Speculative Decoding: Using Small Models to Speed Up Large Model Inference

The Inference Bottleneck

Large language model inference is fundamentally bottlenecked by memory bandwidth, not compute. Each token generation requires loading billions of parameters from memory, but the actual computation per token is minimal. This means that whether you are generating one token or checking five candidate tokens, the wall-clock time is similar — the memory transfer dominates.

Speculative decoding exploits this insight: use a small, fast model to draft several tokens at once, then verify all of them in a single pass through the large model. If the large model agrees with the draft, you have generated multiple tokens in the time it would take to generate one.

How Speculative Decoding Works

The process has three phases:

Draft phase. A small model (the draft model) autoregressively generates K candidate tokens. Because the draft model is small, this is fast — often faster than a single forward pass of the target model.

Verify phase. The large target model processes all K draft tokens in a single forward pass, computing the probability distribution for each position. This is efficient because transformer attention over K tokens in parallel costs roughly the same as generating one token due to the memory-bandwidth bottleneck.

Accept/reject phase. Each draft token is compared against the target model's distribution. Tokens are accepted or rejected using a modified rejection sampling scheme that preserves the exact output distribution of the target model.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer

def speculative_decode(
    draft_model,
    target_model,
    tokenizer,
    prompt: str,
    max_tokens: int = 100,
    draft_length: int = 5,
) -> str:
    """Speculative decoding with a draft model and target model."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated = input_ids.clone()

    tokens_generated = 0
    while tokens_generated < max_tokens:
        # Phase 1: Draft K tokens with the small model
        draft_ids = generated.clone()
        draft_probs_list = []

        for _ in range(draft_length):
            with torch.no_grad():
                draft_out = draft_model(draft_ids)
                draft_logits = draft_out.logits[:, -1, :]
                draft_probs = torch.softmax(draft_logits, dim=-1)
                draft_probs_list.append(draft_probs)
                next_token = torch.multinomial(draft_probs, 1)
                draft_ids = torch.cat([draft_ids, next_token], dim=1)

        # Phase 2: Verify all draft tokens with the target model
        with torch.no_grad():
            target_out = target_model(draft_ids)
            target_logits = target_out.logits

        # Phase 3: Accept or reject each draft token
        n_accepted = 0
        for i in range(draft_length):
            pos = generated.shape[1] + i
            target_probs = torch.softmax(target_logits[:, pos - 1, :], dim=-1)
            draft_token = draft_ids[:, pos]
            draft_p = draft_probs_list[i][:, draft_token].item()
            target_p = target_probs[:, draft_token].item()

            # Acceptance criterion preserving target distribution
            if np.random.random() < min(1.0, target_p / (draft_p + 1e-10)):
                n_accepted += 1
            else:
                # Reject: sample from adjusted distribution
                adjusted = torch.clamp(target_probs - draft_probs_list[i], min=0)
                adjusted = adjusted / adjusted.sum()
                new_token = torch.multinomial(adjusted, 1)
                generated = torch.cat([generated, draft_ids[:, generated.shape[1]:pos].reshape(1, -1), new_token], dim=1)
                tokens_generated += n_accepted + 1
                break
        else:
            # All draft tokens accepted, sample one bonus token
            generated = draft_ids
            tokens_generated += draft_length

        if tokenizer.eos_token_id in generated[0, input_ids.shape[1]:]:
            break

    return tokenizer.decode(generated[0, input_ids.shape[1]:], skip_special_tokens=True)

Speedup Factors and Draft Model Selection

The speedup depends on the acceptance rate — how often the target model agrees with the draft model. A well-matched draft model that agrees 70-80% of the time typically yields 2-3x speedup. Poor matches drop to 1.2-1.5x or even no speedup.

Good draft model choices:

A smaller model from the same family (Llama-7B drafting for Llama-70B)
A quantized version of the target model
A model fine-tuned on similar data distributions

def estimate_speedup(
    acceptance_rate: float, draft_length: int,
    draft_time_ms: float, target_time_ms: float,
) -> float:
    """Estimate speculative decoding speedup factor."""
    # Expected tokens per speculation round
    expected_tokens = (1 - acceptance_rate ** (draft_length + 1)) / (1 - acceptance_rate)

    # Time per speculation round
    round_time = draft_length * draft_time_ms + target_time_ms

    # Standard autoregressive time for same tokens
    standard_time = expected_tokens * target_time_ms

    return standard_time / round_time

Implementation in Agent Pipelines

For agent developers using API-based inference, speculative decoding is typically handled by the serving infrastructure (vLLM, TensorRT-LLM, llama.cpp all support it). Your role is choosing the right draft model and tuning the draft length.

For self-hosted agents, enable speculative decoding in your serving framework. In vLLM, it is a configuration flag. The serving layer handles the draft-verify-accept cycle transparently, and your application code sees only faster token generation with identical output quality.

FAQ

Does speculative decoding change the output quality?

No. The mathematical guarantee of speculative decoding is that the output distribution is identical to what the target model would produce on its own. The rejection sampling scheme ensures that accepted tokens follow the exact same probability distribution. You get speed without any quality tradeoff.

What draft length should I use?

Start with K=5 and tune based on your acceptance rate. Higher acceptance rates support longer draft lengths (K=8-10). Lower acceptance rates benefit from shorter drafts (K=3-4) because rejected tokens waste the draft model's compute. Monitor the acceptance rate in production and adjust accordingly.

Can I use speculative decoding with API providers like OpenAI?

Not directly from your application code — the draft-verify cycle requires access to both models' logits during generation. However, API providers implement speculative decoding internally on their serving infrastructure. You benefit from it automatically without any code changes.

#SpeculativeDecoding #InferenceOptimization #DraftModels #Performance #AgenticAI #LearnAI #AIEngineering

Speculative Decoding: Using Small Models to Speed Up Large Model Inference

The Inference Bottleneck

How Speculative Decoding Works

Speedup Factors and Draft Model Selection

Implementation in Agent Pipelines

FAQ

Does speculative decoding change the output quality?

What draft length should I use?

Can I use speculative decoding with API providers like OpenAI?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding