Skip to content
Learn Agentic AI12 min read0 views

LLM Inference Explained: How Models Generate Text Token by Token

Understand the autoregressive generation process, KV cache optimization, batching strategies, and the latency vs throughput trade-offs that govern LLM inference performance.

What Happens When You Call an LLM API

When you send a prompt to an LLM and receive a response, the model does not produce the entire answer at once. It generates text one token at a time in a process called autoregressive generation. Understanding this process is key to understanding why LLMs have the performance characteristics they do — why the first token takes longer than subsequent ones, why longer outputs cost more, and how to optimize for speed and throughput.

Autoregressive Generation: One Token at a Time

Autoregressive means each token depends on all previous tokens. The model generates token 1, then uses the prompt plus token 1 to generate token 2, then uses everything so far to generate token 3, and so on:

def autoregressive_generation(model, prompt_tokens, max_new_tokens=50):
    """
    Simplified autoregressive text generation.
    Each new token depends on all previous tokens.
    """
    generated = list(prompt_tokens)

    for step in range(max_new_tokens):
        # Feed ALL tokens so far into the model
        logits = model.forward(generated)  # Returns logits for next token

        # Get probabilities for the next token only
        next_token_logits = logits[-1]  # Last position

        # Sample from the distribution
        next_token = sample(next_token_logits, temperature=0.7)

        # Check for end-of-sequence
        if next_token == EOS_TOKEN:
            break

        # Append and continue
        generated.append(next_token)

    return generated[len(prompt_tokens):]  # Return only new tokens

This creates two distinct phases in every LLM request:

  1. Prefill phase: Process all prompt tokens in parallel. This is compute-bound — the model processes the entire prompt through all transformer layers in one pass.

  2. Decode phase: Generate output tokens one at a time. This is memory-bandwidth-bound — each step only produces one token but must read the model weights from GPU memory.

This two-phase structure explains the "time to first token" (TTFT) metric. The prefill phase must complete before the first output token appears. A long prompt means a longer wait before the response starts streaming.

The KV Cache: Avoiding Redundant Computation

Without optimization, generating each new token would require reprocessing the entire sequence from scratch. For a 1,000-token prompt generating a 500-token response, that means processing sequences of length 1000, 1001, 1002, all the way to 1499. This is catastrophically wasteful.

The KV cache solves this. During attention computation, each token produces key (K) and value (V) vectors. These do not change once computed. The KV cache stores them so they are computed only once:

def generation_with_kv_cache(model, prompt_tokens, max_new_tokens=50):
    """
    Generation with KV cache — dramatically faster than naive generation.
    """
    # Phase 1: Prefill — process all prompt tokens at once
    # This computes and caches K, V for every prompt token
    logits, kv_cache = model.forward(
        prompt_tokens,
        kv_cache=None,  # No cache yet — compute everything
    )

    next_token = sample(logits[-1])
    generated = [next_token]

    # Phase 2: Decode — process one new token at a time
    for step in range(max_new_tokens - 1):
        # Only process the NEW token — use cached K, V for all previous tokens
        logits, kv_cache = model.forward(
            [next_token],     # Just the one new token
            kv_cache=kv_cache,  # Reuse cached K, V from all previous tokens
        )

        next_token = sample(logits[-1])
        if next_token == EOS_TOKEN:
            break

        generated.append(next_token)

    return generated

The KV cache turns what would be O(n squared) computation into O(n) for the decode phase. But it comes at a cost — memory. For a large model with a long context, the KV cache can consume tens of gigabytes of GPU memory.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def estimate_kv_cache_memory(
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    seq_len: int,
    dtype_bytes: int = 2,  # FP16 = 2 bytes
) -> float:
    """
    Estimate KV cache memory in GB.

    For GPT-4-scale: 120 layers, 8 KV heads, 128 dim, 128K seq
    """
    # K and V each: [num_layers, num_kv_heads, seq_len, head_dim]
    kv_elements = 2 * num_layers * num_kv_heads * seq_len * head_dim
    memory_bytes = kv_elements * dtype_bytes
    memory_gb = memory_bytes / (1024 ** 3)

    print(f"KV cache memory: {memory_gb:.2f} GB")
    print(f"  Layers: {num_layers}")
    print(f"  KV heads: {num_kv_heads}")
    print(f"  Sequence length: {seq_len:,}")
    return memory_gb

# Llama 3.1 70B with full 128K context
estimate_kv_cache_memory(
    num_layers=80,
    num_kv_heads=8,
    head_dim=128,
    seq_len=128_000,
)
# KV cache memory: ~30 GB — significant GPU memory just for one request!

Batching: Serving Multiple Requests Simultaneously

In production, an LLM server handles many requests concurrently. Batching groups multiple requests together to share the cost of loading model weights from GPU memory:

# Conceptual batching in an inference server
class InferenceServer:
    def __init__(self, model, max_batch_size=32):
        self.model = model
        self.max_batch_size = max_batch_size
        self.request_queue = []

    def add_request(self, request):
        self.request_queue.append(request)

    def step(self):
        """
        Process one generation step for all active requests.
        Model weights are loaded from GPU memory ONCE and shared.
        """
        # Select up to max_batch_size active requests
        batch = self.request_queue[:self.max_batch_size]

        # Process all requests in one forward pass
        # Each request may be at a different token position
        outputs = self.model.batched_forward(
            tokens=[req.current_tokens for req in batch],
            kv_caches=[req.kv_cache for req in batch],
        )

        # Handle completions — remove finished requests, keep others
        for req, output in zip(batch, outputs):
            next_token = sample(output)
            if next_token == EOS_TOKEN or req.token_count >= req.max_tokens:
                req.complete(req.generated_tokens)
                self.request_queue.remove(req)
            else:
                req.generated_tokens.append(next_token)
                req.current_tokens = [next_token]

Continuous batching is a more advanced technique where new requests can join the batch as old ones finish, instead of waiting for the entire batch to complete. This is how production servers like vLLM and TensorRT-LLM achieve high throughput.

Latency vs Throughput: The Fundamental Trade-off

There is a core tension in LLM serving:

  • Latency (time per request) is best with small batches — each request gets more GPU attention
  • Throughput (requests per second) is best with large batches — GPU utilization is maximized
# Measuring the trade-off
import time
from openai import OpenAI

client = OpenAI()

def measure_latency(prompt, model="gpt-4o"):
    """Measure time-to-first-token and total generation time."""
    start = time.monotonic()
    first_token_time = None

    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )

    tokens = []
    for chunk in stream:
        if chunk.choices[0].delta.content:
            if first_token_time is None:
                first_token_time = time.monotonic() - start
            tokens.append(chunk.choices[0].delta.content)

    total_time = time.monotonic() - start
    token_count = len(tokens)

    print(f"Time to first token: {first_token_time:.3f}s")
    print(f"Total time: {total_time:.3f}s")
    print(f"Output tokens: {token_count}")
    print(f"Tokens per second: {token_count / total_time:.1f}")

    return {
        "ttft": first_token_time,
        "total": total_time,
        "tokens": token_count,
        "tps": token_count / total_time,
    }

# Short prompt = fast prefill
measure_latency("What is 2+2?")

# Long prompt = slow prefill but same decode speed
measure_latency("Summarize this: " + "word " * 5000)

Speculative Decoding: Using a Small Model to Speed Up a Big One

Speculative decoding is a technique where a small, fast "draft" model generates candidate tokens, and the large "target" model verifies them in parallel. Since verification is parallel (like prefill) rather than sequential (like decode), this can speed up generation by 2-3x:

def speculative_decoding(target_model, draft_model, prompt, gamma=5):
    """
    Speculative decoding concept:
    1. Draft model quickly generates gamma candidate tokens
    2. Target model verifies all candidates in one parallel pass
    3. Accept matching tokens, reject from first mismatch
    """
    tokens = list(prompt)

    while not is_complete(tokens):
        # Step 1: Draft model generates gamma candidates quickly
        draft_tokens = []
        for _ in range(gamma):
            next_token = draft_model.generate_one(tokens + draft_tokens)
            draft_tokens.append(next_token)

        # Step 2: Target model verifies ALL candidates in one forward pass
        # This is parallel, so it takes roughly the same time as generating 1 token
        target_probs = target_model.forward(tokens + draft_tokens)

        # Step 3: Accept tokens where draft and target agree
        accepted = 0
        for i, draft_token in enumerate(draft_tokens):
            if should_accept(draft_token, target_probs[len(tokens) + i]):
                tokens.append(draft_token)
                accepted += 1
            else:
                # Resample from target at this position
                corrected = sample(target_probs[len(tokens) + i])
                tokens.append(corrected)
                break

        # If all gamma tokens accepted, we generated gamma tokens
        # in the time it takes to generate ~1 token

    return tokens

Practical Implications for Application Developers

Understanding inference mechanics directly impacts application design:

  1. Stream responses. Do not wait for the full response before showing output. Use streaming to display tokens as they are generated. This makes applications feel responsive even when total generation takes several seconds.

  2. Keep prompts concise. Longer prompts increase time-to-first-token due to the prefill phase. Every unnecessary token in your system prompt adds latency to every request.

  3. Set appropriate max_tokens. The model will generate tokens until it hits the limit or produces a stop token. Setting max_tokens too high wastes time if the model could have stopped earlier. For classification tasks, max_tokens=10 is often sufficient.

  4. Choose the right model size. Smaller models are faster. If GPT-4o-mini handles your task adequately, it will respond 2-3x faster than GPT-4o and at a fraction of the cost.

FAQ

Why is the first token slower than subsequent tokens?

The first token requires the prefill phase — processing the entire prompt through all transformer layers. Subsequent tokens only need to process the single new token (using the KV cache for all previous tokens). A 10,000-token prompt might take 500ms for prefill, but each subsequent decode step only takes 20-30ms. This is why time-to-first-token (TTFT) and inter-token latency are measured separately.

How does streaming work at the protocol level?

When you set stream=True in the API, the server sends the response using Server-Sent Events (SSE). Each event contains a small JSON object with the next token or tokens. The connection stays open until generation is complete. This allows your application to display partial responses immediately rather than waiting for the full response.

Why are output tokens more expensive than input tokens?

Input tokens are processed in parallel during the prefill phase, which is compute-efficient. Output tokens are generated sequentially, one at a time, and each requires reading the model weights from GPU memory. The sequential nature means GPU utilization is lower, and the KV cache consumes additional memory. This per-token overhead is why API providers charge 2-4x more for output tokens than input tokens.


#LLMInference #KVCache #Autoregressive #Performance #Optimization #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.