Token Healing and Output Recovery: Fixing Common LLM Generation Artifacts

The Reality of LLM Outputs

LLM outputs are not always clean. Even the best models produce artifacts: truncated responses when hitting token limits, malformed JSON with trailing commas or missing brackets, code blocks that open but never close, and Unicode encoding errors from tokenizer edge cases. In agentic pipelines where outputs feed into downstream parsers, tools, and other models, these artifacts cause cascading failures.

Token healing and output recovery are the defensive techniques that make agent pipelines robust against these inevitable generation imperfections.

Token Healing: Fixing Tokenization Boundary Issues

Token healing addresses a specific problem at the boundary between a prompt and the model's completion. When a prompt ends mid-token (for example, ending with a partial URL or code string), the model may generate an unexpected continuation because the tokenizer splits the boundary differently than intended.

The solution is to back up by one token from the prompt boundary and let the model regenerate from that point with a constrained prefix:

import tiktoken

def heal_token_boundary(prompt: str, completion: str, model: str = "gpt-4") -> str:
    """Fix artifacts at the prompt-completion boundary."""
    encoding = tiktoken.encoding_for_model(model)

    # Encode the last few characters of the prompt
    prompt_tokens = encoding.encode(prompt)
    if not prompt_tokens:
        return completion

    # Decode the last token to see if it might be a partial match
    last_token_text = encoding.decode([prompt_tokens[-1]])
    prompt_suffix = prompt[-len(last_token_text):]

    # If the prompt's trailing text does not match the last token's
    # full decoded form, we have a boundary issue
    if prompt_suffix != last_token_text:
        # Re-encode the boundary region
        boundary = prompt_suffix + completion[:10]
        healed_tokens = encoding.encode(boundary)
        healed_text = encoding.decode(healed_tokens)
        # Replace the boundary region with the healed version
        completion = healed_text[len(prompt_suffix):] + completion[10:]

    return completion

Truncation Recovery

When responses hit the max_tokens limit, they are cut off mid-sentence or mid-structure. For structured outputs, this is catastrophic — a truncated JSON string is unparseable. Recovery strategies depend on the output format:

import json
import re

def recover_truncated_json(raw: str) -> dict | None:
    """Attempt to recover a valid JSON object from truncated output."""
    # Strip markdown fences if present
    raw = re.sub(r"```json\s*", "", raw)
    raw = re.sub(r"```\s*$", "", raw)
    raw = raw.strip()

    # Try parsing as-is first
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        pass

    # Strategy 1: Close unclosed brackets and braces
    open_braces = raw.count("{") - raw.count("}")
    open_brackets = raw.count("[") - raw.count("]")

    repaired = raw.rstrip(",\n ")  # remove trailing commas
    # Remove any incomplete key-value pair at the end
    repaired = re.sub(r',\s*"[^"]*"\s*:\s*$', "", repaired)
    repaired = re.sub(r',\s*"[^"]*$', "", repaired)
    repaired = re.sub(r',\s*$', "", repaired)

    repaired += "]" * max(0, open_brackets)
    repaired += "}" * max(0, open_braces)

    try:
        return json.loads(repaired)
    except json.JSONDecodeError:
        pass

    # Strategy 2: Find the last valid JSON prefix
    for end in range(len(raw), 0, -1):
        candidate = raw[:end]
        open_b = candidate.count("{") - candidate.count("}")
        open_k = candidate.count("[") - candidate.count("]")
        candidate += "]" * max(0, open_k) + "}" * max(0, open_b)
        try:
            return json.loads(candidate)
        except json.JSONDecodeError:
            continue

    return None

Format Repair Pipeline

A robust format repair pipeline applies multiple repair strategies in sequence, from cheapest to most expensive:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from dataclasses import dataclass
from typing import Callable

@dataclass
class RepairResult:
    success: bool
    data: any
    strategy_used: str

def build_repair_pipeline(
    strategies: list[tuple[str, Callable[[str], any]]],
) -> Callable[[str], RepairResult]:
    """Build a repair pipeline that tries strategies in order."""
    def repair(raw_output: str) -> RepairResult:
        for name, strategy in strategies:
            try:
                result = strategy(raw_output)
                if result is not None:
                    return RepairResult(success=True, data=result, strategy_used=name)
            except Exception:
                continue
        return RepairResult(success=False, data=None, strategy_used="none")

    return repair

# Configure the pipeline
json_repair = build_repair_pipeline([
    ("direct_parse", lambda s: json.loads(s)),
    ("strip_fences", lambda s: json.loads(re.sub(r"```\w*\n?|\n?```", "", s).strip())),
    ("truncation_recovery", recover_truncated_json),
    ("extract_first_object", lambda s: json.loads(re.search(r"\{.*\}", s, re.DOTALL).group())),
])

# Usage
result = json_repair(llm_output)
if result.success:
    print(f"Parsed using: {result.strategy_used}")
    process(result.data)
else:
    trigger_retry_or_escalate()

Post-Processing Best Practices

Always validate structure before content. Check that JSON is valid before checking that it has the right keys. Check that code compiles before checking that it runs correctly. Structural validation is cheap and catches the most common artifacts.

Log repair actions. Every repair is a signal that something went wrong upstream. Track which repair strategies fire most often and use that data to improve your prompts, adjust token limits, or switch models.

Set repair budgets. A post-processing pipeline should not retry indefinitely. Define a maximum number of repair attempts and a fallback behavior (return a default, escalate to a human, return a graceful error).

Common Artifacts and Their Fixes

Trailing commas in JSON arrays and objects — strip with regex before parsing. Missing closing quotes — count quote parity and append if needed. Markdown code fences wrapping structured output — strip known fence patterns. HTML entities in plain text responses — decode with html.unescape(). Repeated tokens (model degeneration) — detect consecutive duplicate n-grams and truncate.

FAQ

When should I use output recovery versus retrying the LLM call?

Use output recovery first — it is faster and cheaper than an LLM retry. Retry only when recovery fails or when the content itself (not just the format) is inadequate. A good rule of thumb: if the semantic content is present but the format is broken, repair it. If the content is missing or wrong, retry.

How do I handle truncation proactively?

Monitor the finish_reason field in the API response. If it is length instead of stop, the output was truncated. For structured outputs, set max_tokens high enough to accommodate the expected output plus a 30% buffer. For variable-length outputs, implement continuation — send a follow-up request asking the model to continue from where it stopped.

Does token healing apply to all models?

The boundary artifact that token healing addresses is specific to byte-pair encoding (BPE) tokenizers, which are used by GPT, Llama, Mistral, and most major models. Models using character-level or word-level tokenizers do not exhibit this specific artifact, but they have their own edge cases.

#TokenHealing #OutputRecovery #PostProcessing #ErrorHandling #AgenticAI #LearnAI #AIEngineering

Token Healing and Output Recovery: Fixing Common LLM Generation Artifacts

The Reality of LLM Outputs

Token Healing: Fixing Tokenization Boundary Issues

Truncation Recovery

Format Repair Pipeline

Post-Processing Best Practices

Common Artifacts and Their Fixes

FAQ

When should I use output recovery versus retrying the LLM call?

How do I handle truncation proactively?

Does token healing apply to all models?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding