Token Healing and Output Recovery: Fixing Common LLM Generation Artifacts
Learn techniques for detecting and repairing common LLM output problems including truncated responses, malformed JSON, encoding artifacts, and broken code blocks through robust post-processing pipelines.
The Reality of LLM Outputs
LLM outputs are not always clean. Even the best models produce artifacts: truncated responses when hitting token limits, malformed JSON with trailing commas or missing brackets, code blocks that open but never close, and Unicode encoding errors from tokenizer edge cases. In agentic pipelines where outputs feed into downstream parsers, tools, and other models, these artifacts cause cascading failures.
Token healing and output recovery are the defensive techniques that make agent pipelines robust against these inevitable generation imperfections.
Token Healing: Fixing Tokenization Boundary Issues
Token healing addresses a specific problem at the boundary between a prompt and the model's completion. When a prompt ends mid-token (for example, ending with a partial URL or code string), the model may generate an unexpected continuation because the tokenizer splits the boundary differently than intended.
The solution is to back up by one token from the prompt boundary and let the model regenerate from that point with a constrained prefix:
import tiktoken
def heal_token_boundary(prompt: str, completion: str, model: str = "gpt-4") -> str:
"""Fix artifacts at the prompt-completion boundary."""
encoding = tiktoken.encoding_for_model(model)
# Encode the last few characters of the prompt
prompt_tokens = encoding.encode(prompt)
if not prompt_tokens:
return completion
# Decode the last token to see if it might be a partial match
last_token_text = encoding.decode([prompt_tokens[-1]])
prompt_suffix = prompt[-len(last_token_text):]
# If the prompt's trailing text does not match the last token's
# full decoded form, we have a boundary issue
if prompt_suffix != last_token_text:
# Re-encode the boundary region
boundary = prompt_suffix + completion[:10]
healed_tokens = encoding.encode(boundary)
healed_text = encoding.decode(healed_tokens)
# Replace the boundary region with the healed version
completion = healed_text[len(prompt_suffix):] + completion[10:]
return completion
Truncation Recovery
When responses hit the max_tokens limit, they are cut off mid-sentence or mid-structure. For structured outputs, this is catastrophic — a truncated JSON string is unparseable. Recovery strategies depend on the output format:
import json
import re
def recover_truncated_json(raw: str) -> dict | None:
"""Attempt to recover a valid JSON object from truncated output."""
# Strip markdown fences if present
raw = re.sub(r"```json\s*", "", raw)
raw = re.sub(r"```\s*$", "", raw)
raw = raw.strip()
# Try parsing as-is first
try:
return json.loads(raw)
except json.JSONDecodeError:
pass
# Strategy 1: Close unclosed brackets and braces
open_braces = raw.count("{") - raw.count("}")
open_brackets = raw.count("[") - raw.count("]")
repaired = raw.rstrip(",\n ") # remove trailing commas
# Remove any incomplete key-value pair at the end
repaired = re.sub(r',\s*"[^"]*"\s*:\s*$', "", repaired)
repaired = re.sub(r',\s*"[^"]*$', "", repaired)
repaired = re.sub(r',\s*$', "", repaired)
repaired += "]" * max(0, open_brackets)
repaired += "}" * max(0, open_braces)
try:
return json.loads(repaired)
except json.JSONDecodeError:
pass
# Strategy 2: Find the last valid JSON prefix
for end in range(len(raw), 0, -1):
candidate = raw[:end]
open_b = candidate.count("{") - candidate.count("}")
open_k = candidate.count("[") - candidate.count("]")
candidate += "]" * max(0, open_k) + "}" * max(0, open_b)
try:
return json.loads(candidate)
except json.JSONDecodeError:
continue
return None
Format Repair Pipeline
A robust format repair pipeline applies multiple repair strategies in sequence, from cheapest to most expensive:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from dataclasses import dataclass
from typing import Callable
@dataclass
class RepairResult:
success: bool
data: any
strategy_used: str
def build_repair_pipeline(
strategies: list[tuple[str, Callable[[str], any]]],
) -> Callable[[str], RepairResult]:
"""Build a repair pipeline that tries strategies in order."""
def repair(raw_output: str) -> RepairResult:
for name, strategy in strategies:
try:
result = strategy(raw_output)
if result is not None:
return RepairResult(success=True, data=result, strategy_used=name)
except Exception:
continue
return RepairResult(success=False, data=None, strategy_used="none")
return repair
# Configure the pipeline
json_repair = build_repair_pipeline([
("direct_parse", lambda s: json.loads(s)),
("strip_fences", lambda s: json.loads(re.sub(r"```\w*\n?|\n?```", "", s).strip())),
("truncation_recovery", recover_truncated_json),
("extract_first_object", lambda s: json.loads(re.search(r"\{.*\}", s, re.DOTALL).group())),
])
# Usage
result = json_repair(llm_output)
if result.success:
print(f"Parsed using: {result.strategy_used}")
process(result.data)
else:
trigger_retry_or_escalate()
Post-Processing Best Practices
Always validate structure before content. Check that JSON is valid before checking that it has the right keys. Check that code compiles before checking that it runs correctly. Structural validation is cheap and catches the most common artifacts.
Log repair actions. Every repair is a signal that something went wrong upstream. Track which repair strategies fire most often and use that data to improve your prompts, adjust token limits, or switch models.
Set repair budgets. A post-processing pipeline should not retry indefinitely. Define a maximum number of repair attempts and a fallback behavior (return a default, escalate to a human, return a graceful error).
Common Artifacts and Their Fixes
Trailing commas in JSON arrays and objects — strip with regex before parsing. Missing closing quotes — count quote parity and append if needed. Markdown code fences wrapping structured output — strip known fence patterns. HTML entities in plain text responses — decode with html.unescape(). Repeated tokens (model degeneration) — detect consecutive duplicate n-grams and truncate.
FAQ
When should I use output recovery versus retrying the LLM call?
Use output recovery first — it is faster and cheaper than an LLM retry. Retry only when recovery fails or when the content itself (not just the format) is inadequate. A good rule of thumb: if the semantic content is present but the format is broken, repair it. If the content is missing or wrong, retry.
How do I handle truncation proactively?
Monitor the finish_reason field in the API response. If it is length instead of stop, the output was truncated. For structured outputs, set max_tokens high enough to accommodate the expected output plus a 30% buffer. For variable-length outputs, implement continuation — send a follow-up request asking the model to continue from where it stopped.
Does token healing apply to all models?
The boundary artifact that token healing addresses is specific to byte-pair encoding (BPE) tokenizers, which are used by GPT, Llama, Mistral, and most major models. Models using character-level or word-level tokenizers do not exhibit this specific artifact, but they have their own edge cases.
#TokenHealing #OutputRecovery #PostProcessing #ErrorHandling #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.