Constrained Decoding: Forcing LLM Outputs to Match Specific Grammars and Formats

The Format Reliability Problem

Every agent developer has experienced it: you carefully instruct the LLM to return valid JSON, and 95% of the time it works. But 5% of the time the model adds a trailing comma, wraps the JSON in markdown fences, or injects an explanation before the opening brace. That 5% failure rate crashes your downstream parser and breaks the entire agent pipeline.

Constrained decoding solves this by modifying the token selection process itself so that only tokens consistent with a target grammar can be chosen. The model literally cannot produce invalid output.

How Constrained Decoding Works

During standard autoregressive generation, the model picks from all possible next tokens. Constrained decoding introduces a mask at each generation step that zeros out the probability of any token that would violate the target grammar. Only tokens that keep the output on a valid path through the grammar are eligible for selection.

This is implemented as a finite-state machine or pushdown automaton that tracks the current position in the grammar and determines which tokens are valid continuations.

GBNF: Grammar-Based Format Specification

GBNF (GGML BNF) is a grammar format used by llama.cpp and compatible inference engines to define output constraints:

# GBNF grammar for a JSON object with specific fields
json_grammar = r"""
root   ::= "{" ws "\"action\"" ws ":" ws action "," ws "\"params\"" ws ":" ws params "}"
action ::= "\"search\"" | "\"calculate\"" | "\"respond\""
params ::= "{" ws (param ("," ws param)*)? ws "}"
param  ::= string ws ":" ws value
string ::= "\"" [a-zA-Z_]+ "\""
value  ::= string | number | "true" | "false" | "null"
number ::= "-"? [0-9]+ ("." [0-9]+)?
ws     ::= [ \t\n]*
"""

When this grammar is applied during generation, the model is physically prevented from producing output that does not match the root rule. Every generated token must be a valid continuation within the grammar.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

The Outlines Library

Outlines is a Python library that brings constrained generation to any HuggingFace-compatible model. It supports regex patterns, JSON schemas, and custom grammars:

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")

# Regex-constrained generation: force a valid email
email_generator = outlines.generate.regex(
    model,
    r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
)
result = email_generator("Extract the email from: Contact us at ")
print(result)  # guaranteed to be a valid email format

# JSON schema-constrained generation
from pydantic import BaseModel

class ToolCall(BaseModel):
    action: str
    query: str
    confidence: float

json_generator = outlines.generate.json(model, ToolCall)
tool_call = json_generator("Decide what tool to use for: What is 42 * 17?")
print(tool_call)  # always a valid ToolCall instance

Regex-Guided Generation

For simpler format constraints, regex-guided generation offers a lightweight alternative. The regex is compiled into a finite-state automaton, and at each token the automaton determines which tokens are valid next characters:

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")

# Force output to be a valid ISO date
date_gen = outlines.generate.regex(model, r"[0-9]{4}-[0-9]{2}-[0-9]{2}")

# Force output to be one of specific choices
choice_gen = outlines.generate.choice(model, ["approve", "reject", "escalate"])
decision = choice_gen("Should this refund request be approved? Customer spent $500 last month.")
print(decision)  # guaranteed to be one of the three options

Impact on Agent Architecture

Constrained decoding changes how you design agent pipelines. Instead of parsing LLM output and handling format errors with retries, you get guaranteed-valid structured output on every call. This eliminates an entire category of error-handling code and makes agents more reliable and faster — no retry loops needed.

The tradeoff is that constrained decoding requires access to the model's logits during generation. This works with local models and some API providers but is not available through all inference endpoints. OpenAI's structured output mode and Anthropic's tool use provide similar guarantees through different mechanisms.

FAQ

Does constrained decoding reduce output quality?

Constraining the format does not meaningfully reduce content quality. The model still selects the highest-probability valid token at each step. Studies show that for structured tasks, constrained decoding actually improves accuracy because the model does not waste capacity on format compliance.

Can I use constrained decoding with OpenAI's API?

Not directly — you do not have access to logits during generation. However, OpenAI's response_format: { type: "json_schema" } parameter provides a similar guarantee through their own constrained decoding implementation on the server side.

What happens when the grammar is too restrictive?

If the grammar leaves very few valid tokens at a given step, the model may be forced to choose low-probability tokens, reducing coherence. Design grammars that constrain format without over-constraining content — for example, require JSON structure but allow free-form string values.

#ConstrainedDecoding #StructuredOutput #GBNF #Outlines #AgenticAI #LearnAI #AIEngineering

Constrained Decoding: Forcing LLM Outputs to Match Specific Grammars and Formats

The Format Reliability Problem

How Constrained Decoding Works

GBNF: Grammar-Based Format Specification

The Outlines Library

Regex-Guided Generation

Impact on Agent Architecture

FAQ

Does constrained decoding reduce output quality?

Can I use constrained decoding with OpenAI's API?

What happens when the grammar is too restrictive?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding