LLM Calibration: Understanding and Improving Model Confidence Estimates
Understand what LLM calibration means, how to measure it with calibration curves, and practical techniques like temperature scaling and verbalized confidence to build agents that know when they do not know.
Why Calibration Matters for Agents
An LLM is well-calibrated when its expressed confidence matches its actual accuracy. If a model says it is 90% confident in an answer, that answer should be correct roughly 90% of the time. Poorly calibrated models are dangerous in agentic systems because they either overstate confidence — leading agents to take incorrect actions — or understate it — causing unnecessary escalations and human-in-the-loop bottlenecks.
For agent developers, calibration directly impacts two critical decisions: when to act autonomously and when to ask for help.
Measuring Calibration: The Calibration Curve
A calibration curve plots predicted confidence against observed accuracy. A perfectly calibrated model produces a diagonal line where predicted probability equals actual correctness. Most LLMs deviate significantly from this ideal.
import numpy as np
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
def evaluate_calibration(
predictions: list[dict], # [{"confidence": 0.9, "correct": True}, ...]
) -> dict:
"""Compute calibration metrics from model predictions."""
confidences = np.array([p["confidence"] for p in predictions])
accuracies = np.array([p["correct"] for p in predictions])
# Compute calibration curve
prob_true, prob_pred = calibration_curve(
accuracies, confidences, n_bins=10, strategy="uniform"
)
# Expected Calibration Error (ECE)
bin_sizes = np.histogram(confidences, bins=10, range=(0, 1))[0]
bin_weights = bin_sizes / len(confidences)
ece = np.sum(bin_weights * np.abs(prob_true - prob_pred))
return {
"ece": float(ece),
"prob_true": prob_true.tolist(),
"prob_pred": prob_pred.tolist(),
"mean_confidence": float(confidences.mean()),
"mean_accuracy": float(accuracies.mean()),
}
The Expected Calibration Error (ECE) summarizes miscalibration as a single number. An ECE of 0 means perfect calibration. Most production LLMs have ECE values between 0.05 and 0.20, meaning their confidence is off by 5-20 percentage points on average.
Temperature Scaling: Post-Hoc Calibration
Temperature scaling is the simplest and most effective post-hoc calibration technique. It applies a single learned parameter (temperature T) to the model's output logits to bring confidence estimates in line with actual accuracy:
from scipy.optimize import minimize_scalar
from scipy.special import softmax
def find_optimal_temperature(
logits: np.ndarray, labels: np.ndarray
) -> float:
"""Find the temperature that minimizes negative log-likelihood."""
def nll_with_temperature(T):
scaled = logits / T
probs = softmax(scaled, axis=1)
correct_probs = probs[np.arange(len(labels)), labels]
return -np.mean(np.log(correct_probs + 1e-10))
result = minimize_scalar(nll_with_temperature, bounds=(0.1, 10.0), method="bounded")
return result.x
# Usage: after finding optimal T on a calibration set
optimal_T = find_optimal_temperature(validation_logits, validation_labels)
calibrated_probs = softmax(test_logits / optimal_T, axis=1)
Temperature scaling requires access to model logits, which is available with local models but not through most API providers. For API-based agents, verbalized confidence is the practical alternative.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Verbalized Confidence: API-Friendly Calibration
When you cannot access logits, you can ask the model to express its confidence as a number. Research shows that with careful prompting, verbalized confidence provides useful — though imperfect — calibration signals:
from openai import OpenAI
import json
def get_calibrated_answer(question: str, client: OpenAI) -> dict:
"""Get an answer with a verbalized confidence score."""
response = client.chat.completions.create(
model="gpt-4",
messages=[{
"role": "user",
"content": f"""Answer this question and rate your confidence.
Question: {question}
Respond in JSON with:
- "answer": your answer
- "confidence": a number from 0.0 to 1.0 representing your true confidence
- "reasoning": why you assigned this confidence level
Be honest about uncertainty. A 0.7 means you expect to be right about 70% of the time on similar questions."""
}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
def should_agent_act(confidence: float, threshold: float = 0.85) -> str:
"""Decide whether the agent should act autonomously."""
if confidence >= threshold:
return "act"
elif confidence >= 0.5:
return "act_with_caveat"
else:
return "escalate_to_human"
Practical Calibration for Agent Pipelines
In production agent systems, calibration informs routing decisions. High-confidence answers proceed through automated workflows, while low-confidence answers get routed to human reviewers or trigger additional verification steps.
Build a calibration dataset specific to your domain by collecting model predictions with confidence scores and comparing them against ground truth. Track calibration metrics over time — model updates, prompt changes, and distribution shifts all affect calibration.
FAQ
Are LLMs generally overconfident or underconfident?
Most LLMs are overconfident — they express high confidence even when their answers are wrong. This is especially pronounced for factual knowledge questions outside the model's strong training domains. Instruction-tuned models tend to be slightly better calibrated than base models.
Can I calibrate an API-based model without logit access?
Yes, through verbalized confidence. Ask the model to output a confidence score with each answer, then build a calibration curve from these scores against ground truth. You can then apply a simple mapping function (learned from your calibration set) to adjust raw verbalized confidence into calibrated estimates.
How often should I recalibrate?
Recalibrate whenever the underlying model changes (new version, different provider) or when your input distribution shifts significantly. A monthly calibration check on a held-out evaluation set is good practice for production agents.
#LLMCalibration #ConfidenceEstimation #TemperatureScaling #Reliability #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.