Skip to content
Back to Blog
Agentic AI6 min read

AI Cost Management: Building a Budget for Production LLM Apps

A comprehensive guide to understanding, forecasting, and optimizing the costs of running LLM-powered applications in production, with real pricing data and cost reduction strategies.

Why LLM Costs Surprise Engineering Teams

Building an LLM prototype costs almost nothing. Running it in production can cost thousands of dollars per day. This gap catches teams off guard because LLM pricing is fundamentally different from traditional API costs: you pay per token processed, and token consumption scales with both request volume and request complexity.

A single Claude Sonnet API call processing a 4,000-token prompt and generating a 1,000-token response costs approximately $0.019. That seems trivial. But at 100,000 requests per day with an average context window of 8,000 tokens, the daily bill reaches $380 -- and that is before you account for retries, multi-turn conversations, or RAG context injection.

Understanding LLM Pricing Models

Token-Based Pricing

All major LLM providers use token-based pricing with separate rates for input and output tokens.

Model Input (per 1M tokens) Output (per 1M tokens) Context Window
Claude 3.5 Haiku $0.80 $4.00 200K
Claude Sonnet 4 $3.00 $15.00 200K
Claude Opus 4 $15.00 $75.00 200K
GPT-4o $2.50 $10.00 128K
GPT-4o mini $0.15 $0.60 128K
Gemini 1.5 Pro $1.25 $5.00 2M

Output tokens are 3-5x more expensive than input tokens across all providers. This means response length is a primary cost driver.

Batch vs. Real-Time Pricing

Most providers offer a batch API at 50% discount for non-time-sensitive workloads:

import anthropic

client = anthropic.Anthropic()

# Real-time: $3.00 / $15.00 per 1M tokens
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Analyze this document..."}]
)

# Batch: $1.50 / $7.50 per 1M tokens (50% cheaper)
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "doc-001",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": "Analyze this document..."}]
            }
        }
        # ... up to 100,000 requests per batch
    ]
)

Building a Cost Model

Step 1: Measure Your Token Distribution

Before you can forecast costs, you need to know your actual token consumption patterns:

import tiktoken
from collections import defaultdict
from dataclasses import dataclass

@dataclass
class TokenStats:
    input_tokens: list[int]
    output_tokens: list[int]

    @property
    def avg_input(self) -> float:
        return sum(self.input_tokens) / len(self.input_tokens)

    @property
    def avg_output(self) -> float:
        return sum(self.output_tokens) / len(self.output_tokens)

    @property
    def p95_input(self) -> int:
        sorted_tokens = sorted(self.input_tokens)
        return sorted_tokens[int(len(sorted_tokens) * 0.95)]

    @property
    def p95_output(self) -> int:
        sorted_tokens = sorted(self.output_tokens)
        return sorted_tokens[int(len(sorted_tokens) * 0.95)]

class CostTracker:
    PRICING = {
        "claude-haiku": {"input": 0.80, "output": 4.00},
        "claude-sonnet": {"input": 3.00, "output": 15.00},
        "claude-opus": {"input": 15.00, "output": 75.00},
    }

    def __init__(self):
        self.stats: dict[str, TokenStats] = defaultdict(
            lambda: TokenStats([], [])
        )

    def record(self, model: str, input_tokens: int, output_tokens: int):
        self.stats[model].input_tokens.append(input_tokens)
        self.stats[model].output_tokens.append(output_tokens)

    def daily_cost_estimate(self, model: str, daily_requests: int) -> float:
        stats = self.stats[model]
        pricing = self.PRICING[model]
        input_cost = (stats.avg_input * daily_requests / 1_000_000) * pricing["input"]
        output_cost = (stats.avg_output * daily_requests / 1_000_000) * pricing["output"]
        return input_cost + output_cost

    def monthly_forecast(self, model: str, daily_requests: int) -> float:
        return self.daily_cost_estimate(model, daily_requests) * 30

Step 2: Identify Cost Drivers

The top cost drivers in production LLM applications are:

  1. System prompts repeated on every request: A 2,000-token system prompt at 100K requests/day costs $0.60/day on Sonnet just for system prompt input tokens
  2. RAG context injection: Stuffing 5,000 tokens of retrieved context into each request multiplies input costs
  3. Multi-turn conversations: Each turn re-sends the full conversation history
  4. Retries: Failed requests that are retried double the token cost
  5. Verbose outputs: Not constraining output length leads to unnecessarily long responses

Step 3: Set Budgets and Alerts

class BudgetManager:
    def __init__(self, daily_budget: float, alert_threshold: float = 0.8):
        self.daily_budget = daily_budget
        self.alert_threshold = alert_threshold
        self.daily_spend = 0.0

    async def check_budget(self, estimated_cost: float) -> bool:
        """Check if request is within budget before making the API call."""
        if self.daily_spend + estimated_cost > self.daily_budget:
            await self.send_alert(
                f"Daily budget exceeded: ${self.daily_spend:.2f} / ${self.daily_budget:.2f}"
            )
            return False

        if self.daily_spend + estimated_cost > self.daily_budget * self.alert_threshold:
            await self.send_alert(
                f"Approaching daily budget: ${self.daily_spend:.2f} / ${self.daily_budget:.2f}"
            )

        return True

    def record_spend(self, input_tokens: int, output_tokens: int, model: str):
        pricing = CostTracker.PRICING[model]
        cost = (input_tokens / 1_000_000 * pricing["input"] +
                output_tokens / 1_000_000 * pricing["output"])
        self.daily_spend += cost

Cost Optimization Strategies

1. Prompt Caching

Anthropic's prompt caching reduces costs for repeated system prompts and context. Cached input tokens cost 90% less than uncached tokens:

# First request: full price for input, caches the system prompt
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": long_system_prompt,  # 3000 tokens
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": "User question here"}]
)

# Subsequent requests: system prompt served from cache at 10% cost
# Saves ~$2.70 per 1M cached input tokens on Sonnet

2. Model Tiering

Route requests to the cheapest model that can handle them:

async def tiered_request(task_type: str, prompt: str) -> str:
    model_map = {
        "classification": "claude-haiku",      # $0.80 input
        "extraction": "claude-haiku",           # $0.80 input
        "summarization": "claude-sonnet",       # $3.00 input
        "analysis": "claude-sonnet",            # $3.00 input
        "complex_reasoning": "claude-opus",     # $15.00 input
    }
    model = model_map.get(task_type, "claude-sonnet")
    return await call_model(model, prompt)

3. Response Length Control

Explicitly limit output tokens and instruct the model to be concise:

# Instead of max_tokens=4096 for every request:
MAX_TOKENS_BY_TASK = {
    "yes_no_classification": 10,
    "entity_extraction": 256,
    "short_summary": 512,
    "detailed_analysis": 2048,
}

4. Semantic Caching

Cache responses for semantically similar queries to avoid redundant API calls:

import hashlib
import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.cache: dict[str, str] = {}
        self.embeddings: dict[str, list[float]] = {}
        self.threshold = similarity_threshold

    async def get_or_compute(self, query: str, compute_fn) -> str:
        query_embedding = await get_embedding(query)

        for cached_key, cached_embedding in self.embeddings.items():
            similarity = cosine_similarity(query_embedding, cached_embedding)
            if similarity >= self.threshold:
                return self.cache[cached_key]

        result = await compute_fn(query)
        self.cache[query] = result
        self.embeddings[query] = query_embedding
        return result

5. Conversation Summarization

For multi-turn conversations, summarize older turns instead of re-sending the full history:

async def manage_conversation_context(messages: list[dict], max_tokens: int = 4000) -> list[dict]:
    total_tokens = count_tokens(messages)

    if total_tokens <= max_tokens:
        return messages

    # Keep system prompt and last 4 messages verbatim
    preserved = messages[:1] + messages[-4:]

    # Summarize the middle messages
    middle = messages[1:-4]
    summary = await summarize_conversation(middle)

    return [
        messages[0],  # system prompt
        {"role": "user", "content": f"[Previous conversation summary: {summary}]"},
        *messages[-4:]  # recent messages
    ]

Real-World Cost Breakdown

Here is a real cost breakdown for a production RAG application handling 50,000 queries per day:

Component Daily Cost Optimization Applied Savings
Embedding generation $8 Cached embeddings for repeated queries 40%
Vector search $15 Managed service (not LLM cost) N/A
LLM inference (Sonnet) $142 Prompt caching + model tiering 55%
Retries $12 Reduced timeout, better error handling 60%
Total $177/day
After optimization $95/day 46% savings

Conclusion

LLM cost management is a discipline, not an afterthought. The teams that control costs effectively build instrumentation from day one, route requests to appropriate model tiers, leverage prompt caching aggressively, and set hard budget limits with alerting. Start measuring today -- you cannot optimize what you do not measure.

Share this article
N

NYC News

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.