Skip to content
Learn Agentic AI13 min read0 views

Token Usage Analytics: Understanding and Optimizing LLM Consumption Patterns

Learn how to track token consumption across AI agents, attribute costs to specific features and users, identify usage trends, and implement optimization strategies that reduce LLM spend without sacrificing quality.

Why Token Usage Analytics Matter

LLM costs are directly tied to token consumption. A single agent conversation might use anywhere from 500 to 50,000 tokens depending on context length, tool calls, and conversation depth. Without granular tracking, you cannot answer basic questions: Which agent costs the most? Which conversations are outliers? Is your cost per resolution trending up or down?

Token analytics transform LLM spending from an opaque monthly bill into a controllable, optimizable metric.

Capturing Token Data

Every LLM API response includes token usage information. The key is capturing this data consistently and attaching it to the right context: the conversation, the agent, and the specific step within the agent loop.

from dataclasses import dataclass, field
from datetime import datetime
from openai import OpenAI

client = OpenAI()

@dataclass
class TokenRecord:
    conversation_id: str
    agent_name: str
    model: str
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    timestamp: str = field(
        default_factory=lambda: datetime.utcnow().isoformat()
    )
    step_type: str = ""  # "main_response", "tool_call", "classification"
    cost_usd: float = 0.0

MODEL_PRICING = {
    "gpt-4o": {"input": 2.50 / 1_000_000, "output": 10.00 / 1_000_000},
    "gpt-4o-mini": {"input": 0.15 / 1_000_000, "output": 0.60 / 1_000_000},
    "gpt-4.1": {"input": 2.00 / 1_000_000, "output": 8.00 / 1_000_000},
    "gpt-4.1-mini": {"input": 0.40 / 1_000_000, "output": 1.60 / 1_000_000},
}

def calculate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
    pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    return (
        prompt_tokens * pricing["input"]
        + completion_tokens * pricing["output"]
    )

Building a Token Tracker

A centralized tracker wraps every LLM call, records token usage, and provides aggregation methods.

from collections import defaultdict
import json

class TokenTracker:
    def __init__(self):
        self.records: list[TokenRecord] = []
        self._by_conversation: dict[str, list[TokenRecord]] = defaultdict(list)
        self._by_agent: dict[str, list[TokenRecord]] = defaultdict(list)

    def record(self, rec: TokenRecord) -> None:
        rec.cost_usd = calculate_cost(
            rec.model, rec.prompt_tokens, rec.completion_tokens
        )
        self.records.append(rec)
        self._by_conversation[rec.conversation_id].append(rec)
        self._by_agent[rec.agent_name].append(rec)

    def tracked_completion(
        self, conversation_id: str, agent_name: str,
        step_type: str, **kwargs
    ) -> dict:
        response = client.chat.completions.create(**kwargs)
        usage = response.usage
        rec = TokenRecord(
            conversation_id=conversation_id,
            agent_name=agent_name,
            model=kwargs.get("model", "unknown"),
            prompt_tokens=usage.prompt_tokens,
            completion_tokens=usage.completion_tokens,
            total_tokens=usage.total_tokens,
            step_type=step_type,
        )
        self.record(rec)
        return {
            "response": response,
            "tokens": rec,
        }

    def cost_by_agent(self) -> dict[str, float]:
        return {
            agent: sum(r.cost_usd for r in records)
            for agent, records in self._by_agent.items()
        }

    def cost_by_conversation(self) -> dict[str, float]:
        return {
            conv: sum(r.cost_usd for r in records)
            for conv, records in self._by_conversation.items()
        }

Usage Trend Analysis

Tracking token usage over time reveals whether your agents are becoming more or less efficient. A rising cost-per-conversation trend signals prompt bloat or unnecessary tool calls.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from datetime import timedelta

def daily_usage_summary(
    records: list[TokenRecord], days: int = 30
) -> list[dict]:
    from collections import defaultdict
    daily: dict[str, dict] = defaultdict(
        lambda: {"total_tokens": 0, "cost_usd": 0.0, "conversations": set()}
    )
    for rec in records:
        day = rec.timestamp[:10]  # extract YYYY-MM-DD
        daily[day]["total_tokens"] += rec.total_tokens
        daily[day]["cost_usd"] += rec.cost_usd
        daily[day]["conversations"].add(rec.conversation_id)

    summary = []
    for day in sorted(daily.keys())[-days:]:
        data = daily[day]
        conv_count = len(data["conversations"])
        summary.append({
            "date": day,
            "total_tokens": data["total_tokens"],
            "total_cost": round(data["cost_usd"], 4),
            "conversations": conv_count,
            "cost_per_conversation": round(
                data["cost_usd"] / conv_count, 4
            ) if conv_count else 0,
            "tokens_per_conversation": (
                data["total_tokens"] // conv_count
            ) if conv_count else 0,
        })
    return summary

Optimization Opportunities

Once you have visibility into token consumption, several optimization strategies become obvious. Prompt compression reduces input tokens. Model tiering routes simple requests to cheaper models. Caching avoids redundant calls entirely.

class TokenOptimizer:
    def __init__(self, tracker: TokenTracker):
        self.tracker = tracker

    def find_expensive_conversations(
        self, threshold_usd: float = 0.10
    ) -> list[dict]:
        costs = self.tracker.cost_by_conversation()
        return [
            {"conversation_id": cid, "cost": cost}
            for cid, cost in sorted(costs.items(), key=lambda x: -x[1])
            if cost > threshold_usd
        ]

    def find_prompt_bloat(self, threshold_ratio: float = 5.0) -> list[dict]:
        bloated = []
        for rec in self.tracker.records:
            ratio = rec.prompt_tokens / max(rec.completion_tokens, 1)
            if ratio > threshold_ratio:
                bloated.append({
                    "conversation_id": rec.conversation_id,
                    "agent": rec.agent_name,
                    "prompt_tokens": rec.prompt_tokens,
                    "completion_tokens": rec.completion_tokens,
                    "ratio": round(ratio, 1),
                })
        return bloated

    def model_tier_recommendation(self) -> list[dict]:
        recommendations = []
        for agent, records in self.tracker._by_agent.items():
            avg_tokens = sum(r.total_tokens for r in records) / len(records)
            current_cost = sum(r.cost_usd for r in records)
            if avg_tokens < 500 and records[0].model != "gpt-4o-mini":
                recommendations.append({
                    "agent": agent,
                    "current_model": records[0].model,
                    "suggested_model": "gpt-4o-mini",
                    "potential_savings_pct": 85,
                })
        return recommendations

FAQ

How do I track token usage for streaming responses?

Most APIs provide token counts in the final chunk of a streaming response. For OpenAI, the last chunk includes a usage field when you set stream_options={"include_usage": True} in your request. Capture this final chunk and feed it into your tracker just like a non-streaming response.

What is a good cost-per-conversation benchmark?

It varies dramatically by use case. Simple FAQ agents using gpt-4o-mini might cost $0.001 per conversation. Complex multi-step agents with tool calls on gpt-4o can reach $0.05 to $0.20. The more useful benchmark is cost-per-resolution, which factors in whether the agent actually solved the problem.

Should I set hard token limits on conversations?

Yes, but with a graceful fallback. Set a warning threshold at 80% of your budget and a hard limit at 100%. When the warning threshold is hit, instruct the agent to summarize and resolve quickly. When the hard limit is hit, escalate to a human rather than abruptly cutting the conversation.


#TokenUsage #CostOptimization #LLM #Analytics #AIAgents #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.