LiteLLM: A Unified Interface for 100+ LLM Providers in Agent Applications
Set up LiteLLM to call OpenAI, Anthropic, Mistral, Ollama, and 100+ other providers through a single API. Implement fallbacks, load balancing, and cost tracking for production agents.
The Multi-Provider Problem
Production agent systems rarely depend on a single LLM provider. You might use GPT-4o for complex reasoning, Claude for long-context tasks, Mistral for cost-effective classification, and a local Ollama model for development. Each provider has a different API format, authentication mechanism, and error handling behavior.
LiteLLM solves this by providing a single completion() function that translates your request to any of 100+ providers. You write your code once, and LiteLLM handles the API differences, retry logic, and response normalization.
Installation and Basic Usage
Install LiteLLM:
pip install litellm
The core API mirrors OpenAI's interface. To switch providers, you only change the model string:
import litellm
# OpenAI
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello from OpenAI"}],
)
# Anthropic — same interface
response = litellm.completion(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Hello from Anthropic"}],
)
# Mistral
response = litellm.completion(
model="mistral/mistral-large-latest",
messages=[{"role": "user", "content": "Hello from Mistral"}],
)
# Local Ollama
response = litellm.completion(
model="ollama/llama3.1:8b",
messages=[{"role": "user", "content": "Hello from Ollama"}],
api_base="http://localhost:11434",
)
# All responses have the same structure
print(response.choices[0].message.content)
Set API keys via environment variables:
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export MISTRAL_API_KEY="..."
The LiteLLM Proxy Server
For production, run LiteLLM as a proxy server that your agents connect to. This centralizes API key management, logging, and cost tracking:
# litellm_config.yaml
model_list:
- model_name: "fast-agent"
litellm_params:
model: "gpt-4o-mini"
api_key: "os.environ/OPENAI_API_KEY"
- model_name: "smart-agent"
litellm_params:
model: "claude-3-5-sonnet-20241022"
api_key: "os.environ/ANTHROPIC_API_KEY"
- model_name: "local-agent"
litellm_params:
model: "ollama/llama3.1:8b"
api_base: "http://localhost:11434"
- model_name: "smart-agent" # Second deployment for fallback
litellm_params:
model: "gpt-4o"
api_key: "os.environ/OPENAI_API_KEY"
Start the proxy:
litellm --config litellm_config.yaml --port 4000
Now your agents connect to http://localhost:4000 using the standard OpenAI client:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-anything", # Proxy handles real keys
)
response = client.chat.completions.create(
model="smart-agent", # Routes to Claude, falls back to GPT-4o
messages=[{"role": "user", "content": "Analyze this data..."}],
)
Implementing Fallbacks
Provider outages happen. LiteLLM supports automatic fallbacks so your agent keeps working when one provider goes down:
import litellm
from litellm import completion
# Fallback chain: try Claude first, then GPT-4o, then local
response = completion(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Explain quantum computing"}],
fallbacks=["gpt-4o", "ollama/llama3.1:8b"],
num_retries=2,
)
For the proxy server, configure fallbacks in the YAML:
router_settings:
routing_strategy: "simple-shuffle" # Load balance across same-name models
num_retries: 3
timeout: 30
fallbacks: [
{"smart-agent": ["fast-agent", "local-agent"]}
]
When a request to smart-agent (Claude) fails, LiteLLM automatically retries with fast-agent (GPT-4o-mini), then local-agent (Ollama).
Cost Tracking and Budgets
LiteLLM tracks costs per request automatically:
import litellm
litellm.success_callback = ["langfuse"] # Send cost data to Langfuse
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this report."}],
)
# Access cost information
print(f"Cost: ${response._hidden_params['response_cost']:.6f}")
Set spending limits per model or per user through the proxy:
general_settings:
max_budget: 100.0 # $100 monthly budget
budget_duration: "monthly"
Agent Integration Pattern
Here is a production-ready agent class that uses LiteLLM for multi-provider support:
from openai import OpenAI
from dataclasses import dataclass
@dataclass
class ModelConfig:
name: str
max_tokens: int
temperature: float
MODELS = {
"reasoning": ModelConfig("smart-agent", 4096, 0.2),
"classification": ModelConfig("fast-agent", 256, 0.0),
"summarization": ModelConfig("fast-agent", 1024, 0.3),
}
class MultiProviderAgent:
def __init__(self, proxy_url: str = "http://localhost:4000/v1"):
self.client = OpenAI(base_url=proxy_url, api_key="internal")
def call(self, task_type: str, messages: list) -> str:
config = MODELS[task_type]
response = self.client.chat.completions.create(
model=config.name,
messages=messages,
max_tokens=config.max_tokens,
temperature=config.temperature,
)
return response.choices[0].message.content
def classify(self, text: str, categories: list[str]) -> str:
return self.call("classification", [
{"role": "system", "content": f"Classify into: {categories}. "
"Respond with just the category name."},
{"role": "user", "content": text},
])
def reason(self, query: str, context: str) -> str:
return self.call("reasoning", [
{"role": "system", "content": f"Context: {context}"},
{"role": "user", "content": query},
])
agent = MultiProviderAgent()
category = agent.classify("My order hasn't arrived", ["billing", "shipping", "technical"])
print(f"Category: {category}")
FAQ
Does LiteLLM add significant latency?
As a Python library (not proxy mode), LiteLLM adds less than 1ms of overhead — it is just translating the request format. As a proxy server, it adds 5-15ms of network latency for the extra hop. For most agent applications, this is negligible compared to the 200-2000ms LLM inference time.
Can LiteLLM handle streaming responses?
Yes, LiteLLM fully supports streaming across all providers. Use stream=True in your completion call, and LiteLLM normalizes the streaming format so you get consistent ChatCompletionChunk objects regardless of the underlying provider.
How does LiteLLM compare to building my own provider abstraction?
Building your own abstraction for two or three providers is manageable. Beyond that, you are reinventing LiteLLM. LiteLLM handles edge cases you would not think of — different error codes, rate limit headers, token counting differences, and streaming format variations across providers. Use the library and focus your engineering time on agent logic.
#LiteLLM #LLMGateway #MultiProvider #Fallback #CostOptimization #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.