Skip to content
Learn Agentic AI11 min read0 views

LiteLLM: A Unified Interface for 100+ LLM Providers in Agent Applications

Set up LiteLLM to call OpenAI, Anthropic, Mistral, Ollama, and 100+ other providers through a single API. Implement fallbacks, load balancing, and cost tracking for production agents.

The Multi-Provider Problem

Production agent systems rarely depend on a single LLM provider. You might use GPT-4o for complex reasoning, Claude for long-context tasks, Mistral for cost-effective classification, and a local Ollama model for development. Each provider has a different API format, authentication mechanism, and error handling behavior.

LiteLLM solves this by providing a single completion() function that translates your request to any of 100+ providers. You write your code once, and LiteLLM handles the API differences, retry logic, and response normalization.

Installation and Basic Usage

Install LiteLLM:

pip install litellm

The core API mirrors OpenAI's interface. To switch providers, you only change the model string:

import litellm

# OpenAI
response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello from OpenAI"}],
)

# Anthropic — same interface
response = litellm.completion(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": "Hello from Anthropic"}],
)

# Mistral
response = litellm.completion(
    model="mistral/mistral-large-latest",
    messages=[{"role": "user", "content": "Hello from Mistral"}],
)

# Local Ollama
response = litellm.completion(
    model="ollama/llama3.1:8b",
    messages=[{"role": "user", "content": "Hello from Ollama"}],
    api_base="http://localhost:11434",
)

# All responses have the same structure
print(response.choices[0].message.content)

Set API keys via environment variables:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export MISTRAL_API_KEY="..."

The LiteLLM Proxy Server

For production, run LiteLLM as a proxy server that your agents connect to. This centralizes API key management, logging, and cost tracking:

# litellm_config.yaml
model_list:
  - model_name: "fast-agent"
    litellm_params:
      model: "gpt-4o-mini"
      api_key: "os.environ/OPENAI_API_KEY"

  - model_name: "smart-agent"
    litellm_params:
      model: "claude-3-5-sonnet-20241022"
      api_key: "os.environ/ANTHROPIC_API_KEY"

  - model_name: "local-agent"
    litellm_params:
      model: "ollama/llama3.1:8b"
      api_base: "http://localhost:11434"

  - model_name: "smart-agent"  # Second deployment for fallback
    litellm_params:
      model: "gpt-4o"
      api_key: "os.environ/OPENAI_API_KEY"

Start the proxy:

litellm --config litellm_config.yaml --port 4000

Now your agents connect to http://localhost:4000 using the standard OpenAI client:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="sk-anything",  # Proxy handles real keys
)

response = client.chat.completions.create(
    model="smart-agent",  # Routes to Claude, falls back to GPT-4o
    messages=[{"role": "user", "content": "Analyze this data..."}],
)

Implementing Fallbacks

Provider outages happen. LiteLLM supports automatic fallbacks so your agent keeps working when one provider goes down:

import litellm
from litellm import completion

# Fallback chain: try Claude first, then GPT-4o, then local
response = completion(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    fallbacks=["gpt-4o", "ollama/llama3.1:8b"],
    num_retries=2,
)

For the proxy server, configure fallbacks in the YAML:

router_settings:
  routing_strategy: "simple-shuffle"  # Load balance across same-name models
  num_retries: 3
  timeout: 30
  fallbacks: [
    {"smart-agent": ["fast-agent", "local-agent"]}
  ]

When a request to smart-agent (Claude) fails, LiteLLM automatically retries with fast-agent (GPT-4o-mini), then local-agent (Ollama).

Cost Tracking and Budgets

LiteLLM tracks costs per request automatically:

import litellm

litellm.success_callback = ["langfuse"]  # Send cost data to Langfuse

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this report."}],
)

# Access cost information
print(f"Cost: ${response._hidden_params['response_cost']:.6f}")

Set spending limits per model or per user through the proxy:

general_settings:
  max_budget: 100.0  # $100 monthly budget
  budget_duration: "monthly"

Agent Integration Pattern

Here is a production-ready agent class that uses LiteLLM for multi-provider support:

from openai import OpenAI
from dataclasses import dataclass

@dataclass
class ModelConfig:
    name: str
    max_tokens: int
    temperature: float

MODELS = {
    "reasoning": ModelConfig("smart-agent", 4096, 0.2),
    "classification": ModelConfig("fast-agent", 256, 0.0),
    "summarization": ModelConfig("fast-agent", 1024, 0.3),
}

class MultiProviderAgent:
    def __init__(self, proxy_url: str = "http://localhost:4000/v1"):
        self.client = OpenAI(base_url=proxy_url, api_key="internal")

    def call(self, task_type: str, messages: list) -> str:
        config = MODELS[task_type]
        response = self.client.chat.completions.create(
            model=config.name,
            messages=messages,
            max_tokens=config.max_tokens,
            temperature=config.temperature,
        )
        return response.choices[0].message.content

    def classify(self, text: str, categories: list[str]) -> str:
        return self.call("classification", [
            {"role": "system", "content": f"Classify into: {categories}. "
             "Respond with just the category name."},
            {"role": "user", "content": text},
        ])

    def reason(self, query: str, context: str) -> str:
        return self.call("reasoning", [
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": query},
        ])

agent = MultiProviderAgent()
category = agent.classify("My order hasn't arrived", ["billing", "shipping", "technical"])
print(f"Category: {category}")

FAQ

Does LiteLLM add significant latency?

As a Python library (not proxy mode), LiteLLM adds less than 1ms of overhead — it is just translating the request format. As a proxy server, it adds 5-15ms of network latency for the extra hop. For most agent applications, this is negligible compared to the 200-2000ms LLM inference time.

Can LiteLLM handle streaming responses?

Yes, LiteLLM fully supports streaming across all providers. Use stream=True in your completion call, and LiteLLM normalizes the streaming format so you get consistent ChatCompletionChunk objects regardless of the underlying provider.

How does LiteLLM compare to building my own provider abstraction?

Building your own abstraction for two or three providers is manageable. Beyond that, you are reinventing LiteLLM. LiteLLM handles edge cases you would not think of — different error codes, rate limit headers, token counting differences, and streaming format variations across providers. Use the library and focus your engineering time on agent logic.


#LiteLLM #LLMGateway #MultiProvider #Fallback #CostOptimization #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.