Agentic AI with Open Source: Building a Self-Hosted LLM Agent Stack
Build agentic AI with open-source models like Llama 3 and Mistral using vLLM, Ollama, LangGraph, and pgvector. Full stack comparison guide.
Why Self-Host Your LLM Agent Stack
Cloud-hosted LLM APIs from OpenAI, Anthropic, and Google are the fastest way to build agentic AI. But as your product scales and matures, several forces push teams toward self-hosted open-source alternatives.
Cost control. At high conversation volumes, API costs become a significant line item. A self-hosted Llama 3.1 70B model on dedicated GPU hardware can reduce per-token costs by 60 to 80 percent compared to GPT-4o pricing, with the tradeoff of higher fixed infrastructure costs.
Data sovereignty. Regulated industries and certain geographies require that data never leaves specific network boundaries. Self-hosted models let you run inference entirely within your own infrastructure, satisfying the strictest data residency requirements.
Latency control. When you own the inference infrastructure, you control the hardware, queue depth, and batching configuration. This eliminates the variable latency that comes from shared multi-tenant API services and lets you optimize specifically for your workload profile.
Customization. Self-hosted models can be fine-tuned on your domain data, quantized to fit specific hardware constraints, and served with custom sampling parameters that hosted APIs may not expose.
The tradeoff is operational complexity. Running inference infrastructure requires GPU management, model serving expertise, monitoring, and on-call support. This guide covers the full stack — from model selection through inference serving to agent frameworks — so you can make informed decisions about what to self-host and when.
Open Source Model Selection
The open-source model landscape has matured dramatically. As of early 2026, several model families are production-viable for agentic AI workloads.
Model Comparison Table
| Model | Parameters | License | Tool Calling | Context Window | Best For |
|---|---|---|---|---|---|
| Llama 3.1 70B | 70B | Meta Community | Native support | 128K | Primary agent reasoning, strong general capability |
| Llama 3.1 8B | 8B | Meta Community | Native support | 128K | Intent classification, simple tool routing, low-latency tasks |
| Mistral Large 2 | 123B | Apache 2.0 | Native support | 128K | Complex reasoning, European data residency option |
| Mistral Nemo 12B | 12B | Apache 2.0 | Native support | 128K | Mid-range tasks, good efficiency-to-quality ratio |
| Qwen 2.5 72B | 72B | Apache 2.0 | Native support | 128K | Multilingual agents, strong Asian language support |
| Qwen 2.5 7B | 7B | Apache 2.0 | Native support | 128K | Low-resource deployments, edge inference |
| DeepSeek V3 | 671B MoE | MIT | Native support | 128K | Maximum capability, requires significant GPU resources |
| Phi-3 Medium 14B | 14B | MIT | Limited | 128K | Cost-efficient medium tasks, small GPU footprint |
Choosing the Right Model for Your Workload
For a typical agentic AI deployment, you need models at multiple capability tiers. Use a large model (70B or above) for the primary agent reasoning loop where the model selects tools, processes results, and generates responses. Use a small model (7B to 14B) for high-volume, low-complexity tasks like intent classification, conversation routing, and simple extraction.
Llama 3.1 70B is the default recommendation for primary agent reasoning. It has native function calling support, performs well on tool selection benchmarks, handles complex multi-turn conversations reliably, and has a permissive license for commercial use. For the small model tier, Llama 3.1 8B or Qwen 2.5 7B both perform well for classification and routing tasks at a fraction of the compute cost.
If your agents serve multilingual users, Qwen 2.5 is stronger than Llama for Chinese, Japanese, Korean, and other Asian languages. For European language support beyond English, Mistral models have an edge.
Inference Servers
The inference server sits between your application code and the model weights. It handles loading models onto GPUs, processing inference requests, batching for throughput, and exposing an API that your agent framework calls.
vLLM
vLLM is the leading open-source inference server for production deployments. It implements PagedAttention for efficient GPU memory management, continuous batching for high throughput, and an OpenAI-compatible API that makes it a drop-in replacement for hosted APIs.
# Start vLLM serving Llama 3.1 70B
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 2 --max-model-len 16384 --enable-auto-tool-choice --tool-call-parser llama3_json --port 8000
When to use vLLM: Production deployments where throughput and latency matter. Multi-GPU setups with tensor parallelism. When you need OpenAI-compatible API endpoints for easy migration from hosted APIs.
Hardware requirements for Llama 3.1 70B: Two NVIDIA A100 80GB GPUs or four A100 40GB GPUs with tensor parallelism. With AWQ quantization (4-bit), you can fit the model on a single A100 80GB.
Ollama
Ollama is designed for simplicity. It handles model downloading, quantization, and serving with minimal configuration. It is excellent for development, testing, and small-scale deployments.
# Pull and run a model with Ollama
ollama pull llama3.1:70b
ollama serve
# API call
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:70b",
"messages": [{"role": "user", "content": "Hello"}]
}'
When to use Ollama: Local development and testing. Small-scale deployments where operational simplicity matters more than maximum throughput. Prototyping with different models before committing to a production deployment.
Limitations: Lower throughput than vLLM for high-concurrency workloads. Limited batching optimization. Fewer configuration options for production tuning.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Text Generation Inference (TGI)
Hugging Face's TGI offers a middle ground between Ollama's simplicity and vLLM's performance. It supports continuous batching, quantization, and speculative decoding.
# Run TGI with Docker
docker run --gpus all -p 8080:80 -v /data/models:/data ghcr.io/huggingface/text-generation-inference:latest --model-id meta-llama/Meta-Llama-3.1-70B-Instruct --num-shard 2 --max-input-tokens 8192 --max-total-tokens 16384
When to use TGI: When you want Docker-native deployment with good performance. When you are already in the Hugging Face ecosystem. When speculative decoding is important for your latency requirements.
Performance Comparison
| Server | Throughput (tokens/sec, 70B) | P50 Latency (first token) | Ease of Setup | Production Readiness |
|---|---|---|---|---|
| vLLM | 800-1200 | 150-300ms | Medium | High |
| Ollama | 200-400 | 300-800ms | Very Easy | Medium |
| TGI | 600-900 | 200-400ms | Easy (Docker) | High |
These numbers are approximate and vary significantly based on hardware, model quantization, batch size, and input/output length.
Agent Frameworks for Self-Hosted Models
LangGraph
LangGraph is the most mature framework for building stateful, multi-step agents. It models agent workflows as directed graphs where nodes are processing steps and edges are transitions. It works with any LLM backend that exposes a chat completion API, making it compatible with vLLM and TGI through their OpenAI-compatible endpoints.
from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI
# Point to your self-hosted vLLM instance
llm = ChatOpenAI(
base_url="http://vllm-server:8000/v1",
api_key="not-needed",
model="meta-llama/Meta-Llama-3.1-70B-Instruct"
)
# Define agent graph
graph = StateGraph(AgentState)
graph.add_node("reason", reason_node)
graph.add_node("execute_tool", tool_node)
graph.add_node("respond", response_node)
graph.add_edge("reason", "execute_tool")
graph.add_edge("execute_tool", "reason")
graph.add_conditional_edges("reason", should_respond, {
True: "respond",
False: "execute_tool"
})
Strengths: Flexible graph-based workflow design. Good support for complex multi-step agents. Built-in state management and checkpointing. Active development and community.
AutoGen
Microsoft's AutoGen framework is designed for multi-agent conversations where multiple specialized agents collaborate to accomplish tasks. It works well with self-hosted models and supports both two-agent and group-chat patterns.
Strengths: Purpose-built for multi-agent collaboration. Good patterns for agent-to-agent communication. Useful when your architecture involves distinct specialized agents that need to coordinate.
Custom Lightweight Orchestrator
For many agentic AI products, a custom orchestrator built with 50 to 100 lines of Python provides more control and less overhead than a full framework. The core loop is simple: send messages to the LLM, check if the response includes tool calls, execute tools and append results, repeat until the LLM generates a final response.
import httpx
VLLM_URL = "http://vllm-server:8000/v1/chat/completions"
async def agent_loop(messages: list, tools: list) -> str:
while True:
response = await httpx.AsyncClient().post(
VLLM_URL,
json={
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"messages": messages,
"tools": tools,
"tool_choice": "auto"
}
)
result = response.json()["choices"][0]["message"]
messages.append(result)
if not result.get("tool_calls"):
return result["content"]
for tool_call in result["tool_calls"]:
output = await execute_tool(
tool_call["function"]["name"],
tool_call["function"]["arguments"]
)
messages.append({
"role": "tool",
"tool_call_id": tool_call["id"],
"content": str(output)
})
When to use a custom orchestrator: When your agent flow is straightforward (single agent, linear tool use). When you want full control over every aspect of the pipeline. When framework abstractions add complexity without corresponding value.
Embedding Models and Vector Databases
If your agents need retrieval-augmented generation (RAG) to access knowledge bases, product catalogs, or documentation, you need an embedding model and a vector database.
Open Source Embedding Models
| Model | Dimensions | Performance (MTEB) | Self-Host Ease |
|---|---|---|---|
| BGE-Large-en-v1.5 | 1024 | Strong | Easy (sentence-transformers) |
| E5-Mistral-7B | 4096 | Very Strong | Medium (needs GPU) |
| Nomic-Embed-Text | 768 | Good | Easy (Ollama supported) |
| GTE-Qwen2-7B | 3584 | Very Strong | Medium (needs GPU) |
For most agentic AI use cases, BGE-Large or Nomic-Embed-Text provide excellent quality without requiring GPU resources for the embedding model itself.
Vector Database Options
pgvector extends PostgreSQL with vector similarity search. If you already use PostgreSQL for your application data, pgvector eliminates the need for a separate vector database. It handles millions of vectors with HNSW indexing and is operationally simple since it is just a PostgreSQL extension.
ChromaDB is a lightweight, open-source vector database designed for AI applications. It is easy to set up and works well for small to medium collections. For production at scale, pgvector or Milvus are more robust choices.
Milvus is a purpose-built vector database designed for billion-scale similarity search. Use it when your vector collection exceeds what pgvector handles comfortably (roughly 10 million or more vectors with high query concurrency).
Infrastructure Architecture
A production self-hosted agentic AI stack looks like this.
User Request
|
v
API Gateway / Load Balancer
|
v
Agent Application (Python/Node.js)
|
+---> vLLM (Large Model, 2x A100) -- primary reasoning
+---> vLLM (Small Model, 1x A10G) -- classification/routing
+---> Embedding Service (CPU) -- RAG queries
+---> PostgreSQL + pgvector -- data + vectors
+---> Redis -- caching, sessions
+---> External Tool APIs -- business integrations
GPU Infrastructure Options
| Option | Cost (70B model) | Flexibility | Operational Overhead |
|---|---|---|---|
| Cloud GPU instances (AWS/GCP) | $3-6/hr per A100 | High (scale up/down) | Medium |
| Bare metal GPU rental (Lambda, CoreWeave) | $1.50-3/hr per A100 | Medium | Low-Medium |
| On-premises GPU servers | $15K-30K per server (capex) | Low (fixed capacity) | High |
For most teams, cloud GPU instances provide the best balance of cost, flexibility, and operational simplicity. Reserve instances reduce cloud GPU costs by 30 to 50 percent if you can commit to steady-state usage.
When to Self-Host Versus When to Use APIs
Self-hosting makes sense when LLM API costs exceed 30 to 40 percent of revenue, when data residency requirements prohibit sending data to third-party APIs, when you need deterministic latency that shared API services cannot guarantee, or when you want to fine-tune models on your domain data.
Continue using hosted APIs when you are pre-product-market-fit and need to iterate quickly, when your conversation volume is under 50,000 per month where the operational overhead of self-hosting is not justified by cost savings, when you lack the engineering capacity to operate GPU infrastructure, or when you need access to frontier models that are not available as open-source weights.
A hybrid approach works well for many teams. Use hosted APIs like GPT-4o for complex reasoning tasks where frontier model quality matters, and self-host an open-source model for high-volume, simpler tasks like classification, extraction, and routine responses. This captures most of the cost savings while maintaining access to the best models for the hardest tasks.
Frequently Asked Questions
How does the quality of Llama 3.1 70B compare to GPT-4o for agentic tasks?
For structured tool calling and straightforward multi-step reasoning, Llama 3.1 70B performs within 10 to 15 percent of GPT-4o on most benchmarks. Where the gap widens is on complex, ambiguous tasks that require nuanced judgment, multi-constraint reasoning, or sophisticated language understanding. For many production agentic AI workloads — scheduling, lead qualification, ticket triage — Llama 3.1 70B is fully capable and the cost savings make it the better choice.
What hardware do I need to run Llama 3.1 70B in production?
At full precision (BF16), you need approximately 140GB of GPU memory — two A100 80GB GPUs with tensor parallelism. With 4-bit quantization (AWQ or GPTQ), the memory requirement drops to approximately 35 to 40GB, fitting on a single A100 80GB or two A100 40GB GPUs. Quantization reduces quality slightly but the impact on agentic task performance is minimal for most workloads. For the 8B model, a single A10G (24GB) or even a consumer RTX 4090 is sufficient.
Can I fine-tune open-source models for better tool calling performance?
Yes, and this is one of the strongest arguments for self-hosting. Collect examples of correct tool calls from your production conversations, format them as training data, and fine-tune using LoRA or QLoRA. Even a small fine-tuning dataset of 500 to 1,000 high-quality examples can significantly improve tool selection accuracy and reduce hallucinated tool parameters for your specific use case. The fine-tuned model runs on the same infrastructure as the base model with negligible additional cost.
How do I handle model updates and version management?
Treat model versions like application deployments. Tag each model version (base model plus any fine-tuning checkpoints), test new versions against your evaluation suite before deploying to production, maintain the ability to roll back to the previous version within minutes, and run A/B tests between model versions by routing a percentage of traffic to the new version and comparing metrics. Use separate vLLM instances for the current and candidate model versions during rollouts.
What is the total cost of ownership for a self-hosted stack versus API usage?
For a deployment handling 100,000 conversations per month, approximate costs are as follows. With hosted APIs at roughly $0.05 per conversation, monthly cost is around $5,000. With self-hosted on two reserved A100 instances, the monthly cost is around $2,500 to $3,500 including compute, storage, and operational overhead. The self-hosted option saves 30 to 50 percent on inference costs but adds 10 to 20 hours per month of engineering time for infrastructure management. The break-even point where self-hosting becomes cost-effective is typically between 50,000 and 100,000 conversations per month.
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.