Skip to content
Technology10 min read0 views

Agentic AI with Open Source: Building a Self-Hosted LLM Agent Stack

Build agentic AI with open-source models like Llama 3 and Mistral using vLLM, Ollama, LangGraph, and pgvector. Full stack comparison guide.

Why Self-Host Your LLM Agent Stack

Cloud-hosted LLM APIs from OpenAI, Anthropic, and Google are the fastest way to build agentic AI. But as your product scales and matures, several forces push teams toward self-hosted open-source alternatives.

Cost control. At high conversation volumes, API costs become a significant line item. A self-hosted Llama 3.1 70B model on dedicated GPU hardware can reduce per-token costs by 60 to 80 percent compared to GPT-4o pricing, with the tradeoff of higher fixed infrastructure costs.

Data sovereignty. Regulated industries and certain geographies require that data never leaves specific network boundaries. Self-hosted models let you run inference entirely within your own infrastructure, satisfying the strictest data residency requirements.

Latency control. When you own the inference infrastructure, you control the hardware, queue depth, and batching configuration. This eliminates the variable latency that comes from shared multi-tenant API services and lets you optimize specifically for your workload profile.

Customization. Self-hosted models can be fine-tuned on your domain data, quantized to fit specific hardware constraints, and served with custom sampling parameters that hosted APIs may not expose.

The tradeoff is operational complexity. Running inference infrastructure requires GPU management, model serving expertise, monitoring, and on-call support. This guide covers the full stack — from model selection through inference serving to agent frameworks — so you can make informed decisions about what to self-host and when.

Open Source Model Selection

The open-source model landscape has matured dramatically. As of early 2026, several model families are production-viable for agentic AI workloads.

Model Comparison Table

Model Parameters License Tool Calling Context Window Best For
Llama 3.1 70B 70B Meta Community Native support 128K Primary agent reasoning, strong general capability
Llama 3.1 8B 8B Meta Community Native support 128K Intent classification, simple tool routing, low-latency tasks
Mistral Large 2 123B Apache 2.0 Native support 128K Complex reasoning, European data residency option
Mistral Nemo 12B 12B Apache 2.0 Native support 128K Mid-range tasks, good efficiency-to-quality ratio
Qwen 2.5 72B 72B Apache 2.0 Native support 128K Multilingual agents, strong Asian language support
Qwen 2.5 7B 7B Apache 2.0 Native support 128K Low-resource deployments, edge inference
DeepSeek V3 671B MoE MIT Native support 128K Maximum capability, requires significant GPU resources
Phi-3 Medium 14B 14B MIT Limited 128K Cost-efficient medium tasks, small GPU footprint

Choosing the Right Model for Your Workload

For a typical agentic AI deployment, you need models at multiple capability tiers. Use a large model (70B or above) for the primary agent reasoning loop where the model selects tools, processes results, and generates responses. Use a small model (7B to 14B) for high-volume, low-complexity tasks like intent classification, conversation routing, and simple extraction.

Llama 3.1 70B is the default recommendation for primary agent reasoning. It has native function calling support, performs well on tool selection benchmarks, handles complex multi-turn conversations reliably, and has a permissive license for commercial use. For the small model tier, Llama 3.1 8B or Qwen 2.5 7B both perform well for classification and routing tasks at a fraction of the compute cost.

If your agents serve multilingual users, Qwen 2.5 is stronger than Llama for Chinese, Japanese, Korean, and other Asian languages. For European language support beyond English, Mistral models have an edge.

Inference Servers

The inference server sits between your application code and the model weights. It handles loading models onto GPUs, processing inference requests, batching for throughput, and exposing an API that your agent framework calls.

vLLM

vLLM is the leading open-source inference server for production deployments. It implements PagedAttention for efficient GPU memory management, continuous batching for high throughput, and an OpenAI-compatible API that makes it a drop-in replacement for hosted APIs.

# Start vLLM serving Llama 3.1 70B
python -m vllm.entrypoints.openai.api_server     --model meta-llama/Meta-Llama-3.1-70B-Instruct     --tensor-parallel-size 2     --max-model-len 16384     --enable-auto-tool-choice     --tool-call-parser llama3_json     --port 8000

When to use vLLM: Production deployments where throughput and latency matter. Multi-GPU setups with tensor parallelism. When you need OpenAI-compatible API endpoints for easy migration from hosted APIs.

Hardware requirements for Llama 3.1 70B: Two NVIDIA A100 80GB GPUs or four A100 40GB GPUs with tensor parallelism. With AWQ quantization (4-bit), you can fit the model on a single A100 80GB.

Ollama

Ollama is designed for simplicity. It handles model downloading, quantization, and serving with minimal configuration. It is excellent for development, testing, and small-scale deployments.

# Pull and run a model with Ollama
ollama pull llama3.1:70b
ollama serve

# API call
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:70b",
  "messages": [{"role": "user", "content": "Hello"}]
}'

When to use Ollama: Local development and testing. Small-scale deployments where operational simplicity matters more than maximum throughput. Prototyping with different models before committing to a production deployment.

Limitations: Lower throughput than vLLM for high-concurrency workloads. Limited batching optimization. Fewer configuration options for production tuning.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Text Generation Inference (TGI)

Hugging Face's TGI offers a middle ground between Ollama's simplicity and vLLM's performance. It supports continuous batching, quantization, and speculative decoding.

# Run TGI with Docker
docker run --gpus all     -p 8080:80     -v /data/models:/data     ghcr.io/huggingface/text-generation-inference:latest     --model-id meta-llama/Meta-Llama-3.1-70B-Instruct     --num-shard 2     --max-input-tokens 8192     --max-total-tokens 16384

When to use TGI: When you want Docker-native deployment with good performance. When you are already in the Hugging Face ecosystem. When speculative decoding is important for your latency requirements.

Performance Comparison

Server Throughput (tokens/sec, 70B) P50 Latency (first token) Ease of Setup Production Readiness
vLLM 800-1200 150-300ms Medium High
Ollama 200-400 300-800ms Very Easy Medium
TGI 600-900 200-400ms Easy (Docker) High

These numbers are approximate and vary significantly based on hardware, model quantization, batch size, and input/output length.

Agent Frameworks for Self-Hosted Models

LangGraph

LangGraph is the most mature framework for building stateful, multi-step agents. It models agent workflows as directed graphs where nodes are processing steps and edges are transitions. It works with any LLM backend that exposes a chat completion API, making it compatible with vLLM and TGI through their OpenAI-compatible endpoints.

from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI

# Point to your self-hosted vLLM instance
llm = ChatOpenAI(
    base_url="http://vllm-server:8000/v1",
    api_key="not-needed",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct"
)

# Define agent graph
graph = StateGraph(AgentState)
graph.add_node("reason", reason_node)
graph.add_node("execute_tool", tool_node)
graph.add_node("respond", response_node)
graph.add_edge("reason", "execute_tool")
graph.add_edge("execute_tool", "reason")
graph.add_conditional_edges("reason", should_respond, {
    True: "respond",
    False: "execute_tool"
})

Strengths: Flexible graph-based workflow design. Good support for complex multi-step agents. Built-in state management and checkpointing. Active development and community.

AutoGen

Microsoft's AutoGen framework is designed for multi-agent conversations where multiple specialized agents collaborate to accomplish tasks. It works well with self-hosted models and supports both two-agent and group-chat patterns.

Strengths: Purpose-built for multi-agent collaboration. Good patterns for agent-to-agent communication. Useful when your architecture involves distinct specialized agents that need to coordinate.

Custom Lightweight Orchestrator

For many agentic AI products, a custom orchestrator built with 50 to 100 lines of Python provides more control and less overhead than a full framework. The core loop is simple: send messages to the LLM, check if the response includes tool calls, execute tools and append results, repeat until the LLM generates a final response.

import httpx

VLLM_URL = "http://vllm-server:8000/v1/chat/completions"

async def agent_loop(messages: list, tools: list) -> str:
    while True:
        response = await httpx.AsyncClient().post(
            VLLM_URL,
            json={
                "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
                "messages": messages,
                "tools": tools,
                "tool_choice": "auto"
            }
        )
        result = response.json()["choices"][0]["message"]
        messages.append(result)

        if not result.get("tool_calls"):
            return result["content"]

        for tool_call in result["tool_calls"]:
            output = await execute_tool(
                tool_call["function"]["name"],
                tool_call["function"]["arguments"]
            )
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call["id"],
                "content": str(output)
            })

When to use a custom orchestrator: When your agent flow is straightforward (single agent, linear tool use). When you want full control over every aspect of the pipeline. When framework abstractions add complexity without corresponding value.

Embedding Models and Vector Databases

If your agents need retrieval-augmented generation (RAG) to access knowledge bases, product catalogs, or documentation, you need an embedding model and a vector database.

Open Source Embedding Models

Model Dimensions Performance (MTEB) Self-Host Ease
BGE-Large-en-v1.5 1024 Strong Easy (sentence-transformers)
E5-Mistral-7B 4096 Very Strong Medium (needs GPU)
Nomic-Embed-Text 768 Good Easy (Ollama supported)
GTE-Qwen2-7B 3584 Very Strong Medium (needs GPU)

For most agentic AI use cases, BGE-Large or Nomic-Embed-Text provide excellent quality without requiring GPU resources for the embedding model itself.

Vector Database Options

pgvector extends PostgreSQL with vector similarity search. If you already use PostgreSQL for your application data, pgvector eliminates the need for a separate vector database. It handles millions of vectors with HNSW indexing and is operationally simple since it is just a PostgreSQL extension.

ChromaDB is a lightweight, open-source vector database designed for AI applications. It is easy to set up and works well for small to medium collections. For production at scale, pgvector or Milvus are more robust choices.

Milvus is a purpose-built vector database designed for billion-scale similarity search. Use it when your vector collection exceeds what pgvector handles comfortably (roughly 10 million or more vectors with high query concurrency).

Infrastructure Architecture

A production self-hosted agentic AI stack looks like this.

User Request
    |
    v
API Gateway / Load Balancer
    |
    v
Agent Application (Python/Node.js)
    |
    +---> vLLM (Large Model, 2x A100)  -- primary reasoning
    +---> vLLM (Small Model, 1x A10G)  -- classification/routing
    +---> Embedding Service (CPU)       -- RAG queries
    +---> PostgreSQL + pgvector         -- data + vectors
    +---> Redis                         -- caching, sessions
    +---> External Tool APIs            -- business integrations

GPU Infrastructure Options

Option Cost (70B model) Flexibility Operational Overhead
Cloud GPU instances (AWS/GCP) $3-6/hr per A100 High (scale up/down) Medium
Bare metal GPU rental (Lambda, CoreWeave) $1.50-3/hr per A100 Medium Low-Medium
On-premises GPU servers $15K-30K per server (capex) Low (fixed capacity) High

For most teams, cloud GPU instances provide the best balance of cost, flexibility, and operational simplicity. Reserve instances reduce cloud GPU costs by 30 to 50 percent if you can commit to steady-state usage.

When to Self-Host Versus When to Use APIs

Self-hosting makes sense when LLM API costs exceed 30 to 40 percent of revenue, when data residency requirements prohibit sending data to third-party APIs, when you need deterministic latency that shared API services cannot guarantee, or when you want to fine-tune models on your domain data.

Continue using hosted APIs when you are pre-product-market-fit and need to iterate quickly, when your conversation volume is under 50,000 per month where the operational overhead of self-hosting is not justified by cost savings, when you lack the engineering capacity to operate GPU infrastructure, or when you need access to frontier models that are not available as open-source weights.

A hybrid approach works well for many teams. Use hosted APIs like GPT-4o for complex reasoning tasks where frontier model quality matters, and self-host an open-source model for high-volume, simpler tasks like classification, extraction, and routine responses. This captures most of the cost savings while maintaining access to the best models for the hardest tasks.

Frequently Asked Questions

How does the quality of Llama 3.1 70B compare to GPT-4o for agentic tasks?

For structured tool calling and straightforward multi-step reasoning, Llama 3.1 70B performs within 10 to 15 percent of GPT-4o on most benchmarks. Where the gap widens is on complex, ambiguous tasks that require nuanced judgment, multi-constraint reasoning, or sophisticated language understanding. For many production agentic AI workloads — scheduling, lead qualification, ticket triage — Llama 3.1 70B is fully capable and the cost savings make it the better choice.

What hardware do I need to run Llama 3.1 70B in production?

At full precision (BF16), you need approximately 140GB of GPU memory — two A100 80GB GPUs with tensor parallelism. With 4-bit quantization (AWQ or GPTQ), the memory requirement drops to approximately 35 to 40GB, fitting on a single A100 80GB or two A100 40GB GPUs. Quantization reduces quality slightly but the impact on agentic task performance is minimal for most workloads. For the 8B model, a single A10G (24GB) or even a consumer RTX 4090 is sufficient.

Can I fine-tune open-source models for better tool calling performance?

Yes, and this is one of the strongest arguments for self-hosting. Collect examples of correct tool calls from your production conversations, format them as training data, and fine-tune using LoRA or QLoRA. Even a small fine-tuning dataset of 500 to 1,000 high-quality examples can significantly improve tool selection accuracy and reduce hallucinated tool parameters for your specific use case. The fine-tuned model runs on the same infrastructure as the base model with negligible additional cost.

How do I handle model updates and version management?

Treat model versions like application deployments. Tag each model version (base model plus any fine-tuning checkpoints), test new versions against your evaluation suite before deploying to production, maintain the ability to roll back to the previous version within minutes, and run A/B tests between model versions by routing a percentage of traffic to the new version and comparing metrics. Use separate vLLM instances for the current and candidate model versions during rollouts.

What is the total cost of ownership for a self-hosted stack versus API usage?

For a deployment handling 100,000 conversations per month, approximate costs are as follows. With hosted APIs at roughly $0.05 per conversation, monthly cost is around $5,000. With self-hosted on two reserved A100 instances, the monthly cost is around $2,500 to $3,500 including compute, storage, and operational overhead. The self-hosted option saves 30 to 50 percent on inference costs but adds 10 to 20 hours per month of engineering time for infrastructure management. The break-even point where self-hosting becomes cost-effective is typically between 50,000 and 100,000 conversations per month.

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.