How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

"Train it on my business"

Every buyer says it. "Can you train the agent on my business?" The word "train" hides three completely different techniques: prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. They live at different layers, cost different amounts, and solve different problems.

This guide walks through all three for AI voice agents, with the decision tree CallSphere uses in production to decide which lever to pull for a given customer.

Need → choose technique
   │
   ├── "use our tone"           → system prompt
   ├── "know our catalog"       → RAG
   ├── "talk like our best rep" → fine-tune (rarely)
   └── "take actions"           → tool calls

Architecture overview

┌────────────────────────────────────────┐
│          Voice agent runtime           │
│                                        │
│  system_prompt  ──────┐                │
│                       ▼                │
│  user audio ──► LLM ◄── RAG context    │
│                       │                │
│                       ▼                │
│                  tool calls            │
└────────────────────────────────────────┘
              │
              ▼
     ┌────────────────────┐
     │ Vector DB (pgvector│
     │ / Pinecone)        │
     └────────────────────┘

Prerequisites

A corpus of business documents (FAQ, SOPs, pricing, product pages).
An embedding model (text-embedding-3-small is a sensible default).
Postgres with pgvector, or a hosted vector DB.
Access to the OpenAI Realtime API for the runtime.

Step-by-step walkthrough

1. Write a tight system prompt

Voice is not chat. A system prompt that works for ChatGPT will be too long and too wordy for a voice agent. Keep it under 400 tokens and prioritize persona, boundaries, and escalation rules.

You are Jamie, the after-hours receptionist for Maple Dental.
Speak warmly and naturally. Keep replies under 2 sentences.
Never quote prices. If asked, say: "I can get an exact quote
from the scheduling team — want me to book that callback?"
Escalate to human if caller mentions pain, trauma, or bleeding.

2. Chunk and embed your knowledge base

from openai import OpenAI
import asyncpg

client = OpenAI()

async def ingest(doc_id: str, text: str):
    chunks = chunk_by_sentence(text, max_tokens=300, overlap=40)
    for i, chunk in enumerate(chunks):
        emb = client.embeddings.create(model="text-embedding-3-small", input=chunk).data[0].embedding
        await conn.execute(
            "INSERT INTO kb_chunks (doc_id, chunk_idx, text, embedding) VALUES ($1, $2, $3, $4)",
            doc_id, i, chunk, emb,
        )

3. Retrieve at tool-call time, not per turn

Running RAG on every user turn is wasteful. Instead, expose a search_knowledge_base tool and let the LLM call it when it needs to.

async def search_kb(query: str, k: int = 4):
    emb = client.embeddings.create(model="text-embedding-3-small", input=query).data[0].embedding
    rows = await conn.fetch(
        "SELECT text, 1 - (embedding <=> $1::vector) AS score "
        "FROM kb_chunks ORDER BY embedding <=> $1::vector LIMIT $2",
        emb, k,
    )
    return [{"text": r["text"], "score": float(r["score"])} for r in rows]

4. Expose the search tool to the agent

const kbTool = {
  type: "function",
  name: "search_knowledge_base",
  description: "Search the company knowledge base for a specific fact",
  parameters: {
    type: "object",
    properties: { query: { type: "string" } },
    required: ["query"],
  },
};

5. Decide whether you actually need fine-tuning

Fine-tuning is rarely worth it for voice agents. It shines only when:

You have a consistent, domain-specific vocabulary the base model keeps mangling.
You have 500+ high-quality dialogue examples.
The improvement will be measured in production, not just vibes.

Ninety-five percent of the time, a better system prompt + RAG beats fine-tuning on both quality and cost.

6. Close the loop with evals

Create a regression suite of 50+ realistic caller turns. Run it on every prompt or knowledge-base change and track pass rate.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

EVAL_CASES = [
    {"input": "Are you open Sunday?", "expected_contains": ["closed Sunday", "Monday"]},
    {"input": "How much is a cleaning?", "expected_not_contains": ["$"]},
]

Production considerations

Prompt versioning: check prompts into git, tag releases, A/B test changes.
RAG freshness: re-ingest on source changes; show "last updated" in admin.
Latency budget: embedding + vector search adds 100-250ms. Run in parallel with the first LLM thought.
Citation: include the chunk ID in the tool response so you can audit what the LLM saw.
Access control: RAG over customer data needs per-tenant isolation in the vector DB.

CallSphere's real implementation

CallSphere uses the prompt-plus-RAG approach across almost every vertical. IT helpdesk is the clearest example: 10 tools plus a RAG layer over customer knowledge bases, all orchestrated through the OpenAI Agents SDK. Healthcare (14 tools), real estate (10 agents), salon (4 agents), after-hours escalation (7 tools), and the ElevenLabs sales pod (5 GPT-4 specialists) all keep fine-tuning off the table because the ROI never beats a better prompt plus a better knowledge base.

The runtime is the OpenAI Realtime API (gpt-4o-realtime-preview-2025-06-03) at 24kHz PCM16 with server VAD. Post-call analytics from a GPT-4o-mini pipeline flag any turn where the LLM said "I don't know" so customers can close knowledge gaps quickly. CallSphere supports 57+ languages and runs under one second end-to-end on live traffic.

Common pitfalls

Bloated system prompts: 2000-token prompts make voice feel sluggish.
Running RAG on every turn: it is wasted work and latency.
Skipping citations: you cannot debug what you cannot trace.
Ingesting PDFs raw: clean out headers, footers, and page numbers first.
Fine-tuning when a tool would do: if the answer is "call an API", do not bake it into weights.

FAQ

How big should my chunks be?

200-400 tokens with 10-15% overlap for voice agents.

Should I use a different embedding model for search vs storage?

No — use the same model for both.

Is hybrid search (BM25 + vector) worth it?

For short voice queries, pure vector is usually enough.

How do I handle multi-language knowledge bases?

Store chunks in their original language and let the model translate at response time.

When does fine-tuning actually help?

For brand voice consistency in regulated industries with >1000 high-quality examples.

Next steps

Want to see your knowledge base powering a voice agent in a week? Book a demo, read the technology page, or see pricing.

#CallSphere #RAG #PromptEngineering #VoiceAI #KnowledgeBase #Embeddings #AIVoiceAgents