Reference

AI Voice Agent Glossary

Definitions of 25 technical terms used in AI voice agent systems.

A

ASR

Automatic Speech Recognition. Converts spoken audio into text. ASR models process audio in real time and output transcripts used by the reasoning layer. Accuracy depends on language, accent, and background noise.

B

Barge-in

The ability for a caller to interrupt the AI agent while it is speaking. When barge-in is detected, the agent stops its current response, processes the new input, and generates a new reply.

C

Conversation State

A structured object maintained for the duration of a call or chat session. Contains extracted entities, tool results, turn history, and context. Used by the LLM to generate contextually relevant responses.

E

Endpointing

The process of detecting when a speaker has finished their utterance. Uses silence duration thresholds (typically 600ms) combined with linguistic cues to determine turn boundaries.

Escalation

Transferring a conversation from the AI agent to a human operator. Triggered by policy rules (confidence threshold, turn limit, sensitive topic) or explicit caller request. Full conversation context is passed to the human.

G

Guardrail

A constraint applied to agent behavior to prevent undesirable outputs. Examples: topic deny-lists, PII redaction, tool allowlists, and confidence thresholds. Guardrails operate independently of the LLM.

H

Hallucination

When an LLM generates factually incorrect or fabricated information. Mitigated by grounding responses in structured tools, knowledge bases (RAG), and explicit system prompt instructions to avoid speculation.

Human Handoff

The process of transferring an active conversation to a human agent. Includes passing conversation history, extracted entities, and caller sentiment. The human receives full context before joining.

I

Intent Detection

Identifying the purpose of a caller's utterance. The LLM maps spoken input to predefined intents (e.g., book appointment, check status, make payment) using the system prompt and conversation context.

L

Latency Budget

The maximum acceptable time between the end of a caller's utterance and the start of the agent's audio response. CallSphere targets under 1.5 seconds end-to-end, distributed across ASR, LLM, and TTS stages.

LLM

Large Language Model. A neural network trained on text data that generates human-like responses. CallSphere supports GPT-4o, Claude, and Gemini. The LLM handles reasoning, intent detection, and response generation.

P

PII

Personally Identifiable Information. Data that can identify an individual: names, phone numbers, email addresses, SSNs, credit card numbers. CallSphere redacts PII from transcripts and logs by default.

R

RAG

Retrieval-Augmented Generation. A technique that retrieves relevant documents from a knowledge base and includes them in the LLM's context window. Grounds responses in factual source material rather than relying solely on model training data.

S

SIP

Session Initiation Protocol. A signaling protocol for establishing, maintaining, and terminating voice calls over IP networks. CallSphere connects to PSTN carriers and PBX systems via SIP trunks.

Speech Synthesis (TTS)

Text-to-Speech. Converts the agent's text response into natural-sounding audio. Configurable voice, speed, pitch, and language per agent. Providers include ElevenLabs, Google TTS, and Azure Neural TTS.

Structured Response

A response from the LLM formatted as structured data (JSON) rather than free text. Used when the agent needs to invoke tools, return specific data fields, or follow a defined output schema.

System Prompt

The initial instructions given to the LLM that define the agent's role, personality, available tools, guardrails, and behavioral constraints. Each agent has a unique system prompt configured during onboarding.

T

Tool Calling

The mechanism by which an LLM invokes external functions during a conversation. The LLM outputs a tool name and parameters; the platform executes the function and returns results to the LLM for response generation.

Tool Definition

A schema that describes a tool available to the agent. Includes the tool name, a natural-language description, and a JSON schema for input parameters. The LLM uses definitions to decide when and how to invoke tools.

Turn

A single exchange in a conversation: one caller utterance followed by one agent response. Multi-turn conversations maintain context across turns. Turn count is used for escalation policies and analytics.

Turn Detection

The combined process of voice activity detection (VAD) and endpointing that determines when a caller's turn has ended and the agent should begin responding. Prevents the agent from interrupting the caller.

V

VAD

Voice Activity Detection. An algorithm that distinguishes speech from silence and background noise in an audio stream. VAD triggers ASR processing and contributes to turn detection and endpointing decisions.

W

Webhook

An HTTP callback triggered by a specific event. CallSphere sends webhooks for events like call started, call ended, escalation triggered, and appointment booked. Payloads are signed with HMAC-SHA256.

WebRTC

Web Real-Time Communication. A browser-based protocol for real-time audio and video. CallSphere uses WebRTC for browser-based voice agents, enabling sub-second audio streaming without plugins.

WebSocket

A persistent, bidirectional communication protocol over TCP. Used for real-time text chat and streaming audio. Lower overhead than HTTP polling for continuous data exchange between client and server.