Turn-Taking in Conversational AI: Building Natural Voice Interactions
Master turn-taking mechanics for voice AI agents — including endpointing strategies, barge-in detection, silence thresholds, and overlap handling to create conversations that feel natural and responsive.
The Art of Taking Turns
Human conversations have an elegant turn-taking system that we learn as children. We know when someone has finished speaking, when to jump in, and how to handle interruptions — all without explicit signals. Voice AI agents need to replicate this behavior, and getting it wrong is one of the fastest ways to frustrate users.
The two most common complaints about voice agents are: "It cut me off before I finished" (premature endpointing) and "It takes forever to respond after I stop talking" (late endpointing). Building a good turn-taking system means finding the sweet spot between these two failure modes.
Endpointing: Detecting When the User Finishes
Endpointing is the process of determining that the user has completed their turn and is waiting for a response. The simplest approach uses a fixed silence timeout, but production systems combine multiple signals.
import time
from enum import Enum
from dataclasses import dataclass, field
class TurnState(Enum):
IDLE = "idle" # No one is speaking
USER_SPEAKING = "user_speaking"
USER_PAUSING = "user_pausing" # Brief pause, might continue
AGENT_SPEAKING = "agent_speaking"
AGENT_INTERRUPTED = "agent_interrupted"
@dataclass
class EndpointingConfig:
silence_threshold_ms: int = 700 # Silence before endpoint
pause_threshold_ms: int = 300 # Short pause (not an endpoint)
max_turn_duration_ms: int = 30000 # Force endpoint after 30s
min_speech_duration_ms: int = 200 # Ignore very short speech
class TurnManager:
def __init__(self, config: EndpointingConfig = None):
self.config = config or EndpointingConfig()
self.state = TurnState.IDLE
self.speech_start_time = None
self.silence_start_time = None
self.transcript_buffer = []
def on_vad_event(self, event: str, timestamp: float) -> str | None:
"""Process VAD events and return action if needed."""
if event == "speech_start":
if self.state == TurnState.AGENT_SPEAKING:
return self._handle_barge_in(timestamp)
self.state = TurnState.USER_SPEAKING
self.speech_start_time = timestamp
self.silence_start_time = None
return None
elif event == "speech_end":
self.silence_start_time = timestamp
self.state = TurnState.USER_PAUSING
return None
return None
def check_endpoint(self, current_time: float) -> str | None:
"""Call this periodically to check for endpoint conditions."""
if self.state != TurnState.USER_PAUSING:
return None
if self.silence_start_time is None:
return None
silence_duration = (current_time - self.silence_start_time) * 1000
speech_duration = (
(self.silence_start_time - self.speech_start_time) * 1000
if self.speech_start_time else 0
)
# Ignore very short utterances (likely noise)
if speech_duration < self.config.min_speech_duration_ms:
return None
# User has been silent long enough — endpoint
if silence_duration >= self.config.silence_threshold_ms:
self.state = TurnState.IDLE
return "endpoint_detected"
return None
def _handle_barge_in(self, timestamp: float) -> str:
self.state = TurnState.AGENT_INTERRUPTED
self.speech_start_time = timestamp
return "barge_in_detected"
Adaptive Endpointing
Fixed silence thresholds work poorly because natural pause lengths vary by context. A user listing items pauses briefly between items. A user thinking about a complex question pauses longer. Adaptive endpointing adjusts the threshold based on conversational context.
class AdaptiveEndpointer:
def __init__(self):
self.base_silence_ms = 700
self.recent_pauses = [] # Track recent pause durations
self.max_history = 10
def get_threshold(self, context: dict) -> int:
"""Compute dynamic silence threshold based on context."""
threshold = self.base_silence_ms
# If the user asked a question, they expect a quick response
if context.get("last_utterance_is_question"):
threshold = min(threshold, 500)
# If the user is mid-list ("first... second..."), use a longer threshold
if context.get("is_enumeration"):
threshold = max(threshold, 1200)
# Adapt to user's natural pace
if self.recent_pauses:
avg_pause = sum(self.recent_pauses) / len(self.recent_pauses)
# Set threshold to 1.5x their average pause length
threshold = max(threshold, int(avg_pause * 1.5))
return min(threshold, 2000) # Cap at 2 seconds
def record_pause(self, pause_ms: int):
"""Record a mid-speech pause for adaptation."""
self.recent_pauses.append(pause_ms)
if len(self.recent_pauses) > self.max_history:
self.recent_pauses.pop(0)
Barge-In Detection and Handling
Barge-in occurs when the user starts speaking while the agent is still talking. This is a natural conversational behavior — users interrupt to correct misunderstandings, provide quick answers, or redirect the conversation. Handling it well is essential.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class BargeInHandler:
def __init__(self, min_duration_ms: int = 200):
self.min_duration_ms = min_duration_ms
self.barge_in_start = None
async def handle(
self,
timestamp: float,
tts_player,
stt_processor,
) -> bool:
"""Handle potential barge-in. Returns True if confirmed."""
if self.barge_in_start is None:
self.barge_in_start = timestamp
return False
duration = (timestamp - self.barge_in_start) * 1000
if duration >= self.min_duration_ms:
# Confirmed barge-in: stop agent speech
await tts_player.stop()
# Reset STT to capture the interruption clearly
await stt_processor.reset()
self.barge_in_start = None
return True
return False
def cancel(self):
"""Cancel if speech was too short (likely noise)."""
self.barge_in_start = None
Client-Side Turn Indicators
Visual feedback helps users understand the agent's state. Show clear indicators for listening, thinking, and speaking states.
class TurnIndicator {
constructor(containerEl) {
this.container = containerEl;
this.states = {
idle: { label: 'Tap to speak', color: '#6b7280' },
listening: { label: 'Listening...', color: '#22c55e' },
thinking: { label: 'Thinking...', color: '#eab308' },
speaking: { label: 'Speaking...', color: '#3b82f6' },
};
}
setState(state) {
const config = this.states[state];
this.container.querySelector('.status-label').textContent = config.label;
this.container.querySelector('.status-dot').style.backgroundColor = config.color;
// Pulse animation for active states
const dot = this.container.querySelector('.status-dot');
if (state === 'listening' || state === 'thinking') {
dot.classList.add('animate-pulse');
} else {
dot.classList.remove('animate-pulse');
}
}
}
Production Turn-Taking Pipeline
Putting all the pieces together, here is how a production turn-taking system flows:
- IDLE: VAD detects speech start, transition to LISTENING
- LISTENING: STT processes audio, VAD monitors for silence
- PAUSING: Silence detected, adaptive endpointer starts countdown
- ENDPOINT: Silence exceeds threshold, send final transcript to LLM
- THINKING: LLM generates response
- SPEAKING: TTS plays response, VAD monitors for barge-in
- If barge-in detected during SPEAKING, cancel TTS and return to LISTENING
FAQ
What silence threshold should I start with for a customer service voice agent?
Start with 700ms. This works well for most conversational scenarios. If users complain about being cut off, increase to 900ms. If they complain about slow responses, decrease to 500ms. Then implement adaptive endpointing to handle both fast and slow speakers automatically. The key insight is that there is no single correct threshold — it depends on your user population and use case.
How do I prevent false barge-in from the agent's own audio?
This is the echo problem. The agent's speech plays through the user's speakers and can be picked up by their microphone, triggering false barge-in detection. WebRTC's built-in Acoustic Echo Cancellation (AEC) handles this well. If you are not using WebRTC, you need to implement echo suppression by feeding the agent's output audio as a reference signal to your audio processing pipeline. Additionally, you can mute VAD detection during the first 100-200ms after TTS playback begins, since echo appears almost immediately.
Should the agent acknowledge interruptions explicitly?
Yes, acknowledging interruptions builds trust. When a user barges in, the agent should respond with a brief acknowledgment like "Go ahead" or "Sure" before processing their new input. This mimics natural human behavior and confirms that the agent heard and responded to the interruption rather than simply malfunctioning.
#TurnTaking #ConversationalAI #Endpointing #BargeIn #VoiceUX #VoiceAI #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.