Turn-Taking in Conversational AI: Building Natural Voice Interactions

The Art of Taking Turns

Human conversations have an elegant turn-taking system that we learn as children. We know when someone has finished speaking, when to jump in, and how to handle interruptions — all without explicit signals. Voice AI agents need to replicate this behavior, and getting it wrong is one of the fastest ways to frustrate users.

The two most common complaints about voice agents are: "It cut me off before I finished" (premature endpointing) and "It takes forever to respond after I stop talking" (late endpointing). Building a good turn-taking system means finding the sweet spot between these two failure modes.

Endpointing: Detecting When the User Finishes

Endpointing is the process of determining that the user has completed their turn and is waiting for a response. The simplest approach uses a fixed silence timeout, but production systems combine multiple signals.

import time
from enum import Enum
from dataclasses import dataclass, field

class TurnState(Enum):
    IDLE = "idle"                # No one is speaking
    USER_SPEAKING = "user_speaking"
    USER_PAUSING = "user_pausing"  # Brief pause, might continue
    AGENT_SPEAKING = "agent_speaking"
    AGENT_INTERRUPTED = "agent_interrupted"

@dataclass
class EndpointingConfig:
    silence_threshold_ms: int = 700    # Silence before endpoint
    pause_threshold_ms: int = 300      # Short pause (not an endpoint)
    max_turn_duration_ms: int = 30000  # Force endpoint after 30s
    min_speech_duration_ms: int = 200  # Ignore very short speech

class TurnManager:
    def __init__(self, config: EndpointingConfig = None):
        self.config = config or EndpointingConfig()
        self.state = TurnState.IDLE
        self.speech_start_time = None
        self.silence_start_time = None
        self.transcript_buffer = []

    def on_vad_event(self, event: str, timestamp: float) -> str | None:
        """Process VAD events and return action if needed."""
        if event == "speech_start":
            if self.state == TurnState.AGENT_SPEAKING:
                return self._handle_barge_in(timestamp)
            self.state = TurnState.USER_SPEAKING
            self.speech_start_time = timestamp
            self.silence_start_time = None
            return None

        elif event == "speech_end":
            self.silence_start_time = timestamp
            self.state = TurnState.USER_PAUSING
            return None

        return None

    def check_endpoint(self, current_time: float) -> str | None:
        """Call this periodically to check for endpoint conditions."""
        if self.state != TurnState.USER_PAUSING:
            return None

        if self.silence_start_time is None:
            return None

        silence_duration = (current_time - self.silence_start_time) * 1000
        speech_duration = (
            (self.silence_start_time - self.speech_start_time) * 1000
            if self.speech_start_time else 0
        )

        # Ignore very short utterances (likely noise)
        if speech_duration < self.config.min_speech_duration_ms:
            return None

        # User has been silent long enough — endpoint
        if silence_duration >= self.config.silence_threshold_ms:
            self.state = TurnState.IDLE
            return "endpoint_detected"

        return None

    def _handle_barge_in(self, timestamp: float) -> str:
        self.state = TurnState.AGENT_INTERRUPTED
        self.speech_start_time = timestamp
        return "barge_in_detected"

Adaptive Endpointing

Fixed silence thresholds work poorly because natural pause lengths vary by context. A user listing items pauses briefly between items. A user thinking about a complex question pauses longer. Adaptive endpointing adjusts the threshold based on conversational context.

class AdaptiveEndpointer:
    def __init__(self):
        self.base_silence_ms = 700
        self.recent_pauses = []  # Track recent pause durations
        self.max_history = 10

    def get_threshold(self, context: dict) -> int:
        """Compute dynamic silence threshold based on context."""
        threshold = self.base_silence_ms

        # If the user asked a question, they expect a quick response
        if context.get("last_utterance_is_question"):
            threshold = min(threshold, 500)

        # If the user is mid-list ("first... second..."), use a longer threshold
        if context.get("is_enumeration"):
            threshold = max(threshold, 1200)

        # Adapt to user's natural pace
        if self.recent_pauses:
            avg_pause = sum(self.recent_pauses) / len(self.recent_pauses)
            # Set threshold to 1.5x their average pause length
            threshold = max(threshold, int(avg_pause * 1.5))

        return min(threshold, 2000)  # Cap at 2 seconds

    def record_pause(self, pause_ms: int):
        """Record a mid-speech pause for adaptation."""
        self.recent_pauses.append(pause_ms)
        if len(self.recent_pauses) > self.max_history:
            self.recent_pauses.pop(0)

Barge-In Detection and Handling

Barge-in occurs when the user starts speaking while the agent is still talking. This is a natural conversational behavior — users interrupt to correct misunderstandings, provide quick answers, or redirect the conversation. Handling it well is essential.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class BargeInHandler:
    def __init__(self, min_duration_ms: int = 200):
        self.min_duration_ms = min_duration_ms
        self.barge_in_start = None

    async def handle(
        self,
        timestamp: float,
        tts_player,
        stt_processor,
    ) -> bool:
        """Handle potential barge-in. Returns True if confirmed."""
        if self.barge_in_start is None:
            self.barge_in_start = timestamp
            return False

        duration = (timestamp - self.barge_in_start) * 1000

        if duration >= self.min_duration_ms:
            # Confirmed barge-in: stop agent speech
            await tts_player.stop()
            # Reset STT to capture the interruption clearly
            await stt_processor.reset()
            self.barge_in_start = None
            return True

        return False

    def cancel(self):
        """Cancel if speech was too short (likely noise)."""
        self.barge_in_start = None

Client-Side Turn Indicators

Visual feedback helps users understand the agent's state. Show clear indicators for listening, thinking, and speaking states.

class TurnIndicator {
  constructor(containerEl) {
    this.container = containerEl;
    this.states = {
      idle: { label: 'Tap to speak', color: '#6b7280' },
      listening: { label: 'Listening...', color: '#22c55e' },
      thinking: { label: 'Thinking...', color: '#eab308' },
      speaking: { label: 'Speaking...', color: '#3b82f6' },
    };
  }

  setState(state) {
    const config = this.states[state];
    this.container.querySelector('.status-label').textContent = config.label;
    this.container.querySelector('.status-dot').style.backgroundColor = config.color;

    // Pulse animation for active states
    const dot = this.container.querySelector('.status-dot');
    if (state === 'listening' || state === 'thinking') {
      dot.classList.add('animate-pulse');
    } else {
      dot.classList.remove('animate-pulse');
    }
  }
}

Production Turn-Taking Pipeline

Putting all the pieces together, here is how a production turn-taking system flows:

IDLE: VAD detects speech start, transition to LISTENING
LISTENING: STT processes audio, VAD monitors for silence
PAUSING: Silence detected, adaptive endpointer starts countdown
ENDPOINT: Silence exceeds threshold, send final transcript to LLM
THINKING: LLM generates response
SPEAKING: TTS plays response, VAD monitors for barge-in
If barge-in detected during SPEAKING, cancel TTS and return to LISTENING

FAQ

What silence threshold should I start with for a customer service voice agent?

Start with 700ms. This works well for most conversational scenarios. If users complain about being cut off, increase to 900ms. If they complain about slow responses, decrease to 500ms. Then implement adaptive endpointing to handle both fast and slow speakers automatically. The key insight is that there is no single correct threshold — it depends on your user population and use case.

How do I prevent false barge-in from the agent's own audio?

This is the echo problem. The agent's speech plays through the user's speakers and can be picked up by their microphone, triggering false barge-in detection. WebRTC's built-in Acoustic Echo Cancellation (AEC) handles this well. If you are not using WebRTC, you need to implement echo suppression by feeding the agent's output audio as a reference signal to your audio processing pipeline. Additionally, you can mute VAD detection during the first 100-200ms after TTS playback begins, since echo appears almost immediately.

Should the agent acknowledge interruptions explicitly?

Yes, acknowledging interruptions builds trust. When a user barges in, the agent should respond with a brief acknowledgment like "Go ahead" or "Sure" before processing their new input. This mimics natural human behavior and confirms that the agent heard and responded to the interruption rather than simply malfunctioning.

#TurnTaking #ConversationalAI #Endpointing #BargeIn #VoiceUX #VoiceAI #AgenticAI #LearnAI #AIEngineering

Turn-Taking in Conversational AI: Building Natural Voice Interactions

The Art of Taking Turns

Endpointing: Detecting When the User Finishes

Adaptive Endpointing

Barge-In Detection and Handling

Client-Side Turn Indicators

Production Turn-Taking Pipeline

FAQ

What silence threshold should I start with for a customer service voice agent?

How do I prevent false barge-in from the agent's own audio?

Should the agent acknowledge interruptions explicitly?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding