Capstone: Building a Voice-Enabled Appointment Booking System from Scratch

System Architecture

A voice-enabled appointment booking system takes an inbound phone call, converts speech to text, processes the request through an AI agent, books or modifies appointments in a calendar, and speaks the response back to the caller. This capstone integrates Twilio for telephony, Deepgram for speech-to-text, OpenAI for the conversational agent, ElevenLabs for natural text-to-speech, and a PostgreSQL database for appointment storage.

The call flow is: Twilio receives the call and opens a WebSocket media stream to your backend. Your FastAPI backend receives raw audio frames, streams them to Deepgram for real-time transcription, sends the transcript to an AI agent, receives the agent response, converts it to speech via ElevenLabs, and streams the audio back through the Twilio WebSocket.

Database Schema for Appointments

# models.py
from sqlalchemy import Column, String, DateTime, Boolean, ForeignKey
from sqlalchemy.dialects.postgresql import UUID
import uuid

class Provider(Base):
    __tablename__ = "providers"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    name = Column(String(200), nullable=False)
    specialty = Column(String(100))
    timezone = Column(String(50), default="America/New_York")

class TimeSlot(Base):
    __tablename__ = "time_slots"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    provider_id = Column(UUID(as_uuid=True), ForeignKey("providers.id"))
    start_time = Column(DateTime, nullable=False)
    end_time = Column(DateTime, nullable=False)
    is_available = Column(Boolean, default=True)

class Appointment(Base):
    __tablename__ = "appointments"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    slot_id = Column(UUID(as_uuid=True), ForeignKey("time_slots.id"))
    patient_name = Column(String(200), nullable=False)
    patient_phone = Column(String(20), nullable=False)
    reason = Column(String(500))
    confirmed = Column(Boolean, default=False)
    created_at = Column(DateTime, server_default="now()")

Twilio WebSocket Integration

Twilio sends a webhook when a call arrives. You respond with TwiML that opens a bidirectional media stream to your server.

# routes/twilio.py
from fastapi import APIRouter, Request
from fastapi.responses import Response

router = APIRouter()

@router.post("/incoming-call")
async def handle_incoming_call(request: Request):
    twiml = """<?xml version="1.0" encoding="UTF-8"?>
    <Response>
        <Connect>
            <Stream url="wss://your-domain.com/media-stream" />
        </Connect>
    </Response>"""
    return Response(content=twiml, media_type="application/xml")

The WebSocket handler receives audio frames from Twilio and manages the conversation loop.

# routes/media_stream.py
from fastapi import WebSocket
import json, base64

@app.websocket("/media-stream")
async def media_stream(ws: WebSocket):
    await ws.accept()
    stream_sid = None
    deepgram_ws = await connect_deepgram()
    conversation_history = []

    async for raw in ws.iter_text():
        msg = json.loads(raw)

        if msg["event"] == "start":
            stream_sid = msg["start"]["streamSid"]

        elif msg["event"] == "media":
            audio_bytes = base64.b64decode(msg["media"]["payload"])
            await deepgram_ws.send(audio_bytes)

        elif msg["event"] == "stop":
            break

    await deepgram_ws.close()

Booking Agent with Tool Calls

The AI agent uses tools to check availability, book slots, and cancel appointments.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# agents/booking_agent.py
from agents import Agent, function_tool
from datetime import datetime, timedelta

@function_tool
def check_availability(provider_name: str, date: str) -> str:
    """Check available time slots for a provider on a given date."""
    target = datetime.strptime(date, "%Y-%m-%d")
    slots = db.query(TimeSlot).join(Provider).filter(
        Provider.name.ilike(f"%{provider_name}%"),
        TimeSlot.start_time >= target,
        TimeSlot.start_time < target + timedelta(days=1),
        TimeSlot.is_available == True,
    ).order_by(TimeSlot.start_time).all()
    if not slots:
        return f"No availability for {provider_name} on {date}."
    times = [s.start_time.strftime("%I:%M %p") for s in slots]
    return f"Available times: {', '.join(times)}"

@function_tool
def book_appointment(slot_time: str, patient_name: str, reason: str) -> str:
    """Book an appointment at the specified time."""
    slot = db.query(TimeSlot).filter(
        TimeSlot.start_time == datetime.strptime(slot_time, "%Y-%m-%d %H:%M"),
        TimeSlot.is_available == True,
    ).first()
    if not slot:
        return "That time slot is no longer available."
    slot.is_available = False
    appt = Appointment(
        slot_id=slot.id, patient_name=patient_name, reason=reason, confirmed=True
    )
    db.add(appt)
    db.commit()
    return f"Appointment booked for {patient_name} at {slot_time}."

booking_agent = Agent(
    name="Booking Agent",
    instructions="""You are a friendly appointment booking assistant on a phone call.
    Always confirm the provider, date, time, and reason before booking.
    Speak naturally since the caller is listening to TTS output.
    Keep responses under 2 sentences for quick voice delivery.""",
    tools=[check_availability, book_appointment],
)

Speech-to-Text and Text-to-Speech Pipeline

Connect Deepgram for real-time STT with interim results, and ElevenLabs for low-latency TTS streaming.

# services/stt.py
import websockets, json, os

async def connect_deepgram():
    url = "wss://api.deepgram.com/v1/listen?model=nova-2&punctuate=true"
    ws = await websockets.connect(url, extra_headers={
        "Authorization": f"Token {os.environ['DEEPGRAM_API_KEY']}"
    })
    return ws

async def stream_tts(text: str) -> bytes:
    """Convert text to speech using ElevenLabs streaming API."""
    import httpx
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
            headers={"xi-api-key": os.environ["ELEVENLABS_API_KEY"]},
            json={"text": text, "model_id": "eleven_turbo_v2"},
        )
        return resp.content

Deployment and Testing

Deploy with Docker Compose using three services: the FastAPI backend, PostgreSQL, and an ngrok container for exposing your local WebSocket to Twilio during development. For production, deploy behind an nginx reverse proxy with TLS and configure Twilio to point to your domain.

Test the booking flow end-to-end by calling your Twilio number, requesting an appointment, confirming the details, and verifying the database record. Automated testing uses recorded audio fixtures played through the WebSocket handler.

FAQ

How do I handle interruptions when the caller speaks over the AI?

Implement barge-in detection by monitoring the Deepgram transcript stream while TTS audio is playing. When new speech is detected, immediately stop the TTS playback by sending a clear message on the Twilio WebSocket, then process the new utterance.

What latency should I target for a natural voice experience?

Aim for under 800ms total round-trip from end-of-speech to start-of-response-audio. Deepgram Nova-2 typically returns final transcripts within 200ms, the LLM response takes 300-400ms, and ElevenLabs streaming TTS begins output within 200ms.

How do I prevent double-booking?

Use a database-level unique constraint or a SELECT FOR UPDATE lock on the time slot row. Wrap the availability check and booking in a single database transaction so that concurrent callers cannot book the same slot.

#CapstoneProject #VoiceAI #Twilio #AppointmentBooking #STTTTS #FullStackAI #AgenticAI #LearnAI #AIEngineering

Capstone: Building a Voice-Enabled Appointment Booking System from Scratch

System Architecture

Database Schema for Appointments

Twilio WebSocket Integration

Booking Agent with Tool Calls

Speech-to-Text and Text-to-Speech Pipeline

Deployment and Testing

FAQ

How do I handle interruptions when the caller speaks over the AI?

What latency should I target for a natural voice experience?

How do I prevent double-booking?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding