Capstone: Building a Voice-Enabled Appointment Booking System from Scratch
Build a complete voice-powered appointment booking system using Twilio, speech-to-text, text-to-speech, calendar integration, and intelligent booking logic with a FastAPI backend.
System Architecture
A voice-enabled appointment booking system takes an inbound phone call, converts speech to text, processes the request through an AI agent, books or modifies appointments in a calendar, and speaks the response back to the caller. This capstone integrates Twilio for telephony, Deepgram for speech-to-text, OpenAI for the conversational agent, ElevenLabs for natural text-to-speech, and a PostgreSQL database for appointment storage.
The call flow is: Twilio receives the call and opens a WebSocket media stream to your backend. Your FastAPI backend receives raw audio frames, streams them to Deepgram for real-time transcription, sends the transcript to an AI agent, receives the agent response, converts it to speech via ElevenLabs, and streams the audio back through the Twilio WebSocket.
Database Schema for Appointments
# models.py
from sqlalchemy import Column, String, DateTime, Boolean, ForeignKey
from sqlalchemy.dialects.postgresql import UUID
import uuid
class Provider(Base):
__tablename__ = "providers"
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
name = Column(String(200), nullable=False)
specialty = Column(String(100))
timezone = Column(String(50), default="America/New_York")
class TimeSlot(Base):
__tablename__ = "time_slots"
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
provider_id = Column(UUID(as_uuid=True), ForeignKey("providers.id"))
start_time = Column(DateTime, nullable=False)
end_time = Column(DateTime, nullable=False)
is_available = Column(Boolean, default=True)
class Appointment(Base):
__tablename__ = "appointments"
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
slot_id = Column(UUID(as_uuid=True), ForeignKey("time_slots.id"))
patient_name = Column(String(200), nullable=False)
patient_phone = Column(String(20), nullable=False)
reason = Column(String(500))
confirmed = Column(Boolean, default=False)
created_at = Column(DateTime, server_default="now()")
Twilio WebSocket Integration
Twilio sends a webhook when a call arrives. You respond with TwiML that opens a bidirectional media stream to your server.
# routes/twilio.py
from fastapi import APIRouter, Request
from fastapi.responses import Response
router = APIRouter()
@router.post("/incoming-call")
async def handle_incoming_call(request: Request):
twiml = """<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://your-domain.com/media-stream" />
</Connect>
</Response>"""
return Response(content=twiml, media_type="application/xml")
The WebSocket handler receives audio frames from Twilio and manages the conversation loop.
# routes/media_stream.py
from fastapi import WebSocket
import json, base64
@app.websocket("/media-stream")
async def media_stream(ws: WebSocket):
await ws.accept()
stream_sid = None
deepgram_ws = await connect_deepgram()
conversation_history = []
async for raw in ws.iter_text():
msg = json.loads(raw)
if msg["event"] == "start":
stream_sid = msg["start"]["streamSid"]
elif msg["event"] == "media":
audio_bytes = base64.b64decode(msg["media"]["payload"])
await deepgram_ws.send(audio_bytes)
elif msg["event"] == "stop":
break
await deepgram_ws.close()
Booking Agent with Tool Calls
The AI agent uses tools to check availability, book slots, and cancel appointments.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
# agents/booking_agent.py
from agents import Agent, function_tool
from datetime import datetime, timedelta
@function_tool
def check_availability(provider_name: str, date: str) -> str:
"""Check available time slots for a provider on a given date."""
target = datetime.strptime(date, "%Y-%m-%d")
slots = db.query(TimeSlot).join(Provider).filter(
Provider.name.ilike(f"%{provider_name}%"),
TimeSlot.start_time >= target,
TimeSlot.start_time < target + timedelta(days=1),
TimeSlot.is_available == True,
).order_by(TimeSlot.start_time).all()
if not slots:
return f"No availability for {provider_name} on {date}."
times = [s.start_time.strftime("%I:%M %p") for s in slots]
return f"Available times: {', '.join(times)}"
@function_tool
def book_appointment(slot_time: str, patient_name: str, reason: str) -> str:
"""Book an appointment at the specified time."""
slot = db.query(TimeSlot).filter(
TimeSlot.start_time == datetime.strptime(slot_time, "%Y-%m-%d %H:%M"),
TimeSlot.is_available == True,
).first()
if not slot:
return "That time slot is no longer available."
slot.is_available = False
appt = Appointment(
slot_id=slot.id, patient_name=patient_name, reason=reason, confirmed=True
)
db.add(appt)
db.commit()
return f"Appointment booked for {patient_name} at {slot_time}."
booking_agent = Agent(
name="Booking Agent",
instructions="""You are a friendly appointment booking assistant on a phone call.
Always confirm the provider, date, time, and reason before booking.
Speak naturally since the caller is listening to TTS output.
Keep responses under 2 sentences for quick voice delivery.""",
tools=[check_availability, book_appointment],
)
Speech-to-Text and Text-to-Speech Pipeline
Connect Deepgram for real-time STT with interim results, and ElevenLabs for low-latency TTS streaming.
# services/stt.py
import websockets, json, os
async def connect_deepgram():
url = "wss://api.deepgram.com/v1/listen?model=nova-2&punctuate=true"
ws = await websockets.connect(url, extra_headers={
"Authorization": f"Token {os.environ['DEEPGRAM_API_KEY']}"
})
return ws
async def stream_tts(text: str) -> bytes:
"""Convert text to speech using ElevenLabs streaming API."""
import httpx
async with httpx.AsyncClient() as client:
resp = await client.post(
f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
headers={"xi-api-key": os.environ["ELEVENLABS_API_KEY"]},
json={"text": text, "model_id": "eleven_turbo_v2"},
)
return resp.content
Deployment and Testing
Deploy with Docker Compose using three services: the FastAPI backend, PostgreSQL, and an ngrok container for exposing your local WebSocket to Twilio during development. For production, deploy behind an nginx reverse proxy with TLS and configure Twilio to point to your domain.
Test the booking flow end-to-end by calling your Twilio number, requesting an appointment, confirming the details, and verifying the database record. Automated testing uses recorded audio fixtures played through the WebSocket handler.
FAQ
How do I handle interruptions when the caller speaks over the AI?
Implement barge-in detection by monitoring the Deepgram transcript stream while TTS audio is playing. When new speech is detected, immediately stop the TTS playback by sending a clear message on the Twilio WebSocket, then process the new utterance.
What latency should I target for a natural voice experience?
Aim for under 800ms total round-trip from end-of-speech to start-of-response-audio. Deepgram Nova-2 typically returns final transcripts within 200ms, the LLM response takes 300-400ms, and ElevenLabs streaming TTS begins output within 200ms.
How do I prevent double-booking?
Use a database-level unique constraint or a SELECT FOR UPDATE lock on the time slot row. Wrap the availability check and booking in a single database transaction so that concurrent callers cannot book the same slot.
#CapstoneProject #VoiceAI #Twilio #AppointmentBooking #STTTTS #FullStackAI #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.