Conference Calling with AI: Adding AI Agents as Meeting Participants
Learn how to add AI agents to conference calls as real-time participants. Covers conference bridge setup, live transcription, automatic note-taking, action item extraction, and post-meeting summaries.
AI as a Meeting Participant
Conference calls are where decisions happen, tasks get assigned, and context gets shared — yet most of that information evaporates the moment the call ends. Adding an AI agent as a silent participant changes this entirely. The AI listens to the entire conversation, transcribes in real time, identifies action items, and produces a structured summary minutes after the call ends.
Unlike passive recording tools, an AI participant can be interactive — responding to questions like "What did we decide about the budget?" during the call itself, or flagging when a discussion is going off-agenda.
Setting Up the Conference Bridge
Use Twilio to create a conference room that both human participants and your AI agent can join:
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Dial
from fastapi import FastAPI, Request
from fastapi.responses import Response
app = FastAPI()
twilio_client = Client()
@app.post("/join-conference")
async def join_conference(request: Request):
"""Webhook for human participants joining the conference."""
form = await request.form()
conference_name = form.get("conference_name", "team-meeting")
response = VoiceResponse()
response.say("Joining the conference. An AI note-taker is active.")
dial = Dial()
dial.conference(
conference_name,
start_conference_on_enter=True,
record="record-from-start",
recording_status_callback="/recording-complete",
status_callback="/conference-events",
status_callback_event="start end join leave",
)
response.append(dial)
return Response(content=str(response), media_type="application/xml")
def add_ai_agent_to_conference(conference_name: str):
"""Programmatically add the AI agent to an existing conference."""
participant = twilio_client.conferences(conference_name).participants.create(
from_="+15551234567", # Your Twilio number
to="sip:ai-agent@yourdomain.com",
early_media=True,
beep="false", # Do not announce the AI joining
record=False, # Conference itself is already recording
coaching=True, # AI can listen but not interrupt
)
return participant.call_sid
The coaching=True parameter puts the AI in listen-only mode — it receives the audio stream but cannot be heard by other participants. This is ideal for a note-taking agent.
Real-Time Transcription Pipeline
Connect the AI agent's audio stream to a live transcription service:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import asyncio
import json
from collections import defaultdict
from datetime import datetime
class LiveTranscriptionEngine:
"""Real-time transcription with speaker tracking."""
def __init__(self, deepgram_client):
self.deepgram = deepgram_client
self.transcript_buffer = []
self.current_speakers = {}
async def start_live_stream(self, audio_stream):
"""Connect to Deepgram for live transcription."""
options = {
"model": "nova-2",
"language": "en-US",
"smart_format": True,
"diarize": True,
"interim_results": True,
"punctuate": True,
}
connection = await self.deepgram.listen.asynclive.v("1") .transcribe_stream(options)
connection.on("Results", self.handle_transcript_result)
connection.on("Error", self.handle_error)
# Stream audio chunks to Deepgram
async for chunk in audio_stream:
await connection.send(chunk)
await connection.finish()
def handle_transcript_result(self, result):
"""Process each transcription result."""
if not result.is_final:
return # Skip interim results for notes
for alt in result.channel.alternatives:
if alt.transcript.strip():
entry = {
"timestamp": datetime.utcnow().isoformat(),
"speaker": self.identify_speaker(alt),
"text": alt.transcript,
"confidence": alt.confidence,
}
self.transcript_buffer.append(entry)
def identify_speaker(self, alternative):
"""Map diarization labels to speaker identifiers."""
if hasattr(alternative, "words") and alternative.words:
speaker_id = alternative.words[0].speaker
return f"Speaker {speaker_id}"
return "Unknown"
def get_full_transcript(self) -> list[dict]:
return list(self.transcript_buffer)
Action Item Extraction in Real Time
Process the transcript incrementally to detect action items as they are spoken:
from openai import AsyncOpenAI
class ActionItemDetector:
"""Detects action items from conversation segments."""
def __init__(self):
self.client = AsyncOpenAI()
self.detected_items = []
self.processed_up_to = 0
async def process_new_segments(self, transcript: list[dict]):
"""Analyze new transcript segments for action items."""
new_segments = transcript[self.processed_up_to:]
if len(new_segments) < 3:
return # Wait for enough context
text_block = "\n".join(
f"[{s['speaker']}]: {s['text']}" for s in new_segments
)
response = await self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Extract action items from this meeting segment. "
"Return JSON: {\"action_items\": [{\"assignee\": "
"\"...\", \"task\": \"...\", \"deadline\": "
"\"...\", \"priority\": \"high/medium/low\"}]}. "
"Return empty array if no action items found."
),
},
{"role": "user", "content": text_block},
],
response_format={"type": "json_object"},
temperature=0.1,
)
result = json.loads(response.choices[0].message.content)
new_items = result.get("action_items", [])
self.detected_items.extend(new_items)
self.processed_up_to = len(transcript)
for item in new_items:
print(f"ACTION: {item['assignee']} -> {item['task']}")
Post-Meeting Summary Generation
When the conference ends, generate a comprehensive summary:
class MeetingSummarizer:
"""Generates structured meeting summaries."""
def __init__(self):
self.client = AsyncOpenAI()
async def generate_summary(
self, transcript: list[dict], action_items: list[dict]
) -> dict:
formatted_transcript = "\n".join(
f"[{t['timestamp']}] {t['speaker']}: {t['text']}"
for t in transcript
)
response = await self.client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Generate a structured meeting summary.
Return JSON with:
- title: short meeting title
- duration_minutes: estimated duration
- participants: list of identified speakers
- summary: 3-5 sentence executive summary
- key_decisions: list of decisions made
- discussion_topics: list of {topic, summary, outcome}
- action_items: cleaned/deduped list from detected items
- open_questions: unresolved topics for follow-up
- next_steps: recommended follow-up actions""",
},
{
"role": "user",
"content": (
f"TRANSCRIPT:\n{formatted_transcript}\n\n"
f"DETECTED ACTION ITEMS:\n"
f"{json.dumps(action_items, indent=2)}"
),
},
],
response_format={"type": "json_object"},
temperature=0.3,
)
return json.loads(response.choices[0].message.content)
Distributing Meeting Notes
After generating the summary, distribute it to participants via email or messaging:
async def distribute_meeting_notes(
summary: dict,
participant_emails: list[str],
email_client,
):
"""Send formatted meeting notes to all participants."""
action_list = "\n".join(
f"- [{item['priority'].upper()}] {item['assignee']}: "
f"{item['task']} (due: {item.get('deadline', 'TBD')})"
for item in summary.get("action_items", [])
)
body = f"""Meeting: {summary['title']}
Duration: {summary['duration_minutes']} minutes
Participants: {', '.join(summary['participants'])}
SUMMARY
{summary['summary']}
KEY DECISIONS
{chr(10).join('- ' + d for d in summary.get('key_decisions', []))}
ACTION ITEMS
{action_list}
OPEN QUESTIONS
{chr(10).join('- ' + q for q in summary.get('open_questions', []))}
"""
for email in participant_emails:
await email_client.send(
to=email,
subject=f"Meeting Notes: {summary['title']}",
body=body,
)
FAQ
Does the AI agent add latency to the conference call?
No. When added in coaching (listen-only) mode, the AI agent receives a copy of the audio stream but does not inject any audio back. This means zero impact on call quality or latency for human participants. The transcription and analysis happen asynchronously on your server.
How accurate is speaker diarization in conference calls?
With high-quality audio (each person on a separate phone line), diarization accuracy is typically 85-90%. It degrades when multiple people speak in the same room through a single phone. For best results, have each participant join individually rather than clustering around a speakerphone. Post-processing can improve accuracy by using voice profiles if participants have called before.
Can the AI agent actively participate in the conversation?
Yes, though it requires careful design. Remove the coaching=True flag to let the AI speak. Implement a keyword trigger (e.g., "Hey AI, summarize what we discussed") so the agent only speaks when addressed. Use interruption detection to avoid talking over participants. Active participation is useful for real-time fact-checking or retrieving information, but keep responses brief — a verbose AI agent in a meeting is counterproductive.
#ConferenceCalls #MeetingAI #Transcription #ActionItems #VoiceAI #NoteTaking #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.