Conference Calling with AI: Adding AI Agents as Meeting Participants

AI as a Meeting Participant

Conference calls are where decisions happen, tasks get assigned, and context gets shared — yet most of that information evaporates the moment the call ends. Adding an AI agent as a silent participant changes this entirely. The AI listens to the entire conversation, transcribes in real time, identifies action items, and produces a structured summary minutes after the call ends.

Unlike passive recording tools, an AI participant can be interactive — responding to questions like "What did we decide about the budget?" during the call itself, or flagging when a discussion is going off-agenda.

Setting Up the Conference Bridge

Use Twilio to create a conference room that both human participants and your AI agent can join:

from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Dial
from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()
twilio_client = Client()

@app.post("/join-conference")
async def join_conference(request: Request):
    """Webhook for human participants joining the conference."""
    form = await request.form()
    conference_name = form.get("conference_name", "team-meeting")

    response = VoiceResponse()
    response.say("Joining the conference. An AI note-taker is active.")

    dial = Dial()
    dial.conference(
        conference_name,
        start_conference_on_enter=True,
        record="record-from-start",
        recording_status_callback="/recording-complete",
        status_callback="/conference-events",
        status_callback_event="start end join leave",
    )
    response.append(dial)

    return Response(content=str(response), media_type="application/xml")


def add_ai_agent_to_conference(conference_name: str):
    """Programmatically add the AI agent to an existing conference."""
    participant = twilio_client.conferences(conference_name).participants.create(
        from_="+15551234567",  # Your Twilio number
        to="sip:ai-agent@yourdomain.com",
        early_media=True,
        beep="false",  # Do not announce the AI joining
        record=False,   # Conference itself is already recording
        coaching=True,  # AI can listen but not interrupt
    )
    return participant.call_sid

The coaching=True parameter puts the AI in listen-only mode — it receives the audio stream but cannot be heard by other participants. This is ideal for a note-taking agent.

Real-Time Transcription Pipeline

Connect the AI agent's audio stream to a live transcription service:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import asyncio
import json
from collections import defaultdict
from datetime import datetime

class LiveTranscriptionEngine:
    """Real-time transcription with speaker tracking."""

    def __init__(self, deepgram_client):
        self.deepgram = deepgram_client
        self.transcript_buffer = []
        self.current_speakers = {}

    async def start_live_stream(self, audio_stream):
        """Connect to Deepgram for live transcription."""
        options = {
            "model": "nova-2",
            "language": "en-US",
            "smart_format": True,
            "diarize": True,
            "interim_results": True,
            "punctuate": True,
        }

        connection = await self.deepgram.listen.asynclive.v("1")             .transcribe_stream(options)

        connection.on("Results", self.handle_transcript_result)
        connection.on("Error", self.handle_error)

        # Stream audio chunks to Deepgram
        async for chunk in audio_stream:
            await connection.send(chunk)

        await connection.finish()

    def handle_transcript_result(self, result):
        """Process each transcription result."""
        if not result.is_final:
            return  # Skip interim results for notes

        for alt in result.channel.alternatives:
            if alt.transcript.strip():
                entry = {
                    "timestamp": datetime.utcnow().isoformat(),
                    "speaker": self.identify_speaker(alt),
                    "text": alt.transcript,
                    "confidence": alt.confidence,
                }
                self.transcript_buffer.append(entry)

    def identify_speaker(self, alternative):
        """Map diarization labels to speaker identifiers."""
        if hasattr(alternative, "words") and alternative.words:
            speaker_id = alternative.words[0].speaker
            return f"Speaker {speaker_id}"
        return "Unknown"

    def get_full_transcript(self) -> list[dict]:
        return list(self.transcript_buffer)

Action Item Extraction in Real Time

Process the transcript incrementally to detect action items as they are spoken:

from openai import AsyncOpenAI

class ActionItemDetector:
    """Detects action items from conversation segments."""

    def __init__(self):
        self.client = AsyncOpenAI()
        self.detected_items = []
        self.processed_up_to = 0

    async def process_new_segments(self, transcript: list[dict]):
        """Analyze new transcript segments for action items."""
        new_segments = transcript[self.processed_up_to:]
        if len(new_segments) < 3:
            return  # Wait for enough context

        text_block = "\n".join(
            f"[{s['speaker']}]: {s['text']}" for s in new_segments
        )

        response = await self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Extract action items from this meeting segment. "
                        "Return JSON: {\"action_items\": [{\"assignee\": "
                        "\"...\", \"task\": \"...\", \"deadline\": "
                        "\"...\", \"priority\": \"high/medium/low\"}]}. "
                        "Return empty array if no action items found."
                    ),
                },
                {"role": "user", "content": text_block},
            ],
            response_format={"type": "json_object"},
            temperature=0.1,
        )

        result = json.loads(response.choices[0].message.content)
        new_items = result.get("action_items", [])
        self.detected_items.extend(new_items)
        self.processed_up_to = len(transcript)

        for item in new_items:
            print(f"ACTION: {item['assignee']} -> {item['task']}")

Post-Meeting Summary Generation

When the conference ends, generate a comprehensive summary:

class MeetingSummarizer:
    """Generates structured meeting summaries."""

    def __init__(self):
        self.client = AsyncOpenAI()

    async def generate_summary(
        self, transcript: list[dict], action_items: list[dict]
    ) -> dict:
        formatted_transcript = "\n".join(
            f"[{t['timestamp']}] {t['speaker']}: {t['text']}"
            for t in transcript
        )

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """Generate a structured meeting summary.
Return JSON with:
- title: short meeting title
- duration_minutes: estimated duration
- participants: list of identified speakers
- summary: 3-5 sentence executive summary
- key_decisions: list of decisions made
- discussion_topics: list of {topic, summary, outcome}
- action_items: cleaned/deduped list from detected items
- open_questions: unresolved topics for follow-up
- next_steps: recommended follow-up actions""",
                },
                {
                    "role": "user",
                    "content": (
                        f"TRANSCRIPT:\n{formatted_transcript}\n\n"
                        f"DETECTED ACTION ITEMS:\n"
                        f"{json.dumps(action_items, indent=2)}"
                    ),
                },
            ],
            response_format={"type": "json_object"},
            temperature=0.3,
        )

        return json.loads(response.choices[0].message.content)

Distributing Meeting Notes

After generating the summary, distribute it to participants via email or messaging:

async def distribute_meeting_notes(
    summary: dict,
    participant_emails: list[str],
    email_client,
):
    """Send formatted meeting notes to all participants."""
    action_list = "\n".join(
        f"- [{item['priority'].upper()}] {item['assignee']}: "
        f"{item['task']} (due: {item.get('deadline', 'TBD')})"
        for item in summary.get("action_items", [])
    )

    body = f"""Meeting: {summary['title']}
Duration: {summary['duration_minutes']} minutes
Participants: {', '.join(summary['participants'])}

SUMMARY
{summary['summary']}

KEY DECISIONS
{chr(10).join('- ' + d for d in summary.get('key_decisions', []))}

ACTION ITEMS
{action_list}

OPEN QUESTIONS
{chr(10).join('- ' + q for q in summary.get('open_questions', []))}
"""

    for email in participant_emails:
        await email_client.send(
            to=email,
            subject=f"Meeting Notes: {summary['title']}",
            body=body,
        )

FAQ

Does the AI agent add latency to the conference call?

No. When added in coaching (listen-only) mode, the AI agent receives a copy of the audio stream but does not inject any audio back. This means zero impact on call quality or latency for human participants. The transcription and analysis happen asynchronously on your server.

How accurate is speaker diarization in conference calls?

With high-quality audio (each person on a separate phone line), diarization accuracy is typically 85-90%. It degrades when multiple people speak in the same room through a single phone. For best results, have each participant join individually rather than clustering around a speakerphone. Post-processing can improve accuracy by using voice profiles if participants have called before.

Can the AI agent actively participate in the conversation?

Yes, though it requires careful design. Remove the coaching=True flag to let the AI speak. Implement a keyword trigger (e.g., "Hey AI, summarize what we discussed") so the agent only speaks when addressed. Use interruption detection to avoid talking over participants. Active participation is useful for real-time fact-checking or retrieving information, but keep responses brief — a verbose AI agent in a meeting is counterproductive.

#ConferenceCalls #MeetingAI #Transcription #ActionItems #VoiceAI #NoteTaking #AgenticAI #LearnAI #AIEngineering

Conference Calling with AI: Adding AI Agents as Meeting Participants

AI as a Meeting Participant

Setting Up the Conference Bridge

Real-Time Transcription Pipeline

Action Item Extraction in Real Time

Post-Meeting Summary Generation

Distributing Meeting Notes

FAQ

Does the AI agent add latency to the conference call?

How accurate is speaker diarization in conference calls?

Can the AI agent actively participate in the conversation?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding