Call Recording and Transcription for AI Analysis: Building a Call Analytics Pipeline

Why Call Analytics Matters

Every phone call your business handles is a goldmine of unstructured data — customer pain points, competitor mentions, product feedback, and sales signals. Without a structured analytics pipeline, these insights vanish the moment the call ends. A call analytics pipeline captures recordings, transcribes them accurately, and uses AI to extract structured insights at scale.

The pipeline has four stages: recording, transcription, analysis, and storage. Each stage feeds the next, and the final output is a structured dataset you can query, visualize, and act on.

Stage 1: Recording Calls

Using Twilio as an example, enabling call recording is a single parameter in your TwiML:

from twilio.twiml.voice_response import VoiceResponse
from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/incoming-call")
async def handle_call(request: Request):
    response = VoiceResponse()

    # Enable dual-channel recording (separate tracks per speaker)
    response.start().record(
        name="call-recording",
        track="both_legs",  # Separate caller and agent audio
    )

    response.say("Thank you for calling. How can I help?")
    gather = response.gather(input="speech", action="/handle-speech")
    return Response(content=str(response), media_type="application/xml")


@app.post("/recording-status")
async def recording_complete(request: Request):
    """Webhook called when recording is finalized."""
    form = await request.form()
    recording_sid = form["RecordingSid"]
    recording_url = form["RecordingUrl"]
    duration = int(form["RecordingDuration"])
    call_sid = form["CallSid"]

    # Trigger the transcription pipeline
    await start_transcription_pipeline(
        recording_sid=recording_sid,
        recording_url=f"{recording_url}.wav",
        duration=duration,
        call_sid=call_sid,
    )
    return {"status": "accepted"}

Dual-channel recording is critical for analytics — it puts each speaker on a separate audio track, which dramatically improves transcription accuracy and makes speaker diarization trivial.

Stage 2: Transcription with Speaker Diarization

Download the recording and run it through a speech-to-text engine with speaker separation:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import httpx
from deepgram import DeepgramClient, PrerecordedOptions

deepgram = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])

async def transcribe_recording(recording_url: str, auth_token: str):
    """Download recording and transcribe with speaker diarization."""
    # Download the recording from Twilio
    async with httpx.AsyncClient() as client:
        resp = await client.get(
            recording_url,
            auth=(os.environ["TWILIO_ACCOUNT_SID"], auth_token),
        )
        audio_bytes = resp.content

    # Transcribe with Deepgram (diarization + punctuation)
    options = PrerecordedOptions(
        model="nova-2",
        smart_format=True,
        diarize=True,
        punctuate=True,
        utterances=True,
        language="en-US",
    )

    response = await deepgram.listen.asyncrest.v("1").transcribe_file(
        {"buffer": audio_bytes, "mimetype": "audio/wav"},
        options,
    )

    # Structure the transcript by speaker
    utterances = response.results.utterances
    structured_transcript = []
    for utterance in utterances:
        structured_transcript.append({
            "speaker": f"Speaker {utterance.speaker}",
            "text": utterance.transcript,
            "start": utterance.start,
            "end": utterance.end,
            "confidence": utterance.confidence,
        })

    return structured_transcript

Stage 3: AI-Powered Analysis

With a structured transcript in hand, use an LLM to extract insights:

from openai import AsyncOpenAI

client = AsyncOpenAI()

ANALYSIS_PROMPT = """Analyze this call transcript and extract:
1. **Summary**: 2-3 sentence summary of the call
2. **Sentiment**: overall (positive/neutral/negative), and per-speaker
3. **Intent**: caller's primary intent (support, sales, complaint, etc.)
4. **Key Topics**: list of topics discussed
5. **Action Items**: any follow-up actions promised
6. **Satisfaction Score**: 1-10 estimate of caller satisfaction
7. **Escalation Risk**: low/medium/high
8. **Competitor Mentions**: any competitor names mentioned

Return valid JSON matching this schema exactly."""

async def analyze_transcript(transcript: list[dict]) -> dict:
    """Run AI analysis on a structured transcript."""
    # Format transcript for the LLM
    formatted = "\n".join(
        f"[{t['speaker']}] ({t['start']:.1f}s): {t['text']}"
        for t in transcript
    )

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": ANALYSIS_PROMPT},
            {"role": "user", "content": formatted},
        ],
        response_format={"type": "json_object"},
        temperature=0.2,
    )

    import json
    return json.loads(response.choices[0].message.content)

Stage 4: Storage and Querying

Store the raw transcript and analysis results in a database optimized for querying:

import asyncpg
import json
from datetime import datetime

async def store_call_analysis(
    pool: asyncpg.Pool,
    call_sid: str,
    transcript: list[dict],
    analysis: dict,
    duration: int,
):
    """Persist call data and analysis to PostgreSQL."""
    await pool.execute(
        """
        INSERT INTO call_analytics (
            call_sid, transcript, summary, sentiment,
            intent, topics, action_items, satisfaction_score,
            escalation_risk, competitor_mentions,
            duration_seconds, analyzed_at
        ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12)
        """,
        call_sid,
        json.dumps(transcript),
        analysis["summary"],
        analysis["sentiment"],
        analysis["intent"],
        analysis["key_topics"],
        json.dumps(analysis["action_items"]),
        analysis["satisfaction_score"],
        analysis["escalation_risk"],
        analysis.get("competitor_mentions", []),
        duration,
        datetime.utcnow(),
    )


async def get_insights_summary(pool: asyncpg.Pool, days: int = 7):
    """Query aggregate insights over a time period."""
    return await pool.fetch(
        """
        SELECT
            intent,
            COUNT(*) as call_count,
            AVG(satisfaction_score) as avg_satisfaction,
            COUNT(*) FILTER (WHERE escalation_risk = 'high') as escalations,
            array_agg(DISTINCT unnest_topics) as all_topics
        FROM call_analytics,
             LATERAL unnest(topics) as unnest_topics
        WHERE analyzed_at >= NOW() - make_interval(days => $1)
        GROUP BY intent
        ORDER BY call_count DESC
        """,
        days,
    )

The Complete Pipeline

Wire all four stages together with an async task queue:

async def start_transcription_pipeline(
    recording_sid: str,
    recording_url: str,
    duration: int,
    call_sid: str,
):
    """Orchestrate the full recording-to-insights pipeline."""
    # Stage 2: Transcribe
    transcript = await transcribe_recording(
        recording_url, os.environ["TWILIO_AUTH_TOKEN"]
    )

    # Stage 3: Analyze
    analysis = await analyze_transcript(transcript)

    # Stage 4: Store
    await store_call_analysis(
        db_pool, call_sid, transcript, analysis, duration
    )

    print(f"Pipeline complete for call {call_sid}: "
          f"intent={analysis['intent']}, "
          f"satisfaction={analysis['satisfaction_score']}/10")

FAQ

How long does the pipeline take per call?

Transcription takes roughly 20-30% of the call duration with modern engines like Deepgram Nova-2. AI analysis adds 2-5 seconds. For a 5-minute call, expect the full pipeline to complete in about 90 seconds. Run it asynchronously after the call ends so it never impacts call quality.

Recording laws vary by jurisdiction. In "two-party consent" states (like California) and countries (like Germany), you must inform all parties and obtain consent before recording. Add a recording disclosure at the start of every call and implement a mechanism to disable recording if consent is denied. Consult legal counsel for your specific jurisdictions.

How accurate is modern speech-to-text for phone calls?

Modern engines like Deepgram Nova-2 and OpenAI Whisper achieve 90-95% accuracy on clean phone audio. Accuracy drops with heavy accents, background noise, or poor phone connections. Dual-channel recording improves accuracy by 5-10% because each speaker has a clean audio track. Always store the raw recording alongside the transcript so you can re-transcribe as models improve.

#CallAnalytics #Transcription #SentimentAnalysis #SpeechtoText #VoiceAI #DataPipeline #AgenticAI #LearnAI #AIEngineering

Call Recording and Transcription for AI Analysis: Building a Call Analytics Pipeline

Why Call Analytics Matters

Stage 1: Recording Calls

Stage 2: Transcription with Speaker Diarization

Stage 3: AI-Powered Analysis

Stage 4: Storage and Querying

The Complete Pipeline

FAQ

How long does the pipeline take per call?

How accurate is modern speech-to-text for phone calls?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding

Why Call Analytics Matters

Stage 1: Recording Calls

Stage 2: Transcription with Speaker Diarization

Stage 3: AI-Powered Analysis

Stage 4: Storage and Querying

The Complete Pipeline

FAQ

How long does the pipeline take per call?

What about call recording consent laws?

How accurate is modern speech-to-text for phone calls?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding