Audio Analysis Agent: Music Classification, Speaker Identification, and Sound Events

Audio as a First-Class Modality

While text and images dominate AI agent discussions, audio carries information that other modalities cannot. Tone of voice reveals sentiment. Background sounds provide context. Speaker identity matters for meeting transcription. An audio analysis agent goes beyond simple speech-to-text — it understands the full audio landscape.

Core Dependencies

pip install openai librosa numpy soundfile torch torchaudio
pip install pyannote.audio  # for speaker diarization

Audio Feature Extraction

Before classification, extract meaningful features from raw audio. Librosa provides the standard toolkit for audio feature analysis:

import librosa
import numpy as np
from dataclasses import dataclass


@dataclass
class AudioFeatures:
    duration_seconds: float
    sample_rate: int
    tempo: float
    spectral_centroid_mean: float
    mfcc_means: list[float]
    rms_energy_mean: float
    zero_crossing_rate: float
    is_speech_likely: bool


def extract_features(audio_path: str) -> AudioFeatures:
    """Extract audio features for classification."""
    y, sr = librosa.load(audio_path, sr=22050)
    duration = librosa.get_duration(y=y, sr=sr)

    # Tempo estimation
    tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
    tempo_value = float(tempo) if np.isscalar(tempo) else float(tempo[0])

    # Spectral features
    spectral_centroid = librosa.feature.spectral_centroid(
        y=y, sr=sr
    )

    # MFCCs — standard for audio classification
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    mfcc_means = [float(np.mean(mfccs[i])) for i in range(13)]

    # Energy
    rms = librosa.feature.rms(y=y)

    # Zero crossing rate — high for speech, low for music
    zcr = librosa.feature.zero_crossing_rate(y)

    zcr_mean = float(np.mean(zcr))
    is_speech = zcr_mean > 0.05 and float(np.mean(rms)) < 0.1

    return AudioFeatures(
        duration_seconds=duration,
        sample_rate=sr,
        tempo=tempo_value,
        spectral_centroid_mean=float(np.mean(spectral_centroid)),
        mfcc_means=mfcc_means,
        rms_energy_mean=float(np.mean(rms)),
        zero_crossing_rate=zcr_mean,
        is_speech_likely=is_speech,
    )

Speaker Diarization

Speaker diarization answers the question "who spoke when" — essential for meeting transcription and multi-party audio analysis:

from pyannote.audio import Pipeline
import torch


@dataclass
class SpeakerSegment:
    speaker: str
    start: float
    end: float
    duration: float


def diarize_speakers(
    audio_path: str, hf_token: str
) -> list[SpeakerSegment]:
    """Identify different speakers and their time segments."""
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=hf_token,
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    pipeline = pipeline.to(device)

    diarization = pipeline(audio_path)

    segments = []
    for turn, _, speaker in diarization.itertracks(
        yield_label=True
    ):
        segments.append(SpeakerSegment(
            speaker=speaker,
            start=round(turn.start, 2),
            end=round(turn.end, 2),
            duration=round(turn.end - turn.start, 2),
        ))

    return segments

Transcription with Speaker Labels

Combine diarization with Whisper transcription to produce speaker-labeled transcripts:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import openai


async def transcribe_with_speakers(
    audio_path: str,
    segments: list[SpeakerSegment],
    client: openai.AsyncOpenAI,
) -> list[dict]:
    """Transcribe audio and align with speaker diarization."""
    # First, get the full transcript with timestamps
    with open(audio_path, "rb") as f:
        transcript = await client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            response_format="verbose_json",
            timestamp_granularities=["segment"],
        )

    # Align transcript segments with speaker labels
    labeled_segments = []
    for t_seg in transcript.segments:
        seg_mid = (t_seg["start"] + t_seg["end"]) / 2
        speaker = "Unknown"
        for s_seg in segments:
            if s_seg.start <= seg_mid <= s_seg.end:
                speaker = s_seg.speaker
                break

        labeled_segments.append({
            "speaker": speaker,
            "start": t_seg["start"],
            "end": t_seg["end"],
            "text": t_seg["text"].strip(),
        })

    return labeled_segments

Sound Event Detection

Beyond speech, detect environmental sounds and events:

async def detect_sound_events(
    audio_path: str, client: openai.AsyncOpenAI
) -> list[dict]:
    """Use GPT-4o audio capabilities to detect sound events."""
    # Encode audio for API
    import base64

    with open(audio_path, "rb") as f:
        audio_bytes = f.read()
    b64_audio = base64.b64encode(audio_bytes).decode()

    # GPT-4o with audio understanding
    response = await client.chat.completions.create(
        model="gpt-4o-audio-preview",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Listen to this audio and identify all "
                        "distinct sound events. For each event, "
                        "provide the approximate timestamp, type "
                        "of sound, and confidence level. Return "
                        "as a JSON array."
                    ),
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": b64_audio,
                        "format": "wav",
                    },
                },
            ],
        }],
        response_format={"type": "json_object"},
    )

    import json
    result = json.loads(response.choices[0].message.content)
    return result.get("events", [])

The Audio Analysis Agent

class AudioAnalysisAgent:
    def __init__(self, hf_token: str | None = None):
        self.client = openai.AsyncOpenAI()
        self.hf_token = hf_token

    async def analyze(self, audio_path: str) -> dict:
        features = extract_features(audio_path)

        result = {
            "duration": features.duration_seconds,
            "tempo": features.tempo,
            "is_speech": features.is_speech_likely,
        }

        if features.is_speech_likely and self.hf_token:
            segments = diarize_speakers(audio_path, self.hf_token)
            transcript = await transcribe_with_speakers(
                audio_path, segments, self.client
            )
            unique_speakers = set(s.speaker for s in segments)
            result["speaker_count"] = len(unique_speakers)
            result["transcript"] = transcript
        else:
            events = await detect_sound_events(
                audio_path, self.client
            )
            result["sound_events"] = events

        return result

FAQ

What is the difference between speaker diarization and speaker identification?

Speaker diarization answers "who spoke when" by segmenting audio into speaker turns and labeling them as Speaker 1, Speaker 2, and so on — without knowing who those speakers are. Speaker identification matches voice segments against a known database of speaker voiceprints to determine the actual identity. Diarization is unsupervised and works on any audio, while identification requires pre-enrolled speaker profiles.

How accurate is pyannote for speaker diarization in noisy environments?

Pyannote 3.1 achieves strong results in clean recordings (under 5% diarization error rate) but degrades in noisy environments, overlapping speech, and phone-quality audio. For noisy recordings, preprocess with noise reduction (using libraries like noisereduce) before diarization. Also consider increasing the minimum segment duration to avoid spurious speaker switches caused by noise.

Can I classify music genres using the extracted audio features?

Yes. The MFCC features, spectral centroid, tempo, and zero crossing rate are the classic features used for genre classification. Train a simple classifier (random forest or small neural network) on a labeled dataset like GTZAN. Alternatively, skip manual feature engineering and use a pretrained audio classification model like those from Hugging Face's audio transformers, which accept raw waveforms and output genre labels directly.

#AudioAnalysis #SpeakerDiarization #SoundClassification #AudioFeatures #Python #AgenticAI #LearnAI #AIEngineering

Audio Analysis Agent: Music Classification, Speaker Identification, and Sound Events

Audio as a First-Class Modality

Core Dependencies

Audio Feature Extraction

Speaker Diarization

Transcription with Speaker Labels

Sound Event Detection

The Audio Analysis Agent

FAQ

What is the difference between speaker diarization and speaker identification?

How accurate is pyannote for speaker diarization in noisy environments?

Can I classify music genres using the extracted audio features?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding