Audio Analysis Agent: Music Classification, Speaker Identification, and Sound Events
Build an audio analysis agent in Python that classifies music genres, identifies speakers through diarization, and detects sound events. Covers audio feature extraction, classification models, and structured audio understanding.
Audio as a First-Class Modality
While text and images dominate AI agent discussions, audio carries information that other modalities cannot. Tone of voice reveals sentiment. Background sounds provide context. Speaker identity matters for meeting transcription. An audio analysis agent goes beyond simple speech-to-text — it understands the full audio landscape.
Core Dependencies
pip install openai librosa numpy soundfile torch torchaudio
pip install pyannote.audio # for speaker diarization
Audio Feature Extraction
Before classification, extract meaningful features from raw audio. Librosa provides the standard toolkit for audio feature analysis:
import librosa
import numpy as np
from dataclasses import dataclass
@dataclass
class AudioFeatures:
duration_seconds: float
sample_rate: int
tempo: float
spectral_centroid_mean: float
mfcc_means: list[float]
rms_energy_mean: float
zero_crossing_rate: float
is_speech_likely: bool
def extract_features(audio_path: str) -> AudioFeatures:
"""Extract audio features for classification."""
y, sr = librosa.load(audio_path, sr=22050)
duration = librosa.get_duration(y=y, sr=sr)
# Tempo estimation
tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
tempo_value = float(tempo) if np.isscalar(tempo) else float(tempo[0])
# Spectral features
spectral_centroid = librosa.feature.spectral_centroid(
y=y, sr=sr
)
# MFCCs — standard for audio classification
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
mfcc_means = [float(np.mean(mfccs[i])) for i in range(13)]
# Energy
rms = librosa.feature.rms(y=y)
# Zero crossing rate — high for speech, low for music
zcr = librosa.feature.zero_crossing_rate(y)
zcr_mean = float(np.mean(zcr))
is_speech = zcr_mean > 0.05 and float(np.mean(rms)) < 0.1
return AudioFeatures(
duration_seconds=duration,
sample_rate=sr,
tempo=tempo_value,
spectral_centroid_mean=float(np.mean(spectral_centroid)),
mfcc_means=mfcc_means,
rms_energy_mean=float(np.mean(rms)),
zero_crossing_rate=zcr_mean,
is_speech_likely=is_speech,
)
Speaker Diarization
Speaker diarization answers the question "who spoke when" — essential for meeting transcription and multi-party audio analysis:
from pyannote.audio import Pipeline
import torch
@dataclass
class SpeakerSegment:
speaker: str
start: float
end: float
duration: float
def diarize_speakers(
audio_path: str, hf_token: str
) -> list[SpeakerSegment]:
"""Identify different speakers and their time segments."""
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=hf_token,
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipeline = pipeline.to(device)
diarization = pipeline(audio_path)
segments = []
for turn, _, speaker in diarization.itertracks(
yield_label=True
):
segments.append(SpeakerSegment(
speaker=speaker,
start=round(turn.start, 2),
end=round(turn.end, 2),
duration=round(turn.end - turn.start, 2),
))
return segments
Transcription with Speaker Labels
Combine diarization with Whisper transcription to produce speaker-labeled transcripts:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import openai
async def transcribe_with_speakers(
audio_path: str,
segments: list[SpeakerSegment],
client: openai.AsyncOpenAI,
) -> list[dict]:
"""Transcribe audio and align with speaker diarization."""
# First, get the full transcript with timestamps
with open(audio_path, "rb") as f:
transcript = await client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="verbose_json",
timestamp_granularities=["segment"],
)
# Align transcript segments with speaker labels
labeled_segments = []
for t_seg in transcript.segments:
seg_mid = (t_seg["start"] + t_seg["end"]) / 2
speaker = "Unknown"
for s_seg in segments:
if s_seg.start <= seg_mid <= s_seg.end:
speaker = s_seg.speaker
break
labeled_segments.append({
"speaker": speaker,
"start": t_seg["start"],
"end": t_seg["end"],
"text": t_seg["text"].strip(),
})
return labeled_segments
Sound Event Detection
Beyond speech, detect environmental sounds and events:
async def detect_sound_events(
audio_path: str, client: openai.AsyncOpenAI
) -> list[dict]:
"""Use GPT-4o audio capabilities to detect sound events."""
# Encode audio for API
import base64
with open(audio_path, "rb") as f:
audio_bytes = f.read()
b64_audio = base64.b64encode(audio_bytes).decode()
# GPT-4o with audio understanding
response = await client.chat.completions.create(
model="gpt-4o-audio-preview",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": (
"Listen to this audio and identify all "
"distinct sound events. For each event, "
"provide the approximate timestamp, type "
"of sound, and confidence level. Return "
"as a JSON array."
),
},
{
"type": "input_audio",
"input_audio": {
"data": b64_audio,
"format": "wav",
},
},
],
}],
response_format={"type": "json_object"},
)
import json
result = json.loads(response.choices[0].message.content)
return result.get("events", [])
The Audio Analysis Agent
class AudioAnalysisAgent:
def __init__(self, hf_token: str | None = None):
self.client = openai.AsyncOpenAI()
self.hf_token = hf_token
async def analyze(self, audio_path: str) -> dict:
features = extract_features(audio_path)
result = {
"duration": features.duration_seconds,
"tempo": features.tempo,
"is_speech": features.is_speech_likely,
}
if features.is_speech_likely and self.hf_token:
segments = diarize_speakers(audio_path, self.hf_token)
transcript = await transcribe_with_speakers(
audio_path, segments, self.client
)
unique_speakers = set(s.speaker for s in segments)
result["speaker_count"] = len(unique_speakers)
result["transcript"] = transcript
else:
events = await detect_sound_events(
audio_path, self.client
)
result["sound_events"] = events
return result
FAQ
What is the difference between speaker diarization and speaker identification?
Speaker diarization answers "who spoke when" by segmenting audio into speaker turns and labeling them as Speaker 1, Speaker 2, and so on — without knowing who those speakers are. Speaker identification matches voice segments against a known database of speaker voiceprints to determine the actual identity. Diarization is unsupervised and works on any audio, while identification requires pre-enrolled speaker profiles.
How accurate is pyannote for speaker diarization in noisy environments?
Pyannote 3.1 achieves strong results in clean recordings (under 5% diarization error rate) but degrades in noisy environments, overlapping speech, and phone-quality audio. For noisy recordings, preprocess with noise reduction (using libraries like noisereduce) before diarization. Also consider increasing the minimum segment duration to avoid spurious speaker switches caused by noise.
Can I classify music genres using the extracted audio features?
Yes. The MFCC features, spectral centroid, tempo, and zero crossing rate are the classic features used for genre classification. Train a simple classifier (random forest or small neural network) on a labeled dataset like GTZAN. Alternatively, skip manual feature engineering and use a pretrained audio classification model like those from Hugging Face's audio transformers, which accept raw waveforms and output genre labels directly.
#AudioAnalysis #SpeakerDiarization #SoundClassification #AudioFeatures #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.