Building a Unified AI Agent API: One API for Chat, Voice, and Task Agents
Design a single unified API that serves chat, voice, and task-based AI agents through a common interface. Learn channel abstraction, response normalization, and how to handle the unique requirements of each modality without code duplication.
The Problem with Separate Agent APIs
Many organizations start with one API for their chatbot, another for their voice agent, and yet another for task automation. Each API has its own authentication, session management, error handling, and data models. Within months, you are maintaining three codebases that do fundamentally the same thing — send user input to an AI agent and return a response — but with incompatible interfaces.
A unified API consolidates these into a single interface with channel-specific adapters. The core logic — agent routing, conversation management, tool execution — lives in one place. Channel-specific concerns like voice transcription or chat formatting are handled at the edges.
The Unified Request Model
Design a request model that accommodates all channels through a common structure with channel-specific extensions:
from pydantic import BaseModel, Field
from typing import Any, Optional, Literal
from enum import Enum
class Channel(str, Enum):
CHAT = "chat"
VOICE = "voice"
TASK = "task"
EMAIL = "email"
class InputContent(BaseModel):
text: Optional[str] = None
audio_url: Optional[str] = None
audio_base64: Optional[str] = None
attachments: list[dict] = Field(default_factory=list)
class UnifiedRequest(BaseModel):
channel: Channel
session_id: str
agent_id: str
input: InputContent
context: dict[str, Any] = Field(default_factory=dict)
response_format: Literal["text", "ssml", "audio", "structured"] = "text"
stream: bool = False
class ToolCallOutput(BaseModel):
call_id: str
tool_name: str
arguments: dict[str, Any]
class UnifiedResponse(BaseModel):
session_id: str
agent_id: str
channel: Channel
text: Optional[str] = None
ssml: Optional[str] = None
audio_url: Optional[str] = None
tool_calls: list[ToolCallOutput] = Field(default_factory=list)
metadata: dict[str, Any] = Field(default_factory=dict)
usage: dict[str, int] = Field(default_factory=dict)
A chat client sends {"channel": "chat", "input": {"text": "Hello"}}. A voice client sends {"channel": "voice", "input": {"audio_base64": "..."}}. A task agent sends {"channel": "task", "input": {"text": "Analyze this dataset"}}. The same endpoint handles all three.
Channel Adapters
Each channel has preprocessing and postprocessing needs. Adapters handle these transformations:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from abc import ABC, abstractmethod
class ChannelAdapter(ABC):
@abstractmethod
async def preprocess(self, request: UnifiedRequest) -> str:
"""Convert channel-specific input to plain text for the agent."""
pass
@abstractmethod
async def postprocess(
self, text: str, request: UnifiedRequest
) -> dict:
"""Convert agent text output to channel-specific format."""
pass
class ChatAdapter(ChannelAdapter):
async def preprocess(self, request: UnifiedRequest) -> str:
return request.input.text or ""
async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
return {"text": text}
class VoiceAdapter(ChannelAdapter):
async def preprocess(self, request: UnifiedRequest) -> str:
if request.input.audio_base64:
return await transcribe_audio(request.input.audio_base64)
return request.input.text or ""
async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
if request.response_format == "ssml":
return {"ssml": text_to_ssml(text)}
if request.response_format == "audio":
audio_url = await synthesize_speech(text)
return {"audio_url": audio_url, "text": text}
return {"text": text}
class TaskAdapter(ChannelAdapter):
async def preprocess(self, request: UnifiedRequest) -> str:
# Tasks may include structured instructions
parts = [request.input.text or ""]
for attachment in request.input.attachments:
parts.append(f"[Attachment: {attachment.get('name', 'file')}]")
return "\n".join(parts)
async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
if request.response_format == "structured":
return {"text": text, "metadata": {"structured": True}}
return {"text": text}
ADAPTERS: dict[Channel, ChannelAdapter] = {
Channel.CHAT: ChatAdapter(),
Channel.VOICE: VoiceAdapter(),
Channel.TASK: TaskAdapter(),
}
The Unified Endpoint
The main endpoint delegates to the appropriate adapter, runs the agent, and normalizes the response:
from fastapi import FastAPI
app = FastAPI(title="Unified Agent API")
@app.post("/v1/agent/invoke")
async def invoke_agent(request: UnifiedRequest) -> UnifiedResponse:
adapter = ADAPTERS[request.channel]
# Preprocess: convert channel input to text
user_text = await adapter.preprocess(request)
# Load conversation history
history = await get_session_messages(request.session_id)
# Run the agent
agent_result = await run_agent(
agent_id=request.agent_id,
user_message=user_text,
history=history,
context=request.context,
)
# Postprocess: convert text to channel-appropriate format
output = await adapter.postprocess(agent_result["text"], request)
# Save to session history
await save_message(request.session_id, "user", user_text)
await save_message(request.session_id, "assistant", agent_result["text"])
return UnifiedResponse(
session_id=request.session_id,
agent_id=request.agent_id,
channel=request.channel,
tool_calls=[
ToolCallOutput(**tc) for tc in agent_result.get("tool_calls", [])
],
usage=agent_result.get("usage", {}),
**output,
)
Streaming Across Channels
Streaming works differently per channel. Chat needs Server-Sent Events. Voice needs audio chunks. Tasks may not need streaming at all:
from fastapi.responses import StreamingResponse
import json
@app.post("/v1/agent/stream")
async def stream_agent(request: UnifiedRequest):
adapter = ADAPTERS[request.channel]
user_text = await adapter.preprocess(request)
history = await get_session_messages(request.session_id)
async def event_stream():
full_text = ""
async for chunk in stream_agent_response(
agent_id=request.agent_id,
user_message=user_text,
history=history,
):
full_text += chunk["text"]
output = await adapter.postprocess(chunk["text"], request)
event_data = json.dumps({
"session_id": request.session_id,
"chunk": output,
"done": chunk.get("done", False),
})
yield f"data: {event_data}\n\n"
await save_message(request.session_id, "user", user_text)
await save_message(request.session_id, "assistant", full_text)
return StreamingResponse(event_stream(), media_type="text/event-stream")
FAQ
How do I handle channel-specific features like voice barge-in or chat typing indicators?
Add channel-specific metadata to the context field of the request and response. For voice barge-in, the client sends {"context": {"voice_barge_in": true}}. The voice adapter checks this flag and adjusts response behavior. Keep these features in the adapter layer, not in core agent logic.
Should the unified API normalize all responses to text, or preserve rich formats?
Always generate text as the canonical format, then let adapters transform it. The agent produces text. The chat adapter returns it as-is. The voice adapter converts it to SSML or audio. The task adapter may parse it into structured JSON. This keeps agent logic channel-agnostic.
How do I route to different agent implementations based on channel?
Add routing logic in the endpoint that selects the agent based on both agent_id and channel. A customer service agent might use a faster model for chat and a more capable model for complex task requests. Store this mapping in configuration rather than code.
#UnifiedAPI #AIAgents #APIDesign #FastAPI #MultiChannel #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.