Real-Time AI Applications: Streaming, WebSockets, and Low-Latency Patterns
Building real-time AI applications with Claude -- SSE streaming, WebSocket bidirectional chat, and production latency optimization.
Why Streaming Matters
Non-streaming responses take 15-30 seconds with no output visible. Streaming shows the first token in 1-2 seconds. Total completion time is identical, but perceived performance is dramatically better.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic
app = FastAPI()
client = anthropic.Anthropic()
async def stream_generator(prompt: str):
with client.messages.stream(
model='claude-sonnet-4-6', max_tokens=2048,
messages=[{'role': 'user', 'content': prompt}]
) as stream:
for text in stream.text_stream:
yield f'data: {text}\n\n'
yield 'data: [DONE]\n\n'
@app.post('/stream')
async def stream_endpoint(req: dict):
return StreamingResponse(stream_generator(req['prompt']),
media_type='text/event-stream',
headers={'Cache-Control': 'no-cache', 'X-Accel-Buffering': 'no'})Latency Optimization
- Reduce input tokens: compress system prompts to reduce time-to-first-token
- Prompt caching: cached tokens process 10x faster
- Stream to client immediately: no server-side buffering before forwarding
- Model selection: Haiku first token in ~200ms vs ~500ms for Sonnet
- Parallelize: run independent LLM calls concurrently
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.