Skip to content
Back to Blog
Agentic AI10 min read

Real-Time AI Applications: Streaming, WebSockets, and Low-Latency Patterns

Building real-time AI applications with Claude -- SSE streaming, WebSocket bidirectional chat, and production latency optimization.

Why Streaming Matters

Non-streaming responses take 15-30 seconds with no output visible. Streaming shows the first token in 1-2 seconds. Total completion time is identical, but perceived performance is dramatically better.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic

app = FastAPI()
client = anthropic.Anthropic()

async def stream_generator(prompt: str):
    with client.messages.stream(
        model='claude-sonnet-4-6', max_tokens=2048,
        messages=[{'role': 'user', 'content': prompt}]
    ) as stream:
        for text in stream.text_stream:
            yield f'data: {text}\n\n'
    yield 'data: [DONE]\n\n'

@app.post('/stream')
async def stream_endpoint(req: dict):
    return StreamingResponse(stream_generator(req['prompt']),
        media_type='text/event-stream',
        headers={'Cache-Control': 'no-cache', 'X-Accel-Buffering': 'no'})

Latency Optimization

  • Reduce input tokens: compress system prompts to reduce time-to-first-token
  • Prompt caching: cached tokens process 10x faster
  • Stream to client immediately: no server-side buffering before forwarding
  • Model selection: Haiku first token in ~200ms vs ~500ms for Sonnet
  • Parallelize: run independent LLM calls concurrently
Share this article
N

NYC News

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.