Designing Streaming APIs for LLM Applications: SSE, WebSockets, and HTTP Chunked Transfer
Learn how to choose and implement the right streaming protocol for LLM applications. Covers Server-Sent Events, WebSockets, and HTTP chunked transfer with FastAPI code examples and error handling strategies.
Why LLM Applications Need Streaming
Large language models generate tokens sequentially, often taking several seconds to produce a complete response. Without streaming, users stare at a blank screen until the entire response is ready. Streaming lets you push tokens to the client as they are generated, dramatically improving perceived latency and user experience.
Three protocols dominate the streaming landscape for LLM applications: Server-Sent Events (SSE), WebSockets, and HTTP chunked transfer encoding. Each comes with distinct tradeoffs in complexity, browser support, and bidirectional capability.
Server-Sent Events: The Default Choice
SSE is a unidirectional protocol built on top of standard HTTP. The server pushes a stream of events over a long-lived connection. It is the protocol OpenAI, Anthropic, and most LLM providers use for their streaming endpoints.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
import json
app = FastAPI()
async def generate_tokens(prompt: str):
"""Simulate LLM token generation."""
words = ["Hello", " there!", " I", " am", " an", " AI", " assistant."]
for token in words:
yield token
await asyncio.sleep(0.1)
@app.post("/v1/chat/completions")
async def stream_chat(request: dict):
prompt = request.get("prompt", "")
async def event_stream():
async for token in generate_tokens(prompt):
chunk = {
"choices": [{"delta": {"content": token}}],
"finish_reason": None,
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
event_stream(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no",
},
)
The X-Accel-Buffering: no header tells reverse proxies like Nginx to disable response buffering, which is critical for real-time streaming. The Cache-Control: no-cache header prevents intermediaries from caching the stream.
WebSockets: When You Need Bidirectional Communication
WebSockets provide full-duplex communication over a single TCP connection. Use WebSockets when the client needs to send data during generation, such as cancellation signals, follow-up context, or tool results mid-stream.
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
import json
app = FastAPI()
@app.websocket("/ws/chat")
async def websocket_chat(websocket: WebSocket):
await websocket.accept()
try:
while True:
data = await websocket.receive_json()
prompt = data.get("prompt", "")
async for token in generate_tokens(prompt):
await websocket.send_json({
"type": "token",
"content": token,
})
await websocket.send_json({
"type": "done",
"usage": {"prompt_tokens": 10, "completion_tokens": 7},
})
except WebSocketDisconnect:
pass
HTTP Chunked Transfer: The Simplest Approach
HTTP chunked transfer encoding sends the response body in chunks without knowing the total size upfront. It requires no special protocol support, works everywhere HTTP works, and is the simplest to implement. The downside is that it lacks the structured event format of SSE and the bidirectionality of WebSockets.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
@app.post("/v1/generate")
async def chunked_generate(request: dict):
async def chunked_response():
async for token in generate_tokens(request.get("prompt", "")):
yield token
return StreamingResponse(
chunked_response(),
media_type="text/plain",
)
Error Handling During Streams
Errors during streaming are tricky because HTTP status codes are sent before the body. Once the stream starts, you cannot change the status code. The standard pattern is to embed errors inside the stream itself.
async def safe_event_stream(prompt: str):
try:
async for token in generate_tokens(prompt):
chunk = {"choices": [{"delta": {"content": token}}]}
yield f"data: {json.dumps(chunk)}\n\n"
except Exception as e:
error_event = {
"error": {
"message": str(e),
"type": "stream_error",
"code": "generation_failed",
}
}
yield f"data: {json.dumps(error_event)}\n\n"
finally:
yield "data: [DONE]\n\n"
Protocol Selection Guide
Choose SSE when your application follows a request-response pattern where the client sends a prompt and receives a streamed response. It has automatic reconnection built into the browser EventSource API and works behind most proxies without configuration.
Choose WebSockets when you need the client to send cancellation signals, provide tool call results during generation, or maintain a persistent conversational session with server-push notifications.
Choose HTTP chunked transfer when you need maximum compatibility, your consumers are backend services rather than browsers, or you are building internal microservice communication.
FAQ
When should I use SSE over WebSockets for LLM streaming?
Use SSE when your pattern is unidirectional: the client sends a prompt and the server streams back tokens. SSE is simpler to implement, works through HTTP proxies without special configuration, has built-in browser reconnection via EventSource, and uses standard HTTP semantics for authentication. Most production LLM APIs, including OpenAI and Anthropic, use SSE.
How do I handle connection drops during a long LLM stream?
For SSE, include an id field with each event. The browser EventSource API sends the last received ID in a Last-Event-ID header on reconnection, letting your server resume from where it left off. For WebSockets, implement application-level heartbeats and reconnection logic with exponential backoff. In both cases, cache partial generation state on the server keyed by a request ID so you can resume.
Why does my SSE stream appear to arrive all at once instead of token by token?
This is almost always caused by response buffering in a reverse proxy (Nginx, AWS ALB, Cloudflare) or in your application server. Set the X-Accel-Buffering: no header for Nginx, disable proxy buffering in your load balancer, and ensure your ASGI server (uvicorn) is not batching output. Also check that your client is reading the stream incrementally rather than awaiting the full response.
#StreamingAPIs #ServerSentEvents #WebSockets #FastAPI #LLMAPIDesign #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.