Designing Streaming APIs for LLM Applications: SSE, WebSockets, and HTTP Chunked Transfer

Why LLM Applications Need Streaming

Large language models generate tokens sequentially, often taking several seconds to produce a complete response. Without streaming, users stare at a blank screen until the entire response is ready. Streaming lets you push tokens to the client as they are generated, dramatically improving perceived latency and user experience.

Three protocols dominate the streaming landscape for LLM applications: Server-Sent Events (SSE), WebSockets, and HTTP chunked transfer encoding. Each comes with distinct tradeoffs in complexity, browser support, and bidirectional capability.

Server-Sent Events: The Default Choice

SSE is a unidirectional protocol built on top of standard HTTP. The server pushes a stream of events over a long-lived connection. It is the protocol OpenAI, Anthropic, and most LLM providers use for their streaming endpoints.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
import json

app = FastAPI()

async def generate_tokens(prompt: str):
    """Simulate LLM token generation."""
    words = ["Hello", " there!", " I", " am", " an", " AI", " assistant."]
    for token in words:
        yield token
        await asyncio.sleep(0.1)

@app.post("/v1/chat/completions")
async def stream_chat(request: dict):
    prompt = request.get("prompt", "")

    async def event_stream():
        async for token in generate_tokens(prompt):
            chunk = {
                "choices": [{"delta": {"content": token}}],
                "finish_reason": None,
            }
            yield f"data: {json.dumps(chunk)}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
        },
    )

The X-Accel-Buffering: no header tells reverse proxies like Nginx to disable response buffering, which is critical for real-time streaming. The Cache-Control: no-cache header prevents intermediaries from caching the stream.

WebSockets: When You Need Bidirectional Communication

WebSockets provide full-duplex communication over a single TCP connection. Use WebSockets when the client needs to send data during generation, such as cancellation signals, follow-up context, or tool results mid-stream.

from fastapi import FastAPI, WebSocket, WebSocketDisconnect
import json

app = FastAPI()

@app.websocket("/ws/chat")
async def websocket_chat(websocket: WebSocket):
    await websocket.accept()
    try:
        while True:
            data = await websocket.receive_json()
            prompt = data.get("prompt", "")

            async for token in generate_tokens(prompt):
                await websocket.send_json({
                    "type": "token",
                    "content": token,
                })

            await websocket.send_json({
                "type": "done",
                "usage": {"prompt_tokens": 10, "completion_tokens": 7},
            })
    except WebSocketDisconnect:
        pass

HTTP Chunked Transfer: The Simplest Approach

HTTP chunked transfer encoding sends the response body in chunks without knowing the total size upfront. It requires no special protocol support, works everywhere HTTP works, and is the simplest to implement. The downside is that it lacks the structured event format of SSE and the bidirectionality of WebSockets.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

@app.post("/v1/generate")
async def chunked_generate(request: dict):
    async def chunked_response():
        async for token in generate_tokens(request.get("prompt", "")):
            yield token

    return StreamingResponse(
        chunked_response(),
        media_type="text/plain",
    )

Error Handling During Streams

Errors during streaming are tricky because HTTP status codes are sent before the body. Once the stream starts, you cannot change the status code. The standard pattern is to embed errors inside the stream itself.

async def safe_event_stream(prompt: str):
    try:
        async for token in generate_tokens(prompt):
            chunk = {"choices": [{"delta": {"content": token}}]}
            yield f"data: {json.dumps(chunk)}\n\n"
    except Exception as e:
        error_event = {
            "error": {
                "message": str(e),
                "type": "stream_error",
                "code": "generation_failed",
            }
        }
        yield f"data: {json.dumps(error_event)}\n\n"
    finally:
        yield "data: [DONE]\n\n"

Protocol Selection Guide

Choose SSE when your application follows a request-response pattern where the client sends a prompt and receives a streamed response. It has automatic reconnection built into the browser EventSource API and works behind most proxies without configuration.

Choose WebSockets when you need the client to send cancellation signals, provide tool call results during generation, or maintain a persistent conversational session with server-push notifications.

Choose HTTP chunked transfer when you need maximum compatibility, your consumers are backend services rather than browsers, or you are building internal microservice communication.

FAQ

When should I use SSE over WebSockets for LLM streaming?

Use SSE when your pattern is unidirectional: the client sends a prompt and the server streams back tokens. SSE is simpler to implement, works through HTTP proxies without special configuration, has built-in browser reconnection via EventSource, and uses standard HTTP semantics for authentication. Most production LLM APIs, including OpenAI and Anthropic, use SSE.

How do I handle connection drops during a long LLM stream?

For SSE, include an id field with each event. The browser EventSource API sends the last received ID in a Last-Event-ID header on reconnection, letting your server resume from where it left off. For WebSockets, implement application-level heartbeats and reconnection logic with exponential backoff. In both cases, cache partial generation state on the server keyed by a request ID so you can resume.

Why does my SSE stream appear to arrive all at once instead of token by token?

This is almost always caused by response buffering in a reverse proxy (Nginx, AWS ALB, Cloudflare) or in your application server. Set the X-Accel-Buffering: no header for Nginx, disable proxy buffering in your load balancer, and ensure your ASGI server (uvicorn) is not batching output. Also check that your client is reading the stream incrementally rather than awaiting the full response.

#StreamingAPIs #ServerSentEvents #WebSockets #FastAPI #LLMAPIDesign #AgenticAI #LearnAI #AIEngineering

Designing Streaming APIs for LLM Applications: SSE, WebSockets, and HTTP Chunked Transfer

Why LLM Applications Need Streaming

Server-Sent Events: The Default Choice

WebSockets: When You Need Bidirectional Communication

HTTP Chunked Transfer: The Simplest Approach

Error Handling During Streams

Protocol Selection Guide

FAQ

When should I use SSE over WebSockets for LLM streaming?

How do I handle connection drops during a long LLM stream?

Why does my SSE stream appear to arrive all at once instead of token by token?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding