Skip to content
Learn Agentic AI12 min read0 views

Typing Indicators and Streaming: Making Chat Agents Feel Responsive

Implement SSE streaming, token-by-token display, typing simulation delays, and progressive loading states that make chat agents feel fast and human-like even when LLM inference takes several seconds.

The Perception of Speed

Users judge chat agents on perceived responsiveness, not actual latency. A chat agent that shows nothing for 3 seconds then dumps a full paragraph feels slow. An agent that immediately shows a typing indicator, then streams text word by word, feels fast — even if the total time to completion is identical. Research on human-computer interaction consistently shows that progressive feedback reduces perceived wait times by 30-40%.

This article covers three techniques: typing indicators for immediate feedback, SSE streaming for progressive text display, and simulation strategies that make the experience feel natural.

Typing Indicators: Instant Feedback

The moment a user sends a message, show a typing indicator. Do not wait for the LLM to start generating. This immediate feedback tells the user their message was received and processing has begun:

from fastapi import WebSocket
import asyncio

async def handle_message(ws: WebSocket, session_id: str, content: str):
    # Step 1: Immediately signal typing
    await ws.send_json({"type": "typing_start"})

    try:
        # Step 2: Process with the AI agent (may take 1-5 seconds)
        response = await generate_response(content, session_id)

        # Step 3: Send the response
        await ws.send_json({
            "type": "message",
            "role": "assistant",
            "content": response,
        })
    finally:
        # Step 4: Always clear the typing indicator
        await ws.send_json({"type": "typing_stop"})

On the frontend, render the typing indicator as an animated element:

function TypingIndicator({ visible }: { visible: boolean }) {
  if (!visible) return null;

  return (
    <div className="typing-indicator" role="status" aria-label="Agent is typing">
      <span className="dot" />
      <span className="dot" />
      <span className="dot" />
    </div>
  );
}

Style the dots with a staggered CSS animation:

.typing-indicator {
  display: flex;
  gap: 4px;
  padding: 12px 16px;
}
.typing-indicator .dot {
  width: 8px;
  height: 8px;
  border-radius: 50%;
  background: #888;
  animation: bounce 1.4s infinite ease-in-out;
}
.typing-indicator .dot:nth-child(2) { animation-delay: 0.2s; }
.typing-indicator .dot:nth-child(3) { animation-delay: 0.4s; }
@keyframes bounce {
  0%, 80%, 100% { transform: scale(0.6); opacity: 0.4; }
  40% { transform: scale(1); opacity: 1; }
}

SSE Streaming: Token-by-Token Display

For longer responses, streaming tokens as they are generated is dramatically better than waiting for the full response. Use Server-Sent Events for HTTP-based streaming or WebSocket messages for socket-based architectures:

import openai

async def stream_response(ws: WebSocket, messages: list[dict]):
    await ws.send_json({"type": "stream_start"})

    full_response = ""
    stream = await openai.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=True,
    )

    async for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            full_response += delta
            await ws.send_json({
                "type": "stream_delta",
                "content": delta,
            })

    await ws.send_json({
        "type": "stream_end",
        "content": full_response,
    })

    return full_response

The frontend accumulates deltas into the current message:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

const [streamingContent, setStreamingContent] = useState("");
const [isStreaming, setIsStreaming] = useState(false);

function handleSocketMessage(data: any) {
  switch (data.type) {
    case "stream_start":
      setIsStreaming(true);
      setStreamingContent("");
      break;

    case "stream_delta":
      setStreamingContent((prev) => prev + data.content);
      break;

    case "stream_end":
      setIsStreaming(false);
      setMessages((prev) => [
        ...prev,
        { role: "assistant", content: data.content, id: crypto.randomUUID() },
      ]);
      setStreamingContent("");
      break;
  }
}

Simulated Typing Speed

Raw LLM streaming can be uneven — bursts of tokens arrive quickly, then a pause, then another burst. This jagged pacing feels unnatural. Smooth it out with a token buffer:

class TokenBuffer {
  private buffer: string[] = [];
  private rendering = false;
  private onToken: (token: string) => void;
  private minDelay: number;
  private maxDelay: number;

  constructor(
    onToken: (token: string) => void,
    minDelay = 15,
    maxDelay = 40,
  ) {
    this.onToken = onToken;
    this.minDelay = minDelay;
    this.maxDelay = maxDelay;
  }

  push(token: string) {
    this.buffer.push(token);
    if (!this.rendering) this.render();
  }

  private async render() {
    this.rendering = true;
    while (this.buffer.length > 0) {
      const token = this.buffer.shift()!;
      this.onToken(token);

      // Adaptive delay: faster when buffer is large (catching up),
      // slower when buffer is small (natural pacing)
      const delay = this.buffer.length > 10
        ? this.minDelay
        : this.minDelay + Math.random() * (this.maxDelay - this.minDelay);

      await new Promise((r) => setTimeout(r, delay));
    }
    this.rendering = false;
  }

  flush() {
    while (this.buffer.length > 0) {
      this.onToken(this.buffer.shift()!);
    }
  }
}

The buffer introduces micro-delays between tokens. When the LLM is producing tokens faster than the buffer renders, the buffer absorbs the surplus and renders at a smooth, consistent pace. When the LLM pauses, the buffer drains and the display catches up naturally.

Progressive Loading States

Different operations deserve different feedback. Use staged indicators that reflect what the agent is actually doing:

async def process_with_status(ws: WebSocket, message: str, session_id: str):
    # Stage 1: Thinking
    await ws.send_json({"type": "status", "message": "Thinking..."})

    intent = await classify_intent(message)

    if intent == "data_lookup":
        # Stage 2: Searching
        await ws.send_json({"type": "status", "message": "Looking up your account..."})
        data = await fetch_account_data(session_id)

        # Stage 3: Composing
        await ws.send_json({"type": "status", "message": "Preparing response..."})
        response = await compose_response(message, data)
    else:
        response = await generate_response(message)

    await ws.send_json({"type": "status", "message": None})
    return response

The frontend renders these status messages in place of the generic typing indicator, giving users transparency into what the agent is doing. "Looking up your account..." is far more reassuring than three bouncing dots when someone is asking about a billing issue.

FAQ

How do I handle streaming when the agent needs to call tools mid-response?

Break the response into segments. Stream the initial text, pause streaming when the tool call starts (show a status like "Checking availability..."), execute the tool, then resume streaming the rest of the response. On the frontend, concatenate all segments into a single message bubble. The user sees a natural flow rather than separate messages.

What is a good token rendering speed for streaming?

Aim for 30-60 tokens per second displayed to the user. This matches comfortable reading speed and feels like a fast, fluent typist. Below 20 tokens per second feels laggy. Above 100 tokens per second is too fast to follow and defeats the purpose of streaming. Use the adaptive buffer approach to maintain consistent pacing regardless of LLM output speed.

Should I stream short responses or only long ones?

Set a threshold. For responses under 50 tokens, collect the full response and display it at once — streaming three words looks odd. For responses over 50 tokens, stream them. You can estimate response length from the intent: greetings and confirmations are short; explanations and recommendations are long. When in doubt, start streaming and the short response will appear almost instantly anyway.


#Streaming #SSE #TypingIndicators #UX #RealTime #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.