Gemini Streaming and Real-Time Responses: Building Responsive Agent UIs

Why Streaming Matters for Agent UX

When a Gemini API call takes 5-10 seconds to complete, users stare at a loading spinner wondering if something broke. Streaming delivers tokens as they are generated, typically starting within 200-500 milliseconds. The user sees the response forming in real time, which feels dramatically faster even though the total generation time is the same.

For agent applications, streaming is even more important. When your agent calls tools, the user can see "Searching for flights..." appear immediately rather than waiting for the entire tool call and response cycle to finish.

Basic Streaming

Replace generate_content with generate_content and set stream=True:

import google.generativeai as genai
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel("gemini-2.0-flash")

response = model.generate_content(
    "Write a detailed explanation of how transformer attention works.",
    stream=True,
)

for chunk in response:
    if chunk.text:
        print(chunk.text, end="", flush=True)

print()  # Final newline

Each chunk contains a portion of the response text. Chunks arrive as soon as the model generates them, so the first chunk typically appears within a few hundred milliseconds.

Streaming with Chat Sessions

Streaming works seamlessly with multi-turn chat:

model = genai.GenerativeModel("gemini-2.0-flash")
chat = model.start_chat()

def stream_chat(message: str):
    response = chat.send_message(message, stream=True)
    full_response = []

    for chunk in response:
        if chunk.text:
            print(chunk.text, end="", flush=True)
            full_response.append(chunk.text)

    print()
    return "".join(full_response)

stream_chat("What are the main differences between REST and GraphQL?")
stream_chat("Which would you recommend for a real-time dashboard?")

The chat history is maintained across streaming calls, so follow-up questions work correctly.

Async Streaming for Web Applications

For web servers, use the async streaming interface to avoid blocking the event loop:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import google.generativeai as genai
import asyncio
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel("gemini-2.0-flash")

async def stream_response(prompt: str):
    response = await model.generate_content_async(
        prompt,
        stream=True,
    )

    full_text = []
    async for chunk in response:
        if chunk.text:
            full_text.append(chunk.text)
            yield chunk.text

    # After iteration, usage metadata is available
    # Access via response.usage_metadata if needed

Server-Sent Events with FastAPI

Here is a complete FastAPI endpoint that streams Gemini responses to the browser using SSE:

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import google.generativeai as genai
import json
import os

app = FastAPI()
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-2.0-flash")

@app.post("/api/chat/stream")
async def chat_stream(request: Request):
    body = await request.json()
    prompt = body["message"]

    async def event_generator():
        response = await model.generate_content_async(prompt, stream=True)

        async for chunk in response:
            if chunk.text:
                data = json.dumps({"type": "text", "content": chunk.text})
                yield f"data: {data}\n\n"

        yield f"data: {json.dumps({'type': 'done'})}\n\n"

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        },
    )

Client-Side SSE Consumption

On the frontend, consume the stream with the EventSource API or fetch:

# This is JavaScript for the browser — included for the full-stack pattern
# ~~~javascript
# async function streamChat(message) {
#     const response = await fetch('/api/chat/stream', {
#         method: 'POST',
#         headers: { 'Content-Type': 'application/json' },
#         body: JSON.stringify({ message }),
#     });
#
#     const reader = response.body.getReader();
#     const decoder = new TextDecoder();
#
#     while (true) {
#         const { done, value } = await reader.read();
#         if (done) break;
#
#         const text = decoder.decode(value);
#         const lines = text.split('\n');
#
#         for (const line of lines) {
#             if (line.startsWith('data: ')) {
#                 const data = JSON.parse(line.slice(6));
#                 if (data.type === 'text') {
#                     appendToChat(data.content);
#                 }
#             }
#         }
#     }
# }

Streaming with Function Calling

When streaming is combined with function calling, you receive function call chunks that signal when to execute tools:

def get_stock_price(symbol: str) -> dict:
    """Get the current stock price.

    Args:
        symbol: Stock ticker symbol, e.g. 'AAPL'.
    """
    prices = {"AAPL": 198.50, "GOOGL": 175.30, "MSFT": 420.15}
    return {"symbol": symbol, "price": prices.get(symbol, 0)}

model = genai.GenerativeModel(
    "gemini-2.0-flash",
    tools=[get_stock_price],
)

chat = model.start_chat()

response = chat.send_message(
    "What is Apple's stock price?",
    stream=True,
)

for chunk in response:
    for part in chunk.parts:
        if part.function_call:
            fc = part.function_call
            print(f"Calling tool: {fc.name}({dict(fc.args)})")
            result = get_stock_price(**dict(fc.args))
            # Send result back and continue streaming

This allows your UI to show "Looking up AAPL stock price..." in real time while the tool executes.

FAQ

Does streaming affect token costs?

No. Streaming delivers the same tokens as non-streaming — it just delivers them incrementally. The total cost is identical regardless of whether you use streaming.

Can I abort a streaming response mid-way?

Yes. Simply stop iterating over the response object. The connection will be closed and no further tokens will be generated. This is useful for implementing "Stop generating" buttons in chat UIs.

What happens if the network drops during streaming?

The iterator will raise an exception. Implement retry logic that re-sends the request. Since Gemini API calls are not resumable, you need to restart the full generation. Consider saving partial responses so the user does not lose context.

#GoogleGemini #Streaming #RealTime #FastAPI #ServerSentEvents #AgenticAI #LearnAI #AIEngineering

Gemini Streaming and Real-Time Responses: Building Responsive Agent UIs

Why Streaming Matters for Agent UX

Basic Streaming

Streaming with Chat Sessions

Async Streaming for Web Applications

Server-Sent Events with FastAPI

Client-Side SSE Consumption

Streaming with Function Calling

FAQ

Does streaming affect token costs?

Can I abort a streaming response mid-way?

What happens if the network drops during streaming?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding