Skip to content
Learn Agentic AI11 min read0 views

Gemini Streaming and Real-Time Responses: Building Responsive Agent UIs

Implement Gemini streaming for real-time token delivery in agent UIs. Learn stream_generate_content, chunk handling, SSE integration with FastAPI, and building responsive chat interfaces.

Why Streaming Matters for Agent UX

When a Gemini API call takes 5-10 seconds to complete, users stare at a loading spinner wondering if something broke. Streaming delivers tokens as they are generated, typically starting within 200-500 milliseconds. The user sees the response forming in real time, which feels dramatically faster even though the total generation time is the same.

For agent applications, streaming is even more important. When your agent calls tools, the user can see "Searching for flights..." appear immediately rather than waiting for the entire tool call and response cycle to finish.

Basic Streaming

Replace generate_content with generate_content and set stream=True:

import google.generativeai as genai
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel("gemini-2.0-flash")

response = model.generate_content(
    "Write a detailed explanation of how transformer attention works.",
    stream=True,
)

for chunk in response:
    if chunk.text:
        print(chunk.text, end="", flush=True)

print()  # Final newline

Each chunk contains a portion of the response text. Chunks arrive as soon as the model generates them, so the first chunk typically appears within a few hundred milliseconds.

Streaming with Chat Sessions

Streaming works seamlessly with multi-turn chat:

model = genai.GenerativeModel("gemini-2.0-flash")
chat = model.start_chat()

def stream_chat(message: str):
    response = chat.send_message(message, stream=True)
    full_response = []

    for chunk in response:
        if chunk.text:
            print(chunk.text, end="", flush=True)
            full_response.append(chunk.text)

    print()
    return "".join(full_response)

stream_chat("What are the main differences between REST and GraphQL?")
stream_chat("Which would you recommend for a real-time dashboard?")

The chat history is maintained across streaming calls, so follow-up questions work correctly.

Async Streaming for Web Applications

For web servers, use the async streaming interface to avoid blocking the event loop:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import google.generativeai as genai
import asyncio
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel("gemini-2.0-flash")

async def stream_response(prompt: str):
    response = await model.generate_content_async(
        prompt,
        stream=True,
    )

    full_text = []
    async for chunk in response:
        if chunk.text:
            full_text.append(chunk.text)
            yield chunk.text

    # After iteration, usage metadata is available
    # Access via response.usage_metadata if needed

Server-Sent Events with FastAPI

Here is a complete FastAPI endpoint that streams Gemini responses to the browser using SSE:

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import google.generativeai as genai
import json
import os

app = FastAPI()
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-2.0-flash")

@app.post("/api/chat/stream")
async def chat_stream(request: Request):
    body = await request.json()
    prompt = body["message"]

    async def event_generator():
        response = await model.generate_content_async(prompt, stream=True)

        async for chunk in response:
            if chunk.text:
                data = json.dumps({"type": "text", "content": chunk.text})
                yield f"data: {data}\n\n"

        yield f"data: {json.dumps({'type': 'done'})}\n\n"

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        },
    )

Client-Side SSE Consumption

On the frontend, consume the stream with the EventSource API or fetch:

# This is JavaScript for the browser — included for the full-stack pattern
# ~~~javascript
# async function streamChat(message) {
#     const response = await fetch('/api/chat/stream', {
#         method: 'POST',
#         headers: { 'Content-Type': 'application/json' },
#         body: JSON.stringify({ message }),
#     });
#
#     const reader = response.body.getReader();
#     const decoder = new TextDecoder();
#
#     while (true) {
#         const { done, value } = await reader.read();
#         if (done) break;
#
#         const text = decoder.decode(value);
#         const lines = text.split('\n');
#
#         for (const line of lines) {
#             if (line.startsWith('data: ')) {
#                 const data = JSON.parse(line.slice(6));
#                 if (data.type === 'text') {
#                     appendToChat(data.content);
#                 }
#             }
#         }
#     }
# }

Streaming with Function Calling

When streaming is combined with function calling, you receive function call chunks that signal when to execute tools:

def get_stock_price(symbol: str) -> dict:
    """Get the current stock price.

    Args:
        symbol: Stock ticker symbol, e.g. 'AAPL'.
    """
    prices = {"AAPL": 198.50, "GOOGL": 175.30, "MSFT": 420.15}
    return {"symbol": symbol, "price": prices.get(symbol, 0)}

model = genai.GenerativeModel(
    "gemini-2.0-flash",
    tools=[get_stock_price],
)

chat = model.start_chat()

response = chat.send_message(
    "What is Apple's stock price?",
    stream=True,
)

for chunk in response:
    for part in chunk.parts:
        if part.function_call:
            fc = part.function_call
            print(f"Calling tool: {fc.name}({dict(fc.args)})")
            result = get_stock_price(**dict(fc.args))
            # Send result back and continue streaming

This allows your UI to show "Looking up AAPL stock price..." in real time while the tool executes.

FAQ

Does streaming affect token costs?

No. Streaming delivers the same tokens as non-streaming — it just delivers them incrementally. The total cost is identical regardless of whether you use streaming.

Can I abort a streaming response mid-way?

Yes. Simply stop iterating over the response object. The connection will be closed and no further tokens will be generated. This is useful for implementing "Stop generating" buttons in chat UIs.

What happens if the network drops during streaming?

The iterator will raise an exception. Implement retry logic that re-sends the request. Since Gemini API calls are not resumable, you need to restart the full generation. Consider saving partial responses so the user does not lose context.


#GoogleGemini #Streaming #RealTime #FastAPI #ServerSentEvents #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.