Gemini Streaming and Real-Time Responses: Building Responsive Agent UIs
Implement Gemini streaming for real-time token delivery in agent UIs. Learn stream_generate_content, chunk handling, SSE integration with FastAPI, and building responsive chat interfaces.
Why Streaming Matters for Agent UX
When a Gemini API call takes 5-10 seconds to complete, users stare at a loading spinner wondering if something broke. Streaming delivers tokens as they are generated, typically starting within 200-500 milliseconds. The user sees the response forming in real time, which feels dramatically faster even though the total generation time is the same.
For agent applications, streaming is even more important. When your agent calls tools, the user can see "Searching for flights..." appear immediately rather than waiting for the entire tool call and response cycle to finish.
Basic Streaming
Replace generate_content with generate_content and set stream=True:
import google.generativeai as genai
import os
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content(
"Write a detailed explanation of how transformer attention works.",
stream=True,
)
for chunk in response:
if chunk.text:
print(chunk.text, end="", flush=True)
print() # Final newline
Each chunk contains a portion of the response text. Chunks arrive as soon as the model generates them, so the first chunk typically appears within a few hundred milliseconds.
Streaming with Chat Sessions
Streaming works seamlessly with multi-turn chat:
model = genai.GenerativeModel("gemini-2.0-flash")
chat = model.start_chat()
def stream_chat(message: str):
response = chat.send_message(message, stream=True)
full_response = []
for chunk in response:
if chunk.text:
print(chunk.text, end="", flush=True)
full_response.append(chunk.text)
print()
return "".join(full_response)
stream_chat("What are the main differences between REST and GraphQL?")
stream_chat("Which would you recommend for a real-time dashboard?")
The chat history is maintained across streaming calls, so follow-up questions work correctly.
Async Streaming for Web Applications
For web servers, use the async streaming interface to avoid blocking the event loop:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import google.generativeai as genai
import asyncio
import os
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-2.0-flash")
async def stream_response(prompt: str):
response = await model.generate_content_async(
prompt,
stream=True,
)
full_text = []
async for chunk in response:
if chunk.text:
full_text.append(chunk.text)
yield chunk.text
# After iteration, usage metadata is available
# Access via response.usage_metadata if needed
Server-Sent Events with FastAPI
Here is a complete FastAPI endpoint that streams Gemini responses to the browser using SSE:
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import google.generativeai as genai
import json
import os
app = FastAPI()
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-2.0-flash")
@app.post("/api/chat/stream")
async def chat_stream(request: Request):
body = await request.json()
prompt = body["message"]
async def event_generator():
response = await model.generate_content_async(prompt, stream=True)
async for chunk in response:
if chunk.text:
data = json.dumps({"type": "text", "content": chunk.text})
yield f"data: {data}\n\n"
yield f"data: {json.dumps({'type': 'done'})}\n\n"
return StreamingResponse(
event_generator(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
},
)
Client-Side SSE Consumption
On the frontend, consume the stream with the EventSource API or fetch:
# This is JavaScript for the browser — included for the full-stack pattern
# ~~~javascript
# async function streamChat(message) {
# const response = await fetch('/api/chat/stream', {
# method: 'POST',
# headers: { 'Content-Type': 'application/json' },
# body: JSON.stringify({ message }),
# });
#
# const reader = response.body.getReader();
# const decoder = new TextDecoder();
#
# while (true) {
# const { done, value } = await reader.read();
# if (done) break;
#
# const text = decoder.decode(value);
# const lines = text.split('\n');
#
# for (const line of lines) {
# if (line.startsWith('data: ')) {
# const data = JSON.parse(line.slice(6));
# if (data.type === 'text') {
# appendToChat(data.content);
# }
# }
# }
# }
# }
Streaming with Function Calling
When streaming is combined with function calling, you receive function call chunks that signal when to execute tools:
def get_stock_price(symbol: str) -> dict:
"""Get the current stock price.
Args:
symbol: Stock ticker symbol, e.g. 'AAPL'.
"""
prices = {"AAPL": 198.50, "GOOGL": 175.30, "MSFT": 420.15}
return {"symbol": symbol, "price": prices.get(symbol, 0)}
model = genai.GenerativeModel(
"gemini-2.0-flash",
tools=[get_stock_price],
)
chat = model.start_chat()
response = chat.send_message(
"What is Apple's stock price?",
stream=True,
)
for chunk in response:
for part in chunk.parts:
if part.function_call:
fc = part.function_call
print(f"Calling tool: {fc.name}({dict(fc.args)})")
result = get_stock_price(**dict(fc.args))
# Send result back and continue streaming
This allows your UI to show "Looking up AAPL stock price..." in real time while the tool executes.
FAQ
Does streaming affect token costs?
No. Streaming delivers the same tokens as non-streaming — it just delivers them incrementally. The total cost is identical regardless of whether you use streaming.
Can I abort a streaming response mid-way?
Yes. Simply stop iterating over the response object. The connection will be closed and no further tokens will be generated. This is useful for implementing "Stop generating" buttons in chat UIs.
What happens if the network drops during streaming?
The iterator will raise an exception. Implement retry logic that re-sends the request. Since Gemini API calls are not resumable, you need to restart the full generation. Consider saving partial responses so the user does not lose context.
#GoogleGemini #Streaming #RealTime #FastAPI #ServerSentEvents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.