RAG Pipeline Optimization: Reducing Latency from Seconds to Milliseconds
Learn practical techniques to dramatically reduce RAG pipeline latency including async retrieval, semantic caching, pre-computation, and embedding optimization without sacrificing answer quality.
Where RAG Latency Comes From
A typical RAG pipeline has five latency-contributing stages:
- Embedding the query — 50-200ms (API call to embedding model)
- Vector search — 10-500ms (depends on index size and infrastructure)
- Document retrieval — 5-50ms (fetching full documents from storage)
- Context assembly — 1-5ms (concatenating and formatting)
- LLM generation — 500-5000ms (the dominant cost)
A naive implementation runs these sequentially, resulting in 1-6 seconds of total latency. With the optimizations in this guide, you can reduce stages 1-4 to under 100ms combined and significantly improve the perceived speed of stage 5 through streaming.
Optimization 1: Semantic Cache
The highest-impact optimization is caching. If two users ask semantically similar questions, the second query can return a cached response instantly:
import hashlib
import numpy as np
from openai import OpenAI
import redis
import json
client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.95):
self.threshold = similarity_threshold
self.embedding_cache_key = "rag:embeddings"
self.response_cache_key = "rag:responses"
def _get_embedding(self, text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
def _cosine_similarity(
self, a: list[float], b: list[float]
) -> float:
a_np, b_np = np.array(a), np.array(b)
return float(
np.dot(a_np, b_np)
/ (np.linalg.norm(a_np) * np.linalg.norm(b_np))
)
def get(self, query: str) -> str | None:
"""Check if a semantically similar query was cached."""
query_emb = self._get_embedding(query)
# Check all cached embeddings
cached = cache.hgetall(self.embedding_cache_key)
for key, emb_json in cached.items():
cached_emb = json.loads(emb_json)
similarity = self._cosine_similarity(
query_emb, cached_emb
)
if similarity >= self.threshold:
response = cache.hget(
self.response_cache_key, key
)
if response:
return response.decode()
return None
def set(
self, query: str, response: str, ttl: int = 3600
):
"""Cache a query-response pair."""
query_emb = self._get_embedding(query)
key = hashlib.md5(query.encode()).hexdigest()
cache.hset(
self.embedding_cache_key,
key,
json.dumps(query_emb),
)
cache.hset(self.response_cache_key, key, response)
Optimization 2: Async Parallel Retrieval
When searching multiple sources, run them concurrently:
import asyncio
from typing import Any
async def async_embed(text: str) -> list[float]:
"""Non-blocking embedding call."""
loop = asyncio.get_event_loop()
response = await loop.run_in_executor(
None,
lambda: client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
)
return response.data[0].embedding
async def async_search(
vectorstore, query_embedding: list[float], k: int
) -> list[dict]:
"""Non-blocking vector search."""
loop = asyncio.get_event_loop()
return await loop.run_in_executor(
None,
lambda: vectorstore.search_by_vector(
query_embedding, k=k
)
)
async def optimized_retrieval(
query: str,
vectorstores: list,
k_per_store: int = 3,
) -> list[dict]:
"""Search all vector stores in parallel."""
# Single embedding call shared across all stores
query_embedding = await async_embed(query)
# Search all stores concurrently
tasks = [
async_search(vs, query_embedding, k_per_store)
for vs in vectorstores
]
results = await asyncio.gather(*tasks)
# Flatten and return
return [doc for store_results in results
for doc in store_results]
Optimization 3: Matryoshka Embeddings for Faster Search
Modern embedding models like text-embedding-3-small support dimensionality reduction. Shorter embeddings mean faster similarity computation:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def get_compact_embedding(
text: str, dimensions: int = 256
) -> list[float]:
"""Get a reduced-dimension embedding for faster search.
text-embedding-3-small natively supports 256, 512,
or 1536 dimensions."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
dimensions=dimensions, # Reduce from 1536 to 256
)
return response.data[0].embedding
# 256-dim embeddings are 6x smaller and search is
# approximately 4x faster with minimal quality loss
Optimization 4: Streaming Generation
The LLM generation step dominates latency. Streaming gives users immediate feedback:
def streaming_rag(
query: str,
context: str,
):
"""Stream the RAG response token by token."""
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Answer using the provided context."
}, {
"role": "user",
"content": f"Context:\n{context}\n\n"
f"Question: {query}"
}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
yield delta.content
Optimization 5: Pre-Computed Popular Queries
For queries that follow predictable patterns, pre-compute and cache results during off-peak hours:
from datetime import datetime
def precompute_popular_queries(
popular_queries: list[str],
rag_pipeline,
semantic_cache: SemanticCache,
):
"""Pre-compute answers for frequently asked questions
during off-peak hours."""
for query in popular_queries:
# Check if already cached and fresh
cached = semantic_cache.get(query)
if cached:
continue
# Generate and cache
answer = rag_pipeline.answer(query)
semantic_cache.set(query, answer, ttl=86400)
print(
f"Pre-computed {len(popular_queries)} queries "
f"at {datetime.now()}"
)
Combined Pipeline with All Optimizations
When you apply all these optimizations together, the typical latency profile changes dramatically. Cache hits return in under 100ms. Cache misses with parallel retrieval and streaming return the first token in 300-500ms. The user perceives near-instant responses for common queries and fast streaming for novel ones.
FAQ
What cache hit rate should I expect?
In production RAG systems with enterprise users, cache hit rates of 30-50% are common because users often ask variations of the same questions. Consumer-facing systems see lower hit rates (10-20%) due to query diversity. Even a 30% hit rate means nearly a third of your queries return instantly.
Does reducing embedding dimensions hurt retrieval quality?
At 256 dimensions (down from 1536), text-embedding-3-small retains approximately 95% of its retrieval quality on standard benchmarks. For most applications, this is an excellent tradeoff. If you work in a domain with very fine-grained semantic distinctions (like legal or medical), test on your specific evaluation set before committing to reduced dimensions.
Should I optimize the retrieval pipeline or the generation step first?
Optimize generation first with streaming — it gives the biggest perceived latency improvement because users see tokens appearing immediately instead of waiting for the full response. Then add semantic caching, which eliminates both retrieval and generation latency for repeated queries. Async retrieval and embedding optimization are worthwhile refinements after those two are in place.
#RAGOptimization #LatencyReduction #Caching #AsyncRetrieval #Performance #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.