Advanced RAG Patterns: Query Rewriting, HyDE, and Multi-Step Retrieval
Go beyond basic RAG with advanced retrieval patterns including query rewriting, hypothetical document embeddings (HyDE), step-back prompting, and iterative multi-step retrieval chains.
When Basic RAG Falls Short
Basic RAG follows a simple pattern: embed the user's query, find similar documents, generate an answer. This works well for straightforward factual questions but struggles with three common scenarios:
- Vague or poorly worded queries — "how does the thing work" retrieves nothing useful
- Vocabulary mismatch — the user says "cancel my account" but the docs say "subscription termination"
- Multi-hop questions — "Which of our enterprise customers in healthcare had SLA violations last quarter?" requires multiple retrieval steps
Advanced RAG patterns address each of these failure modes. This post covers four production-proven techniques.
Pattern 1: Query Rewriting
Query rewriting uses an LLM to transform the user's original query into one (or multiple) queries that are more likely to retrieve relevant documents.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
def rewrite_query(original_query: str, num_variants: int = 3) -> list[str]:
"""Generate multiple search queries from the original question."""
prompt = f"""You are a search query optimizer for a RAG system.
Given the user's question, generate {num_variants} different search queries
that would help find the relevant information in a knowledge base.
Each query should approach the question from a different angle or use
different terminology.
User question: {original_query}
Return only the queries, one per line, no numbering."""
response = llm.invoke(prompt)
queries = [q.strip() for q in response.content.strip().split("\n") if q.strip()]
return queries
# Example
original = "how does the thing with payments work"
rewritten = rewrite_query(original)
for q in rewritten:
print(f" -> {q}")
# Output:
# -> How does the payment processing system function?
# -> What is the billing and payment workflow?
# -> Payment integration setup and configuration guide
Now retrieve with all queries and merge the results:
def multi_query_retrieve(queries: list[str], retriever, k: int = 5) -> list:
"""Retrieve documents using multiple queries, deduplicate by content."""
all_docs = []
seen_content = set()
for query in queries:
docs = retriever.invoke(query)
for doc in docs:
content_hash = hash(doc.page_content)
if content_hash not in seen_content:
seen_content.add(content_hash)
all_docs.append(doc)
# Return top k by order of appearance (first retrieved = most relevant)
return all_docs[:k]
Pattern 2: HyDE — Hypothetical Document Embeddings
HyDE is a counterintuitive but effective technique. Instead of embedding the question, you ask the LLM to generate a hypothetical answer (even if it is wrong), then embed that hypothetical answer and use it as the search vector.
The insight is that a hypothetical answer is closer in embedding space to the real document than the question itself. Questions and answers live in different semantic neighborhoods — HyDE bridges this gap.
def hyde_retrieve(question: str, retriever, llm, k: int = 5) -> list:
"""
Hypothetical Document Embeddings:
1. Generate a hypothetical answer
2. Embed the hypothetical answer
3. Use it to search for real documents
"""
# Step 1: Generate hypothetical answer
hyde_prompt = f"""Write a detailed paragraph that would answer the following question.
Write as if you are writing a section of a technical document.
Do not mention that this is hypothetical.
Question: {question}
Answer paragraph:"""
hypothetical_doc = llm.invoke(hyde_prompt).content
# Step 2-3: Use the hypothetical doc as the search query
# The retriever will embed this text and find similar real documents
docs = retriever.invoke(hypothetical_doc)
return docs[:k]
# Usage
question = "What security measures protect customer payment data?"
docs = hyde_retrieve(question, retriever, llm)
for doc in docs:
print(f"Retrieved: {doc.page_content[:100]}...")
When HyDE helps most: Technical questions where users describe problems in different terms than the documentation. Customer support queries where the question vocabulary differs significantly from the knowledge base vocabulary.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
When to skip HyDE: Simple factual lookups, queries that already use domain terminology, latency-sensitive applications (HyDE adds an LLM call before retrieval).
Pattern 3: Step-Back Prompting
Step-back prompting handles overly specific queries by first generating a more general version of the question, retrieving for both, and combining the context.
def step_back_retrieve(question: str, retriever, llm, k: int = 5) -> list:
"""
Retrieve using both the original question and a more general version.
"""
# Generate step-back question
step_back_prompt = f"""Given a specific question, generate a more general
question that would retrieve broader context helpful for answering
the specific question.
Specific question: {question}
General question:"""
general_question = llm.invoke(step_back_prompt).content.strip()
# Retrieve for both
specific_docs = retriever.invoke(question)
general_docs = retriever.invoke(general_question)
# Merge with deduplication
seen = set()
merged = []
for doc in specific_docs + general_docs:
key = hash(doc.page_content)
if key not in seen:
seen.add(key)
merged.append(doc)
return merged[:k]
# Example
question = "What is the TLS version used for API endpoints in the EU region?"
# Step-back generates: "What are the security and encryption standards for API endpoints?"
# This retrieves both the specific TLS doc and the broader security architecture doc
docs = step_back_retrieve(question, retriever, llm)
Pattern 4: Iterative Multi-Step Retrieval
For complex questions that require information from multiple documents, iterative retrieval performs multiple rounds of search, using information gathered in each round to refine subsequent queries.
def multi_step_retrieve(
question: str,
retriever,
llm,
max_steps: int = 3,
k_per_step: int = 3,
) -> dict:
"""
Iterative retrieval: use each round's findings to inform the next query.
"""
all_context = []
queries_used = [question]
for step in range(max_steps):
# Retrieve for current query
current_query = queries_used[-1]
docs = retriever.invoke(current_query)[:k_per_step]
new_context = [doc.page_content for doc in docs]
all_context.extend(new_context)
# Check if we have enough to answer
check_prompt = f"""Given the question and the context gathered so far,
determine if we have enough information to answer completely.
Question: {question}
Context gathered:
{chr(10).join(all_context)}
If we have enough information, respond with: SUFFICIENT
If we need more information, respond with a follow-up search query
that would find the missing pieces."""
check_response = llm.invoke(check_prompt).content.strip()
if "SUFFICIENT" in check_response.upper():
break
else:
queries_used.append(check_response)
return {
"context": all_context,
"steps": len(queries_used),
"queries": queries_used,
}
Combining Patterns
In production, these patterns compose naturally:
User Query
|
v
Query Rewriting (generate 3 variants)
|
v
For each variant: HyDE (generate hypothetical doc)
|
v
Retrieve top-k for each hypothetical doc
|
v
Merge + Deduplicate all results
|
v
Re-rank with cross-encoder
|
v
Top-5 chunks -> LLM generation
Each additional layer adds latency but improves retrieval quality. Start with basic RAG, measure where retrieval fails, and add the pattern that addresses your specific failure mode.
FAQ
Does HyDE work if the LLM hallucinates the hypothetical answer?
Yes, and this is the counterintuitive insight. Even a factually wrong hypothetical answer uses the right vocabulary, structure, and semantic space of a real answer. The embedding of a wrong answer about "TLS 1.3 encryption for API endpoints" is still closer to the real documentation about API encryption than the original question "What security does the API use?"
How much latency does query rewriting add?
Query rewriting adds one LLM call (100-500ms with GPT-4o-mini) before retrieval begins. If you then retrieve with 3 query variants in parallel, the total added latency is just the rewriting call — the parallel retrievals take the same time as a single retrieval. This is usually an acceptable tradeoff for the retrieval quality improvement.
When should I use multi-step retrieval vs. just retrieving more documents?
Multi-step retrieval is better when the answer requires synthesizing information from documents that would not be retrieved together by a single query. For example, answering "Which customers affected by the Q3 outage are also on expired contracts?" requires first finding outage-affected customers, then looking up their contract status. Retrieving more documents with a single query would not find this cross-referenced information.
#RAG #AdvancedRetrieval #HyDE #QueryRewriting #MultiStepRetrieval #LLM #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.