Advanced RAG Patterns: Query Rewriting, HyDE, and Multi-Step Retrieval

When Basic RAG Falls Short

Basic RAG follows a simple pattern: embed the user's query, find similar documents, generate an answer. This works well for straightforward factual questions but struggles with three common scenarios:

Vague or poorly worded queries — "how does the thing work" retrieves nothing useful
Vocabulary mismatch — the user says "cancel my account" but the docs say "subscription termination"
Multi-hop questions — "Which of our enterprise customers in healthcare had SLA violations last quarter?" requires multiple retrieval steps

Advanced RAG patterns address each of these failure modes. This post covers four production-proven techniques.

Pattern 1: Query Rewriting

Query rewriting uses an LLM to transform the user's original query into one (or multiple) queries that are more likely to retrieve relevant documents.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)

def rewrite_query(original_query: str, num_variants: int = 3) -> list[str]:
    """Generate multiple search queries from the original question."""
    prompt = f"""You are a search query optimizer for a RAG system.
Given the user's question, generate {num_variants} different search queries
that would help find the relevant information in a knowledge base.

Each query should approach the question from a different angle or use
different terminology.

User question: {original_query}

Return only the queries, one per line, no numbering."""

    response = llm.invoke(prompt)
    queries = [q.strip() for q in response.content.strip().split("\n") if q.strip()]
    return queries

# Example
original = "how does the thing with payments work"
rewritten = rewrite_query(original)
for q in rewritten:
    print(f"  -> {q}")
# Output:
#   -> How does the payment processing system function?
#   -> What is the billing and payment workflow?
#   -> Payment integration setup and configuration guide

Now retrieve with all queries and merge the results:

def multi_query_retrieve(queries: list[str], retriever, k: int = 5) -> list:
    """Retrieve documents using multiple queries, deduplicate by content."""
    all_docs = []
    seen_content = set()

    for query in queries:
        docs = retriever.invoke(query)
        for doc in docs:
            content_hash = hash(doc.page_content)
            if content_hash not in seen_content:
                seen_content.add(content_hash)
                all_docs.append(doc)

    # Return top k by order of appearance (first retrieved = most relevant)
    return all_docs[:k]

Pattern 2: HyDE — Hypothetical Document Embeddings

HyDE is a counterintuitive but effective technique. Instead of embedding the question, you ask the LLM to generate a hypothetical answer (even if it is wrong), then embed that hypothetical answer and use it as the search vector.

The insight is that a hypothetical answer is closer in embedding space to the real document than the question itself. Questions and answers live in different semantic neighborhoods — HyDE bridges this gap.

def hyde_retrieve(question: str, retriever, llm, k: int = 5) -> list:
    """
    Hypothetical Document Embeddings:
    1. Generate a hypothetical answer
    2. Embed the hypothetical answer
    3. Use it to search for real documents
    """
    # Step 1: Generate hypothetical answer
    hyde_prompt = f"""Write a detailed paragraph that would answer the following question.
Write as if you are writing a section of a technical document.
Do not mention that this is hypothetical.

Question: {question}

Answer paragraph:"""

    hypothetical_doc = llm.invoke(hyde_prompt).content

    # Step 2-3: Use the hypothetical doc as the search query
    # The retriever will embed this text and find similar real documents
    docs = retriever.invoke(hypothetical_doc)

    return docs[:k]

# Usage
question = "What security measures protect customer payment data?"
docs = hyde_retrieve(question, retriever, llm)
for doc in docs:
    print(f"Retrieved: {doc.page_content[:100]}...")

When HyDE helps most: Technical questions where users describe problems in different terms than the documentation. Customer support queries where the question vocabulary differs significantly from the knowledge base vocabulary.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

When to skip HyDE: Simple factual lookups, queries that already use domain terminology, latency-sensitive applications (HyDE adds an LLM call before retrieval).

Pattern 3: Step-Back Prompting

Step-back prompting handles overly specific queries by first generating a more general version of the question, retrieving for both, and combining the context.

def step_back_retrieve(question: str, retriever, llm, k: int = 5) -> list:
    """
    Retrieve using both the original question and a more general version.
    """
    # Generate step-back question
    step_back_prompt = f"""Given a specific question, generate a more general
question that would retrieve broader context helpful for answering
the specific question.

Specific question: {question}
General question:"""

    general_question = llm.invoke(step_back_prompt).content.strip()

    # Retrieve for both
    specific_docs = retriever.invoke(question)
    general_docs = retriever.invoke(general_question)

    # Merge with deduplication
    seen = set()
    merged = []
    for doc in specific_docs + general_docs:
        key = hash(doc.page_content)
        if key not in seen:
            seen.add(key)
            merged.append(doc)

    return merged[:k]

# Example
question = "What is the TLS version used for API endpoints in the EU region?"
# Step-back generates: "What are the security and encryption standards for API endpoints?"
# This retrieves both the specific TLS doc and the broader security architecture doc
docs = step_back_retrieve(question, retriever, llm)

Pattern 4: Iterative Multi-Step Retrieval

For complex questions that require information from multiple documents, iterative retrieval performs multiple rounds of search, using information gathered in each round to refine subsequent queries.

def multi_step_retrieve(
    question: str,
    retriever,
    llm,
    max_steps: int = 3,
    k_per_step: int = 3,
) -> dict:
    """
    Iterative retrieval: use each round's findings to inform the next query.
    """
    all_context = []
    queries_used = [question]

    for step in range(max_steps):
        # Retrieve for current query
        current_query = queries_used[-1]
        docs = retriever.invoke(current_query)[:k_per_step]
        new_context = [doc.page_content for doc in docs]
        all_context.extend(new_context)

        # Check if we have enough to answer
        check_prompt = f"""Given the question and the context gathered so far,
determine if we have enough information to answer completely.

Question: {question}

Context gathered:
{chr(10).join(all_context)}

If we have enough information, respond with: SUFFICIENT
If we need more information, respond with a follow-up search query
that would find the missing pieces."""

        check_response = llm.invoke(check_prompt).content.strip()

        if "SUFFICIENT" in check_response.upper():
            break
        else:
            queries_used.append(check_response)

    return {
        "context": all_context,
        "steps": len(queries_used),
        "queries": queries_used,
    }

Combining Patterns

In production, these patterns compose naturally:

User Query
    |
    v
Query Rewriting (generate 3 variants)
    |
    v
For each variant: HyDE (generate hypothetical doc)
    |
    v
Retrieve top-k for each hypothetical doc
    |
    v
Merge + Deduplicate all results
    |
    v
Re-rank with cross-encoder
    |
    v
Top-5 chunks -> LLM generation

Each additional layer adds latency but improves retrieval quality. Start with basic RAG, measure where retrieval fails, and add the pattern that addresses your specific failure mode.

FAQ

Does HyDE work if the LLM hallucinates the hypothetical answer?

Yes, and this is the counterintuitive insight. Even a factually wrong hypothetical answer uses the right vocabulary, structure, and semantic space of a real answer. The embedding of a wrong answer about "TLS 1.3 encryption for API endpoints" is still closer to the real documentation about API encryption than the original question "What security does the API use?"

How much latency does query rewriting add?

Query rewriting adds one LLM call (100-500ms with GPT-4o-mini) before retrieval begins. If you then retrieve with 3 query variants in parallel, the total added latency is just the rewriting call — the parallel retrievals take the same time as a single retrieval. This is usually an acceptable tradeoff for the retrieval quality improvement.

When should I use multi-step retrieval vs. just retrieving more documents?

Multi-step retrieval is better when the answer requires synthesizing information from documents that would not be retrieved together by a single query. For example, answering "Which customers affected by the Q3 outage are also on expired contracts?" requires first finding outage-affected customers, then looking up their contract status. Retrieving more documents with a single query would not find this cross-referenced information.

#RAG #AdvancedRetrieval #HyDE #QueryRewriting #MultiStepRetrieval #LLM #AgenticAI #LearnAI #AIEngineering

Advanced RAG Patterns: Query Rewriting, HyDE, and Multi-Step Retrieval

When Basic RAG Falls Short

Pattern 1: Query Rewriting

Pattern 2: HyDE — Hypothetical Document Embeddings

Pattern 3: Step-Back Prompting

Pattern 4: Iterative Multi-Step Retrieval

Combining Patterns

FAQ

Does HyDE work if the LLM hallucinates the hypothetical answer?

How much latency does query rewriting add?

When should I use multi-step retrieval vs. just retrieving more documents?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding