Retrieval-Augmented Prompting: Injecting Context Dynamically into Prompts
Learn how to design retrieval-augmented prompts that dynamically inject relevant context, manage context windows efficiently, and produce grounded answers from external knowledge.
Static Prompts Hit a Knowledge Wall
A static prompt contains only the information you wrote into it. The moment a user asks about data the model was not trained on — your company's internal docs, recent events, or domain-specific knowledge — the model either hallucinates or admits ignorance.
Retrieval-Augmented Prompting (RAP) solves this by fetching relevant context at query time and injecting it directly into the prompt. This is the prompt engineering layer that sits at the heart of every RAG system. The retrieval pipeline finds relevant documents, but the prompt template determines how effectively the model uses that information.
Designing Effective RAP Templates
The template structure matters as much as the retrieval quality. A well-designed template clearly separates the retrieved context from the user query and gives the model explicit instructions on how to use the context:
def build_rap_prompt(
query: str,
retrieved_chunks: list[dict],
system_instructions: str = "",
) -> list[dict]:
"""Build a retrieval-augmented prompt with clear context boundaries."""
context_block = "\n\n---\n\n".join(
f"[Source: {chunk['source']}, Relevance: {chunk['score']:.2f}]\n"
f"{chunk['text']}"
for chunk in retrieved_chunks
)
system_prompt = (
"You are a knowledgeable assistant. Answer the user's question "
"based ONLY on the provided context. If the context does not "
"contain enough information to answer fully, say so explicitly. "
"Cite the source for each claim you make.\n\n"
f"{system_instructions}"
)
user_message = (
f"## Retrieved Context\n\n{context_block}\n\n"
f"---\n\n## Question\n\n{query}"
)
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
]
Key design decisions here: the context comes before the question so the model processes it first. Each chunk includes its source and relevance score. The separator between chunks is visually distinct so the model does not blend information across sources.
Context Window Management
The biggest practical challenge is fitting retrieved context within the model's context window while leaving room for the system prompt, user query, and generated response. You need a context budget:
import tiktoken
def manage_context_budget(
chunks: list[dict],
max_context_tokens: int = 6000,
model: str = "gpt-4o",
) -> list[dict]:
"""Select chunks that fit within the token budget."""
encoder = tiktoken.encoding_for_model(model)
selected = []
token_count = 0
# Chunks are assumed pre-sorted by relevance (highest first)
for chunk in chunks:
chunk_tokens = len(encoder.encode(chunk["text"]))
if token_count + chunk_tokens > max_context_tokens:
# Try to include a truncated version
remaining = max_context_tokens - token_count
if remaining > 200:
tokens = encoder.encode(chunk["text"])[:remaining]
chunk = {**chunk, "text": encoder.decode(tokens) + "..."}
selected.append(chunk)
break
selected.append(chunk)
token_count += chunk_tokens
return selected
A practical budget split for a 128K-token model: reserve 1000 tokens for the system prompt, 500 for the user query, and 4000 for the expected response. That leaves roughly 122,000 tokens for context — but in practice, packing that much context degrades quality. Keeping retrieved context between 4000 and 12000 tokens typically produces the best results.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Dynamic Template Patterns
Different query types benefit from different template structures. A routing layer can select the appropriate template:
from enum import Enum
class QueryType(Enum):
FACTUAL = "factual"
COMPARISON = "comparison"
PROCEDURAL = "procedural"
ANALYTICAL = "analytical"
TEMPLATES = {
QueryType.FACTUAL: (
"Answer the question directly using the provided sources. "
"Quote the relevant passage when possible."
),
QueryType.COMPARISON: (
"Compare and contrast the information from different sources. "
"Organize your answer with clear sections for each item being compared."
),
QueryType.PROCEDURAL: (
"Provide step-by-step instructions based on the context. "
"Number each step and note any prerequisites or warnings."
),
QueryType.ANALYTICAL: (
"Analyze the information from the sources to answer the question. "
"Consider multiple perspectives and note any contradictions "
"between sources."
),
}
def classify_query(query: str) -> QueryType:
"""Classify the query type to select the right template."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"Classify the query as one of: factual, comparison, "
"procedural, analytical. Return JSON with key 'type'."
)},
{"role": "user", "content": query},
],
response_format={"type": "json_object"},
temperature=0,
)
data = json.loads(response.choices[0].message.content)
query_type = data.get("type", "factual")
return QueryType(query_type)
def build_adaptive_prompt(query: str, chunks: list[dict]) -> list[dict]:
"""Build a prompt with template selected by query type."""
query_type = classify_query(query)
template_instructions = TEMPLATES[query_type]
budget_chunks = manage_context_budget(chunks)
return build_rap_prompt(query, budget_chunks, template_instructions)
Handling Missing Context Gracefully
A robust RAP system tells the model what to do when the retrieved context does not contain the answer. Without this instruction, models tend to hallucinate an answer using their training data, defeating the purpose of retrieval augmentation:
NO_CONTEXT_INSTRUCTION = (
"If the provided context does not contain sufficient information "
"to answer the question, respond with: 'The available sources do "
"not contain information about this topic. Here is what I found "
"that may be related:' followed by the most relevant partial "
"information from the context."
)
Adding this instruction to your system prompt significantly reduces hallucination rates in production RAG systems.
FAQ
How many retrieved chunks should I include in the prompt?
Three to five highly relevant chunks is the sweet spot for most tasks. Including more chunks adds noise and can actually decrease answer quality if lower-relevance chunks contradict or dilute the useful information. Quality of retrieval matters more than quantity.
Should context go before or after the user question in the prompt?
Context before the question is the standard approach and works best for most models. The model processes context first and has it fully in working memory when it encounters the question. Some practitioners put a brief summary of the question before the context and the full question after — this can help the model read the context with the right focus.
How do I prevent the model from using its training data instead of the retrieved context?
Use explicit instructions like "Answer ONLY based on the provided context" and "Do not use any knowledge not present in the context above." Additionally, setting temperature to 0 reduces the chance of the model improvising. In evaluation, test with questions where the correct answer from the context differs from what the model might know from training to verify compliance.
#PromptEngineering #RAG #Retrieval #ContextManagement #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.