What Is RAG: Retrieval-Augmented Generation Explained from Scratch

The Problem RAG Solves

Large language models are trained on static datasets with a fixed knowledge cutoff. When you ask an LLM about your company's internal documentation, last week's product changelog, or a proprietary research paper, it either hallucinates an answer or admits it does not know. Fine-tuning can inject new knowledge, but it is expensive, slow, and the model still cannot access information that changes daily.

Retrieval-Augmented Generation (RAG) solves this by giving the LLM a search engine at inference time. Instead of relying solely on parametric memory baked into the model's weights, RAG retrieves relevant documents from an external knowledge base and passes them into the prompt as context. The model then generates an answer grounded in those documents.

Core Architecture

A RAG system has two phases that execute in sequence for every user query:

Phase 1 — Retrieval: The user's question is converted into a vector embedding, then compared against pre-indexed document embeddings in a vector database. The top-k most similar chunks are returned.

Phase 2 — Generation: The retrieved chunks are injected into the LLM prompt alongside the original question. The model synthesizes an answer using only (or primarily) the provided context.

The data flow looks like this:

User Query
    |
    v
Embedding Model --> Query Vector
    |
    v
Vector Database (similarity search)
    |
    v
Top-K Document Chunks
    |
    v
LLM Prompt = System Instructions + Retrieved Context + User Query
    |
    v
Generated Answer (grounded in retrieved documents)

Offline Ingestion Pipeline

Before queries can be answered, documents must be preprocessed and indexed. This happens offline:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# 1. Load documents
documents = load_your_documents()  # PDFs, markdown, HTML, etc.

# 2. Chunk them
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

Each chunk gets an embedding vector. These vectors are stored in the vector database alongside the original text.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

RAG vs Fine-Tuning: When to Use Which

The decision between RAG and fine-tuning depends on four factors:

Factor	RAG	Fine-Tuning
Data changes frequently	Excellent — re-index new docs	Poor — retrain required
Need citations / sources	Built-in — retrieved docs are traceable	Not possible
Domain-specific style/tone	Weaker — model writes in its default style	Strong — model learns the style
Latency budget	Higher — retrieval adds 100-500ms	Lower — single model call
Cost	Lower — no GPU training costs	Higher — compute for training

Use RAG when your knowledge base changes often, you need source attribution, or you cannot afford fine-tuning compute. Use fine-tuning when you need the model to adopt a specific writing style or deeply understand a narrow domain.

In practice, many production systems combine both: fine-tune for tone and format, then use RAG for factual grounding.

A Minimal RAG Query in Python

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Load the pre-built vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./db", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Retrieve
query = "What is our refund policy for enterprise plans?"
docs = retriever.invoke(query)
context = "\n\n".join([doc.page_content for doc in docs])

# Generate
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = f"""Answer the question based ONLY on the following context.
If the context does not contain the answer, say "I don't have that information."

Context:
{context}

Question: {query}"""

response = llm.invoke(prompt)
print(response.content)

Common Pitfalls

Chunks too large: The model gets flooded with irrelevant text and misses the key passage. Keep chunks between 256-1024 tokens.

No overlap between chunks: Important information that spans a chunk boundary gets split and lost. Use 10-15% overlap.

Ignoring retrieval quality: Teams focus on the generation model but the answer quality ceiling is set by retrieval. If the right document is not retrieved, no model can produce a correct answer.

FAQ

How is RAG different from just pasting documents into a prompt?

RAG is selective — it retrieves only the most relevant chunks rather than dumping entire documents into the context window. This keeps costs low, avoids hitting token limits, and reduces noise so the model focuses on pertinent information.

Can RAG work with open-source LLMs or only OpenAI models?

RAG is model-agnostic. The retrieval phase uses embedding models (which can be open-source like sentence-transformers) and the generation phase works with any LLM — Llama, Mistral, Gemma, or any other model that accepts a text prompt.

When should I NOT use RAG?

Skip RAG when all the knowledge the model needs is already in its training data (general knowledge tasks), when you need sub-50ms latency with no room for a retrieval step, or when your use case is purely generative (creative writing, brainstorming) with no factual grounding requirement.

#RAG #RetrievalAugmentedGeneration #LLM #VectorSearch #AIArchitecture #AgenticAI #LearnAI #AIEngineering

What Is RAG: Retrieval-Augmented Generation Explained from Scratch

The Problem RAG Solves

Core Architecture

Offline Ingestion Pipeline

RAG vs Fine-Tuning: When to Use Which

A Minimal RAG Query in Python

Common Pitfalls

FAQ

How is RAG different from just pasting documents into a prompt?

Can RAG work with open-source LLMs or only OpenAI models?

When should I NOT use RAG?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding