RAG

Retrieval-Augmented Generation: giving LLMs access to external knowledge.

What is RAG?

Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving relevant documents from a knowledge base and including them in the prompt. This gives models access to up-to-date or specialized information.

Why Use RAG?

LLMs have knowledge cutoffs and can hallucinate. RAG grounds responses in actual documents, reducing hallucination and enabling domain-specific knowledge without fine-tuning.

The RAG Pipeline

RAG systems follow a consistent pattern: embed the query, retrieve relevant chunks, augment the prompt, and generate a response.

1

Query Embedding

Convert the user's question into a vector using an embedding model.

2

Retrieval

Search the vector database for chunks similar to the query embedding.

3

Augmentation

Insert retrieved chunks into the prompt as context.

4

Generation

The LLM generates a response grounded in the retrieved context.

Document Chunking

Documents are split into smaller chunks (typically 200-1000 tokens) for embedding and retrieval. Chunk size affects retrieval precision.

Vector Databases

Specialized databases like Pinecone, Weaviate, or pgvector enable fast similarity search over millions of embeddings.

🔍

Interactive RAG Pipeline

See how queries flow through a RAG system

Enter your query

What is the capital of France?

RAG Pipeline

Embed Query
Retrieve
Augment
Generate

Vector Database

8 documents indexed

Geography: FranceParis is the capital of France. It is known for th...
Geography: GermanyBerlin is the capital of Germany. It is known for ...
Geography: ItalyRome is the capital of Italy. It features ancient ...
Geography: SpainMadrid is the capital of Spain. It is famous for i...
Science: PhysicsAlbert Einstein developed the theory of relativity...
Science: ChemistryThe periodic table organizes chemical elements by ...
History: World War IIWorld War II lasted from 1939 to 1945. It involved...
Technology: AIArtificial Intelligence enables machines to learn ...
⚖️

Traditional vs Agentic RAG

Two approaches to retrieval-augmented generation with different trade-offs.

Traditional RAG

Fixed pipeline, predictable flow

  • •Linear execution: query → retrieve → generate
  • •Single retrieval pass, no iteration
  • •Fast and predictable, easier to debug

Agentic RAG

LLM-controlled, iterative process

  • •LLM decides when and what to retrieve
  • •Can loop: retrieve → evaluate → re-retrieve
  • •Handles complex, multi-step queries
AspectTraditional RAGAgentic RAG
Control FlowFixed pipelineLLM decides
RetrievalSingle passMultiple iterations
Query HandlingUsed as-isCan reformulate
LatencyFastVariable
Best ForSimple Q&A, factual lookupComplex reasoning, multi-hop
🎯

When Each Approach Wins (or Fails)

Explore real-world scenarios to see when Traditional RAG outperforms Agentic RAG, when the reverse is true, and when neither can help.

User Query

"What is the company's return policy?"

Process Steps

Searching

Embedding query: "return policy"

Retrieved

Found 1 highly relevant document (similarity: 0.94)

Generating

Generating response from retrieved context

Retrieved Documents

[policies/returns.md] "Return Policy: Items may be returned within 30 days of purchase with original receipt. Refunds processed to original payment method within 5-7 business days. Electronics must be unopened. Sale items are final sale."

Final Response

Items can be returned within 30 days with the original receipt. Refunds are processed to your original payment method in 5-7 business days. Note that electronics must be unopened and sale items are final sale.

Why This Outcome?

For simple factual queries, Traditional RAG is more efficient. The answer exists in a single document, so the direct retrieve-then-generate pipeline works perfectly. Agentic RAG reaches the same answer but with unnecessary overhead from planning and evaluation steps—wasting time and tokens.

Agentic RAG

In agentic RAG, the LLM doesn't just receive retrieved documents—it actively controls the retrieval process. The model decides when to search, what to search for, and which retrieval tools to use.

How It Works

Instead of a fixed pipeline, the LLM is given retrieval tools it can call as needed. It might reformulate queries, search multiple times, or combine different search strategies based on the task.

Advantages

  • +Query refinement: The LLM can rephrase or decompose complex questions
  • +Multi-hop reasoning: Chain multiple retrievals to answer complex questions
  • +Adaptive search: Choose the right tool for each sub-question
  • +Self-correction: Re-retrieve if initial results are insufficient

Disadvantages

  • -Higher latency: Multiple LLM calls and retrievals add up
  • -Increased cost: Each reasoning step costs tokens
  • -Complexity: Harder to debug and predict behavior
  • -Failure modes: LLM might loop, over-retrieve, or miss obvious queries

Standard RAG is simpler and faster for straightforward Q&A. Use agentic RAG when queries are complex, require multiple sources, or benefit from query reformulation.

Multi-Tool Retrieval

Give the LLM multiple retrieval tools for different use cases. This flexibility lets the model choose the best approach for each query.

Semantic Search

Vector similarity for conceptual matching. Best for: "documents about X", finding related content.

Full-Text Search

Keyword/BM25 search for exact matches. Best for: specific terms, names, codes, error messages.

SQL/Structured Query

Query structured data directly. Best for: counts, aggregations, filtering by attributes.

Knowledge Graph

Traverse entity relationships. Best for: "how is X related to Y", multi-hop facts.

Advanced RAG Techniques

Beyond basic RAG, modern systems use sophisticated techniques to improve retrieval quality, answer accuracy, and handle complex queries. These 2025 approaches represent the state of the art.

Self-RAG

Self-RAG introduces self-reflection into the retrieval process. Instead of always retrieving, the model decides when retrieval is needed and critically evaluates retrieved content before using it.

How Self-RAG Works

The model generates special reflection tokens during inference: [Retrieve] to decide if retrieval is needed, [IsRel] to assess relevance of retrieved passages, [IsSup] to verify if the response is supported by the context, and [IsUse] to evaluate overall utility.

Retrieve Decision

Model decides whether the query needs external knowledge or can be answered from parametric memory alone.

Self-Critique

Retrieved passages are evaluated for relevance. Irrelevant or low-quality results are filtered before generation.

Grounded Generation

Response is generated with explicit grounding checks. The model verifies claims are supported by retrieved context.

GraphRAG

GraphRAG combines vector similarity search with knowledge graph traversal. It builds a graph of entities and relationships from your documents, enabling both semantic search and structured reasoning.

Vector Search Layer

Traditional semantic search finds relevant document chunks. This handles the "what is similar to my query" part of retrieval.

Knowledge Graph Layer

Entities and relationships are extracted and linked. Enables multi-hop reasoning like "Find all products mentioned by companies that partnered with X".

Key Benefits

  • +Better handling of questions requiring relationship reasoning
  • +Improved accuracy for multi-entity queries
  • +Enables global summarization across entire document collections

Query Augmentation

User queries are often incomplete or poorly phrased for retrieval. Query augmentation techniques transform queries before search to improve retrieval quality.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer first, then use that answer's embedding for retrieval. This bridges the gap between question and document embedding spaces.

Query: "climate change effects" -> Generate hypothetical doc -> Embed that -> Search

Query Decomposition

Break complex queries into simpler sub-queries. Each sub-query retrieves independently, then results are combined.

"Compare A vs B" -> "What is A?" + "What is B?" -> Merge results

Query Expansion

Add synonyms, related terms, or rephrasings to the original query. Increases recall by matching documents that use different terminology.

Query Rewriting

Use an LLM to rewrite ambiguous or conversational queries into clear, search-optimized forms. Handles pronouns, context, and implicit references.

RAG Evaluation

Measuring RAG system quality requires specialized metrics that evaluate both retrieval and generation. RAGAS (Retrieval Augmented Generation Assessment) provides a standard framework.

RAGAS Framework

RAGAS uses LLM-based evaluation to score RAG systems without requiring ground truth labels for every question. It measures multiple dimensions of quality.

Faithfulness

Does the answer only contain information from the retrieved context? Measures hallucination—claims not supported by the provided documents.

Answer Relevance

Is the answer actually addressing the question asked? A faithful answer can still be irrelevant if it misses the point.

Context Recall

Did the retrieval find all the information needed to answer? Measures if relevant passages were missed.

Context Precision

Are the retrieved passages actually relevant? High precision means less noise in the context, reducing confusion.

Evaluation Best Practices

  • 1.Create a diverse test set covering different query types and difficulty levels
  • 2.Track metrics over time as you iterate on chunking, embeddings, and prompts
  • 3.Combine automated metrics with human evaluation for nuanced quality assessment

Key Takeaways

  • 1RAG retrieves relevant documents and includes them in the prompt
  • 2It reduces hallucination by grounding responses in actual sources
  • 3Chunking strategy and embedding quality are critical for good retrieval
  • 4RAG is often preferable to fine-tuning for adding domain knowledge
  • 5Advanced techniques like Self-RAG and GraphRAG improve accuracy for complex queries
  • 6Use RAGAS metrics to systematically evaluate and improve your RAG pipeline