What is RAG?
Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving relevant documents from a knowledge base and including them in the prompt. This gives models access to up-to-date or specialized information.
Why Use RAG?
LLMs have knowledge cutoffs and can hallucinate. RAG grounds responses in actual documents, reducing hallucination and enabling domain-specific knowledge without fine-tuning.
The RAG Pipeline
RAG systems follow a consistent pattern: embed the query, retrieve relevant chunks, augment the prompt, and generate a response.
Query Embedding
Convert the user's question into a vector using an embedding model.
Retrieval
Search the vector database for chunks similar to the query embedding.
Augmentation
Insert retrieved chunks into the prompt as context.
Generation
The LLM generates a response grounded in the retrieved context.
Document Chunking
Documents are split into smaller chunks (typically 200-1000 tokens) for embedding and retrieval. Chunk size affects retrieval precision.
Vector Databases
Specialized databases like Pinecone, Weaviate, or pgvector enable fast similarity search over millions of embeddings.
Interactive RAG Pipeline
See how queries flow through a RAG system
Enter your query
What is the capital of France?
RAG Pipeline
Vector Database
8 documents indexed
Traditional vs Agentic RAG
Two approaches to retrieval-augmented generation with different trade-offs.
Traditional RAG
Fixed pipeline, predictable flow
- â˘Linear execution: query â retrieve â generate
- â˘Single retrieval pass, no iteration
- â˘Fast and predictable, easier to debug
Agentic RAG
LLM-controlled, iterative process
- â˘LLM decides when and what to retrieve
- â˘Can loop: retrieve â evaluate â re-retrieve
- â˘Handles complex, multi-step queries
| Aspect | Traditional RAG | Agentic RAG |
|---|---|---|
| Control Flow | Fixed pipeline | LLM decides |
| Retrieval | Single pass | Multiple iterations |
| Query Handling | Used as-is | Can reformulate |
| Latency | Fast | Variable |
| Best For | Simple Q&A, factual lookup | Complex reasoning, multi-hop |
When Each Approach Wins (or Fails)
Explore real-world scenarios to see when Traditional RAG outperforms Agentic RAG, when the reverse is true, and when neither can help.
User Query
"What is the company's return policy?"
Process Steps
Embedding query: "return policy"
Found 1 highly relevant document (similarity: 0.94)
Generating response from retrieved context
Retrieved Documents
[policies/returns.md] "Return Policy: Items may be returned within 30 days of purchase with original receipt. Refunds processed to original payment method within 5-7 business days. Electronics must be unopened. Sale items are final sale."
Final Response
Items can be returned within 30 days with the original receipt. Refunds are processed to your original payment method in 5-7 business days. Note that electronics must be unopened and sale items are final sale.
Why This Outcome?
For simple factual queries, Traditional RAG is more efficient. The answer exists in a single document, so the direct retrieve-then-generate pipeline works perfectly. Agentic RAG reaches the same answer but with unnecessary overhead from planning and evaluation stepsâwasting time and tokens.
Agentic RAG
In agentic RAG, the LLM doesn't just receive retrieved documentsâit actively controls the retrieval process. The model decides when to search, what to search for, and which retrieval tools to use.
How It Works
Instead of a fixed pipeline, the LLM is given retrieval tools it can call as needed. It might reformulate queries, search multiple times, or combine different search strategies based on the task.
Advantages
- +Query refinement: The LLM can rephrase or decompose complex questions
- +Multi-hop reasoning: Chain multiple retrievals to answer complex questions
- +Adaptive search: Choose the right tool for each sub-question
- +Self-correction: Re-retrieve if initial results are insufficient
Disadvantages
- -Higher latency: Multiple LLM calls and retrievals add up
- -Increased cost: Each reasoning step costs tokens
- -Complexity: Harder to debug and predict behavior
- -Failure modes: LLM might loop, over-retrieve, or miss obvious queries
Standard RAG is simpler and faster for straightforward Q&A. Use agentic RAG when queries are complex, require multiple sources, or benefit from query reformulation.
Multi-Tool Retrieval
Give the LLM multiple retrieval tools for different use cases. This flexibility lets the model choose the best approach for each query.
Semantic Search
Vector similarity for conceptual matching. Best for: "documents about X", finding related content.
Full-Text Search
Keyword/BM25 search for exact matches. Best for: specific terms, names, codes, error messages.
SQL/Structured Query
Query structured data directly. Best for: counts, aggregations, filtering by attributes.
Knowledge Graph
Traverse entity relationships. Best for: "how is X related to Y", multi-hop facts.
Advanced RAG Techniques
Beyond basic RAG, modern systems use sophisticated techniques to improve retrieval quality, answer accuracy, and handle complex queries. These 2025 approaches represent the state of the art.
Self-RAG
Self-RAG introduces self-reflection into the retrieval process. Instead of always retrieving, the model decides when retrieval is needed and critically evaluates retrieved content before using it.
How Self-RAG Works
The model generates special reflection tokens during inference: [Retrieve] to decide if retrieval is needed, [IsRel] to assess relevance of retrieved passages, [IsSup] to verify if the response is supported by the context, and [IsUse] to evaluate overall utility.
Retrieve Decision
Model decides whether the query needs external knowledge or can be answered from parametric memory alone.
Self-Critique
Retrieved passages are evaluated for relevance. Irrelevant or low-quality results are filtered before generation.
Grounded Generation
Response is generated with explicit grounding checks. The model verifies claims are supported by retrieved context.
GraphRAG
GraphRAG combines vector similarity search with knowledge graph traversal. It builds a graph of entities and relationships from your documents, enabling both semantic search and structured reasoning.
Vector Search Layer
Traditional semantic search finds relevant document chunks. This handles the "what is similar to my query" part of retrieval.
Knowledge Graph Layer
Entities and relationships are extracted and linked. Enables multi-hop reasoning like "Find all products mentioned by companies that partnered with X".
Key Benefits
- +Better handling of questions requiring relationship reasoning
- +Improved accuracy for multi-entity queries
- +Enables global summarization across entire document collections
Query Augmentation
User queries are often incomplete or poorly phrased for retrieval. Query augmentation techniques transform queries before search to improve retrieval quality.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer first, then use that answer's embedding for retrieval. This bridges the gap between question and document embedding spaces.
Query Decomposition
Break complex queries into simpler sub-queries. Each sub-query retrieves independently, then results are combined.
Query Expansion
Add synonyms, related terms, or rephrasings to the original query. Increases recall by matching documents that use different terminology.
Query Rewriting
Use an LLM to rewrite ambiguous or conversational queries into clear, search-optimized forms. Handles pronouns, context, and implicit references.
RAG Evaluation
Measuring RAG system quality requires specialized metrics that evaluate both retrieval and generation. RAGAS (Retrieval Augmented Generation Assessment) provides a standard framework.
RAGAS Framework
RAGAS uses LLM-based evaluation to score RAG systems without requiring ground truth labels for every question. It measures multiple dimensions of quality.
Faithfulness
Does the answer only contain information from the retrieved context? Measures hallucinationâclaims not supported by the provided documents.
Answer Relevance
Is the answer actually addressing the question asked? A faithful answer can still be irrelevant if it misses the point.
Context Recall
Did the retrieval find all the information needed to answer? Measures if relevant passages were missed.
Context Precision
Are the retrieved passages actually relevant? High precision means less noise in the context, reducing confusion.
Evaluation Best Practices
- 1.Create a diverse test set covering different query types and difficulty levels
- 2.Track metrics over time as you iterate on chunking, embeddings, and prompts
- 3.Combine automated metrics with human evaluation for nuanced quality assessment
Key Takeaways
- 1RAG retrieves relevant documents and includes them in the prompt
- 2It reduces hallucination by grounding responses in actual sources
- 3Chunking strategy and embedding quality are critical for good retrieval
- 4RAG is often preferable to fine-tuning for adding domain knowledge
- 5Advanced techniques like Self-RAG and GraphRAG improve accuracy for complex queries
- 6Use RAGAS metrics to systematically evaluate and improve your RAG pipeline