RAG (Retrieval-Augmented Generation) grounds LLM responses in your own data. You'll chunk documents, embed them into a vector store, and wire up a retrieval step that feeds relevant context into your prompts.
Prerequisites
- Basic Python
- Familiarity with LLM prompting
- Explain when RAG is the right tool and when it isn't
- Chunk and preprocess documents for embedding
- Generate and store embeddings in a vector index
- Retrieve relevant chunks at query time and inject them into a prompt
- Measure and improve retrieval quality
Step 1: Why RAG and When to Use It
RAG is the right choice when the model needs access to facts that change frequently, are too large to fit in context, or are proprietary. It's not the right choice when the task is reasoning-heavy and doesn't depend on external facts.
Step 2: Chunking and Cleaning Documents
Split documents into chunks small enough to embed meaningfully (300–600 tokens is a common starting point) but large enough to preserve context. Remove boilerplate, normalize whitespace, and keep metadata like source URL and section title with each chunk.
Step 3: Generating Embeddings
Pass each chunk through an embedding model to get a dense vector. Batch requests to stay within rate limits and cache results so you don't re-embed unchanged documents on every run.
Step 4: Storing and Querying a Vector Index
Store vectors in a database that supports ANN (approximate nearest-neighbour) search. For prototyping, an in-memory index like FAISS works fine. For production use a managed store (Pinecone, Weaviate, pgvector) that handles persistence and scale.
Step 5: Wiring Retrieval into Your Prompt
At query time, embed the user's question and retrieve the top-k most similar chunks. Inject them into the prompt as context before the question. Experiment with k and the placement of context (before vs. after the question) to find what works best for your use case.
Step 6: Evaluating Retrieval Quality
Measure recall@k (did the right chunk appear in the top k results?) and MRR (mean reciprocal rank). A high LLM answer quality with low recall@k usually means the model is hallucinating — fix the retrieval step before tuning the generation prompt.
Next Mini Course
Advanced RAG – Reranking, Hybrid Search, and Agentic Retrieval – Take your RAG pipeline further: add BM25 hybrid search, cross-encoder reranking, query expansion, and agentic retrieval loops that decide when and what to retrieve based on the model's confidence.
Further Reading
- LlamaIndex Docs – Comprehensive RAG framework and patterns
- FAISS – Efficient similarity search library by Meta
- pgvector – Vector similarity search extension for Postgres
01
Overview
02
Learning Outcomes
03
Steps
04
Completion Checklist