Practical AI Logo
Courses
Build a RAG Pipeline from Scratch
2h 00min
Beginner
Author: George K.
Overview

RAG (Retrieval-Augmented Generation) grounds LLM responses in your own data. You'll chunk documents, embed them into a vector store, and wire up a retrieval step that feeds relevant context into your prompts.

Prerequisites

  • Basic Python
  • Familiarity with LLM prompting
Learning Outcomes
  • Explain when RAG is the right tool and when it isn't
  • Chunk and preprocess documents for embedding
  • Generate and store embeddings in a vector index
  • Retrieve relevant chunks at query time and inject them into a prompt
  • Measure and improve retrieval quality
Steps

Step 1: Why RAG and When to Use It

RAG is the right choice when the model needs access to facts that change frequently, are too large to fit in context, or are proprietary. It's not the right choice when the task is reasoning-heavy and doesn't depend on external facts.

Step 2: Chunking and Cleaning Documents

Split documents into chunks small enough to embed meaningfully (300–600 tokens is a common starting point) but large enough to preserve context. Remove boilerplate, normalize whitespace, and keep metadata like source URL and section title with each chunk.

Step 3: Generating Embeddings

Pass each chunk through an embedding model to get a dense vector. Batch requests to stay within rate limits and cache results so you don't re-embed unchanged documents on every run.

text-embedding-3-small

Step 4: Storing and Querying a Vector Index

Store vectors in a database that supports ANN (approximate nearest-neighbour) search. For prototyping, an in-memory index like FAISS works fine. For production use a managed store (Pinecone, Weaviate, pgvector) that handles persistence and scale.

Step 5: Wiring Retrieval into Your Prompt

At query time, embed the user's question and retrieve the top-k most similar chunks. Inject them into the prompt as context before the question. Experiment with k and the placement of context (before vs. after the question) to find what works best for your use case.

Step 6: Evaluating Retrieval Quality

Measure recall@k (did the right chunk appear in the top k results?) and MRR (mean reciprocal rank). A high LLM answer quality with low recall@k usually means the model is hallucinating — fix the retrieval step before tuning the generation prompt.

All done! Let's check again what we've done in this course:
What's next?

Next Mini Course

Advanced RAG – Reranking, Hybrid Search, and Agentic RetrievalTake your RAG pipeline further: add BM25 hybrid search, cross-encoder reranking, query expansion, and agentic retrieval loops that decide when and what to retrieve based on the model's confidence.

Further Reading

  • LlamaIndex DocsComprehensive RAG framework and patterns
  • FAISSEfficient similarity search library by Meta
  • pgvectorVector similarity search extension for Postgres