Practical AI Thessaloniki Meetup

Courses

Build a RAG Pipeline from Scratch

2h 00min

Beginner

Author: George K.

Courses

Build a RAG Pipeline from Scratch

2h 00min

Beginner

Author: George K.

Overview

RAG (Retrieval-Augmented Generation) grounds LLM responses in your own data. You'll chunk documents, embed them into a vector store, and wire up a retrieval step that feeds relevant context into your prompts.

Prerequisites

Basic Python
Familiarity with LLM prompting

Learning Outcomes

Explain when RAG is the right tool and when it isn't
Chunk and preprocess documents for embedding
Generate and store embeddings in a vector index
Retrieve relevant chunks at query time and inject them into a prompt
Measure and improve retrieval quality

Steps

Step 1: Why RAG and When to Use It

RAG is the right choice when the model needs access to facts that change frequently, are too large to fit in context, or are proprietary. It's not the right choice when the task is reasoning-heavy and doesn't depend on external facts.

Step 2: Chunking and Cleaning Documents

Split documents into chunks small enough to embed meaningfully (300–600 tokens is a common starting point) but large enough to preserve context. Remove boilerplate, normalize whitespace, and keep metadata like source URL and section title with each chunk.

Step 3: Generating Embeddings

Pass each chunk through an embedding model to get a dense vector. Batch requests to stay within rate limits and cache results so you don't re-embed unchanged documents on every run.

text-embedding-3-small

Step 4: Storing and Querying a Vector Index

Store vectors in a database that supports ANN (approximate nearest-neighbour) search. For prototyping, an in-memory index like FAISS works fine. For production use a managed store (Pinecone, Weaviate, pgvector) that handles persistence and scale.

Step 5: Wiring Retrieval into Your Prompt

At query time, embed the user's question and retrieve the top-k most similar chunks. Inject them into the prompt as context before the question. Experiment with k and the placement of context (before vs. after the question) to find what works best for your use case.

Step 6: Evaluating Retrieval Quality

Measure recall@k (did the right chunk appear in the top k results?) and MRR (mean reciprocal rank). A high LLM answer quality with low recall@k usually means the model is hallucinating — fix the retrieval step before tuning the generation prompt.

All done! Let's check again what we've done in this course:

Decided when RAG is the right tool

Chunked and embedded documents into a vector store

Built a retrieval step for prompt context

Queried a vector index with ANN search

Measured and improved retrieval quality

What's next?

Next Mini Course

Advanced RAG – Reranking, Hybrid Search, and Agentic Retrieval – Take your RAG pipeline further: add BM25 hybrid search, cross-encoder reranking, query expansion, and agentic retrieval loops that decide when and what to retrieve based on the model's confidence.