Retrieval Augmented Generation (RAG)¶
Instead of only relying on what it learned during training, it first retrieves fresh, relevant information from a trusted source — like a database, document library, or API — and then uses that information to generate its reply.
Introduction to RAG¶
Retrieval Augmented Generation is a technique that combines retrieval (finding relevant external information) with generation (using an LLM to produce an answer).
Why it matters:
- Reduces hallucinations by grounding answers in real, verifiable data.
- Lets LLMs answer with up‑to‑date and domain‑specific knowledge without retraining.
Core Components of a RAG System¶
Component | Role |
---|---|
Retriever | Finds relevant documents/snippets from a knowledge base. |
Index / Vector Store | Stores embeddings of documents for fast semantic search. |
Chunker | Splits large documents into manageable, retrievable pieces. |
Embedder | Converts text into vector embeddings for similarity search. |
Generator (LLM) | Produces the final answer using the retrieved context. |
Orchestrator | Coordinates retrieval, augmentation, and generation steps. |
Your First RAG — “Hello, Docs”¶
Build a minimal RAG pipeline.
-
Corpus – Your Knowledge Base
- Start with a small, clearly defined set of documents — for example, a folder of Markdown notes or PDFs.
- This is the “source of truth” your system will draw from.
-
Chunking – Breaking It Down
- Large documents are split into smaller, meaningful pieces (“chunks”).
- Why: Smaller units make it easier to find exactly the right information.
- Typical size: A few hundred words or tokens, sometimes with a little overlap so context isn’t lost.
-
Embedding – Turning Text into Numbers
- Each chunk is transformed into a numerical representation (a vector) that captures its meaning.
- This allows the system to compare “semantic similarity” between a question and the stored chunks.
-
Storage – Building the Searchable Index
- All vectors are stored in a special database (a vector store).
- Think of it as a library card catalogue, but instead of titles and authors, it indexes meaning.
-
Retrieval – Finding Relevant Chunks
- When you ask a question, the system:
-Converts your question into a vector.
- Finds the most similar chunks in the index.
- Returns the top matches (often called top‑k results).
- When you ask a question, the system:
-Converts your question into a vector.
-
Augmentation – Adding Context to the Prompt
- The retrieved chunks are inserted into the LLM’s prompt alongside your question.
- This “grounds” the model in real, relevant information from your corpus.
-
Generation – Producing the Answer
- The LLM uses both your question and the retrieved context to generate a response.
- The goal: factual, context‑aware answers that reflect your documents, not just the model’s training data.
Key Takeaways¶
RAG is about combining search with generation.
The quality of your chunks, embeddings, and retrieval directly affects answer accuracy.
Even a minimal RAG pipeline follows this same loop: Corpus → Chunk → Embed → Store → Retrieve → Augment → Generate.
Info
In the context of AI and RAG, a corpus simply means the complete collection of
documents or text that your system can draw from.
Think of it as your knowledge library — it could be anything from a folder of
Markdown notes, to a set of PDFs, to a database of support tickets. In RAG, the
corpus is the raw material you break into chunks, embed, and store so the model
can later retrieve the most relevant pieces when answering a question.
In short: Corpus = all the source content you make searchable for your AI.
Example flow:¶
- Corpus: A folder of Markdown files.
- Chunk: Split into ~500‑token chunks.
- Embed: Use an embedding model to vectorize chunks.
- Store: Save embeddings in a local vector DB (e.g., FAISS).
- Retrieve: On a query, find top‑k similar chunks.
- Augment: Insert retrieved chunks into the LLM prompt.
- Generate: LLM answers using both the query and context.