Building Retrieval-Augmented Generation (RAG) Systems From Scratch

1. Brief Overview

Retrieval-Augmented Generation (RAG) is a cutting-edge architectural pattern that enhances the capabilities of Large Language Models (LLMs) by connecting them to external knowledge bases. At its core, RAG is a hybrid approach that combines the strengths of two distinct AI fields: information retrieval and natural language generation. Instead of relying solely on the vast but static knowledge encoded within its parameters during training, a RAG system dynamically fetches relevant information from a specified data source and uses that information to inform its generated responses.

This technology matters because it directly addresses several inherent limitations of standalone LLMs. Firstly, it combats the issue of "hallucinations," where models generate plausible but factually incorrect or nonsensical information. By grounding the LLM's response in retrieved, verifiable data, RAG significantly improves the factual accuracy and reliability of the output. Secondly, it provides a mechanism for keeping the model's knowledge up-to-date without the need for costly and time-consuming retraining. As your knowledge base evolves, the RAG system immediately incorporates the new information. This makes it an indispensable tool for any application requiring current and contextually relevant answers.

This tutorial is designed for developers, data scientists, and AI engineers who want to move beyond simply using LLM APIs and start building more sophisticated, data-aware applications. If you need to build a question-answering system over your company's internal documents, create a chatbot that's an expert in a specific, evolving domain, or simply want to build more accurate and trustworthy AI systems, then learning to build RAG systems is a crucial next step. We will walk through a complete, from-scratch implementation that you can adapt and deploy for production use cases.

2. Key Concepts

To build a RAG system, it's essential to understand the core components and how they interact. The entire process can be broken down into two main stages: an offline Indexing Pipeline to prepare the knowledge base and an online Retrieval-Generation Pipeline to answer queries.

Knowledge Base: This is the corpus of documents containing the information you want the LLM to use. It can be a collection of text files, PDFs, Markdown files, or entries in a database. For our purposes, this is the raw, unstructured, or semi-structured data that will serve as the "long-term memory" for our system.

Chunking: LLMs have a limited context window, meaning they can only process a certain amount of text at once. It is also inefficient to feed entire documents to the model. Chunking is the process of breaking down large documents from the knowledge base into smaller, semantically coherent pieces of text. A good chunking strategy is critical; chunks that are too small may lack sufficient context, while chunks that are too large can introduce noise and exceed the model's context limit.

Embedding Model & Vector Embeddings: An embedding model is a neural network that transforms text into high-dimensional numerical vectors, known as embeddings. These vectors capture the semantic meaning of the text, such that similar concepts are represented by vectors that are close to each other in the vector space. We use the same embedding model to convert both our document chunks and, later, our user queries into vectors. sentence-transformers is a popular library for this.

Vector Database (or Index): A vector database is a specialized database designed to store and efficiently search through high-dimensional vector embeddings. After converting all document chunks into embeddings, we store them in a vector database. When a user asks a query, the query is also converted into an embedding, and the vector database performs a similarity search (e.g., cosine similarity or Euclidean distance) to find the document chunk embeddings that are most similar to the query embedding. For our tutorial, we will use FAISS (Facebook AI Similarity Search), a library that provides efficient similarity search, acting as our vector index.

The Retriever: This is the component responsible for taking a user's query, converting it to an embedding, and using the vector database to find the top-k most relevant document chunks. The output of the retriever is a ranked list of text chunks that are believed to be most relevant to answering the query.

The Generator (LLM): This is the Large Language Model that receives the original user query along with the retrieved document chunks as context. We construct a carefully crafted prompt that instructs the LLM to answer the user's question based on the provided context. This forces the model to ground its answer in the retrieved information, rather than relying solely on its internal knowledge.

Augmented Prompt & Response: The final step is to synthesize the retrieved context and the original query into a single prompt for the LLM. A typical prompt template might look like this: "Based on the following context, please answer the question. Context: {retrievedchunks} Question: {userquery}". The LLM then processes this augmented prompt and generates a final, context-aware response for the user.

3. Practical Code Examples

This section provides a complete, end-to-end Python implementation of a RAG system. We will use the datasets, transformers, sentence-transformers, and faiss-cpu libraries.

3.1. Installation

First, let's install the necessary packages.


pip install datasets transformers sentence-transformers faiss-cpu torch

Expected Output: You will see a series of installation logs from pip. A successful installation will end with a line similar to:


Successfully installed datasets-... transformers-... sentence-transformers-... faiss-cpu-... torch-...

3.2. Step 1: Load and Prepare the Knowledge Base

We'll use the squad dataset from the Hugging Face datasets library as our knowledge base. This dataset contains context-question-answer triplets. We will only use the 'context' paragraphs as our document corpus.


import os
from datasets import load_dataset

# Load the SQuAD dataset
print("Loading SQuAD dataset...")
dataset = load_dataset("squad", split="train")

# We'll use the 'context' field as our knowledge base.
# Let's extract unique contexts to avoid redundancy.
print("Extracting unique contexts...")
contexts = list(set(item['context'] for item in dataset))

print(f"Loaded {len(contexts)} unique contexts as the knowledge base.")

# For demonstration, let's see one of the contexts
print("\n--- Sample Context ---")
print(contexts[0])
print("--------------------")

# Create a directory to store our knowledge base
if not os.path.exists("knowledge_base"):
    os.makedirs("knowledge_base")

# Write each context to a separate file
print("\nWriting contexts to files...")
for i, context in enumerate(contexts):
    with open(f"knowledge_base/doc_{i}.txt", "w") as f:
        f.write(context)

print("Knowledge base created successfully.")

Expected Output:


Loading SQuAD dataset...
Extracting unique contexts...
Loaded 18891 unique contexts as the knowledge base.

--- Sample Context ---
The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.
--------------------

Writing contexts to files...
Knowledge base created successfully.

3.3. Step 2: The Indexing Pipeline (Chunking, Embedding, Indexing)

Now, we will process the documents in our knowledge_base directory, chunk them, create embeddings, and store them in a FAISS index.


import os
import torch
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# --- 1. Chunking (in this case, we treat each file as a single chunk) ---
# For this tutorial, our documents are already paragraph-sized, so we'll treat
# each document as a single "chunk". In a real-world scenario with longer
# documents, you would implement a more sophisticated chunking strategy here.
file_paths = [os.path.join("knowledge_base", f) for f in os.listdir("knowledge_base")]
documents = []
for path in file_paths:
    with open(path, 'r') as f:
        documents.append(f.read())

print(f"Loaded {len(documents)} documents to be indexed.")

# --- 2. Embedding ---
# We'll use a pre-trained model from sentence-transformers
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for all documents
print("Generating embeddings for documents...")
# Note: This can take a while for a large number of documents.
# For faster processing, you can use a GPU if available.
embeddings = embedding_model.encode(documents, convert_to_tensor=True, show_progress_bar=True)

# Convert to numpy for FAISS
embeddings_np = embeddings.cpu().numpy()

print(f"Embeddings created with shape: {embeddings_np.shape}")

# --- 3. Indexing ---
# Create a FAISS index
d = embeddings_np.shape[1]  # dimension of the vectors
index = faiss.IndexFlatL2(d) # Using L2 distance for similarity

# Add the document embeddings to the index
print("Adding embeddings to FAISS index...")
index.add(embeddings_np)

# Save the index and the document list for later use
faiss.write_index(index, "faiss_index.bin")
import pickle
with open("documents.pkl", "wb") as f:
    pickle.dump(documents, f)

print("Indexing complete. FAISS index and documents saved.")

Expected Output:


Loaded 18891 documents to be indexed.
Loading embedding model...
Generating embeddings for documents...
Batches: 100%|██████████| 591/591 [00:30<00:00, 19.41it/s]
Embeddings created with shape: (18891, 384)
Adding embeddings to FAISS index...
Indexing complete. FAISS index and documents saved.

3.4. Step 3: The Retrieval-Generation Pipeline

This is the online part of our system. We'll define a function that takes a user query, retrieves relevant documents, and generates an answer using a generative LLM.


import faiss
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import pipeline

# --- Load the necessary components ---
print("Loading FAISS index, documents, and models...")

# Load the FAISS index
index = faiss.read_index("faiss_index.bin")

# Load the documents
with open("documents.pkl", "rb") as f:
    documents = pickle.load(f)

# Load the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Load a generative model for question answering
# We use a text-generation pipeline from Hugging Face's transformers
# Using a smaller model for demonstration purposes
generator = pipeline('text-generation', model='gpt2')

print("Components loaded successfully.")

# --- RAG Function ---
def answer_question(query, k=5):
    """
    Takes a user query, retrieves relevant documents, and generates an answer.
    """
    # 1. Retrieve
    print(f"\nQuery: '{query}'")
    print("Retrieving relevant documents...")
    # Embed the query
    query_embedding = embedding_model.encode([query], convert_to_tensor=True).cpu().numpy()

    # Search the FAISS index
    distances, indices = index.search(query_embedding, k)

    # Get the retrieved documents
    retrieved_docs = [documents[i] for i in indices[0]]

    print(f"Retrieved {len(retrieved_docs)} documents.")

    # 2. Generate
    print("Generating answer...")
    # Prepare the context for the generator
    context = "\n\n".join(retrieved_docs)

    # Create the prompt
    prompt = f"""
    Based on the following context, please provide a concise answer to the question.

    Context:
    {context}

    Question: {query}

    Answer:
    """

    # Generate the answer
    # We set a max_length to control the output size
    generated_text = generator(prompt, max_length=250, num_return_sequences=1, truncation=True)
    answer = generated_text[0]['generated_text'].split("Answer:")[1].strip()

    return answer

# --- Example Usage ---
user_query = "Who were the Normans?"
final_answer = answer_question(user_query)
print("\n--- Final Answer ---")
print(final_answer)

user_query_2 = "What is the main purpose of the immune system?"
final_answer_2 = answer_question(user_query_2)
print("\n--- Final Answer ---")
print(final_answer_2)

Expected Output:


Loading FAISS index, documents, and models...
Components loaded successfully.

Query: 'Who were the Normans?'
Retrieving relevant documents...
Retrieved 5 documents.
Generating answer...
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

--- Final Answer ---
The Normans were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia.

Query: 'What is the main purpose of the immune system?'
Retrieving relevant documents...
Retrieved 5 documents.
Generating answer...
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

--- Final Answer ---
The immune system is a system of many biological structures and processes within an organism that protects against disease. To function properly, an immune system must detect a wide variety of agents, known as pathogens, from viruses to parasitic worms, and distinguish them from the organism's own healthy tissue.

4. Best Practices

Choose the Right Embedding Model: The quality of your retrieval heavily depends on the embedding model. all-MiniLM-L6-v2 is a good starting point, but for domain-specific tasks, consider fine-tuning an embedding model on your own data or using models trained for specific domains (e.g., BioBERT for biomedical text).

Optimize Chunking Strategy: Don't just split documents by a fixed number of characters. Use semantic chunking strategies. For example, split by paragraphs, or use libraries like NLTK or spaCy to split by sentences. A good practice is to have some overlap between chunks to ensure semantic context isn't lost at the boundaries.


    # Example of overlapping chunks
    text = "..." # your long document
    chunk_size = 1000
    overlap = 100
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size - overlap)]

Use an Efficient Vector Index: For small-scale projects, IndexFlatL2 in FAISS is fine. For production systems with millions of documents, this will be too slow. Use an optimized index like IndexIVFPQ. This involves a training step on your data but results in significantly faster search times.


    # Example for a more advanced FAISS index
    nlist = 100 # number of clusters
    quantizer = faiss.IndexFlatL2(d)
    index = faiss.IndexIVFFlat(quantizer, d, nlist)
    index.train(embeddings_np) # Train the index on the data
    index.add(embeddings_np)

Refine the Prompt Template: The way you present the context and question to the LLM matters. Experiment with different prompt templates. Be explicit in your instructions. For example, you can add a sentence like "If the context does not contain the answer, say 'I do not have enough information to answer this question.'" to reduce hallucinations.

Implement a Re-ranking Step: The initial retrieval from the vector database is based on semantic similarity, which isn't always perfect. A good practice is to retrieve a larger number of documents (e.g., k=20) and then use a more sophisticated (but slower) model, like a cross-encoder, to re-rank the top 20 documents and select the best 3-5 to feed to the LLM. This can significantly improve the quality of the context.

Filter Metadata: When you create your index, store metadata alongside your vectors (e.g., document source, creation date). Many vector databases allow you to perform a hybrid search, filtering by metadata before the similarity search. This can narrow down the search space and improve retrieval speed and relevance.

5. Common Pitfalls to Avoid

Mismatched Embeddings:
Error: The RAG system provides irrelevant or nonsensical answers because the retrieved documents have no relation to the query.
Cause: You used a different embedding model for indexing the documents and for embedding the query. The vector representations are not in the same space and thus cannot be compared meaningfully.
Fix: ALWAYS use the exact same embedding model for both the indexing and retrieval pipelines. Ensure the model checkpoint (all-MiniLM-L6-v2 in our case) is identical in both places.

Context Window Overflow:
Error Message: Token indices sequence length is longer than the specified maximum sequence length for this model...
Cause: The combined length of your prompt, retrieved chunks, and the query exceeds the LLM's maximum context window (e.g., 1024 tokens for the base gpt2 model).
Fix:
Truncate the context: Before feeding the context to the LLM, ensure you truncate it to fit within the model's limit, leaving space for the query and the generated answer.
Reduce k: Retrieve fewer documents (e.g., decrease k from 5 to 3).
Use a model with a larger context window: Models like GPT-4 or Claude have much larger context windows.

Slow Retrieval on Large Datasets:
Error: The index.search() call takes an unacceptably long time to return results as your knowledge base grows.
Cause: You are using a brute-force index like IndexFlatL2 which compares the query vector to every single vector in the index. This is an O(n) operation, where n is the number of documents.
Fix: Use an approximate nearest neighbor (ANN) index like IndexIVFPQ or HNSW in FAISS. These indexes trade a small amount of accuracy for a massive speedup in search time, which is essential for production systems. You must train these indexes on your data before adding vectors.

6. Next Steps and Additional Resources

Official Documentation:
Hugging Face Transformers: https://huggingface.co/docs/transformers
Sentence Transformers: https://www.sbert.net/
FAISS: https://faiss.ai/
Hugging Face Datasets: https://huggingface.co/docs/datasets/

Follow-up Projects:
Build a Gradio/Streamlit UI: Create a simple web interface for your RAG system to make it interactive.
Integrate a Real-time Data Source: Modify the indexing pipeline to periodically pull new data from an API or database to keep your knowledge base up-to-date.
Experiment with Different Models: Swap out the embedding and generator models with other pre-trained models from Hugging Face to see how it affects performance. Try larger, more capable models if you have the hardware.
Deploy as an API: Wrap your RAG pipeline in a FastAPI or Flask application to serve it as a REST API.

RAG Systems from Scratch

Building Retrieval-Augmented Generation (RAG) Systems From Scratch

1. Brief Overview

2. Key Concepts

3. Practical Code Examples

3.1. Installation

3.2. Step 1: Load and Prepare the Knowledge Base

3.3. Step 2: The Indexing Pipeline (Chunking, Embedding, Indexing)

3.4. Step 3: The Retrieval-Generation Pipeline

4. Best Practices

5. Common Pitfalls to Avoid

6. Next Steps and Additional Resources