Introduction

Retrieval-Augmented Generation (RAG) is a powerful approach for creating more accurate and context-aware responses from Large Language Models (LLMs). Instead of relying solely on the model’s internal training data, RAG uses external documents to ground answers, making them more factual and relevant.

In this tutorial, you’ll learn how to build a simple RAG pipeline using Haystack and Ollama. By the end, you’ll have a working setup that retrieves relevant documents and generates answers locally—no expensive API keys required.

For this project, I opted for a lightweight yet powerful model, Qwen 2.5-Coder, with only 0.5 billion parameters. This choice not only ensured efficiency but also demonstrated how smaller models can deliver substantial performance when applied effectively.

For a complete working example of this RAG pipeline setup, check out the GitHub repository.

For more details on using Ollama with Haystack, see the official Haystack Ollama Integration page.

If you’d like to follow a similar step-by-step tutorial, check out the official Haystack guide on creating your first RAG pipeline.

Tools and Setup

Haystack provides high-level building blocks for search and question-answering pipelines. We’ll use:

  • InMemoryDocumentStore: To store our documents in-memory.
  • SentenceTransformersDocumentEmbedder & SentenceTransformersTextEmbedder: To create embeddings for both documents and queries.
  • InMemoryEmbeddingRetriever: To fetch the most relevant documents based on query embeddings.
  • PromptBuilder: To construct a prompt that pairs the user’s question with retrieved context.
  • OllamaGenerator: To run an LLM locally instead of calling external APIs.

Before getting started, ensure you have:

  • Python 3.10+ installed.
  • Haystack and its dependencies installed (pip install haystack-ai datasets sentence-transformers).
  • Ollama installed and running locally (Ollama Setup Instructions).

Create a .env file and set:

1
2
export OLLAMA_ENDPOINT="http://localhost:11434" # this should reflect the ollama installation
export OLLAMA_MODEL="qwen2.5-coder"

Data Preparation and Indexing

We’ll use a preprocessed dataset of the Seven Wonders of the Ancient World. After fetching and embedding these documents, they’ll be stored in the in-memory DocumentStore.

The Code

Below is a simplified code snippet. Save it as qa_pipeline_with_retrieval_augmentation.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import os
from dotenv import load_dotenv
from haystack.document_stores.in_memory import InMemoryDocumentStore
from datasets import load_dataset
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

from haystack.components.builders import PromptBuilder

# from getpass import getpass
# from haystack.components.generators import OpenAIGenerator
# Let's use open-source instead!
from haystack_integrations.components.generators.ollama import OllamaGenerator
from haystack import Pipeline

# if "OPENAI_API_KEY" not in os.environ:
#     os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
load_dotenv()
generator = OllamaGenerator(
    model=os.environ["OLLAMA_MODEL"], url=os.environ["OLLAMA_ENDPOINT"]
)


document_store = InMemoryDocumentStore()
dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]


doc_embedder = SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
doc_embedder.warm_up()

docs_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])


text_embedder = SentenceTransformersTextEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)


retriever = InMemoryEmbeddingRetriever(document_store)


template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)


basic_rag_pipeline = Pipeline()
# Add components to your pipeline
basic_rag_pipeline.add_component("text_embedder", text_embedder)
basic_rag_pipeline.add_component("retriever", retriever)
basic_rag_pipeline.add_component("prompt_builder", prompt_builder)
basic_rag_pipeline.add_component("llm", generator)

# Now, connect the components to each other
basic_rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
basic_rag_pipeline.connect("retriever", "prompt_builder.documents")
basic_rag_pipeline.connect("prompt_builder", "llm")


example_questions = [
    "Where is Gardens of Babylon?",
    "Why did people build Great Pyramid of Giza?",
    "What does Rhodes Statue look like?",
    "Why did people visit the Temple of Artemis?",
    "What is the importance of Colossus of Rhodes?",
    "What happened to the Tomb of Mausolus?",
    "How did Colossus of Rhodes collapse?",
]

for question in example_questions:

    response = basic_rag_pipeline.run(
        {"text_embedder": {"text": question}, "prompt_builder": {"question": question}}
    )

    print(response["llm"]["replies"][0])

Run this script, and you’ll see your pipeline return answers grounded in the dataset.

source .env && python qa_pipeline_with_retrieval_augmentation.py

Conclusion and Next Steps

You now have a basic Retrieval-Augmented Generation pipeline running locally. From here, you can:

  • Experiment with different datasets.
  • Swap out embedding models or document stores.
  • Integrate a more advanced vector database like FAISS.
  • Customize prompts and explore other LLMs that run with Ollama.

With just a few components and a simple setup, Haystack lets you build flexible, cost-effective QA systems that are easy to adapt and scale.

In the next blog post, I will dive deeper into setting up Ollama locally (or on a VPS) which will work like your own private GPT!

RAG implemented with custom data!

References

  1. Haystack GitHub Repository
    https://github.com/deepset-ai/haystack

  2. Ollama Qwen 2.5-Coder
    https://ollama.com/library/qwen2.5-coder

  3. Seven Wonders Dataset
    https://huggingface.co/datasets/bilgeyucel/seven-wonders

  4. Ollama Setup Instructions
    https://docs.ollama.ai/getting-started/installation

  5. Haystack Ollama Integration
    https://haystack.deepset.ai/integrations/ollama

  6. Creating Your First RAG Pipeline - Haystack Tutorial
    https://haystack.deepset.ai/tutorials/27_first_rag_pipeline#fetching-and-indexing-documents

  7. Sentence-Transformers: all-MiniLM-L6-v2
    https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  8. My GitHub Repository with the project
    https://github.com/shreyashag/simple-qa-rag-with-haystack