Introduction
Retrieval-Augmented Generation (RAG) is a powerful approach for creating more accurate and context-aware responses from Large Language Models (LLMs). Instead of relying solely on the model’s internal training data, RAG uses external documents to ground answers, making them more factual and relevant.
In this tutorial, you’ll learn how to build a simple RAG pipeline using Haystack and Ollama. By the end, you’ll have a working setup that retrieves relevant documents and generates answers locally—no expensive API keys required.
For this project, I opted for a lightweight yet powerful model, Qwen 2.5-Coder, with only 0.5 billion parameters. This choice not only ensured efficiency but also demonstrated how smaller models can deliver substantial performance when applied effectively.
For a complete working example of this RAG pipeline setup, check out the GitHub repository.
For more details on using Ollama with Haystack, see the official Haystack Ollama Integration page.
If you’d like to follow a similar step-by-step tutorial, check out the official Haystack guide on creating your first RAG pipeline.
Tools and Setup
Haystack provides high-level building blocks for search and question-answering pipelines. We’ll use:
- InMemoryDocumentStore: To store our documents in-memory.
- SentenceTransformersDocumentEmbedder & SentenceTransformersTextEmbedder: To create embeddings for both documents and queries.
- InMemoryEmbeddingRetriever: To fetch the most relevant documents based on query embeddings.
- PromptBuilder: To construct a prompt that pairs the user’s question with retrieved context.
- OllamaGenerator: To run an LLM locally instead of calling external APIs.
Before getting started, ensure you have:
- Python 3.10+ installed.
- Haystack and its dependencies installed (
pip install haystack-ai datasets sentence-transformers
). - Ollama installed and running locally (Ollama Setup Instructions).
Create a .env
file and set:
|
|
Data Preparation and Indexing
We’ll use a preprocessed dataset of the Seven Wonders of the Ancient World. After fetching and embedding these documents, they’ll be stored in the in-memory DocumentStore.
The Code
Below is a simplified code snippet. Save it as qa_pipeline_with_retrieval_augmentation.py
:
|
|
Run this script, and you’ll see your pipeline return answers grounded in the dataset.
source .env && python qa_pipeline_with_retrieval_augmentation.py
Conclusion and Next Steps
You now have a basic Retrieval-Augmented Generation pipeline running locally. From here, you can:
- Experiment with different datasets.
- Swap out embedding models or document stores.
- Integrate a more advanced vector database like FAISS.
- Customize prompts and explore other LLMs that run with Ollama.
With just a few components and a simple setup, Haystack lets you build flexible, cost-effective QA systems that are easy to adapt and scale.
In the next blog post, I will dive deeper into setting up Ollama locally (or on a VPS) which will work like your own private GPT!
References
Haystack GitHub Repository
https://github.com/deepset-ai/haystackOllama Qwen 2.5-Coder
https://ollama.com/library/qwen2.5-coderSeven Wonders Dataset
https://huggingface.co/datasets/bilgeyucel/seven-wondersOllama Setup Instructions
https://docs.ollama.ai/getting-started/installationHaystack Ollama Integration
https://haystack.deepset.ai/integrations/ollamaCreating Your First RAG Pipeline - Haystack Tutorial
https://haystack.deepset.ai/tutorials/27_first_rag_pipeline#fetching-and-indexing-documentsSentence-Transformers: all-MiniLM-L6-v2
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2My GitHub Repository with the project
https://github.com/shreyashag/simple-qa-rag-with-haystack