RAG on Your Own Data —
Without Sending Anything to the Cloud
You have a local LLM. You have a gateway. Now the model knows nothing about your data. RAG is how you fix that — and the entire pipeline runs locally.
The Problem With Local Knowledge
Language models only know what was in their training data, and that data has a cutoff date. Ask a local Llama model about your Q3 internal risk report, your customer contracts, or your company's knowledge base — and it will either hallucinate something plausible-sounding or honestly tell you it doesn't know.
The naive solution is to paste the document into the prompt. This works until the document is longer than the context window, or until you have ten documents, or a hundred, or a SharePoint full of thousands of PDFs. Stuffing entire document collections into every prompt is expensive in tokens, slow to process, and often worse than a targeted retrieval approach — because the model struggles to focus when given too much irrelevant context.
RAG — Retrieval Augmented Generation — solves this properly. Instead of sending all your documents to the model, you first retrieve only the relevant passages for a given query, then send just those passages as context. The model answers based on the retrieved content rather than its training data alone. It's the difference between asking someone to memorize your entire library versus giving them a good search engine.
The key insight: a vector database and an embedding model — the two components that make RAG work — can both run entirely locally. Neither requires a cloud API.
The Stack We're Building
Everything in this stack is open-source and runs locally. Nothing touches a cloud API unless you explicitly configure it to.
- Document loading: LangChain document loaders — handles PDF, DOCX, TXT, Markdown, and more
- Embedding model:
nomic-embed-textvia Ollama — a strong open-source embedding model that runs locally - Vector database: Qdrant — fast, open-source, runs as a Docker container or in-memory
- Text splitting:
langchain-text-splitters— splits documents into indexed chunks - LLM: Your local Ollama model, optionally routed via LiteLLM from Article #2
- Orchestration: LangChain — glue between all of these components
Create a dedicated virtual environment for this project — keeping its dependencies isolated:
# Create project folder and virtual environment
mkdir rag-local && cd rag-local
python3 -m venv .venv
# Activate — macOS / Linux:
source .venv/bin/activate
# Windows (WSL2): source .venv/Scripts/activate
Install dependencies. Note that LangChain's Qdrant integration is now a separate package — langchain-qdrant:
pip install langchain langchain-community langchain-ollama
pip install langchain-qdrant # Qdrant integration (separate package since v0.1.2)
pip install langchain-text-splitters # Text splitting (separate package)
pip install qdrant-client
pip install pypdf # PDF support
pip install python-docx # Word document support
langchain-qdrant is a required separate install. The old langchain_community.vectorstores.Qdrant class was deprecated in v0.1.2 and scheduled for removal. Use QdrantVectorStore from langchain_qdrant instead — that's what all code in this article uses.Pull the embedding model via Ollama:
ollama pull nomic-embed-text
nomic-embed-text is a 137M parameter model purpose-built for text embeddings. It's fast, produces 768-dimensional vectors, and performs well on document retrieval. At ~270MB it's trivial compared to your LLM.
Verify it works:
curl http://localhost:11434/api/embeddings \
-d '{"model": "nomic-embed-text", "prompt": "What is data sovereignty?"}'
You'll get back a JSON object with a list of 768 floating-point numbers — the semantic fingerprint of that phrase.
Qdrant is the vector database that stores and searches your document embeddings. Run it via Docker with persistent storage:
docker run -d \
--name qdrant \
-p 6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant
-v flag is critical. Without volume mounting, your entire indexed collection disappears when the container restarts. Always mount a local directory for persistent storage.Verify it's running:
curl http://localhost:6333/healthz
# {"title":"qdrant - vector search engine","version":"..."}
Qdrant includes a built-in web UI at http://localhost:6333/dashboard — useful for inspecting your collections and running test searches visually.
For quick prototyping without Docker, Qdrant supports in-memory mode:
from qdrant_client import QdrantClient
client = QdrantClient(":memory:") # No Docker needed — data lost on restart
This script reads documents from a folder, splits them into chunks, generates embeddings locally, and stores everything in Qdrant:
# index_documents.py
import os
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader, Docx2txtLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore
# Configuration
DOCS_FOLDER = "./documents"
QDRANT_URL = "http://localhost:6333"
COLLECTION_NAME = "my_knowledge_base"
EMBEDDING_MODEL = "nomic-embed-text"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
def load_documents(folder: str):
"""Load all supported documents from a folder."""
documents = []
for filepath in Path(folder).rglob("*"):
if filepath.suffix.lower() == ".pdf":
loader = PyPDFLoader(str(filepath))
elif filepath.suffix.lower() in [".txt", ".md"]:
loader = TextLoader(str(filepath), encoding="utf-8")
elif filepath.suffix.lower() in [".docx", ".doc"]:
loader = Docx2txtLoader(str(filepath))
else:
continue
try:
docs = loader.load()
for doc in docs:
doc.metadata["source_file"] = filepath.name
documents.extend(docs)
print(f" Loaded: {filepath.name} ({len(docs)} sections)")
except Exception as e:
print(f" Failed: {filepath.name}: {e}")
return documents
def index_documents():
print("Loading documents...")
documents = load_documents(DOCS_FOLDER)
print("\nSplitting into chunks...")
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
print("\nGenerating embeddings and indexing into Qdrant...")
embeddings = OllamaEmbeddings(
model=EMBEDDING_MODEL,
base_url="http://localhost:11434"
)
QdrantVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
url=QDRANT_URL,
collection_name=COLLECTION_NAME,
)
print(f"\nDone. {len(chunks)} chunks indexed into '{COLLECTION_NAME}'")
if __name__ == "__main__":
os.makedirs(DOCS_FOLDER, exist_ok=True)
if not any(Path(DOCS_FOLDER).iterdir()):
print(f"Put your documents in '{DOCS_FOLDER}' and run again.")
else:
index_documents()
Docx2txtLoader is used here instead of UnstructuredWordDocumentLoader. The latter requires the heavy unstructured package with many system dependencies. Docx2txtLoader needs only pip install python-docx docx2txt and works reliably.Create the documents folder, add your files, and run:
mkdir documents
# copy your PDFs, TXT, DOCX files into ./documents/
python index_documents.py
Chunking parameters explained:
chunk_size=1000— approximately 1000 characters per chunk. Smaller chunks (500) give more precise retrieval but less context per result. Larger chunks (2000) provide more context but reduce precision. 1000 is a good starting point.chunk_overlap=200— adjacent chunks share 200 characters. Prevents information loss when a sentence spans two chunk boundaries.RecursiveCharacterTextSplitter— tries to split at paragraph boundaries first, then newlines, then sentences. Preserves the natural structure of your documents better than naive character-count splitting.
This script connects to your indexed Qdrant collection, retrieves relevant chunks for a query, and sends them to your local LLM for synthesis.
Many tutorials on the internet still use the old paths which no longer work on current LangChain versions:
from langchain.schema.runnable import RunnablePassthrough ✗ old, brokenfrom langchain.schema.output_parser import StrOutputParser ✗ old, brokenfrom langchain.prompts import ChatPromptTemplate ✗ old, brokenThe correct current paths — all in
langchain_core:from langchain_core.runnables import RunnablePassthrough ✓from langchain_core.output_parsers import StrOutputParser ✓from langchain_core.prompts import ChatPromptTemplate ✓
# query.py
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_qdrant import QdrantVectorStore
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from qdrant_client import QdrantClient
QDRANT_URL = "http://localhost:6333"
COLLECTION_NAME = "my_knowledge_base"
EMBEDDING_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.1:8b"
TOP_K = 4
def build_rag_chain():
embeddings = OllamaEmbeddings(
model=EMBEDDING_MODEL,
base_url="http://localhost:11434"
)
qdrant_client = QdrantClient(url=QDRANT_URL)
vectorstore = QdrantVectorStore(
client=qdrant_client,
collection_name=COLLECTION_NAME,
embedding=embeddings
)
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": TOP_K}
)
llm = ChatOllama(
model=LLM_MODEL,
base_url="http://localhost:11434",
temperature=0.1
)
prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant that answers questions based strictly on the provided context.
If the answer is not in the context, say "I don't have enough information in the provided
documents to answer this." Do not make up information.
Context:
{context}
Question: {question}
Answer:""")
def format_docs(docs):
formatted = []
for doc in docs:
src = doc.metadata.get("source_file", "unknown")
formatted.append(f"[Source: {src}]\n{doc.page_content}")
return "\n\n---\n\n".join(formatted)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
return rag_chain, retriever
def ask(question: str, rag_chain, retriever):
print(f"\nQuestion: {question}")
retrieved_docs = retriever.invoke(question)
for i, doc in enumerate(retrieved_docs, 1):
src = doc.metadata.get("source_file", "unknown")
print(f" [{i}] {src}: {doc.page_content[:80].replace(chr(10),' ')}...")
answer = rag_chain.invoke(question)
print(f"\nAnswer: {answer}")
return answer
if __name__ == "__main__":
rag_chain, retriever = build_rag_chain()
questions = [
"What are the main data retention policies described in the documents?",
"Who is responsible for security incident reporting?",
"What are the consequences of non-compliance?",
]
for q in questions:
ask(q, rag_chain, retriever)
print("\n" + "="*60)
python query.py
For interactive use, a simple command-line loop with a sources command — which is deliberately included for compliance and debugging:
# chat.py
from langchain_core.messages import HumanMessage, AIMessage
from query import build_rag_chain
def chat_with_documents():
rag_chain, retriever = build_rag_chain()
history = []
last_docs = []
print("RAG Chat — ask questions about your documents.")
print("Type 'quit' to exit, 'sources' to show last retrieved documents.\n")
while True:
question = input("You: ").strip()
if question.lower() == "quit": break
if question.lower() == "sources":
if last_docs:
for doc in last_docs:
print(f" - {doc.metadata.get('source_file', 'unknown')}")
print(f" {doc.page_content[:150]}...")
continue
if not question: continue
last_docs = retriever.invoke(question)
answer = rag_chain.invoke(question)
print(f"\nAssistant: {answer}\n")
history.extend([HumanMessage(content=question), AIMessage(content=answer)])
if __name__ == "__main__":
chat_with_documents()
python chat.py
# You: What does the contract say about liability?
# Assistant: Based on the contract documents, liability is limited to...
# You: sources
# - contract_v3.pdf
# ...section 8.2 regarding limitation of liability...
You don't need to re-index everything when you add new documents. Create a separate script that adds to the existing Qdrant collection:
# add_documents.py
import sys
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from index_documents import EMBEDDING_MODEL, QDRANT_URL, COLLECTION_NAME, CHUNK_SIZE, CHUNK_OVERLAP
def add_document(filepath: str):
path = Path(filepath)
loader = PyPDFLoader(filepath) if path.suffix.lower() == ".pdf" else TextLoader(filepath, encoding="utf-8")
documents = loader.load()
for doc in documents: doc.metadata["source_file"] = path.name
chunks = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
).split_documents(documents)
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL, base_url="http://localhost:11434")
client = QdrantClient(url=QDRANT_URL)
# add_documents on an existing QdrantVectorStore instance — no recreate
vectorstore = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME, embedding=embeddings)
vectorstore.add_documents(chunks)
print(f"Added {len(chunks)} chunks from {path.name}")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python add_documents.py path/to/document.pdf")
else: add_document(sys.argv[1])
python add_documents.py new_policy_document.pdf
# Added 47 chunks from new_policy_document.pdf
The new document is immediately searchable. No downtime, no full reindex.
Practical Considerations
Chunk size tuning is the single biggest lever for RAG quality. Smaller chunks (500) give more precise retrieval but may lack context. Larger chunks (2000) provide more context per result but reduce precision. Start at 1000 and experiment.
TOP_K (retrieved chunks per query) affects quality and prompt length. More chunks = more context but more tokens and more noise. Start with 4, experiment with 3–6.
Embedding model consistency is mandatory. The model used during indexing and during query must be identical. If you switch embedding models, you must re-index your entire collection — vectors from different models are not comparable.
The system prompt instruction to "answer only based on the provided context" is what prevents the model from mixing retrieved content with training data in ways that produce confident-sounding hallucinations.
Complete Docker Compose Setup
Combining all three articles, here's a docker-compose.yaml that runs the full stack as a persistent local service. Two files required in the same folder, as in Article #2:
services:
ollama:
image: ollama/ollama
volumes:
- ollama_models:/root/.ollama
ports:
- "11434:11434"
restart: unless-stopped
qdrant:
image: qdrant/qdrant
volumes:
- qdrant_storage:/qdrant/storage
ports:
- "6333:6333"
restart: unless-stopped
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY:-not_set}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-not_set}
command: ["--config", "/app/config.yaml", "--port", "4000"]
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_models:
qdrant_storage:
docker compose up -d
# Pull models once
docker compose exec ollama ollama pull llama3.1:8b
docker compose exec ollama ollama pull nomic-embed-text
What You've Built
Across three articles you've assembled a complete local AI data stack:
- A local LLM serving inference via Ollama (article #1)
- An LLM gateway with routing, cost tracking, and access control via LiteLLM (article #2)
- A document retrieval layer that makes your own data queryable without sending it anywhere (this article)
Every component — embedding, vector search, language model inference — runs locally. The data flow is end-to-end private and verifiable, not just a policy claim.
Common Issues
Qdrant class was deprecated and removed. Install the correct package: pip install langchain-qdrant. Then use from langchain_qdrant import QdrantVectorStore — that's what all code in this article uses.langchain_core. Use: from langchain_core.runnables import RunnablePassthrough, from langchain_core.output_parsers import StrOutputParser, from langchain_core.prompts import ChatPromptTemplate.docker ps to check. If it's missing, restart with docker run -d --name qdrant -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant. If it stopped unexpectedly: docker start qdrant.-v volume flag. Without persistent storage, the entire collection disappears when the container restarts. Re-run the indexing script and always use the volume mount.