Local Sovereign AI · Article #3

RAG on Your Own Data —
Without Sending Anything to the Cloud

You have a local LLM. You have a gateway. Now the model knows nothing about your data. RAG is how you fix that — and the entire pipeline runs locally.

~60 min total Intermediate Requires Articles #1 and #2
Time breakdown
Install + Qdrant ~10 min Index documents ~15 min Query pipeline ~15 min Chat interface ~20 min
Ollama (Article #1) and a working venv from Article #2 required before you start.

The Problem With Local Knowledge

Language models only know what was in their training data, and that data has a cutoff date. Ask a local Llama model about your Q3 internal risk report, your customer contracts, or your company's knowledge base — and it will either hallucinate something plausible-sounding or honestly tell you it doesn't know.

The naive solution is to paste the document into the prompt. This works until the document is longer than the context window, or until you have ten documents, or a hundred, or a SharePoint full of thousands of PDFs. Stuffing entire document collections into every prompt is expensive in tokens, slow to process, and often worse than a targeted retrieval approach — because the model struggles to focus when given too much irrelevant context.

RAG — Retrieval Augmented Generation — solves this properly. Instead of sending all your documents to the model, you first retrieve only the relevant passages for a given query, then send just those passages as context. The model answers based on the retrieved content rather than its training data alone. It's the difference between asking someone to memorize your entire library versus giving them a good search engine.

The key insight: a vector database and an embedding model — the two components that make RAG work — can both run entirely locally. Neither requires a cloud API.

How RAG Works — Two Distinct Phases
PHASE 1 — INDEXING (run once) Documents PDF / TXT / DOCX Chunking ~1000 chars each Embed nomic-embed-text Store in Qdrant vector database Each chunk becomes a vector — a semantic fingerprint stored locally in Qdrant. PHASE 2 — RETRIEVAL (every query) Question user input Embed same model! Search Qdrant top-K closest chunks LLM generates answer Semantically similar chunks retrieved — even without exact keyword match. Why vector search finds semantically related content Query: "vacation policy" vector [0.21, -0.87, ...] Cosine similarity search finds mathematically close vectors Finds: "annual leave entitlements" vector [0.19, -0.91, ...] Different words, same semantic space — no keyword overlap required.

The Stack We're Building

Everything in this stack is open-source and runs locally. Nothing touches a cloud API unless you explicitly configure it to.

Complete Local Data Flow — Nothing Leaves Your Machine
YOUR MACHINE — PERIMETER BOUNDARY Your Docs ~/documents/ Embed nomic·ollama Qdrant :6333 local Retrieve top-K chunks LLM ollama·:11434 Answer ✓ fully local Zero external API calls in this pipeline. Every step runs on your hardware.
1
Set Up the Environment
~10 min

Create a dedicated virtual environment for this project — keeping its dependencies isolated:

bash — create venv
# Create project folder and virtual environment
mkdir rag-local && cd rag-local
python3 -m venv .venv

# Activate — macOS / Linux:
source .venv/bin/activate
# Windows (WSL2): source .venv/Scripts/activate

Install dependencies. Note that LangChain's Qdrant integration is now a separate package — langchain-qdrant:

bash — install packages
pip install langchain langchain-community langchain-ollama
pip install langchain-qdrant          # Qdrant integration (separate package since v0.1.2)
pip install langchain-text-splitters  # Text splitting (separate package)
pip install qdrant-client
pip install pypdf                     # PDF support
pip install python-docx               # Word document support
Important: langchain-qdrant is a required separate install. The old langchain_community.vectorstores.Qdrant class was deprecated in v0.1.2 and scheduled for removal. Use QdrantVectorStore from langchain_qdrant instead — that's what all code in this article uses.

Pull the embedding model via Ollama:

bash
ollama pull nomic-embed-text

nomic-embed-text is a 137M parameter model purpose-built for text embeddings. It's fast, produces 768-dimensional vectors, and performs well on document retrieval. At ~270MB it's trivial compared to your LLM.

Verify it works:

bash — verify embedding endpoint
curl http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "What is data sovereignty?"}'

You'll get back a JSON object with a list of 768 floating-point numbers — the semantic fingerprint of that phrase.

2
Start Qdrant
~5 min

Qdrant is the vector database that stores and searches your document embeddings. Run it via Docker with persistent storage:

bash
docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant
The -v flag is critical. Without volume mounting, your entire indexed collection disappears when the container restarts. Always mount a local directory for persistent storage.

Verify it's running:

bash
curl http://localhost:6333/healthz
# {"title":"qdrant - vector search engine","version":"..."}

Qdrant includes a built-in web UI at http://localhost:6333/dashboard — useful for inspecting your collections and running test searches visually.

For quick prototyping without Docker, Qdrant supports in-memory mode:

python — in-memory (prototype only)
from qdrant_client import QdrantClient
client = QdrantClient(":memory:")   # No Docker needed — data lost on restart
3
Index Your Documents
~15 min

This script reads documents from a folder, splits them into chunks, generates embeddings locally, and stores everything in Qdrant:

python — index_documents.py
# index_documents.py
import os
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader, Docx2txtLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore

# Configuration
DOCS_FOLDER = "./documents"
QDRANT_URL = "http://localhost:6333"
COLLECTION_NAME = "my_knowledge_base"
EMBEDDING_MODEL = "nomic-embed-text"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


def load_documents(folder: str):
    """Load all supported documents from a folder."""
    documents = []
    for filepath in Path(folder).rglob("*"):
        if filepath.suffix.lower() == ".pdf":
            loader = PyPDFLoader(str(filepath))
        elif filepath.suffix.lower() in [".txt", ".md"]:
            loader = TextLoader(str(filepath), encoding="utf-8")
        elif filepath.suffix.lower() in [".docx", ".doc"]:
            loader = Docx2txtLoader(str(filepath))
        else:
            continue
        try:
            docs = loader.load()
            for doc in docs:
                doc.metadata["source_file"] = filepath.name
            documents.extend(docs)
            print(f"  Loaded: {filepath.name} ({len(docs)} sections)")
        except Exception as e:
            print(f"  Failed: {filepath.name}: {e}")
    return documents


def index_documents():
    print("Loading documents...")
    documents = load_documents(DOCS_FOLDER)

    print("\nSplitting into chunks...")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    chunks = splitter.split_documents(documents)
    print(f"Created {len(chunks)} chunks")

    print("\nGenerating embeddings and indexing into Qdrant...")
    embeddings = OllamaEmbeddings(
        model=EMBEDDING_MODEL,
        base_url="http://localhost:11434"
    )

    QdrantVectorStore.from_documents(
        documents=chunks,
        embedding=embeddings,
        url=QDRANT_URL,
        collection_name=COLLECTION_NAME,
    )
    print(f"\nDone. {len(chunks)} chunks indexed into '{COLLECTION_NAME}'")


if __name__ == "__main__":
    os.makedirs(DOCS_FOLDER, exist_ok=True)
    if not any(Path(DOCS_FOLDER).iterdir()):
        print(f"Put your documents in '{DOCS_FOLDER}' and run again.")
    else:
        index_documents()
Word documents note: Docx2txtLoader is used here instead of UnstructuredWordDocumentLoader. The latter requires the heavy unstructured package with many system dependencies. Docx2txtLoader needs only pip install python-docx docx2txt and works reliably.

Create the documents folder, add your files, and run:

bash
mkdir documents
# copy your PDFs, TXT, DOCX files into ./documents/
python index_documents.py

Chunking parameters explained:

4
Build the RAG Query Pipeline
~15 min

This script connects to your indexed Qdrant collection, retrieves relevant chunks for a query, and sends them to your local LLM for synthesis.

LangChain import paths changed — this is a common source of errors.
Many tutorials on the internet still use the old paths which no longer work on current LangChain versions:

from langchain.schema.runnable import RunnablePassthrough  ✗ old, broken
from langchain.schema.output_parser import StrOutputParser  ✗ old, broken
from langchain.prompts import ChatPromptTemplate  ✗ old, broken

The correct current paths — all in langchain_core:

from langchain_core.runnables import RunnablePassthrough  ✓
from langchain_core.output_parsers import StrOutputParser  ✓
from langchain_core.prompts import ChatPromptTemplate  ✓
python — query.py
# query.py
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_qdrant import QdrantVectorStore
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from qdrant_client import QdrantClient

QDRANT_URL = "http://localhost:6333"
COLLECTION_NAME = "my_knowledge_base"
EMBEDDING_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.1:8b"
TOP_K = 4


def build_rag_chain():
    embeddings = OllamaEmbeddings(
        model=EMBEDDING_MODEL,
        base_url="http://localhost:11434"
    )
    qdrant_client = QdrantClient(url=QDRANT_URL)
    vectorstore = QdrantVectorStore(
        client=qdrant_client,
        collection_name=COLLECTION_NAME,
        embedding=embeddings
    )
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": TOP_K}
    )
    llm = ChatOllama(
        model=LLM_MODEL,
        base_url="http://localhost:11434",
        temperature=0.1
    )
    prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant that answers questions based strictly on the provided context.
If the answer is not in the context, say "I don't have enough information in the provided
documents to answer this." Do not make up information.

Context:
{context}

Question: {question}

Answer:""")

    def format_docs(docs):
        formatted = []
        for doc in docs:
            src = doc.metadata.get("source_file", "unknown")
            formatted.append(f"[Source: {src}]\n{doc.page_content}")
        return "\n\n---\n\n".join(formatted)

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    return rag_chain, retriever


def ask(question: str, rag_chain, retriever):
    print(f"\nQuestion: {question}")
    retrieved_docs = retriever.invoke(question)
    for i, doc in enumerate(retrieved_docs, 1):
        src = doc.metadata.get("source_file", "unknown")
        print(f"  [{i}] {src}: {doc.page_content[:80].replace(chr(10),' ')}...")
    answer = rag_chain.invoke(question)
    print(f"\nAnswer: {answer}")
    return answer


if __name__ == "__main__":
    rag_chain, retriever = build_rag_chain()
    questions = [
        "What are the main data retention policies described in the documents?",
        "Who is responsible for security incident reporting?",
        "What are the consequences of non-compliance?",
    ]
    for q in questions:
        ask(q, rag_chain, retriever)
        print("\n" + "="*60)
bash
python query.py
Why the system prompt matters: The instruction "answer only based on the provided context" is what prevents hallucination. Test explicitly by asking about something not in your documents — the model should say it doesn't know, not invent an answer.
5
Interactive Chat Interface
~10 min

For interactive use, a simple command-line loop with a sources command — which is deliberately included for compliance and debugging:

python — chat.py
# chat.py
from langchain_core.messages import HumanMessage, AIMessage
from query import build_rag_chain

def chat_with_documents():
    rag_chain, retriever = build_rag_chain()
    history = []
    last_docs = []

    print("RAG Chat — ask questions about your documents.")
    print("Type 'quit' to exit, 'sources' to show last retrieved documents.\n")

    while True:
        question = input("You: ").strip()
        if question.lower() == "quit": break
        if question.lower() == "sources":
            if last_docs:
                for doc in last_docs:
                    print(f"  - {doc.metadata.get('source_file', 'unknown')}")
                    print(f"    {doc.page_content[:150]}...")
            continue
        if not question: continue

        last_docs = retriever.invoke(question)
        answer = rag_chain.invoke(question)
        print(f"\nAssistant: {answer}\n")
        history.extend([HumanMessage(content=question), AIMessage(content=answer)])

if __name__ == "__main__":
    chat_with_documents()
bash
python chat.py

# You: What does the contract say about liability?
# Assistant: Based on the contract documents, liability is limited to...
# You: sources
#   - contract_v3.pdf
#     ...section 8.2 regarding limitation of liability...
6
Adding New Documents Without Rebuilding
~5 min

You don't need to re-index everything when you add new documents. Create a separate script that adds to the existing Qdrant collection:

python — add_documents.py
# add_documents.py
import sys
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from index_documents import EMBEDDING_MODEL, QDRANT_URL, COLLECTION_NAME, CHUNK_SIZE, CHUNK_OVERLAP

def add_document(filepath: str):
    path = Path(filepath)
    loader = PyPDFLoader(filepath) if path.suffix.lower() == ".pdf" else TextLoader(filepath, encoding="utf-8")
    documents = loader.load()
    for doc in documents: doc.metadata["source_file"] = path.name

    chunks = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
    ).split_documents(documents)

    embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL, base_url="http://localhost:11434")
    client = QdrantClient(url=QDRANT_URL)

    # add_documents on an existing QdrantVectorStore instance — no recreate
    vectorstore = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME, embedding=embeddings)
    vectorstore.add_documents(chunks)
    print(f"Added {len(chunks)} chunks from {path.name}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python add_documents.py path/to/document.pdf")
    else: add_document(sys.argv[1])
bash
python add_documents.py new_policy_document.pdf
# Added 47 chunks from new_policy_document.pdf

The new document is immediately searchable. No downtime, no full reindex.

Practical Considerations

Chunk size tuning is the single biggest lever for RAG quality. Smaller chunks (500) give more precise retrieval but may lack context. Larger chunks (2000) provide more context per result but reduce precision. Start at 1000 and experiment.

TOP_K (retrieved chunks per query) affects quality and prompt length. More chunks = more context but more tokens and more noise. Start with 4, experiment with 3–6.

Embedding model consistency is mandatory. The model used during indexing and during query must be identical. If you switch embedding models, you must re-index your entire collection — vectors from different models are not comparable.

The system prompt instruction to "answer only based on the provided context" is what prevents the model from mixing retrieved content with training data in ways that produce confident-sounding hallucinations.

Complete Docker Compose Setup

Combining all three articles, here's a docker-compose.yaml that runs the full stack as a persistent local service. Two files required in the same folder, as in Article #2:

yaml — docker-compose.yaml
services:
  ollama:
    image: ollama/ollama
    volumes:
      - ollama_models:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped

  qdrant:
    image: qdrant/qdrant
    volumes:
      - qdrant_storage:/qdrant/storage
    ports:
      - "6333:6333"
    restart: unless-stopped

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY:-not_set}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-not_set}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_models:
  qdrant_storage:
bash — start the full stack
docker compose up -d

# Pull models once
docker compose exec ollama ollama pull llama3.1:8b
docker compose exec ollama ollama pull nomic-embed-text

What You've Built

Across three articles you've assembled a complete local AI data stack:

Every component — embedding, vector search, language model inference — runs locally. The data flow is end-to-end private and verifiable, not just a policy claim.

Common Issues

ImportError: cannot import name 'Qdrant' from 'langchain_community.vectorstores'
The old Qdrant class was deprecated and removed. Install the correct package: pip install langchain-qdrant. Then use from langchain_qdrant import QdrantVectorStore — that's what all code in this article uses.
ImportError: cannot import name 'RunnablePassthrough' from 'langchain.schema.runnable'
Import paths moved to langchain_core. Use: from langchain_core.runnables import RunnablePassthrough, from langchain_core.output_parsers import StrOutputParser, from langchain_core.prompts import ChatPromptTemplate.
Qdrant connection refused on port 6333
The Qdrant container isn't running. Run docker ps to check. If it's missing, restart with docker run -d --name qdrant -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant. If it stopped unexpectedly: docker start qdrant.
Collection not found — but I indexed documents earlier
You ran Qdrant without the -v volume flag. Without persistent storage, the entire collection disappears when the container restarts. Re-run the indexing script and always use the volume mount.
Embedding is very slow (minutes per document)
Embedding runs on your local hardware. On CPU, expect 50–200 chunks per minute. For large document collections, let it run overnight. On Apple Silicon or with an NVIDIA GPU, Ollama will use the accelerator automatically and embedding will be significantly faster.
RAG answers are hallucinated or irrelevant
Two separate problems: (1) Hallucination — ensure the system prompt says "answer only based on the provided context." (2) Irrelevant chunks retrieved — experiment with smaller chunk sizes (500–800) and verify that embeddings are from the same model used during indexing.