Local Sovereign AI · Article #3

RAG on Your Own Data —
Without Sending Anything to the Cloud

You have a local LLM. You have a gateway. Now the model knows nothing about your data. RAG is how you fix that — and the entire pipeline runs locally.

~60 min total Intermediate Requires Articles #1 and #2

⏱

Time breakdown

Install + Qdrant ~10 min → Index documents ~15 min → Query pipeline ~15 min → Chat interface ~20 min

Ollama (Article #1) and a working venv from Article #2 required before you start.

The Problem With Local Knowledge

Language models only know what was in their training data, and that data has a cutoff date. Ask a local Llama model about your Q3 internal risk report, your customer contracts, or your company's knowledge base — and it will either hallucinate something plausible-sounding or honestly tell you it doesn't know.

The naive solution is to paste the document into the prompt. This works until the document is longer than the context window, or until you have ten documents, or a hundred, or a SharePoint full of thousands of PDFs. Stuffing entire document collections into every prompt is expensive in tokens, slow to process, and often worse than a targeted retrieval approach — because the model struggles to focus when given too much irrelevant context.

RAG — Retrieval Augmented Generation — solves this properly. Instead of sending all your documents to the model, you first retrieve only the relevant passages for a given query, then send just those passages as context. The model answers based on the retrieved content rather than its training data alone. It's the difference between asking someone to memorize your entire library versus giving them a good search engine.

The key insight: a vector database and an embedding model — the two components that make RAG work — can both run entirely locally. Neither requires a cloud API.

How RAG Works — Two Distinct Phases

The Stack We're Building

Everything in this stack is open-source and runs locally. Nothing touches a cloud API unless you explicitly configure it to.

Document loading: LangChain document loaders — handles PDF, DOCX, TXT, Markdown, and more
Embedding model: nomic-embed-text via Ollama — a strong open-source embedding model that runs locally
Vector database: Qdrant — fast, open-source, runs as a Docker container or in-memory
Text splitting: langchain-text-splitters — splits documents into indexed chunks
LLM: Your local Ollama model, optionally routed via LiteLLM from Article #2
Orchestration: LangChain — glue between all of these components

Complete Local Data Flow — Nothing Leaves Your Machine

Set Up the Environment

~10 min

Create a dedicated virtual environment for this project — keeping its dependencies isolated:

bash — create venv

# Create project folder and virtual environment
mkdir rag-local && cd rag-local
python3 -m venv .venv

# Activate — macOS / Linux:
source .venv/bin/activate
# Windows (WSL2): source .venv/Scripts/activate

Install dependencies. Note that LangChain's Qdrant integration is now a separate package — langchain-qdrant:

bash — install packages

pip install langchain langchain-community langchain-ollama
pip install langchain-qdrant          # Qdrant integration (separate package since v0.1.2)
pip install langchain-text-splitters  # Text splitting (separate package)
pip install qdrant-client
pip install pypdf                     # PDF support
pip install python-docx               # Word document support

Important: langchain-qdrant is a required separate install. The old langchain_community.vectorstores.Qdrant class was deprecated in v0.1.2 and scheduled for removal. Use QdrantVectorStore from langchain_qdrant instead — that's what all code in this article uses.

Pull the embedding model via Ollama:

bash

ollama pull nomic-embed-text

nomic-embed-text is a 137M parameter model purpose-built for text embeddings. It's fast, produces 768-dimensional vectors, and performs well on document retrieval. At ~270MB it's trivial compared to your LLM.

Verify it works:

bash — verify embedding endpoint

curl http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "What is data sovereignty?"}'

You'll get back a JSON object with a list of 768 floating-point numbers — the semantic fingerprint of that phrase.

Start Qdrant

~5 min

Qdrant is the vector database that stores and searches your document embeddings. Run it via Docker with persistent storage:

bash

docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant

The -v flag is critical. Without volume mounting, your entire indexed collection disappears when the container restarts. Always mount a local directory for persistent storage.

Verify it's running:

bash

curl http://localhost:6333/healthz
# {"title":"qdrant - vector search engine","version":"..."}

Qdrant includes a built-in web UI at http://localhost:6333/dashboard — useful for inspecting your collections and running test searches visually.

For quick prototyping without Docker, Qdrant supports in-memory mode:

python — in-memory (prototype only)

from qdrant_client import QdrantClient
client = QdrantClient(":memory:")   # No Docker needed — data lost on restart

Index Your Documents

~15 min

This script reads documents from a folder, splits them into chunks, generates embeddings locally, and stores everything in Qdrant:

python — index_documents.py

# index_documents.py
import os
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader, Docx2txtLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore

# Configuration
DOCS_FOLDER = "./documents"
QDRANT_URL = "http://localhost:6333"
COLLECTION_NAME = "my_knowledge_base"
EMBEDDING_MODEL = "nomic-embed-text"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


def load_documents(folder: str):
    """Load all supported documents from a folder."""
    documents = []
    for filepath in Path(folder).rglob("*"):
        if filepath.suffix.lower() == ".pdf":
            loader = PyPDFLoader(str(filepath))
        elif filepath.suffix.lower() in [".txt", ".md"]:
            loader = TextLoader(str(filepath), encoding="utf-8")
        elif filepath.suffix.lower() in [".docx", ".doc"]:
            loader = Docx2txtLoader(str(filepath))
        else:
            continue
        try:
            docs = loader.load()
            for doc in docs:
                doc.metadata["source_file"] = filepath.name
            documents.extend(docs)
            print(f"  Loaded: {filepath.name} ({len(docs)} sections)")
        except Exception as e:
            print(f"  Failed: {filepath.name}: {e}")
    return documents


def index_documents():
    print("Loading documents...")
    documents = load_documents(DOCS_FOLDER)

    print("\nSplitting into chunks...")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    chunks = splitter.split_documents(documents)
    print(f"Created {len(chunks)} chunks")

    print("\nGenerating embeddings and indexing into Qdrant...")
    embeddings = OllamaEmbeddings(
        model=EMBEDDING_MODEL,
        base_url="http://localhost:11434"
    )

    QdrantVectorStore.from_documents(
        documents=chunks,
        embedding=embeddings,
        url=QDRANT_URL,
        collection_name=COLLECTION_NAME,
    )
    print(f"\nDone. {len(chunks)} chunks indexed into '{COLLECTION_NAME}'")


if __name__ == "__main__":
    os.makedirs(DOCS_FOLDER, exist_ok=True)
    if not any(Path(DOCS_FOLDER).iterdir()):
        print(f"Put your documents in '{DOCS_FOLDER}' and run again.")
    else:
        index_documents()

Word documents note: Docx2txtLoader is used here instead of UnstructuredWordDocumentLoader. The latter requires the heavy unstructured package with many system dependencies. Docx2txtLoader needs only pip install python-docx docx2txt and works reliably.

Create the documents folder, add your files, and run:

bash

mkdir documents
# copy your PDFs, TXT, DOCX files into ./documents/
python index_documents.py

Chunking parameters explained:

chunk_size=1000 — approximately 1000 characters per chunk. Smaller chunks (500) give more precise retrieval but less context per result. Larger chunks (2000) provide more context but reduce precision. 1000 is a good starting point.
chunk_overlap=200 — adjacent chunks share 200 characters. Prevents information loss when a sentence spans two chunk boundaries.
RecursiveCharacterTextSplitter — tries to split at paragraph boundaries first, then newlines, then sentences. Preserves the natural structure of your documents better than naive character-count splitting.

Build the RAG Query Pipeline

~15 min

This script connects to your indexed Qdrant collection, retrieves relevant chunks for a query, and sends them to your local LLM for synthesis.

LangChain import paths changed — this is a common source of errors.
Many tutorials on the internet still use the old paths which no longer work on current LangChain versions:

from langchain.schema.runnable import RunnablePassthrough ✗ old, broken
from langchain.schema.output_parser import StrOutputParser ✗ old, broken
from langchain.prompts import ChatPromptTemplate ✗ old, broken

The correct current paths — all in langchain_core:

from langchain_core.runnables import RunnablePassthrough ✓
from langchain_core.output_parsers import StrOutputParser ✓
from langchain_core.prompts import ChatPromptTemplate ✓

python — query.py

# query.py
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_qdrant import QdrantVectorStore
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from qdrant_client import QdrantClient

QDRANT_URL = "http://localhost:6333"
COLLECTION_NAME = "my_knowledge_base"
EMBEDDING_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.1:8b"
TOP_K = 4


def build_rag_chain():
    embeddings = OllamaEmbeddings(
        model=EMBEDDING_MODEL,
        base_url="http://localhost:11434"
    )
    qdrant_client = QdrantClient(url=QDRANT_URL)
    vectorstore = QdrantVectorStore(
        client=qdrant_client,
        collection_name=COLLECTION_NAME,
        embedding=embeddings
    )
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": TOP_K}
    )
    llm = ChatOllama(
        model=LLM_MODEL,
        base_url="http://localhost:11434",
        temperature=0.1
    )
    prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant that answers questions based strictly on the provided context.
If the answer is not in the context, say "I don't have enough information in the provided
documents to answer this." Do not make up information.

Context:
{context}

Question: {question}

Answer:""")

    def format_docs(docs):
        formatted = []
        for doc in docs:
            src = doc.metadata.get("source_file", "unknown")
            formatted.append(f"[Source: {src}]\n{doc.page_content}")
        return "\n\n---\n\n".join(formatted)

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    return rag_chain, retriever


def ask(question: str, rag_chain, retriever):
    print(f"\nQuestion: {question}")
    retrieved_docs = retriever.invoke(question)
    for i, doc in enumerate(retrieved_docs, 1):
        src = doc.metadata.get("source_file", "unknown")
        print(f"  [{i}] {src}: {doc.page_content[:80].replace(chr(10),' ')}...")
    answer = rag_chain.invoke(question)
    print(f"\nAnswer: {answer}")
    return answer


if __name__ == "__main__":
    rag_chain, retriever = build_rag_chain()
    questions = [
        "What are the main data retention policies described in the documents?",
        "Who is responsible for security incident reporting?",
        "What are the consequences of non-compliance?",
    ]
    for q in questions:
        ask(q, rag_chain, retriever)
        print("\n" + "="*60)

bash

python query.py

Why the system prompt matters: The instruction "answer only based on the provided context" is what prevents hallucination. Test explicitly by asking about something not in your documents — the model should say it doesn't know, not invent an answer.

Interactive Chat Interface

~10 min

For interactive use, a simple command-line loop with a sources command — which is deliberately included for compliance and debugging:

python — chat.py

# chat.py
from langchain_core.messages import HumanMessage, AIMessage
from query import build_rag_chain

def chat_with_documents():
    rag_chain, retriever = build_rag_chain()
    history = []
    last_docs = []

    print("RAG Chat — ask questions about your documents.")
    print("Type 'quit' to exit, 'sources' to show last retrieved documents.\n")

    while True:
        question = input("You: ").strip()
        if question.lower() == "quit": break
        if question.lower() == "sources":
            if last_docs:
                for doc in last_docs:
                    print(f"  - {doc.metadata.get('source_file', 'unknown')}")
                    print(f"    {doc.page_content[:150]}...")
            continue
        if not question: continue

        last_docs = retriever.invoke(question)
        answer = rag_chain.invoke(question)
        print(f"\nAssistant: {answer}\n")
        history.extend([HumanMessage(content=question), AIMessage(content=answer)])

if __name__ == "__main__":
    chat_with_documents()

bash

python chat.py

# You: What does the contract say about liability?
# Assistant: Based on the contract documents, liability is limited to...
# You: sources
#   - contract_v3.pdf
#     ...section 8.2 regarding limitation of liability...

Adding New Documents Without Rebuilding

~5 min

You don't need to re-index everything when you add new documents. Create a separate script that adds to the existing Qdrant collection:

python — add_documents.py

# add_documents.py
import sys
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from index_documents import EMBEDDING_MODEL, QDRANT_URL, COLLECTION_NAME, CHUNK_SIZE, CHUNK_OVERLAP

def add_document(filepath: str):
    path = Path(filepath)
    loader = PyPDFLoader(filepath) if path.suffix.lower() == ".pdf" else TextLoader(filepath, encoding="utf-8")
    documents = loader.load()
    for doc in documents: doc.metadata["source_file"] = path.name

    chunks = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
    ).split_documents(documents)

    embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL, base_url="http://localhost:11434")
    client = QdrantClient(url=QDRANT_URL)

    # add_documents on an existing QdrantVectorStore instance — no recreate
    vectorstore = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME, embedding=embeddings)
    vectorstore.add_documents(chunks)
    print(f"Added {len(chunks)} chunks from {path.name}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python add_documents.py path/to/document.pdf")
    else: add_document(sys.argv[1])

bash

python add_documents.py new_policy_document.pdf
# Added 47 chunks from new_policy_document.pdf

The new document is immediately searchable. No downtime, no full reindex.

Practical Considerations

Chunk size tuning is the single biggest lever for RAG quality. Smaller chunks (500) give more precise retrieval but may lack context. Larger chunks (2000) provide more context per result but reduce precision. Start at 1000 and experiment.

TOP_K (retrieved chunks per query) affects quality and prompt length. More chunks = more context but more tokens and more noise. Start with 4, experiment with 3–6.

Embedding model consistency is mandatory. The model used during indexing and during query must be identical. If you switch embedding models, you must re-index your entire collection — vectors from different models are not comparable.

The system prompt instruction to "answer only based on the provided context" is what prevents the model from mixing retrieved content with training data in ways that produce confident-sounding hallucinations.

Complete Docker Compose Setup

Combining all three articles, here's a docker-compose.yaml that runs the full stack as a persistent local service. Two files required in the same folder, as in Article #2:

yaml — docker-compose.yaml

services:
  ollama:
    image: ollama/ollama
    volumes:
      - ollama_models:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped

  qdrant:
    image: qdrant/qdrant
    volumes:
      - qdrant_storage:/qdrant/storage
    ports:
      - "6333:6333"
    restart: unless-stopped

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY:-not_set}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-not_set}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_models:
  qdrant_storage:

bash — start the full stack

docker compose up -d

# Pull models once
docker compose exec ollama ollama pull llama3.1:8b
docker compose exec ollama ollama pull nomic-embed-text

What You've Built

Across three articles you've assembled a complete local AI data stack:

A local LLM serving inference via Ollama (article #1)
An LLM gateway with routing, cost tracking, and access control via LiteLLM (article #2)
A document retrieval layer that makes your own data queryable without sending it anywhere (this article)

Every component — embedding, vector search, language model inference — runs locally. The data flow is end-to-end private and verifiable, not just a policy claim.

Common Issues

ImportError: cannot import name 'Qdrant' from 'langchain_community.vectorstores'▼

The old Qdrant class was deprecated and removed. Install the correct package: pip install langchain-qdrant. Then use from langchain_qdrant import QdrantVectorStore — that's what all code in this article uses.

ImportError: cannot import name 'RunnablePassthrough' from 'langchain.schema.runnable'▼

Import paths moved to langchain_core. Use: from langchain_core.runnables import RunnablePassthrough, from langchain_core.output_parsers import StrOutputParser, from langchain_core.prompts import ChatPromptTemplate.

Qdrant connection refused on port 6333▼

The Qdrant container isn't running. Run docker ps to check. If it's missing, restart with docker run -d --name qdrant -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant. If it stopped unexpectedly: docker start qdrant.

Collection not found — but I indexed documents earlier▼

You ran Qdrant without the -v volume flag. Without persistent storage, the entire collection disappears when the container restarts. Re-run the indexing script and always use the volume mount.

Embedding is very slow (minutes per document)▼

Embedding runs on your local hardware. On CPU, expect 50–200 chunks per minute. For large document collections, let it run overnight. On Apple Silicon or with an NVIDIA GPU, Ollama will use the accelerator automatically and embedding will be significantly faster.

RAG answers are hallucinated or irrelevant▼

Two separate problems: (1) Hallucination — ensure the system prompt says "answer only based on the provided context." (2) Irrelevant chunks retrieved — experiment with smaller chunk sizes (500–800) and verify that embeddings are from the same model used during indexing.

Up next · Article #4

MCP Protocol Explained — How AI Agents Securely Access Your Data

RAG works for static documents. For live data — your databases, APIs, and internal tools — you need MCP. We build an MCP server that gives your local AI agent secure, controlled access to real data sources.

→

RAG on Your Own Data —Without Sending Anything to the Cloud

The Problem With Local Knowledge

The Stack We're Building

Practical Considerations

Complete Docker Compose Setup

What You've Built

Common Issues

RAG on Your Own Data —
Without Sending Anything to the Cloud