Local Sovereign AI · Article #1

Your First Local LLM —
A Complete Setup Guide

Every prompt you send to ChatGPT or Claude travels over the internet, gets logged on someone else's servers, and potentially contributes to future model training. For anything involving real data, that's a problem worth solving.

~30 min total Beginner–Intermediate Zero data leaves your machine
Time breakdown
Install Ollama ~5 min Model download ~15 min First run + API ~10 min
The majority of the time is waiting for the model file to download.

Why Would You Even Do This?

The case for running a language model locally isn't about being anti-cloud. It's about control.

When you use a hosted AI service, you're accepting an implicit deal: you get access to a powerful model, and in exchange, your inputs pass through infrastructure you don't own or control. Most providers have terms of service that limit training on your data, but "limited" and "zero" are different things. Audit trails don't exist. Data residency is often vague. And for companies operating under GDPR, EU AI Act, DORA, or sector-specific regulations, "our vendor promises not to look" isn't a compliance posture.

Running a model locally changes the math entirely. The model file lives on your hardware. Inference happens on your CPU or GPU. Nothing leaves your machine — not the prompt, not the response, not the metadata about what you asked. There's no API to go down, no rate limits, no per-token cost. It works on a train with no internet. It works in an air-gapped environment. It works at 3am when OpenAI has an outage.

Cloud API vs. Local Inference — Where Does Your Data Go?
CLOUD API Your App prompt + data Internet ⚠ Logged Data leaves your perimeter Cost per token · Rate limits · Outages Residency unclear · Audit trail: none LOCAL INFERENCE Your App prompt + data localhost :11434 Data stays on your hardware Zero cost per token · No rate limits Works offline · Full audit control VS

The honest trade-off: local models are not GPT-4. The best open-source models available today are genuinely impressive, but there is still a capability gap at the frontier. For the majority of real-world use cases — summarization, classification, code explanation, Q&A over provided documents, drafting text — a well-chosen local model is good enough, and the privacy guarantee is worth more than the marginal quality difference.

What "Local" Actually Means

A language model is fundamentally a large file containing billions of numerical parameters called weights. When you run inference (send a prompt and get a response), you're loading those weights into memory and performing a very large matrix multiplication repeatedly.

The file format you'll encounter most often is GGUF — a compressed format optimized for consumer hardware. GGUF files use a technique called quantization: instead of storing each weight as a 32-bit float, they're compressed to 4-bit or 8-bit integers, with a small but usually acceptable quality trade-off. This allows a model that would otherwise require 40GB to run in 5GB. You'll see this as q4_0 or q8_0 in model names — lower number = more compression.

The critical thing to understand: once that GGUF file is on your disk, the model is entirely self-contained. No license server, no telemetry, no cloud dependency. It's just math running on your hardware.

How Ollama Works — Architecture
Your Code
Python / curl
HTTP
Ollama Server
localhost:11434
loads
GGUF Model
~/.ollama/models
inference
CPU / GPU
your hardware

Nothing exits this chain. All computation is local.

Hardware Reality Check

Most tutorials skip this section and leave readers confused when things don't work. Let's be direct.

Mac users: Apple Silicon (M1–M4) is excellent for local AI. Unified memory means the full RAM pool is available to the GPU, so a 32GB MacBook Pro can use all 32GB for model weights. If you have 16GB+, you're in good shape.

Linux + NVIDIA GPU: This is the fastest path. An RTX 3080 (10GB) or RTX 4070 (12GB) will run 7–13B models with excellent performance. Ollama auto-detects your GPU.

Everyone else: 16GB RAM on CPU works — expect 5–15 tokens/sec instead of 50+. Slower, but functional for most tasks.

Hardware → Model Size Quick Reference
8 GB RAM
3–4B
phi4-mini
Slow, works
16 GB RAM
Up to 8B
llama3.3:8b
Usable
32 GB / M1 16GB
Up to 13B
llama3.3:13b
Good
M2/M3 32 GB+
Up to 34B
llama3.3:34b
Very good
NVIDIA 8 GB VRAM
Up to 8B
GPU accelerated
Fast
NVIDIA 24 GB+
Up to 34B
GPU accelerated
Excellent

One note on Windows: everything below works, but you'll have a smoother experience using WSL2 (Windows Subsystem for Linux). The terminal tooling around local AI is significantly more mature in Linux environments.

1
Install Ollama
~5 min

Ollama handles model downloading, version management, quantization selection, GPU detection, and automatically runs a local REST API server. Think of it as Docker for language models.

bash
brew install ollama
On macOS with Homebrew, Ollama starts automatically as a menu bar app — do not run ollama serve manually. If you installed via the .dmg from ollama.com, launch the app from your Applications folder instead.
bash
curl -fsSL https://ollama.com/install.sh | sh
# Then start the service:
ollama serve   # leave this terminal open, or configure as systemd
Download the installer from ollama.com. Recommended: Install WSL2 first (Ubuntu via Microsoft Store), then use the Linux command inside WSL2 — most local AI tooling works more reliably in Linux environments.

Verify it's running in a new terminal:

bash
ollama --version
# ollama version is 0.x.x
What Ollama is actually doing: When you pull a model, Ollama downloads the GGUF file from its registry (like Docker Hub) and stores it in ~/.ollama/models/. It then exposes a REST API on localhost:11434. After the initial download, no internet connection is needed.
2
Choose Your First Model
~2 min decision
A note on versions: The open-source model ecosystem moves fast — new releases appear weekly. The families below are stable recommendations; specific version numbers change. Always check ollama search <family> or ollama.com/library for the current version before pulling.
🦙
Llama 3.3 (Meta)
Best all-around general-purpose model. Strong instruction following, 128K context. The 8B version is the sweet spot for most hardware.
General
Mistral Small 3 (Mistral AI)
Fastest tokens/sec in the 7B class on consumer hardware. Excellent structured output. Apache 2.0 — cleanest commercial license.
Fast
🧑‍💻
Qwen 3 (Alibaba)
Strongest on coding benchmarks in the 7B class. Best multilingual support. Recommended for non-English content or code-heavy workflows.
Code
🔬
Phi-4-mini (Microsoft)
3.8B parameters, competes with many 7B models on reasoning. MIT license. Best choice for 8GB RAM machines or when you need instant load times.
Small
🧠
DeepSeek-R1 (DeepSeek)
Purpose-built for complex reasoning, debugging, step-by-step problem solving. The 32B version on 24GB GPU genuinely rivals GPT-4 on technical tasks.
Reasoning

Simple recommendation by hardware:

quick reference
16GB RAM or M1/M2 16GB  →  llama3.3:8b
8GB RAM                 →  phi4-mini
32GB+ or M2 Pro/Max     →  llama3.3:8b  (then try larger)
NVIDIA GPU 12GB+        →  llama3.3:8b  (auto GPU acceleration)
Primary use case: code  →  qwen3:8b or deepseek-r1:7b
3
Pull and Run Your First Model
~15 min download
bash
ollama pull llama3.3:8b   # ~5GB download, one-time
ollama run llama3.3:8b

You'll get a prompt. Try something real:

example prompts
>>> Summarize the key differences between GDPR and the EU AI Act in 3 paragraphs.
>>> Write a Python function that reads a CSV file and returns the top 10 rows by a column.
>>> Find all users in my PostgreSQL users table who haven't logged in for 90 days.
>>> /bye

Immediately verify GPU usage — many people run on CPU without realising it and think local AI is just slow:

bash
ollama ps
# NAME              ID    SIZE      PROCESSOR    UNTIL
# llama3.3:8b       ...   5.0 GB    100% GPU     4 minutes from now

If PROCESSOR shows CPU and you expected GPU: on Apple Silicon the model may be too large for your unified memory (try phi4-mini); on NVIDIA, check that drivers are installed and Ollama detected your GPU at startup.

bash — useful commands
ollama list              # see downloaded models
ollama show llama3.3:8b  # model details, context length
ollama rm llama3.3:8b    # remove to free disk space
4
The Local API
~10 min

The chat interface is useful for experimenting. The REST API is what makes this genuinely powerful.

Ollama automatically exposes an OpenAI-compatible API on localhost:11434. Any tool, library, or application built for OpenAI's API works with your local model by changing one configuration line — the base URL. Nothing else.

bash — verify with curl
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3:8b",
    "messages": [{"role": "user", "content": "What is data sovereignty?"}]
  }'

For scripting and real applications, here's the same call in Python using the OpenAI SDK — which works without modification because the API format is identical:

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by client, ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.3:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize zero-trust security in 3 points."}
    ]
)
print(response.choices[0].message.content)

If you have existing code using the OpenAI SDK, switching to local is two lines:

python — migration
# Before (OpenAI cloud):
client = OpenAI(api_key="sk-...")

# After (local Ollama — everything else stays identical):
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
The architectural insight: The API is the interface, not the vendor. The same application code can route to different models depending on context, data sensitivity, or cost — without being rewritten. This is the foundation the rest of this series builds on.
5
Practical Test: Document Summarization
~5 min

Here's a complete script that reads a local document and summarizes it — entirely locally, no data leaving your machine. First, if you're working with PDFs, convert them to text:

bash — PDF conversion
# macOS:
brew install poppler
# Linux (Ubuntu/Debian):
sudo apt install poppler-utils

# Convert PDF to text:
pdftotext internal_report.pdf internal_report.txt
python — summarize.py
# summarize.py — run: python summarize.py contract.txt
from openai import OpenAI
import sys

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

def summarize_document(filepath: str) -> str:
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()

    max_chars = 12000
    if len(content) > max_chars:
        content = content[:max_chars] + "\n\n[Truncated — see article #3 on RAG for long docs]"

    response = client.chat.completions.create(
        model="llama3.3:8b",
        messages=[
            {"role": "system", "content": "You are a document analyst. Summarize concisely, covering main points, key decisions, and notable details."},
            {"role": "user", "content": f"Summarize this document:\n\n{content}"}
        ],
        temperature=0.3
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    print(summarize_document(sys.argv[1] if len(sys.argv) > 1 else "document.txt"))

Honest Assessment: What Local Models Are Good At

Setting accurate expectations is more useful than cheerleading.

✓ Handles well
  • Summarization of documents you provide
  • Code explanation and basic generation
  • Drafting text from structured input
  • Classification and extraction tasks
  • Q&A with context provided in prompt
  • Translation for major languages
△ Still lags behind frontier
  • Complex multi-step reasoning
  • Highly nuanced creative prose
  • Mathematical reasoning (beyond algebra)
  • Up-to-date world knowledge (cutoff date)
  • Very long documents without RAG

For 60–70% of realistic daily AI tasks in a business context, a well-configured local 8B model produces output that is good enough. The remaining 30–40% — tasks that genuinely benefit from GPT-4 class capability — are usually also tasks where external API usage is acceptable or data can be sanitised before sending. That routing decision is exactly what the next article covers.

Troubleshooting Common Issues

Model runs but output is very slow (under 3 tokens/sec)
Check ollama ps and look at the PROCESSOR column. If it shows CPU: on Apple Silicon, the model is likely too large for unified memory — try phi4-mini. On Linux/Windows, verify NVIDIA drivers are installed and Ollama detected your GPU at startup.
"Error: model not found"
Pull explicitly before running: ollama pull llama3.3:8b. Model names are case-sensitive and must exactly match the Ollama library name.
Out of memory / process crashes mid-generation
The model is too large for your RAM. Try a more aggressively quantized variant: ollama pull llama3.3:8b-q4_0 uses less memory than the default at some quality cost.
macOS: "address already in use" when running ollama serve
With the Homebrew install, Ollama is already running as a menu bar app. You don't need ollama serve — use ollama run directly.
API returns connection refused on port 11434
Ollama isn't running. On macOS, check the menu bar icon. On Linux: sudo systemctl enable --now ollama to start and enable as a service.
Windows: various terminal and path issues
Switch to WSL2. Install Ubuntu via the Microsoft Store, then use the Linux install command inside WSL2. This resolves the vast majority of Windows-specific friction with local AI tooling.
Model gives incoherent or very short responses
Add a system prompt that clearly defines the task. Local models are more sensitive to prompt quality than frontier models — being explicit about the task format and expected output length improves results significantly.

What You Have Now

You have a working local LLM that runs entirely on your hardware, costs nothing per token, exposes an OpenAI-compatible API on localhost, handles your data without it leaving your machine, and works offline. That's a real foundation.

But the moment you try to use this seriously — multiple models for different tasks, teammates sharing the same setup, switching between local and cloud based on data sensitivity, tracking what's actually being used — you've outgrown talking directly to Ollama. Those are infrastructure problems, and they need a gateway layer.