Local Sovereign AI · Article #1

Your First Local LLM —
A Complete Setup Guide

Every prompt you send to ChatGPT or Claude travels over the internet, gets logged on someone else's servers, and potentially contributes to future model training. For anything involving real data, that's a problem worth solving.

~30 min total Beginner–Intermediate Zero data leaves your machine

⏱

Time breakdown

Install Ollama ~5 min → Model download ~15 min → First run + API ~10 min

The majority of the time is waiting for the model file to download.

Why Would You Even Do This?

The case for running a language model locally isn't about being anti-cloud. It's about control.

When you use a hosted AI service, you're accepting an implicit deal: you get access to a powerful model, and in exchange, your inputs pass through infrastructure you don't own or control. Most providers have terms of service that limit training on your data, but "limited" and "zero" are different things. Audit trails don't exist. Data residency is often vague. And for companies operating under GDPR, EU AI Act, DORA, or sector-specific regulations, "our vendor promises not to look" isn't a compliance posture.

Running a model locally changes the math entirely. The model file lives on your hardware. Inference happens on your CPU or GPU. Nothing leaves your machine — not the prompt, not the response, not the metadata about what you asked. There's no API to go down, no rate limits, no per-token cost. It works on a train with no internet. It works in an air-gapped environment. It works at 3am when OpenAI has an outage.

Cloud API vs. Local Inference — Where Does Your Data Go?

The honest trade-off: local models are not GPT-4. The best open-source models available today are genuinely impressive, but there is still a capability gap at the frontier. For the majority of real-world use cases — summarization, classification, code explanation, Q&A over provided documents, drafting text — a well-chosen local model is good enough, and the privacy guarantee is worth more than the marginal quality difference.

What "Local" Actually Means

A language model is fundamentally a large file containing billions of numerical parameters called weights. When you run inference (send a prompt and get a response), you're loading those weights into memory and performing a very large matrix multiplication repeatedly.

The file format you'll encounter most often is GGUF — a compressed format optimized for consumer hardware. GGUF files use a technique called quantization: instead of storing each weight as a 32-bit float, they're compressed to 4-bit or 8-bit integers, with a small but usually acceptable quality trade-off. This allows a model that would otherwise require 40GB to run in 5GB. You'll see this as q4_0 or q8_0 in model names — lower number = more compression.

The critical thing to understand: once that GGUF file is on your disk, the model is entirely self-contained. No license server, no telemetry, no cloud dependency. It's just math running on your hardware.

How Ollama Works — Architecture

Your Code

Python / curl

HTTP

Ollama Server
localhost:11434

loads

GGUF Model

~/.ollama/models

inference

CPU / GPU

your hardware

Nothing exits this chain. All computation is local.

Hardware Reality Check

Most tutorials skip this section and leave readers confused when things don't work. Let's be direct.

Mac users: Apple Silicon (M1–M4) is excellent for local AI. Unified memory means the full RAM pool is available to the GPU, so a 32GB MacBook Pro can use all 32GB for model weights. If you have 16GB+, you're in good shape.

Linux + NVIDIA GPU: This is the fastest path. An RTX 3080 (10GB) or RTX 4070 (12GB) will run 7–13B models with excellent performance. Ollama auto-detects your GPU.

Everyone else: 16GB RAM on CPU works — expect 5–15 tokens/sec instead of 50+. Slower, but functional for most tasks.

Hardware → Model Size Quick Reference

8 GB RAM

3–4B

phi4-mini

Slow, works

16 GB RAM

Up to 8B

llama3.1:8b

Usable

32 GB / M1 16GB

Up to 13B

mistral-small3

Good

M2/M3 32 GB+

Up to 34B

llama3.1:70b

Very good

NVIDIA 8 GB VRAM

Up to 8B

GPU accelerated

Fast

NVIDIA 24 GB+

Up to 34B

GPU accelerated

Excellent

One note on Windows: everything below works, but you'll have a smoother experience using WSL2 (Windows Subsystem for Linux). The terminal tooling around local AI is significantly more mature in Linux environments.

Install Ollama

~5 min

Ollama handles model downloading, version management, quantization selection, GPU detection, and automatically runs a local REST API server. Think of it as Docker for language models.

bash

brew install ollama

On macOS with Homebrew, Ollama starts automatically as a menu bar app — do not run ollama serve manually. If you installed via the .dmg from ollama.com, launch the app from your Applications folder instead.

bash

curl -fsSL https://ollama.com/install.sh | sh
# Then start the service:
ollama serve   # leave this terminal open, or configure as systemd

Download the installer from ollama.com. Recommended: Install WSL2 first (Ubuntu via Microsoft Store), then use the Linux command inside WSL2 — most local AI tooling works more reliably in Linux environments.

Verify it's running in a new terminal:

bash

ollama --version
# ollama version is 0.x.x

What Ollama is actually doing: When you pull a model, Ollama downloads the GGUF file from its registry (like Docker Hub) and stores it in ~/.ollama/models/. It then exposes a REST API on localhost:11434. After the initial download, no internet connection is needed.

Choose Your First Model

~2 min decision

A note on versions: The open-source model ecosystem moves fast — new releases appear weekly. The families below are stable recommendations; specific version numbers change. Always check ollama.com/library directly for the current version of each family before pulling.

🦙

Llama 3.3 (Meta)

Best all-around general-purpose model. Strong instruction following, 128K context. The 8B version is the sweet spot for most hardware.

General

⚡

Mistral Small 3 (Mistral AI)

Fastest tokens/sec in the 7B class on consumer hardware. Excellent structured output. Apache 2.0 — cleanest commercial license.

Fast

🧑‍💻

Qwen 3 (Alibaba)

Strongest on coding benchmarks in the 7B class. Best multilingual support. Recommended for non-English content or code-heavy workflows.

Code

🔬

Phi-4-mini (Microsoft)

3.8B parameters, competes with many 7B models on reasoning. MIT license. Best choice for 8GB RAM machines or when you need instant load times.

Small

🧠

DeepSeek-R1 (DeepSeek)

Purpose-built for complex reasoning, debugging, step-by-step problem solving. The 32B version on 24GB GPU genuinely rivals GPT-4 on technical tasks.

Reasoning

Simple recommendation by hardware:

quick reference

16GB RAM or M1/M2 16GB  →  llama3.1:8b
8GB RAM                 →  phi4-mini
32GB+ or M2 Pro/Max     →  llama3.1:8b  (then try larger)
NVIDIA GPU 12GB+        →  llama3.1:8b  (auto GPU acceleration)
Primary use case: code  →  qwen3:8b or deepseek-r1:7b

Pull and Run Your First Model

~15 min download

bash

ollama pull llama3.1:8b   # ~5GB download, one-time
ollama run llama3.1:8b

You'll get a prompt. Try something real:

example prompts

>>> Summarize the key differences between GDPR and the EU AI Act in 3 paragraphs.
>>> Write a Python function that reads a CSV file and returns the top 10 rows by a column.
>>> Find all users in my PostgreSQL users table who haven't logged in for 90 days.
>>> /bye

Immediately verify GPU usage — many people run on CPU without realising it and think local AI is just slow:

bash

ollama ps
# NAME              ID    SIZE      PROCESSOR    UNTIL
# llama3.1:8b       ...   5.0 GB    100% GPU     4 minutes from now

If PROCESSOR shows CPU and you expected GPU: on Apple Silicon the model may be too large for your unified memory (try phi4-mini); on NVIDIA, check that drivers are installed and Ollama detected your GPU at startup.

bash — useful commands

ollama list              # see downloaded models
ollama show llama3.1:8b  # model details, context length
ollama rm llama3.1:8b    # remove to free disk space

The Local API

~10 min

The chat interface is useful for experimenting. The REST API is what makes this genuinely powerful.

Ollama automatically exposes an OpenAI-compatible API on localhost:11434. Any tool, library, or application built for OpenAI's API works with your local model by changing one configuration line — the base URL. Nothing else.

bash — verify with curl

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "What is data sovereignty?"}]
  }'

For scripting and real applications, here's the same call in Python using the OpenAI SDK — which works without modification because the API format is identical:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by client, ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize zero-trust security in 3 points."}
    ]
)
print(response.choices[0].message.content)

If you have existing code using the OpenAI SDK, switching to local is two lines:

python — migration

# Before (OpenAI cloud):
client = OpenAI(api_key="sk-...")

# After (local Ollama — everything else stays identical):
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

The architectural insight: The API is the interface, not the vendor. The same application code can route to different models depending on context, data sensitivity, or cost — without being rewritten. This is the foundation the rest of this series builds on.

Practical Test: Document Summarization

~5 min

Here's a complete script that reads a local document and summarizes it — entirely locally, no data leaving your machine. First, if you're working with PDFs, convert them to text:

bash — PDF conversion

# macOS:
brew install poppler
# Linux (Ubuntu/Debian):
sudo apt install poppler-utils

# Convert PDF to text:
pdftotext internal_report.pdf internal_report.txt

python — summarize.py

# summarize.py — run: python summarize.py contract.txt
from openai import OpenAI
import sys

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

def summarize_document(filepath: str) -> str:
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()

    max_chars = 12000
    if len(content) > max_chars:
        content = content[:max_chars] + "\n\n[Truncated — see article #3 on RAG for long docs]"

    response = client.chat.completions.create(
        model="llama3.1:8b",
        messages=[
            {"role": "system", "content": "You are a document analyst. Summarize concisely, covering main points, key decisions, and notable details."},
            {"role": "user", "content": f"Summarize this document:\n\n{content}"}
        ],
        temperature=0.3
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    print(summarize_document(sys.argv[1] if len(sys.argv) > 1 else "document.txt"))

Honest Assessment: What Local Models Are Good At

Setting accurate expectations is more useful than cheerleading.

✓ Handles well

Summarization of documents you provide
Code explanation and basic generation
Drafting text from structured input
Classification and extraction tasks
Q&A with context provided in prompt
Translation for major languages

△ Still lags behind frontier

Complex multi-step reasoning
Highly nuanced creative prose
Mathematical reasoning (beyond algebra)
Up-to-date world knowledge (cutoff date)
Very long documents without RAG

For 60–70% of realistic daily AI tasks in a business context, a well-configured local 8B model produces output that is good enough. The remaining 30–40% — tasks that genuinely benefit from GPT-4 class capability — are usually also tasks where external API usage is acceptable or data can be sanitised before sending. That routing decision is exactly what the next article covers.

Troubleshooting Common Issues

Model runs but output is very slow (under 3 tokens/sec) ▼

Check ollama ps and look at the PROCESSOR column. If it shows CPU: on Apple Silicon, the model is likely too large for unified memory — try phi4-mini. On Linux/Windows, verify NVIDIA drivers are installed and Ollama detected your GPU at startup.

"Error: model not found" ▼

Pull explicitly before running: ollama pull llama3.1:8b. Model names are case-sensitive and must exactly match the Ollama library name.

Out of memory / process crashes mid-generation ▼

The model is too large for your RAM. Try a more aggressively quantized variant: ollama pull llama3.1:8b-q4_0 uses less memory than the default at some quality cost.

macOS: "address already in use" when running ollama serve ▼

With the Homebrew install, Ollama is already running as a menu bar app. You don't need ollama serve — use ollama run directly.

API returns connection refused on port 11434 ▼

Ollama isn't running. On macOS, check the menu bar icon. On Linux: sudo systemctl enable --now ollama to start and enable as a service.

Windows: various terminal and path issues ▼

Switch to WSL2. Install Ubuntu via the Microsoft Store, then use the Linux install command inside WSL2. This resolves the vast majority of Windows-specific friction with local AI tooling.

Model gives incoherent or very short responses ▼

Add a system prompt that clearly defines the task. Local models are more sensitive to prompt quality than frontier models — being explicit about the task format and expected output length improves results significantly.

What You Have Now

You have a working local LLM that runs entirely on your hardware, costs nothing per token, exposes an OpenAI-compatible API on localhost, handles your data without it leaving your machine, and works offline. That's a real foundation.

But the moment you try to use this seriously — multiple models for different tasks, teammates sharing the same setup, switching between local and cloud based on data sensitivity, tracking what's actually being used — you've outgrown talking directly to Ollama. Those are infrastructure problems, and they need a gateway layer.

Up next · Article #2

LiteLLM as Your Local AI Gateway

A unified gateway in front of all your models — local and cloud — with routing rules, cost tracking, access control, and fallback logic. The layer that turns "I have a model running" into "I have AI infrastructure."

→

Your First Local LLM —A Complete Setup Guide

Why Would You Even Do This?

What "Local" Actually Means

Hardware Reality Check

Honest Assessment: What Local Models Are Good At

Troubleshooting Common Issues

What You Have Now

Your First Local LLM —
A Complete Setup Guide