Your First Local LLM —
A Complete Setup Guide
Every prompt you send to ChatGPT or Claude travels over the internet, gets logged on someone else's servers, and potentially contributes to future model training. For anything involving real data, that's a problem worth solving.
Why Would You Even Do This?
The case for running a language model locally isn't about being anti-cloud. It's about control.
When you use a hosted AI service, you're accepting an implicit deal: you get access to a powerful model, and in exchange, your inputs pass through infrastructure you don't own or control. Most providers have terms of service that limit training on your data, but "limited" and "zero" are different things. Audit trails don't exist. Data residency is often vague. And for companies operating under GDPR, EU AI Act, DORA, or sector-specific regulations, "our vendor promises not to look" isn't a compliance posture.
Running a model locally changes the math entirely. The model file lives on your hardware. Inference happens on your CPU or GPU. Nothing leaves your machine — not the prompt, not the response, not the metadata about what you asked. There's no API to go down, no rate limits, no per-token cost. It works on a train with no internet. It works in an air-gapped environment. It works at 3am when OpenAI has an outage.
The honest trade-off: local models are not GPT-4. The best open-source models available today are genuinely impressive, but there is still a capability gap at the frontier. For the majority of real-world use cases — summarization, classification, code explanation, Q&A over provided documents, drafting text — a well-chosen local model is good enough, and the privacy guarantee is worth more than the marginal quality difference.
What "Local" Actually Means
A language model is fundamentally a large file containing billions of numerical parameters called weights. When you run inference (send a prompt and get a response), you're loading those weights into memory and performing a very large matrix multiplication repeatedly.
The file format you'll encounter most often is GGUF — a compressed format optimized for consumer hardware. GGUF files use a technique called quantization: instead of storing each weight as a 32-bit float, they're compressed to 4-bit or 8-bit integers, with a small but usually acceptable quality trade-off. This allows a model that would otherwise require 40GB to run in 5GB. You'll see this as q4_0 or q8_0 in model names — lower number = more compression.
The critical thing to understand: once that GGUF file is on your disk, the model is entirely self-contained. No license server, no telemetry, no cloud dependency. It's just math running on your hardware.
Nothing exits this chain. All computation is local.
Hardware Reality Check
Most tutorials skip this section and leave readers confused when things don't work. Let's be direct.
Mac users: Apple Silicon (M1–M4) is excellent for local AI. Unified memory means the full RAM pool is available to the GPU, so a 32GB MacBook Pro can use all 32GB for model weights. If you have 16GB+, you're in good shape.
Linux + NVIDIA GPU: This is the fastest path. An RTX 3080 (10GB) or RTX 4070 (12GB) will run 7–13B models with excellent performance. Ollama auto-detects your GPU.
Everyone else: 16GB RAM on CPU works — expect 5–15 tokens/sec instead of 50+. Slower, but functional for most tasks.
One note on Windows: everything below works, but you'll have a smoother experience using WSL2 (Windows Subsystem for Linux). The terminal tooling around local AI is significantly more mature in Linux environments.
Ollama handles model downloading, version management, quantization selection, GPU detection, and automatically runs a local REST API server. Think of it as Docker for language models.
brew install ollama
ollama serve manually. If you installed via the .dmg from ollama.com, launch the app from your Applications folder instead.curl -fsSL https://ollama.com/install.sh | sh
# Then start the service:
ollama serve # leave this terminal open, or configure as systemd
Verify it's running in a new terminal:
ollama --version
# ollama version is 0.x.x
~/.ollama/models/. It then exposes a REST API on localhost:11434. After the initial download, no internet connection is needed.
ollama search <family> or ollama.com/library for the current version before pulling.
Simple recommendation by hardware:
16GB RAM or M1/M2 16GB → llama3.3:8b
8GB RAM → phi4-mini
32GB+ or M2 Pro/Max → llama3.3:8b (then try larger)
NVIDIA GPU 12GB+ → llama3.3:8b (auto GPU acceleration)
Primary use case: code → qwen3:8b or deepseek-r1:7b
ollama pull llama3.3:8b # ~5GB download, one-time
ollama run llama3.3:8b
You'll get a prompt. Try something real:
>>> Summarize the key differences between GDPR and the EU AI Act in 3 paragraphs.
>>> Write a Python function that reads a CSV file and returns the top 10 rows by a column.
>>> Find all users in my PostgreSQL users table who haven't logged in for 90 days.
>>> /bye
Immediately verify GPU usage — many people run on CPU without realising it and think local AI is just slow:
ollama ps
# NAME ID SIZE PROCESSOR UNTIL
# llama3.3:8b ... 5.0 GB 100% GPU 4 minutes from now
If PROCESSOR shows CPU and you expected GPU: on Apple Silicon the model may be too large for your unified memory (try phi4-mini); on NVIDIA, check that drivers are installed and Ollama detected your GPU at startup.
ollama list # see downloaded models
ollama show llama3.3:8b # model details, context length
ollama rm llama3.3:8b # remove to free disk space
The chat interface is useful for experimenting. The REST API is what makes this genuinely powerful.
Ollama automatically exposes an OpenAI-compatible API on localhost:11434. Any tool, library, or application built for OpenAI's API works with your local model by changing one configuration line — the base URL. Nothing else.
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3:8b",
"messages": [{"role": "user", "content": "What is data sovereignty?"}]
}'
For scripting and real applications, here's the same call in Python using the OpenAI SDK — which works without modification because the API format is identical:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required by client, ignored by Ollama
)
response = client.chat.completions.create(
model="llama3.3:8b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize zero-trust security in 3 points."}
]
)
print(response.choices[0].message.content)
If you have existing code using the OpenAI SDK, switching to local is two lines:
# Before (OpenAI cloud):
client = OpenAI(api_key="sk-...")
# After (local Ollama — everything else stays identical):
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
Here's a complete script that reads a local document and summarizes it — entirely locally, no data leaving your machine. First, if you're working with PDFs, convert them to text:
# macOS:
brew install poppler
# Linux (Ubuntu/Debian):
sudo apt install poppler-utils
# Convert PDF to text:
pdftotext internal_report.pdf internal_report.txt
# summarize.py — run: python summarize.py contract.txt
from openai import OpenAI
import sys
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
def summarize_document(filepath: str) -> str:
with open(filepath, "r", encoding="utf-8") as f:
content = f.read()
max_chars = 12000
if len(content) > max_chars:
content = content[:max_chars] + "\n\n[Truncated — see article #3 on RAG for long docs]"
response = client.chat.completions.create(
model="llama3.3:8b",
messages=[
{"role": "system", "content": "You are a document analyst. Summarize concisely, covering main points, key decisions, and notable details."},
{"role": "user", "content": f"Summarize this document:\n\n{content}"}
],
temperature=0.3
)
return response.choices[0].message.content
if __name__ == "__main__":
print(summarize_document(sys.argv[1] if len(sys.argv) > 1 else "document.txt"))
Honest Assessment: What Local Models Are Good At
Setting accurate expectations is more useful than cheerleading.
- Summarization of documents you provide
- Code explanation and basic generation
- Drafting text from structured input
- Classification and extraction tasks
- Q&A with context provided in prompt
- Translation for major languages
- Complex multi-step reasoning
- Highly nuanced creative prose
- Mathematical reasoning (beyond algebra)
- Up-to-date world knowledge (cutoff date)
- Very long documents without RAG
For 60–70% of realistic daily AI tasks in a business context, a well-configured local 8B model produces output that is good enough. The remaining 30–40% — tasks that genuinely benefit from GPT-4 class capability — are usually also tasks where external API usage is acceptable or data can be sanitised before sending. That routing decision is exactly what the next article covers.
Troubleshooting Common Issues
ollama ps and look at the PROCESSOR column. If it shows CPU: on Apple Silicon, the model is likely too large for unified memory — try phi4-mini. On Linux/Windows, verify NVIDIA drivers are installed and Ollama detected your GPU at startup.
ollama pull llama3.3:8b. Model names are case-sensitive and must exactly match the Ollama library name.
ollama pull llama3.3:8b-q4_0 uses less memory than the default at some quality cost.
ollama serve — use ollama run directly.
sudo systemctl enable --now ollama to start and enable as a service.
What You Have Now
You have a working local LLM that runs entirely on your hardware, costs nothing per token, exposes an OpenAI-compatible API on localhost, handles your data without it leaving your machine, and works offline. That's a real foundation.
But the moment you try to use this seriously — multiple models for different tasks, teammates sharing the same setup, switching between local and cloud based on data sensitivity, tracking what's actually being used — you've outgrown talking directly to Ollama. Those are infrastructure problems, and they need a gateway layer.