Local Sovereign AI · Article #5

Running Open Source Models at Scale —
vLLM vs. Ollama vs. LM Studio

Ollama got you started. Now you have 10 users, or 50, or a response time that's no longer acceptable. This is what you reach for next — and why it matters which one.

~45 min read + setup Intermediate–Advanced NVIDIA GPU required for vLLM

When Ollama Isn't Enough

Ollama is the right tool for getting started. One command, model running, API available. For a single developer running local experiments, it's close to perfect.

The problem appears at scale. By default Ollama processes requests sequentially — it handles one request at a time, queuing everything else. It has a OLLAMA_NUM_PARALLEL setting that allows some concurrency, but it doesn't change the underlying architecture: Ollama allocates GPU memory statically per model load, which limits how many requests it can genuinely parallelize.

The numbers make this concrete. In a controlled benchmark on the same hardware, Ollama peaked at around 41 tokens per second total throughput under concurrent load. vLLM — with the same model on the same GPU — hit 793 tokens per second. That's not a marginal difference. That's the difference between a service that can handle your team and one that collapses under ten simultaneous users.

This article explains why that gap exists, when it matters, and how to make the right choice for your situation.

Three Tools, Three Philosophies

Ollama
Developer-first
Docker for language models. One command, model running. Built for the individual developer.
Setup time~2 min
GPU requirementOptional
Peak throughput~41 TPS
Concurrent users1–5
Apple Silicon✓ Native
vLLM
Production-first
High-throughput inference engine from UC Berkeley. Built for many concurrent users.
Setup time~20 min
GPU requirementNVIDIA required
Peak throughput~793 TPS
Concurrent users10–500+
Apple Silicon✗ CUDA only
LM Studio
GUI-first
Desktop application with a visual interface. Built for non-technical users and personal use.
Setup time~5 min
GPU requirementOptional
Peak throughput~Ollama-level
Concurrent users1–3
Apple Silicon✓ Native

Why the Performance Gap Exists

To understand when to switch from Ollama to vLLM, you need to understand two architectural concepts that explain almost the entire performance difference: PagedAttention and continuous batching.

The memory problem: PagedAttention

When a language model processes a request, it builds a "KV cache" — a growing table of attention keys and values for every token generated so far. Traditional inference engines pre-allocate a large contiguous block of GPU memory for this cache for each request, sized for the maximum possible output length. Most of that memory sits empty for most of the request's lifetime, and it can't be used for anything else.

vLLM's core innovation, PagedAttention, borrows a concept from operating system virtual memory. Instead of one big block, it divides the KV cache into small fixed-size pages and allocates them on demand — exactly like an OS pages memory to disk. The result: near-zero memory waste, which means more requests can fit on the same GPU simultaneously, which means higher throughput.

Ollama allocates GPU memory statically per model load. It works fine for one or two simultaneous users. Under ten concurrent users, the memory ceiling hits and throughput plateaus.

The scheduling problem: Continuous batching

Traditional batching waits for a full batch of requests to arrive, processes them all together, and starts the next batch. If one request in the batch takes twice as long, every other request in the batch waits. GPU sits idle between batches.

vLLM uses continuous batching: new requests are inserted into the processing pipeline at every single iteration — not between batches. When a request finishes generating a token, that freed GPU capacity is immediately filled with the next waiting request. No idle time, no waiting for batch windows.

These two features together explain why vLLM's throughput scales almost linearly with concurrent users until GPU saturation, while Ollama's throughput hits a ceiling quickly.

Throughput vs. Concurrent Users — Ollama vs. vLLM
0 200 400 600 1 4 8 16 32 64 128 Concurrent users → Tokens / sec → vLLM Ollama plateaus here

Based on benchmarks from Red Hat Developer (Aug 2025) on identical hardware. Exact numbers vary by GPU, model, and prompt length.

Getting More Out of Ollama

Before switching to vLLM, it's worth knowing how far you can push Ollama with tuning. For small teams (3–8 users), these settings can close a significant portion of the gap without changing your infrastructure:

bash — Ollama environment variables
# Allow up to 6 requests in parallel (default is 1 or 4 depending on version)
# Start lower and increase — watch VRAM with `ollama ps`
export OLLAMA_NUM_PARALLEL=4

# Spread work across multiple GPUs if you have them
export OLLAMA_SCHED_SPREAD=1

# Keep model loaded for longer (seconds) — avoid reload penalty
export OLLAMA_KEEP_ALIVE=3600

# Max queue depth for pending requests (avoid 503s under burst)
export OLLAMA_MAX_QUEUE=512

# Restart with settings applied
ollama serve
Watch your VRAM. Each parallel slot consumes additional GPU memory for the KV cache. Run ollama ps while serving to monitor how much memory is in use. If requests start spilling to CPU (shown as "CPU" in ollama ps), you've exceeded VRAM and performance will drop sharply. Reduce OLLAMA_NUM_PARALLEL until everything stays on GPU.

Even tuned, Ollama's architecture means throughput will plateau. These settings buy you more headroom, not unlimited scale. Once you're regularly seeing queue buildup or response times above 10 seconds under load, it's time for vLLM.

Setting Up vLLM

vLLM requires an NVIDIA GPU with CUDA. It does not run on CPU-only machines or Apple Silicon. If your production server runs on Apple hardware, Ollama remains your best option for local inference. For cloud deployment on NVIDIA hardware (AWS g5, Azure NC, GCP A100), vLLM is the standard choice.

vLLM installs as a standard Python package and runs with a single command that spins up an OpenAI-compatible API server:

bash — install
# Create a dedicated environment (Python 3.10+ required)
python3 -m venv .venv && source .venv/bin/activate

pip install vllm

# Verify GPU is visible
python -c "import torch; print(torch.cuda.get_device_name(0))"
bash — start the server
# Serve llama3.1:8b — downloads from HuggingFace on first run
# Server starts on http://localhost:8000 with OpenAI-compatible API
vllm serve meta-llama/Llama-3.1-8B-Instruct

# With a Hugging Face token (required for gated models like Llama):
HF_TOKEN=your_token vllm serve meta-llama/Llama-3.1-8B-Instruct

# With explicit GPU memory fraction (if you need to share the GPU):
vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8

# Across two GPUs with tensor parallelism:
vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 2

Once running, the API is a drop-in replacement for the OpenAI API — just change the base URL. Any code using openai or LiteLLM from Article #2 works without modification:

python — calling vLLM
from openai import OpenAI

# Only the base_url and model name change — everything else is identical
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require a key by default
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Summarize our Q3 report in three bullet points."}],
    temperature=0.1
)
print(response.choices[0].message.content)

If you already have LiteLLM from Article #2 in front of Ollama, pointing it at vLLM instead is a one-line config change:

yaml — litellm_config.yaml update
model_list:
  - model_name: llama3
    litellm_params:
      # Before (Ollama):
      # model: ollama/llama3.1:8b
      # api_base: http://localhost:11434

      # After (vLLM) — same model name, new backend:
      model: openai/meta-llama/Llama-3.1-8B-Instruct
      api_base: http://localhost:8000/v1
      api_key: not-needed
First startup is slow. vLLM downloads model weights from HuggingFace (several GB) and then compiles CUDA kernels — this can take 10–30 minutes on first run. Subsequent starts from the same machine are fast because the weights are cached in ~/.cache/huggingface/ and the compiled kernels are reused.

Key configuration flags

vLLM exposes a large number of serving parameters. These are the ones worth knowing for a standard deployment:

bash — important flags
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \              # bind to all interfaces (needed for remote access)
  --port 8000 \                 # default port
  --gpu-memory-utilization 0.9 \  # fraction of GPU VRAM to use (0–1, default 0.9)
  --max-model-len 4096 \        # max context length — reduce to save VRAM
  --dtype bfloat16 \            # precision: bfloat16 (recommended), float16, auto
  --tensor-parallel-size 1 \    # number of GPUs to split across
  --api-key your-secret-key     # optional: require auth header

LM Studio — When You Don't Want a Terminal

LM Studio is a different kind of tool. It's a desktop application with a visual UI — you browse and download models through a GUI, chat with them directly, and optionally expose a local API server. It's not built for concurrent production workloads; it's built for the person who wants to explore models without touching a command line.

Where LM Studio genuinely shines:

LM Studio also exposes an OpenAI-compatible API on port 1234 when you start its server mode. For single-user access to a locally-running model from code, it's perfectly viable. For anything serving multiple users simultaneously, its throughput characteristics are similar to Ollama — respectable for one person, insufficient for a team.

Download it at lmstudio.ai. No install commands, no configuration files — just download, open, and start chatting.

Full Comparison

Dimension Ollama vLLM LM Studio
Setup complexity Single binary, 2 min pip install, 20 min first run Desktop app, 5 min
GPU requirement Optional (CPU works) NVIDIA CUDA required Optional (CPU works)
Apple Silicon ✓ Native ✗ Not supported ✓ Native
Concurrent users 1–5 comfortably 10–500+ 1–3
OpenAI-compatible API ✓ Port 11434 ✓ Port 8000 ✓ Port 1234
Model source Ollama Hub (curated) HuggingFace (any model) HuggingFace + built-in
PagedAttention ✗ No ✓ Yes ✗ No
Continuous batching ✗ No ✓ Yes ✗ No
Multi-GPU (tensor parallel) ~ Layer offload only ✓ True tensor parallel ✗ No
GUI interface ✗ CLI only ✗ CLI only ✓ Full GUI
Structured outputs ~ Basic ✓ JSON schema, regex ~ Basic
Best for Dev, prototyping, small teams Production, scale, APIs Individuals, non-technical
What happened to Hugging Face TGI? Text Generation Inference (TGI) used to be a common third option alongside Ollama and vLLM. In December 2025, Hugging Face put TGI into maintenance mode — no new features, security patches only. They now recommend vLLM or SGLang for new deployments. If you're evaluating inference engines today, TGI is not a recommended choice for new projects.

How to Choose

Are you on Apple Silicon (M1/M2/M3/M4)?
Use Ollama. vLLM does not support Apple Silicon. Ollama has native Metal acceleration and is the right tool for your hardware.
Continue below
Do you need a GUI with no command line?
Use LM Studio. Download, install, start chatting. No terminal required.
Continue below
Will more than 5 people use this simultaneously?
Use vLLM. This is exactly the problem it was built for. Ollama will queue and degrade under this load.
Continue below
Are your response times acceptable under current Ollama load?
Stay on Ollama. Don't add complexity you don't need. Try the tuning flags first if you need a bit more headroom.
Move to vLLM. Check your hardware first — you need an NVIDIA GPU with enough VRAM for your model.

VRAM Requirements by Model Size

vLLM requires all model weights to fit in GPU VRAM. This is the practical constraint that determines which models you can run:

Minimum VRAM for Common Models (fp16 / bfloat16)
MODEL MIN VRAM GPU EXAMPLE NOTES llama3.1:8b 16 GB RTX 4080, A10G Common dev GPU mistral-small3:22b 44 GB 2× RTX 4090, A100 Use --tensor-parallel-size 2 llama3.1:70b 140 GB 4× A100 80GB Or use 4-bit quantization qwen3:7b (INT4) 8 GB RTX 3080, RTX 4070 --quantization awq Tip: Add ~20% headroom for KV cache. A 16GB model needs ~20GB VRAM in practice under load. Use --quantization awq or gptq to halve VRAM requirements at modest quality cost.

Reducing VRAM with quantization

If your GPU has less VRAM than the model requires, quantization is the answer. vLLM supports AWQ and GPTQ quantization natively, which reduce model size by roughly half with only a modest quality reduction:

bash — serving a quantized model
# AWQ (recommended) — use a pre-quantized model from HuggingFace
# Search for "AWQ" variants of your target model on huggingface.co
vllm serve TheBloke/Llama-3-8B-Instruct-AWQ --quantization awq

# The --max-model-len flag also reduces VRAM by limiting context window
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 2048 \      # reduces KV cache size significantly
  --gpu-memory-utilization 0.85

Common Issues

CUDA out of memory error on startup
Your model is too large for available VRAM. Three options: (1) Use a quantized model variant (--quantization awq). (2) Reduce context length with --max-model-len 2048. (3) Reduce memory allocation with --gpu-memory-utilization 0.75. Check available VRAM with nvidia-smi before starting.
vllm command not found after pip install
Your virtual environment isn't activated, or the venv's bin directory isn't in PATH. Run source .venv/bin/activate and try again. Also verify: pip show vllm should show the package details including its install location.
First startup takes 20+ minutes and seems stuck
Normal on first run. vLLM is downloading model weights from HuggingFace (can be 5–15 GB) and then compiling CUDA kernels. You'll see "Loading weights" followed by "Capturing CUDA graphs". Both steps are slow the first time. Subsequent starts from the same machine complete in 1–3 minutes. Watch progress with watch -n1 nvidia-smi — GPU utilization will spike during CUDA compilation.
Access to model gated — 401 error downloading from HuggingFace
Models like Llama 3.1 require accepting a license on HuggingFace and providing an access token. Go to the model page, accept the terms, then create an access token at huggingface.co/settings/tokens. Pass it as: HF_TOKEN=hf_... vllm serve meta-llama/Llama-3.1-8B-Instruct. Alternatively, use non-gated models like Qwen/Qwen2.5-7B-Instruct which require no token.
Ollama is still slow even after setting OLLAMA_NUM_PARALLEL
The model may be spilling to CPU. Run ollama ps and check the PROCESSOR column. If it shows "CPU" or a mix, you've exceeded VRAM capacity. Reduce OLLAMA_NUM_PARALLEL until everything shows "GPU". You can also use a smaller quantized model to free up headroom for more parallel slots.

The Full Picture

You now have a clear map of the inference engine landscape. Ollama for development and prototyping, with tuning options that take you further than the defaults. vLLM for production scale on NVIDIA hardware, with PagedAttention and continuous batching that change the economics of self-hosted AI entirely. LM Studio for the non-technical user who needs models locally without any infrastructure work.

The choice isn't permanent. The right pattern for most teams is to start with Ollama, expose it through LiteLLM (Article #2) so the application code doesn't care about the backend, and then swap in vLLM when load requires it. One config change, no application changes.