Running Open Source Models at Scale —
vLLM vs. Ollama vs. LM Studio
Ollama got you started. Now you have 10 users, or 50, or a response time that's no longer acceptable. This is what you reach for next — and why it matters which one.
When Ollama Isn't Enough
Ollama is the right tool for getting started. One command, model running, API available. For a single developer running local experiments, it's close to perfect.
The problem appears at scale. By default Ollama processes requests sequentially — it handles one request at a time, queuing everything else. It has a OLLAMA_NUM_PARALLEL setting that allows some concurrency, but it doesn't change the underlying architecture: Ollama allocates GPU memory statically per model load, which limits how many requests it can genuinely parallelize.
The numbers make this concrete. In a controlled benchmark on the same hardware, Ollama peaked at around 41 tokens per second total throughput under concurrent load. vLLM — with the same model on the same GPU — hit 793 tokens per second. That's not a marginal difference. That's the difference between a service that can handle your team and one that collapses under ten simultaneous users.
This article explains why that gap exists, when it matters, and how to make the right choice for your situation.
Three Tools, Three Philosophies
Why the Performance Gap Exists
To understand when to switch from Ollama to vLLM, you need to understand two architectural concepts that explain almost the entire performance difference: PagedAttention and continuous batching.
The memory problem: PagedAttention
When a language model processes a request, it builds a "KV cache" — a growing table of attention keys and values for every token generated so far. Traditional inference engines pre-allocate a large contiguous block of GPU memory for this cache for each request, sized for the maximum possible output length. Most of that memory sits empty for most of the request's lifetime, and it can't be used for anything else.
vLLM's core innovation, PagedAttention, borrows a concept from operating system virtual memory. Instead of one big block, it divides the KV cache into small fixed-size pages and allocates them on demand — exactly like an OS pages memory to disk. The result: near-zero memory waste, which means more requests can fit on the same GPU simultaneously, which means higher throughput.
Ollama allocates GPU memory statically per model load. It works fine for one or two simultaneous users. Under ten concurrent users, the memory ceiling hits and throughput plateaus.
The scheduling problem: Continuous batching
Traditional batching waits for a full batch of requests to arrive, processes them all together, and starts the next batch. If one request in the batch takes twice as long, every other request in the batch waits. GPU sits idle between batches.
vLLM uses continuous batching: new requests are inserted into the processing pipeline at every single iteration — not between batches. When a request finishes generating a token, that freed GPU capacity is immediately filled with the next waiting request. No idle time, no waiting for batch windows.
These two features together explain why vLLM's throughput scales almost linearly with concurrent users until GPU saturation, while Ollama's throughput hits a ceiling quickly.
Based on benchmarks from Red Hat Developer (Aug 2025) on identical hardware. Exact numbers vary by GPU, model, and prompt length.
Getting More Out of Ollama
Before switching to vLLM, it's worth knowing how far you can push Ollama with tuning. For small teams (3–8 users), these settings can close a significant portion of the gap without changing your infrastructure:
# Allow up to 6 requests in parallel (default is 1 or 4 depending on version)
# Start lower and increase — watch VRAM with `ollama ps`
export OLLAMA_NUM_PARALLEL=4
# Spread work across multiple GPUs if you have them
export OLLAMA_SCHED_SPREAD=1
# Keep model loaded for longer (seconds) — avoid reload penalty
export OLLAMA_KEEP_ALIVE=3600
# Max queue depth for pending requests (avoid 503s under burst)
export OLLAMA_MAX_QUEUE=512
# Restart with settings applied
ollama serve
ollama ps while serving to monitor how much memory is in use. If requests start spilling to CPU (shown as "CPU" in ollama ps), you've exceeded VRAM and performance will drop sharply. Reduce OLLAMA_NUM_PARALLEL until everything stays on GPU.
Even tuned, Ollama's architecture means throughput will plateau. These settings buy you more headroom, not unlimited scale. Once you're regularly seeing queue buildup or response times above 10 seconds under load, it's time for vLLM.
Setting Up vLLM
vLLM installs as a standard Python package and runs with a single command that spins up an OpenAI-compatible API server:
# Create a dedicated environment (Python 3.10+ required)
python3 -m venv .venv && source .venv/bin/activate
pip install vllm
# Verify GPU is visible
python -c "import torch; print(torch.cuda.get_device_name(0))"
# Serve llama3.1:8b — downloads from HuggingFace on first run
# Server starts on http://localhost:8000 with OpenAI-compatible API
vllm serve meta-llama/Llama-3.1-8B-Instruct
# With a Hugging Face token (required for gated models like Llama):
HF_TOKEN=your_token vllm serve meta-llama/Llama-3.1-8B-Instruct
# With explicit GPU memory fraction (if you need to share the GPU):
vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8
# Across two GPUs with tensor parallelism:
vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 2
Once running, the API is a drop-in replacement for the OpenAI API — just change the base URL. Any code using openai or LiteLLM from Article #2 works without modification:
from openai import OpenAI
# Only the base_url and model name change — everything else is identical
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM doesn't require a key by default
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Summarize our Q3 report in three bullet points."}],
temperature=0.1
)
print(response.choices[0].message.content)
If you already have LiteLLM from Article #2 in front of Ollama, pointing it at vLLM instead is a one-line config change:
model_list:
- model_name: llama3
litellm_params:
# Before (Ollama):
# model: ollama/llama3.1:8b
# api_base: http://localhost:11434
# After (vLLM) — same model name, new backend:
model: openai/meta-llama/Llama-3.1-8B-Instruct
api_base: http://localhost:8000/v1
api_key: not-needed
~/.cache/huggingface/ and the compiled kernels are reused.
Key configuration flags
vLLM exposes a large number of serving parameters. These are the ones worth knowing for a standard deployment:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \ # bind to all interfaces (needed for remote access)
--port 8000 \ # default port
--gpu-memory-utilization 0.9 \ # fraction of GPU VRAM to use (0–1, default 0.9)
--max-model-len 4096 \ # max context length — reduce to save VRAM
--dtype bfloat16 \ # precision: bfloat16 (recommended), float16, auto
--tensor-parallel-size 1 \ # number of GPUs to split across
--api-key your-secret-key # optional: require auth header
LM Studio — When You Don't Want a Terminal
LM Studio is a different kind of tool. It's a desktop application with a visual UI — you browse and download models through a GUI, chat with them directly, and optionally expose a local API server. It's not built for concurrent production workloads; it's built for the person who wants to explore models without touching a command line.
Where LM Studio genuinely shines:
- Non-technical colleagues who need to run local models for sensitive document work but aren't comfortable with the terminal
- Model exploration — comparing responses from different models in a side-by-side chat interface
- Apple Silicon Macs where you want a polished native experience without Docker or Python setup
- Quick prototyping where you want to test a model before deciding whether to add it to your Ollama or vLLM stack
LM Studio also exposes an OpenAI-compatible API on port 1234 when you start its server mode. For single-user access to a locally-running model from code, it's perfectly viable. For anything serving multiple users simultaneously, its throughput characteristics are similar to Ollama — respectable for one person, insufficient for a team.
Download it at lmstudio.ai. No install commands, no configuration files — just download, open, and start chatting.
Full Comparison
| Dimension | Ollama | vLLM | LM Studio |
|---|---|---|---|
| Setup complexity | Single binary, 2 min | pip install, 20 min first run | Desktop app, 5 min |
| GPU requirement | Optional (CPU works) | NVIDIA CUDA required | Optional (CPU works) |
| Apple Silicon | ✓ Native | ✗ Not supported | ✓ Native |
| Concurrent users | 1–5 comfortably | 10–500+ | 1–3 |
| OpenAI-compatible API | ✓ Port 11434 | ✓ Port 8000 | ✓ Port 1234 |
| Model source | Ollama Hub (curated) | HuggingFace (any model) | HuggingFace + built-in |
| PagedAttention | ✗ No | ✓ Yes | ✗ No |
| Continuous batching | ✗ No | ✓ Yes | ✗ No |
| Multi-GPU (tensor parallel) | ~ Layer offload only | ✓ True tensor parallel | ✗ No |
| GUI interface | ✗ CLI only | ✗ CLI only | ✓ Full GUI |
| Structured outputs | ~ Basic | ✓ JSON schema, regex | ~ Basic |
| Best for | Dev, prototyping, small teams | Production, scale, APIs | Individuals, non-technical |
How to Choose
VRAM Requirements by Model Size
vLLM requires all model weights to fit in GPU VRAM. This is the practical constraint that determines which models you can run:
Reducing VRAM with quantization
If your GPU has less VRAM than the model requires, quantization is the answer. vLLM supports AWQ and GPTQ quantization natively, which reduce model size by roughly half with only a modest quality reduction:
# AWQ (recommended) — use a pre-quantized model from HuggingFace
# Search for "AWQ" variants of your target model on huggingface.co
vllm serve TheBloke/Llama-3-8B-Instruct-AWQ --quantization awq
# The --max-model-len flag also reduces VRAM by limiting context window
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 2048 \ # reduces KV cache size significantly
--gpu-memory-utilization 0.85
Common Issues
--quantization awq). (2) Reduce context length with --max-model-len 2048. (3) Reduce memory allocation with --gpu-memory-utilization 0.75. Check available VRAM with nvidia-smi before starting.source .venv/bin/activate and try again. Also verify: pip show vllm should show the package details including its install location.watch -n1 nvidia-smi — GPU utilization will spike during CUDA compilation.HF_TOKEN=hf_... vllm serve meta-llama/Llama-3.1-8B-Instruct. Alternatively, use non-gated models like Qwen/Qwen2.5-7B-Instruct which require no token.ollama ps and check the PROCESSOR column. If it shows "CPU" or a mix, you've exceeded VRAM capacity. Reduce OLLAMA_NUM_PARALLEL until everything shows "GPU". You can also use a smaller quantized model to free up headroom for more parallel slots.The Full Picture
You now have a clear map of the inference engine landscape. Ollama for development and prototyping, with tuning options that take you further than the defaults. vLLM for production scale on NVIDIA hardware, with PagedAttention and continuous batching that change the economics of self-hosted AI entirely. LM Studio for the non-technical user who needs models locally without any infrastructure work.
The choice isn't permanent. The right pattern for most teams is to start with Ollama, expose it through LiteLLM (Article #2) so the application code doesn't care about the backend, and then swap in vLLM when load requires it. One config change, no application changes.