LiteLLM as Your Local AI Gateway —
One API to Rule Them All
You have a local model running. Now you need multiple models, cost tracking, cloud fallbacks, and team access — without rewriting your application each time. A gateway layer solves this. LiteLLM is the open-source tool that makes it practical.
The Problem With Talking Directly to Models
After getting Ollama running, the natural next step is to wire it into something useful. A script that summarizes documents. An internal chatbot. A code assistant for your team. And this is where things get complicated faster than expected.
You start with one model. Then you realize some tasks need something stronger, so you pull a second. Then a colleague wants to use the setup too, so now you're thinking about networking. Then you start wondering how many tokens you're actually using. Then a task comes in that your local model handles poorly, and you want to route it to Claude or GPT-4 as a fallback — but only for that type of query, not for queries that contain sensitive data.
Suddenly what was "a model running on localhost" has become an infrastructure problem. You need:
- A single endpoint your applications can target, regardless of which model handles the request
- The ability to switch or add models without changing application code
- Routing logic — send this request to the local model, that one to the cloud
- Token counting and cost tracking, even for local models that don't charge per token
- Rate limiting and load balancing when multiple people or processes use the same setup
- A unified logging layer so you can see what's being asked and what responses are generated
This is exactly what an LLM gateway does. And LiteLLM is the best open-source implementation of one.
What LiteLLM Actually Is
LiteLLM is a Python library and proxy server that presents a single, unified, OpenAI-compatible API in front of over 100 different LLM providers — including local models via Ollama. You point your applications at LiteLLM instead of directly at any specific model or provider. LiteLLM handles routing, API format translation, retry logic, cost tracking, and logging.
Provider abstraction: Every model — Ollama local, OpenAI, Anthropic, Mistral, AWS Bedrock, Azure — speaks a slightly different API dialect. LiteLLM normalizes all of them into a single OpenAI-compatible interface. Your application code never needs to know which provider is actually serving the request.
Fully self-hostable: LiteLLM runs as a Docker container or a plain Python process on your own infrastructure. No data passes through LiteLLM's servers. The proxy runs locally, calls your local Ollama instance locally, everything stays inside your perimeter.
OpenAI-compatible output: If you already have code using the OpenAI SDK, switching to LiteLLM proxy is one configuration line. The response format is identical.
Built-in observability: Every request gets logged with model name, token counts, latency, and estimated cost — even for local models where the cost is $0.
What LiteLLM is not: it's not a model, doesn't store your data by default, and is not a replacement for Ollama. It sits in front of Ollama as a routing and management layer. Ollama continues to do the actual inference.
Your application speaks only to LiteLLM — it never needs to know which backend handles the request.
LiteLLM requires Python 3.8+. Install with the proxy extras:
python3 --version
pip install 'litellm[proxy]'
litellm --version
No Docker required to get started. LiteLLM is a Python process you run alongside Ollama. Make sure Ollama is already running:
# macOS: Ollama runs as menu bar app automatically
# Linux: start if not already running
ollama serve
# Pull a model if you don't have one yet (from Article #1)
ollama pull llama3.1:8b
LiteLLM is configured via a YAML file. Create a file called litellm_config.yaml:
model_list:
- model_name: llama3 # The name your apps will use
litellm_params:
model: ollama/llama3.1:8b # How LiteLLM routes it internally
api_base: http://localhost:11434 # Your local Ollama instance
- model_name: mistral
litellm_params:
model: ollama/mistral
api_base: http://localhost:11434
litellm_settings:
drop_params: true # Ignore unsupported params instead of erroring
request_timeout: 600 # 10 min timeout for slow local inference
Start the proxy:
litellm --config litellm_config.yaml --port 4000
Leave this running. Test it immediately:
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer anything" \
-d '{"model":"llama3","messages":[{"role":"user","content":"Are you running locally?"}]}'
Authorization: Bearer anything header is required by the OpenAI client format but LiteLLM ignores the value by default. You'll add real key enforcement in Step 8.Your Python code targets LiteLLM on port 4000 instead of Ollama directly:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="anything" # required by SDK, ignored by LiteLLM in dev mode
)
response = client.chat.completions.create(
model="llama3",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the key principles of zero-trust security."}
]
)
print(response.choices[0].message.content)
Switch models without changing any other code:
# Same client, same code — just a different model name in the request
response = client.chat.completions.create(
model="mistral",
messages=[...]
)
This is where the gateway pattern becomes genuinely powerful. Some queries are too complex for your local model. Some are completely safe to send to a cloud API. You want to handle both from the same codebase — without your application needing any logic about which model to use when.
Update your litellm_config.yaml:
model_list:
- model_name: llama3
litellm_params:
model: ollama/llama3.1:8b
api_base: http://localhost:11434
- model_name: mistral
litellm_params:
model: ollama/mistral
api_base: http://localhost:11434
# Cloud models — only used when explicitly requested
- model_name: gpt-4o
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
# "smart" — tries local first, falls back to cloud on error/timeout
- model_name: smart
litellm_params:
model: ollama/llama3.1:8b
api_base: http://localhost:11434
litellm_settings:
drop_params: true
request_timeout: 600
num_retries: 3
fallbacks: [{"smart": ["gpt-4o"]}] # if smart fails, fall back to cloud
fallbacks format is a list of dicts: [{"source_model": ["fallback_model"]}]. The fallback triggers on errors (5xx, timeouts) — not on slow responses. If the local model is responding slowly but successfully, LiteLLM considers it a success.Set your API keys as environment variables — never hardcode them:
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
Now your application can make a deliberate routing choice per request:
# Sensitive internal data — always local, never leaves your machine
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": sensitive_internal_query}]
)
# Public content where quality matters — explicit cloud
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": marketing_copy_query}]
)
# Let the gateway decide — try local first, cloud if it fails
response = client.chat.completions.create(
model="smart",
messages=[{"role": "user", "content": ambiguous_query}]
)
Routing logic lives in YAML config — not in application code. Change policy without touching your app.
One of the most underappreciated features of a gateway is immediate visibility into what's happening across all AI calls. LiteLLM logs every request with model, token counts, latency, and estimated cost — even for local models where cost is $0.
Enable verbose logging:
litellm_settings:
drop_params: true
request_timeout: 600
set_verbose: true # log all requests to console
For a structured UI dashboard, start with:
litellm --config litellm_config.yaml --port 4000 --ui
# Dashboard at http://localhost:4000/ui
Here's what a log entry looks like — even for a free local model:
{
"model": "llama3",
"provider": "ollama",
"total_tokens": 847,
"prompt_tokens": 312,
"completion_tokens": 535,
"response_time_ms": 4823,
"estimated_cost": 0.0,
"success": true
}
If you're running Ollama on several machines or a server with multiple GPUs, LiteLLM can load balance across them automatically. Use the same model_name for multiple backends — LiteLLM treats them as a pool:
model_list:
- model_name: llama3 # same model_name across all three...
litellm_params:
model: ollama/llama3.1:8b
api_base: http://gpu-server-1:11434
- model_name: llama3 # ...LiteLLM distributes load automatically
litellm_params:
model: ollama/llama3.1:8b
api_base: http://gpu-server-2:11434
- model_name: llama3
litellm_params:
model: ollama/llama3.1:8b
api_base: http://gpu-server-3:11434
router_settings:
routing_strategy: least-busy # or: latency-based, simple-shuffle
model="llama3" at localhost:4000. LiteLLM distributes the load, retries on failure, and tracks which instance is fastest. This is the pattern that scales a single developer's local setup into shared team infrastructure.For sharing with a team or running as a persistent service, Docker is the right deployment model. Create a docker-compose.yaml:
services:
ollama:
image: ollama/ollama
volumes:
- ollama_models:/root/.ollama
ports:
- "11434:11434"
restart: unless-stopped
# GPU support on Linux — uncomment to enable:
# deploy:
# resources:
# reservations:
# devices:
# - capabilities: [gpu]
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
command: ["--config", "/app/config.yaml", "--port", "4000"]
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_models:
docker compose (with a space — Docker Compose v2). The old docker-compose hyphenated command is deprecated. Also omit any top-level version: key — it's no longer required or recommended.Update litellm_config.yaml to use the Docker service name for Ollama:
litellm_params:
model: ollama/llama3.1:8b
api_base: http://ollama:11434 # Docker service name, NOT localhost
Start everything and pull models:
docker compose up -d
docker compose exec ollama ollama pull llama3.1:8b
By default LiteLLM accepts any request without authentication. Fine for local development, not for team deployments. Add to your config:
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY # load from env, never hardcode
For more granular control, generate virtual keys — different keys per team or application, each with configurable model access and budget limits:
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"models": ["llama3", "mistral"],
"max_budget": 10,
"duration": "30d"
}'
Putting It Together: A Production-Ready Config
Here's a complete litellm_config.yaml that reflects a realistic setup — local models as the default, cloud as a deliberate choice, observability and security configured. All technical details have been verified against current LiteLLM documentation:
model_list:
# Primary local models
- model_name: llama3
litellm_params:
model: ollama/llama3.1:8b
api_base: http://localhost:11434
- model_name: mistral
litellm_params:
model: ollama/mistral
api_base: http://localhost:11434
- model_name: qwen-coder
litellm_params:
model: ollama/qwen3:7b
api_base: http://localhost:11434
# Cloud models — available but require explicit selection
- model_name: gpt-4o
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
# Smart routing — local first, cloud fallback on failure
- model_name: smart
litellm_params:
model: ollama/llama3.1:8b
api_base: http://localhost:11434
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
router_settings:
routing_strategy: least-busy
litellm_settings:
drop_params: true
request_timeout: 600
num_retries: 3
fallbacks: [{"smart": ["gpt-4o"]}]
success_callback: ["langfuse"] # optional: full observability (article #6)
failure_callback: ["langfuse"]
set_verbose: false # set to true for debugging
What You've Built
You now have a proper AI gateway that gives you a single endpoint for all AI calls regardless of which model handles them, the ability to route different request types without changing application code, local-first inference with optional cloud fallback, token counting and cost tracking across all models, load balancing across Ollama instances, API key authentication with per-team budget caps, and Docker deployment for shared team infrastructure.
More importantly, you've established the pattern that the rest of this series builds on. The gateway is the control point. Everything that matters — routing decisions, access control, observability, cost management — flows through this single layer. Adding a model means adding a config entry. Changing a routing rule means editing YAML. Your application code stays clean.
Common Issues
model_name in your config (case-sensitive). Verify Ollama has the model pulled: ollama list.request_timeout in litellm_settings. Local inference is slower than cloud APIs, especially on CPU. 600 seconds (10 minutes) is a reasonable upper bound for long-context tasks.localhost. Your config must use http://ollama:11434, not http://localhost:11434.os.environ/OPENAI_API_KEY by looking for the OPENAI_API_KEY environment variable in the shell where LiteLLM is running. Run echo $OPENAI_API_KEY to confirm it's set before starting LiteLLM.docker compose (with a space) — Docker Compose v2 is built into Docker Desktop and modern Docker installations. The old standalone docker-compose binary is deprecated.