Local Sovereign AI · Article #2

LiteLLM as Your Local AI Gateway —
One API to Rule Them All

You have a local model running. Now you need multiple models, cost tracking, cloud fallbacks, and team access — without rewriting your application each time. A gateway layer solves this. LiteLLM is the open-source tool that makes it practical.

~45 min total Intermediate Requires Ollama from Article #1
Time breakdown
Install ~5 min First config ~10 min Cloud fallback ~10 min Docker deploy ~20 min
Ollama from Article #1 must already be running before you start.

The Problem With Talking Directly to Models

After getting Ollama running, the natural next step is to wire it into something useful. A script that summarizes documents. An internal chatbot. A code assistant for your team. And this is where things get complicated faster than expected.

You start with one model. Then you realize some tasks need something stronger, so you pull a second. Then a colleague wants to use the setup too, so now you're thinking about networking. Then you start wondering how many tokens you're actually using. Then a task comes in that your local model handles poorly, and you want to route it to Claude or GPT-4 as a fallback — but only for that type of query, not for queries that contain sensitive data.

Suddenly what was "a model running on localhost" has become an infrastructure problem. You need:

This is exactly what an LLM gateway does. And LiteLLM is the best open-source implementation of one.

What LiteLLM Actually Is

LiteLLM is a Python library and proxy server that presents a single, unified, OpenAI-compatible API in front of over 100 different LLM providers — including local models via Ollama. You point your applications at LiteLLM instead of directly at any specific model or provider. LiteLLM handles routing, API format translation, retry logic, cost tracking, and logging.

Provider abstraction: Every model — Ollama local, OpenAI, Anthropic, Mistral, AWS Bedrock, Azure — speaks a slightly different API dialect. LiteLLM normalizes all of them into a single OpenAI-compatible interface. Your application code never needs to know which provider is actually serving the request.

Fully self-hostable: LiteLLM runs as a Docker container or a plain Python process on your own infrastructure. No data passes through LiteLLM's servers. The proxy runs locally, calls your local Ollama instance locally, everything stays inside your perimeter.

OpenAI-compatible output: If you already have code using the OpenAI SDK, switching to LiteLLM proxy is one configuration line. The response format is identical.

Built-in observability: Every request gets logged with model name, token counts, latency, and estimated cost — even for local models where the cost is $0.

What LiteLLM is not: it's not a model, doesn't store your data by default, and is not a replacement for Ollama. It sits in front of Ollama as a routing and management layer. Ollama continues to do the actual inference.

Gateway Architecture — How Requests Flow
Your Application Python / curl / any SDK OpenAI-compatible HTTP LiteLLM Proxy · localhost:4000 Routing · Cost tracking · Rate limiting · Auth · Logging Ollama · Local llama3.1:8b Ollama · Local mistral Cloud API fallback only Local · $0 Local · $0 Cloud · explicit only

Your application speaks only to LiteLLM — it never needs to know which backend handles the request.

1
Installation
~5 min

LiteLLM requires Python 3.8+. Install with the proxy extras:

bash
python3 --version
pip install 'litellm[proxy]'
litellm --version

No Docker required to get started. LiteLLM is a Python process you run alongside Ollama. Make sure Ollama is already running:

bash
# macOS: Ollama runs as menu bar app automatically
# Linux: start if not already running
ollama serve

# Pull a model if you don't have one yet (from Article #1)
ollama pull llama3.1:8b
2
Your First Configuration File
~10 min

LiteLLM is configured via a YAML file. Create a file called litellm_config.yaml:

yaml — litellm_config.yaml
model_list:
  - model_name: llama3       # The name your apps will use
    litellm_params:
      model: ollama/llama3.1:8b          # How LiteLLM routes it internally
      api_base: http://localhost:11434   # Your local Ollama instance

  - model_name: mistral
    litellm_params:
      model: ollama/mistral
      api_base: http://localhost:11434

litellm_settings:
  drop_params: true        # Ignore unsupported params instead of erroring
  request_timeout: 600     # 10 min timeout for slow local inference

Start the proxy:

bash
litellm --config litellm_config.yaml --port 4000

Leave this running. Test it immediately:

bash — verify
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer anything" \
  -d '{"model":"llama3","messages":[{"role":"user","content":"Are you running locally?"}]}'
The Authorization: Bearer anything header is required by the OpenAI client format but LiteLLM ignores the value by default. You'll add real key enforcement in Step 8.
3
Using LiteLLM From Your Code
~5 min

Your Python code targets LiteLLM on port 4000 instead of Ollama directly:

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="anything"  # required by SDK, ignored by LiteLLM in dev mode
)

response = client.chat.completions.create(
    model="llama3",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the key principles of zero-trust security."}
    ]
)
print(response.choices[0].message.content)

Switch models without changing any other code:

python — switch model
# Same client, same code — just a different model name in the request
response = client.chat.completions.create(
    model="mistral",
    messages=[...]
)
The key insight: The application doesn't know or care that both models are running locally via Ollama. It asks for a model by name and LiteLLM handles where that request actually goes.
4
Adding a Cloud Fallback
~10 min

This is where the gateway pattern becomes genuinely powerful. Some queries are too complex for your local model. Some are completely safe to send to a cloud API. You want to handle both from the same codebase — without your application needing any logic about which model to use when.

Update your litellm_config.yaml:

yaml — with cloud fallback
model_list:
  - model_name: llama3
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://localhost:11434

  - model_name: mistral
    litellm_params:
      model: ollama/mistral
      api_base: http://localhost:11434

  # Cloud models — only used when explicitly requested
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

  # "smart" — tries local first, falls back to cloud on error/timeout
  - model_name: smart
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://localhost:11434

litellm_settings:
  drop_params: true
  request_timeout: 600
  num_retries: 3
  fallbacks: [{"smart": ["gpt-4o"]}]  # if smart fails, fall back to cloud
Config note: The fallbacks format is a list of dicts: [{"source_model": ["fallback_model"]}]. The fallback triggers on errors (5xx, timeouts) — not on slow responses. If the local model is responding slowly but successfully, LiteLLM considers it a success.

Set your API keys as environment variables — never hardcode them:

bash
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

Now your application can make a deliberate routing choice per request:

python — routing by sensitivity
# Sensitive internal data — always local, never leaves your machine
response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": sensitive_internal_query}]
)

# Public content where quality matters — explicit cloud
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": marketing_copy_query}]
)

# Let the gateway decide — try local first, cloud if it fails
response = client.chat.completions.create(
    model="smart",
    messages=[{"role": "user", "content": ambiguous_query}]
)
Routing Decision — Where Each Request Goes
model="llama3" sensitive data model="gpt-4o" public content model="smart" auto-route LiteLLM :4000 Ollama (Local) localhost:11434 Cloud API explicit or fallback ✓ Private $0/token ⚠ Cloud cost/token

Routing logic lives in YAML config — not in application code. Change policy without touching your app.

5
Cost Tracking and Logging
~5 min

One of the most underappreciated features of a gateway is immediate visibility into what's happening across all AI calls. LiteLLM logs every request with model, token counts, latency, and estimated cost — even for local models where cost is $0.

Enable verbose logging:

yaml
litellm_settings:
  drop_params: true
  request_timeout: 600
  set_verbose: true    # log all requests to console

For a structured UI dashboard, start with:

bash
litellm --config litellm_config.yaml --port 4000 --ui
# Dashboard at http://localhost:4000/ui

Here's what a log entry looks like — even for a free local model:

json — sample log entry
{
  "model": "llama3",
  "provider": "ollama",
  "total_tokens": 847,
  "prompt_tokens": 312,
  "completion_tokens": 535,
  "response_time_ms": 4823,
  "estimated_cost": 0.0,
  "success": true
}
What You See in the Logs — Even Free Local Requests
MODEL llama3 ● ollama / local COST $0.00 TOKENS 847 312 prompt · 535 completion context window: 2.1% used LATENCY 4.8s ~111 tok/sec on Apple M2 ✓ success CLOUD EQUIVALENT Same request on GPT-4o: ~$0.013 At 10 000 req/day: $130/day saved
6
Load Balancing Multiple Instances
~5 min config

If you're running Ollama on several machines or a server with multiple GPUs, LiteLLM can load balance across them automatically. Use the same model_name for multiple backends — LiteLLM treats them as a pool:

yaml — load balancing
model_list:
  - model_name: llama3   # same model_name across all three...
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://gpu-server-1:11434

  - model_name: llama3   # ...LiteLLM distributes load automatically
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://gpu-server-2:11434

  - model_name: llama3
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://gpu-server-3:11434

router_settings:
  routing_strategy: least-busy   # or: latency-based, simple-shuffle
From the application's perspective nothing changes — it still calls model="llama3" at localhost:4000. LiteLLM distributes the load, retries on failure, and tracks which instance is fastest. This is the pattern that scales a single developer's local setup into shared team infrastructure.
7
Running as a Docker Container
~20 min

For sharing with a team or running as a persistent service, Docker is the right deployment model. Create a docker-compose.yaml:

yaml — docker-compose.yaml
services:
  ollama:
    image: ollama/ollama
    volumes:
      - ollama_models:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped
    # GPU support on Linux — uncomment to enable:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - capabilities: [gpu]

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_models:
Note: Use docker compose (with a space — Docker Compose v2). The old docker-compose hyphenated command is deprecated. Also omit any top-level version: key — it's no longer required or recommended.

Update litellm_config.yaml to use the Docker service name for Ollama:

yaml
litellm_params:
  model: ollama/llama3.1:8b
  api_base: http://ollama:11434   # Docker service name, NOT localhost

Start everything and pull models:

bash
docker compose up -d
docker compose exec ollama ollama pull llama3.1:8b
8
Securing the Gateway
~5 min

By default LiteLLM accepts any request without authentication. Fine for local development, not for team deployments. Add to your config:

yaml
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY   # load from env, never hardcode

For more granular control, generate virtual keys — different keys per team or application, each with configurable model access and budget limits:

bash — generate virtual key
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["llama3", "mistral"],
    "max_budget": 10,
    "duration": "30d"
  }'
Virtual Keys — Per-Team Access Control
MASTER KEY sk-master admin only all models no budget cap Engineering llama3, mistral · no cap Marketing gpt-4o only · $50/mo Data Science · smart · $200/mo LiteLLM enforces model access tracks spend per key ✓ allowed ✗ blocked

Putting It Together: A Production-Ready Config

Here's a complete litellm_config.yaml that reflects a realistic setup — local models as the default, cloud as a deliberate choice, observability and security configured. All technical details have been verified against current LiteLLM documentation:

yaml — complete production config
model_list:
  # Primary local models
  - model_name: llama3
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://localhost:11434

  - model_name: mistral
    litellm_params:
      model: ollama/mistral
      api_base: http://localhost:11434

  - model_name: qwen-coder
    litellm_params:
      model: ollama/qwen3:7b
      api_base: http://localhost:11434

  # Cloud models — available but require explicit selection
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

  # Smart routing — local first, cloud fallback on failure
  - model_name: smart
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://localhost:11434

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

router_settings:
  routing_strategy: least-busy

litellm_settings:
  drop_params: true
  request_timeout: 600
  num_retries: 3
  fallbacks: [{"smart": ["gpt-4o"]}]
  success_callback: ["langfuse"]   # optional: full observability (article #6)
  failure_callback: ["langfuse"]
  set_verbose: false               # set to true for debugging

What You've Built

You now have a proper AI gateway that gives you a single endpoint for all AI calls regardless of which model handles them, the ability to route different request types without changing application code, local-first inference with optional cloud fallback, token counting and cost tracking across all models, load balancing across Ollama instances, API key authentication with per-team budget caps, and Docker deployment for shared team infrastructure.

More importantly, you've established the pattern that the rest of this series builds on. The gateway is the control point. Everything that matters — routing decisions, access control, observability, cost management — flows through this single layer. Adding a model means adding a config entry. Changing a routing rule means editing YAML. Your application code stays clean.

Common Issues

LiteLLM starts but returns "model not found" errors
Check that the model name in your request exactly matches model_name in your config (case-sensitive). Verify Ollama has the model pulled: ollama list.
Requests time out on large prompts
Increase request_timeout in litellm_settings. Local inference is slower than cloud APIs, especially on CPU. 600 seconds (10 minutes) is a reasonable upper bound for long-context tasks.
Fallback to cloud isn't triggering
Fallback triggers on errors (5xx responses, timeouts) — not on slow responses. If your local model is slow but responding, LiteLLM considers it successful. Test by temporarily pointing the primary model at a non-existent endpoint to force an error.
Docker: LiteLLM can't reach Ollama
In Docker Compose, services communicate by service name — not localhost. Your config must use http://ollama:11434, not http://localhost:11434.
"Invalid API key" errors in cloud fallbacks
LiteLLM reads os.environ/OPENAI_API_KEY by looking for the OPENAI_API_KEY environment variable in the shell where LiteLLM is running. Run echo $OPENAI_API_KEY to confirm it's set before starting LiteLLM.
"command not found: docker-compose"
Use docker compose (with a space) — Docker Compose v2 is built into Docker Desktop and modern Docker installations. The old standalone docker-compose binary is deprecated.