Local Sovereign AI · Article #2

LiteLLM as Your Local AI Gateway —
One API to Rule Them All

You have a local model running. Now you need multiple models, cost tracking, cloud fallbacks, and team access — without rewriting your application each time. A gateway layer solves this. LiteLLM is the open-source tool that makes it practical.

~45 min total Intermediate Requires Ollama from Article #1

⏱

Time breakdown

Install ~5 min → First config ~10 min → Cloud fallback ~10 min → Docker deploy ~20 min

Ollama from Article #1 must already be running before you start.

The Problem With Talking Directly to Models

After getting Ollama running, the natural next step is to wire it into something useful. A script that summarizes documents. An internal chatbot. A code assistant for your team. And this is where things get complicated faster than expected.

You start with one model. Then you realize some tasks need something stronger, so you pull a second. Then a colleague wants to use the setup too, so now you're thinking about networking. Then you start wondering how many tokens you're actually using. Then a task comes in that your local model handles poorly, and you want to route it to Claude or GPT-4 as a fallback — but only for that type of query, not for queries that contain sensitive data.

Suddenly what was "a model running on localhost" has become an infrastructure problem. You need:

A single endpoint your applications can target, regardless of which model handles the request
The ability to switch or add models without changing application code
Routing logic — send this request to the local model, that one to the cloud
Token counting and cost tracking, even for local models that don't charge per token
Rate limiting and load balancing when multiple people or processes use the same setup
A unified logging layer so you can see what's being asked and what responses are generated

This is exactly what an LLM gateway does. And LiteLLM is the best open-source implementation of one.

What LiteLLM Actually Is

LiteLLM is a Python library and proxy server that presents a single, unified, OpenAI-compatible API in front of over 100 different LLM providers — including local models via Ollama. You point your applications at LiteLLM instead of directly at any specific model or provider. LiteLLM handles routing, API format translation, retry logic, cost tracking, and logging.

Provider abstraction: Every model — Ollama local, OpenAI, Anthropic, Mistral, AWS Bedrock, Azure — speaks a slightly different API dialect. LiteLLM normalizes all of them into a single OpenAI-compatible interface. Your application code never needs to know which provider is actually serving the request.

Fully self-hostable: LiteLLM runs as a Docker container or a plain Python process on your own infrastructure. No data passes through LiteLLM's servers. The proxy runs locally, calls your local Ollama instance locally, everything stays inside your perimeter.

OpenAI-compatible output: If you already have code using the OpenAI SDK, switching to LiteLLM proxy is one configuration line. The response format is identical.

Built-in observability: Every request gets logged with model name, token counts, latency, and estimated cost — even for local models where the cost is $0.

What LiteLLM is not: it's not a model, doesn't store your data by default, and is not a replacement for Ollama. It sits in front of Ollama as a routing and management layer. Ollama continues to do the actual inference.

Gateway Architecture — How Requests Flow

Your application speaks only to LiteLLM — it never needs to know which backend handles the request.

Installation

~5 min

LiteLLM requires Python 3.8+. We'll install it in a virtual environment — this keeps LiteLLM and its dependencies isolated from your system Python and any other projects.

bash — create and activate virtual environment

# Check Python version first (need 3.8+)
python3 --version

# Create a project folder and virtual environment
mkdir litellm-gateway && cd litellm-gateway
python3 -m venv .venv

# Activate — macOS / Linux:
source .venv/bin/activate
# Windows (WSL2): source .venv/Scripts/activate

Your prompt will change to show (.venv) — confirming the environment is active. Now install LiteLLM with the proxy extras inside it:

bash — install LiteLLM

pip install 'litellm[proxy]'
litellm --version

Always activate the venv before use. Each time you open a new terminal, run source .venv/bin/activate from the project folder before running litellm or any Python scripts. Deactivate with deactivate when done.

No Docker required to get started. LiteLLM is a Python process you run alongside Ollama. Make sure Ollama is running and has a model pulled:

bash — start Ollama and pull the model

# macOS: Ollama runs as menu bar app automatically — skip ollama serve
# Linux: start the service if not already running
ollama serve

# Pull the models we'll use throughout this article
ollama pull llama3.1:8b
ollama pull mistral

Your First Configuration File

~10 min

LiteLLM is configured via a YAML file. Create a file called litellm_config.yaml:

yaml — litellm_config.yaml

model_list:
  - model_name: llama3       # The name your apps will use
    litellm_params:
      model: ollama/llama3.1:8b          # How LiteLLM routes it internally
      api_base: http://localhost:11434   # Your local Ollama instance

  - model_name: mistral
    litellm_params:
      model: ollama/mistral
      api_base: http://localhost:11434

litellm_settings:
  drop_params: true        # Ignore unsupported params instead of erroring
  request_timeout: 600     # 10 min timeout for slow local inference

Start the proxy:

bash

litellm --config litellm_config.yaml --port 4000

Leave this running. Test it immediately:

bash — verify

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer anything" \
  -d '{"model":"llama3","messages":[{"role":"user","content":"Are you running locally?"}]}'

The Authorization: Bearer anything header is required by the OpenAI client format but LiteLLM ignores the value by default. You'll add real key enforcement in Step 8.

Using LiteLLM From Your Code

~5 min

Your Python code targets LiteLLM on port 4000 instead of Ollama directly:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="anything"  # required by SDK, ignored by LiteLLM in dev mode
)

response = client.chat.completions.create(
    model="llama3",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the key principles of zero-trust security."}
    ]
)
print(response.choices[0].message.content)

Switch models without changing any other code:

python — switch model

# Same client, same code — just a different model name in the request
response = client.chat.completions.create(
    model="mistral",
    messages=[...]
)

The key insight: The application doesn't know or care that both models are running locally via Ollama. It asks for a model by name and LiteLLM handles where that request actually goes.

Save your dependencies: Run pip freeze > requirements.txt to snapshot the environment. Anyone else can reproduce it with pip install -r requirements.txt inside their own venv.

Adding a Cloud Fallback

~10 min

This is where the gateway pattern becomes genuinely powerful. Some queries are too complex for your local model. Some are completely safe to send to a cloud API. You want to handle both from the same codebase — without your application needing any logic about which model to use when.

Update your litellm_config.yaml:

yaml — with cloud fallback

model_list:
  - model_name: llama3
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://localhost:11434

  - model_name: mistral
    litellm_params:
      model: ollama/mistral
      api_base: http://localhost:11434

  # Cloud models — only used when explicitly requested
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

  # "smart" — tries local first, falls back to cloud on error/timeout
  - model_name: smart
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://localhost:11434

litellm_settings:
  drop_params: true
  request_timeout: 600
  num_retries: 3
  fallbacks: [{"smart": ["gpt-4o"]}]  # if smart fails, fall back to cloud

Config note: The fallbacks format is a list of dicts: [{"source_model": ["fallback_model"]}]. The fallback triggers on errors (5xx, timeouts) — not on slow responses. If the local model is responding slowly but successfully, LiteLLM considers it a success.

Set your API keys as environment variables — never hardcode them:

bash

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

Now your application can make a deliberate routing choice per request:

python — routing by sensitivity

# Sensitive internal data — always local, never leaves your machine
response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": sensitive_internal_query}]
)

# Public content where quality matters — explicit cloud
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": marketing_copy_query}]
)

# Let the gateway decide — try local first, cloud if it fails
response = client.chat.completions.create(
    model="smart",
    messages=[{"role": "user", "content": ambiguous_query}]
)

Routing Decision — Where Each Request Goes

Routing logic lives in YAML config — not in application code. Change policy without touching your app.

Cost Tracking and Logging

~5 min

One of the most underappreciated features of a gateway is immediate visibility into what's happening across all AI calls. LiteLLM logs every request with model, token counts, latency, and estimated cost — even for local models where cost is $0.

Enable verbose logging:

yaml

litellm_settings:
  drop_params: true
  request_timeout: 600
  set_verbose: true    # log all requests to console

The UI dashboard starts automatically alongside the proxy — no extra flags needed. Just open your browser after starting:

bash

# The proxy start command is the same as always
litellm --config litellm_config.yaml --port 4000

# UI is available automatically at:
# http://localhost:4000/ui  (log in with your LITELLM_MASTER_KEY)

Here's what a log entry looks like — even for a free local model:

json — sample log entry

{
  "model": "llama3",
  "provider": "ollama",
  "total_tokens": 847,
  "prompt_tokens": 312,
  "completion_tokens": 535,
  "response_time_ms": 4823,
  "estimated_cost": 0.0,
  "success": true
}

What You See in the Logs — Even Free Local Requests

Load Balancing Multiple Instances

~5 min config

If you're running Ollama on several machines or a server with multiple GPUs, LiteLLM can load balance across them automatically. Use the same model_name for multiple backends — LiteLLM treats them as a pool:

yaml — load balancing

model_list:
  - model_name: llama3   # same model_name across all three...
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://gpu-server-1:11434

  - model_name: llama3   # ...LiteLLM distributes load automatically
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://gpu-server-2:11434

  - model_name: llama3
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://gpu-server-3:11434

router_settings:
  routing_strategy: least-busy   # or: latency-based, simple-shuffle

From the application's perspective nothing changes — it still calls model="llama3" at localhost:4000. LiteLLM distributes the load, retries on failure, and tracks which instance is fastest. This is the pattern that scales a single developer's local setup into shared team infrastructure.

Running as a Docker Container

~20 min

For sharing with a team or running as a persistent service, Docker is the right deployment model. This step requires two separate files in the same folder — docker-compose.yaml for Docker and litellm_config.yaml for LiteLLM. Do not mix their contents into one file.

Prerequisite: Docker Desktop must be installed and running. On macOS, docker compose is part of Docker Desktop — the CLI alone is not enough. Install it with brew install --cask docker and launch the app before proceeding. You should see the whale icon in your menu bar. Alternatively, use Colima (brew install colima docker docker-compose && colima start) as a lightweight GUI-free alternative.

Create a docker-compose.yaml:

yaml — docker-compose.yaml

services:
  ollama:
    image: ollama/ollama
    volumes:
      - ollama_models:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped
    # GPU support on Linux — uncomment to enable:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - capabilities: [gpu]

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_models:

Note: Use docker compose (with a space — Docker Compose v2). The old docker-compose hyphenated command is deprecated. Also omit any top-level version: key — it's no longer required or recommended.

Update litellm_config.yaml to use the Docker service name for Ollama:

yaml

litellm_params:
  model: ollama/llama3.1:8b
  api_base: http://ollama:11434   # Docker service name, NOT localhost

Your folder should look like this before running:

bash — expected folder structure

litellm-playground/
├── docker-compose.yaml      # Docker only — services, ports, volumes
└── litellm_config.yaml      # LiteLLM only — model_list, settings

Start everything and pull models:

bash

docker compose up -d
docker compose exec ollama ollama pull llama3.1:8b
docker compose exec ollama ollama pull mistral

Securing the Gateway

~5 min

By default LiteLLM accepts any request without authentication. Fine for local development, not for team deployments. Add to your config:

yaml

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY   # load from env, never hardcode

For more granular control, generate virtual keys — different keys per team or application, each with configurable model access and budget limits:

bash — generate virtual key

curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["llama3", "mistral"],
    "max_budget": 10,
    "duration": "30d"
  }'

Virtual Keys — Per-Team Access Control

Putting It Together: A Production-Ready Config

Here's a complete litellm_config.yaml that reflects a realistic setup — local models as the default, cloud as a deliberate choice, observability and security configured. All technical details have been verified against current LiteLLM documentation:

yaml — complete production config

model_list:
  # Primary local models
  - model_name: llama3
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://localhost:11434

  - model_name: mistral
    litellm_params:
      model: ollama/mistral
      api_base: http://localhost:11434

  - model_name: qwen-coder
    litellm_params:
      model: ollama/qwen3:7b
      api_base: http://localhost:11434

  # Cloud models — available but require explicit selection
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

  # Smart routing — local first, cloud fallback on failure
  - model_name: smart
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://localhost:11434

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

router_settings:
  routing_strategy: least-busy

litellm_settings:
  drop_params: true
  request_timeout: 600
  num_retries: 3
  fallbacks: [{"smart": ["gpt-4o"]}]
  success_callback: ["langfuse"]   # optional: full observability (article #6)
  failure_callback: ["langfuse"]
  set_verbose: false               # set to true for debugging

What You've Built

You now have a proper AI gateway that gives you a single endpoint for all AI calls regardless of which model handles them, the ability to route different request types without changing application code, local-first inference with optional cloud fallback, token counting and cost tracking across all models, load balancing across Ollama instances, API key authentication with per-team budget caps, and Docker deployment for shared team infrastructure.

More importantly, you've established the pattern that the rest of this series builds on. The gateway is the control point. Everything that matters — routing decisions, access control, observability, cost management — flows through this single layer. Adding a model means adding a config entry. Changing a routing rule means editing YAML. Your application code stays clean.

Common Issues

LiteLLM starts but returns "model not found" errors▼

Check that the model name in your request exactly matches model_name in your config (case-sensitive). Verify Ollama has the model pulled: ollama list.

Requests time out on large prompts▼

Increase request_timeout in litellm_settings. Local inference is slower than cloud APIs, especially on CPU. 600 seconds (10 minutes) is a reasonable upper bound for long-context tasks.

Fallback to cloud isn't triggering▼

Fallback triggers on errors (5xx responses, timeouts) — not on slow responses. If your local model is slow but responding, LiteLLM considers it successful. Test by temporarily pointing the primary model at a non-existent endpoint to force an error.

Docker: LiteLLM can't reach Ollama▼

In Docker Compose, services communicate by service name — not localhost. Your config must use http://ollama:11434, not http://localhost:11434.

"Invalid API key" errors in cloud fallbacks▼

LiteLLM reads os.environ/OPENAI_API_KEY by looking for the OPENAI_API_KEY environment variable in the shell where LiteLLM is running. Run echo $OPENAI_API_KEY to confirm it's set before starting LiteLLM.

"command not found: docker-compose"▼

Use docker compose (with a space) — Docker Compose v2 is built into Docker Desktop and modern Docker installations. The old standalone docker-compose binary is deprecated.

docker compose up -d: "additional properties 'model_list', 'litellm_settings' not allowed"▼

You mixed the contents of both config files into one. docker-compose.yaml must contain only Docker service definitions. litellm_config.yaml must contain only LiteLLM config (model_list, litellm_settings, etc.). They are two separate files in the same folder — Docker Compose mounts the LiteLLM config into the container via the volumes section.

Up next · Article #3

RAG on Your Own Data — Without Sending Anything to the Cloud

Connect your local LLM to your own documents using a fully local vector database and embedding model. Every component — retrieval, embeddings, generation — stays inside your perimeter.

→

LiteLLM as Your Local AI Gateway —One API to Rule Them All

The Problem With Talking Directly to Models

What LiteLLM Actually Is

Putting It Together: A Production-Ready Config

What You've Built

Common Issues

LiteLLM as Your Local AI Gateway —
One API to Rule Them All