Local Sovereign AI · Article #6

Local AI Observability —
You Can't Trust What You Can't See

Your LLM returns HTTP 200 and the output looks fine. But is it actually correct? Did retrieval find the right documents? Are prompts drifting? Is latency creeping up? Without observability, you're flying blind — and AI systems fail silently in ways traditional monitoring completely misses.

~50 min Intermediate LiteLLM from Article #2 helpful but not required

⏱

Time breakdown

Langfuse setup ~15 min → LiteLLM integration ~5 min → Python instrumentation ~15 min → LangChain tracing ~15 min

Docker required. Langfuse runs as a local Docker service — all data stays on your machine.

Why AI Observability Is Different From Regular Monitoring

Traditional application monitoring — uptime checks, error rates, latency percentiles — tells you whether your infrastructure is healthy. It doesn't tell you whether your AI is working.

An LLM can return HTTP 200 with a plausible-sounding response that is completely wrong. It can hallucinate facts confidently. A prompt change from two days ago can silently degrade output quality across every user request without any server error. A RAG pipeline can start returning irrelevant document chunks — again, no error, just quietly bad answers. A cost spike can accumulate for days before anyone notices the token counts are 10x higher than expected.

These failures have no equivalent in traditional APM. The model "worked" — it generated tokens. But the output was useless. Catching this requires a different kind of observability: one that captures what went into the model, what came out, and what happened in between — at every request, across every user session, in a searchable, queryable form.

The Five Signals That Matter

⏱

Latency — broken down

Not just total response time, but time-to-first-token vs. generation time vs. retrieval time. Slow retrieval is a different problem than a slow model.

🔢

Token usage per request

Input tokens + output tokens, per model, per user, over time. Token creep — prompts slowly growing — is invisible without tracking this. Cost is derived directly from token counts.

📝

Full prompt + response pairs

The exact prompt sent to the model and the exact response received. Debugging a bad output without seeing the prompt is guesswork. This is the most important thing to capture.

🔧

Tool calls and retrieval steps

For agents and RAG: which tools were called, with what inputs, and what came back. Did retrieval return relevant chunks? Did the agent use the right tool? These are separate observable events.

❌

Errors and retries

Model errors, timeouts, context-length exceeded, guardrail rejections, invalid tool calls. Error rates per model and endpoint identify flaky configurations before users notice.

👤

Sessions and user attribution

Group traces into conversations. Attribute requests to users. When a user reports "the bot gave me wrong information", you need to find that exact session and trace it.

The Tool: Langfuse

Langfuse is an open-source LLM observability platform that covers all five signals above, can be self-hosted completely, and integrates with everything in our stack: LiteLLM, LangChain, and direct Python code via a decorator. It stores your traces in Postgres and ClickHouse — both running locally in Docker — and provides a web UI at localhost:3000.

The critical point for a data-sovereign stack: when self-hosted, your prompts, responses, and user data never leave your machine. The cloud version of Langfuse sends data to their servers. Self-hosting keeps it local.

How Langfuse Fits Into the Stack

Self-Host Langfuse

~15 min

Langfuse runs as a set of Docker containers: a web UI + API server, an async worker, Postgres, ClickHouse (for trace storage), Redis, and MinIO (for event uploads). The official docker-compose handles all of it:

bash

# Clone Langfuse and start it — takes ~2-3 minutes on first run
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d

Wait until you see langfuse-web-1 log "Ready". Then open http://localhost:3000 in your browser, create an account, create an organization, and create a project. Copy the Public Key and Secret Key from Settings → API Keys — you'll need them in the next steps.

Change the default secrets before exposing to a network. The official docker-compose.yml marks several lines with # CHANGEME: NEXTAUTH_SECRET, SALT, CLICKHOUSE_PASSWORD, and the MinIO secret key. For local-only use they're fine, but if the machine is accessible on your network, update them first.

To update Langfuse later:

bash

docker compose pull && docker compose up -d

Connect LiteLLM — Zero Code Required

~5 min

If you have the LiteLLM gateway from Article #2 running, connecting Langfuse requires adding exactly four lines to your litellm_config.yaml. Every request that passes through the gateway will be automatically traced — no changes to any client code:

yaml — litellm_config.yaml (add the success_callback section)

model_list:
  - model_name: smart
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://localhost:11434

# Add this block to enable Langfuse tracing:
litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

# Set these environment variables before starting LiteLLM:
# LANGFUSE_PUBLIC_KEY=pk-lf-your-public-key
# LANGFUSE_SECRET_KEY=sk-lf-your-secret-key
# LANGFUSE_HOST=http://localhost:3000

bash — start LiteLLM with Langfuse env vars

export LANGFUSE_PUBLIC_KEY="pk-lf-your-key"
export LANGFUSE_SECRET_KEY="sk-lf-your-key"
export LANGFUSE_HOST="http://localhost:3000"

litellm --config litellm_config.yaml --port 4000

Make a test request through LiteLLM, then open Langfuse at http://localhost:3000 → your project → Traces. You should see the request appear within a few seconds with the full prompt, response, model name, latency, and token counts — all automatically captured.

This is the fastest win in this article. If LiteLLM is already your gateway, you get full observability for every model request — Ollama, vLLM, OpenAI, Anthropic, anything — with five lines of config and zero application code changes.

Direct Python Instrumentation

~15 min

For code that doesn't go through LiteLLM — standalone scripts, custom pipelines, direct Ollama calls — instrument with the Langfuse Python SDK. The @observe() decorator creates a trace for any function automatically:

bash

pip install langfuse

python — basic instrumentation with @observe()

import os
from langfuse import get_client, observe
from langchain_ollama import ChatOllama

# Configure Langfuse — point at your local instance
os.environ["LANGFUSE_PUBLIC_KEY"]  = "pk-lf-your-key"
os.environ["LANGFUSE_SECRET_KEY"]  = "sk-lf-your-key"
os.environ["LANGFUSE_HOST"]         = "http://localhost:3000"

langfuse = get_client()
llm = ChatOllama(model="llama3.1:8b", base_url="http://localhost:11434")


@observe()  # Creates a trace for every call to this function
def answer_question(question: str, user_id: str = "anonymous") -> str:
    # Add metadata to the current trace
    langfuse.update_current_trace(
        user_id=user_id,
        session_id=f"session-{user_id}",
        tags=["qa-pipeline"],
    )

    messages = [{"role": "user", "content": question}]
    response = llm.invoke(messages)
    return response.content


if __name__ == "__main__":
    answer = answer_question("What is data sovereignty?", user_id="marcin")
    print(answer)

    # IMPORTANT: flush pending traces before script exits
    # Langfuse sends traces asynchronously — without flush(), short-lived
    # scripts may exit before all traces are sent
    langfuse.flush()

The @observe() decorator automatically captures the function's inputs, outputs, and execution time. Nested @observe() functions create nested spans — the trace tree mirrors your call stack.

For more granular control — manually marking specific steps, recording token counts, or tracking costs — use the span context manager directly:

python — manual spans for multi-step pipelines

@observe()
def rag_pipeline(question: str) -> str:
    # Step 1: retrieval — tracked as a separate span
    with langfuse.start_as_current_observation(
        as_type="span", name="retrieve-context"
    ) as retrieval_span:
        chunks = retrieve_from_qdrant(question)
        retrieval_span.update(output={"chunks_found": len(chunks)})

    # Step 2: generation — tracked as a generation span with token metadata
    with langfuse.start_as_current_observation(
        as_type="generation",
        name="llm-response",
        model="llama3.1:8b",
    ) as gen_span:
        prompt = build_prompt(question, chunks)
        response = llm.invoke(prompt)
        gen_span.update(
            input=prompt,
            output=response.content,
            usage={
                "input": response.response_metadata.get("prompt_eval_count", 0),
                "output": response.response_metadata.get("eval_count", 0),
            }
        )
    return response.content

LangChain Integration

~10 min

For LangChain-based pipelines, Langfuse provides a CallbackHandler that automatically captures every LLM call, retrieval step, tool invocation, and chain execution — with their full inputs and outputs, nested in the correct trace hierarchy:

python — LangChain + Langfuse callback

from langfuse.langchain import CallbackHandler
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Create the Langfuse callback handler — configure once, pass everywhere
langfuse_handler = CallbackHandler()

llm = ChatOllama(model="llama3.1:8b", base_url="http://localhost:11434")
prompt = ChatPromptTemplate.from_template("Answer briefly: {question}")
chain = prompt | llm | StrOutputParser()

# Pass the callback in config — every step in the chain is traced automatically
result = chain.invoke(
    {"question": "What is PagedAttention?"},
    config={"callbacks": [langfuse_handler]}
)
print(result)

For the RAG pipeline from Article #3, add the callback to the chain invocation:

python — RAG chain with tracing

# In query.py from Article #3, update the ask() function:
from langfuse.langchain import CallbackHandler

langfuse_handler = CallbackHandler()

def ask(question: str, rag_chain, retriever, user_id: str = "user"):
    answer = rag_chain.invoke(
        question,
        config={
            "callbacks": [langfuse_handler],
            # metadata passed to trace:
            "metadata": {"user_id": user_id, "pipeline": "rag-v1"}
        }
    )
    return answer

In Langfuse's UI, you'll see a nested trace for every RAG query: the top-level chain → the retrieval step with the returned document chunks → the prompt construction → the LLM call with prompt and response. You can click any step to see its exact input and output.

What to Actually Look At in Langfuse

Having traces is only useful if you know what to monitor. Here are the specific views and metrics that matter for a local AI stack:

Trace list and search

The main view. Filter by user, time range, model, tags, or latency. When a user reports a problem, search by their user ID and time. Find the exact trace. See the exact prompt and response. This is the most valuable debugging tool you'll have.

Token usage over time

Navigate to Metrics → Token Usage. Watch for upward trends. If average input token count per request is growing, your prompts are getting longer — often silently, due to growing conversation history or retrieved context. Catching this early prevents runaway costs if you're also routing to paid API providers via LiteLLM.

Latency breakdown

Metrics → Latency. Look at P50, P95, and P99. The gap between P50 and P99 tells you how predictable your system is. A P99 that's 10x the P50 means occasional requests are hanging — often due to model loading time, queue buildup, or a slow retrieval step. The per-step breakdown shows you whether latency is in the LLM or in retrieval.

Error rates per model

Filter traces by "error" status. Group by model. If one model is failing more than others — context window exceeded, timeouts, parsing errors — this shows it immediately. Silent errors (model returns a response but the chain fails downstream) also surface here when you use the failure_callback in LiteLLM.

Session replay

Click any trace → Session. See the entire multi-turn conversation that request was part of. When debugging a bad output, context from earlier in the conversation often explains it — the model was given contradictory information five turns ago. Without session replay, this is invisible.

Adding Quality Scores — Moving Beyond Latency

Latency and token counts are infrastructure metrics. Quality is the harder problem — was the answer actually correct? Langfuse supports three ways to attach quality scores to traces:

User feedback: If your application has a thumbs up/down button, post the feedback to Langfuse via the API. Every trace then carries a user satisfaction signal.

python — recording user feedback on a trace

from langfuse import get_client

langfuse = get_client()

# After the user rates the response (called from your application's feedback handler):
def record_feedback(trace_id: str, thumbs_up: bool):
    langfuse.score(
        trace_id=trace_id,
        name="user-feedback",
        value=1 if thumbs_up else 0,
        comment="thumbs up" if thumbs_up else "thumbs down"
    )
    langfuse.flush()

Automated LLM-as-judge: For factual Q&A or RAG, use a second (potentially smaller and faster) model to evaluate whether the answer is supported by the retrieved context. Langfuse has built-in evaluation templates for this.

Manual labeling: The Langfuse UI has a labeling queue — route low-confidence or flagged traces there for human review. Useful for compliance scenarios where a human needs to sign off on AI outputs.

Common Issues

Langfuse containers start but UI shows no traces▼

Two common causes: (1) The worker container isn't running — check docker compose ps and ensure langfuse-worker-1 is Up. It processes events asynchronously; without it, traces are queued but never ingested. (2) Your API keys don't match — the Public Key and Secret Key in your app environment must match the keys shown in Langfuse UI → Settings → API Keys for the correct project.

LiteLLM callback throws "Connection refused" to Langfuse▼

LiteLLM is running outside of Docker but trying to reach Langfuse at localhost:3000. Set LANGFUSE_HOST=http://localhost:3000 explicitly. If LiteLLM is running inside Docker in the same compose network, use the container hostname instead: http://langfuse-web:3000. Check with docker compose logs litellm for the exact error.

Traces appear in Langfuse but token counts show 0▼

Ollama doesn't always include token usage in its response metadata by default. Token counts in Langfuse traces come from the model's response metadata. For Ollama, check that response.response_metadata contains prompt_eval_count and eval_count — if not, these aren't being returned. You can manually set them via gen_span.update(usage={...}) as shown in the manual spans example above.

Short Python scripts complete but no traces appear in Langfuse▼

Langfuse sends traces asynchronously in a background thread. If your script exits before the background thread flushes, traces are lost. Always call langfuse.flush() at the end of short-lived scripts. In long-running services (FastAPI, etc.), the background flush happens automatically and you don't need to call flush explicitly.

Langfuse is consuming too much disk space▼

ClickHouse stores all trace data and can grow quickly for high-volume usage. Check disk usage with docker system df. You can configure trace retention in Langfuse Settings → Data Retention. For development, you can periodically purge traces: in the UI, go to Settings → Danger Zone → Delete all traces. For production, set a retention policy from day one.

What You've Built

You now have a complete observability layer for your local AI stack. Every LLM request — whether it comes through LiteLLM, a LangChain pipeline, or direct Python code — is captured with its prompt, response, latency, token counts, and user context. You can search, filter, and replay any trace. You can see quality scores and user feedback alongside infrastructure metrics.

More importantly: you're no longer flying blind. When something goes wrong, you have the data to understand what happened. When a user reports a bad response, you can find it in under 30 seconds. When latency creeps up, you can pinpoint which step in the pipeline is responsible.

Up next · Article #7

Fine-Tuning a Local Model on Your Own Data — A Practical Guide

When prompting isn't enough. How to fine-tune an open-source model on your domain-specific data using LoRA and Unsloth — training that runs on a single consumer GPU without sending your data anywhere.

→

Local AI Observability —You Can't Trust What You Can't See

Why AI Observability Is Different From Regular Monitoring

The Five Signals That Matter

The Tool: Langfuse

What to Actually Look At in Langfuse

Trace list and search

Token usage over time

Latency breakdown

Error rates per model

Session replay

Adding Quality Scores — Moving Beyond Latency

Common Issues

What You've Built

Local AI Observability —
You Can't Trust What You Can't See