Local AI Observability —
You Can't Trust What You Can't See
Your LLM returns HTTP 200 and the output looks fine. But is it actually correct? Did retrieval find the right documents? Are prompts drifting? Is latency creeping up? Without observability, you're flying blind — and AI systems fail silently in ways traditional monitoring completely misses.
Why AI Observability Is Different From Regular Monitoring
Traditional application monitoring — uptime checks, error rates, latency percentiles — tells you whether your infrastructure is healthy. It doesn't tell you whether your AI is working.
An LLM can return HTTP 200 with a plausible-sounding response that is completely wrong. It can hallucinate facts confidently. A prompt change from two days ago can silently degrade output quality across every user request without any server error. A RAG pipeline can start returning irrelevant document chunks — again, no error, just quietly bad answers. A cost spike can accumulate for days before anyone notices the token counts are 10x higher than expected.
These failures have no equivalent in traditional APM. The model "worked" — it generated tokens. But the output was useless. Catching this requires a different kind of observability: one that captures what went into the model, what came out, and what happened in between — at every request, across every user session, in a searchable, queryable form.
The Five Signals That Matter
The Tool: Langfuse
Langfuse is an open-source LLM observability platform that covers all five signals above, can be self-hosted completely, and integrates with everything in our stack: LiteLLM, LangChain, and direct Python code via a decorator. It stores your traces in Postgres and ClickHouse — both running locally in Docker — and provides a web UI at localhost:3000.
The critical point for a data-sovereign stack: when self-hosted, your prompts, responses, and user data never leave your machine. The cloud version of Langfuse sends data to their servers. Self-hosting keeps it local.
Langfuse runs as a set of Docker containers: a web UI + API server, an async worker, Postgres, ClickHouse (for trace storage), Redis, and MinIO (for event uploads). The official docker-compose handles all of it:
# Clone Langfuse and start it — takes ~2-3 minutes on first run
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d
Wait until you see langfuse-web-1 log "Ready". Then open http://localhost:3000 in your browser, create an account, create an organization, and create a project. Copy the Public Key and Secret Key from Settings → API Keys — you'll need them in the next steps.
docker-compose.yml marks several lines with # CHANGEME: NEXTAUTH_SECRET, SALT, CLICKHOUSE_PASSWORD, and the MinIO secret key. For local-only use they're fine, but if the machine is accessible on your network, update them first.
To update Langfuse later:
docker compose pull && docker compose up -d
If you have the LiteLLM gateway from Article #2 running, connecting Langfuse requires adding exactly four lines to your litellm_config.yaml. Every request that passes through the gateway will be automatically traced — no changes to any client code:
model_list:
- model_name: smart
litellm_params:
model: ollama/llama3.1:8b
api_base: http://localhost:11434
# Add this block to enable Langfuse tracing:
litellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
# Set these environment variables before starting LiteLLM:
# LANGFUSE_PUBLIC_KEY=pk-lf-your-public-key
# LANGFUSE_SECRET_KEY=sk-lf-your-secret-key
# LANGFUSE_HOST=http://localhost:3000
export LANGFUSE_PUBLIC_KEY="pk-lf-your-key"
export LANGFUSE_SECRET_KEY="sk-lf-your-key"
export LANGFUSE_HOST="http://localhost:3000"
litellm --config litellm_config.yaml --port 4000
Make a test request through LiteLLM, then open Langfuse at http://localhost:3000 → your project → Traces. You should see the request appear within a few seconds with the full prompt, response, model name, latency, and token counts — all automatically captured.
For code that doesn't go through LiteLLM — standalone scripts, custom pipelines, direct Ollama calls — instrument with the Langfuse Python SDK. The @observe() decorator creates a trace for any function automatically:
pip install langfuse
import os
from langfuse import get_client, observe
from langchain_ollama import ChatOllama
# Configure Langfuse — point at your local instance
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-your-key"
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-your-key"
os.environ["LANGFUSE_HOST"] = "http://localhost:3000"
langfuse = get_client()
llm = ChatOllama(model="llama3.1:8b", base_url="http://localhost:11434")
@observe() # Creates a trace for every call to this function
def answer_question(question: str, user_id: str = "anonymous") -> str:
# Add metadata to the current trace
langfuse.update_current_trace(
user_id=user_id,
session_id=f"session-{user_id}",
tags=["qa-pipeline"],
)
messages = [{"role": "user", "content": question}]
response = llm.invoke(messages)
return response.content
if __name__ == "__main__":
answer = answer_question("What is data sovereignty?", user_id="marcin")
print(answer)
# IMPORTANT: flush pending traces before script exits
# Langfuse sends traces asynchronously — without flush(), short-lived
# scripts may exit before all traces are sent
langfuse.flush()
The @observe() decorator automatically captures the function's inputs, outputs, and execution time. Nested @observe() functions create nested spans — the trace tree mirrors your call stack.
For more granular control — manually marking specific steps, recording token counts, or tracking costs — use the span context manager directly:
@observe()
def rag_pipeline(question: str) -> str:
# Step 1: retrieval — tracked as a separate span
with langfuse.start_as_current_observation(
as_type="span", name="retrieve-context"
) as retrieval_span:
chunks = retrieve_from_qdrant(question)
retrieval_span.update(output={"chunks_found": len(chunks)})
# Step 2: generation — tracked as a generation span with token metadata
with langfuse.start_as_current_observation(
as_type="generation",
name="llm-response",
model="llama3.1:8b",
) as gen_span:
prompt = build_prompt(question, chunks)
response = llm.invoke(prompt)
gen_span.update(
input=prompt,
output=response.content,
usage={
"input": response.response_metadata.get("prompt_eval_count", 0),
"output": response.response_metadata.get("eval_count", 0),
}
)
return response.content
For LangChain-based pipelines, Langfuse provides a CallbackHandler that automatically captures every LLM call, retrieval step, tool invocation, and chain execution — with their full inputs and outputs, nested in the correct trace hierarchy:
from langfuse.langchain import CallbackHandler
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Create the Langfuse callback handler — configure once, pass everywhere
langfuse_handler = CallbackHandler()
llm = ChatOllama(model="llama3.1:8b", base_url="http://localhost:11434")
prompt = ChatPromptTemplate.from_template("Answer briefly: {question}")
chain = prompt | llm | StrOutputParser()
# Pass the callback in config — every step in the chain is traced automatically
result = chain.invoke(
{"question": "What is PagedAttention?"},
config={"callbacks": [langfuse_handler]}
)
print(result)
For the RAG pipeline from Article #3, add the callback to the chain invocation:
# In query.py from Article #3, update the ask() function:
from langfuse.langchain import CallbackHandler
langfuse_handler = CallbackHandler()
def ask(question: str, rag_chain, retriever, user_id: str = "user"):
answer = rag_chain.invoke(
question,
config={
"callbacks": [langfuse_handler],
# metadata passed to trace:
"metadata": {"user_id": user_id, "pipeline": "rag-v1"}
}
)
return answer
In Langfuse's UI, you'll see a nested trace for every RAG query: the top-level chain → the retrieval step with the returned document chunks → the prompt construction → the LLM call with prompt and response. You can click any step to see its exact input and output.
What to Actually Look At in Langfuse
Having traces is only useful if you know what to monitor. Here are the specific views and metrics that matter for a local AI stack:
Trace list and search
The main view. Filter by user, time range, model, tags, or latency. When a user reports a problem, search by their user ID and time. Find the exact trace. See the exact prompt and response. This is the most valuable debugging tool you'll have.
Token usage over time
Navigate to Metrics → Token Usage. Watch for upward trends. If average input token count per request is growing, your prompts are getting longer — often silently, due to growing conversation history or retrieved context. Catching this early prevents runaway costs if you're also routing to paid API providers via LiteLLM.
Latency breakdown
Metrics → Latency. Look at P50, P95, and P99. The gap between P50 and P99 tells you how predictable your system is. A P99 that's 10x the P50 means occasional requests are hanging — often due to model loading time, queue buildup, or a slow retrieval step. The per-step breakdown shows you whether latency is in the LLM or in retrieval.
Error rates per model
Filter traces by "error" status. Group by model. If one model is failing more than others — context window exceeded, timeouts, parsing errors — this shows it immediately. Silent errors (model returns a response but the chain fails downstream) also surface here when you use the failure_callback in LiteLLM.
Session replay
Click any trace → Session. See the entire multi-turn conversation that request was part of. When debugging a bad output, context from earlier in the conversation often explains it — the model was given contradictory information five turns ago. Without session replay, this is invisible.
Adding Quality Scores — Moving Beyond Latency
Latency and token counts are infrastructure metrics. Quality is the harder problem — was the answer actually correct? Langfuse supports three ways to attach quality scores to traces:
User feedback: If your application has a thumbs up/down button, post the feedback to Langfuse via the API. Every trace then carries a user satisfaction signal.
from langfuse import get_client
langfuse = get_client()
# After the user rates the response (called from your application's feedback handler):
def record_feedback(trace_id: str, thumbs_up: bool):
langfuse.score(
trace_id=trace_id,
name="user-feedback",
value=1 if thumbs_up else 0,
comment="thumbs up" if thumbs_up else "thumbs down"
)
langfuse.flush()
Automated LLM-as-judge: For factual Q&A or RAG, use a second (potentially smaller and faster) model to evaluate whether the answer is supported by the retrieved context. Langfuse has built-in evaluation templates for this.
Manual labeling: The Langfuse UI has a labeling queue — route low-confidence or flagged traces there for human review. Useful for compliance scenarios where a human needs to sign off on AI outputs.
Common Issues
docker compose ps and ensure langfuse-worker-1 is Up. It processes events asynchronously; without it, traces are queued but never ingested. (2) Your API keys don't match — the Public Key and Secret Key in your app environment must match the keys shown in Langfuse UI → Settings → API Keys for the correct project.localhost:3000. Set LANGFUSE_HOST=http://localhost:3000 explicitly. If LiteLLM is running inside Docker in the same compose network, use the container hostname instead: http://langfuse-web:3000. Check with docker compose logs litellm for the exact error.response.response_metadata contains prompt_eval_count and eval_count — if not, these aren't being returned. You can manually set them via gen_span.update(usage={...}) as shown in the manual spans example above.langfuse.flush() at the end of short-lived scripts. In long-running services (FastAPI, etc.), the background flush happens automatically and you don't need to call flush explicitly.docker system df. You can configure trace retention in Langfuse Settings → Data Retention. For development, you can periodically purge traces: in the UI, go to Settings → Danger Zone → Delete all traces. For production, set a retention policy from day one.What You've Built
You now have a complete observability layer for your local AI stack. Every LLM request — whether it comes through LiteLLM, a LangChain pipeline, or direct Python code — is captured with its prompt, response, latency, token counts, and user context. You can search, filter, and replay any trace. You can see quality scores and user feedback alongside infrastructure metrics.
More importantly: you're no longer flying blind. When something goes wrong, you have the data to understand what happened. When a user reports a bad response, you can find it in under 30 seconds. When latency creeps up, you can pinpoint which step in the pipeline is responsible.