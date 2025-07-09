ON THIS PAGE

You did it. You added the latest AI agent into your product. Shipped it. Went to sleep. Woke up to find it returning a blank string, taking five seconds longer than yesterday, or confidently outputting lies in perfect JSON.

Naturally, you check your logs. You see a prompt. You see a response. And you see nothing helpful.

Surprise. Prompt in and response out is not observability. It is vibes.

There is a lot of buzz around “LLM observability.” Most of it involves charts you do not need and dashboards you will forget to check. If you are actually building and shipping LLM-powered products—chatbots, internal agents, retrieval apps—you need observability. But not the kind that stops where things get interesting.

Let’s walk through what to track, when to track it, and why you should care before you accidentally spend five thousand dollars on empty completions.

(AKA “Prompt Graveyard”)

You are building a prototype. You have a notebook, a vector store, an OpenAI key, and a dream. This is not the time for dashboards. This is the time for panic-saving broken prompts.

At this stage, you are debugging yourself more than your users. So log:

The full prompt and response

Model name, temperature, and function schema version

Token usage and latency

Something—anything—that identifies the version of your prompt

Click to Copy Click to Copy log_data = { "prompt" : prompt , "response" : response , "model" : "gpt-4-turbo" , "temperature" : 0.7 , "latency_ms" : duration_ms , "tokens" : { "prompt_tokens" : usage . prompt_tokens , "completion_tokens" : usage . completion_tokens } , "prompt_version" : os . getenv ( "GIT_COMMIT_HASH" ) } logger . info ( "llm_trace" , extra = log_data )

Prompt versioning does not need a fancy system. A commit hash will do. Or a sticky note. Just write it down before you forget what changed.

A JSON file

A table in Postgres

Sentry with structured logs

Traceloop, if you like that sort of thing

Your only goal here is to make weird behavior reproducible. If something explodes and you can explain it in less than five minutes, you win.

(Now It Is Someone Else’s Problem)

Your app is live. Someone, somewhere is typing something terrible into your input box. This is where your optimistic prototype becomes a haunted house of retry storms, stale embeddings, and LLM behavior changes you did not ask for.

At this point, you are not debugging the model. You are debugging everything around it.

Layer What Will Break Frontend Laggy input fields, users pasting PDFs Backend Prompt assembly bugs, retry loops LLM Latency, token burn, mysterious hallucinations Retrieval Missing documents, low relevance scores External APIs Schema changes, rate limits, surprise outages Infra Cold starts, memory spikes, silent container exits

This is not “what did we send to the model.” This is “what happened from the user click to the flaming output.”

Click to Copy Click to Copy import uuid trace_id = str ( uuid . uuid4 ( ) ) logger . info ( "start_trace" , extra = { "trace_id" : trace_id , "input" : user_input } ) response = call_agent ( user_input , trace_id = trace_id ) logger . info ( "end_trace" , extra = { "trace_id" : trace_id , "response" : response , "latency_ms" : elapsed , "tools_used" : tool_calls } )

Add trace identifiers across frontend and backend

Use OpenTelemetry or Sentry to follow request flow

Track retries and token usage

Set alerts for latency spikes and error rates

If you cannot tell what the user saw, what the model saw, and what changed, you are not doing observability. You are doing archaeology.

A retry loop burns tokens quietly during a vector store outage

The UI breaks because the model adds an unexpected prefix

Retrieval stops working after a model change, and you find out on Twitter

Latency jumps 600 milliseconds and no one knows why

You cannot fix what you cannot find. Especially when it is wrapped in base64 inside a JSON blob.

(Also Known As “Now We Are On Call”)

Your product is being used. People depend on it. You have new problems now. Like cost. And scale. And trying to explain to your manager why your summarizer now refuses to summarize.

Observability shifts from catching bugs to catching regressions and making tradeoffs.

Output drift detection for every model upgrade

Evaluation metrics like semantic similarity and formatting checks

Cost and latency breakdowns by user, endpoint, and model

RAG (Retrieval-Augmented Generation) quality tracking: is your index fresh, relevant, and not broken

Full stack tracing, all the way down to the weird tool call that silently failed

Click to Copy Click to Copy score = cosine_similarity ( embed ( output ) , embed ( reference ) ) if score < 0.8 : alert ( "Drift detected in summarizer" )

Nightly evals using cron and SQL

Token usage reports grouped by endpoint

Manual diffs of output before and after model changes

Embedding freshness checks with thresholds

You should pay for tooling when:

You spend more time reading logs than shipping features

It takes longer than five minutes to answer “what changed”

Your system breaks silently and no one knows until customers complain

Useful tools:

LangSmith for tracing and evals

Sentry for full-stack visibility

WhyLabs or Arize for drift detection and output monitoring

Monitoring LLM systems is messy. These are probabilistic tools. You are not logging errors. You are logging behavior.

Stage What You Should See Pre-Production Prompt logs, token usage, version tracking Production Tracing, retries, feedback, structured logs Post-PMF Drift detection, evals, cost insights, RAG health

A good observability stack should answer four questions. Fast.

What did we send the model Why did it respond that way What changed recently How much is this costing us

If you can answer those, you are probably fine. If you cannot, start small. Fix the parts that hurt. Add more when things get weird. They will.

If your monitoring stops at the model call, you are not monitoring. You are hoping.

