What You Actually Need to Monitor AI Systems in Production

ON THIS PAGE
- Stage One: Pre-Production
- Stage Two: Production
- Stage Three: Product Market Fit
- What Good Looks Like
- One More Thing
You did it. You added the latest AI agent into your product. Shipped it. Went to sleep. Woke up to find it returning a blank string, taking five seconds longer than yesterday, or confidently outputting lies in perfect JSON.
Naturally, you check your logs. You see a prompt. You see a response. And you see nothing helpful.
Surprise. Prompt in and response out is not observability. It is vibes.
There is a lot of buzz around “LLM observability.” Most of it involves charts you do not need and dashboards you will forget to check. If you are actually building and shipping LLM-powered products—chatbots, internal agents, retrieval apps—you need observability. But not the kind that stops where things get interesting.
Let’s walk through what to track, when to track it, and why you should care before you accidentally spend five thousand dollars on empty completions.
Stage One: Pre-Production
(AKA “Prompt Graveyard”)
You are building a prototype. You have a notebook, a vector store, an OpenAI key, and a dream. This is not the time for dashboards. This is the time for panic-saving broken prompts.
What to Log
At this stage, you are debugging yourself more than your users. So log:
The full prompt and response
Model name, temperature, and function schema version
Token usage and latency
Something—anything—that identifies the version of your prompt
log_data = {
"prompt": prompt,
"response": response,
"model": "gpt-4-turbo",
"temperature": 0.7,
"latency_ms": duration_ms,
"tokens": {
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens
},
"prompt_version": os.getenv("GIT_COMMIT_HASH")
}
logger.info("llm_trace", extra=log_data)
Prompt versioning does not need a fancy system. A commit hash will do. Or a sticky note. Just write it down before you forget what changed.
Tooling You Can Get Away With
A JSON file
A table in Postgres
Sentry with structured logs
Traceloop, if you like that sort of thing
Your only goal here is to make weird behavior reproducible. If something explodes and you can explain it in less than five minutes, you win.
Stage Two: Production
(Now It Is Someone Else’s Problem)
Your app is live. Someone, somewhere is typing something terrible into your input box. This is where your optimistic prototype becomes a haunted house of retry storms, stale embeddings, and LLM behavior changes you did not ask for.
What to Monitor
At this point, you are not debugging the model. You are debugging everything around it.
Layer | What Will Break |
---|---|
Frontend | Laggy input fields, users pasting PDFs |
Backend | Prompt assembly bugs, retry loops |
LLM | Latency, token burn, mysterious hallucinations |
Retrieval | Missing documents, low relevance scores |
External APIs | Schema changes, rate limits, surprise outages |
Infra | Cold starts, memory spikes, silent container exits |
You Need Tracing. Actual Tracing.
This is not “what did we send to the model.” This is “what happened from the user click to the flaming output.”
import uuid
trace_id = str(uuid.uuid4())
logger.info("start_trace", extra={"trace_id": trace_id, "input": user_input})
response = call_agent(user_input, trace_id=trace_id)
logger.info("end_trace", extra={
"trace_id": trace_id,
"response": response,
"latency_ms": elapsed,
"tools_used": tool_calls
})
Free Stuff That Actually Helps
Add trace identifiers across frontend and backend
Use OpenTelemetry or Sentry to follow request flow
Track retries and token usage
Set alerts for latency spikes and error rates
If you cannot tell what the user saw, what the model saw, and what changed, you are not doing observability. You are doing archaeology.
What Breaks Without It
A retry loop burns tokens quietly during a vector store outage
The UI breaks because the model adds an unexpected prefix
Retrieval stops working after a model change, and you find out on Twitter
Latency jumps 600 milliseconds and no one knows why
You cannot fix what you cannot find. Especially when it is wrapped in base64 inside a JSON blob.
Event
Debugging with Sentry AI using Seer, MCP, and Agent Monitoring
Join Sentry engineers for a hands-on workshop on debugging AI applications and agents with full-stack visibility. Join live on July 23rd at 10am PT to code along.
Stage Three: Product Market Fit
(Also Known As “Now We Are On Call”)
Your product is being used. People depend on it. You have new problems now. Like cost. And scale. And trying to explain to your manager why your summarizer now refuses to summarize.
Observability shifts from catching bugs to catching regressions and making tradeoffs.
What You Actually Need
Output drift detection for every model upgrade
Evaluation metrics like semantic similarity and formatting checks
Cost and latency breakdowns by user, endpoint, and model
RAG (Retrieval-Augmented Generation) quality tracking: is your index fresh, relevant, and not broken
Full stack tracing, all the way down to the weird tool call that silently failed
Example: Quick and Dirty Eval
score = cosine_similarity(embed(output), embed(reference))
if score < 0.8:
alert("Drift detected in summarizer")
You Can Still Do Some of This Yourself
Nightly evals using cron and SQL
Token usage reports grouped by endpoint
Manual diffs of output before and after model changes
Embedding freshness checks with thresholds
When to Stop Building and Start Paying
You should pay for tooling when:
You spend more time reading logs than shipping features
It takes longer than five minutes to answer “what changed”
Your system breaks silently and no one knows until customers complain
Useful tools:
LangSmith for tracing and evals
Sentry for full-stack visibility
WhyLabs or Arize for drift detection and output monitoring
What Good Looks Like
Monitoring LLM systems is messy. These are probabilistic tools. You are not logging errors. You are logging behavior.
Stage | What You Should See |
---|---|
Pre-Production | Prompt logs, token usage, version tracking |
Production | Tracing, retries, feedback, structured logs |
Post-PMF | Drift detection, evals, cost insights, RAG health |
A good observability stack should answer four questions. Fast.
What did we send the model
Why did it respond that way
What changed recently
How much is this costing us
If you can answer those, you are probably fine. If you cannot, start small. Fix the parts that hurt. Add more when things get weird. They will.
One More Thing
If your monitoring stops at the model call, you are not monitoring. You are hoping.
Sentry gives you full request traces across your app. You can follow a user click through your toolchain, vector store, model call, and back to the output. You can even see why a tool invocation failed inside an agent plan, without spending your entire afternoon rebuilding context from logs.
So stop guessing. Start knowing. And please log the prompt before you forget what broke it.
Check out the docs to learn more, join the discussion in Discord, or if you’re new to Sentry, you can get started for free.