Back to Blog Home

Contents

Share

Share on Twitter
Share on Bluesky
Share on HackerNews
Share on LinkedIn

AI agent observability: The developer's guide to agent monitoring

Sergiy Dybskiy image

Sergiy Dybskiy -

AI agent observability: The developer's guide to agent monitoring

AI agent observability: The developer's guide to agent monitoring

Most "agent observability best practices" content reads like a compliance checklist from 2019 with "AI" pasted over "microservices." Implement comprehensive logging. Establish evaluation metrics. Create governance frameworks. Not a single line of code. No mention of what happens when your agent silently picks the wrong tool on turn 3 and you need to figure out why.

Agent monitoring needs two things: dashboards that show you what's happening across all your agents, and traces that show you exactly why something went wrong in a specific run. Most tools give you one or the other. Here's what it looks like when you have both.

What is agent observability?

Agent observability is end-to-end visibility into what your AI agents are doing: which models they call, which tools they invoke, what decisions they make at each step, and how those decisions affect the final output.

Traditional application monitoring tracks requests, errors, and latency. That works for stateless HTTP services where each request is independent.

AI agents are different. A single agent run might involve multiple LLM calls, tool executions, sub-agent handoffs, and multi-turn reasoning loops, all dependent on each other. When the output is wrong, the failure could be anywhere in that chain: a bad tool response, a context window overflow, a model selecting the wrong function, or a handoff losing state.

Agent observability tracks the full reasoning chain across these interactions. You can't evaluate agent quality, debug agentic workflows, or control costs without this level of visibility.

Why traditional monitoring fails for AI agents

Standard APM tools will tell you that POST /api/chat returned 200 in 4.2 seconds. They won't tell you that inside that request, the agent made 5 LLM calls, the third one picked the wrong tool, the tool returned stale data, and the model faithfully summarized garbage.

If your agent monitoring plan is "log everything and figure it out later," you'll have a dashboard full of counts and averages with no way to dig deeper. An agent that returns the wrong answer might have made 12 LLM calls, executed 4 tools, and handed off to a sub-agent before producing garbage. Aggregate metrics tell you that error rates went up. They don't tell you where the reasoning went wrong.

What you need is structured tracing built on a standard convention, so dashboards, traces, and alerts all speak the same language.

The OpenTelemetry standard for agent observability

The OpenTelemetry gen_ai semantic conventions define the standard for instrumenting AI agent systems. Instead of custom logging, every AI operation produces a structured span with a consistent set of attributes. Here are the core operations defined by the convention:

Span operation

What it captures

gen_ai.request

A single LLM call: model, prompt, response, token counts

gen_ai.invoke_agent

The full lifecycle of an agent run, from task to final output

gen_ai.execute_tool

A tool/function call: name, input, output, duration

These compose into a span tree:

Click to Copy
POST /api/chat (http.server)
└── gen_ai.invoke_agent "Research Agent"
    ├── gen_ai.request "chat claude-sonnet-4-6"        ← initial reasoning
    ├── gen_ai.execute_tool "search_docs"              ← tool call
    ├── gen_ai.request "chat claude-sonnet-4-6"        ← process results
    ├── gen_ai.execute_tool "summarize"                ← second tool call
    ├── gen_ai.request "chat claude-sonnet-4-6"        ← decides to hand off
    └── gen_ai.execute_tool "transfer_to_writer"       ← handoff via tool
        └── gen_ai.invoke_agent "Writer Agent"
            ├── gen_ai.request "chat gemini-2.5-flash"
            └── gen_ai.execute_tool "format_output"

This isn't proprietary. It's an open standard, and any observability platform that follows it can ingest these spans. The span op follows the pattern gen_ai.{operation_name} - for manual instrumentation, gen_ai.request covers all LLM calls. Auto-instrumentation from SDKs may produce more specific ops like gen_ai.chat or gen_ai.embeddings depending on the API being called. Because these are structured spans rather than unstructured logs, they power both dashboards and trace views.

Key metrics for AI agent monitoring

Before diving into tooling, here's what you should be tracking for any AI agent in production:

Reliability metrics:

  • Agent error rate - percentage of agent runs that fail or return errors

  • Tool failure rate - which tools are unreliable, and how often they break agent runs

  • Latency (p50, p95) - per-agent and per-model, to catch regressions and slowdowns

Cost metrics:

  • Token usage - input, output, cached, and reasoning tokens per model. Cached and reasoning tokens are subsets of the totals, not additive. Get this wrong and your cost dashboards are fiction.

  • Cost per model - compare models doing similar work. You might discover claude-sonnet-4-6 costs $10.8K/week while gemini-2.5-flash-lite handles the same volume for $645.

  • Cost per user/tier - which users or pricing tiers are consuming the most AI resources

Quality metrics:

  • Tool call frequency - how often agents invoke each tool, and in what order

  • Token efficiency - average tokens per successful agent completion. If this number is growing, your prompts or context windows might be bloating.

  • Cache hit rate - what percentage of input tokens are served from prompt cache. If you've enabled caching and this isn't improving, something is wrong.

If you want to find out more about these metrics and why they matter, our post on the core KPI metrics of LLM performance is a great way to get started. Any agent observability platform that follows OpenTelemetry conventions should surface these from trace data.

Here's what this looks like in practice using Sentry as an example.

Auto-instrumentation for 10+ frameworks

Sentry auto-instruments the major AI frameworks in both Python and Node.js, including OpenAI, Anthropic, Google GenAI, LangChain, LangGraph, Pydantic AI, OpenAI Agents SDK, Vercel AI SDK, and more. No manual span creation needed. Install the package, enable tracing, and Sentry picks it up.

The entire setup:

Click to Copy
import sentry_sdk

sentry_sdk.init(
    dsn="YOUR_DSN",
    traces_sample_rate=1.0,
)
# OpenAI, Anthropic, LangChain, LangGraph, Pydantic AI,
# Google GenAI -- all auto-instrumented when detected.

That's it. Make an Anthropic or OpenAI call and the spans show up in Sentry.

Pre-built agent monitoring dashboards

Most observability platforms offer pre-built dashboards for agent monitoring. Once instrumentation is in place, Sentry's AI Agents dashboard includes three views:

AI Agents Overview

Agent Runs, duration, total LLM calls, tokens consumed, tool calls. This is your "is everything healthy?" view.

AI Agents Model Details

Per-model cost estimates, token breakdown (input/output/cached/reasoning), and latency. This surfaces the cost metrics above automatically.

AI Agents Tool Details

Per-tool call frequency, error rates, and p95 latency. If your search_docs tool is failing 12% of the time, this is where you'll spot it before your users start complaining.

These dashboards are pre-built and available the moment spans start flowing. But they're aggregate views: per-model totals, per-tool error rates, overall agent run counts. They help you answer technical questions and show you what’s broken — but what about the deeper business-level questions?

Custom agent monitoring dashboards

Pre-built dashboards give you aggregate health signals. They don't tell you who's driving your AI costs, which features are worth the spend, or whether your caching strategy is actually saving money. For those questions, you need to slice trace data by your own dimensions: user tier, feature flag, experiment group.

Some platforms let you build custom queries against span data. With the Sentry CLI, you can script this - and because it has an agent skill system, AI coding assistants like Claude Code can build dashboards for you:

"Who are my most expensive users?"

Click to Copy
sentry dashboard create 'AI Cost Attribution'

sentry dashboard widget add 'AI Cost Attribution' "Most Expensive Users" \\\\
  --display table --dataset spans \\\\
  --query "sum:gen_ai.usage.total_tokens" \\\\
  --where "span.op:gen_ai.request" \\\\
  --group-by "user.id" \\\\
  --sort "-sum:gen_ai.usage.total_tokens" \\\\
  --limit 20

"Which pricing tier is eating my AI budget?"

Tag your users with their plan, then group them in the dashboard based on that tag

Click to Copy
sentry_sdk.set_tag("user_tier", user.plan)  # "free", "pro", "enterprise"
Click to Copy
sentry dashboard widget add 'AI Cost Attribution' "AI Cost by Tier" \\
  --display bar --dataset spans \\
  --query "sum:gen_ai.usage.total_tokens" \\
  --where "span.op:gen_ai.request" \\
  --group-by "user_tier" \\
  --sort "-sum:gen_ai.usage.total_tokens"

This is how you find out that free-tier users consume 60% of your AI budget. The same tagging pattern works for any dimension: team, feature_flag, experiment_group.

"Which agents are token-hungry?"

Click to Copy
sentry dashboard widget add 'AI Cost Attribution' "Avg Tokens per Agent" \\
  --display table --dataset spans \\
  --query "avg:gen_ai.usage.total_tokens" "count" \\
  --where "span.op:gen_ai.invoke_agent" \\
  --group-by "gen_ai.agent.name" \\
  --sort "-avg:gen_ai.usage.total_tokens"

If your "Research Agent" averages 15K tokens per run while "Summarizer Agent" averages 2K, you know where to focus prompt optimization.

"Is my prompt caching actually saving money?"

Click to Copy
sentry dashboard widget add 'AI Cost Attribution' "Cache Hit Rate" \\\\
  --display line --dataset spans \\\\
  --query "sum:gen_ai.usage.input_tokens.cached" "sum:gen_ai.usage.input_tokens" \\\\
  --where "span.op:gen_ai.request"

If your cached-to-total ratio isn't improving after enabling Anthropic prompt caching, something is wrong with your prompt structure.

Why tracing matters for agent monitoring

Dashboards show aggregates. Traces show decisions.

A dashboard tells you error rates went up or latency spiked. A trace tells you which agent, which model call, and which tool was responsible.

Distributed tracing already captures the full span tree for a request: browser interactions, HTTP calls, server-side routing, database queries. Agent observability plugs into this. Your gen_ai.* spans show up as children inside existing traces, so the LLM calls, tool executions, MCP server interactions, and sub-agent handoffs sit right alongside your regular application spans. No separate system needed.

This is what makes it powerful. You're not looking at agent data in isolation. You're looking at the full request, from the user's click to the final tool response, with agent decisions as one layer in the stack.

Here's what this looks like in Sentry's trace view:

Distributed trace showing a full agent workflow from browser click through LLM calls, MCP server interactions, and agent handoffs

One request, end to end: from a user clicking "Send Message" in the browser, through the API layer, into agent orchestration with LLM calls and MCP server interactions, through a handoff to a second agent. Click any span to see the model, tokens, cost, and system prompt.

Agent observability best practices

Regardless of which platform you use, you should:

  1. Use structured tracing, not logs. Unstructured logs can't reconstruct the reasoning chain. OpenTelemetry gen_ai spans give you a searchable, filterable span tree that powers dashboards and trace views simultaneously.

  2. Sample AI traces at 100%. Agent runs are span trees. Sampling drops entire executions, not individual calls. If your tracesSampleRate is below 1.0, you're losing complete agent runs. Use a tracesSampler to keep AI routes at 100% while sampling everything else at your baseline.

  3. Track cost by user, not just by model. The pre-built dashboard shows per-model totals. You need per-user and per-tier attribution to make business decisions about rate limiting, pricing, and model routing.

  4. Monitor tool reliability separately. A tool that fails 5% of the time might not show up in your overall error rate, but it's causing 1 in 20 agent runs to produce bad output. Your agent monitoring dashboard should surface per-tool error rates separately.

  5. Connect AI monitoring to your full stack. An agent failure might be caused by a slow database query, a failed external API call, or a frontend timeout. Isolated AI monitoring can't surface these root causes.

Full-stack agent observability

Agent observability is most powerful when it sits on top of a full APM platform, connecting agent spans to errors, performance traces, session replays, and logs across your entire stack.

Isolated AI monitoring gives you gen_ai spans in a silo. You can see that your Research Agent made 8 LLM calls and cost $0.04. What you can't see is that the reason it made 8 calls instead of 3 is because your search_docs tool is hitting a slow Postgres query that's timing out, causing the agent to retry with a rephrased query each time.

When agent spans share context with the rest of your stack, the full picture comes together. Errors surface with their full span tree attached. Session replays show the user interaction that triggered a bad agent run. Upstream infrastructure problems (a degraded vector database, a flaky external API, etc.) show up in the same trace as the agent behavior they caused.

Four steps to first trace

  1. Install the SDK: pip install sentry-sdk or npm install @sentry/node

  2. Initialize with tracing enabled

  3. Make an AI call, spans and dashboards populate automatically

  4. (Optional) Install the CLI skill for your AI assistant:

    Click to Copy
    npx skills add <https://cli.sentry.dev>

If your framework is auto-instrumented, you're done. If it's not, manual instrumentation takes about 10 lines of code per span type.

Try Sentry for free - AI monitoring is included on all plans.

AI agent monitoring FAQs

Syntax.fm logo

Listen to the Syntax Podcast

Of course we sponsor a developer podcast. Check it out on your favorite listening platform.

Listen To Syntax
© 2026 • Sentry is a registered Trademark of Functional Software, Inc.