When agents orchestrate agents, who's watching?
When agents orchestrate agents, who's watching?
You used to monitor services.
Then you started monitoring AI calls inside services.
Now your AI agent is spinning up other AI agents to complete tasks. Your old monitoring instincts need to evolve.
This isn't hypothetical. Agentic architectures are already in production. Coding agents are calling search agents; orchestrators are spawning specialized sub-agents for retrieval, planning, and execution. Teams are shipping these systems faster than they're figuring out how to watch them.
The problem isn't that agents fail. It's that when they do, you often can't tell which agent introduced the failure, or whether anything technically failed at all.
Traditional tracing wasn't built for this
In a traditional stack, debugging a request means following one thread from entry point to database. One service, one owner, one place to look.
In a multi-agent system, a single user action might trigger a planner agent, three tool-call agents, a validation agent, and a write agent. That's five actors, potentially across different models, different prompts, and very different latency budgets. Errors don't always surface as exceptions. A bad output from a sub-agent might not throw an error at all. It might just start the spiral, propagating as context corruption further down the chain. The orchestrator thinks it succeeded. The user sees something wrong. You open your logs and find nothing obviously broken.
If you want to see what this looks like in practice, this breakdown of a real multi-agent debugging session shows exactly how a silent tool failure two hops upstream can corrupt final output without triggering a single error. It's a good illustration of why the instinct to "read the logs" stops working at this level of complexity. In this world, little missteps compound and avalanche.
This post focuses on what that complexity looks like when you're operating at scale, across teams, with enterprise reliability expectations.
The visibility problem compounds with scale
One agent is readable. Two agents are manageable. Five agents calling each other conditionally, with branching logic and shared context? It’s a different category of problem entirely.
You're no longer debugging code execution. You're debugging emergent behavior across a distributed decision graph. The same way microservices made "it's slow somewhere in the stack" a meaningless statement without traces, multi-agent systems make "the AI did something wrong" nearly impossible to act on without the right instrumentation.
Most teams discover this the hard way. Maybe that’s a sudden uptick in user churn with clear cause, or an LLM silently returning bad data three hops down the chain. A token cost bill that tripled overnight. No alert fired because no single component technically crossed a threshold.
Distributed tracing solved this exact problem for microservices. The question is whether your AI pipeline is instrumented to handle the next version of that problem.
What actual, useful multi-agent monitoring looks like
Getting visibility into multi-agent systems isn't a new product category. It's about applying the right primitives with the right granularity. Sentry's AI observability tooling is built on the same foundation as its distributed tracing, which means the mental model transfers even as the complexity scales. Here's what that actually requires:
Trace continuity across agent handoffs. The trace ID needs to follow the task through every agent invocation, not restart at each boundary. You need to see the full tree: who called what, in what order, with what inputs and outputs. A flat list of spans with the same parent doesn't offer the same value when you need to understand which agent in the middle of the chain introduced a bad state.
Per-agent span attribution. Latency, token usage, model version, prompt hash, and output signal should be attributable to each agent individually, not rolled up to the top-level call. Knowing your orchestrator took 4.2 seconds tells you almost nothing. Knowing it was waiting 3.8 seconds on a retrieval sub-agent that returned low-confidence results tells you exactly where to go.
Failure mode differentiation. Agent timeout, bad tool call output, context window overflow, model refusal, and hallucination downstream of a technically valid response are completely different problems with completely different fixes. Grouping them all as "AI errors" is the equivalent of logging every 500 as "server error." Technically accurate, operationally useless.
Cost and token attribution at the task level. A task that spawned six agents and consumed 40K tokens is a different animal than one that consumed 4K. You need this at query time, broken down per transaction, per user, and per feature. Not buried in an end-of-month billing aggregate.
Nested span trees showing agent relationships. Sentry's trace view shows agent invocations as nested spans, so you can see which agent called which, in what order, and what each one consumed. When multiple agents are calling a shared tool or a downstream agent is being invoked by more than one parent, that structure is visible in the trace.
Where Sentry fits
Sentry already has the primitives: distributed tracing, spans, breadcrumbs, performance metrics. If you're using the Sentry SDK in your AI pipeline, you're closer than you think.
For supported frameworks, setup is essentially zero-config. Sentry auto-instruments agent invocations, tool calls, and LLM requests across the major AI frameworks in both Python and Node.js, including OpenAI, Anthropic, Google GenAI, LangChain, LangGraph, Pydantic AI, OpenAI Agents SDK, and Vercel AI SDK. Install the package, enable tracing, and Sentry picks it up automatically, including tool failure patterns and automatic grouping of similar failures across runs.
Here's what that looks like for the OpenAI Agents SDK in Python:
import sentry_sdk
from sentry_sdk.integrations.openai_agents import OpenAIAgentsIntegration
sentry_sdk.init(
dsn="YOUR_DSN",
traces_sample_rate=1.0,
integrations=[OpenAIAgentsIntegration()],
)If your framework isn't on the supported list, manual instrumentation takes about 10 lines of code per span type using Sentry's gen_ai.* span conventions such as gen_ai.invoke_agent, gen_ai.execute_tool, and gen_ai.request.
One thing worth knowing before you set this up: all AI integrations capture prompt and response content by default, since recordInputs and recordOutputs both default to true. If your prompts or responses contain sensitive data, set both to false. Make sure your privacy policy permits capturing this content before going to production with the defaults enabled.
Either way, you end up with a trace tree showing nested agent invocations, tool executions, and LLM calls as child spans. That gives you visibility into execution and performance. Understanding output correctness and decision quality still requires additional validation layers on top.
Seer can help reduce time-to-triage. When a multi-agent task fails and you have a trace spanning five agents, Seer can analyze the error context, surface the most likely source of degradation, and give you a starting point grounded in your actual production data rather than five equally plausible places to begin.
For a full setup guide across supported frameworks, the AI agent observability guide covers instrumentation in detail. Start there if you're setting this up for the first time.
Practical starting point: instrument your orchestrator first. Get the top-level task as a transaction, with each agent call as a child span. Even partial visibility is better than none when you're trying to triage a production degradation at 2am.
This is the readiness question
Teams adopting agentic systems are going to face the same question their SRE teams faced when they migrated from monolith to microservices: how do we know this is working?
It's not a question that stays abstract for long. The first time an agent-orchestrated workflow produces a wrong answer at scale, or quietly runs up a token bill nobody can attribute, or degrades in a way that no individual span flagged, that's when the question becomes urgent. By then, the teams that already instrumented are triaging. The teams that didn't are left guessing.
The teams that answer that question first with real traces, real attribution, and real alerting are the ones that get to keep running agents in production. The others roll it back after the first incident they can't explain.
Multi-agent observability isn't a nice-to-have at scale. It's table stakes for anyone taking agents beyond the prototype phase. The complexity doesn't ask permission before it shows up in production. It's already there.

