Back to Blog Home

Contents

Share

Share on Twitter
Share on Bluesky
Share on HackerNews
Share on LinkedIn

Debugging multi-agent AI: When the failure is in the space between agents

Sergiy Dybskiy image

Sergiy Dybskiy -

Debugging multi-agent AI: When the failure is in the space between agents

Debugging multi-agent AI: When the failure is in the space between agents

I've been building a multi-agent research system. The idea is simple: give it a controversial technical topic like "Should we rewrite our Python backend in Rust?", and three agents work on it. An Advocate argues for it, a Skeptic argues against, and a Synthesizer reads both briefs blind and produces a balanced analysis. Each agent has its own model, its own tools, its own system prompt.

It worked great in testing. Then I noticed the Synthesizer kept producing analyses that leaned heavily toward one side. Not wrong, but noticeably lopsided. I mean, rewriting the Sentry monorepo in Rust is arguably a bad idea, but it was arguing against on things where I clearly knew it should be for it.

I eventually traced it to the Skeptic’s web_search tool. The Advocate was returning 3-4 solid data points per query. The Skeptic, however, was searching for different terms that didn't match the data as well, and was getting back a single generic result. So the Advocate's brief was well-sourced with citations, and the Skeptic's brief was... vibes. The Synthesizer did what any reasonable reader would do: it weighted the better-sourced argument more heavily.

The bug was in a tool call, inside one agent, that silently degraded the input to a completely different agent two steps later. I only found it by clicking through the trace and reading tool outputs at each step.

What is multi-agent observability?

Multi-agent observability is visibility into how multiple AI agents coordinate, hand off work, and influence each other's decisions.

You probably already know single-agent observability: one reasoning chain, some tool calls, a response. The multi-agent version tracks a graph of interconnected reasoning chains where the output of one agent becomes the input of another. A failure anywhere in the graph can silently corrupt everything downstream.

If you're running a single agent with a few tools, standard agent observability has you covered. But the moment you have agents calling other agents, delegating subtasks, or running in parallel with results merged later, you need a different level of visibility.

Why single-agent monitoring doesn't cut it here

Your existing agent monitoring tells you that Skeptic ran in 3.1 seconds and consumed 2,400 tokens. It does not tell you that Skeptic's web_search returned weak results, that the brief it produced was thin compared to the Advocate's, and that the Synthesizer produced a biased analysis because one of its inputs was poor.

There are three specific reasons this falls apart.

Blame is distributed. When the final output is wrong, you can't point at one agent. The Advocate built a reasonable argument from what its tools gave it. The Synthesizer did a reasonable synthesis of what it received. The bug is in the interaction between them, and no single agent's logs will show it.

The worst failures look fine. In traditional software, things throw errors. In multi-agent AI, an agent returns a plausible-but-thin result, the next agent incorporates it without question, and by the time the final output arrives, weak data has been confidently summarized through multiple layers. You'd never know unless you compared the raw inputs.

You can't test every path. A single agent with 5 tools has 5 possible actions per step. Three agents with 5 tools each, running in parallel and merging results? The number of possible execution paths is absurd. You need to observe what actually happens in production because you can't pre-test every combination.

Most "multi-agent" examples are actually single-agent

Before going further, I want to be honest. I built a multi-agent startup idea validator as my first attempt at this playground, and then realized... it was fake multi-agent. A "Market Analyst" handing off to a "Technical Advisor" handing off to a "Devil's Advocate" is just one agent with different tools. A single agent with all the tools and a comprehensive system prompt produces the same output with less latency and less cost.

Microsoft's Cloud Adoption Framework puts it directly: "Don't assume role separation requires multiple agents. Distinct roles might suggest multiple agents, but they don't automatically justify a multi-agent architecture."

Multi-agent earns its pain when:

  • Objectives genuinely conflict. An agent told to "argue for" and "argue against" in the same prompt produces mediocre output at both. A generator and a critic need to be separate, or the critic pulls its punches.

  • Information must be isolated. If Agent A seeing Agent B's work would bias the result, they can't share a context window. Advocate/skeptic. Blind peer review.

  • Different models serve different roles. Cheap fast model for research, expensive capable model for synthesis. One agent means one model.

  • Tasks should run in parallel. Two independent research tasks running concurrently as separate agents is genuinely faster than one agent doing them sequentially.

  • Security boundaries require separation. The agent reading user PII shouldn't have database write access.

If your use case doesn't hit at least two of these, start with a single agent and save yourself the debugging pain I'm about to describe.

Common multi-agent architecture patterns

Each pattern produces a different trace shape and breaks in its own way.

Orchestrator / Worker

One agent routes tasks to specialists. This is the most common pattern in the OpenAI Agents SDK, LangGraph, and custom implementations.

Click to Copy
POST /api/research (http.server)
└── gen_ai.invoke_agent "Research Director"
    ├── gen_ai.request "chat gpt-5.4"                         ← plan subtasks
    ├── gen_ai.execute_tool "delegate_research"
    │   └── gen_ai.invoke_agent "Web Research Agent"
    │       ├── gen_ai.request "chat gpt-5.4-mini"
    │       ├── gen_ai.execute_tool "web_search"
    │       └── gen_ai.request "chat gpt-5.4-mini"            ← summarize
    ├── gen_ai.execute_tool "delegate_analysis"
    │   └── gen_ai.invoke_agent "Data Analysis Agent"
    │       ├── gen_ai.request "chat gpt-5.4-mini"
    │       ├── gen_ai.execute_tool "query_database"
    │       └── gen_ai.request "chat gpt-5.4-mini"
    └── gen_ai.request "chat gpt-5.4"                         ← synthesize

How it breaks: The orchestrator misclassifies the task and routes to the wrong specialist, who then does perfect work on the wrong problem. Or it passes insufficient context, and the specialist hallucinates what's missing.

Parallel with merge

Independent agents work concurrently on the same problem, and a final agent merges results. This is what the balanced research system uses, and it's the pattern I think has the most interesting debugging challenges.

Click to Copy
Advocate workflow .............. 3.2s  (parallel)
├── gen_ai.invoke_agent "Advocate"
│   ├── gen_ai.request "chat gpt-5.4-mini"         ← plan research
│   ├── gen_ai.execute_tool "web_search"           ← find evidence
│   ├── gen_ai.execute_tool "fetch_benchmark"      ← get numbers
│   └── gen_ai.request "chat gpt-5.4-mini"         ← write brief

Skeptic workflow ............... 2.8s  (parallel)
├── gen_ai.invoke_agent "Skeptic"
│   ├── gen_ai.request "chat gpt-5.4-mini"         ← plan research
│   ├── gen_ai.execute_tool "web_search"           ← find counter-evidence
│   └── gen_ai.request "chat gpt-5.4-mini"         ← write brief

Synthesizer workflow ........... 4.1s  (sequential, after both)
└── gen_ai.invoke_agent "Synthesizer"
    └── gen_ai.request "chat gpt-5.4"              ← blind analysis

How it breaks: Uneven tool quality. If one agent's tool calls return richer data, the merge agent naturally weights that side more heavily. The merge agent has no way to know its inputs were unequal, because it only sees the finished briefs, not the raw tool results underneath. This is the bug I had the pleasure of dealing with while crafting this blog post.

Peer handoffs

Agents transfer control directly to each other. The OpenAI Agents SDK handoff() pattern works this way.

Click to Copy
POST /api/chat (http.server)
└── gen_ai.invoke_agent "Triage Agent"
    ├── gen_ai.request "chat gpt-5.4-mini"
    ├── gen_ai.handoff "from Triage Agent to Billing Agent"
    └── gen_ai.invoke_agent "Billing Agent"
        ├── gen_ai.request "chat gpt-5.4-mini"
        ├── gen_ai.execute_tool "check_balance"
        ├── gen_ai.handoff "from Billing Agent to Dispute Specialist"
        └── gen_ai.invoke_agent "Dispute Specialist"
            ├── gen_ai.request "chat gpt-5.4"
            └── gen_ai.execute_tool "file_dispute"

How it breaks: State management at the handoff. When Agent A transfers to Agent B, what gets passed? Full conversation history? A summary? Just the last message? Pass everything and you blow context windows. Summarize and you lose nuance. Bugs in the handoff protocol are the hardest to find because they look like bugs in the receiving agent.

What makes multi-agent debugging different

There are a few specific problems you only hit when multiple agents are involved.

  • Blame attribution across boundaries. When a multi-agent system returns wrong output, the question is: did the right agent receive the task? Did it get the right context? Did it do bad work with good input, or good work with bad input? Without traces that span the full agent graph, you're reading each agent's logs in isolation trying to reconstruct what happened at the boundaries.

  • Silent cascading failures. This is the one that got me. An agent returns a plausible response, the downstream agent accepts it, and the final output is wrong, but every span shows status: ok. To catch these, you need to be able to compare input and output at each agent boundary and see the full prompt and response at each LLM call. Token counts and latency alone won't help.

  • Context drift across handoffs. Every time an agent summarizes before passing to the next, information is lossy-compressed. After three handoffs, the original user intent can be barely recognizable. In a trace, you can see this by reading the prompts in sequence: the first agent has the full query, the second has a summary, the third has a summary of a summary. The fix is usually architectural (pass structured data instead of natural language), but you have to see the drift before you can fix it.

  • Cost explosion without attribution. In our research system, the Synthesizer uses gpt-5.4 while the researchers use gpt-5.4-mini. Without per-agent cost tracking, you'd see total spend growing but wouldn't know the Synthesizer accounts for 60% of the cost despite running only once per query.

A debugging walkthrough with the balanced research system

Here's how I actually found the bug from the opening. The Synthesizer was producing lopsided analyses, and I wanted to figure out why.

Comparing the parallel agents

First thing I did was look at both research agent workflows side by side in the trace view:

Click to Copy
Advocate workflow .......................... 3.2s  ✓
├── gen_ai.invoke_agent "Advocate" ......... 3.1s
│   ├── gen_ai.request "chat gpt-5.4-mini" . 0.6s  ← plan
│   ├── gen_ai.execute_tool "web_search" ... 0.2s  ← "rust performance"
│   ├── gen_ai.execute_tool "web_search" ... 0.1s  ← "rust adoption"
│   ├── gen_ai.execute_tool "fetch_benchmark" 0.1s ← rust benchmarks
│   └── gen_ai.request "chat gpt-5.4-mini" . 1.8s  ← write brief

Skeptic workflow ........................... 2.8s  ✓
├── gen_ai.invoke_agent "Skeptic" .......... 2.7s
│   ├── gen_ai.request "chat gpt-5.4-mini" . 0.5s  ← plan
│   ├── gen_ai.execute_tool "web_search" ... 0.1s  ← "python migration costs"
│   └── gen_ai.request "chat gpt-5.4-mini" . 1.9s  ← write brief

The asymmetry was immediately obvious. The Advocate made 3 tool calls. The Skeptic made 1.

Inspecting the tool results

Clicking into the Advocate's web_search spans, each returned 3-4 data points:

Click to Copy
["Rust programs typically run 2-5x faster than equivalent Python...",
 "Discord switched from Go to Rust... latency drop from 50ms to 1ms",
 "Figma rewrote their multiplayer server... memory usage by 10x"]

The Skeptic's single web_search had searched for "python migration costs":

Click to Copy
["No specific data found for 'python migration costs'. Consider refining your search terms."]

So the Skeptic wrote its brief from general knowledge with no citations, while the Advocate had 10+ data points from 3 searches.

Following it to the Synthesizer

Clicking the Synthesizer's gen_ai.request span and reading the prompt confirmed it. It received one well-sourced brief with citations and benchmark data, and one brief with general arguments and no data. It weighted the better-sourced one more heavily, which is exactly what you'd want a synthesizer to do. The problem was upstream.

The fix

Two options: improve the Skeptic's prompt to try multiple search queries when the first returns weak results, or improve the web_search tool to handle broader query terms. I did both. Watched the traces afterward, and both agents were producing comparably sourced briefs.

The root cause was a weak tool result for one agent that cascaded through the pipeline as information asymmetry. Without seeing every tool call and every prompt in the trace, I would have blamed the Synthesizer's prompt for being biased.

Auto-instrumenting multi-agent frameworks

Sentry auto-instruments the OpenAI Agents SDK, LangGraph, CrewAI, and other frameworks. The integration activates automatically when the package is detected. Here's the setup for the balanced research system:

Click to Copy
import asyncio
import sentry_sdk
from agents import Agent, Runner, function_tool, ModelSettings

sentry_sdk.init(
    send_default_pii=True,   # captures prompts and responses in spans
    traces_sample_rate=1.0,
    enable_logs=True,
)

@function_tool
def web_search(query: str) -> str:
    """Search the web for information on a topic."""
    ...

advocate = Agent(
    name="Advocate",
    model="gpt-5.4-mini",
    model_settings=ModelSettings(temperature=0.3),
    instructions="Build the strongest case FOR the position...",
    tools=[web_search],
)

skeptic = Agent(
    name="Skeptic",
    model="gpt-5.4-mini",
    model_settings=ModelSettings(temperature=0.3),
    instructions="Build the strongest case AGAINST the position...",
    tools=[web_search],
)

synthesizer = Agent(
    name="Synthesizer",
    model="gpt-5.4",
    model_settings=ModelSettings(temperature=0.5),
    instructions="Produce balanced analysis from two research briefs...",
)

async def analyze(topic: str):
    # Parallel execution: two independent trace trees
    advocate_result, skeptic_result = await asyncio.gather(
        Runner.run(advocate, topic),
        Runner.run(skeptic, topic),
    )

    synthesis_input = f"""
    Brief A: {advocate_result.final_output}
    Brief B: {skeptic_result.final_output}
    """
    return await Runner.run(synthesizer, synthesis_input)

SENTRY_DSN is read from the environment. send_default_pii=True is what enables prompt and response capture in spans, which is essential for debugging the handoff problems described above. The SDK creates gen_ai.invoke_agent spans for each agent, gen_ai.execute_tool spans for tool calls, and gen_ai.request spans for LLM calls with token counts and model info.

For JavaScript/TypeScript with the Vercel AI SDK or LangChain, use tracesSampler to capture AI routes at 100%:

Click to Copy
import * as Sentry from "@sentry/node";

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  sendDefaultPii: true,
  tracesSampler: ({ name, attributes, inheritOrSampleWith }) => {
    if (attributes?.['sentry.op']?.startsWith('gen_ai.')) {
      return 1.0;
    }
    if (name?.includes('/api/chat') || name?.includes('/api/agent')) {
      return 1.0;
    }
    return inheritOrSampleWith(0.2);
  },
});

For more on why you should sample AI traces at 100%, see the companion post on sampling strategies for agentic applications.

Building multi-agent dashboards

Pre-built agent dashboards show per-model and per-tool aggregates. For multi-agent systems, you need to slice by agent. Some dashboards you can build with the Sentry CLI:

Per-agent cost attribution:

Click to Copy
sentry dashboard widget add 'Multi-Agent Monitoring' "Cost by Agent" \\
  --display table --dataset spans \\
  --query "sum:gen_ai.usage.total_tokens" "count" \\
  --where "span.op:gen_ai.invoke_agent" \\
  --group-by "gen_ai.agent.name" \\
  --sort "-sum:gen_ai.usage.total_tokens"

This is how I found out the Synthesizer was 60% of my cost despite running once per query (because it uses gpt-5.4 instead of gpt-5.4-mini).

Tool reliability by agent:

Click to Copy
sentry dashboard widget add 'Multi-Agent Monitoring' "Tool Errors by Agent" \\
  --display table --dataset spans \\
  --query "failure_rate" "count" \\
  --where "span.op:gen_ai.execute_tool" \\
  --group-by "gen_ai.agent.name" "gen_ai.tool.name" \\
  --sort "-failure_rate"

If the Skeptic's web_search returns empty results 15% of the time while the Advocate's returns empty 3% of the time, you've found your lopsided synthesis problem before users report it.

Agent duration comparison:

Click to Copy
sentry dashboard widget add 'Multi-Agent Monitoring' "Agent Duration p95" \\
  --display bar --dataset spans \\
  --query "p95:span.duration" \\
  --where "span.op:gen_ai.invoke_agent" \\
  --group-by "gen_ai.agent.name"

Agents doing similar work should take similar time. Big duration gaps between parallel agents usually mean one is making more (or fewer) tool calls than expected.

What I'd recommend if you're building multi-agent systems

Based on debugging this system and reading a lot of traces:

Capture prompts and responses at every agent boundary. This is the send_default_pii=True flag. Token counts show cost

But the prompts, responses and tool input / output data are where you'll actually find bugs. The handoff boundaries between agents are where most multi-agent issues live.

Name your agents clearly. "Agent" and "Sub-Agent" in your trace view tells you nothing. "Advocate" and "Skeptic" and "Synthesizer" tells a story you can follow.

Compare parallel agents. When agents run concurrently and their outputs merge, the merge agent can't tell if its inputs were equally good. But you can tell from the traces. Look for asymmetry in tool call counts, token usage, and duration between agents that should be doing similar work.

Sample at 100%. This matters even more for multi-agent than single-agent. A run that fails on a specific combination of tool results might happen 1 in 50 times. At 10% sampling, you'll need 500 runs before you capture one. See how to sample AI traces at 100% for the setup.

Alert on tool failure rates per agent, not globally. A tool that fails 5% globally might fail 20% for one specific agent because of how it formulates queries. Global averages hide per-agent problems.

Connect to your full stack. A slow web_search tool might be caused by rate limiting from an upstream API, not an agent issue. Multi-agent traces that sit inside your existing distributed traces let you see everything.

Getting started

If you're already using Sentry for agent monitoring, multi-agent traces work automatically. The SDKs detect agent invocations, handoffs, and tool calls.

Starting fresh:

  1. pip install sentry-sdk or npm install @sentry/node

  2. Initialize with traces_sample_rate=1.0 and send_default_pii=True

  3. Run your multi-agent workflow. Spans appear in Sentry's trace view.

For setup across 10+ frameworks, see the AI agent observability guide.

Multi-agent AI debugging FAQs

Syntax.fm logo

Listen to the Syntax Podcast

Of course we sponsor a developer podcast. Check it out on your favorite listening platform.

Listen To Syntax
© 2026 • Sentry is a registered Trademark of Functional Software, Inc.