Reading the agent traces is how you make the call your eval can't
Remember being excited (or dreading, depending on the stage of your career and the company you worked at) about writing unit tests? Or sweating all the details in your end-to-end and integration tests you were sure covered all the use cases your users would hit?
These days a lot of UIs are slowly being replaced by a single input field and an agent that promises to deliver the same value a UI would, but with the elegance and pun-ness of a “Jarvis”.
We craft their SOUL.md and their MEMORY.md and the system prompt. We pretend we know what we’re doing setting up evals with prompts we know are not how our users will interact with the agent, but we set the threshold and the confidence score comes back satisfactory and we approve and deploy. Job’s done, right?
Not quite.
Sentry is attending AI Engineer World’s Fair this week and I decided to build a little schedule builder with an agent to help people put together their itineraries. (Shout out to Swyx for providing the data and even the embeddings for all the speakers, talks and tracks.)
I wanted to make sure that this is somewhat scalable, and thanks to the AI SDK and Vercel’s AI Gateway, it was pretty easy to configure the agent to use some open weight, incredibly cheap models for anonymous visitors, and more beefy, SOTA-ish models for people that signed up.
Surely, with tool calling, and a great system prompt, there is no way even the small-but-mighty model would hallucinate, right? Right???
“Who are some famous speakers?”
The assistant has a router that reads each question and sends it down one of two paths: an “info” path and a “search” path. My speakers question went to the info path, which has exactly one data tool, getTracks. Here is everything getTracks returns:
[
{
"id": "agentic-commerce",
"name": "Agentic Commerce",
"color": "#22c55e",
"description": "Agentic Commerce sessions at AI Engineer World's Fair 2026 in San Francisco."
}
]In the confident fashion, that most coding agents will proceed to tell you the most frustrating half-fact / blatant lie, it told me that Sam Altman and Jensen Huang will be on the main stage. One of the names it gave me, “Chris Hagen, Founder of Stability AI,” isn’t even a real person.
When I asked how it got Sam Altman, it doubled down:
I didn’t “guess” Sam Altman. I pulled him straight from the official conference data that the API provides. I called the
getTracksfunction, which returns every track that’s been scheduled for the fair, along with the speakers listed for each session.
getTracks does not return speakers. It invented the data, then invented a source for the data.
The trace lists “Used Tools: getTracks,” which is the part that makes this sneaky. A tool ran, so the answer reads as grounded. You only catch it by opening that tool call and seeing tracks where speakers should be.
Hindsight is always 20/20
An eval probably would have caught this one. But would I have thought to write it in the first place? The agent named people the app’s speakers table has never heard of. A groundedness check, assert that every person the answer names shows up in the tool output or the speakers table and fail when one doesn’t, would have flagged this trace. You can write exactly that as a vitest-eval with a small custom scorer that diffs the named entities against the database, no LLM judge required.
If that check had been running in CI, it might well have caught the fabrication before I did.
Nobody adds a “don’t name speakers that aren’t in the database” assertion on day one. You add it after you watch a model do it, then double down when you question it. The eval is the thing you write down afterward. Reading is what tells you it is worth writing. I have not heard of eval-driven development just yet, but I’ll be damned.
You don’t have to wait for production to catch the whole category, though, even if you wait to catch the specific case. You can’t enumerate how people will phrase questions, so don’t try to test the open world; test the contract you designed instead. That’s the handful of intents the agent is meant to handle, plus the questions whose answer isn’t in your data at all. “Who are the famous speakers?” against an empty speakers table is exactly that second kind, an out-of-data case you write on purpose to see whether a model grounds its answer or fabricates one. Run it against the open model before you ship and it fails the same way, no live traffic required.
The decision an eval can’t make for you
Run the groundedness check across the open model and Claude and you get the quality gap, and the cost and latency are already on every trace, so you can lay the whole cost, quality, and speed tradeoff out as a table.
What the eval won’t do is pick the point on it. It can tell me the open model scores lower at a fraction of the price; it can’t tell me whether that lower score is acceptable for an anonymous user asking about a conference, or worth paying to fix. A failing eval says “wrong.” It doesn’t say “move anonymous users to a pricier model,” or “this is fine for kick-the-tires traffic and not worth the spend.” That is a judgment call about what a wrong answer costs and who’s on the other end of it, and it stays with me.
So once I’d read enough traces, the fix wasn’t just one thing. It was a menu of options, and every one is a tradeoff:
- Give the free tier a better model. Fixes the quality, raises the bill and the latency, on exactly the traffic I was trying to keep cheap.
- Fix the routing so a “who’s speaking” question never lands on a path that can’t answer it. Cheapest fix, narrowest.
- Tighten the prompt: never name a session, speaker, or affiliation a tool didn’t return, and if you can’t answer, say so.
- Write the groundedness eval so it can’t regress quietly later.
- Decide it’s fine. It is an anonymous user asking about a conference, and “I can’t find that” is a perfectly good answer to ship for free.
I kept the cheap model, fixed the routing, tightened the prompt, and wrote the eval. I could only make that call because I’d read enough to know how badly it landed.
And once you read a few agent traces, you start to develop a sense of what feels right, what feels wrong and things to look out for. Things an eval would not be able to tell you.
You start to develop “taste” (is this overplayed yet?).
Make the traces easy to find
None of this works if you can’t find the trace in the first place, which is the part of agent monitoring nobody demos.
My agent records inputs and outputs in its telemetry, so the whole exchange and every tool result land in Sentry with a stable conversation_id attached:
experimental_telemetry: {
isEnabled: true,
functionId: "conference-scheduler.schedule",
recordInputs: true,
recordOutputs: true,
metadata: {
agent: "schedule",
model_id: config.id,
conversation_id: context.conversationId,
},
},A teammate went further on their own agent. They log the first user message with a few attributes and a link straight back to the trace, so when something looks off they search the text instead of their memory:
Sentry.logger.info("agent conversation started", {
conversation_id: conversationId,
first_message: messages[0].content,
user_tier: identity.tier,
conversation_url: `https://your-app.com/admin/conversations/${conversationId}`,
});The useful part is that the agent trace lands next to the errors, performance data, and logs from the same request. When an answer looks wrong, you can read it and then follow it to the empty query or the bad route behind it, which is the step a standalone eval score leaves you to guess at.
Read the agent traces before you reach for the fix
Picking a model, routing a question, writing a prompt, deciding what to test: these are all tradeoffs, and none of them come with a dashboard that tells you whether you got them right. An eval will tell you when a known thing regresses. It will not tell you what you haven’t thought to check yet, and it will not make the cost-versus-quality call for the people on your free tier.
So before you swap a model or add a judge, go read some traces. That is less manual than it used to be. Point your coding agent at the Sentry MCP and it can run search_ai_conversations to surface the traces worth reading, then get_ai_conversation_details to pull the transcript with every tool call and the input and output it actually returned.
Read the traces, then make the call your eval can’t.
FAQs
How do you choose a model for an AI agent?
It's a tradeoff between cost, quality, speed, and availability. Cheaper or open models keep costs down but tend to follow instructions less reliably; frontier models cost more and add latency. The right call depends on who is on the other end and what a wrong answer actually costs you.
Can an eval catch a hallucinated agent answer?
Often, yes. A groundedness check that asserts the answer only references entities present in the tool output or your database will catch many fabrications, and you can write it as a normal test. What an eval can't do is decide whether a model's cost and quality tradeoff is acceptable for a given tier.
What does it mean to spot-check an AI agent?
Reading individual agent traces by hand, including every tool call and the data the tool returned, to judge whether the agent did the right thing. It is manual review, not an automated score.
Why isn't a tool call proof that the answer is grounded?
Calling a tool does not mean the answer used the tool's output. An agent can call a tool, then overrun what it returned and fill the gap from its training data. You have to read the tool's output, not just confirm a tool ran.
Why do AI agents make up information like fake names or sources?
Usually because the data they need isn't in what their tools returned, so the model fills the gap from its training data instead of saying it can't find it. A cheaper or open model is more prone to this because it follows grounding instructions less reliably. The fix is a mix: routing so the question reaches a tool that can actually answer it, a prompt that forbids naming anything a tool didn't return, and a groundedness check in CI.