Sample AI traces at 100% without sampling everything
Sample AI traces at 100% without sampling everything
A little while ago, when agents were telling me “You’re absolutely right!”, I was building webvitals.com. You put in a URL, it kicks off an API request to a Next.js API route that invokes an agent with a few tools to scan it and provide AI generated suggestions to improve your… you guessed it… Web Vitals. Do we even care about these anymore?
I had the traceSampleRate set to 100% in development, but in production, I sampled it down to 10% because… well that’s what our instrumentation recommends. Kyle wrote a great blog post explaining that “Watching everything is watching nothing”. But AI is non-deterministic. And when I was debugging an error from a tool call, I realized I was missing very important spans emitted from the Vercel AI SDK because of that sampling strategy.
An agent run with 7 tool calls doesn't get partially sampled. You either capture the whole span tree or you lose it entirely. This is how head-based sampling works.
I was chasing ghosts.
Agent runs are span trees, and sampling is all-or-nothing
A typical agent execution looks like this in Sentry's trace view:
POST /api/chat (http.server)
└── gen_ai.invoke_agent "Research Agent"
├── gen_ai.request "chat claude-sonnet-4-6" ← initial reasoning
├── gen_ai.execute_tool "search_docs" ← tool call
├── gen_ai.request "chat claude-sonnet-4-6" ← process results
├── gen_ai.execute_tool "summarize" ← second tool call
├── gen_ai.request "chat claude-sonnet-4-6" ← decides to hand off
└── gen_ai.execute_tool "transfer_to_writer" ← handoff via tool
└── gen_ai.invoke_agent "Writer Agent"
├── gen_ai.request "chat gemini-2.5-flash"
└── gen_ai.execute_tool "format_output"That's 11 spans in a single run. The sampling decision happens once, at the root: the POST /api/chat HTTP transaction. Every child span inherits that decision. If the root is dropped, all 9 spans disappear.
This is fundamentally different from sampling HTTP requests, where dropping one GET /api/users is no big deal because the next one is basically identical.
Agent runs are not identical. Each one makes different decisions, calls different tools, processes different data. An agent that hallucinated on run 67 might work perfectly on run 420. If your sample rate dropped 67, you'll never know what went wrong.
How head-based sampling actually works (and why it matters here)
Both the Sentry JavaScript and Python SDKs use head-based sampling: the decision is made at the start of the trace, before any child spans exist.
In the JavaScript SDK, SentrySampler.shouldSample() is explicit about this:
// We only sample based on parameters (like tracesSampleRate or tracesSampler)
// for root spans. Non-root spans simply inherit the sampling decision
// from their parent.Non-root spans don't get a vote. If the root span was dropped, tracesSampler is never called for any child, including your gen_ai.request and gen_ai.execute_tool spans. They inherit the parent's fate.
In Python, the same logic lives in Transaction._set_initial_sampling_decision(). The traces_sampler callback receives a sampling_context dict with transaction_context (containing op and name) and parent_sampled. It only fires for root transactions.
This means head-based sampling doesn't support independently sampling gen_ai child spans at a different rate than their parent transaction. There's no "sample 100% of LLM calls but 10% of HTTP requests." If the HTTP request is dropped, the LLM calls inside it are dropped too.
I’d love to walk through a few different scenarios to show the difference in filtering approaches based on wether or not the root span is from an agent or the application.
Scenario 1: The gen_ai span IS the root
gen_ai span IS the rootSometimes your agent run is the root span. Maybe it’s a cron job thats running an agent, a queue consumer processing an AI task, or a CLI script. In these cases, tracesSampler sees the gen_ai.* operation directly and you can match on it:
JavaScript:
Sentry.init({
dsn: process.env.SENTRY_DSN,
tracesSampler: ({ name, attributes, inheritOrSampleWith }) => {
// Standalone gen_ai root spans - always sample
if (attributes?.['sentry.op']?.startsWith('gen_ai.') || attributes?.['gen_ai.system']) {
return 1.0;
}
return inheritOrSampleWith(0.2);
},
});Python:
def traces_sampler(sampling_context):
op = sampling_context.get("transaction_context", {}).get("op", "")
# Standalone gen_ai root spans - always sample
if op.startswith("gen_ai."):
return 1.0
parent = sampling_context.get("parent_sampled")
if parent is not None:
return float(parent)
return 0.2
sentry_sdk.init(dsn="...", traces_sampler=traces_sampler)This is the easy case. The hard case is next.
Scenario 2: The gen_ai spans are children of an HTTP transaction
gen_ai spans are children of an HTTP transactionThis is the common case in web applications. A user hits POST /api/chat, your framework creates an http.server root span, and somewhere inside that request handler your agent runs. By the time the first gen_ai.request span is created, the sampling decision was already made for the HTTP transaction.
The fix: identify which routes trigger AI calls and sample those routes at 100%.
JavaScript:
Sentry.init({
dsn: process.env.SENTRY_DSN,
tracesSampler: ({ name, attributes, inheritOrSampleWith }) => {
// Standalone gen_ai root spans
if (attributes?.['sentry.op']?.startsWith('gen_ai.') || attributes?.['gen_ai.system']) {
return 1.0;
}
// HTTP routes that serve AI features - always sample
if (name?.includes('/api/chat') ||
name?.includes('/api/agent') ||
name?.includes('/api/generate')) {
return 1.0;
}
return inheritOrSampleWith(0.2);
},
});Python:
def traces_sampler(sampling_context):
tx_context = sampling_context.get("transaction_context", {})
op = tx_context.get("op", "")
name = tx_context.get("name", "")
# Standalone gen_ai root spans
if op.startswith("gen_ai."):
return 1.0
# HTTP routes that serve AI features - always sample
if op == "http.server" and any(
p in name for p in ["/api/chat", "/api/agent", "/api/generate"]
):
return 1.0
# Honour parent decision in distributed traces
parent = sampling_context.get("parent_sampled")
if parent is not None:
return float(parent)
return 0.2
sentry_sdk.init(dsn="...", traces_sampler=traces_sampler)Replace the route strings with whatever paths your AI features live on. If your entire app is AI-powered, skip the tracesSampler and just set tracesSampleRate: 1.0.
The cost math: AI API bills dwarf observability costs
The instinct to sample AI traces at a lower rate usually comes from cost concerns. Let's look at the actual numbers.
What | Cost per event |
|---|---|
Claude Sonnet 4 input (1K tokens) | ~$0.003 |
Claude Sonnet 4 output (1K tokens) | ~$0.015 |
Gemini 2.5 Flash input (1K tokens) | ~$0.00015 |
Gemini 2.5 Flash output (1K tokens) | ~$0.0006 |
A typical agent run (3 LLM calls, 2 tool calls) | $0.02-$0.15 |
Sentry span events for that agent run (~9 spans) | Fraction of a cent |
The LLM calls themselves are 10-100x more expensive than the monitoring. You're already paying for the AI call; dropping the observability span to save a fraction of a cent per call is like skipping the dashcam to save on gas.
When 100% tracing isn't feasible: Metrics and Logs as a safety net
If you genuinely can't sample AI routes at 100%, because of, say, massive scale or strict budget restraints, you can still capture the important signals from every AI call using Sentry Metrics and Logs. Both are independent of trace sampling.
JavaScript - emit metrics on every LLM call:
import * as Sentry from "@sentry/node";
// After every LLM call, regardless of trace sampling:
Sentry.metrics.distribution("gen_ai.token_usage", result.usage.totalTokens, {
unit: "none",
attributes: {
model: "claude-sonnet-4-6",
user_id: user.id,
endpoint: "/api/chat",
},
});
Sentry.metrics.distribution("gen_ai.latency", responseTimeMs, {
unit: "millisecond",
attributes: { model: "claude-sonnet-4-6" },
});
Sentry.metrics.count("gen_ai.calls", 1, {
attributes: {
model: "claude-sonnet-4-6",
status: result.error ? "error" : "success",
},
});Python - emit metrics on every LLM call:
import sentry_sdk
sentry_sdk.metrics.distribution(
"gen_ai.token_usage",
result.usage.total_tokens,
attributes={
"model": "claude-sonnet-4-6",
"user_id": str(user.id),
"endpoint": "/api/chat",
},
)
sentry_sdk.metrics.distribution(
"gen_ai.latency",
response_time_ms,
unit="millisecond",
attributes={"model": "claude-sonnet-4-6"},
)
sentry_sdk.metrics.count(
"gen_ai.calls",
1,
attributes={
"model": "claude-sonnet-4-6",
"status": "error" if error else "success",
},
)You can also log every call with structured attributes for searchability:
JavaScript:
Sentry.logger.info("LLM call completed", {
model: "claude-sonnet-4-6",
user_id: user.id,
input_tokens: result.usage.promptTokens,
output_tokens: result.usage.completionTokens,
latency_ms: responseTimeMs,
status: "success",
});Python:
sentry_sdk.logger.info(
"LLM call completed",
model="claude-sonnet-4-6",
user_id=str(user.id),
input_tokens=result.usage.prompt_tokens,
output_tokens=result.usage.completion_tokens,
latency_ms=response_time_ms,
status="success",
)Here's what each telemetry layer gives you:
Signal | Traces (sampled) | Metrics (100%) | Logs (100%) |
|---|---|---|---|
Full span tree with prompts/responses | Yes | No | No |
Token usage distributions (p50, p99) | Partial | Yes | No |
Cost attribution by model/user | Partial | Yes | Yes |
Error rates by model/endpoint | Partial | Yes | Yes |
Latency distributions | Partial | Yes | No |
Searchable per-call records | Yes | No | Yes |
The recommended approach: Use tracesSampler to capture 100% of AI-related routes. If that's not possible, combine a lower trace rate with metrics and logs emitted on every call. Traces give you the debugging depth; metrics and logs give you the aggregate picture.
Once you're emitting these metrics, you can build custom dashboards that go beyond what the pre-built AI Agents dashboard shows. The Sentry CLI makes this scriptable:
# Find your most expensive users - the pre-built dashboard doesn't group by user
sentry dashboard create 'AI Cost Attribution'
sentry dashboard widget add 'AI Cost Attribution' "Most Expensive Users" \\
--display table --dataset spans \\
--query "sum:gen_ai.usage.total_tokens" \\
--where "span.op:gen_ai.request" \\
--group-by "user.id" \\
--sort "-sum:gen_ai.usage.total_tokens" \\
--limit 20
# Cost per conversation - find runaway multi-turn sessions
sentry dashboard widget add 'AI Cost Attribution' "Cost per Conversation" \\
--display table --dataset spans \\
--query "sum:gen_ai.usage.total_tokens" "count" \\
--where "span.op:gen_ai.request" \\
--group-by "gen_ai.conversation.id" \\
--sort "-sum:gen_ai.usage.total_tokens" \\
--limit 20The pre-built dashboard gives you per-model and per-tool aggregates. Custom dashboards answer the business questions: who's driving cost, which features justify their AI spend, and which conversations are spiraling.
The full production config
Here's a complete setup that samples AI routes at 100%, everything else at your baseline, and emits metrics as a safety net:
JavaScript:
import * as Sentry from "@sentry/node";
Sentry.init({
dsn: process.env.SENTRY_DSN,
tracesSampler: ({ name, attributes, inheritOrSampleWith }) => {
if (attributes?.['sentry.op']?.startsWith('gen_ai.') || attributes?.['gen_ai.system']) {
return 1.0;
}
if (name?.includes('/api/chat') || name?.includes('/api/agent')) {
return 1.0;
}
return inheritOrSampleWith(0.2);
},
});
// Wrapper for any LLM call - emit metrics regardless of sampling
function trackLLMCall(model, usage, latencyMs, userId) {
Sentry.metrics.distribution("gen_ai.token_usage", usage.totalTokens, {
attributes: { model, user_id: userId },
});
Sentry.metrics.distribution("gen_ai.latency", latencyMs, {
unit: "millisecond",
attributes: { model },
});
Sentry.metrics.count("gen_ai.calls", 1, {
attributes: { model, status: "success" },
});
}Python:
import sentry_sdk
def traces_sampler(sampling_context):
tx = sampling_context.get("transaction_context", {})
op, name = tx.get("op", ""), tx.get("name", "")
if op.startswith("gen_ai."):
return 1.0
if op == "http.server" and any(
p in name for p in ["/api/chat", "/api/agent"]
):
return 1.0
parent = sampling_context.get("parent_sampled")
if parent is not None:
return float(parent)
return 0.2
sentry_sdk.init(
dsn="...",
traces_sampler=traces_sampler,
)
# Wrapper for any LLM call - emit metrics regardless of sampling
def track_llm_call(model, usage, latency_ms, user_id):
sentry_sdk.metrics.distribution(
"gen_ai.token_usage", usage.total_tokens,
attributes={"model": model, "user_id": str(user_id)},
)
sentry_sdk.metrics.distribution(
"gen_ai.latency", latency_ms,
unit="millisecond",
attributes={"model": model},
)
sentry_sdk.metrics.count(
"gen_ai.calls", 1,
attributes={"model": model, "status": "success"},
)Quick reference
Situation | What to do |
|---|---|
AI is the core product |
|
AI is one feature in a larger app |
with AI routes at 1.0, baseline for the rest |
Can't afford 100% on AI routes | Lower trace rate + metrics/logs on every call |
Already using | Add AI route matching to your existing logic |
Sample rate is already 1.0 | No change needed |
The underlying principle: agent runs are high-value, low-volume (relative to HTTP traffic), and expensive to reproduce. Sample them accordingly.
If you're just getting started with AI monitoring, check out our companion post on the developer's guide to AI agent monitoring, which covers the full setup across 10+ frameworks, the pre-built dashboards, and a real debugging walkthrough.
For framework-specific setup, see our AI monitoring docs. If you're using an AI coding assistant, install the Sentry CLI skill (npx skills add <https://cli.sentry.dev>) to configure your sampling, build custom dashboards, and investigate issues directly from your editor.

