Back to Blog Home

Contents

Share

Share on Twitter
Share on Bluesky
Share on HackerNews
Share on LinkedIn

Sample AI traces at 100% without sampling everything

Sergiy Dybskiy image

Sergiy Dybskiy -

Sample AI traces at 100% without sampling everything

Sample AI traces at 100% without sampling everything

A little while ago, when agents were telling me “You’re absolutely right!”, I was building webvitals.com. You put in a URL, it kicks off an API request to a Next.js API route that invokes an agent with a few tools to scan it and provide AI generated suggestions to improve your… you guessed it… Web Vitals. Do we even care about these anymore?

I had the traceSampleRate set to 100% in development, but in production, I sampled it down to 10% because… well that’s what our instrumentation recommends. Kyle wrote a great blog post explaining that “Watching everything is watching nothing”. But AI is non-deterministic. And when I was debugging an error from a tool call, I realized I was missing very important spans emitted from the Vercel AI SDK because of that sampling strategy.

An agent run with 7 tool calls doesn't get partially sampled. You either capture the whole span tree or you lose it entirely. This is how head-based sampling works.

I was chasing ghosts.

Agent runs are span trees, and sampling is all-or-nothing

A typical agent execution looks like this in Sentry's trace view:

Click to Copy
POST /api/chat (http.server)
└── gen_ai.invoke_agent "Research Agent"
    ├── gen_ai.request "chat claude-sonnet-4-6"        ← initial reasoning
    ├── gen_ai.execute_tool "search_docs"              ← tool call
    ├── gen_ai.request "chat claude-sonnet-4-6"        ← process results
    ├── gen_ai.execute_tool "summarize"                ← second tool call
    ├── gen_ai.request "chat claude-sonnet-4-6"        ← decides to hand off
    └── gen_ai.execute_tool "transfer_to_writer"       ← handoff via tool
        └── gen_ai.invoke_agent "Writer Agent"
            ├── gen_ai.request "chat gemini-2.5-flash"
            └── gen_ai.execute_tool "format_output"

That's 11 spans in a single run. The sampling decision happens once, at the root: the POST /api/chat HTTP transaction. Every child span inherits that decision. If the root is dropped, all 9 spans disappear.

This is fundamentally different from sampling HTTP requests, where dropping one GET /api/users is no big deal because the next one is basically identical.

Agent runs are not identical. Each one makes different decisions, calls different tools, processes different data. An agent that hallucinated on run 67 might work perfectly on run 420. If your sample rate dropped 67, you'll never know what went wrong.

How head-based sampling actually works (and why it matters here)

Both the Sentry JavaScript and Python SDKs use head-based sampling: the decision is made at the start of the trace, before any child spans exist.

In the JavaScript SDK, SentrySampler.shouldSample() is explicit about this:

Click to Copy
// We only sample based on parameters (like tracesSampleRate or tracesSampler)
// for root spans. Non-root spans simply inherit the sampling decision
// from their parent.

Non-root spans don't get a vote. If the root span was dropped, tracesSampler is never called for any child, including your gen_ai.request and gen_ai.execute_tool spans. They inherit the parent's fate.

In Python, the same logic lives in Transaction._set_initial_sampling_decision(). The traces_sampler callback receives a sampling_context dict with transaction_context (containing op and name) and parent_sampled. It only fires for root transactions.

This means head-based sampling doesn't support independently sampling gen_ai child spans at a different rate than their parent transaction. There's no "sample 100% of LLM calls but 10% of HTTP requests." If the HTTP request is dropped, the LLM calls inside it are dropped too.

I’d love to walk through a few different scenarios to show the difference in filtering approaches based on wether or not the root span is from an agent or the application.

Scenario 1: The gen_ai span IS the root

Sometimes your agent run is the root span. Maybe it’s a cron job thats running an agent, a queue consumer processing an AI task, or a CLI script. In these cases, tracesSampler sees the gen_ai.* operation directly and you can match on it:

JavaScript:

Click to Copy
Sentry.init({
  dsn: process.env.SENTRY_DSN,
  tracesSampler: ({ name, attributes, inheritOrSampleWith }) => {
    // Standalone gen_ai root spans - always sample
    if (attributes?.['sentry.op']?.startsWith('gen_ai.') || attributes?.['gen_ai.system']) {
      return 1.0;
    }

    return inheritOrSampleWith(0.2);
  },
});

Python:

Click to Copy
def traces_sampler(sampling_context):
    op = sampling_context.get("transaction_context", {}).get("op", "")

    # Standalone gen_ai root spans - always sample
    if op.startswith("gen_ai."):
        return 1.0

    parent = sampling_context.get("parent_sampled")
    if parent is not None:
        return float(parent)

    return 0.2

sentry_sdk.init(dsn="...", traces_sampler=traces_sampler)

This is the easy case. The hard case is next.

Scenario 2: The gen_ai spans are children of an HTTP transaction

This is the common case in web applications. A user hits POST /api/chat, your framework creates an http.server root span, and somewhere inside that request handler your agent runs. By the time the first gen_ai.request span is created, the sampling decision was already made for the HTTP transaction.

The fix: identify which routes trigger AI calls and sample those routes at 100%.

JavaScript:

Click to Copy
Sentry.init({
  dsn: process.env.SENTRY_DSN,
  tracesSampler: ({ name, attributes, inheritOrSampleWith }) => {
    // Standalone gen_ai root spans
    if (attributes?.['sentry.op']?.startsWith('gen_ai.') || attributes?.['gen_ai.system']) {
      return 1.0;
    }

    // HTTP routes that serve AI features - always sample
    if (name?.includes('/api/chat') ||
        name?.includes('/api/agent') ||
        name?.includes('/api/generate')) {
      return 1.0;
    }

    return inheritOrSampleWith(0.2);
  },
});

Python:

Click to Copy
def traces_sampler(sampling_context):
    tx_context = sampling_context.get("transaction_context", {})
    op = tx_context.get("op", "")
    name = tx_context.get("name", "")

    # Standalone gen_ai root spans
    if op.startswith("gen_ai."):
        return 1.0

    # HTTP routes that serve AI features - always sample
    if op == "http.server" and any(
        p in name for p in ["/api/chat", "/api/agent", "/api/generate"]
    ):
        return 1.0

    # Honour parent decision in distributed traces
    parent = sampling_context.get("parent_sampled")
    if parent is not None:
        return float(parent)

    return 0.2

sentry_sdk.init(dsn="...", traces_sampler=traces_sampler)

Replace the route strings with whatever paths your AI features live on. If your entire app is AI-powered, skip the tracesSampler and just set tracesSampleRate: 1.0.

The cost math: AI API bills dwarf observability costs

The instinct to sample AI traces at a lower rate usually comes from cost concerns. Let's look at the actual numbers.

What

Cost per event

Claude Sonnet 4 input (1K tokens)

~$0.003

Claude Sonnet 4 output (1K tokens)

~$0.015

Gemini 2.5 Flash input (1K tokens)

~$0.00015

Gemini 2.5 Flash output (1K tokens)

~$0.0006

A typical agent run (3 LLM calls, 2 tool calls)

$0.02-$0.15

Sentry span events for that agent run (~9 spans)

Fraction of a cent

The LLM calls themselves are 10-100x more expensive than the monitoring. You're already paying for the AI call; dropping the observability span to save a fraction of a cent per call is like skipping the dashcam to save on gas.

When 100% tracing isn't feasible: Metrics and Logs as a safety net

If you genuinely can't sample AI routes at 100%, because of, say, massive scale or strict budget restraints, you can still capture the important signals from every AI call using Sentry Metrics and Logs. Both are independent of trace sampling.

JavaScript - emit metrics on every LLM call:

Click to Copy
import * as Sentry from "@sentry/node";

// After every LLM call, regardless of trace sampling:
Sentry.metrics.distribution("gen_ai.token_usage", result.usage.totalTokens, {
  unit: "none",
  attributes: {
    model: "claude-sonnet-4-6",
    user_id: user.id,
    endpoint: "/api/chat",
  },
});

Sentry.metrics.distribution("gen_ai.latency", responseTimeMs, {
  unit: "millisecond",
  attributes: { model: "claude-sonnet-4-6" },
});

Sentry.metrics.count("gen_ai.calls", 1, {
  attributes: {
    model: "claude-sonnet-4-6",
    status: result.error ? "error" : "success",
  },
});

Python - emit metrics on every LLM call:

Click to Copy
import sentry_sdk

sentry_sdk.metrics.distribution(
    "gen_ai.token_usage",
    result.usage.total_tokens,
    attributes={
        "model": "claude-sonnet-4-6",
        "user_id": str(user.id),
        "endpoint": "/api/chat",
    },
)

sentry_sdk.metrics.distribution(
    "gen_ai.latency",
    response_time_ms,
    unit="millisecond",
    attributes={"model": "claude-sonnet-4-6"},
)

sentry_sdk.metrics.count(
    "gen_ai.calls",
    1,
    attributes={
        "model": "claude-sonnet-4-6",
        "status": "error" if error else "success",
    },
)

You can also log every call with structured attributes for searchability:

JavaScript:

Click to Copy
Sentry.logger.info("LLM call completed", {
  model: "claude-sonnet-4-6",
  user_id: user.id,
  input_tokens: result.usage.promptTokens,
  output_tokens: result.usage.completionTokens,
  latency_ms: responseTimeMs,
  status: "success",
});

Python:

Click to Copy
sentry_sdk.logger.info(
    "LLM call completed",
    model="claude-sonnet-4-6",
    user_id=str(user.id),
    input_tokens=result.usage.prompt_tokens,
    output_tokens=result.usage.completion_tokens,
    latency_ms=response_time_ms,
    status="success",
)

Here's what each telemetry layer gives you:

Signal

Traces (sampled)

Metrics (100%)

Logs (100%)

Full span tree with prompts/responses

Yes

No

No

Token usage distributions (p50, p99)

Partial

Yes

No

Cost attribution by model/user

Partial

Yes

Yes

Error rates by model/endpoint

Partial

Yes

Yes

Latency distributions

Partial

Yes

No

Searchable per-call records

Yes

No

Yes

The recommended approach: Use tracesSampler to capture 100% of AI-related routes. If that's not possible, combine a lower trace rate with metrics and logs emitted on every call. Traces give you the debugging depth; metrics and logs give you the aggregate picture.

Once you're emitting these metrics, you can build custom dashboards that go beyond what the pre-built AI Agents dashboard shows. The Sentry CLI makes this scriptable:

Click to Copy
# Find your most expensive users - the pre-built dashboard doesn't group by user
sentry dashboard create 'AI Cost Attribution'
sentry dashboard widget add 'AI Cost Attribution' "Most Expensive Users" \\
  --display table --dataset spans \\
  --query "sum:gen_ai.usage.total_tokens" \\
  --where "span.op:gen_ai.request" \\
  --group-by "user.id" \\
  --sort "-sum:gen_ai.usage.total_tokens" \\
  --limit 20

# Cost per conversation - find runaway multi-turn sessions
sentry dashboard widget add 'AI Cost Attribution' "Cost per Conversation" \\
  --display table --dataset spans \\
  --query "sum:gen_ai.usage.total_tokens" "count" \\
  --where "span.op:gen_ai.request" \\
  --group-by "gen_ai.conversation.id" \\
  --sort "-sum:gen_ai.usage.total_tokens" \\
  --limit 20

The pre-built dashboard gives you per-model and per-tool aggregates. Custom dashboards answer the business questions: who's driving cost, which features justify their AI spend, and which conversations are spiraling.

The full production config

Here's a complete setup that samples AI routes at 100%, everything else at your baseline, and emits metrics as a safety net:

JavaScript:

Click to Copy
import * as Sentry from "@sentry/node";

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  tracesSampler: ({ name, attributes, inheritOrSampleWith }) => {
    if (attributes?.['sentry.op']?.startsWith('gen_ai.') || attributes?.['gen_ai.system']) {
      return 1.0;
    }
    if (name?.includes('/api/chat') || name?.includes('/api/agent')) {
      return 1.0;
    }
    return inheritOrSampleWith(0.2);
  },
});

// Wrapper for any LLM call - emit metrics regardless of sampling
function trackLLMCall(model, usage, latencyMs, userId) {
  Sentry.metrics.distribution("gen_ai.token_usage", usage.totalTokens, {
    attributes: { model, user_id: userId },
  });
  Sentry.metrics.distribution("gen_ai.latency", latencyMs, {
    unit: "millisecond",
    attributes: { model },
  });
  Sentry.metrics.count("gen_ai.calls", 1, {
    attributes: { model, status: "success" },
  });
}

Python:

Click to Copy
import sentry_sdk

def traces_sampler(sampling_context):
    tx = sampling_context.get("transaction_context", {})
    op, name = tx.get("op", ""), tx.get("name", "")

    if op.startswith("gen_ai."):
        return 1.0
    if op == "http.server" and any(
        p in name for p in ["/api/chat", "/api/agent"]
    ):
        return 1.0

    parent = sampling_context.get("parent_sampled")
    if parent is not None:
        return float(parent)
    return 0.2

sentry_sdk.init(
    dsn="...",
    traces_sampler=traces_sampler,
)

# Wrapper for any LLM call - emit metrics regardless of sampling
def track_llm_call(model, usage, latency_ms, user_id):
    sentry_sdk.metrics.distribution(
        "gen_ai.token_usage", usage.total_tokens,
        attributes={"model": model, "user_id": str(user_id)},
    )
    sentry_sdk.metrics.distribution(
        "gen_ai.latency", latency_ms,
        unit="millisecond",
        attributes={"model": model},
    )
    sentry_sdk.metrics.count(
        "gen_ai.calls", 1,
        attributes={"model": model, "status": "success"},
    )

Quick reference

Situation

What to do

AI is the core product

tracesSampleRate: 1.0 - sample everything

AI is one feature in a larger app

tracesSampler

with AI routes at 1.0, baseline for the rest

Can't afford 100% on AI routes

Lower trace rate + metrics/logs on every call

Already using tracesSampler

Add AI route matching to your existing logic

Sample rate is already 1.0

No change needed

The underlying principle: agent runs are high-value, low-volume (relative to HTTP traffic), and expensive to reproduce. Sample them accordingly.

If you're just getting started with AI monitoring, check out our companion post on the developer's guide to AI agent monitoring, which covers the full setup across 10+ frameworks, the pre-built dashboards, and a real debugging walkthrough.


For framework-specific setup, see our AI monitoring docs. If you're using an AI coding assistant, install the Sentry CLI skill (npx skills add <https://cli.sentry.dev>) to configure your sampling, build custom dashboards, and investigate issues directly from your editor.

Syntax.fm logo

Listen to the Syntax Podcast

Of course we sponsor a developer podcast. Check it out on your favorite listening platform.

Listen To Syntax
© 2026 • Sentry is a registered Trademark of Functional Software, Inc.