Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

Kush Dubey - March 20, 2026 · 3 min read

Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

On February 21, 2026, Sentry’s AI-powered issue summarization experienced an outage in the EU region. Approximately 80-90% of requests to Seer’s Issue Summary API endpoint failed, disabling AI Summary cards on new issues and generating over 40,000 error events.

The root cause traced back to an upstream incident: Google Cloud Platform declared unavailability for gemini-2.5-flash-lite in several EU regions. However, Sentry had provisioned throughput capacity in europe-west1 with guaranteed resources. The outage should have been minor—Sentry was only using 12% of provisioned capacity.

The actual problem stemmed from application code, not infrastructure. A latency optimization feature blocklisted every Gemini region in the EU, including the one with guaranteed capacity.

How Seer Routes LLM Calls in the EU

Seer executes gemini-2.5-flash-lite through GCP Vertex AI. The EU deployment maintains provisioned throughput in europe-west1, providing reserved capacity during demand spikes. Several other EU regions use Standard pay-as-you-go capacity without guaranteed availability.

The LLM client implements a region fallback mechanism with temporary blocklisting: regions accumulating 6 failures within a short window are temporarily removed from rotation. This optimization reduces latency during Autofix sessions, which trigger 50-100 LLM calls.

A critical invariant should exist: never blocklist provisioned throughput regions. That capacity represents paid-for, guaranteed resources. Sentry enforced this rule in the US deployment but omitted it from the EU configuration.

The Cascade

Sentry Seer error volume alert for GCP Vertex AI

When europe-west1 returned 504 Deadline Exceeded errors during the GCP incident, six failures triggered blocklisting. All traffic shifted to Standard PayGo regions unprepared for full load. europe-west4 returned 429 RESOURCE_EXHAUSTED and was blocklisted. Then europe-central2. Within minutes, every EU region was blocklisted, and calls returned LlmNoRegionsToRunError—no allowed regions remained.

Critically, most calls to europe-west1 succeeded because provisioned throughput absorbed the load. The blocklist triggered on raw failure count regardless of success rate, enabling a region handling the vast majority of traffic to be banned for having six clustered failures.

The Code Problem

The original blocklist logic:

def should_blocklist(region: str, model: str, error_count: int) -> bool:
    return error_count >= BLOCKLIST_THRESHOLD

The required fix:

def should_blocklist(region: str, model: str, error_count: int) -> bool:
    if is_provisioned_throughput_region(region, model):
        return False  # Never blocklist PT regions

    return error_count >= BLOCKLIST_THRESHOLD

The US deployment hardcoded an exception for its PT region. When EU provisioned throughput was added after a previous incident, the blocklist code wasn’t updated. Configuration relied on developers remembering to maintain a separate, manually-updated allowlist—a classic gap between infrastructure provisioning and application awareness.

A secondary issue: the blocklist threshold of 6 errors was hardcoded based on months-old load patterns. Sentry is replacing it with an error-rate-based approach.

Seer Debugging Seer

Sentry’s AI debugging tool proved essential for understanding the blast radius of its own outage. Standard monitoring detected the alert, but Seer’s analysis of the LlmNoRegionsToRunError issue determined the impact in seconds.

Seer identified that failed issue summaries caused ~42,000 errors, with spam detection (~1,600) and autofix (~850) also affected. It confirmed >99% of events occurred in the EU deployment and traced the blocklisting cascade through breadcrumb trails.

The analysis reached the region blocklisting mechanism autonomously. Engineers, applying knowledge of provisioned throughput architecture, recognized that the PT region shouldn’t have been blocklisted. Seer confirmed calls to the PT region mostly succeeded during the GCP incident—the precise combination of facts needed to identify the fix.

The Lesson

Latency optimizations can create failure modes worse than having no optimization at all. Circuit breakers opening too aggressively, blocklists ignoring reserved capacity, or fallback chains amplifying failures can transform upstream provider incidents into complete service outages.

The bug exploited a mundane gap: the distance between “we provisioned GCP capacity” and “our code knows we provisioned GCP capacity.” Organizations routing LLM requests across multiple regions should audit circuit breakers to ensure reserved regions receive special protection. This fix required six lines of code.

For Seer’s analytical capabilities, consult the Seer documentation.

Solutions

Products

Products

Integrations

Integrations

SDKs

SDKs

Learn

Learn

Support

Support

Hang out with us

Hang out with us

Bi-weekly Intro to Sentry Demo

Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

How Seer Routes LLM Calls in the EU

The Cascade

The Code Problem

Seer Debugging Seer

The Lesson

Listen to the Syntax Podcast

Solutions

Products

Products

Integrations

Integrations

SDKs

SDKs

Learn

Learn

Support

Support

Hang out with us

Hang out with us

Bi-weekly Intro to Sentry Demo

How Seer Routes LLM Calls in the EU

The Cascade

The Code Problem

Seer Debugging Seer

The Lesson

When agents orchestrate agents, who's watching?

No more monkey-patching: Better observability with tracing channels

Debugging multi-agent AI: When the failure is in the space between agents

Listen to the Syntax Podcast