Details of the Sentry outage on May 6th, 2022
On May 6th, 2022 between 1:30 AM PDT and 12:17 PM PDT, Sentry experienced a large-scale incident which resulted in the majority of our services being incessabile for 6+ hours. This had the following potential impact for customers:
- Approximately 90% of incoming events were lost between 3:15 AM and 11:15 AM PDT
- Sentry’s UI was unavailable, making any existing customer data inaccessible
- Sentry’s API was unavailable, impacting customers who rely on artifact uploads as part of build and release processes
- Some SDKs (such as PHP) may have caused application degradation due to the synchronous implementation and our edge network experiencing issues
The root cause of this incident was an issue within our Google Cloud Platform’s primary compute region, affecting persistent volumes attached to our compute infrastructure. In this post we’ll share more details about how this incident was handled by our SRE team and our future plans to increase our resilience to this class of issues.
At 1:38 AM PDT on May 6th, our Site Reliability Engineering (SRE) team was paged for multiple distinct alerts across varying services and infrastructure. Our team quickly identified that numerous persistent volumes (PV) across our Google Cloud Platform (GCP) compute infrastructure were experiencing abnormal levels of IO wait and IO latencies of more than a few minutes in some cases.
By 3:00 AM PDT much of Sentry’s core infrastructure residing in GCP was either heavily degraded due to IO performance or inaccessible due to the expanding scope of the GCP incident. These issues spanned across everything from our remote bastion access to our Kubernetes control plane, and greatly impacted the SREs team’s attempts to mitigate or improve the overall incident impact.
At 6:24 AM PDT, as the team was nearing the beginning of our required disaster recovery window (a process which could take upwards of 72 hours), we began to see an improvement in numerous metrics suggesting the underlying issue was being resolved. The team began the long process of restoring, fixing, and validating all of our core infrastructure services. Because this incident affected the ability for applications to reliably write to data volumes, many services required careful verification before returning to full-operation.
This cleanup phase lasted until 8:44 AM PDT, after which we restored core services (Dashboards, APIs, Authentication, Notifications, 3rd-party Integrations, and Event Ingestion). The application UI became available to users, and the system began working through its event backlog.
While observing the recovery, our team discovered that event ingestion was struggling to process the typical volume of Sentry’s incoming event load. Our geographically distributed ingestion infrastructure – referred to internally as Points of Presence, or PoPs – was failing to operate properly. We made the decision to stop using PoPs entirely and divert traffic directly to our centralized ingestion path to route around the problem until our engineering teams could investigate and resolve the underlying issue. This process involved DNS propagation which allowed our ingestion pipeline to slowly recover as our full load of traffic returned. At 11:08 AM PDT the SRE team re-enabled traffic to our PoPs infrastructure after monitoring and confirming all issues had been resolved.
At 12:17 PM PDT after 11+ hours of investigating, monitoring and resolution work, we marked our internal incident as resolved.
Takeaways and Next Steps
While the root cause of this incident was ultimately a problem with our cloud provider, Sentry itself lacks good resilience against single-zone failures. While we had already kicked off long-term project work at the start of 2022 to expand Sentry’s availability to multiple zones/regions, this incident has illustrated the urgency in prioritizing near-term solutions. We expect these improvements to greatly reduce our single-zone/region reliance and improve our options for recovery in the case of a similar future incident.
Additionally, while we are confident in our distributed ingestion infrastructure’s (PoPs) horizontal scalability, this incident exposed a failure mode in PoPs resulting from an unavailability of downstream services. We are prioritizing work to address this flaw, which resulted in unnecessary event loss despite PoPs being geographically dispersed and resilient to a single-region incident of this nature.
Last, we also spent 2 hours of our total downtime to fully restore service once the original incident was resolved. We’re not happy with the length of this recovery process, and this is an area where we are going to focus additional investments going forward.
Outages are never desired, but just like software defects, we inevitably have to deal with them. We recognize the importance of reliability in Sentry, and the trust you all put into us. Ultimately our goal is to ensure that even in the event that we experience an outage, that outage should not be capable of cascading to a customer’s application. We have work ahead of us, but hope that by continuing to be transparent that others can learn from mistakes, and our customers can trust that we continue to invest in the right solutions.