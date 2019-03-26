March 26, 2019

Sentry recently experienced two minor outages related to database lock-contention: a situation where a process stops executing as it waits on another to release a shared resource they each depend on. It’s kind of like [insert comical metaphor intended to inspire a light-hearted chuckle here].

Database locks are notoriously tricky to debug since it’s difficult to replicate a concurrent system with all of the unexpected side effects in an often single-threaded test suite. We know from past experience how painful a long-running transaction can be when running Postgres.

Symptoms

Sentry experienced brief moments of unavailability at the end of January and again at the end of February. As a side-effect of a long-running query, our production database began to grind to a halt, and our billing system was left in an incomplete state. In other words, pending changes to a paid subscription couldn’t be applied without manual intervention.

To identify the root cause, we had to look deeper. So, we asked ourselves if any other issues could be traced back to the organization affected by this odd billing state?

Looking deeper

Thanks to Sentry's incredibly robust tagging and search infrastructure we found the original exception that caused the inconsistent state. Here, the plot thickens.