On Monday, July 20th, Sentry was down for most of the US working day. We deeply regret any issues this may have caused for your team and have taken measures to reduce the risk of this happening in the future. For transparency purposes, and the hope of helping others who may find themselves in this situation, we’ve described the event in detail below.
If you’re familiar with Sentry at all you know that we live and breathe SQL. For a long time I’ve had the stance that simple solutions are the best solutions. That holds true in choice of databases as well. Unfortunately that choice has also led to some complex problems. One such problem caused a major outage of the hosted Sentry service earlier this week. That problem is Postgres’ transaction ID (XID).
To understand the problem we first need to understand how transaction isolation works in Postgres. That turns out to be a lot easier said than done, as public documentation on the internals is sparse. What little documentation exists barely touches the surface. Given that I can’t speak with authority on the subject, we’ll approach this from a high level.
In essence, the first time a transaction manipulates rows in the database (typically via an
DELETE statement) the database’s XID counter is incremented. This counter is used to determine row visibility.
In reality, it’s quite a bit more complex than that. The best human-readable description we’ve found is Heroku’s topic on the MVCC and concurrency.
You don’t need to have a strong grasp of MVCC to really understand the basics. It does what it’s supposed to, and works amazingly well for many workloads.
Due to the way transaction IDs are implemented in Postgres, it’s necessary to run a routine maintenance task called vacuuming. Postgres will do this automatically (via autovacuum), and in most worlds it can go unnoticed. Usually the casual PostgreSQL user first encounters autovacuum around the time that they start seeing heavy load (especially write load) on their database. The internet is full of awful advice of users suggesting you should turn off autovacuum, or run it manually at low traffic times, or simply adjust its schedule to run less often. To know why that’s ill-advised, you first need to understand the consequences of autovacuum not running.
For example, let’s say we’ve got a reasonably active database, and we’ve already disabled autovacuum. Eventually, that database is going to reach a situation where the XID counter has reached its maximum value. The absolute peak is something around 2 billion, but it can be far lower than that in some situations. Once you’ve reached this value, the semantics of MVCC can no longer function correctly. As mentioned earlier, the XID counter is used to issue unique values that identify an individual transaction. Since there are no longer any unique XID values available to identify new transactions, the database must halt to prevent XID re-use – otherwise, transactions that occurred in the past could suddenly appear to be in the future, leading to all kinds of strange and undesirable behavior (such as deleted rows reappearing, updated rows reverting to previous states, referential integrity or constraint violations, among others.) To protect you from this, Postgres will stop accepting commands when there are fewer than one million transactions left before the maximum XID value is reached.
Sentry is a very write heavy application. On top of that we have some very large relational tables (both in number of rows, as well as physical size on disk). Due to these concerns we’ve gone back and forth on tuning autovacuum to avoid this very situation. Our goal is to vacuum as aggressively as possible so that this situation should never arise.
Unfortunately this very situation is what caused our most recent outage.
Recovering from Disaster
Once Postgres kicks in its XID wraparound protection, unless you’re willing to accept data loss, your only option is to stop accepting writes and vacuum the relations. One piece of general advice is to stop the database and restart in single user mode, thus running vacuum in a more isolated and safe environment.
This is where it gets a bit tricky in our situation.
The database has stopped accepting writes, but we’re already neck deep in autovacuums which certainly need to finish. Now you start asking vital questions:
- Is a VACUUM resumable?
- How long until the process completes?
- Will another 100GB of memory dedicated to maintenance make things faster?
- Is the VACUUM that’s running now (while the protection has kicked in) even going to allow the database to recover?
At this point we had to make a bet: it was better to let autovacuum finish with our current configuration than to risk interrupting it. Our biggest fear would be that we’d restart, either to up the memory dedicated to the task or to switch to single user mode, and we’d increase the duration of the downtime.
So, we are now in the waiting game.
Eventually the autovacuums finish. Unfortunately one of them either failed or simply didn’t work correctly. It’s unclear as to what happened, as the logs contain nothing to suggest failure and we didn’t have verbosity on for autovacuums. Now we’re in a state of limbo to decide what to do next.
Up until this point we were operating a read-only state. We had replicas online that we could point the application at to maintain that view of the world, but it wouldn’t allow us to accept writes. In Sentry’s case this problem compounds itself as we process a very large amount of data continuously. Our backlog of events in the queue is growing, our Redis buffers are getting out of control, and catastrophe is bound to strike. As to avoid making the problem any worse, we decided to flush the entire backlog of events.
Foresight played in our favor a bit here. We had just provisioned new hardware dedicated solely to ensuring this problem was no longer such a risk. New machines with much more memory targeted at maintenance and additional CPUs to these tasks. Immediately when we identified the XID issue we had failed the primary master over to the new hardware, which is where we’d been running the vacuums.
Now we’re several hours in, autovacuums have finished, the queue will survive (assuming we restore writes very shortly), but Postgres still isn’t accepting writes. Did we waste all of that time allowing autovacuum to finish for it to have no effect? Did it do anything at all? Our only choice at this point was to shut down the database and restart in single user mode. Our fear being that we’ve now invested nearly 3 hours and have nothing to show for it.
While we couldn’t estimate how long the vacuum was going to take we were able to identify the problem. By querying Postgres’ internal statistics we identified that the autovacuums actually had finished on all of the relations except one. That relation however is a massive one. It’s a very large table which stores a mapping of events to their rollups, and we require one of these entries for every single event Sentry receives. It’s also a table that has very limited uses: to let you look up an event by its ID and as a safeguard for duplicate events.
At this stage we made the decision that we were willing to risk all data in the table if it meant we could recovery immediately. In this case the risk was minor. It’s used for features that aren’t critical, and if nothing else we had full backups via replicas.
With our best estimations, we made the call to truncate the table. Five minutes later and the system was fully restored.
Perspective from two days later tells us this was absolutely the right call. As we work towards future proofing this problem even more, we’re still watching one of our test machines run a vacuum in single user mode going on 24 hours. This machine, being on the older hardware, tells us that we made many correct calls: letting autovacuum finish, truncating the table, and most importantly the aggressiveness of hardware and configuration tuning to improve the vacuum times.
We’ve got a few ideas on how to resolve this problem, but the biggest is splitting relations into multiple databases. This has its own set of challenges as connection management is a fairly complex thing, but it’s at the very least predictable and understood. Time is more favorable now as well as the machine tuning has provided lots of breathing room.
For the interested, our new hardware specs look something akin to:
- 3x RAIDs (OS, WAL, and Data)
- 256GB Memory (we bumped up from 128, specifically to dedicate the new memory to vacuum)
- 24 Cores (dual hexcore)
Note: We don’t store blob data in Postgres at Sentry (that goes in Riak), so we’re able to get away with a fraction of the hardware we otherwise might need for SQL.
We previously ran with too high value for
autovacuum_freeze_max_age and the default of 3 autovacuum workers. Additionally we had been running with far too much delay in vacuuming which meant lower load on the system, but more idle time in maintenance.
Our updated configuration:
autovacuum_freeze_max_age = 500000000 autovacuum_max_workers = 6 autovacuum_naptime = '15s' autovacuum_vacuum_cost_delay = 0 maintenance_work_mem = '10GB' vacuum_freeze_min_age = 10000000
There’s a number of other things we tune, though it’s all fairly standard these days.
While we feel pretty good about the progress we’ve made, ultimately we’re going to be moving away from a SQL-based architecture. In the future we’ll talk more about that, but we’re happy Postgres has taken us this far.
Update: We’ve had many discussions since posting this and wanted to provide some additional context.
vacuum_freeze_table_agehad not been updated with our new configuration. We’ve removed this from the configuration above as it didn’t affect anything.
maintenance_work_memhas a hard internal limit of 1GB for the relevant paths, and thus our 100GB of memory dedicated to it wasn’t even being used. This was confused by looking at the processes and seeing the large amount of memory they were using, though it was coming from the other relevant settings (
In the future we’ll be following up with a post on what we’ve deemed as best practices for tuning Postgres under high write loads.