Buffering SQL Writes with Redis
It’s no secret that one of Sentry’s core technologies is SQL, specifically PostgreSQL. We’re huge advocates of simplicity, and Postgres is one of those tools that’s not only quick to get started with, but can also grow with you. While at our scale very few things are simple, we’ve still managed to keep complexity to a minimum.
To give a bit of background, it’s important to understand how we use Postgres today. We operate two clusters (unique databases), one which stores event metadata, and the other which stores the remainder of the Sentry platform data. These clusters operate with cold-standby replicas, which are only used in disaster or maintenance scenarios.
When data comes into Sentry, it primarily falls into three categories:
- the event blob, which is immutable and written into a Riak cluster
- key/value pairs representing tags (such as the device name, or operating system)
- various attributes like “last time seen” on an aggregate
Most data is stored in SQL, with the exception being the large event blobs. We rely on Riak for it’s operational scalability, but utilize it in only the most primitive way — as a key value store. Inside of SQL we maintain references to the Riak keys, keeping it as our source of truth.
We commonly see a pattern of heavily-duplicated, high-frequency events due to the nature of errors. Internally we take an error and de-duplicate it into an aggregate. This gives us two scenarios:
- A lot of the same errors bubble up to the same aggregate issue
- Many unique errors are creating new aggregates (or updating existing ones).
For the data within SQL, there are two types of attributes which see heavy mutation:
- Cardinality counters — frequency of events or the number of values seen for a tag.
- “Latest Value” attributes — such as the last time an event was seen in an aggregate.
In the first situation we want to increment a basic counter. This would be equivalent to the following SQL statement:
UPDATE table SET counter = counter + 1 WHERE key = %s;
Our second case is a denormalization of the data that is stored with the event. In the aggregate we store a few attributes of the latest event, including the timestamp and title.
In Sentry we’re content with a last-write-wins scenario, so we simply overwrite the existing value with the most recently observed values:
UPDATE table SET last_seen = %s WHERE key = %s;
While the way Postgres does locking is too complex for this post, it’s an important factor in some of the decisions we’ve taken. Whenever an entity needs written to Postgres it will take out a shared lock for that row. In Sentry this can be a serious issue as many times an error will aggregate up into the same entity, which then triggers many updates to the same row. When these updates are all trying to hit the same entity (row) throughput grinds to a halt while you wait to acquire the write lock.
There are different approaches to improving write performance in this scenario. For
instance, when dealing with counter data, the data could be split across multiple
rows. That is, instead of having a single (
counter) row, we could have
partition) and split up the writes:
UPDATE table SET counter = counter + 1 WHERE key = %s AND partition = ABS(RANDOM() * 100);
This means we could create multiple rows per entity — in the example above we have 100 partitions. When it needs incremented we pick a random row to change, which means the locking can be spread out to more unique rows in the database. This would allow us a much higher write throughput, but it we would need to aggregate the counter at some point later. While this approach might ease the counter scenario, it won’t resolve the other types of attributes we need to store.
Our core issue here is the choice to use Postgres for this type of data. Postgres must provide strong consistency and therefore spends considerable time on locks. We don’t need that level of consistency, and we absolutely don’t want the cost of those locks. By choosing to sacrifice consistency we are be able to greatly increase our throughput. We do this by buffering writes.
To solve our locking issue we take a buffered writes approach. This means that we aggregate writes over a period of time, and flush them after an interval. Our constraints primarily revolve around one question: how much data are we willing to lose? That question controls the flush interval of the buffers. In our case it’s 10 seconds.
In Sentry this means that while we write the event data immediately when the counter processed — to Riak, as it’s unique and immutable — we don’t apply counters, search indexes, or many other things for up to 10 seconds. This creates two scenarios that we need to be aware of:
- The UI will not atomically update — some items will be more up to date than others.
- There is a possibility for data loss if we lose lose a buffer partition due to sustained network failure or a number of other less common scenarios.
We rely on Redis a lot at Sentry, including systems ranging from simple data caches all the way to persistent storage for time-series data. Our solution to buffering is no different.
The schema of buffers is fairly simple. We have a set of attributes per entity, with their current value (or delta), and then a set listing those entity’s keys. In Redis we store these in two structures:
- A hash per entity
- A set of the hashes which need flushed
Due to the simplicity of the schema, it means operations are also easier to reason about. Whenever a pending write comes in, we step through the following:
- Write the changes to the entity’s hash key.
- With counters we use
- With another value — such as “last seen timestamp” — we simply
- With counters we use
ZADDthe key to a ‘pending’ set using the current timestamp.
Now every tick — in Sentry’s case this is 10 seconds — we’re going to flush the pending writes. This happens via something acting like a cron:
- Get all keys using
- Fire off a job into our queue with each pending hash key.
ZREMthe given keys.
When a worker receives a job, it does a bit of work and then applies the write:
- In a pipeline:
ZREMthe key from ‘pending’ — if multiple jobs were queued before this was executed this ensures the excess can be noop’d.
HGETALLvalues from the entity’s hash key.
REMthe entity’s hash key.
- Convert the pending values to a SQL update, just as we would have without buffering.
- Counters do a delta update —
SET counter = counter + %d.
- All other values set the new value —
SET value = %s.
- Counters do a delta update —
A few notes about the process:
- We use a sorted set for the case where we would want to only pop off a set amount (e.g. we want to process the 100 oldest).
- The system scales linearly with the addition of Redis nodes by putting a ‘pending’ key on each node.
With this model we mostly guarantee that only a single row in SQL is being updated at once, which alleviates most locking contention that we’d see.
As with every system, things change, or better ideas emerge. That holds true with our buffers as well. There’s a few improvements and problems that we’ve become aware of, and have yet to tackle:
The current implementation is a LIFO structure. This works OK for as long as we’re able to keep up with the pending queue. If we weren’t however it’d mean that the most frequent events would have the highest priority. That may seem like a good idea, but often you’d want to prioritize things which are happening less.
We could resolve that by using the
NX parameter of
ZADD. This would ensure
that the timestamp doesn’t get updated if it already exists in the set. Unfortunately
this requires a newer version of Redis, and being that Sentry is shipped to a variety
of environments is something that we don’t feel is worth the headache.
In the event that the pending writes jobs don’t process in a timely manner the queue can easily get backlogged. This would result in large amounts of tasks being created. These tasks are generally NOOP when they do get executed, but they’re still far from free.
One way to solve this would be to move to a pull model rather than push. We could have a workers that are responsible for popping off pending updates continuously which means that we’d never be executing duplicate work.
This adds complexity in that we’d need to add a new set of background services and scale those out in addition to our existing queue workers. We’ve avoided this so far for the same reasons we don’t use newer Redis features.
It’s possible for a worker to clear a hash and fail midway through the process so that the data does not get written. For example, if a process were to be killed by the OOM killer it would have already removed the data from Redis, but not yet committed changes to Postgres.
This could be improved by adding a secondary “in progress” set and duplicating the hash. Given that this failure chance is so low, we didn’t feel it was worth the extra cost here as this is an extremely hot path in our application.
The buffer implementation has been around in Sentry for four years now, and has only had minor changes along the way. Only time will tell how much longer until we face new constraints, but we’re confident that choosing Redis early on was the right choice.
What’s the coolest thing you’ve used Redis for?