Sentry on Sentry: How Metrics saved us $160K
If you know me, you know I care about fast code. Recently, I ran a simple query that revealed that we spend almost $160k a year on one task. Luckily, we launched the Metrics beta back in March. Over the last month or so, 10 of us Sentry engineers collaborated across many functions to leverage Metrics to track custom data points and pinpoint the issue leading to this ridiculous ingestion cost.
Identifying the key problem to solve: save_event_transaction
I’m on the Processing team at Sentry. My colleague Jernej Strasner and I wanted to find out if we could optimize Sentry’s ingestion pipeline because this recent query revealed that we spend almost $6K USD a week on the save_event_transaction
task – definitely room for improvement.
How Performance and Metrics helped us identify the issue
We thought this was a perfect use case for Metrics. We set up Metrics to visualize the computation time for this task, which revealed that we spent 1.37 weeks of computation in the past seven days for the save_event_transaction
task.
Now knowing we should start with the save_event_transation
task, we began the discovery journey to identify a way to reduce this overhead. The Trace View gave us a hint as to which function seemed to be the biggest offender. As you can see in the screenshot below, in this particular trace 7.37s
out of 8.00s
was spent on set_subkeys
.
At this point, we want to understand the actual impact of set_subkeys
. We used Metrics to plot the extracted transaction and span metrics, which gave us the percentage of time that set_subkeys
spends in relation to save_event_transaction
. In practice, this means that for every ingested transaction we spend a minimum of 27% and a maximum of 81% on setting the cache.
To summarize, we spend on average 51% of the time processing a transaction on setting the cache. In financial terms, this means that we spend roughly $3060 per week on setting the cache (that’s $160k per year).
Saving Sentry from save events
There were four main ways we identified with the help of Metrics that could help us reduce the time (and money) cost caused by the save_event_transaction
:
Be deliberate about when we set the cache
Improve balanced distribution of traffic between
memcache
serversEnable background threads for third-party tools
Move away from
Twemproxy
to alternatives
These changes resulted in improved performance across the entire Sentry codebase, which of course reduces costs overall.
Impactful changes can be one-line
After careful consideration, we decided that instead of setting the cache with every transaction, requiring a separate write
operation, we could just set the cache during the first read
operation. The change can be found on GitHub.
You can see that immediately after this minor change was made, the time dropped from ~300ms on average to under 100ms. And that change had a huge impact on other celery tasks as well.
Engaging the ops team to improve traffic distribution
Two of my colleagues in ops, Anton Ovchinnikov and Alex Tarasov, also found an issue with the Kubernetes config of twemproxy
, the load balancer in front of memcached
servers at Sentry. They experimented with switching twemproxy-save
service from clusterIP None
to regular clusterIP
, to have properly balanced distribution of traffic between memcache
servers.
As a result, the total percentage of CPU time spent in user space decreased significantly, and the connections were more evenly distributed causing stable CPU usage.
The experiment resulted in a huge decrease in sentry.tasks.store.save_event
p95.
And also decreased the sentry.tasks.store.save_event
duration by 10ms (~13%).
And also this decreased sentry.tasks.store.save_event_transaction
p95.
Unblocking us from third party tools
After an in-depth analysis of the Metrics, we realized that the SDK for our infrastructure monitoring tool was sending sequential and blocking requests to the statsd
proxy causing high executions for hot paths at Sentry. We updated the client and enabled the background thread feature to send metrics in a different thread to not block the main one. In some profiles, it took the infra-monitoring tool so long to send an event to their proxy that it was 30% of the whole transaction.
This change resulted in an 11ms improvement in sentry.tasks.store.save_event_transaction
.
Moving from Twemproxy
to Mcrouter
Finally, one of our security engineers, Alex Tarasov, ran a quick benchmark and found out that Twemproxy
is three times slower than accessing a memcached
server directly.
>>> benchmark('192.168.208.93:11211')
1.0897316932678223
>>> benchmark('10.4.140.127:11211')
3.4573745727539062
One of our engineers on the processing team, Arpad Borsos, is working on removing twemproxy
and distributing workload from Python. As a result of his work, we can reduce the overhead that Alex Tarasov found.
While investigating caching issues we also realized that the assemble status is stored in memcache
. This is problematic as memcache
neither has persistence nor can we guarantee that we do not get stale reads which makes moving away to a different memcache
clustering system difficult. Sebastian Zivota will be working on this in the next couple of weeks and we expect to see our overhead reduce significantly. Stay tuned for that update.
Did Metrics help us save time and money?
Here's an overview of ‘save_event_transaction’ p95.
I’d say the graph speaks for itself.
TL;DR: Try Metrics and save money
If you’re inspired to try Metrics yourself, it’s currently in beta. Eligible users will now see Metrics in their Sentry accounts (if you don’t see Metrics yet, sign up here, and we will give you access). We’d love to hear how you use it: share your experiences or feedback on Discord.
Shoutouts
This was a huge collaborative effort, I get to be the one to tell you about it, but without the help of these 9 other engineers it wouldn’t have been as big of a win: Alex Tarasov, Anthony Sottile, Anton Ovchinnikov, Armin Ronacher, Arpad Borsos, Jernej Strasner, Mark Story, Michael Warkentin, and Sebastian Zivota.