Sentry on Sentry: How Metrics saved us $160K

Yagiz Nizipli - May 8, 2024

Sentry on Sentry: How Metrics saved us $160K

If you know me, you know I care about fast code. Recently, I ran a simple query that revealed that we spend almost $160k a year on one task. Luckily, we launched the Metrics beta back in March. Over the last month or so, 10 of us Sentry engineers collaborated across many functions to leverage Metrics to track custom data points and pinpoint the issue leading to this ridiculous ingestion cost.

Identifying the key problem to solve: `save_event_transaction`

I’m on the Processing team at Sentry. My colleague Jernej Strasner and I wanted to find out if we could optimize Sentry’s ingestion pipeline because this recent query revealed that we spend almost $6K USD a week on the save_event_transaction task – definitely room for improvement.

How Performance and Metrics helped us identify the issue

We thought this was a perfect use case for Metrics. We set up Metrics to visualize the computation time for this task, which revealed that we spent 1.37 weeks of computation in the past seven days for the save_event_transaction task.

Now knowing we should start with the save_event_transation task, we began the discovery journey to identify a way to reduce this overhead. The Trace View gave us a hint as to which function seemed to be the biggest offender. As you can see in the screenshot below, in this particular trace 7.37s out of 8.00s was spent on set_subkeys.

Trace for set_subkeys — The `set_subkeys` function sets the cache through `memcached` and plays a crucial part in our infrastructure.

At this point, we want to understand the actual impact of set_subkeys. We used Metrics to plot the extracted transaction and span metrics, which gave us the percentage of time that set_subkeys spends in relation to save_event_transaction. In practice, this means that for every ingested transaction we spend a minimum of 27% and a maximum of 81% on setting the cache.

To summarize, we spend on average 51% of the time processing a transaction on setting the cache. In financial terms, this means that we spend roughly $3060 per week on setting the cache (that’s $160k per year).

Saving Sentry from save events

There were four main ways we identified with the help of Metrics that could help us reduce the time (and money) cost caused by the save_event_transaction:

Be deliberate about when we set the cache
Improve balanced distribution of traffic between memcache servers
Enable background threads for third-party tools
Move away from Twemproxy to alternatives

These changes resulted in improved performance across the entire Sentry codebase, which of course reduces costs overall.

Impactful changes can be one-line

After careful consideration, we decided that instead of setting the cache with every transaction, requiring a separate write operation, we could just set the cache during the first read operation. The change can be found on GitHub.

You can see that immediately after this minor change was made, the time dropped from ~300ms on average to under 100ms. And that change had a huge impact on other celery tasks as well.

Engaging the ops team to improve traffic distribution

Two of my colleagues in ops, Anton Ovchinnikov and Alex Tarasov, also found an issue with the Kubernetes config of twemproxy, the load balancer in front of memcached servers at Sentry. They experimented with switching twemproxy-save service from clusterIP None to regular clusterIP, to have properly balanced distribution of traffic between memcache servers.

As a result, the total percentage of CPU time spent in user space decreased significantly, and the connections were more evenly distributed causing stable CPU usage.

The experiment resulted in a huge decrease in sentry.tasks.store.save_event p95.

And also decreased the sentry.tasks.store.save_event duration by 10ms (~13%).

And also this decreased sentry.tasks.store.save_event_transaction p95.

Unblocking us from third party tools

After an in-depth analysis of the Metrics, we realized that the SDK for our infrastructure monitoring tool was sending sequential and blocking requests to the statsd proxy causing high executions for hot paths at Sentry. We updated the client and enabled the background thread feature to send metrics in a different thread to not block the main one. In some profiles, it took the infra-monitoring tool so long to send an event to their proxy that it was 30% of the whole transaction.

This change resulted in an 11ms improvement in sentry.tasks.store.save_event_transaction.

Moving from `Twemproxy` to `Mcrouter`

Finally, one of our security engineers, Alex Tarasov, ran a quick benchmark and found out that Twemproxy is three times slower than accessing a memcached server directly.

>>> benchmark('192.168.208.93:11211')
1.0897316932678223
>>> benchmark('10.4.140.127:11211')
3.4573745727539062

One of our engineers on the processing team, Arpad Borsos, is working on removing twemproxy and distributing workload from Python. As a result of his work, we can reduce the overhead that Alex Tarasov found.

While investigating caching issues we also realized that the assemble status is stored in memcache. This is problematic as memcache neither has persistence nor can we guarantee that we do not get stale reads which makes moving away to a different memcache clustering system difficult. Sebastian Zivota will be working on this in the next couple of weeks and we expect to see our overhead reduce significantly. Stay tuned for that update.

Did Metrics help us save time and money?

Here's an overview of ‘save_event_transaction’ p95.

Sentry save_event_transaction Duration Breakdown

I’d say the graph speaks for itself.

TL;DR: Try Metrics and save money

If you’re inspired to try Metrics yourself, it’s currently in beta. Eligible users will now see Metrics in their Sentry accounts (if you don’t see Metrics yet, sign up here, and we will give you access). We’d love to hear how you use it: share your experiences or feedback on Discord.

Shoutouts

This was a huge collaborative effort, I get to be the one to tell you about it, but without the help of these 9 other engineers it wouldn’t have been as big of a win: Alex Tarasov, Anthony Sottile, Anton Ovchinnikov, Armin Ronacher, Arpad Borsos, Jernej Strasner, Mark Story, Michael Warkentin, and Sebastian Zivota.

Products

Integrations

SDKs

Learn

Support

Hang out with us

Unsnag Your Mobile Monitoring: Why Teams Are Switching to Sentry

Sentry on Sentry: How Metrics saved us $160K

Identifying the key problem to solve: `save_event_transaction`

How Performance and Metrics helped us identify the issue