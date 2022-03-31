March 31, 2022

Sentry Points of Presence: How We Built a Distributed Ingestion Infrastructure

Event ingestion is one of the most mission-critical components at Sentry, so it’s only natural that we constantly strive to improve its scalability and efficiency. In this blog post, we want to share our journey of designing and building a distributed ingestion infrastructure—Sentry Points of Presence— that handles billions of events per day and helps thousands of organizations see what actually matters and solve critical issues quickly. Space is Time Historically (both before and after Sentry infrastructure migrated to Google Cloud Platform) all Sentry SaaS servers have been located in a single region somewhere in North America. This meant that users and servers transmitting events from the other side of the globe (e.g., Australia or India) were sometimes experiencing end-to-end latency as high as 1 second. Europe fared better (~450 ms,) but this still paled in comparison to events sent from within the continental US (a respectable ~150 ms.) Ideally, Sentry SDKs should add as little overhead as possible, and that’s not feasible with a component in your app that regularly makes requests with 500-1000 ms of latency. Besides tying up system resources, on some platforms, this latency could have the seriously negative consequence of blocking all app execution (e.g., PHP, which is single-threaded and synchronous). Additionally, after a request has been sent by the SDK, it has to make its way through the chaos of the public Internet, traversing wires we have no control over. And, after traveling all that distance, the event might be rejected or dropped because of exhausted quotas, invalid payload, inbound filters, or a number of other valid reasons. Which means the client application could wait for a full second just to learn the data was never ingested.

Sending an event before Points of Presence

To solve this problem, we needed essentially the inverse of a content delivery network: an infrastructure layer that’s close enough to users to minimize the time the request spends in transit and smart enough to perform minimal processing on the payload. Infrastructure, Assemble While we cannot pull Australia closer to Sentry, we could try to bring Sentry closer to Australia. Fortunately, some time ago, we developed a component called Relay. It can do everything we need (and remarkably, sometimes even more.) With it, we just need to do a few basic things: run a few Relays closer to our users on top of a scalable cloud infrastructure, ensure that they can sustain the desired load, and make it all transparent for users. This was our thinking at the beginning of 2020, and the first experiments in February 2020 showed that we had all the pieces we needed to build a working prototype of our first Sentry Point of Presence (PoP).

Sending an event with Points of Presence

Here are the most important components that shaped the final Points of Presence solution: Sentry Relay — Relay is a component that has been powering our ingestion infrastructure for more than a year, and throughout the PoP project, it had to learn how to wear a new hat. Relay is what ultimately allows us to reject invalid or above-the-quota events on the edge while sending all the good ones upstream to our main processing infrastructure. And most importantly, PoP Relay does the forwarding asynchronously from the user’s perspective, so SDKs do not have to wait for the event to reach our main infrastructure.

Kubernetes — Container orchestration framework that lets us focus on “what” and not “how.” We already used Kubernetes for our main processing pipeline, so it was only natural to keep using it for our Points of Presence. Every Sentry PoP is basically a Kubernetes cluster in a separate geographic region that runs a few Relays and other auxiliary services such as abuse protection layer and logging agents. As Google Cloud Platform citizens, we naturally use Google Kubernetes Engine as a managed Kubernetes offering.

Google HTTPS Load Balancer — A geo-distributed managed service provided by Google Cloud that allows us to hide multiple PoP clusters behind a single anycast IP address, also offering geo-aware routing to backends and efficient TLS termination. Additionally, after a user’s event reaches its closest entry point to the Google infrastructure, the data stays within the Google Cloud network—avoiding the turmoil of the public Internet.

Nginx — Our favorite (but not only our favorite) web server. Nginx serves as a user-facing reverse proxy, powering our anti-abuse layer: if someone is sending us too many requests or if those requests are clearly invalid (invalid URL, payload too big, and so on,) Nginx will promptly respond with the corresponding status code.

Envoy proxy — A powerful proxy software that connects different parts of our infrastructure and provides flexible instruments for cool things like service discovery, dynamic configuration, circuit breaking, and request retries. When connected together, these components form the following architecture: