We're Officially Clouding
It was the first weekend of July. And because we’re in San Francisco, it was so foggy and windy that we may as well have been huddled inside an Antarctic research station trying to avoid The Thing.
The operations team had gathered at Sentry HQ to finish migrating our infrastructure to Google Cloud Platform. The previous two and a half months had been tireless (and tiring), but we finished that day by switching over sentry.io’s DNS records and watched as traffic slowly moved from our colo provider in Texas to us-central1 in Iowa.
We’ve now gone several months without dedicated hardware, with no downtime along the way, and we feel good about having made the switch from bare metal to a cloud provider. Here’s what we learned.
It’s all about meeting unpredictable demand.
As an error tracking service, Sentry’s traffic is naturally unpredictable, as there’s simply no way to foresee when a user’s next influx of events will be. On bare metal, we handled this by preparing for the worst(ish) and over-provisioning machines in case of a spike. We’re hardly the only company to do this; it’s a popular practice for anyone running on dedicated hardware. And since providers often compete on price, users like us reap the benefits of cheap computing power.
Unfortunately, as demand grew, our window for procuring new machines shrunk. We demanded more from our provider, requesting machines before we really needed them and kept them idle for days on end. This was exacerbated when we needed bespoke machines, since databases and firewalls took even more time to piece together than commodity boxes.
But even in the best case, you still had an onsite engineer sprinting down the floor clutching a machine like Marshawn Lynch clutches a football. It was too nerve-wracking. We made the decision to switch to GCP because the machines are already there, they turn on seconds after we request them, and we only pay for them when they’re on.
Building the bridge is harder than crossing it.
We decided that migrating to GCP was possible in April, and the operations team spent the next two months working diligently to make it happen. Our first order of business: weed out all the single data center assumptions that we’d made. Sentry was originally constructed for internal services communicating across the room from each other. Increasing that distance to hundreds of miles during the migration would change behavior in ways that we never planned for. At the same time, we wanted to make certain that we could sustain the same throughput between two providers during the migration that we previously sustained inside of only one.
The first fork in the road that we came to was the literal network bridge. We had two options: Maintain our own IPsec VPN or encrypt arbitrary connections between providers. Weighing the options, we agreed that public end-to-end latency was low enough that we could rely on stunnel to protect our private data across the public wire. Funneling this private traffic through machines acting as pseudo-NATs yielded surprisingly solid results. For two providers that were roughly 650 miles apart, we saw latencies of around 15 milliseconds for established connections.
The rest of our time was spent simulating worst-case scenarios, like “What happens if this specific machine disappears?” and “How do we point traffic back the other way if something goes wrong?” After a few days of back-and-forth on disaster scenarios, we ultimately determined that we could successfully migrate—with the caveat that the more time we spent straddled between two providers, the less resilient we would be.
Every change to infrastructure extends your timeline.
A lot of conversations about migrating to the cloud weigh the pros and cons of doing a “lift and shift” vs. re-architecting for the cloud. We chose the former. If we were going to be able to migrate quickly, it was because we were treating GCP as a hardware provider. We gave them money, they gave us machines to connect to and configure. Our entire migration plan was focused around moving off of our current provider, not adopting a cloud-based architecture.
Sure, there were solid arguments for adopting GCP services as we started moving, but we cast those arguments aside and reminded ourselves that the primary reason for our change was not about architecture — it was about infrastructure. Minimizing infrastructure changes not only reduced the work required, but also reduced the possibility for change in application behavior. Our focus was on building the bridge, not rebuilding Sentry.
Migrate like you’re stealing a base.
Once we agreed that we’d built the bridge correctly, we sought out to divert our traffic the safest way we could think of: slow and thorough testing, followed by quick and confident migrating. We spent a week diverting our L4 traffic to GCP in short bursts, which helped us build confidence that we could process data in one provider and store it in the other.
Then the migration really got underway. It started with failing over our single busiest database, just to be extra certain that Compute Engine could actually keep up with our IO. Those requirements met, it was a race to get everything else into GCP — the other databases, the workers writing to them, and the web machines reading from them. We did everything we could to rid ourselves of hesitation. Like stealing a base, successful migrations are the result of careful planning and confident execution.
Dust doesn’t settle, you settle dust.
The fateful and foggy July day when we switched over finally came. After a few days, we deleted our old provider’s dashboard from our Bookmarks and set out to get comfortable in GCP: we hung a few pictures, removed the shims we put in place for the migration and checked what time the bar across the street had happy hour. More to the point, now that we had a clearer picture of resource requirements, we could start resizing our instances.
No matter how long we spent projecting resource usage within Google Compute Engine, we never would have predicted our increased throughput. Due to GCP’s default microarchitecture, Haswell, we noticed an immediate performance increase across our CPU-intensive workloads, namely source map processing. The operations team spent the next few weeks making conservative reductions in our infrastructure, and still managed to cut our infrastructure costs by roughly 20%. No fancy cloud technology, no giant infrastructure undertaking — just new rocks that were better at math.
Now that we’ve finished our apples-to-apples migration, we can finally explore all of the features that GCP provides in hopes of adding even more resilience to Sentry. We’ll talk about these features, as well as the techniques we use to cut costs, in future blog posts.
I’d like to send out a special thanks to Matt Robenolt and Evan Ralston for their contributions to this project; you are both irreplaceable parts of the Sentry team. We would love to have another person on our team to help us break ground on our next infrastructure build-out. Maybe that person is you?