How Matthew Machuga of AuthO Narrowly Avoided Launch Catastrophe
In A Comedy of Errors, we talk to engineers about the weirdest, worst, and most interesting application and infrastructure issues they’ve encountered (and resolved) over the years. This week we hear from Matthew Machuga, who leads the analytics team at Auth0. Auth0 solves the most complex and large-scale identity use cases for global enterprises, securing billions of logins every year.
I've been with Auth0 since July, leading the Analytics team. We're responsible for one of Auth0’s newest projects, largely targeted toward giving marketers a way to view when and how people are active and using the system, so they can make better business decisions.
Security is a huge challenge for any company that works directly with digital identity, as we do, since mismanagement can be catastrophic. We understand this innately, so we take security very seriously. We follow many different compliances, which means we put a lot of effort into balancing out all the rules.
But I am really proud of just being able to get our team ramped up. I would say building an entire project from the ground up - from the AWS configuration all the way up to the front end optimization - that has been very exciting to do as a team.
But as to errors we’ve faced?
About a month ago I was trying to deploy one of our services to a new region, so that we would have a redundant backup somewhere. We're using Terraform to do this. And when Terraform presents you with an error message, you're not entirely sure if the error message is from Terraform, if it's from the underlying library that Terraform uses, or if it's from AWS.
In this case, I was trying to deploy something in a cluster, and I got an error message back saying that my deploy failed because I had "entered an invalid number of nodes". And this was some arbitrary number, let's say it's 10. Okay that's weird, I thought. I got the security guy on the phone to take a look at it with me. We went through the docs a little bit and tried troubleshooting - we tried another number, because previously changing a number once had arbitrarily fixed an issue. Then when I had updated the number back to the original number later, it fixed it, so we tried that again.
This time when I updated the number it said "invalid number nodes for region - exceeded valid number", and it gave me back “0” as the appropriate number. I thought, this is nonsensical, there is no reason this should be popping up. After maybe five or ten minutes of digging, the security guy rhetorically asked, "I wonder if this is not available in this AWS region?"
I had been an AWS newbie at this point. I didn't realize that was a thing. “They seriously don't allow certain services in different regions?”
He said, “Yeah it hasn't rolled out yet probably.”
Sure enough I went into the control panel, scrolled down, and for that specific region it was unavailable. That was very frustrating, because it kind of shot our deployment plan in the foot before we got it out the door. We also discovered it way too late in the process.
I wonder if this is not available in this AWS region?
We had been planning to go live in that region for probably two months, and the whole time we were on schedule. It was one of those things where all systems were go, and we're doing the final deployment, but then it just… Would. Not. Launch. We actually rewrote a massive part of our infrastructure to accommodate that one.
One month later and we’re finally deploying it this week. So it was quite an endeavor.
Because of all this, we do far more investigation ahead of time. We're now all fully aware that Amazon has this caveat, so we keep an eye out for this problem and refer to their docs on a regular basis. We also try to track their history and their blog posts. For instance, if they released the feature in December 2016, and we see they released it to three more regions in June 2017, we know that their roll out structure is very slow. If they've rolled out a few times since then, we can usually take a guess as to if this will be available by the time we're ready to go live in this specific region. That way we can appropriately make failover plans in the future.
So now we spend a little extra time looking through all the possible non-coding hangups we might encounter along the way. In a case where something isn’t available, as happened here with AWS, then we just write (or rewrite) it using a different technology altogether.
The big way we’re celebrating finally deploying it is by… moving onto the next feature, of course!