Thinking Through AWS Anomalies with Valentino Volonghi of AdRoll
In A Comedy of Errors, we talk to engineers about the weirdest, worst, and most interesting application and infrastructure issues they’ve encountered (and resolved) over the years. Our first engineer is Valentino Volonghi. He’s the CTO of AdRoll and was also one of their very first employees way back when.
This particular issue is from a long time ago. There are fun ones throughout our history, of course, but this one was particularly obscure and didn’t present itself in a way that you would normally expect. Developers might often think back on a time when the system crashed or they faced some sort of spectacular failure, and certainly every company has its own fair share of these. However, in our case, the most fascinating event was one in which we were seeing a critical failure, but no machines were crashing, and the systems appeared to be completely healthy.
At AdRoll, we manage billions of requests on a daily basis. To accomplish this, we load-balance and scale that traffic across numerous machines. At the same time, we’re always looking to ensure we don’t have more machines than we need because excess boxes beyond a certain level just cause extra headaches during releases and maintenance.
We use AWS here and, whenever Amazon releases a new system that’s even a little bit faster, we usually adopt it with two goals:
Continue to move towards using fewer machines to accomplish the same amount of work
Minimize the ways a box can fail or behave unexpectedly
A few years ago, Amazon released the C3 Instance Family. At the time, we were using C1 Extra Large Instances, so this felt like the right moment to upgrade. The process was incredibly smooth. Not only did it enable us to use fewer instances, but we also saw the number of timeouts generated by the software drop from an already low 0.1% rate to what was effectively a 0% rate. There were very occasional timeouts but, for practical purposes, the machines themselves never timed out. From an engineering perspective, that was fantastic. If you remove even that tiny 0.1% failure rate, it means we can process many more requests, and our customers can realize even more revenue due to better performance.
However, over the course of the week that followed, we learned that our machines weren’t spending their budgets properly, but we couldn’t figure out why, since everything should have been better than ever! After a lot of investigation, we realized that CPU usage wasn’t very high on most of the machines that were sitting behind one particular load balancer. We immediately looked at this as the culprit, but we weren’t sure it wasn’t just a symptom instead of the cause.
As we thought it through, we kept coming back to the fact that the most notable change that occurred with the release wasn’t the software itself; it was the remarkable drop in timeout rate. And that’s when we pinpointed that the core problem was this lack of timeouts.
But why?
In advertising, in particular, connections are typically very long-lived. Between an exchange in our machines, a connection can last several hours and receive many thousands of requests. In this way, we save the latency required by the handshake of a new connection.
Additionally, we were using Level 3 load balancers that were focused on specific connections rather than HTTP requests. This was to ensure we had as little latency as possible in the chain. As a result, the first two or three instances that joined the load balancer group post-deployment received most of the connections. Since they weren’t timing out anymore, they were holding on to these connections, and traffic wasn’t being balanced across all the machines.
The problem is that every single machine is assigned a budget that needs to be spent, but the machines that were hogging all the connections were the only ones able to spend their own budgets. All the other machine budgets went unspent. This lead to us spending significantly less money than usual.
Obviously, we wanted to stick with the new instances, but we needed to solve this problem for the future. We ultimately built out a long-term solution, but in the short-term, our quick-and-dirty fix was to introduce random timeouts into the software so that there was a 1-in-10,000 chance a connection would time out, giving the system a natural incentive to rebalance the connection and location across different boxes. Even in a worst-case scenario, this ensured connection pooling was effectively well allocated, even for our very long-lasting connections.
It took us about a week to come up with this solution and, because we weren’t spending as much money, we also weren’t taking in as much money, causing us to see a drop in revenue for that week. What we learned is that normal alerts and signals are not enough, that you’re not always looking in the right place to find the key that will help you quickly solve an issue. So you need a way to discover and think through anomalies that are outside the range of your normally expected events.