For the fourth edition of The Monitor, we spoke to Valentino Volonghi, the CTO of AdRoll and a member of its founding team. AdRoll is the most widely used prospecting and retargeting platform in the world. They have 100,000 customers across 35 countries and process right around 70 billion requests a day. This post is an excert from a video interview and longer article that you can see here.
As CTO of AdRoll, I oversee all of the technology that goes into running our business on thousands of globally distributed machines. It’s my job to make sure that when the unexpected happens (and it will) that it is isolated and managed before it sets off the domino effect. That makes real-time monitoring of our system a critical aspect of our business.
Here at AdRoll we help thousands of companies drive people to their websites through retargeting and prospecting ads. They use AdRoll to target people who visited their site but didn’t convert and show them an ad to get them to comeback. Or they can get new visitors to their site by targeting people who are similar to their existing customers.
If an issue spreads and creates just a 1% error rate, that amounts to 700 million errors and could easily cost us well over a million dollars a day.
But to do that, every day AdRoll needs to handle over 70 billion requests from all over the internet and all across the globe. It’s an almost unfathomable number of requests that need to be processed. And since each one of those requests needs to be handled in 100ms or less, our infrastructure needs to be globally distributed.
Today we have AdRoll deployed on as many as 3,000 different machines across the globe. At the scale of AdRoll, if an issue spreads through our system and creates just a 1% error rate, that amounts to 700 million errors a day. An error rate like that could easily cost us well over a million dollars a day.
With that many requests happening on that many machines, all around the world, every machine needs to be monitored for the unexpected, so when that first domino falls, we know and can react.
Each machine has a complicated network of decisions, buying and delivering ads, and we log that entire flow. That translates into about seven trillion events every day. These events are the core of our monitoring. It tells us how AdRoll is operating.
A pillar of our monitoring and incident response philosophy is that instances are not to be coddled. A machine can just be killed and rebooted and nobody is going to cry over it. That philosophy and the ability to act on it is crucial in stopping the bleeding as soon as an issue presents itself, as soon as that first domino falls. This allows us to isolate an issue before it has the chance to set off a domino effect across the system.
Knowing that an issue is controlled gives our engineering team breathing room to approach any issue a little bit more calmly. This is the basis for our Blue-Green deployment strategy.
This is only part of what Valentino had to say. Go here to watch our video interview and read the rest of his post.