In A Comedy of Errors, we talk to engineers about the weirdest, worst, and most interesting application and infrastructure issues they’ve encountered (and resolved) over the years. This week we hear from Rich Chetwynd, a Product Manager at OneLogin, whose focus is on looking after developers and APIs. He is also the founder of ThisData.
I joined OneLogin as the Founder/CEO of ThisData when we were acquired. ThisData was an API-first risk engine that took in different types of contextual information to look for patterns, and then used that information to provide a score of whether that pattern looked consistent against normal traffic.
It made sense for OneLogin to acquire ThisData; we were the risk-scoring API behind OneLogin’s multi-factor authentication - we call it adaptive authentication. The pitch is: rather than getting asked for an MFA token every single time you log in, you only get asked when the risk is high. ThisData works out what that level of risk should be each time you log in.
We call it adaptive authentication. The pitch is that rather than getting asked for an MFA token every single time you log in, you only get asked when the risk is high.
Since I’m focused on Product Management at OneLogin, and don’t do much engineering work around here as a result, my own error story comes from ThisData. The way that ThisData works is it takes in lots of indicators from a web transaction — browser or device information, fingerprints from devices or browsers, IP locations, GEO IP, various different attributes — and it pulls them all in to create the risk score briefly mentioned above.
As part of that scoring, it’s important for us to build up a profile for each user. And the way we approached it in the early days of This Data was like this: you might have multiple profiles online and a desire to aggregate the traffic from all of those profiles. So if you had logged in with Google, or you’d done something on Facebook or you’re on Twitter or some other custom site, all these different aliases that you have, we wanted to be able to join them together. That way, we knew it was just one person and we could use those patterns of behavior across different sources. And it might not just be those social sites, but a more common scenario is you’re using a bunch of different custom sites, where in one place your username is Rich and the other place is Rich@gmail.com and so on like that.
We built the system in quite a smart way of merging aliases — it would look at patterns around usernames, emails, various different IDs that might have been provided, and then fall back down to other attributes we could get. We would put all this info together to make profiles, and the profile matching was pretty good, we tested it heavily for about 6 months before we released it to the world.
It was confirmed the person in question was signed in from somewhere in Europe, even though in reality they were somewhere on the other side of the world at the time.
So we put it out there, and we were in very early days, probably in our first month or so of ThisData and we only had a couple of pilot customers. One of the customers had a few hundred users, and the profiles were all working seamlessly - the aliases were all mapping and working out just as we expected. And then one day they said they had a confirmed breach. The reason they knew is because This Data would look for the high risk people and indicate when a transaction was extremely high risk, and then it would send an email notification to the end user, similar to what you get from Facebook - someone signed into your account from another country, is this you?
It was confirmed the person in question was signed in from somewhere in Europe, even though in reality they were somewhere on the other side of the world at the time. There were tons of transactions coming through from Europe, so they thought they’d been breached and under attack.
And they were disappointed about that, but because of the early detection, and the notification and the way they were able to respond, the whole thing was shut down within about ten minutes of that happening. So that part was pretty cool, everyone was super happy about the service, and everyone was kind of surprised that it happened, and we gave ourselves a pat on the back.
Then we did a bit of a dig into this, like we always did whenever there was any sort of issue, just to see what went on with those transactions and make sure we did everything as well as we could. Also we’d look for ways we could improve the service even more, you know, sharpen the knife. The upshot of that dig was that this wasn’t a breach at all. It was actually a bug.
What happened was that our alias matching had gotten confused. It thought the user logging in from the other side of the world was one particular user, but really the alias that had belonged to that user had been moved to another user. And our software didn’t consider that as a use case. We had a G Suite integration, and with G Suite, you can have your main email address and then you could have maybe multiple different domains and email addresses or any other alias as well.
But yeah… it was quite a big bug for a service whose purpose is to detect breaches to provide you with a false positive about a breach.
So this company had a pattern when somebody left the company in which they’d move the departing employee’s email addresses to another user, so that person could catch all the departing user’s email or whatever reasons they may have had internally. But that shift in the alias meant that we weren’t matching aliases correctly. We realized it was a matching problem, that aliases could potentially shift around. There were a couple other cases of that happening as well over that last month or so that we had it out there. Which led to a realization of Oh shit we really thought we were doing all this good work, but not quite.
The initial fix for us was to change our sync with Google. We had been syncing every day, just to get the latest user profile information and aliases and other bits of metadata that we would use. The quickest, easiest fix was instead of doing that once a day as a background process, we just started doing it hourly. Then as time moved on, we got more into push notification style using webhooks, then ultimately we just ended up shutting down the Google integration altogether because it was not providing as much value as the other things we were doing.
But yeah… it was quite a big bug for a service whose purpose is to detect breaches to provide you with a false positive about a breach. Everyone was always asking about the false positives, and How do you get it right? Is this really a confirmed breach? How do you know how accurate you are? We were always about testing, I’d say that we were pretty accurate, but to have that be slapped back in our face, and have got it completely wrong was pretty embarrassing. And a little bit of a confidence knock for what the customer was expecting.
All that said, aside from this helping us be improve our service in those early days, there was also silver lining for the customer. It was like a fire drill for them: they locked down the account, they worked out quickly what they were going to do in that sort of situation and how to make it happen.
It became this sort of happy moment, where the customer was pleased they’d had the opportunity to ensure their own internal processes had worked without the actual threat of a major breach. I mean, it was still annoying that we’d run into this use case that we hadn’t yet considered, but we were happy that they were happy. And it enabled us to catch a problem on our side that other customers may have found more frustrating if they had encountered it.