Learning Difficult Infrastructure Lessons with Conor McDermottroe of Circle CI
In A Comedy of Errors, we talk to engineers about the weirdest, worst, and most interesting application and infrastructure issues they’ve encountered (and resolved) over the years. This week we hear from Conor McDermottroe, a Software Developer at CircleCI. Conor’s focus is on streamlining process by taking common products and services and building reusable libraries, so that CircleCI can move a little bit faster when creating new services.
Everything we do here is really five to ten persons worth of work — so there's nothing I can really take credit for personally — but I’m really proud of shipping our 2.0 execution engine. Actually seeing it run such a huge portion of our build as it is now — that's been massive. We were hitting boundaries on the old execution engine, which is still serving a lot of customers and they do enjoy it, but it just wasn't going to scale. Seeing us as a company, as an engineering team, actually solve that problem and avoid that disaster was a wonderful thing to be involved in.
But we’re not here to just talk about how wonderful things are, are we?
Generally the things that I worry about are data-loss incidents, because they tend to be the slowest to recover from. Stateless services are lovely because if they crash, you just start a new one and it's fine. But if you trash a database, you're in a bad place [rhyme unintended, but in retrospect it should have been intended]. I tend to be very wary around changes because of that.
Which is why everybody's worst story is probably That time I dropped the database and broke shit. And here’s mine!
Two companies and ten years ago, we were running a standard MySQL setup with a master and a couple of replicas. We had bought the master first, so the replicas were better hardware. So one of my colleagues and I decided we would promote one of the replicas as the new master. That sounds simple enough, but we didn’t yet have any tooling to do any of this. This was 2008, so though the tooling existed, we'd never used it. We were flying blind.
First things first, we wanted to make sure that everything in the replica was actually a legit copy of the data. There was some open source tooling that had a pretty solid reputation for MySQL, and one of the scripts they shipped would do a check, as best it could, to make sure that data was the same between master and replica.
We ran that and it showed some differences, which was no big deal. If you run MySQL replication for long enough, you're going to have those. They were relatively minor, so we just figured we could fix them up. But then we noticed this same company also had a tool that would fix the issues, and we thought, Excellent. The detection phase ran perfectly, we should just run this too, it'll be fine. Except for the fact we didn't read the script and figure out what it was actually doing.
Of course we had a plan for how to do restores, but we had never actually tested a restore because we did not have big enough hardware to be able to do it.
We had one really, really big table, which didn't fit in RAM, and this turned out to be a crucial problem. The tool would – table by table – select all the data, delete it, then re-insert it so that it would flow down to the replicas via the replication logs. Well, the tool zipped through the first couple of tables and everything was working great. Then it got to the big table, sucked all the data out and deleted it… and promptly ran out of RAM and crashed. Plus the command to delete all the data had replicated down to the replicas as well, so it just nuked about 10 years of data.
That was the day I learned to do point-in-time recovery.
We had been saving the replication logs to disk; we had a nightly backup, and then we had the replication logs as well. Of course we had a plan for how to do restores, but we'd never actually tested a restore because we didn't have big enough hardware to be able to do it.
We took the last nightly installed log and installed it. We thought we just needed to replay the last bit of the replication logs to bring it up to the point where we had the problem. Naturally, we miscalculated the timestamps and replayed the thing that actually broke it and had to repeat it all over again. Sadly the restoring from backups took about 45 minutes, so at this point we were in a full-on outage.
I think on the second attempt at restoring we did it right and restored it just before the point at which we dropped everything. We did lose about two or three minutes of data, which all things considered was a relief.
The (obvious) lesson: If you're using third party tools, maybe you should test them first. Don't just run them for the first time in production. It was a complete face palm moment.
As for issues here at CircleCI, there was one good error in our 1.0 build infrastructure with the machines that host the containers that run builds for customers, which managed all the container lifecycle stuff directly.
We deliberately made sure that every container had a completely segregated network, because at the time the way that Docker did it wasn't quite ideal. There was potential, at least, of snooping traffic between containers. We had given every container it's own subnet, like a /24 and we had set it up so that even within that /24 the IP that would be assigned to your container that was going to be randomized, which would make it harder to attack.
The obvious lesson was that when you use third party tools, maybe test them first. Best not to run them for the first time in production.
The machines were set up to run a fixed number of containers each, and would tear down a container at the end of a build and replace it with a clean one. We noticed that sometimes they'd start off with x number of containers that were running, and then they'd fail to start a new one, and just lose one container. Bit, by bit, by bit they'd be running fewer containers. We were running machines which were relatively beefy, but after a time they would just be running only one or two containers.
I spent some time trying to figure out what this was, and it turned out it was due to that networking change. As previously noted, it was a /24. When picking an IP to assign, the gateway was .1 and the broadcast was .255. We successfully excluded .1 as one of the addresses but we didn't exclude .255, so when were initializing the container: hey your IP is whatever .255 which is also your broadcast address… obviously the network just doesn't come up at all. Everything is completely hosed and nothing works.
Because of the tooling we were using to manipulate the containers, we couldn't tell any of this was happening.We would start the container, then attempt to SSH into it to complete the container setup. That would just hang because the container’s networking configuration was completely broken. It took us a disappointing amount of time to figure that one out.
It was literally a one character fix that took us — on and off — maybe three months of investigation. And at the time the engineering team was small enough that almost everyone on the team had read the code at that stage and said yeah that looks fine. I think it was a collective, How did we miss this?? I guess corner cases like that are very hard to find during testing because they only happen one in every 250 test runs. But if you start enough containers, it can happen about once a minute.