A Comedy of Errors -- Or Weis of Rookout
In A Comedy of Errors, we talk to engineers about the weirdest, worst, and most interesting application and infrastructure issues they’ve encountered (and resolved) over the years. This week, we hear from Or Weis, co-founder and CEO of Rookout. Rookout’s focus is on collecting data in a seamless, immediate way that maximizes a developer’s insight into live code.
Rookout actually has an integration with Sentry. We enable developers to extract data from wherever they want in their code and deliver it for analysis in whichever platform they’d like. Sentry is among our most commonly desired targets. We’re big fans.
In general, what we do is data collection and pipelining. Rookout doesn’t provide any analysis itself, because there’s essentially an infinite spectrum of ways to analyze the data. We’d rather work with different partners so that we can focus on providing optimized data. One great thing about our solution is the immediate data return. Click a button; without restarting, re-deploying, or writing more code, you immediately get the data you need.
I have one specific bug in mind, which I experienced while working in the Israeli Army. I was an intelligence officer in a specific unit that a lot of people might know. Back then, it was kind of a secret, but we’re not as good at keeping secrets as we used to be. 😉
I was responsible for building and deploying a system in production. The system was going to be deployed in the field, and we weren’t provided with any information about the environment in which the hardware and software would run. There were a lot of unknowns and a lot of things that we knew that would change in production (or even in the final deployed environment). The situation was challenging, to say the least.
Because we had a limited set of iterations to run the software, we had to plan to run a few cases simultaneously. While running on a Windows machine, the cases had to tackle parsing unknown hard drive filesystems that would be loaded at random.
So, the first time it ran, not only did we get a bug that caused the parsing to fail, but we also got a blue screen — the entire thing failed. We had zero data and no information on why it failed. We also had a limited set of repeat deployments looming in the background. You know that if you really fail — if you fail all the iterations — people are likely to die. With the Army, there’s always something important on the line, and that’s a lot of pressure, especially when you’re young. I think I was 20 or 21 at the time.
So, to figure out the issue, I knew we had to build it in a way that would collect data on the way it runs. In other words, if it fails after two iterations, we’ll have the information we need to make it work correctly. Part of what we discovered was that the bug wasn’t in our code. The bug ended up being a vulnerability in Windows itself: when you define the partition table in FAT32, setting two partition tables causes Windows to blue-screen. At that moment in time, that vulnerability in Windows wasn’t well known. It wasn’t until two or three years later that the bug became publicly recognized.
Since our solution collected all of the data, we could statistically sample the memory from the disk. We were able to reconstruct it and figure out the issue. Ultimately, it took between a month and a month-and-a-half to solve.
After diagnosing the problem, we added a module that changed the way the partition looked before it got loaded by Windows. When Windows, or any computer, loads a filesystem, there are three levels. There’s the initial level where it maps the disk device itself. Then it maps the partition. And, finally, it maps the volume, which is the filesystem itself, on top of a specific subset — a specific set of clusters within that hard disk. So when the disk drive was loaded, you got the opportunity to say, Wait a second. Let me check that everything makes sense here. After you tidy things up, Windows resumes, and the entire process continues.
Once we figured it out and had that partition, we were able to simulate the device and run it. We were literally cheering: Yay! It’s crashing! Once we identified the problem, we brainstormed a solution in a few minutes and then implemented that solution within an hour and a half.
That case really drove my understanding of how difficult it is to go from the mindset of building software — you’re on your own, working in your dev environment — to what happens when it actually hits the road and what constraints you might meet. I learned the importance of being able to both plan ahead and plan your data collection to respond to whatever happens. As you’re gaining experience and learning to develop software, you don’t really learn the minute aspects like a potential bug in the operating system. They never cover that in college, they never cover that in tutorials, and they never cover it when they give you the basic workaround of how to work with your SDK. In the end, these elements can really change how your software is being deployed and how it actually works.
Through my years working for other companies — and even my own — I’ve felt that pain point of, It’s my own software, my own code, my own application, and I have no idea what’s going on with it. Every time I want to know what’s going on, I have to go through an excruciating process of writing more code, adding that SDK call to Sentry, and writing that log-line or adding other dependencies. Even then, I have to go through the approval process of doing code review, doing those pull requests. I have to go through the CI/CD process, waiting for the machine to deploy, waiting for the code to run, waiting for the specific part of the code I want to test to run. Only then, after several hours, do I get a taste of what’s going on in my production code.
In a roundabout way, that’s how Rookout started. My co-founder and I looked at that process and said, It’s insane! Why the hell can’t I just press a button and get the information I want? So, that’s essentially what Rookout does. You see your code, you select the area you want, you click, and you get the data. We literally ran in the direction of our own pain point.