How Py Surfaces Critical Errors with Sentry
In A Comedy of Errors, we talk to engineers about the weirdest, worst, and most interesting application and infrastructure issues they’ve encountered (and resolved) over the years. This week, we hear about Py from Derek Lo, Founder and CEO, and Brian Sweatt, Lead Full-Stack Engineer. Py empowers hiring teams with a suite of products to evaluate technical candidates.
Py is democratizing access to career opportunities in tech. Over the last 2 years, we’ve helped over a million people improve their programming skills. What we learned along the way is that many of our users were having difficulty getting past resume screens at companies. This key learning fueled our launch of Py for Work, which is a platform that helps companies take a skills-based approach to hiring.
The most significant value-add we can bring to a company is the conservation of engineering time. Essentially, companies that are aggressively hiring end up having to interview a lot of engineers. Unsurprisingly, the process takes a large amount of time for both engineering and recruiting. Py identifies candidates who are a good fit for companies from a technical perspective earlier in the funnel. Voilà — time saved.
Here is an example of how we located an obscure and hard-to-find bug in Py with the help of Sentry, a tool that has saved us engineering time.
Error context is key
Recently, a group of candidates could not log in to our site, and we couldn’t retrace the problem on our side. When we looked in the AWS logs, we couldn’t see anything that stood out. The issue was only happening to a small subset of users, so we were quite stumped.
We initially looked into one of our micro-services for user authentication, because that’s a place where we would naturally look for a sign-in bug. While we used Sentry for other things, we didn’t have Sentry integrated there yet, so we quickly did a patch to production. Once up and running, the Sentry integration uncovered additional context about the requests that were causing errors, including emails from the customers complaining about the issue.
As it turns out, a third-party service integration was breaking our sign-in flow completely. This specific service integration was with a recruiting company that has a hook-in to our site that allows them to create candidates. We realized that the error was only happening for candidates with accounts made using this flow. Essentially, if users were brought in through this third-party service and that service was throwing any kind of error, sign-in would fail completely.
You know what they say about assumptions
Before investigating the sign-in bug, we assumed that users not able to sign-in simply weren’t authorized to a session. However, we eventually discovered that wasn’t the case. Instead, it could be a misconfiguration of one of our customers, or it could be an error because the third-party service went down, like with the log-in bug. Regardless of the cause, these circumstances shouldn’t result in an inability to sign-in.
Instead of making that assumption, we now look at the content of the error to see what is actually happening. If it’s something that should stop the user from signing in, we’ll have a record of the user and know if the user actually isn’t authorized or if it’s something else. We also now log a warning rather than an error on Sentry.
Ignore what you can, surface the critical errors
We were actually getting tons of errors from this third-party service, which hindered our ability to see these errors when we were trying to figure out the problem with sign-in. That’s something that we’ve fixed now. We don’t have any third-party services that crash our sign-in if we can’t reach the service or if it returns an error, which is great.
One of the reasons why we didn’t see these bugs in Sentry initially is because that same third-party service was throwing a lot of errors in our log when creating the candidates. These errors were overwhelming all other issues that were coming up. We knew the cause of the issue and that it wasn’t necessary to fix it immediately. Thankfully, Sentry allowed us to hide the bug and surface the more critical errors.