In A Comedy of Errors, we talk to engineers about the weirdest, worst, and most interesting application and infrastructure issues they’ve encountered (and resolved) over the years. This week, we hear from Jason Dufair, Full Stack Developer on the Studio team at Purdue University.
In Feb 2013, Gradient, Purdue’s now-retired peer review app, was brand new. The app was a .NET MVC web app hosted here on campus at Purdue, and the internal team was fairly early in its maturity curve. Gradient assignments had several “phases” that required use of the app: submission, calibration (against the instructor examples), peer review, and self review. Needless to say, students would wait until the last minute of each phase to submit/review work (we’ve all been there, haven’t we?).
One particular class, STAT 301, had 1200 students enrolled who jumped right into the deep end with Gradient. The first assignment was due on a Friday midnight — we knew there would be substantial load. With fingers crossed, we made a few last minute performance tweaks, deployed, and headed home. (This was before we had CI; we were rocking Subversion in all its non-distributed glory.)
Friday evening, we started receiving texts and phone calls from Purdue’s Director of Teaching and Learning. Gradient was crashing for STAT, and no one could submit for their assignment. We had 1200 angry students trying to reload, attach, and submit, but only getting server timeouts. The server was completely unresponsive. We were flying blind.
We got infrastructure on the line, and they literally had to open the machines up to add more RAM. This was also before our infrastructure used VMs, much less cloud infrastructure.
Friday night through Sunday night were spent tweaking queries, implementing and integrating memcache on a hot app. We had to go query by query — pushing performance fixes, caching frequent queries, and redeploying. The team tried to keep an eye on how many students were able to submit and how many were sitting outside the central IT office with torches and pitchforks. Two developers survived the weekend on nothing but adrenaline and caffeine.
Needless to say, Sentry would have helped quite a bit that weekend by detecting errors and timeouts before we went into crisis mode. Sentry would have also helped triage the worst offenders more quickly and helped get the app back in service. Of course, hindsight is 20/20.
We actually spent the last year building a replacement for Gradient — Purdue’s “Circuit” app. In fact, Sentry has been instrumental in building this new app, and has been integrated both on the frontend and the backend. We’ve been tracking down intermittent Redis timeout errors by running load tests against our dev environment and monitoring Sentry for the frequency of this class of errors — super helpful.
So thanks, Sentry, from all of us at Purdue, especially the Studio team.