Improve Uptime with Error Prevention and Awareness
On any given day, we use rideshare apps to get from one place to another. We check public transportation apps to see when the next train is arriving at the station. We use streaming services to watch Frasier at the end of lo-ooo-ng work days. As part of engineering teams, we use application monitoring and cloud services (like CI and cloud infrastructure) to function, so that code changes seamlessly deploy into production.
We expect the services and products we rely on to always work, without question. When these services and products don’t work and are experiencing what we call downtime, we aren’t happy — like Frasier when he finds his sherry decanter empty (gasp!).
From a business perspective, there are also downstream effects of downtime. For example, contracts typically define uptime service-level agreements (SLAs). Dropping below the agreed-upon expectations can lead to user/customer unhappiness, contract termination, legal action, and impact on revenue. Let’s not forget the time and effort spent updating customers, tickets, and status pages, as well as the engineering cost of actually fixing the issue. These fires are insidiously stressful and distract development teams from working on new features and adding value to services.
By balancing error prevention and awareness, developer teams can avoid broken code in production, decrease time to resolution, and, ultimately, improve service uptime.
Very simply, uptime is the time when something — like your application or service — is operational. Operational, however, can mean many things, resulting in a somewhat flexible definition of uptime.
Sentry, for example, may have an SLA for how long it takes an event to be processed by our system. Events processed too slowly violate this SLA, even though the system is still operational at some level.
Uptime and downtime aren’t dichotomies so much as ends to a spectrum of functionality. Ultimately, each business needs to define uptime, SLAs, and priorities around both.
Uptime expectations resulted in uptime measurement by nines. Each nine is exponentially hard to achieve, as it means significantly less downtime, and is used as a grading scale for reliability (e.g., A, B-, C).
One nine, or 90% uptime, equates to 36.5 days per year of downtime. While 90% uptime doesn’t sound bad, in reality, this service would not be considered reliable. No one wants a service to be down 1/10 times you use it, even if that service is free. Imagine sitting down to watch Frasier with a glass of wine, only to have your streaming service hit you with a 500.
Two nines, or 99% uptime, equals 3.65 days of downtime per year. While relatively easy to achieve, 99% uptime is not necessarily a good look for services. Heavy users notice the frequency of downtime (and tell their
radio show listeners social media followers, which may ultimately impact your bottom line).
Three nines, or 99.9% uptime, is roughly 8.77 hours per year of downtime. Most customers expect this type of uptime — three nines is now the standard. At this point, protections are needed to prevent broken code from going out. In the case that it does, developers need immediate alerts to stabilize the code ASAP.
Four nines, or 99.99% uptime, equals 52.6 minutes of downtime per year. This level of uptime is extremely hard to achieve — one bad outage can easily jeopardize the fourth nine.
Five nines, or 99.999% uptime, is a mere 5.26 minutes of downtime per year. Unsurprisingly, five nines is the gold standard and requires engineering excellence in regards to monitoring and stabilizing. Very few services achieve 99.999% uptime.
One nine is relatively simple to achieve with competent engineers, manual and automated testing, and on-call engineers for managing production incidents. Subsequent nines require more validation and automation, and developer teams can no longer rely on manual processes and feedback as human intervention isn’t fast enough (i.e., it takes minutes or hours to receive, understand, and then manually roll back or forward). Companies with four and five-nines also have a carefully cultivated engineering culture that encourages high quality and accountability, as well as a seamless incident management process that stabilizes code and user experience ASAP.
It’s worth noting that not all companies need five nines — 100% uptime isn’t always cost-efficient or necessarily make sense. For example, if revenue is not negatively impacted by 99% uptime, then two or three nines is a good goal. Each additional nine incurs exponential effort and cost.
Instead, importance can be placed on defining the trade-off and what makes sense for that particular organization, while focusing on minimizing the scope of impacts and resolving issues quickly when they occur.
Code validation and checks contribute to cleaner code that prevents errors in production and minimizes the mean time between failures.
The first step toward detecting broken code and features early in the development cycle so that they are never merged and deployed is testing. Always test code prior to merging.
Again, three, four, and five nines are not achievable with manual tests alone. To reach this level of uptime, automated testing is necessary. While automated testing often involves an investment of pre-work, like writing unit tests, they undoubtedly save time and increase efficiency, as a failing unit test denotes that a specific code block is not functioning as expected.
Testing also provides an opportunity to learn from previous mistakes. With every issue, development teams should pose the question, “Could we have written a test that could have prevented this broken feature or outage?”
Using a canary deployment, which directs a small percentage of live traffic (e.g., 1%) to the canary deployment, prevents errors from being introduced to all end-users. After being confident, additional traffic is directed to the deploy, with an increasing percentage towards 100%. However, canary deploys aren’t feasible for all scenarios. For example, you can’t always canary a database migration.
Awareness around errors — like when they occur, whom they impact, and where in the code they happen — via application monitoring minimizes the mean time to recovery.
Application monitoring provides developer teams with real-time insights into their service’s health, alerting them to issues immediately. Once alerted, developer teams can use source code management information alongside monitoring to pinpoint which change may have caused the problem, and, more importantly, get the issue into hands of the developer best equipped to resolve it. Then, if necessary, the developer can roll forward and fix the issue.
Monitoring metrics provide a 10,000-foot view of the state of the code and system. However, seeing changes in these metrics only qualify as an incident in specific cases. Knowing that errors are occurring isn’t enough. Instead, developers need deep context, including whom an error is impacting, what product-flows are affected or compromised, which version of the service an error is affecting, what change(s) or commit(s) caused an issue, and who is the best person to fix the problem.
Outages happen, and there is no surefire way to avoid them altogether. What we can do instead is learn from the outages and figure out how to resolve them quickly. Simply put: proper implementation and utilization of code validation and monitoring translate to better uptime.
Do you hear the downtime blues a-callin’? See how Sentry can improve your service’s uptime.