Blog
ArchiveTwitterFeed

What is Crash Reporting?

Crash reporting is a critical programming best practice. However, if you’ve never been exposed to the concept before, it can be tough to understand how it works and why it’s valuable. Here is how we look at crash reporting at Sentry.

What is a crash?

When most people hear the term “crash,” they picture a desktop application that abruptly closes without warning. However on the web, a crash sometimes comes in the form of an unresponsive website or a server that returns an error. In well built applications, crashes often aren’t even clearly visible to end users.

In the context of Sentry, a crash is an unexpected, non-ideal event that would cause your to code stop functioning, regardless of whether it actually causes a true crash to happen. Essentially Sentry is used for every manner of error monitoring

For example, take this Python code:

while True:
    1 / 0  # will raise ZeroDivisionError

Because this code does not include any sort of exception handling, the resulting division by zero error will cause the Python interpreter to crash in the way that is most familiar—it will stop functioning.

A real application might handle the Python error like this:

while True:
    try:
        1 / 0
    except Exception as exc:
        logging.exception(exc)

This graceful error handling means that while the application continues to run, this is still considered a crash because it would have crashed the application.

Handling crashes

It is common to use frameworks to build applications. Most frameworks include some form of basic crash reporting that resembles our try/catch/log example above. This pattern ensures that when crashes happen, the application handles them without failing and continues to run. This also this means that it isn’t possible to capture these crashes at a high level with a try/catch strategy because the errors do not bubble up passed the framework’s error handler.

For example, this Python code is using a library:

def run():
    while True:
        try:
            do_work()
        except Exception as exc:
            logging.exception(exc)

And this generic implementation of Sentry is waiting to catch crashes:

try:
    run()
except Exception as exc:
    sentry.capture_crash(exc)

While this code will catch an error if the program ever broke out of the try...catch, the framework will most likely capture the error and log it first, avoiding our crash handler completely.

Deeper Instrumentation

Since we can’t rely on native crash reporting, we need to look for more appropriate ways to capture relevant errors. In web applications this is commonly done with middleware. Middleware is a component in a stack which executes around a request’s lifecycle. For example:

def on_web_request(request):
    for middleware in middleware_list:
        middleware(request)

Many frameworks provide unhandled exception middleware. The Python framework Django uses process_exception for this purpose:

class MyMiddleware(object):
    def process_exception(self, request, exception):
        pass

Handling a Django error this way means we can capture exceptions that happen when processing a request without changing any code in our own application or within the framework.

Unfortunately, it might not capture some situations such as errors thrown from within other middleware. To do that, we need hooks which provide better guarantees. Fortunately, many frameworks also provide a native error handler. Django uses something called a “signal” that allows us to register a callback for whenever it encounters any exception:

def handle_error(request, exception):
    sentry.capture_crash(exception)

got_request_exception.connect(handle_error)

The cases where you need such a guarantee may be rare, but it’s extremely important none the less.

Passthru

Frameworks generally allow us to capture anything that looks like a crash, but it’s common to need to capture more. We typically want to try...catch errors and report them like crashes while continuing to function smoothly at a much higher scope. However, if we catch errors before they bubble up, we have to send them to the reporter ourselves. This is called passthru.

The most common workaround for this is to embed a crash handler:

def handle_web_request(request):
    try:
        stripe.collect_all_the_money()
    except OutOfFunds as exc:
        sentry.captureCrash(exc)

This works just as you’d expect, and there’s nothing wrong with it. However, it does lock you into a specific API (captureCrash), which could possibly cause future issues as you grow. To solve this, most of our Sentry SDKs provide abstractions that work through logging, allowing you to call the native logging.error and have that bubble up as-if it were a crash.

Abstraction through Logging

Sentry is built on top of best practices like logging. However, this can cause confusion and signal-to-noise issues if you don’t understand the the difference between logging and error reporting.

Throughout your application you are likely logging many kinds of events. While these are commonly errors, you are probably logging a lot of debug information as well. This is the fundamental reason why logging has levels (e.g., debug, warning, error). When we abstract our crash reporting through logs, the challenge is to differentiate between actual errors and debug noise.

For example, let’s say our request handler looks like this:

def handle_web_request(request):
    logging.debug('{ip} {method} {path}'.format(
        ip=request.env['REMOTE_ADDR'],
        method=request.method,
        path=request.path,
    ))

    try:
        do_something_crazy()
    except Exception as exc:
        logging.error(exc)

In this example, it’s easy enough to configure our logging handlers to avoid debug events and only include exception events. However, issues begin if we forget the difference between crash reporting and logging:

def do_something_crazy():
    user_list = rpc.call('users.list')
    if len(user_list) == 0:
        logging.error('no users were returned')
    return user_list

In the example above logging.error('no users were returned') is a useful error for debugging things, but it is not a crash. It can quickly become noise when reporting and aggregating crashes.

Signal vs Noise

Since there is no machine-capable context provided beyond the log level that can reduce the noise, our alternative is to build a new abstraction we can explicitly send errors to.

def handle_web_request(request):
    try:
        do_something_crazy()
    except Exception as exc:
        report_exception(exc)


def report_exception(exc):
    logging.error(exc)
    sentry.capture_crash(exc)

In the example above, the noise decision is made in the report_exception abstraction rather than in the logging execution. This ensures that we can maintain a stable internal API for reporting exceptions that allows us to explicitly decide what is reported upstream in addition to comprehensive local logging. As good practice, this also ensures that the application is flexible in the future if Sentry is removed, or what is captured changes:

def report_exception(exc):
    logging.error(exc)
    if not isinstance(exc, DownForMaintenance):
        sentry.capture_crash(exc)

In Closing

Ultimately, crash reporting is safely catching and reporting application breaking events using a thoughtful combination of platform handlers, framework handlers, and judicious use of logging results.

Ultimately, whether you want to debug JavaScript, do Python error tracking, or handle an obscure PHP exception, we’ll be working hard to provide the best possible experience for you and your team!

Crash Bandicoot illustration by Uzura Edge.

Your code is broken. Let's Fix it.
Start using Sentry