Welcome Kelly Carino

We’re excited to announce that Kelly Carino is joining the Sentry team.

Kelly joins us from Massdrop where she worked on improving the customer experience as well as scaling the support team. During her last few months at Massdrop, she completed an online course at Coding Dojo. At Sentry, she will be focusing on technical support and building out projects with the team.

React Native

We have released a dedicated React Native SDK with some awesome features. If you are using raven-js with React Native, we recommend that you switch over to our brand new react-native-sentry SDK.

With this new SDK, Sentry is now able to provide mixed stacktraces. This means that if a JavaScript call causes a crash in native code, you will see the last call from JavaScript before the crash. This also means that with the new SDK, native crashes are properly handled on iOS.

Mixed Stacktraces

Of course, if an error occurs in JavaScript, we also provide you with a useful stacktrace.

React Native Stacktrace

Since we use our powerful, native Swift SDK in the background, you will also get much more information about the device and operating system.

Device Info

When you also have an Android app, it will gracefully fall back to use raven-js since we currently only fully support iOS.

To start using the new SDK, see the react-native-sentry documentation.

Introducing Reprocessing

Introducing Reprocessing for iOS

In order to provide useful and human-readable stacktraces for iOS crashes, developers have to share their app debug symbols with Sentry. If a crash comes in, Sentry uses these debug symbols to map memory address to the according function name and line number.

For example: the unreadable 0x205d2d000 becomes ViewController.onClickFatalError(AnyObject) -> () (ViewController.swift:113)

Now the user knows exactly where the app crashed and can fix the bug. But for this to work inside Sentry, there was a catch: you had to provide these symbols before the crash occurs. And not all iOS developers have access to symbols before they go up on the App store.

Today we are proud to announce our new reprocessing feature. With this update, iOS events that cannot be processed due to missing debug symbols will be held from the event stream and reprocessed once the symbols have been uploaded.

Why reprocessing matters

This has two major advantages for iOS developers. The first is that this keeps noise down due to bad grouping caused by lack of information. In the past, if events were submitted before debug symbols were uploaded, you could easily end up with lots of incorrectly grouped errors that were just duplicates from older issues.

The second benefit is that since we put those issues on hold temporarily, we will not send out any email notifications from those until the debug symbols are up.

While in a perfect world you would never see an event before the debug symbols are ingested, we know that this is hard to do in practice. In particular, if you have bitcode-enabled builds, it can take a long time for processing to finish on the iTunes Connect side. This means that there is a good chance that someone may start (ab)using your app before you’ve had time to upload symbols.

Enabling reprocessing

Reprocessing is disabled by default and can be turned on in the project settings. Additionally, if you have an issue that lacks debug symbols, we’ll point you to the reprocessing settings page.

Reprocessing Hint

In the project settings you can turn it on with a flip of a switch:

Reprocessing Settings

There you can also see a log of all the reasons events were not processed (for instance, because of missing debug symbols, because broken debug symbols were uploaded etc.).

When debug symbols are missing you can see this on the event stream. A red warning bar will appear and inform you to upload missing symbols.

Reprocessing Bar

Our command line tools have been updated to trigger reprocessing automatically. If you are using an older version of sentry-cli that does not have this automatically enabled, you can also manually trigger reprocessing from the project settings.

How it works

What happens if it’s turned on?

When the feature is enabled, all events that are missing mandatory debug symbols go into a queue and will not show up until their debug symbols are uploaded. Currently we only require a small set of debug symbols and when all those are up the event is free to be reprocessed.

What is considered a mandatory debug symbol?

This is currently implementation defined and we might tweak it in the future. For now, you can see the list of debug symbols we deem required in your project settings once we encounter such debug symbols. We recommend uploading all debug symbols you have.

What triggers reprocessing?

There is an API you can hit which starts the reprocessing. By default this is done automatically by the sentry-cli tool but this can be disabled. You can also trigger reprocessing again from the project settings.

What about optional debug symbols?

We recommend uploading optional debug symbols first or to not trigger reprocessing until you have them all up. Once an event is no longer on hold we no longer permit reprocessing.

Can I see events on hold?

There is currently no way for you to see such events but we will investigate this option in the future.

What happens if I disable reprocessing?

If you disable reprocessing, events that are already on hold will be processed and shown to you in the stream view. Future events that come in will also show up as they arrive. This, however, might cause bad grouping because information relevant for grouping could be unavailable. We recommend against turning off reprocessing.

See also our updated Swift client

Reprocessing isn’t the only thing that’s changed for iOS. We recently pushed out a big change to our Swift client which – together with some server side changes – greatly improves the accuracy of our symbolication process. If you are not using the latest and greatest version yet, please make sure to upgrade.

Dodging S3 Downtime With Nginx and HAProxy

Like many websites and service providers, we use and depend on Amazon S3. Among other things, we primarily use S3 as a data store for uploaded artifacts like JavaScript source maps and iOS debug symbols; which are a critical part in our event processing pipeline. Yesterday, S3 experienced an outage that lasted 3 hours, but the impact on our processing pipeline was very minimal.

Last week, we set off on solving one potential problem that we were experiencing: fetching data out of S3 was neither as performant nor reliable as we would hope. Our servers are in Dallas, while our S3 buckets are in Virginia (us-east-1). This means that we see an average of 32ms ping times, and 100ms for a full Layer 6 TLS handshake. Additionally, S3 throttles bandwidth to servers outside of their network, which limits the ability for us to fetch our largest assets in a timely manner. While increasing performance was our primary goal, this project turned out to be extremely beneficial during the S3 outage and kept our processing pipeline chugging along for the duration of it.

We started off by putting together a quick S3 proxy cache that lived in our datacenter that would cache full assets from S3, allowing us to serve them on our local network without going to Virginia and back every single time. The requirements of this experiment were:

  • minimize risk to production-facing traffic
  • avoid introducing single points of failure
  • prove the concept without increasing hardware bill
  • avoid committing changes to application code

We completed our task with two popular services, no application code required. We used nginx as an S3 cache, while using HAProxy to route requests back to S3 if nginx were to fail.

The goal of the nginx server was to leverage the proxy_cache and store all of our S3 assets on disk when requested. We wanted to leverage a large 750GB disk cache and keep a very large set of actively cached data.

Our new proposed infrastructure would look like this:

The setup should look relatively familiar for anyone who has worked with service discovery. Each application server that’s running our Sentry code has an HAProxy process running on localhost. HAProxy is tasked with directing traffic to our cache server, which will proxy upstream to Amazon. This configuration also allows for a failover to occur, allowing HAProxy to talk directly to Amazon and bypass our cache without any interruption.

Configuring HAProxy is quick, only taking eight lines:

# Define a DNS resolver for S3
resolvers dns
  # Which nameserver do we want to use?
  nameserver google
  # Cache name resolutions for 300s
  hold valid 300s

listen s3
  mode http
  # Define our s3cache upstream server
  server s3cache check inter 500ms rise 2 fall 3
  # With actual Amazon S3 as a backup host using our DNS resolver
  server amazon s3.amazonaws.com:443 resolvers dns ssl verify required check inter 1000ms backup

On each application server tasked with communicating to S3, the HAProxy Admin ends up displaying this:

This gives us a live view of the cache’s health from the perspective of a single application server. Ironically, this also came in handy when S3 went down, clearly depicting that there was exactly two hours and 57 minutes of downtime during which we could not communicate with Amazon.

Configuring nginx was a little bit more involving since it’s doing the heavy lifting:

http {
  gzip off;

  # need to setup external DNS resolver
  resolver_timeout 5s;

  # configure cache directory with 750G and holding old objects for max 30 days
  proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=default:500m max_size=750g inactive=30d;

  server {
    listen default_server;

    keepalive_timeout 3600;

    location / {
      proxy_http_version 1.1;

      # Make sure we're proxying along the correct headers
      proxy_set_header Host s3.amazonaws.com;
      # Pass along Authorization credentials to upstream S3
      proxy_set_header Authorization $http_authorization;
      # Make sure we're using Keep-Alives with S3
      proxy_set_header Connection '';

      # Configure out caches
      proxy_cache default;
      # Cache all 200 OK's for 30 days
      proxy_cache_valid 200 30d;
      # Use stale cache file in all errors from upstream if we can
      proxy_cache_use_stale error timeout invalid_header updating http_500 http_502 http_503 http_504;
      # Lock the cache so that only one request can populate it at a time
      proxy_cache_lock on;

      # Verify and reuse our SSL session for our upstream connection
      proxy_ssl_verify on;
      proxy_ssl_session_reuse on;

      # Set back a nice HTTP Header to indicate what the cache status was
      add_header X-Cache-Status $upstream_cache_status always;

      # Set this to a variable instead of using an `upstream`
      # to coerce nginx into resolving as DNS instead of caching
      # it once on process boot and never updating.
      set $s3_host 's3.amazonaws.com';
      proxy_pass https://$s3_host;

This configuration is allowing us to use a 750GB disk cache for our S3 objects as configured by proxy_cache_path.

This proxy service had been running for a week while we watched our bandwidth and S3 bill drop, but we had an unexpected exchange yesterday morning:

HAProxy had immediately notified us when S3 started showing signs of trouble. In a random turn of events, the proxy that we had implemented to serve as a cache was now serving all of Sentry’s S3 assets while the Amazon service was offline. During the full three hours, we exhibited some problems when users attempted to upload artifacts, but the event processing pipeline happily kept flowing.

From start to finish, our proxy cache took less than a day to implement. In this past week, we have reduced our S3 bandwidth cost by 70%, gained even more performance and reliability when processing events, and took back the majority of the eggs we had the S3 basket.

Welcome Brett Hoerner

Brett most recently worked at Spredfast, where he helped build out and scale a high throughput data processing pipeline and search cluster. He loves deep dives on databases, distributed systems, virtual machines, and programming languages. He will be working on the Java SDK and helping out with our event processing and storage growth. Brett resides in Austin, Texas and when he’s not working he is busy pacifying his newborn son using his pinky finger.