Dodging S3 Downtime With Nginx and HAProxy

Like many websites and service providers, we use and depend on Amazon S3. Among other things, we primarily use S3 as a data store for uploaded artifacts like JavaScript source maps and iOS debug symbols; which are a critical part in our event processing pipeline. Yesterday, S3 experienced an outage that lasted 3 hours, but the impact on our processing pipeline was very minimal.

Last week, we set off on solving one potential problem that we were experiencing: fetching data out of S3 was neither as performant nor reliable as we would hope. Our servers are in Dallas, while our S3 buckets are in Virginia (us-east-1). This means that we see an average of 32ms ping times, and 100ms for a full Layer 6 TLS handshake. Additionally, S3 throttles bandwidth to servers outside of their network, which limits the ability for us to fetch our largest assets in a timely manner. While increasing performance was our primary goal, this project turned out to be extremely beneficial during the S3 outage and kept our processing pipeline chugging along for the duration of it.

We started off by putting together a quick S3 proxy cache that lived in our datacenter that would cache full assets from S3, allowing us to serve them on our local network without going to Virginia and back every single time. The requirements of this experiment were:

  • minimize risk to production-facing traffic
  • avoid introducing single points of failure
  • prove the concept without increasing hardware bill
  • avoid committing changes to application code

We completed our task with two popular services, no application code required. We used nginx as an S3 cache, while using HAProxy to route requests back to S3 if nginx were to fail.

The goal of the nginx server was to leverage the proxy_cache and store all of our S3 assets on disk when requested. We wanted to leverage a large 750GB disk cache and keep a very large set of actively cached data.

Our new proposed infrastructure would look like this:

The setup should look relatively familiar for anyone who has worked with service discovery. Each application server that’s running our Sentry code has an HAProxy process running on localhost. HAProxy is tasked with directing traffic to our cache server, which will proxy upstream to Amazon. This configuration also allows for a failover to occur, allowing HAProxy to talk directly to Amazon and bypass our cache without any interruption.

Configuring HAProxy is quick, only taking eight lines:

# Define a DNS resolver for S3
resolvers dns
  # Which nameserver do we want to use?
  nameserver google
  # Cache name resolutions for 300s
  hold valid 300s

listen s3
  mode http
  # Define our s3cache upstream server
  server s3cache check inter 500ms rise 2 fall 3
  # With actual Amazon S3 as a backup host using our DNS resolver
  server amazon resolvers dns ssl verify required check inter 1000ms backup

On each application server tasked with communicating to S3, the HAProxy Admin ends up displaying this:

This gives us a live view of the cache’s health from the perspective of a single application server. Ironically, this also came in handy when S3 went down, clearly depicting that there was exactly two hours and 57 minutes of downtime during which we could not communicate with Amazon.

Configuring nginx was a little bit more involving since it’s doing the heavy lifting:

http {
  gzip off;

  # need to setup external DNS resolver
  resolver_timeout 5s;

  # configure cache directory with 750G and holding old objects for max 30 days
  proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=default:500m max_size=750g inactive=30d;

  server {
    listen default_server;

    keepalive_timeout 3600;

    location / {
      proxy_http_version 1.1;

      # Make sure we're proxying along the correct headers
      proxy_set_header Host;
      # Pass along Authorization credentials to upstream S3
      proxy_set_header Authorization $http_authorization;
      # Make sure we're using Keep-Alives with S3
      proxy_set_header Connection '';

      # Configure out caches
      proxy_cache default;
      # Cache all 200 OK's for 30 days
      proxy_cache_valid 200 30d;
      # Use stale cache file in all errors from upstream if we can
      proxy_cache_use_stale error timeout invalid_header updating http_500 http_502 http_503 http_504;
      # Lock the cache so that only one request can populate it at a time
      proxy_cache_lock on;

      # Verify and reuse our SSL session for our upstream connection
      proxy_ssl_verify on;
      proxy_ssl_session_reuse on;

      # Set back a nice HTTP Header to indicate what the cache status was
      add_header X-Cache-Status $upstream_cache_status always;

      # Set this to a variable instead of using an `upstream`
      # to coerce nginx into resolving as DNS instead of caching
      # it once on process boot and never updating.
      set $s3_host '';
      proxy_pass https://$s3_host;

This configuration is allowing us to use a 750GB disk cache for our S3 objects as configured by proxy_cache_path.

This proxy service had been running for a week while we watched our bandwidth and S3 bill drop, but we had an unexpected exchange yesterday morning:

HAProxy had immediately notified us when S3 started showing signs of trouble. In a random turn of events, the proxy that we had implemented to serve as a cache was now serving all of Sentry’s S3 assets while the Amazon service was offline. During the full three hours, we exhibited some problems when users attempted to upload artifacts, but the event processing pipeline happily kept flowing.

From start to finish, our proxy cache took less than a day to implement. In this past week, we have reduced our S3 bandwidth cost by 70%, gained even more performance and reliability when processing events, and took back the majority of the eggs we had the S3 basket.