API Authentication Bypass
On July 20th a customer informed us of an authentication bypass vulnerability in our API, specifically centered around our integration platform. A patch was deployed later that morning at approximately 14:04 UTC (followed up with tests). On July 22nd, after a preliminary investigation, we contacted 462 customers who had the possibility of exposure. In parallel, we continued our forensics process, as it was not yet complete. In the end, we found no evidence of any customer data exposure or the vulnerability being used in the wild.
Forensics on this incident was particularly challenging. We’re hoping by giving you some insight into the challenges we faced, it will help others avoid similar situations. While this is not the entirety of our process, it covers concerns that apply more generally.
Understanding The Problem
Vulnerable code was responsible for enforcing tenant access with authentication tokens generated by our custom integrations. These tokens are used to make requests to Sentry’s Web API (not to be confused with our event ingestion APIs). Specifically, any request made to https://sentry.io/api/0/*. The bypass, while somewhat obscure, was also fairly trivial to exploit. If you had a valid authentication token for our integration platform, you could use it on endpoints that were outside of the scope of that originating token.
These tokens worked primarily with our custom integrations, which are commonly used by organizations to develop internal applications against our API. There are a couple of core limitations of the vulnerability: you could only access other organizations that were also using a custom integration and you were limited from performing certain higher-scoped actions that require a real user.
Once we identify and patch a vulnerability like this, our next step is to immediately identify the full scope, and determine if there’s a breach. In this scenario that meant pulling in as much data about requests made with these tokens as we possibly could. This would help us identify suspicious behavior such as unique IPs or other signatures accessing data on multiple organizations, as well as understand which customers may be impacted. The time window to achieve all of this is very short, as we have committed to notifying our customers and compliance organizations within 48 hours of a breach.
Our challenge came into play when we realized we could not associate an entry in our access logs to the token that made the request. That meant we could not immediately identify valid vs. invalid requests, as not knowing the token used for the request, meant we didn’t know if the token had the correct permissions. This is where the rabbit hole starts, and ultimately this is where we identify a number of things we need to improve going forward.
Digging into the access logs, we only had a few relevant bits of information stored for these endpoints:
With the above, we were able to extract which organizations were affected by parsing the
request_uri, as well as begin searching for patterns using combinations of
remote_addr, and time windows. Initially, the data set was extremely large, but because we knew the scope of the vulnerability, we first narrowed it to only customers who even had the possibility of being exposed. That list started at nearly 4,000 customers.
It’s worth noting that our production systems work a bit differently than most organizations. We don’t rely on systems like Splunk or Kibana for log access, but rather we store these in Clickhouse (via clicktail), which in situations like this is a huge boon. It allows us to load correlating information (such as a CSV of customers) and filter down the logs extremely quickly using a SQL-like interface.
Refining the Scope
That list was large - 4,000 organizations and tens of millions of requests. We quickly realized that due to the way the vulnerability worked, it was only active during time ranges which an organization had a custom integration available. Knowing that, we were able to pull in data from another source and filter that list down even further. At this point, we had narrowed scope to a few million requests, and around 3,500 organizations.
There were two tasks to work in parallel:
- Cluster data in the original set to determine if there were signs of cross-customer data access
- Reduce the scope by identifying ways to reduce set of unknown requests.
Most of our learnings (and follow-up items) come from our attempts at reducing scope, so we’re going to focus on that.
Once we the narrowed list of scoped requests, we asked ourselves how could we scientifically filter the list. We had three immediate approaches we could see, identify:
- Service partners which were authorized by a customer (such as Clubhouse)
- Sentry employees which assist customers with issues and have access
- Customers with access to the given account
We started by focusing on what requests we could eliminate based on details we keep elsewhere of authenticated sessions. One of Sentry’s security features is tracking of known signatures for an account. We expose some of this in your account details as a session history, and it’s generally to allow you to understand if your account has been accessed from a location you don’t recognize. In this case, we’re able to filter out Sentry staff access and access that is traceable to a user account which is a member of the organization.
Service partners were a bit trickier to identify. Most companies these days use cloud providers, which means IP spaces aren’t stable. We had already generated a large reverse DNS map to help us identify IP and host clusters — which is a vector we look at for attacker profiles — so we knew that a majority of the non-browser signatures were coming from cloud providers. We started looking at other ways to cluster information and were able to identify the signatures of three service providers by looking at IP space combined with the defined user-agent. For example, we identified Clubhouse due to its Java implementation (surprisingly uncommon), and have a limited set of IPs. We took those IPs and verified they were still in use in production post-patch, which confirmed that they were authorized from Clubhouse. This latter verification was a turning point for us.
Coming into our 2nd day of forensics, we were feeling better about our situation. We had reduced the original list down drastically. Our verification technique for service providers would prove to be hugely valuable. We knew that if a signature, which 36 hours before was making valid requests, was still making valid requests today, it was trusted. This meant it had plausibly valid credentials. We already knew these signatures were not accessing more than one account, so we eliminated them when we validated they still had access.
We confidently reduced the list down to 462 customers. As mentioned above, we have a responsibility to notify our customers of the incident. So on Day 2 we emailed these customers and let them know that everything appeared fine, and we’d provide updates as they come in. Email sent. We kept working.
Upon reviewing the remaining list there were two patterns of data:
- What appeared to be browser sessions (based on a naive look at
- Server-side communications - many with unique signatures per customer
It’s important to note that given user-agents are simply user input, you can only use them to suggest behavior, not to factually confirm it.
We already filtered out known valid signatures, so we knew were not going to be able to validate the server-side requests on our own. We did a manual review of these datasets, and while they all appeared fine, we couldn’t scientifically eliminate them on our own. At the same time, we were working with a number of our customers on helping them verify these requests — if nothing else so they could close the incident out on their end.
During these conversations, we identified another large class of requests we could eliminate now that we had more time. Many of the signatures looking like browser sessions were actually trackable to a user account on Sentry. Those accounts were not originally filtered because they did not have access to the organization, but upon looking at historical data, we could actually match that up to previous valid access. This allowed us to reduce the list to a point where we felt satisfied with our original conclusions.
This process was quite arduous and has taught us a lot, but it has mostly taught us that we need to improve the way we manage metadata in some of our systems. The most obvious thing here is ensuring our access logs always include identifiers for authentication. We have this metadata available in Sentry (e.g.
auth.type=integration, auth.token-id=[primary key]), but that same metadata is not carried over to our other systems. The tricky bit here is that raw access logs can never obtain this information, as we capture those higher up in the stack. Our solution to this is likely going to be utilizing response headers to propagate up additional metadata to the systems which ultimately capture the logs.
Some of our more direct follow-ups and take always, that are likely applicable to others:
- We use
slugin URLs, but that is mutable. We need the
idto be logged. The slug is great for quickly humanizing things, but with forensics, the precision is required.
- Authentication identifiers need to be present in every access log. We already do this for things like staff requests, or application logs, but need to carry that over everywhere.
- The ability to more easily correlate historical user access (“Jane had access to Sentry on Monday”), vs. just present-day access. This would have helped us greatly reduce the original scope.
- Improve our penetration testing program. This vulnerability was actually introduced right before our most recent pen test yet it was uncaught, and it is on what is considered a critical path.
- Improve our ability to contact customers en-masse. This turned out to be one of the most complicated parts of the whole process. Marketing has a ton of tools around this, but those often don’t work well for this kind of scenario.
- Improve forensic tooling. We used a combination of production systems,
clickhouse-local, and Datasette +
sqlitedatabases for this. While those tools worked great, they still had a lot of overhead when doing simple drill-down/pivots.
- Most importantly, use this incident for education. It was easily avoidable via automated testing and peer review. We can and must do better here.
It should go without mentioning, but we take security very seriously at Sentry. These things happen, but we use these situations to learn and better ourselves, and hope that by sharing, you’ll find something that may apply to your business as well.