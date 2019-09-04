September 4, 2019

In our Sentry for Data series, we explain precisely why Sentry is the perfect tool for your data team. The present post focuses on how we used Sentry to make debugging Apache Beam easier (and faster).

Since its creation, Sentry has embraced a single vision: help all developer teams build the best software, faster. We want to give developers the information they need to resolve issues quickly, without having to dig through noisy log lines. When the code that powers data pipelines breaks, data engineers often need more context than is available in logs to solve the issue.

Sentry’s data team feels the pains of searching logs first-hand. We build pipelines that run computations (aggregations, sums, etc.) over streaming data where the data stream never actually “ends.” We chose Apache Beam as our execution framework to manipulate, shape, aggregate, and estimate data in real time. Beam provides out-of-the-box support for technologies we already use (BigQuery and PubSub), which allows the team to focus on understanding our data.

While we appreciate these features, errors in Beam get written to traditional log files. In an attempt to capture these lost errors, we eliminated the need to search through logs to debug Beam by integrating Sentry into Beam’s Python and Java SDKs. Hooray!

Sentry + Beam + Python

Beam’s distributed execution model makes it tricky to instrument; the python SDK serializes our user code and uploads it for Google Dataflow to execute. As a result, injecting Sentry code into Beam is limited to a few files and the injected code has to be formatted with specific signatures.

With these restrictions in mind, we injected Sentry into the ParDo class, which prevented the integration from catching any errors derived from classes who do not inherit ParDo . For example, errors from any class that inherits from PTransform (and executes process functions) would not be included.

Installation

When running your pipeline, include the Beam Integration to your Sentry init.

import sentry_sdk from sentry_sdk . integrations . beam import BeamIntegration integrations = [ BeamIntegration ( ) ] sentry_sdk . init ( dsn = "YOUR DSN" , integrations = integrations )

Note: Make sure that you install the sentry_sdk in all your workers by running the --requirements_file flag, with https://github.com/getsentry/sentry-python/releases/tag/0.11.1