Works on my machine: how we use AI to reproduce reported bugs

Neel Shah - June 8, 2026 · 4 min read

Works on my machine: how we use AI to reproduce reported bugs

Sentry’s SDK teams maintain and support SDKs for a vast ecosystem of languages and frameworks. See our release registry for a source of truth. We’re currently at 159 published packages across the entire ecosystem. If you use it, we probably support it.

All of these SDKs are open source and have their own GitHub repositories that we maintain on a daily basis. And like any other open source project, we get tons of bug reports and issues on these.

In this post, I’ll talk about a Claude skill we’ve been leveraging to help make our reproduction flow smoother and reduce triage time and fatigue.

Bug triage flow

Sometimes bugs are easy to fix - could have been a missing null check, a missing conditional branch or some other small oversight.

Other times, they aren’t so easy for a plethora of reasons:

Tedious setup, or “boilerplate”, just to get the environment ready
Esoteric code paths
Legacy versions
Edge case interactions no one thought of
Data races and other concurrency problems
Forked libraries with different contracts

Boilerplate

Particularly for our SDK bugs, the boilerplate factor is quite annoying. Let’s take a recent example. To reproduce this, we would need to setup the following:

A Python venv with the correct version
A new Django boilerplate app with the correct version
A Sentry SDK with the correct version
Create a Django View that reproduces and showcases the exact problem which is applicable only to HTTPS proxies
Run everything, trigger the view and hope that it shows the problem in question

All of this is necessary just to acknowledge that the problem the original user reported is real and replicable. Once reproduced, it’s typically much easier to roll out the actual fix.

Reproduction papertrail

Another recurring discussion within the teams was how to keep track of all these one-off boilerplate apps that we used to test SDK logic, and reproduce/fix problems.

Ideally we would have a shared repository of these apps with backlinks to the issues, but no one wanted the burden of maintaining yet another collection of apps on top of everything else we already do. Several SDK engineers had their own ad-hoc collection of apps they used for their day-to-day SDK development.

`repro` skill + repository

Enter LLMs. Turns out LLMs are pretty good at doing some of the tedious stuff mentioned above.

Even if they cannot get to the root of a hairy problem, they at least set up the boilerplate and give me a playground with all the correct parameters which I can move forward with, massively reducing tedium.

So I wrote up and iterated on a Claude skill that:

Takes a GitHub issue URL as input
Parses the SDK language, issue number
Gathers metadata on language version, framework version, SDK version
Makes a new directory and branch from the language/issue-number
Attempts to create a minimal reproduction using standard tooling for the language (uv, npm, bundle, etc.)
Tries to run the reproduction, bails out if it’s too complicated
Writes up clear instructions for running the reproduction
Makes a PR
Optionally adds a backlink to the PR to the original user issue (using Claude’s AskUserQuestion tool)

Note that we only ask the LLM to attempt a reproduction and stop if too complicated. This sort of logic is very effective when working with agents since if we ask too much of them, they will often stumble. If we give them an out, they’re more likely to explain the challenge than just stumble through it.

Example run on the Python issue

Continuing with the above Python example, the skill created this reproduction. We can see that it created a minimal Django app and gave very clear instructions to run the reproduction. Using this basic setup, I was able to roll out the subsequent fix very rapidly. I probably saved a few hours of figuring out how to setup Django with an HTTPS proxy correctly and then examining how that interacts with our SDK logic.

Lessons on writing skills

Skills are very generic Markdown files so it’s a bit opaque how to make them reliable and avoid having them go off the rails.

Some insights I have from writing this one:

Use CLIs to interact with other systems; here we’re using the gh CLI to perform GitHub operations
Split out the work to be done into clear steps
Add an Error Handling section explaining what’s not allowed and what to do with bad inputs
Use other in-built tools such as AskUserQuestion for user input or validation

Full automation?

We will play around with fully automating this flow on GitHub issues in the future. A major concern voiced by several engineers here is increased bot noise. We’re already drowning in bot communication on several fronts so we want to be careful how many of these we enable automatically. The right amount of automation in any given problem space is not always full automation and a pair of human eyes in the right places are absolutely necessary.

Solutions

Products

Products

AI Debugging

AI Debugging

Integrations

Integrations

Learn

Learn

Support

Support

Hang out with us

Hang out with us

Bi-weekly Intro to Sentry Demo

Works on my machine: how we use AI to reproduce reported bugs