Works on my machine: how we use AI to reproduce reported bugs
Sentry’s SDK teams maintain and support SDKs for a vast ecosystem of languages and frameworks. See our release registry for a source of truth. We’re currently at 159 published packages across the entire ecosystem. If you use it, we probably support it.
All of these SDKs are open source and have their own GitHub repositories that we maintain on a daily basis. And like any other open source project, we get tons of bug reports and issues on these.
In this post, I’ll talk about a Claude skill we’ve been leveraging to help make our reproduction flow smoother and reduce triage time and fatigue.
Bug triage flow
Sometimes bugs are easy to fix - could have been a missing null check, a missing conditional branch or some other small oversight.
Other times, they aren’t so easy for a plethora of reasons:
- Tedious setup, or “boilerplate”, just to get the environment ready
- Esoteric code paths
- Legacy versions
- Edge case interactions no one thought of
- Data races and other concurrency problems
- Forked libraries with different contracts
Boilerplate
Particularly for our SDK bugs, the boilerplate factor is quite annoying. Let’s take a recent example. To reproduce this, we would need to setup the following:
- A Python venv with the correct version
- A new Django boilerplate app with the correct version
- A Sentry SDK with the correct version
- Create a Django View that reproduces and showcases the exact problem which is applicable only to HTTPS proxies
- Run everything, trigger the view and hope that it shows the problem in question
All of this is necessary just to acknowledge that the problem the original user reported is real and replicable. Once reproduced, it’s typically much easier to roll out the actual fix.
Reproduction papertrail
Another recurring discussion within the teams was how to keep track of all these one-off boilerplate apps that we used to test SDK logic, and reproduce/fix problems.
Ideally we would have a shared repository of these apps with backlinks to the issues, but no one wanted the burden of maintaining yet another collection of apps on top of everything else we already do. Several SDK engineers had their own ad-hoc collection of apps they used for their day-to-day SDK development.
repro skill + repository
Enter LLMs. Turns out LLMs are pretty good at doing some of the tedious stuff mentioned above.
Even if they cannot get to the root of a hairy problem, they at least set up the boilerplate and give me a playground with all the correct parameters which I can move forward with, massively reducing tedium.
So I wrote up and iterated on a Claude skill that:
- Takes a GitHub issue URL as input
- Parses the SDK language, issue number
- Gathers metadata on language version, framework version, SDK version
- Makes a new directory and branch from the
language/issue-number - Attempts to create a minimal reproduction using standard tooling for the language (
uv,npm,bundle, etc.) - Tries to run the reproduction, bails out if it’s too complicated
- Writes up clear instructions for running the reproduction
- Makes a PR
- Optionally adds a backlink to the PR to the original user issue (using Claude’s
AskUserQuestiontool)
Note that we only ask the LLM to attempt a reproduction and stop if too complicated. This sort of logic is very effective when working with agents since if we ask too much of them, they will often stumble. If we give them an out, they’re more likely to explain the challenge than just stumble through it.
Example run on the Python issue
Continuing with the above Python example, the skill created this reproduction. We can see that it created a minimal Django app and gave very clear instructions to run the reproduction. Using this basic setup, I was able to roll out the subsequent fix very rapidly. I probably saved a few hours of figuring out how to setup Django with an HTTPS proxy correctly and then examining how that interacts with our SDK logic.
Lessons on writing skills
Skills are very generic Markdown files so it’s a bit opaque how to make them reliable and avoid having them go off the rails.
Some insights I have from writing this one:
- Use CLIs to interact with other systems; here we’re using the
ghCLI to perform GitHub operations - Split out the work to be done into clear steps
- Add an
Error Handlingsection explaining what’s not allowed and what to do with bad inputs - Use other in-built tools such as
AskUserQuestionfor user input or validation
Full automation?
We will play around with fully automating this flow on GitHub issues in the future. A major concern voiced by several engineers here is increased bot noise. We’re already drowning in bot communication on several fronts so we want to be careful how many of these we enable automatically. The right amount of automation in any given problem space is not always full automation and a pair of human eyes in the right places are absolutely necessary.