Using LLMs to find Python C-extension bugs

1966woodenghost April 27, 2026 0

SaveSavedRemoved 0

Welcome to LWN.net

The following subscription-only content has been made available to you
by an LWN subscriber. Thousands of subscribers depend on LWN for the
best news from the Linux and free software communities. If you enjoy this
article, please consider subscribing to LWN. Thank you
for visiting LWN.net!

By Jake Edge
April 21, 2026

The open-source world is currently awash in
reports of LLM-discovered bugs and vulnerabilities, which makes for a lot more
work for maintainers, but many of the current crop are being reported
responsibly with an eye toward minimizing that impact. A recent report
on an effort to systematically find bugs in Python extensions
written in C has followed that approach. Hobbyist Daniel Diniz used Claude
Code to find more than 500 bugs of various sorts across nearly a million
lines of code in 44 extensions; he has been working with maintainers to get
fixes upstream and his methodology serves as a great example of how to keep
the human in the loop—and the maintainers out of burnout—when employing LLMs.

The numbers are fairly eye-opening: “575+ confirmed bugs (~10-15% false
positive rate after review, ~140 reproduced from Python) and fixes already
merged in 14 projects“. The types of the bugs range widely: “from
hard crashes and memory corruption to correctness issues and spec
violations“. Meanwhile, Diniz would like to work with maintainers to
make the effort “more useful and scalable for maintainers“; the goal
is to provide high-quality reports of “a large class of non-trivial
bugs” that are difficult to find manually.

To do that, Diniz created a Claude Code plugin, cext-review-toolkit,
that is tuned for Python-specific problems that might be found in C
extensions, such as problems with reference counts, in handling the global interpreter lock
(GIL), and with exception state. It uses “13 specialized analysis agents analyzing the C extension source code in parallel, with each agent targeting a different bug class“.

Results

The lengthy report is worth reading in its entirety, but we will highlight
a few parts of it here. The tool found
lots of bugs, as noted, many of which resulted in bug reports and pull
requests (PRs). There are lots of links to both for more than a dozen
different C extension projects, including Cython, Guppy 3, regex,
Pillow, and more. The Guppy
3 maintainer, YiFei Zhu, was highlighted for digging into the extensive
report for that project, fixing 24 of 30 issues found, and finding
“additional bugs the tool missed“. In addition, the feedback
provided in the umbrella issue
for the findings was “invaluable“, leading to improvements to the
tools to reduce false positives.

The report describes how
the tool and process work: the agents are run for a project, the
findings are reviewed, pure-Python reproducers are created when possible,
and then a report is shared with the maintainers via a secret GitHub gist.
There is another document that describes techniques
for creating reproducers in Python and the report itself describes
the specific types of bugs targeted by the agents.

More importantly, given the widespread problems with maintainers being
buried under slop bug reports and PRs, Diniz is clearly trying to ensure
that his work is worthwhile to the projects:

Reports like these can be time and energy-intensive for maintainers to
investigate. Historically, automated bug-finding tools have produced far
more false positives than useful information, and AI can make those false
positives look incredibly convincing.

[…] When a maintainer points out a false positive, I immediately update
the agents’ prompts so that specific pattern is avoided in the future.

Beyond polishing the tools, I try to communicate in a non-invasive, helpful
manner. The maintainer always holds the reins: I ask them how they prefer
to receive the information (an umbrella issue? individual issues? direct
PRs? or do nothing at all) and let them decide exactly what to do with the
findings.

There is more to the report, including an
example of a bug and reproducer, a look at things
that did not work, and so on. He ended with a set
of questions for the community about whether it is useful, how to
improve the tools and reports, and ideas for future tools. He mentions
several other projects he is working on, such as an analysis tool aimed at C
extensions with regard to free-threaded Python and another tool to
analyze the CPython source code.

Reaction

The reaction has been quite positive—no surprise—with a few Python
developers and maintainers
popping up to talk about the experience and to suggest ideas for further
refinements. James Parrott wondered
about the number of bugs that would be eliminated if Rust had been used
instead. Cython maintainer David Woods thought
that Rust could eliminate things like reference-counting problems, but
probably not the exception-handling bugs that were prevalent in the report
for Cython. Diniz prompted
Claude Code with the Rust question, which stated that 60-70% would not be prevented by Rust;
Diniz cautioned that “given LLM’s troubles with numbers and estimates, I
wouldn’t trust the percentages too much“. But even the broad
categorization may be suspect, as Matthias Urlichs said
that he thought Rust could prevent more types of problems “if the Rust API is designed safely (in the Rust sense) instead of literally following the C API“.

Parrott also suggested using the GitHub Actions system to
reproduce the bugs. That would improve
the tool’s reports, which are less than ideal for him: “I don’t want to
have to read a huge machine generated report and work out what’s what.”
Diniz was appreciative
of the suggestions and thought that he could implement them relatively
easily. In particular, customizing reports is already on his radar: “I’d like to tailor the reports to what maintainers need, some like having reproducers and suggested fixes, others would prefer just a short description and code locations.”

Eric Soroos, one of the Pillow maintainers, thought
it was “one of the better sets of reports that we’ve gotten about
potential security/correctness issues“. He did note that the
coverage was incomplete, as he spotted similar bugs in related functions
that were not found. Some of the bugs were difficult to reproduce because
they required a memory-allocation failure to occur in a specific place,
leading to a tooling suggestion:

It would be interesting as a test run to have a fuzzer that used coverage guidance to fail mallocs (or c-api python methods) to test the error handling in those cases. It would need to run under valgrind to catch memory leaks or invalid accesses. This could give better code coverage for the repetitive if(ptr==null) {free everything allocated in the function} c level error handling.

The idea was met with approval, so Soroos expanded
on it some later in the thread.

The severity of the bugs being found, and whether they are worth the
maintainer attention needed to fix them, may also factor into the question
about the reports,
as Maurycy Pawłowski-Wieroński noted.
He had tried using Diniz’s LLM tool for CPython and had mixed results, in
part because some of the bugs are only reproducible in ways that users are
unlikely to ever hit:

Unless the issue is critical (even if perfectly reproducible), many fixes are just distracting. Maintainers have their own projects, plans, schedules etc., and some pathological refleak is not really that important. I believe that such PRs used to make it in the past, because they were seen as an investment (education) in a potential maintainer, a future colleague. Now, it’s “Contributor” badge hunting.

Diniz had a, seemingly characteristic, thoughtful
reply, agreeing that “not all findings are worth fixing“.
Maintainers will draw their own lines of what warrants a fix, so he is not
in a position to decide which bugs merit addressing. “The best I can do
is offer a listing of what the tools find and let them decide what to
fix.” He said that so far he has not gotten much feedback on whether
“tiny PRs targeting nits, leaks, etc.” are valuable or not, but he
is open to discussing it.

This issue is likely to recur. Finding and fixing
memory-allocation-failure handling, for example, is certainly important,
but it may well not be as important as other things that maintainers are
trying to accomplish. Tuning LLMs to prioritize their reports based on the
likelihood of real-world exploitation would be another helpful step. Those
who are using these tools for ill are surely pointing them toward
exploitable bugs; LLM providers could potentially use those prompts
(or share them) for defensive purposes. The LLM providers just might have
their own tools and models that could be loosed on such a task as well.

Keeping maintainers fully in control is perhaps the most important element
of this effort; giving them the ability to opt out is particularly key.
There is a balance to be struck there, of course, because there may be bugs
found that need escalation even when the project and its maintainers are
not interested in the machine-generated reports. These are the early days
for LLM bug-finding—and machines can generate far more reports than mere
humans can process—so we are likely to see a variety of approaches, both
good and ill. For now, this seems like a nice example of the “good” side
of the coin.

Source: lwn.net…

SaveSavedRemoved 0