
I used o3 to find a remote zeroday in the Linux SMB implementation by zielmicha
In this post I’ll show you how I found a zeroday vulnerability in the Linux kernel using OpenAI’s o3 model. I found the vulnerability with nothing more complicated than the o3 API – no scaffolding, no agentic frameworks, no tool use.
Recently I’ve been auditing ksmbd for vulnerabilities. ksmbd is “a linux kernel server which implements SMB3 protocol in kernel space for sharing files over network.“. I started this project specifically to take a break from LLM-related tool development but after the release of o3 I couldn’t resist using the bugs I had found in ksmbd as a quick benchmark of o3’s capabilities. In a future post I’ll discuss o3’s performance across all of those bugs, but here we’ll focus on how o3 found a zeroday vulnerability during my benchmarking. The vulnerability it found is CVE-2025-37899 (fix here), a use-after-free in the handler for the SMB ‘logoff’ command. Understanding the vulnerability requires reasoning about concurrent connections to the server, and how they may share various objects in specific circumstances. o3 was able to comprehend this and spot a location where a particular object that is not referenced counted is freed while still being accessible by another thread. As far as I’m aware, this is the first public discussion of a vulnerability of that nature being found by a LLM.
Before I get into the technical details, the main takeaway from this post is this: with o3 LLMs have made a leap forward in their ability to reason about code, and if you work in vulnerability research you should start paying close attention. If you’re an expert-level vulnerability researcher or exploit developer the machines aren’t about to replace you. In fact, it is quite the opposite: they are now at a stage where they can make you significantly more efficient and effective. If you have a problem that can be represented in fewer than 10k lines of code there is a reasonable chance o3 can either solve it, or help you solve it.
Benchmarking o3 using CVE-2025-37778
Lets first discuss CVE-2025-37778, a vulnerability that I found manually and which I was using as a benchmark for o3’s capabilities. While it’s not a zeroday, it’s worth explaining as the benchmark shows o3 performing 2x-3x better than Claude Sonnet 3.7 on the same task.
CVE-2025-37778 is a use-after-free vulnerability. The issue occurs during the Kerberos authentication path when handling a “session setup” request from a remote client. To save us referring to CVE numbers, I will refer to this vulnerability as the “kerberos authentication vulnerability“.
The root cause looks as follows:
static int krb5_authenticate(struct ksmbd_work *work, struct smb2_sess_setup_req *req, struct smb2_sess_setup_rsp *rsp) { ... if (sess->state == SMB2_SESSION_VALID) ksmbd_free_user(sess->user); retval = ksmbd_krb5_authenticate(sess, in_blob, in_len, out_blob, &out_len); if (retval) { ksmbd_debug(SMB, "krb5 authentication failedn"); return -EINVAL; } ...
If krb5_authenticate
detects that the session state is SMB2_SESSION_VALID
then it frees sess->user
. The assumption here appears to be that afterwards either ksmbd_krb5_authenticate
will reinitialise it to a new valid value, or that after returning from krb5_authenticate
with a return value of -EINVAL
that sess->user
will not be used elsewhere. As it turns out, this assumption is false. We can force ksmbd_krb5_authenticate
to not reinitialise sess->user
, and we can access sess->user
even if krb5_authenticate
returns -EINVAL
.
This vulnerability is a nice benchmark for LLM capabilities as:
- It is interesting by virtue of being part of the remote attack surface of the Linux kernel.
- It is not trivial as it requires:
- (a) Figuring out how to get
sess->state == SMB2_SESSION_VALID
in order to trigger the free. - (b) Realising that there are paths in
ksmbd_krb5_authenticate
that do not reinitialise sess->user and reasoning about how to trigger those paths. - (c) Realising that there are other parts of the codebase that could potentially access
sess->user
after it has been freed.
- (a) Figuring out how to get
- While it is not trivial, it is also not insanely complicated. I could walk a colleague through the entire code-path in 10 minutes, and you don’t really need to understand a lot of auxiliary information about the Linux kernel, the SMB protocol, or the remainder of ksmbd, outside of connection handling and session setup code. I calculated how much code you would need to read at a minimum if you read every ksmbd function called along the path from a packet arriving to the ksmbd module to the vulnerability being triggered, and it works out at about 3.3k LoC.
OK, so we have the vulnerability we want to use for evaluation, now what code do we show the LLM to see if it can find it? My goal here is to evaluate how o3 would perform were it the backend for a hypothetical vulnerability detection system, so we need to ensure we have clarity on how such a system would generate queries to the LLM. In other words, it is no good arbit
19 Comments
zison
Very interesting. Is the bug it found exploitable in practice? Could this have been found by syzkaller?
zielmicha
(To be clear, I'm not the author of the post, the title just starts with "How I")
mdaniel
Noteable:
> o3 finds the kerberos authentication vulnerability in 8 of the 100 runs
And I'd guess this only became a blog post because the author already knew about the vuln and was just curious to see if the intern could spot it too, given a curated subset of the codebase
Retr0id
The article cites a signal to noise ratio of ~1:50. The author is clearly deeply familiar with this codebase and is thus well-positioned to triage the signal from the noise. Automating this part will be where the real wins are, so I'll be watching this closely.
Hilift
Does the vulnerability exist in other implementations of SMB?
logifail
My understanding is that ksmbd is a kernel-space SMB server "developed as a lightweight, high-performance alternative" to the traditional (user-space) Samba server…
Q1: Who is using ksmbd in production?
Q2: Why?
iandanforth
The most interesting and significant bit of this article for me was that the author ran this search for vulnerabilities 100 times for each of the models. That's significantly more computation than I've historically been willing to expend on most of the problems that I try with large language models, but maybe I should let the models go brrrrr!
mezyt
Meanwhile, as a maintainer, I've been reviewing more than a dozen false positives slop CVEs in my library and not a single one found an actual issue. This article's is probably going to make my situation worse.
jobswithgptcom
Wow, interesting. I been hacking a tool called https://diffwithgpt.com with a similar angle but indexing git changelogs with qwen to have it raise risks for backward compat issues, risks including security when upgrading k8s etc.
empath75
Given the value of finding zero days, pretty much every intelligence agency in the world is going to be pouring money into this if it can reliably find them with just a few hundred api calls. Especially if you can fine tune a model with lots of examples, which I don't think open ai, etc are going to do with any public api.
akomtu
This made me think that the near future will be LLMs trained specifically on Linux or another large project. The source code is a small part of the dataset fed to LLMs. The more interesting is runtime data flow, similar to what we observe in a debugger. Looking at the codebase alone is like trying to understand a waterfall by looking at equations that describe the water flow.
KTibow
> With o3 you get something that feels like a human-written bug report, condensed to just present the findings, whereas with Sonnet 3.7 you get something like a stream of thought, or a work log.
This is likely because the author didn't give Claude a scratchpad or space to think, essentially forcing it to mix its thoughts with its report. I'd be interested to see if using the official thinking mechanism gives it enough space to get differing results.
nxobject
A small thing, but I found the author's project-organization practices useful – creating individual .prompt files for system prompt, background information, and auxiliary instructions [1], and then running it through `llm`.
It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.
[1] https://github.com/SeanHeelan/o3_finds_cve-2025-37899
dehrmann
Are there better tools for finding this? It feels like the sort of thing static analysis should reliably find, but it's in the Linux kernel, so you'd think either coding standards or tooling around these sorts of C bugs would be mature.
firesteelrain
I really hope this is legit and not what keeps happening to curl
[1] https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f…
ape4
Seems we need something like kernel modules but with memory protection
martinald
I think this is the biggest alignment problem with LLMs in the short term imo. It is getting scarily good at this.
I recently found a pretty serious security vulnerability in an open source very niche server I sometimes use. This took virtually no effort using LLMs. I'm worried that there is a huge long tail of software out there which wasn't worth finding vulnerabilities in for nefarious means manually but if it was automated could lead to really serious problems.
dboreham
I feel like our jobs are reasonably secure for a while because the LLM didn't immediately say "SMB implemented in the kernel, are you f-ing joking!?"
simonw
There's a beautiful little snippet here that perfectly captures how most of my prompt development sessions go:
> I tried to strongly guide it to not report false positives, and to favour not reporting any bugs over reporting false positives. I have no idea if this helps, but I’d like it to help, so here we are. In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.