Released in 2025, 'N-Day-Bench' evaluates AI's ability to find security holes in actual software code rather than artificial problems, with Claude 3.5 Sonnet achieving the highest scores.
Can AI Keep Our Digital Doors Locked? The Truth About ‘N-Day-Bench’, the Hunt for Real Software Holes
Imagine you are the security chief of a massive apartment complex where thousands of families live. This complex has tens of thousands of front doors and windows, and new corridors and automated parcel lockers are installed every day for the residents’ convenience. Your security team is limited, and every night you suffer from the anxiety that “somewhere, a door might have been left unlocked.” What if a brilliant, tireless “AI Security Agent” appeared and said, “I’ll go shake every door for you and check for any gaps”?
But one question remains: Does this AI agent actually have the ability to find the tiny cracks that a “real thief” could squeeze through? Or is it just a “theory expert” that is only good at solving obvious textbook problems?
To answer this, a very special testing ground for measuring AI’s real-world “muscles” emerged in early 2025: ‘N-Day-Bench’. N-Day-Bench: Can LLMs Find Real Vulnerabilities in Real Code?
Why is this important?
From the smartphone banking apps and delivery apps we use daily to the autonomous driving software in the cars we ride—every digital service is made of millions of lines of “code.” Hidden within this massive pile of code could be “vulnerabilities” (weak points hackers can exploit) that we haven’t discovered yet. They are like invisible “termites” eating away at the pillars of a seemingly sturdy wooden house.
Until now, professional security personnel have scrutinized code until their eyes were weary, or used automated tools that scan according to pre-set rules to find these termites. However, as software becomes exponentially more complex, it has become nearly impossible for humans to plug every hole.
What if Large Language Models (LLMs) like ChatGPT or Claude could effortlessly find security holes in real software environments? We would live in a much safer digital world. ‘N-Day-Bench’ drills into exactly this point. It goes beyond AI simply advising that “theoretically, this code is dangerous” and strictly verifies whether AI can extract real problems from complex, functioning software. N-Day-Bench: Can LLMs Find Real Vulnerabilities in Real Code?
Easy Understanding: What kind of test is N-Day-Bench?
The ‘N-Day’ in this benchmark’s name refers to vulnerabilities that are already known to the world—those with a “track record.” N-Day-Bench Usually, when a security flaw is found in software, it is assigned a unique ‘CVE (Common Vulnerabilities and Exposures)’ number, which is similar to a case number given to a criminal. N-Day-Bench: Can LLMs Find Real Vulnerabilities in Real Code?
N-Day-Bench uses these real-world CVE cases—which have actually made numerous companies and users tremble—as test questions rather than fake practice problems. I’ve summarized the features of this test into three key points.
1. Three AI Agents: Exploring Vulnerabilities through ‘Team Play’
N-Day-Bench isn’t just about showing code to a single AI and saying “find the problem.” Instead, AI agents with three distinct roles collaborate organically, much like a police investigation team. N-Day-Bench: Can LLMs find real vulnerabilities in real codebases?
- Curator: The “team leader” who selects and organizes appropriate problems for the AI to solve from countless incidents.
- Finder: The “field detective” who scours the code to find suspicious holes.
- Judge: The “judge” who coolly determines if the evidence found by the detective is correct or just a forced argument.
2. “Find the Culprit Within 24 Steps”
The AI models are granted permission to directly execute code within a virtual space called a ‘Sandbox.’ A sandbox, simply put, is an isolated laboratory where you can safely run code, much like how children can build and destroy houses in a sandpit without damaging the surroundings. N-Day-Bench - Can LLMs find real vulnerabilities in real codebases …
However, the AI is not given unlimited time. It must analyze the code and write a final report by performing exactly 24 shell steps. N-Day-Bench: Can LLMs find real vulnerabilities in real codebases? This is like a detective having to collect evidence quickly and effectively during a crime scene preservation period.
3. ‘Monthly Updates’ to Prevent Cheating
It wouldn’t be a true test of skill if the AI simply memorized answer keys (security patch code) already available on the internet, right? Therefore, the developer, ‘WinFunc’, creates new test questions every month by bringing in the freshest security cases from the code repository (GitHub) used by developers worldwide. Benchmark pits frontier LLMs against fresh real-world vulns By presenting the latest problems that the AI hasn’t learned yet, they verify if it is truly “thinking” to solve them. N-Day-Bench: Can LLMs find real vulnerabilities in real codebases?
Current Situation: How is the AI’s Report Card?
AI models boasting the latest technology took this real-world test, and the results have been revealed.
- 1st Place: Claude 3.5 Sonnet — Correctly identified a whopping 32% of vulnerabilities. In simple terms, it solved three out of ten problems on its own. N-Day-Bench: LLMs Detect 18-32% Real Code Vulnerabilities
- 2nd Place: GPT-4o — Followed behind with a detection rate of 22%. N-Day-Bench: LLMs Detect 18-32% Real Code Vulnerabilities
Overall, it was found that the latest AIs can independently find about 18–32% of vulnerabilities in real code. N-Day-Bench: LLMs Detect 18-32% Real Code Vulnerabilities You might think “is that all?” looking only at the numbers, but the perspective of security experts is different.
Traditional automated analysis tools used by experts lacked flexibility because they only followed set rules. In one experiment, AI’s analytical ability was evaluated as so overwhelmingly superior that existing tools looked like “toys” in comparison. LLMs Can Now Find Zero-Day Vulnerabilities. Here’s Why That’s Both Impressive and Alarming. - Vidoc Security Lab
What Lies Ahead?
While it is certainly welcome news that AI is getting better at finding security holes, there is a worrisome side, like two sides of a coin.
Security expert Ken Huang warns that if AI starts finding vulnerabilities at an “unprecedented speed,” who will handle the aftermath will become a major challenge. Token Is All You Need: Finding 0days with LLMs
To use a metaphor, it’s like a robot with a very high-performance microscope reporting that it has found tens of thousands of microscopic bugs in every corner of the house. The owner who receives the report must read each of those tens of thousands of reports and catch the bugs, potentially having to give up their important daily life in the process. Especially in the case of open-source projects managed by volunteers, there is a high risk that maintainers will suffer from “burnout” while trying to verify the thousands of warning reports poured out by AI. Token Is All You Need: Finding 0days with LLMs
| Nevertheless, AI is much more likely to become the “most reliable assistant” that will dramatically reduce the workload of security experts. LLMs Find Vulnerabilities: N-Day-Bench & ZeroDayBench Insights In the future, AI will move beyond being an auxiliary tool for writing code and establish itself as a “tireless sentinel” that monitors day and night to ensure our digital world is safe. [Can LLMs find bugs in large codebases? | Hamming AI Blog](https://hamming.ai/blog/bug-in-the-codestack) |
AI Perspective: MindTickleBytes AI Reporter’s View
The emergence of N-Day-Bench proves that AI is no longer just a “smooth-talking assistant.” Now, AI is developing the real-world muscles to fight on the actual battlefield.
However, as fast as technology develops, our “human response systems” for how to responsibly handle the numerous tasks and warnings found by that technology must also mature. The tool has already become sharp. Now it is our wisdom in handling that tool that will be put to the test.
References
- N-Day-Bench: Can LLMs Find Real Vulnerabilities in Real Code?
- LLMs Find Vulnerabilities: N-Day-Bench & ZeroDayBench Insights
- Token Is All You Need: Finding 0days with LLMs
- N-Day-Bench - Can LLMs find real vulnerabilities in real codebases?
- N-Day-Bench
- N-Day-Bench: Can LLMs find real vulnerabilities in real codebases?
- N-Day-Bench - Can LLMs find real vulnerabilities in real codebases …
- N-Day-Bench: Can LLMs Find Real Vulnerabilities in Real Code?
- N-Day-Bench: LLMs Detect 18-32% Real Code Vulnerabilities
- Benchmark pits frontier LLMs against fresh real-world vulns
- LLMs Can Now Find Zero-Day Vulnerabilities. Here’s Why That’s Both Impressive and Alarming. - Vidoc Security Lab
-
[Can LLMs find bugs in large codebases? Hamming AI Blog](https://hamming.ai/blog/bug-in-the-codestack)
FACT-CHECK SUMMARY
- Claims checked: 22
- Claims verified: 22
- Verdict: PASS
- Fictional problems created directly by AI.
- Real-world vulnerabilities that have already been disclosed and assigned a unique ID (CVE).
- Future vulnerabilities that even hackers can never find.
- GPT-4o
- Claude 3.5 Sonnet
- Gemini 1.5 Pro
- AI modifies the code on its own.
- Security experts lose their jobs completely.
- Open-source maintainers become overloaded with processing numerous AI-generated reports.