New Benchmark Reveals LLM Accuracy Drops with Increasing Bug Density

Global: New Multi-Vulnerability Benchmark Shows LLM Accuracy Declines with Higher Bug Density

Researchers have introduced a comprehensive benchmark to assess large language models (LLMs) on the detection of multiple, interacting software vulnerabilities within large code files. The study, posted on arXiv in December 2025, aims to fill gaps left by existing single‑vulnerability tests and to quantify biases that emerge in multi‑label security tasks.

Benchmark Design

According to the authors, the benchmark targets four widely used programming languages—C, C++, Python, and JavaScript—and evaluates model performance across varying vulnerability densities. Files range from 7,500 to 10,000 tokens, reflecting realistic codebases where developers often encounter extensive source files.

Dataset Construction

The team assembled a dataset of 40,000 code files by injecting controlled numbers of vulnerabilities (1, 3, 5, and 9 per file) into long‑context samples drawn from the CodeParrot repository. This systematic approach enables precise measurement of how detection rates change as the number of bugs increases.

Model Evaluation

Five state‑of‑the‑art LLMs were evaluated, including GPT‑4o‑mini, Llama‑3.3‑70B, and the Qwen‑2.5 series. Each model was tasked with identifying all present vulnerabilities and labeling them correctly, allowing the researchers to compute standard metrics such as precision, recall, and F1 score.

Key Findings for C and C++

The authors report that Llama‑3.3‑70B achieved an F1 score of approximately 0.97 on single‑vulnerability C tasks, indicating near‑perfect detection in low‑density scenarios. However, when the vulnerability count rose to nine per file, performance dropped by up to 40%, highlighting a pronounced “count bias” as the models struggled to enumerate all defects.

Challenges in Python and JavaScript

In contrast, the same models exhibited markedly lower recall on Python and JavaScript files. Under high‑density conditions, recall fell below 0.30, suggesting severe “under‑counting” where many vulnerabilities were missed entirely. The authors note that dynamic language features and less rigid syntax may contribute to these distinct failure modes.

Implications for Future Research

The study underscores the need for benchmarks that reflect real‑world code complexity and for model improvements that mitigate multi‑label biases. By making the dataset publicly available, the authors encourage further exploration of techniques—such as chain‑of‑thought prompting or specialized fine‑tuning—to enhance multi‑vulnerability detection capabilities.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Multi-Vulnerability Benchmark Shows LLM Accuracy Declines with Higher Bug Density