Study Compares Prompt-Based and Fine-Tuned LLMs for Security Bug Report Detection

Global: Study Compares Prompt-Based and Fine-Tuned LLMs for Security Bug Report Detection

A research paper posted to arXiv on January 30, 2026, evaluates how large language models (LLMs) can be used to predict security bug reports (SBRs). The authors—Farnaz Soltaniani, Shoaib Razzaq, and Mohammad Ghafari—investigate both prompt‑based engineering and fine‑tuning strategies to determine which approach yields better detection performance.

Methodology Overview

The study applies a prompt‑based approach using proprietary LLMs and a fine‑tuning approach that adapts open‑source models to the SBR prediction task. Experiments are conducted across multiple publicly available datasets, and performance is measured with standard metrics such as G‑measure, precision, and recall.

Prompt‑Based Proprietary Models

According to the authors, the prompt‑based models achieve the highest sensitivity, recording an average G‑measure of 77% and a recall of 74% across all datasets. However, this sensitivity comes with a higher false‑positive rate, resulting in an average precision of only 22%.

Fine‑Tuned Open Models

In contrast, the fine‑tuned models attain a lower overall G‑measure of 51% but demonstrate substantially higher precision at 75%. The trade‑off is a reduced recall of 36%, indicating that fewer true security bug reports are identified.

Performance Trade‑offs and Speed

The authors note that, after the initial investment required to fine‑tune a model, inference on the largest dataset is up to 50 times faster than the proprietary prompt‑based approach. This speed advantage may be relevant for organizations that need rapid, large‑scale scanning of bug reports.

Implications for Future Research

The findings suggest that both approaches have distinct strengths: prompt‑based models excel at finding more potential bugs, while fine‑tuned models provide higher confidence in the bugs they flag. The authors recommend further investigation into hybrid techniques that could combine sensitivity with precision.

Limitations and Next Steps

The paper acknowledges that the evaluation is limited to the datasets examined and that real‑world deployment may present additional challenges, such as evolving bug report language and domain‑specific vocabularies. Future work is proposed to explore larger model families and to assess the impact of continual learning on detection performance.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.