Neural Theorem Proving for Software Verification: A New Benchmark

Global: New Multi-Language Benchmark Targets Neural Theorem Proving for Software Verification Conditions

Lead

Researchers from multiple institutions have released the first multi-language benchmark designed to evaluate neural theorem proving on verification conditions (VCs) extracted from real-world software such as the Linux and Contiki‑OS kernels. The benchmark, named NTP4VC, was announced in a preprint posted to arXiv on January 30, 2026. It aims to address the long‑standing bottleneck in automated program verification where existing automated theorem provers frequently fail to discharge hard VCs, forcing developers to resort to manual proof effort.

Benchmark Composition

The dataset comprises thousands of VCs generated through industrial verification pipelines—Why3 for Isabelle and Lean, and Frama‑C for Coq—ensuring semantic equivalence across the three proof assistants. By covering multiple programming languages and formal systems, the benchmark seeks to reflect the diversity of verification tasks encountered in practice.

Model Evaluation

To assess current capabilities, the authors evaluated a range of large language models, including general‑purpose LLMs and versions fine‑tuned specifically for theorem proving. The experiments revealed that while some models can solve a modest subset of VCs, overall success rates remain far below the levels required for reliable automated verification.

Implications for Research

The authors note that the performance gap highlights both the promise and the limitations of neural approaches in this domain. They argue that the benchmark provides a concrete target for future research, encouraging the development of models that can reason about program semantics and proof obligations more effectively.

Context Within Prior Work

Prior work has explored neural methods for mathematical theorem proving and for synthesizing annotations, but none has focused explicitly on the verification‑condition bottleneck in software verification. NTP4VC therefore fills a gap by offering a realistic, industrially relevant testbed.

Open‑Source Tooling

The release of the benchmark is accompanied by open‑source tooling that automates the extraction, translation, and validation of VCs across Isabelle, Lean, and Coq. This infrastructure is intended to lower the entry barrier for researchers aiming to benchmark or improve neural theorem‑proving systems.

Future Directions

The authors conclude that advancing neural theorem proving for verification conditions will require not only larger models but also training data that captures the logical structure of program verification, as well as tighter integration with existing verification pipelines.This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Multi-Language Benchmark Targets Neural Theorem Proving for Software Verification Conditions