New Benchmarks for Iterative Solver Debugging in Operations Research LLMs

Global: New Benchmarks Assess Iterative Solver Debugging and Bias in Operations Research LLMs

Researchers have unveiled two novel evaluation suites—ORDebug and ORBias—to measure large language models’ (LLMs) capacity for iterative problem‑solving in operations research (OR). The benchmarks target the diagnostic loop that practitioners use when confronting infeasible models, offering a more realistic assessment than traditional one‑shot translation tests.

Background

In typical OR workflows, analysts debug infeasibility by identifying Irreducible Infeasible Subsystems (IIS), pinpointing conflicting constraints, and iteratively revising formulations until a feasible solution emerges. Existing LLM benchmarks, however, largely ignore this iterative process, focusing instead on generating solver code from a static description.

ORDebug Benchmark

ORDebug comprises over 5,000 problems spanning nine distinct error categories. Each model‑generated repair action triggers a re‑execution of the solver and recomputation of the IIS, delivering deterministic, verifiable feedback that guides subsequent corrections. This loop enables precise measurement of a model’s self‑correction ability and resolution speed.

ORBias Benchmark

ORBias evaluates behavioral rationality using 2,000 newsvendor instances—1,000 in‑distribution (ID) and 1,000 out‑of‑distribution (OOD). The suite quantifies systematic deviations from analytically optimal policies, thereby exposing bias that may arise when models encounter unfamiliar data.

Performance Results

Across 26 models and more than 12,000 sampled runs, a domain‑specific reinforcement‑learning‑via‑rewards (RLVR) trained 8‑billion‑parameter model outperformed leading public APIs. Recovery rates rose to 95.3 % versus 86.2 % (a 9.1 % gain), diagnostic accuracy improved to 62.4 % from 47.8 % (a 14.6 % increase), and the average number of steps to resolution fell to 2.25 compared with 3.78, representing a 1.7‑fold speedup.

Bias Mitigation

Curriculum training on ORBias produced the only observed negative bias drift when moving from ID to OOD scenarios, reducing systematic bias by 48 % (from 20.0 % to 10.4 %). This result suggests that targeted training can counteract bias without sacrificing overall performance.

Implications for LLM Training

The findings demonstrate that process‑level evaluation, anchored by verifiable oracles, can guide the development of LLMs that excel in realistic OR debugging tasks. By focusing on iterative correction and bias assessment, researchers can achieve improvements that surpass gains obtained solely through model scaling.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Benchmarks Assess Iterative Solver Debugging and Bias in Operations Research LLMs