LLMs Outperform Traditional Compilers in Superoptimization

Global: LLMs Achieve Superoptimization on Large-Scale Assembly Benchmark

A team of researchers has demonstrated that large language models can function as superoptimizers, generating assembly code that runs faster than code produced by traditional compilers. The study, presented on arXiv, evaluated 23 LLMs against a newly created benchmark of 8,072 assembly programs, each averaging 130 lines. Results show that the top baseline model, Claude‑opus‑4, passed 51.5% of test cases and achieved a 1.43× speedup over gcc -O3. Subsequent fine‑tuning raised correctness to 95.0% and average speedup to 1.46×.

Benchmark Construction

The authors assembled the first large‑scale superoptimization dataset, expanding beyond earlier collections limited to 2–15 straight‑line, loop‑free snippets. The new benchmark comprises 8,072 programs sourced from diverse domains, providing a more realistic testbed for performance‑critical code transformation.

Baseline Model Performance

Among the 23 evaluated models, Claude‑opus‑4 emerged as the strongest baseline, achieving a 51.5% pass rate on functional tests and delivering a 1.43× average runtime improvement compared with code compiled using gcc -O3. Other models displayed lower correctness and speedup metrics, underscoring the challenge of assembly‑level optimization.

Reinforcement Learning Fine‑Tuning

To improve outcomes, the researchers applied reinforcement learning to fine‑tune Qwen2.5‑Coder‑7B‑Instruct. The reward function combined correctness and performance gains. After training, the resulting model, named SuperCoder, reached 95.0% correctness and an average speedup of 1.46×, surpassing the baseline by a substantial margin.

Sampling Strategies and Iterative Refinement

The study also explored Best‑of‑N sampling and iterative refinement techniques. These methods further boosted SuperCoder’s performance, demonstrating that multiple generation attempts and subsequent polishing can extract additional speed improvements from the same model.

Implications for Future Optimization Research

These findings suggest that LLMs can complement or even exceed conventional compiler heuristics in certain contexts. The authors propose that future work explore integration of LLM‑based superoptimizers into development pipelines and investigate scalability to larger codebases and different hardware targets.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

LLMs Demonstrate Superoptimization Capabilities on Large-Scale Assembly Benchmark

Benchmark Construction

Baseline Model Performance

Reinforcement Learning Fine‑Tuning

Sampling Strategies and Iterative Refinement

Implications for Future Optimization Research

Data and Protocol

Privacy Protocol