Researchers Create Diverse Black-Box Optimization Benchmarks with LLMs

Global: Researchers Use LLMs to Create Diverse Black‑Box Optimization Benchmarks

On January 31, 2026, a team of researchers posted a new study on arXiv that details an approach for generating continuous black‑box optimization problems using large language models (LLMs) embedded in an evolutionary loop. The paper outlines how the method produces test functions with clearly defined high‑level landscape characteristics, aiming to broaden the pool of benchmarks available to the optimization community.

Motivation for New Benchmarks

Current benchmark suites such as the Black‑Box Optimization Benchmark (BBOB) are criticized for limited structural diversity, which can constrain the evaluation of optimization algorithms. Consequently, the authors argue that expanding the variety of problem landscapes is essential for more robust algorithm assessment.

LLM‑Driven Problem Generation

The study introduces the LLaMEA framework, which guides an LLM to translate natural‑language descriptions of target properties—such as multimodality, separability, basin‑size homogeneity, search‑space homogeneity, and global‑local optima contrast—into executable problem code. This translation enables rapid prototyping of functions that embody specific landscape traits.

Diversity‑Enhancing Mechanisms

Within the evolutionary loop, candidate problems are scored using Exploratory Landscape Analysis (ELA)‑based property predictors. An ELA‑space fitness‑sharing mechanism is applied to promote population diversity and steer the generator away from redundant landscapes, thereby increasing the likelihood of novel problem instances.

Validation of Generated Landscapes

The authors employ a suite of verification techniques, including basin‑of‑attraction analysis, statistical testing, and visual inspection, to confirm that many generated functions exhibit the intended structural characteristics. These analyses provide evidence that the LLM‑generated problems align with their specified descriptions.

Embedding Analysis and Impact

A t‑SNE embedding of the generated instances reveals that they extend the BBOB instance space rather than forming an isolated cluster. This observation suggests that the new library meaningfully expands the diversity of available benchmarks.

Potential Applications

According to the authors, the resulting library offers a broad, interpretable, and reproducible set of benchmark problems that can support landscape analysis and downstream tasks such as automated algorithm selection. The approach may also inspire further integration of generative AI techniques into research toolchains.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Researchers Use LLMs to Create Diverse Black‑Box Optimization Benchmarks