Improving LLM Judges for Code Generation with Bi-Level Data Reweighting

Global: New Bi-Level Data Reweighting Method Improves LLM Judges for Code Generation

A team of AI researchers introduced a reasoning-based large language model (LLM) judge named DAJ that leverages bi‑level data reweighting to enhance test‑time scaling for code generation. The approach was detailed in a preprint posted to arXiv in January 2026 and aims to address reliability issues that arise when selecting the best candidate solution from multiple model outputs.

Background

Test‑time scaling for code generation typically employs a Best‑of‑N strategy, where a base model produces several candidate programs and an LLM judge evaluates them to select the most promising one. This pipeline has become a standard method for improving the quality of generated code without increasing model size.

Challenges in LLM Judge Training

Training dependable LLM judges is complicated by several distribution shifts. These include imbalances between easy and hard coding problems, mismatches between the tasks used for training and the benchmarks used for evaluation, and trajectory mismatches that stem from training data generated by cheaper models whose behavior differs from that of the models deployed at inference time.

Proposed DAJ Framework

The DAJ system introduces a bi‑level learning framework that assigns importance weights to training data, either at the domain level or the individual instance level. By reweighting data, the model automatically prioritizes hard problems, in‑distribution samples, and data that aligns with the inference‑time trajectory, eliminating the need for manually crafted heuristics.

Training Strategy

Within the bi‑level setup, the outer loop optimizes the data‑importance weights to maximize performance on a held‑out meta‑set that mirrors target benchmarks such as LiveCodeBench and BigCodeBench. The inner loop trains the LLM judge using these weighted samples, producing verifiable rewards that guide the learning process.

Empirical Results

Experimental evaluation reported that DAJ achieved state‑of‑the‑art results on both LiveCodeBench and BigCodeBench. The method outperformed strong test‑time scaling baselines and even surpassed leading proprietary models, demonstrating the effectiveness of data reweighting for LLM‑as‑a‑Judge training.

Implications

By automatically emphasizing challenging and benchmark‑aligned data, DAJ offers a scalable solution for improving LLM judges without extensive manual tuning. The authors suggest that the bi‑level reweighting approach could be extended to other domains where LLM evaluation plays a critical role, potentially broadening its impact across AI‑driven software development.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Bi-Level Data Reweighting Method Improves LLM Judges for Code Generation