New Benchmark Highlights LLMs' Sensitivity to Instruction Order

Global: New Benchmark Highlights LLMs’ Sensitivity to Instruction Order

Overview of the RIFT Testbed

A new benchmark called the Reordered Instruction Following Testbed (RIFT) has been introduced to evaluate how large language models handle non‑sequential instruction flows. The study, posted on arXiv on January 2026, examines whether models can maintain task performance when the order of prompts is altered. Researchers constructed the benchmark to isolate the effect of prompt topology from overall task difficulty.

Design of Prompt Structures

RIFT employs rephrased Jeopardy! question‑answer pairs as a controlled content set. Two distinct prompt configurations are used: linear prompts that present information in a sequential order, and jumping prompts that retain identical content but require the model to navigate the instructions out of sequence. This separation allows the test to focus solely on structural ordering.

Evaluation Results

The authors evaluated six state‑of‑the‑art open‑source LLMs across roughly 10,000 individual trials. Under the jumping condition, accuracy declined by as much as 72 % compared with the linear baseline, indicating a pronounced dependence on positional continuity.

Error Attribution

Subsequent error analysis revealed that about 50 % of the failures could be traced to violations of the intended instruction order or to semantic drift as the model attempted to reconcile out‑of‑order cues.

Implications for Real‑World Applications

These findings suggest that current architectures treat instruction following as a sequential pattern rather than a more general reasoning capability. Consequently, applications that rely on non‑linear control flow—such as workflow automation tools and multi‑agent coordination systems—may encounter reliability challenges.

Context Within Existing Benchmarks

Previous benchmarks have often conflated task complexity with the underlying prompt structure, making it difficult to determine whether performance drops stem from the difficulty of the content or from its ordering. RIFT addresses this gap by keeping content constant while varying only the structural layout.

Future Research Directions

The authors recommend developing model designs or training regimes that explicitly model instruction hierarchy and order‑agnostic reasoning. Such advances could mitigate the observed sensitivity and broaden the applicability of LLMs in complex, multi‑step environments.This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Benchmark Highlights LLMs’ Sensitivity to Instruction Order