Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

Global: InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

Background

A new algorithm called Intrinsic Self‑reflective Preference Optimization (InSPO) has been introduced to improve alignment of large language models (LLMs) by addressing limitations of existing Direct Preference Optimization (DPO) methods. The paper was submitted to arXiv on 29 December 2025 and revised on 30 December 2025.

Limitations of Existing Methods

DPO and its variants are widely used because they are simple and can be applied offline, but researchers identified two fundamental drawbacks: dependence on arbitrary modeling choices such as scalarization functions and reference policies, and the isolation of response generation from comparative pairwise data.

Proposed Approach

InSPO proposes a globally optimal policy that conditions on both the original context and alternative responses, thereby enabling the model to reflect on its own outputs. The authors prove that this formulation outperforms DPO and reinforcement‑learning‑from‑human‑feedback (RLHF) while remaining invariant to scalarization and reference choices.

Implementation

The method is designed as a plug‑and‑play enhancement that does not require changes to model architecture or additional inference overhead. Implementation details indicate that the same LLM can be trained with InSPO by modifying the preference‑optimization objective.

Experimental Results

Experimental evaluation reported in the paper shows consistent improvements in win‑rate metrics across benchmark datasets, as well as better performance on length‑controlled generation tasks. The authors attribute these gains to the model’s ability to internally compare and refine candidate responses.

Contributions

The study lists three primary contributions: (1) a formal definition of intrinsic self‑reflection for preference optimization, (2) theoretical guarantees of optimality and invariance, and (3) empirical evidence of superior alignment outcomes.

Publication Details

The work was authored by Yu Li, Tian Lan, and Zhengling Qi, and is classified under the artificial‑intelligence (cs.AI) and machine‑learning (cs.LG) subjects on arXiv. The paper is accessible via DOI https://doi.org/10.48550/arXiv.2512.23126.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New InSPO Method Aims to Enhance LLM Alignment by Leveraging Intrinsic Self-Reflection