Value Alignment in Reward Models: A Study on Base Language Model Influence

Global: Study Finds Base Language Model Shapes Reward Model Value Preferences

Researchers examined ten leading open-weight reward models (RMs) derived from various large language models (LLMs) and identified systematic differences in value alignment that correspond to the underlying base model. Using validated psycholinguistic corpora, the analysis revealed a consistent preference for “agency” in RMs built on Llama and a corresponding preference for “communion” in RMs based on Gemma.

Methodology

The team applied the “Big Two” psychological axes—agency and communion—to quantify value dimensions across models. Identical preference datasets and fine‑tuning procedures were used for each RM to isolate the influence of the base LLM while controlling for other variables.

Key Findings

Statistical evaluation showed that the observed preferences persisted even when the preference data and fine‑tuning process were held constant. Further analysis traced the effect to log‑probability outputs of the instruction‑tuned and pre‑trained models, indicating that these logits encode implicit value signals.

Implicit Reward Scores

By reformulating the log‑probability differences as an implicit reward model, the researchers derived usable implicit reward scores. These scores reproduced the same agency versus communion distinction, confirming that the underlying LLMs contribute measurable bias to the RM outputs.

Ablation Tests

Additional experiments varied the source and quantity of preference data, demonstrating that the base‑model influence remained durable across different data conditions. The effect was repeatable and did not diminish with reduced preference information.

Implications for Alignment

The findings suggest that reward models inherit value biases from the pretrained LLMs on which they are built, underscoring the importance of safety and alignment considerations during the pre‑training stage rather than solely at the fine‑tuning phase.

Recommendations for Developers

Open‑source developers are advised to evaluate both performance metrics and value orientation when selecting a base model for reward‑model development, as the choice may affect downstream alignment outcomes.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Finds Base Language Model Shapes Reward Model Value Preferences