Revealing Explanatory Instability in Machine Learning Models

Global: New Framework Reveals Instability in Model Explanations Across Training Runs

Researchers posted a study on arXiv in December 2025 that introduces a diagnostic tool designed to assess whether explanations generated by machine‑learning models remain consistent when the models are retrained on the same data. The work seeks to answer whether high‑accuracy models rely on a single internal logic or multiple, potentially competing mechanisms.

Introducing EvoXplain

The proposed framework, named EvoXplain, treats each model explanation as a sample drawn directly from the stochastic optimization process rather than aggregating predictions or forming ensembles. By examining the distribution of these samples, EvoXplain determines whether they converge into a single coherent explanation or diverge into distinct explanatory modes.

Methodological Approach

Instead of focusing on a single trained instance, the authors repeatedly train the same model class on an identical data split, capturing the resulting explanations each time. The analysis then quantifies the degree of multimodality present in the explanation space, providing a metric for explanatory stability.

Empirical Evaluation

The framework was applied to two widely used datasets—Breast Cancer and COMPAS—using two common model families: Logistic Regression and Random Forests. All models attained high predictive accuracy on these benchmarks, yet their explanatory outputs frequently displayed clear multimodal patterns.

Unexpected Findings

Even models traditionally regarded as stable, such as Logistic Regression, produced multiple well‑separated explanatory basins across repeated training runs. The authors report that these variations could not be attributed to differences in hyperparameter settings or simple trade‑offs between accuracy and interpretability.

Implications for Interpretability

EvoXplain does not aim to identify a single “correct” explanation; rather, it makes the presence of explanatory instability visible and quantifiable. The results suggest that single‑instance or averaged explanations may obscure underlying heterogeneous decision‑making processes.

Broader Perspective

By reframing interpretability as a property of an entire model class under repeated instantiation, the study encourages practitioners to consider explanation variability as an intrinsic characteristic of machine‑learning systems, especially in high‑stakes domains such as healthcare and criminal justice.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Framework Reveals Instability in Model Explanations Across Training Runs