New Framework Reveals Instability in Model Explanations Across Training Runs
Global: New Framework Reveals Instability in Model Explanations Across Training Runs
Researchers posted a study on arXiv in December 2025 that introduces a diagnostic tool designed to assess whether explanations generated by machine‑learning models remain consistent when the models are retrained on the same data. The work seeks to answer whether high‑accuracy models rely on a single internal logic or multiple, potentially competing mechanisms.
Introducing EvoXplain
The proposed framework, named EvoXplain, treats each model explanation as a sample drawn directly from the stochastic optimization process rather than aggregating predictions or forming ensembles. By examining the distribution of these samples, EvoXplain determines whether they converge into a single coherent explanation or diverge into distinct explanatory modes.
Methodological Approach
Instead of focusing on a single trained instance, the authors repeatedly train the same model class on an identical data split, capturing the resulting explanations each time. The analysis then quantifies the degree of multimodality present in the explanation space, providing a metric for explanatory stability.
Empirical Evaluation
The framework was applied to two widely used datasets—Breast Cancer and COMPAS—using two common model families: Logistic Regression and Random Forests. All models attained high predictive accuracy on these benchmarks, yet their explanatory outputs frequently displayed clear multimodal patterns.
Unexpected Findings
Even models traditionally regarded as stable, such as Logistic Regression, produced multiple well‑separated explanatory basins across repeated training runs. The authors report that these variations could not be attributed to differences in hyperparameter settings or simple trade‑offs between accuracy and interpretability.
Implications for Interpretability
EvoXplain does not aim to identify a single “correct” explanation; rather, it makes the presence of explanatory instability visible and quantifiable. The results suggest that single‑instance or averaged explanations may obscure underlying heterogeneous decision‑making processes.
Broader Perspective
By reframing interpretability as a property of an entire model class under repeated instantiation, the study encourages practitioners to consider explanation variability as an intrinsic characteristic of machine‑learning systems, especially in high‑stakes domains such as healthcare and criminal justice.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung