Language Models Can Explain Their Own Computations: Breakthrough Study

Global: Study Shows Language Models Can Explain Their Own Computations
A recent preprint on arXiv reports that researchers have successfully fine‑tuned language models to generate natural‑language descriptions of their internal operations. The work, posted in November 2025, investigates whether models can leverage privileged access to their own activations to produce reliable explanations.

Methodology

The authors employed established interpretability techniques as a source of ground‑truth data, creating tens of thousands of example explanations. These examples covered three dimensions: the information encoded by individual model features, the causal relationships among internal activations, and the influence of specific input tokens on the model’s output.
Training involved fine‑tuning the same language model that generated the explanations, allowing it to learn a mapping from internal states to coherent textual descriptions. The resulting “explainer” models were then evaluated on queries that were not present in the training set.

Findings

Results indicate that the explainer models achieved non‑trivial generalization, correctly describing unseen internal computations across the three targeted dimensions. Notably, when a model was tasked with explaining its own activations, performance consistently exceeded that of a separate, larger model attempting to explain a different model’s internals.

Implications and Future Work

The authors suggest that self‑generated explanations could serve as a scalable complement to existing interpretability methods, potentially reducing the reliance on labor‑intensive manual analysis. By automating the description of internal processes, such techniques may accelerate debugging and safety assessments for increasingly complex models.
The research team has made both the codebase and the dataset publicly available on GitHub, inviting further replication and extension by the broader AI community.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Shows Language Models Can Explain Their Own Computations

Methodology

Findings

Implications and Future Work

Data and Protocol

Privacy Protocol