Knowledge Distillation Increases Membership Inference Risks in Large Language Models

Global: Distillation May Increase Membership Inference Risks in Large Language Models

A recent study posted on arXiv in May 2025 examines how knowledge distillation influences membership inference attacks (MIAs) on large language models (LLMs). The researchers evaluate privacy outcomes across multiple teacher‑student configurations and attack techniques, aiming to determine whether compression inherently safeguards sensitive training data.

Privacy Threats in LLM Training

LLMs are trained on extensive corpora that can inadvertently embed personally identifiable or confidential information. Membership inference attacks exploit model outputs to infer whether a specific data point was part of the training set, posing significant privacy concerns for both users and data providers.

Knowledge Distillation and Assumed Benefits

Knowledge distillation compresses a large teacher model into a smaller student model by transferring learned representations. Practitioners often assume that the reduction in model capacity and the smoothing effect of teacher predictions improve privacy, though systematic evidence has been limited.

Experimental Design

The authors construct six distinct teacher‑student model pairs and apply six established MIA methods to each. This systematic approach enables a comprehensive comparison of attack success rates between original teachers and their distilled counterparts.

Key Findings on Attack Success

Results reveal that distilled student models do not consistently achieve lower MIA success than the teachers. In several instances, students exhibit substantially higher member‑specific attack rates, contradicting the notion that distillation automatically enhances privacy.

Mechanisms Behind Increased Vulnerability

Analysis attributes the heightened risk to mixed supervision during distillation. For data points that are already vulnerable, teacher predictions often align with ground‑truth labels, leading students to produce overly confident outputs that accentuate the distinction between members and non‑members. Conversely, for less vulnerable points, discrepancies between teacher predictions and true labels generate inconsistent learning signals.

Proposed Mitigations

To address these issues, the study introduces three practical interventions: (1) restricting distillation to non‑vulnerable data points, (2) incorporating a low‑dimensional bottleneck projection layer, and (3) applying a normalization variant termed NoNorm. Each method targets the sources of over‑confidence and signal inconsistency identified in the analysis.

Effectiveness of Interventions

Empirical evaluation shows that the proposed techniques reduce both aggregate and member‑specific MIA success while maintaining model utility. The interventions improve the privacy‑utility trade‑off for distilled LLMs, suggesting viable pathways for safer model compression.

Implications for Future Research

The findings underscore the need for rigorous privacy assessment when employing knowledge distillation and highlight that compression alone cannot be relied upon to protect sensitive training data. Ongoing work will likely explore additional safeguards and broader attack scenarios.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Distillation May Increase Membership Inference Risks in Large Language Models