Agentic LLMs Can Re-Identify Individuals in Public Interview Dataset

Global: Study Reveals Agentic LLMs Can Re-Identify Individuals in Public Interview Dataset

Researchers have demonstrated that widely available large language models (LLMs) equipped with web‑search and agentic capabilities can link anonymized interview records to specific scientists. The analysis focused on the subset of 125 scientist interviews released by Anthropic on December 4, 2025 as part of the Anthropic Interviewer dataset. Using off‑the‑shelf tools and a series of natural‑language prompts, the author identified six of the twenty‑four scientist interviews by cross‑referencing details with publicly available publications and online profiles. The findings were documented in a paper submitted to arXiv on January 9, 2026.

Methodology Overview

The author employed a standard LLM agent that can execute web searches, retrieve documents, and synthesize information. Each interview was parsed for unique identifiers such as project titles, grant numbers, and specific methodological descriptions. The agent then queried search engines with these cues, collected candidate matches, and ranked them based on textual similarity and contextual relevance. No custom training or fine‑tuning of the model was performed; the workflow relied solely on publicly released capabilities.

Key Findings

Out of the twenty‑four scientist interviews examined, six could be confidently matched to corresponding scholarly works, allowing the reconstruction of the interviewees’ identities. In three cases, the match was unique, revealing the participant’s name, institutional affiliation, and recent publications. The remaining matches were ambiguous but still narrowed the pool of possible individuals considerably. The study notes that the success rate is modest yet significant given the low effort required.

Privacy Implications

According to the author, the results illustrate a new privacy risk where qualitative datasets, even when stripped of explicit identifiers, may be vulnerable to re‑identification by autonomous agents. Existing de‑identification safeguards—such as removing direct names and contact information—can be circumvented when an attacker decomposes the task into smaller, seemingly benign queries that the LLM can execute autonomously.

Proposed Mitigations

The paper recommends several mitigation strategies, including: (1) limiting the granularity of technical details released in interview transcripts; (2) applying differential privacy techniques to textual data; (3) restricting automated web‑search access to released datasets; and (4) providing clear usage licenses that prohibit re‑identification attempts. The author also suggests that dataset curators conduct adversarial testing with LLM agents before public release.

Future Directions

Future research is encouraged to explore the scalability of such attacks across larger, more diverse datasets and to evaluate the effectiveness of proposed defenses in real‑world settings. The author has notified Anthropic of the findings and calls for a broader community discussion on responsible data sharing in the era of capable AI agents.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Reveals Agentic LLMs Can Re-Identify Individuals in Public Interview Dataset

Methodology Overview

Key Findings

Privacy Implications

Proposed Mitigations

Future Directions

Data and Protocol

Privacy Protocol