New Open Dataset Improves AI-Driven Email Security

Global: New Open Dataset Distinguishes Human and AI-Generated Phishing, Spam, and Legitimate Emails
A research team has released a comprehensive email dataset that separates human‑written and large‑language‑model‑generated phishing, spam, and legitimate messages. The dataset was made publicly available in November 2025 via the authors’ project website, aiming to support the development of more effective AI‑assisted email security systems.

Dataset Overview

The collection comprises three primary categories—phishing, spam, and legitimate correspondence—each explicitly labeled according to its origin (human or LLM). By providing clear provenance for each message, the dataset enables comparative analyses of how attackers and benign senders craft email content.

Annotation Scheme

Every email is further annotated with an emotional appeal tag, such as urgency, fear, or authority, and a motivation label that reflects the intended outcome, including link‑following, credential theft, or financial fraud. These granular descriptors are intended to capture the persuasive techniques employed in malicious communications.

Benchmarking Methodology

The authors evaluated several large language models on their ability to identify the annotated emotional and motivational cues. After systematic testing, the most reliable model was selected to annotate the entire dataset. To assess classification robustness, a subset of emails was rephrased using multiple LLMs while preserving original meaning and intent, and the annotations were re‑applied.

Key Findings

Results indicate that the chosen state‑of‑the‑art LLM achieves strong detection performance for phishing emails, correctly recognizing deceptive cues in the majority of cases. However, the same model exhibits persistent difficulty in reliably distinguishing spam messages from legitimate emails, suggesting an area for further refinement.

Implications and Resources

The authors argue that the dataset and its accompanying evaluation framework can accelerate research on AI‑driven email security tools. All code, templates, and supporting materials are released on the project site to promote open‑science collaboration.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Open Dataset Distinguishes Human and AI-Generated Phishing, Spam, and Legitimate Emails

Dataset Overview

Annotation Scheme

Benchmarking Methodology

Key Findings

Implications and Resources

Data and Protocol

Privacy Protocol