Large Language Models Fail to Crack Passwords: Study Finds

Global: Study Finds Large Language Models Ineffective for Password Cracking

In October 2025, researchers posted a paper on arXiv that examined whether state‑of‑the‑art open‑source large language models could be leveraged to guess user passwords. The investigation focused on three models—TinyLLaMA, Falcon‑RW‑1B, and Flan‑T5—by prompting them with structured synthetic user profiles that included attributes such as name, birthdate, and hobbies. The authors measured success using Hit@1, Hit@5, and Hit@10 metrics against both plaintext passwords and SHA‑256 hashes.

Study Overview

The paper, identified as arXiv:2510.17884v3, frames its inquiry within the broader interest in applying natural‑language understanding to cybersecurity tasks. By generating plausible passwords from user‑specific data, the authors aimed to determine whether LLMs could rival conventional cracking techniques without additional supervised training on leaked password corpora.

Methodology

To create a controlled environment, the researchers synthesized a dataset of user profiles, each paired with a known password. They then crafted prompts that asked each LLM to produce likely passwords for the given attributes. The generated candidates were evaluated against the true passwords, first as plain text and subsequently after hashing with SHA‑256, allowing the calculation of hit‑rate metrics at various ranks.

Performance Metrics

Across all three models, the highest observed accuracy was under 1.5% at Hit@10, with Hit@1 and Hit@5 rates even lower. These figures indicate that, in the tested scenario, the LLMs failed to produce the correct password within the top ten guesses at a rate that would be considered practical for attackers.

Comparison with Traditional Techniques

By contrast, established rule‑based and combinatorial cracking methods achieved substantially higher success rates on the same dataset. The study reports that traditional approaches outperformed the LLMs by an order of magnitude, underscoring the continued relevance of specialized password‑cracking tools.

Analysis of Limitations

The authors attribute the poor performance to several factors: limited domain adaptation, insufficient memorization of password‑specific patterns, and the reliance on generic linguistic knowledge rather than targeted security heuristics. Visualizations in the paper highlight how the models’ generative reasoning diverges from the structured logic required for effective password inference.

Implications for Security

According to the findings, current open‑source LLMs do not pose a significant new threat to password security in the absence of fine‑tuning on large, compromised password datasets. Security professionals can therefore continue to prioritize traditional defenses, such as strong password policies and multi‑factor authentication, without immediate concern for LLM‑driven attacks.

Future Directions

The study suggests that future research might explore supervised fine‑tuning of LLMs on leaked password collections, though such work raises ethical and privacy considerations. The authors emphasize the need for robust, privacy‑preserving frameworks should the community pursue deeper investigations into generative models for security testing.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Finds Large Language Models Ineffective for Password Cracking