Prompt-Injection Attacks Threaten Consensus-Generating LLMs in Digital Democracy Experiments
Global: Prompt-Injection Attacks Threaten Consensus-Generating LLMs in Digital Democracy Experiments
Researchers analyzing large language models (LLMs) used for consensus generation in digital democracy experiments reported that off‑the‑shelf models are vulnerable to prompt‑injection attacks that can skew or erase public‑policy opinions. The study, posted on arXiv, evaluated models such as LLaMA 3.1 8B Instruct, GPT‑4.1 Nano, and Apertus 8B, finding high attack success rates across multiple policy topics.
Methodology
The team created paired sets of prompts: one free of adversarial content and another containing injected text designed to amplify specific viewpoints, suppress others, or divert consensus toward unrelated topics. Opinion and consensus valences were classified using a fine‑tuned BERT model, and Attack Success Rates (ASR) were calculated from 3×3 confusion matrices conditioned on alignment with human‑majority responses.
Key Findings
Across the examined topics, the default configurations of LLaMA 3.1 8B Instruct, GPT‑4.1 Nano, and Apertus 8B demonstrated widespread vulnerability. The ASR was especially pronounced for economically and socially conservative parties and for prompts employing rational, instruction‑like rhetorical strategies.
Defense Strategy
A robustness pipeline combining GPT‑OSS‑SafeGuard injection detection, structured opinion representations, and GSPO‑based reinforcement learning was tested. When the system limited analysis to non‑ambiguous consensus outcomes, the pipeline reduced ASR to near zero across all parties and policy clusters.
Implications for Digital Democracy
The results suggest that while LLMs can facilitate large‑scale opinion aggregation, unchecked prompt manipulation could undermine the legitimacy of digital democratic processes. The identified defenses offer a potential path toward more resilient consensus‑building tools.
Future Directions
Authors recommend extending the evaluation to additional policy domains, larger model families, and real‑world deployment scenarios to further assess both vulnerabilities and mitigation techniques.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung