New Benchmark Assesses Agentic AI on Identity Security Visibility Tasks
Global: New Benchmark Assesses Agentic AI on Identity Security Visibility Tasks
A new benchmark called the Sola Visibility ISPM Benchmark was introduced to evaluate how agentic artificial‑intelligence systems handle identity security posture management (ISPM) visibility tasks across cloud and SaaS platforms. The benchmark, described in a paper posted to arXiv in January 2026, draws on live, production‑grade data from Amazon Web Services (AWS), Okta, and Google Workspace to test inventory and configuration‑hygiene queries.
Background and Motivation
Enterprises increasingly rely on multiple identity providers, making it difficult to maintain a clear picture of identity inventory and configuration hygiene. According to the paper’s authors, this complexity has spurred interest in AI agents capable of interpreting and answering natural‑language security questions.
Benchmark Design
The authors constructed a set of 77 benchmark questions that cover foundational ISPM visibility tasks. Each question is posed in natural language, and the Sola AI Agent translates the query into executable data‑exploration steps, ultimately delivering evidence‑backed answers. The benchmark focuses on two core dimensions: identity inventory and hygiene across the three platforms.
Agent Performance
When evaluated against the full suite of questions, the Sola AI Agent achieved an expert accuracy of 0.84 and a strict success rate of 0.77, according to the authors. Performance was highest on AWS hygiene tasks, where expert accuracy reached 0.94. Results for Google Workspace and Okta hygiene tasks were described as moderate but still competitive.
Implications for the Field
The benchmark provides a practical, reproducible framework for measuring the effectiveness of agentic AI in identity security contexts. By offering a standardized evaluation method, the work lays groundwork for future ISPM benchmarks that could address more advanced analysis and governance scenarios.
Limitations and Future Work
The authors note that while the agent performed well on AWS‑related queries, its results on Google Workspace and Okta were less robust, suggesting areas for improvement. They propose extending the benchmark to cover additional identity providers and more complex governance tasks.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung