AgentDrive Benchmark Released: Evaluating Autonomous Driving Language Models

Global: AgentDrive Benchmark Released to Evaluate Autonomous Driving Language Models

Researchers have unveiled AgentDrive, a large‑scale benchmark comprising 300,000 driving scenarios generated by large language models (LLMs), intended to support the training, fine‑tuning, and assessment of autonomous agents in safety‑critical contexts.

Comprehensive Scenario Design

The dataset structures each scenario along seven independent dimensions—scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density—providing a factorized space that captures a wide range of real‑world driving conditions.

Automated Generation and Validation

An LLM‑driven prompt‑to‑JSON workflow produces simulation‑ready specifications, which are subsequently checked against both physical plausibility constraints and a predefined schema to ensure semantic consistency.

Simulation Rollouts and Safety Metrics

Each scenario undergoes simulated execution, during which surrogate safety metrics are calculated and outcomes are labeled using rule‑based criteria, enabling quantitative comparison of agent performance.

AgentDrive‑MCQ: A Reasoning Test

In parallel, the authors introduced AgentDrive‑MCQ, a multiple‑choice benchmark containing 100,000 questions that probe five reasoning categories: physics, policy, hybrid, scenario, and comparative reasoning.

Evaluation of Leading LLMs

A large‑scale study evaluated fifty prominent LLMs on the MCQ test. Findings indicate that proprietary frontier models lead in contextual and policy reasoning, while advanced open‑source models are rapidly narrowing the gap in structured and physics‑grounded reasoning.

Public Release and Future Work

The full AgentDrive dataset, the MCQ benchmark, evaluation code, and associated resources have been made publicly available on GitHub, inviting further research into safe and effective autonomous driving agents.

This report is based on information from arXiv, licensed under See original source. Source attribution required.

AgentDrive Benchmark Released to Evaluate Autonomous Driving Language Models