Evaluating Memory-Aware LLM Agents in Minecraft: A New Benchmark

Global: New MineNPC-Task Benchmark Evaluates Memory-Aware LLM Agents in Minecraft

In January 2026, a team of researchers released MineNPC-Task, a benchmark designed to assess memory‑aware, mixed‑initiative large language model (LLM) agents operating within an open‑world Minecraft environment. The benchmark aims to provide a transparent, reproducible framework for measuring how effectively agents plan, act, and utilize memory when interacting with complex, dynamic worlds.

Motivation for Memory‑Aware Evaluation

Current evaluations of embodied agents often rely on synthetic prompts that do not capture the nuanced challenges of real‑time interaction. By focusing on memory management and mixed‑initiative collaboration, the authors seek to address gaps in existing testing methodologies that overlook long‑term context retention and adaptive clarification.

Benchmark Architecture

MineNPC-Task is constructed from user‑authored scenarios generated through formative and summative co‑play sessions with expert Minecraft players. Each scenario is distilled into a parametric template that explicitly defines preconditions, dependency structures, and permissible actions, thereby eliminating reliance on out‑of‑world shortcuts.

Validation and Policy Constraints

To ensure objective assessment, the framework incorporates machine‑checkable validators that operate under a bounded‑knowledge policy. Validators verify plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts, reporting outcomes solely on in‑world evidence.

Experimental Deployment

The initial implementation evaluated GPT‑4o across 216 subtasks performed by eight experienced players. Results highlighted recurring failure modes in code execution, inventory and tool handling, referencing, and navigation, while also documenting successful recoveries facilitated by mixed‑initiative clarifications and lightweight memory usage.

User Feedback and Observations

Participants rated the interaction quality and interface usability positively but emphasized a need for stronger memory persistence across sequential tasks. The findings suggest that while current LLM agents can adapt through clarification, sustained memory remains a limiting factor.

Release and Future Directions

All task definitions, validators, execution logs, and the evaluation harness have been made publicly available to support ongoing research into memory‑aware embodied agents. The authors anticipate that the benchmark will enable systematic comparison of future models and promote advances in long‑term contextual reasoning.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New MineNPC-Task Benchmark Evaluates Memory-Aware LLM Agents in Minecraft