Study Evaluates LLM Performance on Missouri Collegiate Mathematics Competition Problems
Global: Study Evaluates LLM Performance on Missouri Collegiate Mathematics Competition Problems
Three leading large language models—GPT-4o-mini, Gemini-2.0-Flash, and DeepSeek-V3—were tested on a set of underrepresented problems from the Missouri Collegiate Mathematics Competition covering calculus, analytic geometry, and discrete mathematics. The researchers compared each model’s responses to the official solutions to gauge accuracy and to analyze reasoning patterns across problem types.
Methodology and Dataset
The authors selected competition questions that are rarely used in benchmark suites, aiming to broaden the evaluation beyond standard datasets. Each model received the same prompts and was required to produce step‑by‑step reasoning before delivering a final answer. Results were recorded for correctness and for the quality of intermediate reasoning.
Overall Performance by Domain
DeepSeek-V3 achieved the highest accuracy in all three evaluated domains—calculus, analytic geometry, and discrete mathematics—outperforming both GPT-4o-mini and Gemini-2.0-Flash. In contrast, all three models showed notably weak performance on geometry‑focused items, indicating a persistent gap in spatial‑reasoning capabilities.
Error Patterns Across Models
Analysis of the incorrect responses revealed distinct error signatures. DeepSeek-V3’s mistakes were primarily computational or logical slips, while GPT-4o-mini often erred in logical flow and problem‑approach selection. Gemini‑2.0‑Flash tended toward incomplete reasoning and premature conclusions, suggesting divergent weaknesses among the models.
Implications for Structured Reasoning Research
The findings underscore the value of testing LLMs on diverse, competition‑style mathematics problems to uncover nuanced deficiencies that standard benchmarks may miss. In particular, the consistent underperformance in geometry highlights an area where future model improvements and specialized training data could be directed.
Limitations and Future Directions
The study relied on a single competition dataset and focused on three proprietary models, which may limit the generalizability of the conclusions. Expanding the evaluation to additional problem sets and open‑source models could provide a more comprehensive picture of LLM reasoning capabilities.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung