Benchmarking Code-Mixing in Large Language Models with ChiEngMixBench

Global: New Benchmark Assesses Chinese-English Code-Mixing in Large Language Models

Researchers Qingyan Yang, Tongxi Wang, and Yunsheng Luo introduced ChiEngMixBench, a benchmark for evaluating large language models’ ability to generate spontaneous and natural Chinese‑English code‑mixed text. The work was submitted to arXiv on 2 January 2026 and seeks to fill a gap in authentic evaluation methods for code‑mixing by framing the task as a cognitive alignment problem measured through spontaneity and naturalness.

Benchmark Design and Objectives

ChiEngMixBench is positioned as the first benchmark that captures code‑mixing behavior in real‑world community contexts rather than treating it merely as a translation exercise. The authors emphasize that the benchmark evaluates whether model‑generated switches align with human conventions, thereby providing a more nuanced assessment of multilingual interaction.

Dataset Construction Pipeline

The benchmark was built using a general construction pipeline that can be scaled across domains and bilingual pairs. This pipeline extracts authentic code‑mixed utterances from online communities, annotates them for spontaneity and naturalness, and formats them for systematic testing of language models.

Evaluation Metrics: Spontaneity and Naturalness

Two complementary signals—spontaneity and naturalness—constitute the core metrics. Spontaneity gauges how organically a model initiates language switches, while naturalness measures the fluency and acceptability of the mixed output. The authors report that these metrics can reliably differentiate performance among various large language models.

Empirical Findings Across Models

Experimental results demonstrate systematic variation in code‑mixing quality across models. Some models excel in producing fluent mixed sentences but lag in appropriately timed switches, whereas others display more human‑like switch timing but generate less natural phrasing. The benchmark thus reveals trade‑offs that were previously obscured.

Emergent Terminology Layering Strategy

Beyond performance measurement, the study uncovered an implicitly emergent “Terminology Layering Strategy,” a phenomenon consistent with the Matrix Language Frame (MLF) theory. This suggests that multilingual large language models may develop structured cognitive alignment mirroring human multilingual communication patterns.

Implications and Future Directions

The authors argue that ChiEngMixBench can guide the development of more linguistically aware models and inform future research on multilingual interaction. They also propose extending the pipeline to other language pairs and domains to broaden the benchmark’s applicability.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Benchmark Assesses Chinese-English Code-Mixing in Large Language Models