IndexTTS 2.5: Advancing Multilingual Zero-Shot Emotional Speech Synthesis

Global: IndexTTS 2.5 Advances Multilingual Zero-Shot Emotional Speech Synthesis

A team of researchers announced the release of IndexTTS 2.5 in January 2026, a neural text-to-speech foundation model that builds on its predecessor to broaden language coverage, accelerate inference, and preserve synthesis quality. The model, described in an arXiv preprint, integrates a transformer‑based Text‑to‑Semantic (T2S) module with a non‑autoregressive Semantic‑to‑Mel (S2M) module, enabling zero‑shot emotional speech generation across multiple languages.

Semantic Codec Compression Halves Sequence Length

The new version reduces the semantic codec frame rate from 50 Hz to 25 Hz, effectively cutting the sequence length in half. This compression lowers both training and inference costs, allowing the system to operate more efficiently without sacrificing the fidelity of semantic representations.

Zipformer Backbone Streamlines Mel‑Spectrogram Generation

IndexTTS 2.5 replaces the U‑DiT‑based backbone of the S2M module with a Zipformer architecture. The change yields a notable reduction in parameter count and speeds up mel‑spectrogram generation, contributing to a 2.28‑fold improvement in real‑time factor (RTF) compared with the earlier model.

Cross‑Lingual Modeling Extends to Four Languages

Three explicit strategies—boundary‑aware alignment, token‑level concatenation, and instruction‑guided generation—enable robust zero‑shot multilingual emotional synthesis. The model now supports Chinese, English, Japanese, and Spanish, and can transfer emotional prosody even when target‑language emotional data are unavailable.

Reinforcement Learning Enhances Pronunciation

Post‑training of the T2S module employs Gradient‑Reward‑Based Policy Optimization (GRPO), a reinforcement‑learning technique that improves pronunciation accuracy and naturalness. Experiments indicate that these refinements maintain word error rate (WER) and speaker similarity comparable to the prior version.

Evaluation Shows Faster Real‑Time Factor with Stable Accuracy

Benchmark tests demonstrate that IndexTTS 2.5 achieves a 2.28× speedup in RTF while preserving comparable WER and speaker similarity metrics to IndexTTS 2. The results suggest that the model delivers faster synthesis without compromising intelligibility or emotional expressiveness.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

IndexTTS 2.5 Advances Multilingual Zero-Shot Emotional Speech Synthesis

Semantic Codec Compression Halves Sequence Length

Zipformer Backbone Streamlines Mel‑Spectrogram Generation

Cross‑Lingual Modeling Extends to Four Languages

Reinforcement Learning Enhances Pronunciation

Evaluation Shows Faster Real‑Time Factor with Stable Accuracy

Data and Protocol

Privacy Protocol