Zyphra Unveils ZONOS2: MoE-Based TTS

🤖 Zyphra has introduced ZONOS2, a second-generation text-to-speech (TTS) model based on the Mixture of Experts (MoE) architecture.

The model, featuring 8 billion parameters (900 million active), enables high-quality audio generation (44.1 kHz) in real-time via the Descript Audio Codec (DAC). The use of byte-level tokenization allows for efficient support of multiple languages without explicit phonetization.

🌍 ZONOS2 addresses the trade-off between voice cloning quality and latency through its sparse MoE architecture, making high-quality real-time TTS accessible to the open-source community.

👤 You can create highly natural digital voices that support Russian and English at studio-quality levels in real-time.

Source 1: https://www.zyphra.com/our-work/zonos2 Source 2: https://github.com/Zyphra/ZONOS2

Sources