Resemble AI Unveils Chatterbox Multilingual v3: Open-Source TTS...

Resemble AI has released Chatterbox Multilingual v3 — a new open-source text-to-speech (TTS) model based on the Llama architecture (0.5B parameters) that supports 25 languages and dialects. The key technological feature of this release is the integration of PerTh — a neural watermarking system that ensures content marking remains resilient against tampering attempts.

What Happened

Resemble AI introduced the Chatterbox Multilingual v3 model, which utilizes a lightweight Llama architecture with 0.5 billion parameters. The model supports zero-shot voice cloning from short samples and allows for control over the emotional tone of the speech. Technically, the solution demonstrates high performance with a Real-Time Factor (RTF) of approximately 5 based on an H100 accelerator. The main innovation is the PerTh technology, which embeds watermarks that are resistant to MP3 and Opus compression, as well as audio editing and resampling.

Context

Modern AI safety standards, such as the EU AI Act, require mandatory labeling of synthetic content to combat the spread of deepfakes. Using the Llama architecture for TTS tasks allows for the efficient transfer of knowledge from large language models to the field of speech synthesis, ensuring a more natural sound within a compact model size.

Why It Matters for the Industry

For the AI industry, the release of Chatterbox Multilingual v3 signifies the emergence of a high-quality open-source tool with built-in protection, simplifying compliance with regulatory requirements for content labeling. The integration of PerTh could help establish new safety standards where neural watermarks become a mandatory layer in consumer AI products and commercial TTS engines.

Why It Matters for Users

Developers and content creators gain access to a powerful model for local testing and creating voiceovers with emotional nuance and voice cloning. However, it is worth noting that the synthesis quality for the Russian language is currently in the mid-range segment (CER 1-5%) with possible errors in word stress, and for certain languages, such as Korean and Vietnamese, the model is not yet suitable for industrial use.

What Is Not Yet Known / Limitations

Current issues with prosody quality in Russian and the immaturity of language packs for several regions.

Sources

Author

Look at AI, Editorial Team