Sber's Kandinsky Lab team has introduced KVAE-Audio — a highly efficient audio tokenizer under the MIT open license, capable of providing extreme audio signal compression without loss of quality.

What Happened
Sber has released KVAE-Audio, a model that operates at a 48 kHz sampling rate and provides a temporal compression ratio of 960x. The architecture utilizes a 64-channel latent space and the Snake periodic activation function for precise audio signal modeling. The model is optimized for training generative diffusion models, such as Text-to-Audio (T2A) and Text-to-Audio-Video (T2AV).
Context
The development aims to create a more compact and "diffusion-friendly" latent space. Unlike heavyweight solutions such as Stability AI's SAME-L, KVAE-Audio has significantly fewer parameters (166.9M vs. 852.1M), while demonstrating superior generation metrics over models like Sony's MMAudio and Meta's DACVAE.
Why It Matters for the Industry
The release of KVAE-Audio drastically lowers the barrier to entry and the cost of training multimodal models. Using a compact latent space simplifies the prototyping process and allows startups with limited computational resources to create high-quality audio-generative systems. In the long term, this technology could become an industry standard for efficient audio tokenization.
Why It Matters for Users
Developers gain a powerful open-source tool that requires less computational resources (VRAM) for inference and training. This enables the integration of high-quality sound generation into modern multimodal pipelines, including the creation of audio content and voiced video, with lower time and financial costs.
Sources
Author
Look at AI, Editorial Team
