Fish Audio has introduced S2.1 Pro — a new flagship speech synthesis model that supports 83 languages within a single architecture. Until the end of July 2026, developers are provided with free API access via a special s2.1-pro-free model for testing and prototyping.
What Happened
The released S2.1 Pro model demonstrates significant performance improvements: latency (TTFA) has been reduced to 70–90 ms, and throughput has doubled. This was achieved through the use of custom fish-scales-ops kernels optimized for NVIDIA Hopper and Blackwell architectures, as well as the implementation of specialized FP8 GEMM and FlashAttention libraries.
Context
Development focuses on extremely low latency, which is critical for creating interactive systems. The use of modern GPU stacks and specialized formats (FP8) allows for the efficient scaling of high-quality TTS (Text-to-Speech), transforming it from a costly infrastructural task into an accessible component.
Why It Matters for the Industry
For the AI industry, this means lowering the barrier to entry for creating real-time voice agents. Optimization for modern NVIDIA architectures sets a new standard for efficiency, allowing high-quality speech synthesis to be used in high-load systems without colossal infrastructure costs.
Why It Matters for Users
Developers and content creators can freely test one of the best speech synthesis models with extremely low latency. This opens up possibilities for instant prototyping of multilingual interfaces, smart conversational assistants, and automated content dubbing in dozens of languages.
What Is Not Yet Known / Limitations
Free API access is temporary and ends in late July 2026. There is also a risk of vendor lock-in to NVIDIA architectures due to deep optimization for their specific kernels.
Sources
Author
Look at AI, Editorial Team
