๐Ÿค– Qwen3-ForcedAligner-0.6B: High-Precision Audio and Text Alignment

The Qwen3-ForcedAligner-0.6B-hf model, designed for forced alignment, has been released. Based on 0.9B parameters, the model operates in a non-autoregressive (NAR) mode, allowing it to predict word-level timecodes across 11 languages, including Russian. It is compatible with any ASR systems and optimized via torch.compile.

๐ŸŒ The transition to non-autoregressive methods significantly accelerates subtitling and audio indexing, providing higher accuracy than traditional E2E models.

๐Ÿ‘ค This tool makes audio workflows (captioning, audio search) faster and more accurate, enabling the generation of perfect timecodes based on any existing transcriptions.

Source 1: https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B-hf