Qwen3-ForcedAligner-0.6B-hf Release for Precise Audio Alignment

Release of Qwen3-ForcedAligner-0.6B-hf for Precise Audio Alignment

The Qwen3-ForcedAligner-0.6B-hf model has been released, utilizing a non-autoregressive mode for high-precision audio and text matching across 11 languages.

Compiled by Sergey KostenchukPublished 2026-06-26Updated 2026-06-26

2026-06-26 Research HuggingFace

🤖 Qwen3-ForcedAligner-0.6B: High-Precision Audio and Text Alignment

The Qwen3-ForcedAligner-0.6B-hf model, designed for forced alignment, has been released. Based on 0.9B parameters, the model operates in a non-autoregressive (NAR) mode, allowing it to predict word-level timecodes across 11 languages, including Russian. It is compatible with any ASR systems and optimized via torch.compile.

🌍 The transition to non-autoregressive methods significantly accelerates subtitling and audio indexing, providing higher accuracy than traditional E2E models.

👤 This tool makes audio workflows (captioning, audio search) faster and more accurate, enabling the generation of perfect timecodes based on any existing transcriptions.

Source 1: https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B-hf

Sources