Vox has been introduced—an application for macOS and Windows that leverages local artificial intelligence to transform speech into clean text without sending data to the cloud.
What Happened
Developers have introduced Vox, a local dictation tool working on macOS and Windows. The application uses Whisper or NVIDIA Parakeet models for audio transcription, and Gemma 4 or Apple Intelligence for text post-processing, removing filler words and correcting errors. The entire process is performed locally using Apple Neural Engine or DirectX 12 hardware acceleration.
Context
The project is a practical implementation of the Edge AI concept, where complex computations are moved from remote servers directly onto user devices. This allows for the use of a combination of specialized Automatic Speech Recognition (ASR) models and Small Language Models (SLM) to solve everyday tasks within a closed loop.
Why It Matters for the Industry
The emergence of Vox demonstrates the maturity of the Edge AI stack and the viability of a hybrid approach that combines ASR and local LLMs. This sets a trend toward reducing developer dependence on cloud APIs and paves the way for creating new vertical products with zero cloud infrastructure costs and minimal latency.
Why It Matters for Users
Users gain a tool that allows them to transform speech into structured text (emails, notes, code) at speech speed (~150 WPM), saving up to 60 minutes a day. The main advantage is complete data privacy and the absence of the need to pay for cloud service subscriptions.
Sources
Author
Look at AI, Editorial Team
