A developer has demonstrated a way to run the Qwen 3.6 35B MoE model with an extended context of 450,000 tokens on a single NVIDIA RTX 5090 with 32 GB of VRAM. By using a llama.cpp fork that supports TurboQuant and the YaRN method, extreme memory optimization was achieved, allowing for the processing of massive data volumes on consumer-grade hardware.

What Happened
Using a llama.cpp fork supporting TurboQuant (which compresses the KV cache down to 3-bit) and the YaRN method for RoPE scaling, the developer successfully ran the Qwen 3.6 35B MoE model. With weights quantized to Q6_K, the model consumes 28.5 GB of VRAM, leaving approximately 2.7 GB for the context, which enables a 450,000-token window on a single RTX 5090.
Context
TurboQuant technology allows for aggressive KV cache compression, which is a critical factor when working with long sequences, as the cache consumes a significant portion of video memory. The YaRN method is used to effectively scale positional embeddings (RoPE) for ultra-long contexts.
Why It Matters for the Industry
This case demonstrates the potential for extreme memory optimization for running heavy models locally on consumer equipment. This could lead to the standardization of aggressive KV cache quantization methods in mainstream tools and reduce dependency on cloud APIs for building private RAG systems.
Why It Matters for Users
Enthusiasts and local developers gain the ability to use powerful MoE models to analyze extremely long documents, entire code repositories, or large data libraries without purchasing professional-grade server accelerators like the H100.
What Is Still Unknown / Limitations
Detailed data regarding latency and confirmed quality metrics (perplexity) at such large context volumes are currently missing, which warrants caution when using this solution for mission-critical tasks.
Sources
Author
Look at AI, Editorial Team
