A developer has demonstrated a way to run the Qwen 3.6 35B MoE model with an extended context of 450,000 tokens on a single NVIDIA RTX 5090 with 32 GB of VRAM. By using a llama.cpp fork that supports TurboQuant and the YaRN method, extreme memory optimization was achieved, allowing for the processing of massive data volumes on consumer-grade hardware.

image

What Happened

Using a llama.cpp fork supporting TurboQuant (which compresses the KV cache down to 3-bit) and the YaRN method for RoPE scaling, the developer successfully ran the Qwen 3.6 35B MoE model. With weights quantized to Q6_K, the model consumes 28.5 GB of VRAM, leaving approximately 2.7 GB for the context, which enables a 450,000-token window on a single RTX 5090.

Context

TurboQuant technology allows for aggressive KV cache compression, which is a critical factor when working with long sequences, as the cache consumes a significant portion of video memory. The YaRN method is used to effectively scale positional embeddings (RoPE) for ultra-long contexts.

Why It Matters for the Industry

This case demonstrates the potential for extreme memory optimization for running heavy models locally on consumer equipment. This could lead to the standardization of aggressive KV cache quantization methods in mainstream tools and reduce dependency on cloud APIs for building private RAG systems.

Why It Matters for Users

Enthusiasts and local developers gain the ability to use powerful MoE models to analyze extremely long documents, entire code repositories, or large data libraries without purchasing professional-grade server accelerators like the H100.

What Is Still Unknown / Limitations

Detailed data regarding latency and confirmed quality metrics (perplexity) at such large context volumes are currently missing, which warrants caution when using this solution for mission-critical tasks.

Sources

Author

Look at AI, Editorial Team