Running Qwen 3.6 35B MoE with 450k Context on a Single RTX 5090

🚀 **Running Qwen 3.6 35B MoE with 450k Token Context on a Single RTX 5090**

A developer has introduced a way to run the Qwen 3.6 35B MoE model with an extended context of 450,000 tokens on a single NVIDIA RTX 5090 graphics card (32 GB VRAM). This is achieved using a fork of `llama.cpp` that supports TurboQuant, which compresses the KV cache down to 3 bits, alongside the YaRN method for RoPE scaling.

🌍 This demonstrates the possibilities of extreme memory optimization for running heavy models with massive contexts locally on consumer hardware. The use of TurboQuant to compress the KV cache to 3 bits is the key factor in saving VRAM.

👤 This allows for the use of powerful Mixture-of-Experts (MoE) models to work with very long documents (up to 450k tokens) without the need to purchase professional server-grade GPUs like the H100.

Source 1: https://local-llm.utop.workers.dev/

Sources