đź§ FlashMemory-Deepseek-V4: KV-cache optimization via neural retriever
FlashMemory-Deepseek-V4 has been introduced—a lightweight neural retriever for optimizing KV-cache in DeepSeek-V4 models. The system allows compressing the Compressed-Sparse-Attention (CSA) cache by nearly 90%, keeping only 10–15% of the data on the GPU without loss of quality in RULER and LongBench V2 tests.
🌍 The technology enables working with ultra-long contexts (500k+ tokens) on limited hardware by offloading unused memory blocks to the CPU or disk.
👤 Using powerful models will become cheaper, as VRAM requirements for long dialogues and document analysis are drastically reduced.
Source 1: https://github.com/libertywing/FlashMemory-Deepseek-V4 Source 2: https://huggingface.co/libertywing/FlashMemory-Deepseek-V4
