FlashMemory-Deepseek-V4: KV-cache optimization via neural retriever

FlashMemory-Deepseek-V4 has been introduced—a lightweight neural retriever that allows compressing Compressed-Sparse-Attention (CSA) cache by nearly 90% in DeepSeek-V4 models.

Compiled by Sergey KostenchukPublished 2026-06-15Updated 2026-06-15

2026-06-15 Coding HuggingFace

🧠 FlashMemory-Deepseek-V4: KV-cache optimization via neural retriever

FlashMemory-Deepseek-V4 has been introduced—a lightweight neural retriever for optimizing KV-cache in DeepSeek-V4 models. The system allows compressing the Compressed-Sparse-Attention (CSA) cache by nearly 90%, keeping only 10–15% of the data on the GPU without loss of quality in RULER and LongBench V2 tests.

🌍 The technology enables working with ultra-long contexts (500k+ tokens) on limited hardware by offloading unused memory blocks to the CPU or disk.

👤 Using powerful models will become cheaper, as VRAM requirements for long dialogues and document analysis are drastically reduced.

Source 1: https://github.com/libertywing/FlashMemory-Deepseek-V4 Source 2: https://huggingface.co/libertywing/FlashMemory-Deepseek-V4

Sources