The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache
Long-context large language models (LLMs) face a memory bottleneck that has nothing to do with model weights. During decoding, transformers cache the key and value (KV) vectors for every token at every layer so they don’t have to recompute attention. This cache grows linearly with sequence length and batch size, and at long context with high concurrency it can dwarf the model’s own footprint. Consider Llama-3.1-70B in BF16. Its KV cache costs about 0.31 MB per token (80 layers × 8 KV heads × 128 head-dim × 2 tensors × 2 bytes). At 128K tokens that is ~40 GB; at 1M tokens it exceeds 300 GB — more than the 140 GB of weights themselves. Worse, every newly decoded token has to stream the entire cache out of high-bandwidth memory (HBM), which makes decoding memory-bandwidth-bound rather than compute-bound. Shrinking the KV cache is therefore the most direct lever for cutting both cost and decode latency. Current approaches fall into roughly five families: token eviction ...
