❌

Normal view

Received before yesterday

Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

8 December 2025 at 17:00
Quantization is one of the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we can reduce the memory...

Quantization is one of the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we can reduce the memory footprint and compute costβ€”directly improving throughput, latency, and achievable context length. This blog introduces NVFP4 KV cache quantization, a new KV format that enables significant performance gains on NVIDIA Blackwell GPUs.

Source

❌