Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell

8 January 2026 at 19:43

As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with...

As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with AI more frequently, meaning that more tokens need to be generated. To serve these tokens at the lowest possible cost, AI platforms need to deliver the best possible token throughput per watt. Through extreme co-design across GPUs, CPUs…

Source

Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer

NVIDIA Technical Blog

By:Kyle Aubrey

5 January 2026 at 22:20

end-to-end-press-ces26-inside-vr-tech-blog-1920x1080-4671300_-r1

AI has entered an industrial phase. What began as systems performing discrete AI model training and human-facing inference has evolved into always-on AI...

AI has entered an industrial phase. What began as systems performing discrete AI model training and human-facing inference has evolved into always-on AI factories that continuously convert power, silicon, and data into intelligence at scale. These factories now underpin applications that generate business plans, analyze markets, conduct deep research, and reason across vast bodies of…

Source

Solving Large-Scale Linear Sparse Problems with NVIDIA cuDSS

NVIDIA Technical Blog

By:Jeff Layton

17 December 2025 at 18:30

Solving large-scale problems in Electronic Design Automation (EDA), Computational Fluid Dynamics (CFD), and advanced optimization workflows has become the norm...

Solving large-scale problems in Electronic Design Automation (EDA), Computational Fluid Dynamics (CFD), and advanced optimization workflows has become the norm as chip designs, manufacturing, and multi-physics simulations have grown in complexity. These workloads push traditional solvers and require unprecedented scalability and performance. The NVIDIA CUDA Direct Sparse Solver (cuDSS) is built…

Source

How to Scale Fast Fourier Transforms to Exascale on Modern NVIDIA GPU Architectures

NVIDIA Technical Blog

By:Zan Xu

12 December 2025 at 18:00

Fast Fourier Transforms (FFTs) are widely used across scientific computing, from molecular dynamics and signal processing to computational fluid dynamics (CFD),...

Fast Fourier Transforms (FFTs) are widely used across scientific computing, from molecular dynamics and signal processing to computational fluid dynamics (CFD), wireless multimedia, and machine-learning applications. As computational problem sizes scale to increasingly large domains, researchers require the capability to distribute FFT computations across hundreds or thousands of GPUs spanning…

Source

NVIDIA Blackwell Enables 3x Faster Training and Nearly 2x Training Performance Per Dollar than Previous-Gen Architecture

NVIDIA Technical Blog

By:Ashraf Eassa

11 December 2025 at 19:20

end-to-end-social-ai-factory-taiwan-1920x1080-4660123

AI innovation continues to be driven by three scaling laws: pre-training, post-training, and test-time scaling. Training is foundational to building smarter...

AI innovation continues to be driven by three scaling laws: pre-training, post-training, and test-time scaling. Training is foundational to building smarter models, and post-training—which can include fine-tuning, reinforcement learning, and other techniques—helps to further increase accuracy for specific tasks, as well as provide models with new capabilities like the ability to reason.

Source

Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

NVIDIA Technical Blog

By:Eduardo Alvarez

8 December 2025 at 17:00

Quantization is one of the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we can reduce the memory...

Quantization is one of the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we can reduce the memory footprint and compute cost—directly improving throughput, latency, and achievable context length. This blog introduces NVFP4 KV cache quantization, a new KV format that enables significant performance gains on NVIDIA Blackwell GPUs.

Source

Making GPU Clusters More Efficient with NVIDIA Data Center Monitoring Tools

NVIDIA Technical Blog

By:Sachin Lakharia

25 November 2025 at 21:00

High-performance computing (HPC) customers continue to scale rapidly, with generative AI, large language models (LLMs), computer vision, and other uses leading...

High-performance computing (HPC) customers continue to scale rapidly, with generative AI, large language models (LLMs), computer vision, and other uses leading to tremendous growth in GPU resource needs. As a result, GPU efficiency is an ever-growing focus of infrastructure optimization. With enormous GPU fleet sizes, even small inefficiencies translate into significant cluster bottlenecks…

Source

Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond

NVIDIA Technical Blog

By:Kevin Klues

10 November 2025 at 14:00

runai-tech-blog-compute-domains-1920x1080-4504000

The NVIDIA GB200 NVL72 pushes AI infrastructure to new limits, enabling breakthroughs in training large-language models and running scalable, low-latency...

The NVIDIA GB200 NVL72 pushes AI infrastructure to new limits, enabling breakthroughs in training large-language models and running scalable, low-latency inference workloads. Increasingly, Kubernetes plays a central role for deploying and scaling these workloads efficiently whether on-premises or in the cloud. However, rapidly evolving AI workloads, infrastructure requirements…

Source

Join Us for the Blackwell NVFP4 Kernel Hackathon with NVIDIA and GPU MODE

NVIDIA Technical Blog

By:Ayesha Asif

3 November 2025 at 20:00

Join the Developer Kernel Hackathon, a four-part performance challenge hosted by NVIDIA in collaboration with GPU MODE and support from Dell and Sesterce. Push...

Join the Developer Kernel Hackathon, a four-part performance challenge hosted by NVIDIA in collaboration with GPU MODE and support from Dell and Sesterce. Push the limits of GPU performance and optimize low-level kernels for maximum efficiency on NVIDIA hardware. Compete for the chance to win the latest hardware for accelerated computing.

Source

Streamline AI Infrastructure with NVIDIA Run:ai on Microsoft Azure

NVIDIA Technical Blog

By:Julie Adrounie

30 October 2025 at 17:10

Modern AI workloads, ranging from large-scale training to real-time inference, demand dynamic access to powerful GPUs. However, Kubernetes environments have...

Modern AI workloads, ranging from large-scale training to real-time inference, demand dynamic access to powerful GPUs. However, Kubernetes environments have limited native support for GPU management, which leads to challenges such as inefficient GPU utilization, lack of workload prioritization and preemption, limited visibility into GPU consumption, and difficulty enforcing governance and quota…

Source

Reading view