NVIDIA Technical Blog
Streamlining CUB with a Single-Call API 21 January 2026 at 21:28

Streamlining CUB with a Single-Call API

21 January 2026 at 21:28

The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional "two-phase" API, which separates memory estimation...

The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, can be cumbersome. While this programming model offers flexibility, it often results in repetitive boilerplate code. This post explains the shift from this API to the new CUB single-call API introduced in CUDA 13.1…

Source

NVIDIA Technical Blog
Accelerating AI-Powered Chemistry and Materials Science Simulations with NVIDIA ALCHEMI Toolkit-Ops 19 December 2025 at 17:00

Accelerating AI-Powered Chemistry and Materials Science Simulations with NVIDIA ALCHEMI Toolkit-Ops

NVIDIA Technical Blog

By:Justin S. Smith

19 December 2025 at 17:00

Machine learning interatomic potentials (MLIPs) are transforming the landscape of computational chemistry and materials science. MLIPs enable atomistic...

Machine learning interatomic potentials (MLIPs) are transforming the landscape of computational chemistry and materials science. MLIPs enable atomistic simulations that combine the fidelity of computationally expensive quantum chemistry with the scaling power of AI. Yet, developers working at this intersection face a persistent challenge: a lack of robust, Pythonic toolbox for GPU…

Source

NVIDIA Technical Blog
Democratizing Large-Scale Mixture-of-Experts Training with NVIDIA PyTorch Paralism 6 November 2025 at 17:00

Democratizing Large-Scale Mixture-of-Experts Training with NVIDIA PyTorch Paralism

NVIDIA Technical Blog

By:Hemil Desai

6 November 2025 at 17:00

Training massive mixture-of-experts (MoE) models has long been the domain of a few advanced users with deep infrastructure and distributed-systems expertise....

Training massive mixture-of-experts (MoE) models has long been the domain of a few advanced users with deep infrastructure and distributed-systems expertise. For most developers, the challenge wasn’t building smarter models—it was scaling them efficiently across hundreds or even thousands of GPUs without breaking the bank. With NVIDIA NeMo Automodel, an open-source library within NVIDIA NeMo…

Source

NVIDIA Technical Blog
Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes 5 November 2025 at 16:00

Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes

NVIDIA Technical Blog

By:Kyle Tretina

5 November 2025 at 16:00

Training models with billions or trillions of parameters demands advanced parallel computing. Researchers must decide how to combine parallelism strategies,... Decorative image.

Training models with billions or trillions of parameters demands advanced parallel computing. Researchers must decide how to combine parallelism strategies, select the most efficient accelerated libraries, and integrate low-precision formats such as FP8 and FP4—all without sacrificing speed or memory. There are accelerated frameworks that help, but adapting to these specific methodologies…

Source

Normal view