Streamlining CUB with a Single-Call API

21 January 2026 at 21:28

The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional "two-phase" API, which separates memory estimation...

The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, can be cumbersome. While this programming model offers flexibility, it often results in repetitive boilerplate code. This post explains the shift from this API to the new CUB single-call API introduced in CUDA 13.1…

Source

Accelerating AI-Powered Chemistry and Materials Science Simulations with NVIDIA ALCHEMI Toolkit-Ops

NVIDIA Technical Blog

By:Justin S. Smith

19 December 2025 at 17:00

Machine learning interatomic potentials (MLIPs) are transforming the landscape of computational chemistry and materials science. MLIPs enable atomistic...

Machine learning interatomic potentials (MLIPs) are transforming the landscape of computational chemistry and materials science. MLIPs enable atomistic simulations that combine the fidelity of computationally expensive quantum chemistry with the scaling power of AI. Yet, developers working at this intersection face a persistent challenge: a lack of robust, Pythonic toolbox for GPU…

Source

Democratizing Large-Scale Mixture-of-Experts Training with NVIDIA PyTorch Paralism

NVIDIA Technical Blog

By:Hemil Desai

6 November 2025 at 17:00

Training massive mixture-of-experts (MoE) models has long been the domain of a few advanced users with deep infrastructure and distributed-systems expertise....

Training massive mixture-of-experts (MoE) models has long been the domain of a few advanced users with deep infrastructure and distributed-systems expertise. For most developers, the challenge wasn’t building smarter models—it was scaling them efficiently across hundreds or even thousands of GPUs without breaking the bank. With NVIDIA NeMo Automodel, an open-source library within NVIDIA NeMo…

Source

Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes

NVIDIA Technical Blog

By:Kyle Tretina

5 November 2025 at 16:00

Training models with billions or trillions of parameters demands advanced parallel computing. Researchers must decide how to combine parallelism strategies,... Decorative image.

Training models with billions or trillions of parameters demands advanced parallel computing. Researchers must decide how to combine parallelism strategies, select the most efficient accelerated libraries, and integrate low-precision formats such as FP8 and FP4—all without sacrificing speed or memory. There are accelerated frameworks that help, but adapting to these specific methodologies…

Source

Reading view