Streamlining CUB with a Single-Call API
The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional "two-phase" API, which separates memory estimation...
The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional βtwo-phaseβ API, which separates memory estimation from allocation, can be cumbersome. While this programming model offers flexibility, it often results in repetitive boilerplate code. This post explains the shift from this API to the new CUB single-call API introduced in CUDA 13.1β¦
Machine learning interatomic potentials (MLIPs) are transforming the landscape of computational chemistry and materials science. MLIPs enable atomistic...
Training massive mixture-of-experts (MoE) models has long been the domain of a few advanced users with deep infrastructure and distributed-systems expertise....
Training models with billions or trillions of parameters demands advanced parallel computing. Researchers must decide how to combine parallelism strategies,...