NVIDIA Technical Blog
Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor 30 January 2026 at 18:00

Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor

30 January 2026 at 18:00

Sparse tensors are vectors, matrices, and higher-dimensional generalizations with many zeros. They are crucial in various fields such as scientific computing,...

Sparse tensors are vectors, matrices, and higher-dimensional generalizations with many zeros. They are crucial in various fields such as scientific computing, signal processing, and deep learning due to their efficiency in storage, computation, and power. Despite their benefits, handling sparse tensors manually or through existing libraries is often cumbersome, error-prone, nonportable…

Source

NVIDIA Technical Blog
How to Write High-Performance Matrix Multiply in NVIDIA CUDA Tile 14 January 2026 at 20:41

How to Write High-Performance Matrix Multiply in NVIDIA CUDA Tile

NVIDIA Technical Blog

By:Jinman Xie

14 January 2026 at 20:41

This blog post is part of a series designed to help developers learn NVIDIA CUDA Tile programming for building high-performance GPU kernels, using matrix...

This blog post is part of a series designed to help developers learn NVIDIA CUDA Tile programming for building high-performance GPU kernels, using matrix multiplication as a core example. In this post, you’ll learn: Before you begin, be sure your environment meets the following requirements (see the quickstart for more information): Environment requirements: Install…

Source

NVIDIA Technical Blog
Advanced Large-Scale Quantum Simulation Techniques in cuQuantum SDK v25.11 16 December 2025 at 18:00

Advanced Large-Scale Quantum Simulation Techniques in cuQuantum SDK v25.11

NVIDIA Technical Blog

By:Tom Lubowe

16 December 2025 at 18:00

Simulating large-scale quantum computers has become more difficult as the quality of quantum processing units (QPUs) improves. Validating the results is key to...

Simulating large-scale quantum computers has become more difficult as the quality of quantum processing units (QPUs) improves. Validating the results is key to ensure that after the devices scale beyond what is classically simulable, we can still trust the outputs. Similarly, when generating large-scale datasets for various AI models that aim to aid in the operation of quantum processors…

Source

NVIDIA Technical Blog
Simplify GPU Programming with NVIDIA CUDA Tile in Python 4 December 2025 at 22:20

Simplify GPU Programming with NVIDIA CUDA Tile in Python

NVIDIA Technical Blog

By:Jonathan Bentz

4 December 2025 at 22:20

The release of NVIDIA CUDA 13.1 introduces tile-based programming for GPUs, making it one of the most fundamental additions to GPU programming since CUDA was... Decorative image.

The release of NVIDIA CUDA 13.1 introduces tile-based programming for GPUs, making it one of the most fundamental additions to GPU programming since CUDA was invented. Writing GPU tile kernels enables you to write your algorithm at a higher level than a single-instruction multiple-thread (SIMT) model, while the compiler and runtime handle the partitioning of work onto threads under the covers.

Source

NVIDIA Technical Blog
Focus on Your Algorithm—NVIDIA CUDA Tile Handles the Hardware 4 December 2025 at 22:20

Focus on Your Algorithm—NVIDIA CUDA Tile Handles the Hardware

NVIDIA Technical Blog

By:Jonathan Bentz

4 December 2025 at 22:20

With its largest advancement since the NVIDIA CUDA platform was invented in 2006, CUDA 13.1 is launching NVIDIA CUDA Tile. This exciting innovation introduces a... CUDA Tile example.

With its largest advancement since the NVIDIA CUDA platform was invented in 2006, CUDA 13.1 is launching NVIDIA CUDA Tile. This exciting innovation introduces a virtual instruction set for tile-based parallel programming, focusing on the ability to write algorithms at a higher level and abstract away the details of specialized hardware, such as tensor cores. CUDA exposes a single…

Source

NVIDIA Technical Blog
Achieve CUTLASS C++ Performance with Python APIs Using CuTe DSL 13 November 2025 at 20:30

Achieve CUTLASS C++ Performance with Python APIs Using CuTe DSL

NVIDIA Technical Blog

By:Brandon Sun

13 November 2025 at 20:30

CuTe, a core component of CUTLASS 3.x, provides a unified algebra for describing data layouts and thread mappings, and abstracts complex memory access patterns...

CuTe, a core component of CUTLASS 3.x, provides a unified algebra for describing data layouts and thread mappings, and abstracts complex memory access patterns into composable mathematical operations. While CUTLASS 3.x and CuTe have empowered kernel developers to achieve peak performance on Tensor Cores through intuitive abstractions, the extensive use of C++ templates has resulted in high…

Source

Normal view