Reading view

Advancing GPU Programming with the CUDA Tile IR Backend for OpenAI Triton

NVIDIA CUDA Tile is a GPU-based programming model that targets portability for NVIDIA Tensor Cores, unlocking peak GPU performance. One of the great things...

NVIDIA CUDA Tile is a GPU-based programming model that targets portability for NVIDIA Tensor Cores, unlocking peak GPU performance. One of the great things about CUDA Tile is that you can build your own DSL on top of it. This post shares the work NVIDIA is doing to integrate CUDA Tile as a backend for OpenAI Triton, an open source Python DSL designed to write DL kernels for GPUs.

Source

  •  

Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor

Sparse tensors are vectors, matrices, and higher-dimensional generalizations with many zeros. They are crucial in various fields such as scientific computing,...

Sparse tensors are vectors, matrices, and higher-dimensional generalizations with many zeros. They are crucial in various fields such as scientific computing, signal processing, and deep learning due to their efficiency in storage, computation, and power. Despite their benefits, handling sparse tensors manually or through existing libraries is often cumbersome, error-prone, nonportable…

Source

  •  

Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk

AI coding agents enable developers to work faster by streamlining tasks and driving automated, test-driven development. However, they also introduce a...

AI coding agents enable developers to work faster by streamlining tasks and driving automated, test-driven development. However, they also introduce a significant, often overlooked, attack surface by running tools from the command line with the same permissions and entitlements as the user, making them computer use agents, with all the risks those entail. The primary threat to these tools is…

Source

  •  

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare

NVIDIA Run:ai v2.24 introduces time-based fairshare, a new scheduling mode that brings fair-share scheduling with time awareness for over-quota resources to...

NVIDIA Run:ai v2.24 introduces time-based fairshare, a new scheduling mode that brings fair-share scheduling with time awareness for over-quota resources to Kubernetes clusters. This capability, built on the open source KAI Scheduler that powers NVIDIA Run:ai, addresses a long-standing challenge in shared GPU infrastructure. Consider two teams with equal priority sharing a cluster.

Source

  •  

Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core

A decorative image.This post introduces Dynamic Context Parallelism (Dynamic-CP), a scheduling approach in NVIDIA Megatron Core used for LLM post-training or DiT pre-training. It...A decorative image.

This post introduces Dynamic Context Parallelism (Dynamic-CP), a scheduling approach in NVIDIA Megatron Core used for LLM post-training or DiT pre-training. It dynamically selects the CP size per microbatch to efficiently handle variable-length sequences, achieving up to 1.48x speedup on real-world datasets. In large-scale model training, an often-overlooked bottleneck arises from the…

Source

  •  

Updating Classifier Evasion for Vision Language Models

Cars with bounding boxes driving over a bridge in a city.Advances in AI architectures have unlocked multimodal functionality, enabling transformer models to process multiple forms of data in the same context. For...Cars with bounding boxes driving over a bridge in a city.

Advances in AI architectures have unlocked multimodal functionality, enabling transformer models to process multiple forms of data in the same context. For instance, vision language models (VLMs) can generate output from combined image and text input, enabling developers to build systems that interpret graphs, process camera feeds, or operate with traditionally human interfaces like desktop…

Source

  •  

Accelerating Diffusion Models with an Open, Plug-and-Play Offering

Recent advances in large-scale diffusion models have revolutionized generative AI across multiple domains, from image synthesis to audio generation, 3D asset...

Recent advances in large-scale diffusion models have revolutionized generative AI across multiple domains, from image synthesis to audio generation, 3D asset creation, molecular design, and beyond. These models have demonstrated unprecedented capabilities in producing high-quality, diverse outputs across various conditional generation tasks. Despite these successes…

Source

  •  

Adaptive Inference in NVIDIA TensorRT for RTX Enables Automatic Optimization

Deploying AI applications across diverse consumer hardware has traditionally forced a trade-off. You can optimize for specific GPU configurations and achieve...

Deploying AI applications across diverse consumer hardware has traditionally forced a trade-off. You can optimize for specific GPU configurations and achieve peak performance at the cost of portability. Alternatively, you can build generic, portable engines and leave performance on the table. Bridging this gap often requires manual tuning, multiple build targets, or accepting compromises.

Source

  •  

How to Unlock Local Detail in Coarse Climate Projections with NVIDIA Earth-2

A global image showing weather patterns.Global climate models are good at the big picture—but local climate extremes, like hurricanes and typhoons, often disappear in the details. Those patterns are...A global image showing weather patterns.

Global climate models are good at the big picture—but local climate extremes, like hurricanes and typhoons, often disappear in the details. Those patterns are still there—you just need the right tools to unlock them in high-resolution climate data. Using NVIDIA Earth‑2, this blog post shows you how to downscale coarse climate projections into higher-resolution, bias‑corrected fields—revealing…

Source

  •  

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs

In 2025, NVIDIA partnered with Black Forest Labs (BFL) to optimize the FLUX.1 text-to-image model series, unlocking FP4 image generation performance on NVIDIA...

In 2025, NVIDIA partnered with Black Forest Labs (BFL) to optimize the FLUX.1 text-to-image model series, unlocking FP4 image generation performance on NVIDIA Blackwell GeForce RTX 50 Series GPUs. As a natural extension of the latent diffusion model, FLUX.1 Kontext [dev] proved that in-context learning is a feasible technique for visual-generation models, not just large language models (LLMs).

Source

  •  

Streamlining CUB with a Single-Call API

The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional "two-phase" API, which separates memory estimation...

The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, can be cumbersome. While this programming model offers flexibility, it often results in repetitive boilerplate code. This post explains the shift from this API to the new CUB single-call API introduced in CUDA 13.1…

Source

  •  

How to Train an AI Agent for Command-Line Tasks with Synthetic Data and Reinforcement Learning

What if your computer-use agent could learn a new Command Line Interface (CLI)—and operate it safely without ever writing files or free-typing shell commands?...

What if your computer-use agent could learn a new Command Line Interface (CLI)—and operate it safely without ever writing files or free-typing shell commands? In Part 1 of our series on building a computer use agent, we built a custom Bash computer-use agent using NVIDIA Nemotron in just one hour. In this sequel, we’ll take it further by teaching the same reasoning model with no prior…

Source

  •  

How to Write High-Performance Matrix Multiply in NVIDIA CUDA Tile

This blog post is part of a series designed to help developers learn NVIDIA CUDA Tile programming for building high-performance GPU kernels, using matrix...

This blog post is part of a series designed to help developers learn NVIDIA CUDA Tile programming for building high-performance GPU kernels, using matrix multiplication as a core example. In this post, you’ll learn: Before you begin, be sure your environment meets the following requirements (see the quickstart for more information): Environment requirements: Install…

Source

  •  

NVIDIA DLSS 4.5 Delivers Super Resolution Upgrades and New Dynamic Multi Frame Generation

NVIDIA DLSS 4 with Multi Frame Generation has become the fastest-adopted NVIDIA gaming technology ever. Over 250 games and apps use it to make real-time path...

NVIDIA DLSS 4 with Multi Frame Generation has become the fastest-adopted NVIDIA gaming technology ever. Over 250 games and apps use it to make real-time path tracing possible—and upcoming titles for 2026, including PRAGMATA and Resident Evil Requiem, also plan to incorporate the software. At CES 2026, the technology became even more powerful. NVIDIA introduced DLSS 4.5…

Source

  •  

Learn How NVIDIA cuOpt Accelerates Mixed Integer Optimization using Primal Heuristics

Decorative image.NVIDIA cuOpt is a GPU-accelerated optimization engine designed to deliver fast, high-quality solutions for large, complex decision-making problems. Mixed...Decorative image.

NVIDIA cuOpt is a GPU-accelerated optimization engine designed to deliver fast, high-quality solutions for large, complex decision-making problems. Mixed integer programming (MIP) is a technique for solving problems. It can be modeled by a set of linear constraints, with some of the variables able to assume only integer values. The types of problems that can be modeled as MIP are numerous and…

Source

  •  

Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time

Decorative image.We keep seeing LLMs with larger context windows in the news, along with promises that they can hold entire conversation histories, volumes of books, or multiple...Decorative image.

We keep seeing LLMs with larger context windows in the news, along with promises that they can hold entire conversation histories, volumes of books, or multiple codebases in view at once. And yet, these models still repeat the same mistakes. We still have to copy and paste the earlier context back into the chat for LLMs to “get it”. A smart co-worker would pick up on these patterns, adapt…

Source

  •  

Multi-Agent Warehouse AI Command Layer Enables Operational Excellence and Supply Chain Intelligence

Warehouses have never been more automated, more data-rich, or more operationally demanding than they are now—yet they still rely on systems that can’t keep...

Warehouses have never been more automated, more data-rich, or more operationally demanding than they are now—yet they still rely on systems that can’t keep up. Throughput is rising, SLAs are shrinking, and fleets of AMRs, conveyors, and sensors expand every year. But beneath that technological surface, most sites still rely on a familiar trio: a Warehouse Management System (WMS)…

Source

  •  

Build an AI Catalog System That Delivers Localized, Interactive Product Experiences

Decorative image.E-commerce catalogs often contain sparse product data, generic images, a basic title, and short description. This limits discoverability, engagement, and...Decorative image.

E-commerce catalogs often contain sparse product data, generic images, a basic title, and short description. This limits discoverability, engagement, and conversion. Manual enrichment doesn’t scale because it relies on catalog managers to manually write descriptions, apply tags, and categorize. The process is slow, inconsistent, and error-prone. This tutorial shows developers…

Source

  •  

Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell

As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with...

As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with AI more frequently, meaning that more tokens need to be generated. To serve these tokens at the lowest possible cost, AI platforms need to deliver the best possible token throughput per watt. Through extreme co-design across GPUs, CPUs…

Source

  •  

Building Generalist Humanoid Capabilities with NVIDIA Isaac GR00T N1.6 Using a Sim-to-Real Workflow 

A robot thrTo make humanoid robots useful, they need cognition and loco-manipulation that span perception, planning, and whole-body control in dynamic environments. ...A robot thr

To make humanoid robots useful, they need cognition and loco-manipulation that span perception, planning, and whole-body control in dynamic environments. Building these generalist robots requires a workflow that unifies simulation, control, and learning for robots to acquire complex skills before transferring into the real world. In this post, we present NVIDIA Isaac GR00T N1.6…

Source

  •