Automate Kubernetes AI Cluster Health with NVSentinel

8 December 2025 at 18:00

Kubernetes underpins a large portion of all AI workloads in production. Yet, maintaining GPU nodes and ensuring that applications are running, training jobs are...

Kubernetes underpins a large portion of all AI workloads in production. Yet, maintaining GPU nodes and ensuring that applications are running, training jobs are progressing, and traffic is served across Kubernetes clusters is easier said than done. NVSentinel is designed to help with these challenges. An open source system for Kubernetes AI clusters, NVSentinel continuously monitors GPU…

Source

Building Scalable and Fault-Tolerant NCCL Applications

NVIDIA Technical Blog

By:Luke Robison

10 November 2025 at 21:29

The NVIDIA Collective Communications Library (NCCL) provides communication APIs for low-latency and high-bandwidth collectives, enabling AI workloads to scale...

The NVIDIA Collective Communications Library (NCCL) provides communication APIs for low-latency and high-bandwidth collectives, enabling AI workloads to scale from just a few GPUs on a single host to thousands of GPUs in a data center. This post discusses NCCL features that support run-time rescaling for cost optimization, as well as minimizing service downtime from faults by dynamically removing…

Source

Streamline Complex AI Inference on Kubernetes with NVIDIA Grove

NVIDIA Technical Blog

By:Sanjay Chatterjee

10 November 2025 at 14:00

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now...

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now consist of several distinct components—prefill, decode, vision encoders, key value (KV) routers, and more. In addition, entire agentic pipelines are emerging, where multiple such model instances collaborate to perform reasoning, retrieval…

Source

Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond

NVIDIA Technical Blog

By:Kevin Klues

10 November 2025 at 14:00

runai-tech-blog-compute-domains-1920x1080-4504000

The NVIDIA GB200 NVL72 pushes AI infrastructure to new limits, enabling breakthroughs in training large-language models and running scalable, low-latency...

The NVIDIA GB200 NVL72 pushes AI infrastructure to new limits, enabling breakthroughs in training large-language models and running scalable, low-latency inference workloads. Increasingly, Kubernetes plays a central role for deploying and scaling these workloads efficiently whether on-premises or in the cloud. However, rapidly evolving AI workloads, infrastructure requirements…

Source

Streamline AI Infrastructure with NVIDIA Run:ai on Microsoft Azure

NVIDIA Technical Blog

By:Julie Adrounie

30 October 2025 at 17:10

Modern AI workloads, ranging from large-scale training to real-time inference, demand dynamic access to powerful GPUs. However, Kubernetes environments have...

Modern AI workloads, ranging from large-scale training to real-time inference, demand dynamic access to powerful GPUs. However, Kubernetes environments have limited native support for GPU management, which leads to challenges such as inefficient GPU utilization, lack of workload prioritization and preemption, limited visibility into GPU consumption, and difficulty enforcing governance and quota…

Source

Reading view