Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare

28 January 2026 at 17:00

NVIDIA Run:ai v2.24 introduces time-based fairshare, a new scheduling mode that brings fair-share scheduling with time awareness for over-quota resources to...

NVIDIA Run:ai v2.24 introduces time-based fairshare, a new scheduling mode that brings fair-share scheduling with time awareness for over-quota resources to Kubernetes clusters. This capability, built on the open source KAI Scheduler that powers NVIDIA Run:ai, addresses a long-standing challenge in shared GPU infrastructure. Consider two teams with equal priority sharing a cluster.

Source

Enabling Horizontal Autoscaling of Enterprise RAG Components on Kubernetes

NVIDIA Technical Blog

By:Juana Nakfour

12 December 2025 at 21:00

Today’s best AI agents rely on retrieval-augmented generation (RAG) to enable more accurate results. A RAG system facilitates the use of a knowledge base to...

Today’s best AI agents rely on retrieval-augmented generation (RAG) to enable more accurate results. A RAG system facilitates the use of a knowledge base to augment context to large language models (LLMs). A typical design pattern includes a RAG server that accepts prompt queries, consults a vector database for nearest context vectors, and then redirects the query with the appended context to an…

Source

Automate Kubernetes AI Cluster Health with NVSentinel

NVIDIA Technical Blog

By:Lalit Adithya

8 December 2025 at 18:00

Kubernetes underpins a large portion of all AI workloads in production. Yet, maintaining GPU nodes and ensuring that applications are running, training jobs are...

Kubernetes underpins a large portion of all AI workloads in production. Yet, maintaining GPU nodes and ensuring that applications are running, training jobs are progressing, and traffic is served across Kubernetes clusters is easier said than done. NVSentinel is designed to help with these challenges. An open source system for Kubernetes AI clusters, NVSentinel continuously monitors GPU…

Source

Streamline Complex AI Inference on Kubernetes with NVIDIA Grove

NVIDIA Technical Blog

By:Sanjay Chatterjee

10 November 2025 at 14:00

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now...

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now consist of several distinct components—prefill, decode, vision encoders, key value (KV) routers, and more. In addition, entire agentic pipelines are emerging, where multiple such model instances collaborate to perform reasoning, retrieval…

Source

Reading view