❌

Reading view

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare

NVIDIA Run:ai v2.24 introduces time-based fairshare, a new scheduling mode that brings fair-share scheduling with time awareness for over-quota resources to...

NVIDIA Run:ai v2.24 introduces time-based fairshare, a new scheduling mode that brings fair-share scheduling with time awareness for over-quota resources to Kubernetes clusters. This capability, built on the open source KAI Scheduler that powers NVIDIA Run:ai, addresses a long-standing challenge in shared GPU infrastructure. Consider two teams with equal priority sharing a cluster.

Source

  •  

Enabling Horizontal Autoscaling of Enterprise RAG Components on Kubernetes

Today’s best AI agents rely on retrieval-augmented generation (RAG) to enable more accurate results. A RAG system facilitates the use of a knowledge base to...

Today’s best AI agents rely on retrieval-augmented generation (RAG) to enable more accurate results. A RAG system facilitates the use of a knowledge base to augment context to large language models (LLMs). A typical design pattern includes a RAG server that accepts prompt queries, consults a vector database for nearest context vectors, and then redirects the query with the appended context to an…

Source

  •  

Automate Kubernetes AI Cluster Health with NVSentinel

Kubernetes underpins a large portion of all AI workloads in production. Yet, maintaining GPU nodes and ensuring that applications are running, training jobs are...

Kubernetes underpins a large portion of all AI workloads in production. Yet, maintaining GPU nodes and ensuring that applications are running, training jobs are progressing, and traffic is served across Kubernetes clusters is easier said than done. NVSentinel is designed to help with these challenges. An open source system for Kubernetes AI clusters, NVSentinel continuously monitors GPU…

Source

  •  

Streamline Complex AI Inference on Kubernetes with NVIDIA Grove

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now...

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now consist of several distinct componentsβ€”prefill, decode, vision encoders, key value (KV) routers, and more. In addition, entire agentic pipelines are emerging, where multiple such model instances collaborate to perform reasoning, retrieval…

Source

  •