❌

Normal view

Received before yesterday

Next-Generation AI Factory Telemetry with NVIDIA Spectrum-X Ethernet

11 December 2025 at 19:03
As AI data centers rapidly evolve into AI factories, traditional network monitoring methods are no longer sufficient. Workloads continue to grow in complexity...

As AI data centers rapidly evolve into AI factories, traditional network monitoring methods are no longer sufficient. Workloads continue to grow in complexity and infrastructures scale rapidly, making real-time, high-frequency insights critical. The need for effective system monitoring has never been greater. This post explores how high-frequency sampling and advanced telemetry techniques…

Source

Enhancing Communication Observability of AI Workloads with NCCL Inspector

10 December 2025 at 21:45
When using the NVIDIA Collective Communication Library (NCCL) to run a deep learning training or inference workload that uses collective operations (such as...

When using the NVIDIA Collective Communication Library (NCCL) to run a deep learning training or inference workload that uses collective operations (such as AllReduce, AllGather, and ReduceScatter), it can be challenging to determine how NCCL is performing during the actual workload run. This post introduces the NCCL Inspector Profiler Plugin, which addresses this problem. It offers a way for…

Source

Building Scalable and Fault-Tolerant NCCL Applications

10 November 2025 at 21:29
The NVIDIA Collective Communications Library (NCCL) provides communication APIs for low-latency and high-bandwidth collectives, enabling AI workloads to scale...

The NVIDIA Collective Communications Library (NCCL) provides communication APIs for low-latency and high-bandwidth collectives, enabling AI workloads to scale from just a few GPUs on a single host to thousands of GPUs in a data center. This post discusses NCCL features that support run-time rescaling for cost optimization, as well as minimizing service downtime from faults by dynamically removing…

Source

❌