Reading view

The Rising Risk Profile of CDUs in High-Density AI Data Centers

AI has pushed data center thermal loads to levels the industry has never encountered. Racks that once operated comfortably at 8-15 kW are now climbing past 50-100 kW, driving an accelerated shift toward liquid cooling. This transition is happening so quickly that many organizations are deploying new technologies faster than they can fully understand the operational risks.

In my recent five-part LinkedIn series:

  • 2025 U.S. Data Center Incident Trends & Lessons Learned (9-15-2025)
  • Building Safer Data Centers: How Technology is Changing Construction Safety (10-1-2025)
  • The Future of Zero-Incident Data Centers (1ind0-15-2025)
  • Measuring What Matters: The New Safety Metrics in Data Centers (11-1-2025)
  • Beyond Safety: Building Resilient Data Centers Through Integrated Risk Management (11-15-2025)

— a central theme emerged: as systems become more interconnected, risks become more systemic.

That same dynamic influenced the Direct-to-Chip Cooling: A Technical Primer article that Steve Barberi and I published in Data Center POST (10-29-2025). Today, we are observing this systemic-risk framework emerging specifically in the growing role of Cooling Distribution Units (CDUs).

CDUs have evolved from peripheral equipment to a true point of convergence for engineering design, controls logic, chemistry, operational discipline, and human performance. As AI rack densities accelerate, understanding these risks is becoming essential.

CDUs: From Peripheral Equipment to Critical Infrastructure

Historically, CDUs were treated as supplemental mechanical devices. Today, they sit at the center of the liquid-cooling ecosystem governing flow, pressure, temperature stability, fluid quality, isolation, and redundancy. In practice, the CDU now operates as the boundary between stable thermal control and cascading instability.

Yet, unlike well-established electrical systems such as UPSs, switchgear, and feeders, CDUs lack decades of operational history. Operators, technicians, commissioning agents, and even design teams have limited real-world reference points. That blind spot is where a new class of risk is emerging, and three patterns are showing up most frequently.

A New Risk Landscape for CDUs

  • Controls-Layer Fragility
    • Controls-related instability remains one of the most underestimated issues in liquid cooling. Many CDUs still rely on single-path PLC architectures, limited sensor redundancy, and firmware not designed for the thermal volatility of AI workloads. A single inaccurate pressure, flow, or temperature reading can trigger inappropriate or incorrect system responses affecting multiple racks before anyone realizes something is wrong.
  • Pressure and Flow Instability
    • AI workloads surge and cycle, producing heat patterns that stress pumps, valves, gaskets, seals, and manifolds in ways traditional IT never did. These fluctuations are accelerating wear modes that many operators are just beginning to recognize. Illustrative Open Compute Project (OCP) design examples (e.g., 7–10 psi operating ranges at relevant flow rates) are helpful reference points, but they are not universal design criteria.
  • Human-Performance Gaps
    • CDU-related high-potential near misses (HiPo NMs) frequently arise during commissioning and maintenance, when technicians are still learning new workflows. For teams accustomed to legacy air-cooled systems, tasks such as valve sequencing, alarm interpretation, isolation procedures, fluid handling, and leak response are unfamiliar. Unfortunately, as noted in my Building Safer Data Centers post, when technology advances faster than training, people become the first point of vulnerability.

Photo Image: Borealis CDU
Photo by AGT

Additional Risks Emerging in 2025 Liquid-Cooled Environments

Beyond the three most frequent patterns noted above, several quieter but equally impactful vulnerabilities are also surfacing across 2025 deployments:

  • System Architecture Gaps
    • Some first-generation CDUs and loops lack robust isolation, bypass capability, or multi-path routing. Single points of failure, such as a valve, pump, or PLC drive full-loop shutdowns, mirroring the cascading-risk behaviors highlighted in my earlier work on resilience.
  • Maintenance & Operational Variability
    • SOPs for liquid-cooling vary widely across sites and vendors. Fluid handling, startup/shutdown sequences, and leak-response steps remain inconsistent and/or create conditions for preventable HiPo NMs.
  • Chemistry & Fluid Integrity Risks
    • As highlighted in the DTC article Steve Barberi and I co-authored, corrosion, additive depletion, cross-contamination, and stagnant zones can quietly degrade system health. ICP-MS analysis and other advanced techniques are recommended in OCP-aligned coolant programs for PG-25-class fluids, though not universally required.
  • Leak Detection & Nuisance Alarms
    • False positives and false negatives, especially across BMS/DCIM integrations, remain common. Predictive analytics are becoming essential despite not yet being formalized in standards.
  • Facility-Side Dynamics
    • Upstream conditions such as temperature swings, ΔP fluctuations, water hammer, cooling tower chemistry, and biofouling often drive CDU instability. CDUs are frequently blamed for behavior originating in facility water systems.
  • Interoperability & Telemetry Semantics
    • Inconsistent Modbus, BACnet, and Redfish mappings, naming conventions, and telemetry schemas create confusion and delay troubleshooting.

Best Practices: Designing CDUs for Resilience, Not Just Cooling Capacity

If CDUs are going to serve as the cornerstone of liquid cooling in AI environments, they must be engineered around resilience, not simply performance. Several emerging best practices are gaining traction:

  1. Controls Redundancy
    • Dual PLCs, dual sensors, and cross-validated telemetry signals reduce single-point failure exposure. These features do not have prescriptive standards today but are rapidly emerging as best practices for high-density AI environments.
  2. Real-Time Telemetry & Predictive Insight
    • Detecting drift, seal degradation, valve lag, and chemistry shift early is becoming essential. Predictive analytics and deeper telemetry integration are increasingly expected.
  3. Meaningful Isolation
    • Operators should be able to isolate racks, lines, or nodes without shutting down entire loops. In high-density AI environments, isolation becomes uptime.
  4. Failure-Mode Commissioning
    • CDUs should be tested not only for performance but also for failure behavior such as PLC loss, sensor failures, false alarms, and pressure transients. These simulations reveal early-life risk patterns that standard commissioning often misses.
  5. Reliability Expectations
    • CDU design should align with OCP’s system-level reliability expectations, such as MTBF targets on the order of >300,000 hours for OAI Level 10 assemblies, while recognizing that CDU-specific requirements vary by vendor and application.

Standards Alignment

The risks and mitigation strategies outlined above align with emerging guidance from ASHRAE TC 9.9 and the OCP’s liquid-cooling workstreams, including:

  • OAI System Liquid Cooling Guidelines
  • Liquid-to-Liquid CDU Test Methodology
  • ASTM D8040 & D1384 for coolant chemistry durability
  • IEC/UL 62368-1 for hazard-based safety
  • ASHRAE 90.4, PUE/WUE/CUE metrics, and
  • ANSI/BICSI 002, ISO/IEC 22237, and Uptime’s Tier Standards emphasizing concurrently maintainable infrastructure.

These collectively reinforce a shift: CDUs must be treated as availability-critical systems, not auxiliary mechanical devices.

Looking Ahead

The rise of CDUs represents a moment the data center industry has seen before. As soon as a new technology becomes mission-critical, its risk profile expands until safety, engineering, and operations converge around it. Twenty years ago, that moment belonged to UPS systems. Ten years ago, it was batteries. Now, in AI-driven environments, it is the CDU.

Organizations that embrace resilient CDU design, deep visibility, and operator readiness will be the ones that scale AI safely and sustainably.

# # #

About the Author

Walter Leclerc is an independent consultant and recognized industry thought leader in Environmental Health & Safety, Risk Management, and Sustainability, with deep experience across data center construction and operations, technology, and industrial sectors. He has written extensively on emerging risk, liquid cooling, safety leadership, predictive analytics, incident trends, and the integration of culture, technology, and resilience in next-generation mission-critical environments. Walter led the initiatives that earned Digital Realty the Environment+Energy Leader’s Top Project of the Year Award for its Global Water Strategy and recognition on EHS Today’s America’s Safest Companies List. A frequent global speaker on the future of safety, sustainability, and resilience in data centers, Walter holds a B.S. in Chemistry from UC Berkeley and an M.S. in Environmental Management from the University of San Francisco.

The post The Rising Risk Profile of CDUs in High-Density AI Data Centers appeared first on Data Center POST.

  •  

Where Is AI Taking Data Centers?

A Vision for the Next Era of Compute from Structure Research’s Jabez Tan

Framing the Future of AI Infrastructure

At the infra/STRUCTURE Summit 2025, held October 15–16 at the Wynn Las Vegas, Jabez Tan, Head of Research at Structure Research, opened the event with a forward-looking keynote titled “Where Is AI Taking Data Centers?” His presentation provided a data-driven perspective on how artificial intelligence (AI) is reshaping digital infrastructure, redefining scale, design, and economics across the global data center ecosystem.

Tan’s session served as both a retrospective on how far the industry has come and a roadmap for where it’s heading. With AI accelerating demand beyond traditional cloud models, his insights set the tone for two days of deep discussion among the sector’s leading operators, investors, and technology providers.

From the Edge to the Core – A Redefinition of Scale

Tan began by looking back just a few years to what he called “the 2022 era of edge obsession.” At that time, much of the industry believed the future of cloud would depend on thousands of small, distributed edge data centers. “We thought the next iteration of cloud would be hundreds of sites at the base of cell towers,” Tan recalled. “But that didn’t really happen.”

Instead, the reality has inverted. “The edge has become the new core,” he said. “Rather than hundreds of small facilities, we’re now building gigawatts of capacity in centralized regions where power and land are available.”

That pivot, Tan emphasized, is fundamentally tied to economics, where cost, energy, and accessibility converge. It reflects how hyperscalers and AI developers are chasing efficiency and scale over proximity, redefining where and how the industry grows.

The AI Acceleration – Demand Without Precedent

Tan then unpacked the explosive demand for compute since late 2022, when AI adoption began its steep ascent following the launch of ChatGPT. He described the industry’s trajectory as a “roller coaster” marked by alternating waves of panic and optimism—but one with undeniable momentum.

The numbers he shared were striking. NVIDIA’s GPU shipments, for instance, have skyrocketed: from 1.3 million H100 Hopper GPUs in 2024 to 3.6 million Blackwell GPUs sold in just the first three months of 2025, a threefold increase in supply and demand. “That translates to an increase from under one gigawatt of GPU-driven demand to over four gigawatts in a single year,” Tan noted.

Tan linked this trend to a broader shift: “AI isn’t just consuming capacity, it’s generating revenue.” Large language model (LLM) providers like OpenAI, Anthropic, and xAI are now producing billions in annual income directly tied to compute access, signaling a business model where infrastructure equals monetization.

Measuring in Compute, Not Megawatts

One of the most notable insights from Tan’s session was his argument that power is no longer the most accurate measure of data center capacity. “Historically, we measured in square footage, then in megawatts,” he said. “But with AI, the true metric is compute, the amount of processing power per facility.”

This evolution is forcing analysts and operators alike to rethink capacity modeling and investment forecasting. Structure Research, Tan explained, is now tracking data centers by compute density, a more precise reflection of AI-era workloads. “The way we define market share and value creation will increasingly depend on how much compute each facility delivers,” he said.

From Training to Inference – The Next Compute Shift

Tan projected that as AI matures, the balance between training and inference workloads will shift dramatically. “Today, roughly 60% of demand is tied to training,” he explained. “Within five years, 80% will be inference.”

That shift will reshape infrastructure needs, pushing more compute toward distributed yet interconnected environments optimized for real-time processing. Tan described a future where inference happens continuously across global networks, increasing utilization, efficiency, and energy demands simultaneously.

The Coming Capacity Crunch

Perhaps the most sobering takeaway from Tan’s talk was his projection of a looming data center capacity shortfall. Based on Structure Research’s modeling, global AI-related demand could grow from 13 gigawatts in 2025 to more than 120 gigawatts by 2030, far outpacing current build rates.

“If development doesn’t accelerate, we could face a 100-gigawatt gap by the end of the decade,” Tan cautioned. He noted that 81% of capacity under development in the U.S. today comes from credible, established providers, but even that won’t be enough to meet demand. “The solution,” he said, “requires the entire ecosystem, utilities, regulators, financiers, and developers to work in sync.”

Fungibility, Flexibility, and the AI Architecture of the Future

Tan also emphasized that AI architecture must become fungible, able to handle both inference and training workloads interchangeably. He explained how hyperscalers are now demanding that facilities support variable cooling and compute configurations, often shifting between air and liquid systems based on real-time needs.

“This isn’t just about designing for GPUs,” he said. “It’s about designing for fluidity, so workloads can move and scale without constraint.”

Tan illustrated this with real-world examples of AI inference deployments requiring hundreds of cross-connects for data exchange and instant access to multiple cloud platforms. “Operators are realizing that connectivity, not just capacity, is the new value driver,” he said.

Agentic AI – A Telescope for the Mind

To close, Tan explored the concept of agentic AI, systems that not only process human inputs but act autonomously across interconnected platforms. He compared its potential to the invention of the telescope.

“When Galileo introduced the telescope, it challenged humanity’s view of its place in the universe,” Tan said. “Large language models are doing something similar for intelligence. They make us feel small today, but they also open an entirely new frontier for discovery.”

He concluded with a powerful metaphor: “If traditional technologies were tools humans used, AI is the first technology that uses tools itself. It’s a telescope for the mind.”

A Market Transformed by Compute

Tan’s session underscored that AI is redefining not only how data centers are built but also how they are measured, financed, and valued. The industry is entering an era where compute density is the new currency, where inference will dominate workloads, and where collaboration across the entire ecosystem is essential to keep pace with demand.

Infra/STRUCTURE 2026: Save the Date

Want to tune in live, receive all presentations, gain access to C-level executives, investors and industry leading research? Then save the date for infra/STRUCTURE 2026 set for October 7-8, 2026 at The Wynn Las Vegas. Pre-Registration for the 2026 event is now open, and you can visit www.infrastructuresummit.io to learn more.

The post Where Is AI Taking Data Centers? appeared first on Data Center POST.

  •