Normal view

Received yesterday — 31 January 2026

AI and cooling: toward more automation

AI is increasingly steering the data center industry toward new operational practices, where automation, analytics and adaptive control are paving the way for “dark” — or lights-out, unstaffed — facilities. Cooling systems, in particular, are leading this shift. Yet despite AI’s positive track record in facility operations, one persistent challenge remains: trust.

In some ways, AI faces a similar challenge to that of commercial aviation several decades ago. Even after airlines had significantly improved reliability and safety performance, making air travel not only faster but also safer than other forms of transportation, it still took time for public perceptions to shift.

That same tension between capability and confidence lies at the heart of the next evolution in data center cooling controls. As AI models — of which there are several — improve in performance, becoming better understood, transparent and explainable, the question is no longer whether AI can manage operations autonomously, but whether the industry is ready to trust it enough to turn off the lights.

AI’s place in cooling controls

Thermal management systems, such as CRAHs, CRACs and airflow management, represent the front line of AI deployment in cooling optimization. Their modular nature enables the incremental adoption of AI controls, providing immediate visibility and measurable efficiency gains in day-to-day operations.

AI can now be applied across four core cooling functions:

  • Dynamic setpoint management. Continuously recalibrates temperature, humidity and fan speeds to match load conditions.
  • Thermal load forecasting. Predicts shifts in demand and makes adjustments in advance to prevent overcooling or instability.
  • Airflow distribution and containment. Uses machine learning to balance hot and cold aisles and stage CRAH/CRAC operations efficiently.
  • Fault detection, predictive and prescriptive diagnostics. Identifies coil fouling, fan oscillation, or valve hunting before they degrade performance.

A growing ecosystem of vendors is advancing AI-driven cooling optimization across both air- and water-side applications. Companies such as Vigilent, Siemens, Schneider Electric, Phaidra and Etalytics offer machine learning platforms that integrate with existing building management systems (BMS) or data center infrastructure management (DCIM) systems to enhance thermal management and efficiency.

Siemens’ White Space Cooling Optimization (WSCO) platform applies AI to match CRAH operation with IT load and thermal conditions, while Schneider Electric, through its Motivair acquisition, has expanded into liquid cooling and AI-ready thermal systems for high-density environments. In parallel, hyperscale operators, such as Google and Microsoft, have built proprietary AI engines to fine-tune chiller and CRAH performance in real time. These solutions range from supervisory logic to adaptive, closed-loop control. However, all share a common aim: improve efficiency without compromising compliance with service level agreements (SLAs) or operator oversight.

The scope of AI adoption

While IT cooling optimization has become the most visible frontier, conversations with AI control vendors reveal that most mature deployments still begin at the facility water loop rather than in the computer room. Vendors often start with the mechanical plant and facility water system because these areas present fewer variables, such as temperature differentials, flow rates and pressure setpoints, and can be treated as closed, well-bounded systems.

This makes the water loop a safer proving ground for training and validating algorithms before extending them to computer room air cooling systems, where thermal dynamics are more complex and influenced by containment design, workload variability and external conditions.

Predictive versus prescriptive: the maturity divide

AI in cooling is evolving along a maturity spectrum — from predictive insight to prescriptive guidance and, increasingly, to autonomous control. Table 1 summarizes the functional and operational distinctions among these three stages of AI maturity in data center cooling.

Table 1 Predictive, prescriptive, and autonomous AI in data center cooling

Table: Predictive, prescriptive, and autonomous AI in data center cooling

Most deployments today stop at the predictive stage, where AI enhances situational awareness but leaves action to the operator. Achieving full prescriptive control will require not only a deeper technical sophistication but also a shift in mindset.

Technically, it is more difficult to engineer because the system must not only forecast outcomes but also choose and execute safe corrective actions within operational limits. Operationally, it is harder to trust because it challenges long-held norms about accountability and human oversight.

The divide, therefore, is not only technical but also cultural. The shift from informed supervision to algorithmic control is redefining the boundary between automation and authority.

AI’s value and its risks

No matter how advanced the technology becomes, cooling exists for one reason: maintaining environmental stability and meeting SLAs. AI-enhanced monitoring and control systems support operating staff by:

  • Predicting and preventing temperature excursions before they affect uptime.
  • Detecting system degradation early and enabling timely corrective action.
  • Optimizing energy performance under varying load profiles without violating SLA thresholds.

Yet efficiency gains mean little without confidence in system reliability. It is also important to clarify that AI in data center cooling is not a single technology. Control-oriented machine learning models, such as those used to optimize CRAHs, CRACs and chiller plants, operate within physical limits and rely on deterministic sensor data. These differ fundamentally from language-based AI models such as GPT, where “hallucinations” refer to fabricated or contextually inaccurate responses.

At the Uptime Network Fall Americas Fall Conference 2025, several operators raised concerns about AI hallucinations — instances where optimization models generate inaccurate or confusing recommendations from event logs. In control systems, such errors often arise from model drift, sensor faults, or incomplete training data, not from the reasoning failures seen in language-based AI. When a model’s understanding of system behavior falls out of sync with reality, it can misinterpret anomalies as trends, eroding operator confidence faster than it delivers efficiency gains.

The discomfort is not purely technical, it is also human. Many data center operators remain uneasy about letting AI take the controls entirely, even as they acknowledge its potential. In AI’s ascent toward autonomy, trust remains the runway still under construction.

Critically, modern AI control frameworks are being designed with built-in safety, transparency and human oversight. For example, Vigilent, a provider of AI-based optimization controls for data center cooling, reports that its optimizing control switches to “guard mode” whenever it is unable to maintain the data center environment within tolerances. Guard mode brings on additional cooling capacity (at the expense of power consumption) to restore SLA-compliant conditions. Typical examples include rapid drift or temperature hot spots. In addition, there is also a manual override option, which enables the operator to take control through monitoring and event logs.

This layered logic provides operational resiliency by enabling systems to fail safely: guard mode ensures stability, manual override guarantees operator authority, and explainability, via decision-tree logic, keeps every AI action transparent. Even in dark-mode operation, alarms and reasoning remain accessible to operators.

These frameworks directly address one of the primary fears among data center operators: losing visibility into what the system is doing.

Outlook

Gradually, the concept of a dark data center, one operated remotely with minimal on-site staff, has shifted from being an interesting theory to a desirable strategy. In recent years, many infrastructure operators have increased their use of automation and remote-management tools to enhance resiliency and operational flexibility, while also mitigating low staffing levels. Cooling systems, particularly those governed by AI-assisted control, are now central to this operational transformation.

Operational autonomy does not mean abandoning human control; it means achieving reliable operation without the need for constant supervision. Ultimately, a dark data center is not about turning off the lights, it is about turning on trust.


The Uptime Intelligence View

AI in thermal management has evolved from an experimental concept into an essential tool, improving efficiency and reliability across data centers. The next step — coordinating facility water, air and IT cooling liquid systems — will define the evolution toward greater operational autonomy. However, the transition to “dark” operation will be as much cultural as it is technical. As explainability, fail-safe modes and manual overrides build operator confidence, AI will gradually shift from being a copilot to autopilot. The technology is advancing rapidly; the question is how quickly operators will adopt it.

The post AI and cooling: toward more automation appeared first on Uptime Institute Blog.

Received before yesterday

AI’s growth calls for useful IT efficiency metrics

The digital infrastructure industry is under pressure to measure and improve the energy efficiency of the computing work that underpins digital services. Enterprises seek to maximize returns on cost outlay and operating expenses for IT hardware, and regulators and local communities need reassurance that the energy devoted to data centers is used efficiently. These objectives call for a productivity metric to measure the amount of work that IT hardware performs per unit of energy.

With generative AI projected to boost data center power demand substantially, the stakes have arguably never been higher. Fortunately, organizations monitoring the performance and efficiency of their AI applications can benefit from experiences in the field of supercomputing.

In September 2025, Uptime Intelligence participated in a panel discussion about AI energy efficiency at the Yotta 2025 conference in Las Vegas (Nevada, US). The panelists drew on their extensive experience in supercomputing to weigh in on discussions around AI training efficiency. They discussed the need for a productivity metric to measure it, as well as a key caveat organizations need to consider.

Organizations such as Uptime Intelligence and The Green Grid have published guidance on calculating work capacity for various types of IT. Software applications and their supporting IT hardware vary significantly, so consensus on a single metric to compare energy performance remains out of reach for the foreseeable future. However, tracking energy performance in a given facility over time is important, and is achievable practically for many organizations today.

Defining AI computing work

The work capacity of IT equipment is needed to calculate its utilization and energy performance when running an application. The Green Grid white paper IT work capacity metric V1 — a methodology provides a methodology for calculating a work capacity value for CPU-based servers. Uptime Intelligence has proposed methodologies to extend this to accelerator-based servers for AI and other applications (see Calculating work capacity for server and storage products).

Floating point operations per second (FLOPS) is a common and readily available unit of work capacity for CPU- or accelerator-based servers. In 2025, an AI server’s capacity usually ranks in the trillions of FLOPS, or teraFLOPS (TFLOPS).

Not all FLOPS are the same

Even though large-scale AI training is radically reshaping many commercial data centers, the underlying software and hardware are not fundamentally new. AI training is essentially one of many applications of supercomputing. Supercomputing software, along with the IT selection and configuration, varies in many ways — and one of the most relevant variables when monitoring energy performance is floating point precision. This precision (measured in bits) is analogous to the number of decimal places used in inputs and outputs.

GPUs and other accelerators can perform 64-, 32-, 16-, 8- and 4-bit calculations, and some can use mixed precision. While a high-performance computing (HPC) workload such as computational fluid dynamics might use 64-bit (“double precision”) floating point calculations for high accuracy, other applications do not have such exacting requirements. Lower precision consumes less memory per calculation — and, crucially, less energy. The panel discussion at Yotta raised an important distinction: unlike most engineering and research applications, today’s AI training and inference calculations typically use 4-bit precision.

Floating point precision is necessary information when evaluating a TFLOPS benchmark. A 64-bit precision calculation TFLOPS value is one-half of a 32-bit TFLOPS value — or one-sixteenth of a 4-bit TFLOPS value. For consistent AI work capacity calculation, Uptime Institute recommends that IT operators use 32-bit TFLOPS values supplied by their AI server providers.

Working it out: work per energy

The maximum work capacity calculation for a server can be aggregated at the level of a rack, a cluster or a data center. Work capacity multiplied by average utilization (as a percentage) produces an estimate of the amount of calculation work (in TFLOPS) that was performed over a given period. Operators can divide this figure by the energy consumption (in MWh) over that same time to yield an estimate of the work’s energy efficiency, in TFLOPS/MWh. Separate calculations for CPU-based servers, accelerator-based servers, and other IT (e.g., storage) will provide a more accurate assessment of energy performance (see Figure 1).

Figure 1 Examples of IT equipment work-per-energy calculations

Diagram: Examples of IT equipment work-per-energy calculations

Even when TFLOPS figures are normalized to the same precision, it is difficult to use this information to draw meaningful comparisons between the energy performance of significantly different hardware types and configurations. Accelerator power consumption does not scale linearly with utilization levels. Additionally, the details of software design will determine how closely real-world application performance aligns with simplified work capacity benchmarks.

However, many organizations can benefit from calculating this TFLOPS/MWh productivity metric and are already well equipped to do so. This calculation is most useful to quantify efficiency gains over time, e.g., from IT refresh and consolidation, or refinements to operational control. In some jurisdictions, tracking FLOPS/MWh as a productivity metric can satisfy some regulatory requirements. IT efficiency is often overlooked in favor of facility efficiency — but a consistent productivity metric can help to quantify available improvements.


The Uptime Intelligence View

Generative AI training is poised to drive up data center energy consumption, prompting calls for regulation, responsible resource use and return on investment. A productivity metric can help meet these objectives by consistently quantifying the amount of computing work performed per unit of energy. Supercomputing experts agree that operators should track and use this data, but they caution against interpreting it without the necessary context. A simplified, practical work-per-energy metric is most useful for tracking improvement in one facility over time.

The following participants took part in the panel discussion on energy efficiency at Yotta 2025:

  • Jacqueline Davis, Research Analyst at Uptime Institute (moderator)
  • Dr Peter de Bock, former Program Director, Advanced Research Projects Agency–Energy
  • Dr Alfonso Ortega, Professor of Energy Technology, Villanova University
  • Dr Jon Summers, Research Lead in Data Centers, Research Institutes of Sweden

Other related reports published by Uptime Institute include:

Calculating work capacity for server and storage products

The following Uptime Institute experts were consulted for this report:

Jay Dietrich, Research Director of Sustainability, Uptime Institute

The post AI’s growth calls for useful IT efficiency metrics appeared first on Uptime Institute Blog.

Crypto mines are turning into AI factories

The pursuit of training ever-larger generative AI models has necessitated the creation of a new class of specialized data centers — facilities that have more in common with high-performance computing (HPC) environments than traditional enterprise IT.

These data centers support very high rack densities (130 kW and above with current Nvidia rack-scale systems), direct-to-chip liquid cooling, and supersized power distribution components. This equipment is deployed at scale, in facilities that consume tens of megawatts. Delivering such dense infrastructure at this scale is not just technically complicated — it often requires doing things that have never been attempted before.

Some of these ultra-dense AI training data centers are being built by well-established cloud providers and their partners — wholesale colocation companies. However, the new class of facility has also attracted a different kind of data center developer: former cryptocurrency miners. Many of the organizations now involved in AI infrastructure — such as Applied Digital, Core Scientific, CoreWeave, Crusoe and IREN — originated as crypto mining ventures.

Some have transformed into neoclouds, leasing GPUs at competitive prices. Others operate as wholesale colocation providers, building specialized facilities for hyperscalers, neoclouds, or large AI model developers like OpenAI or Anthropic. Few of them operated traditional data centers before 2020. These operators represent a significant and recent addition to the data center industry — especially in the US.

A league of their own

Crypto mining facilities differ considerably from traditional data centers. Their primary objective is to house basic servers equipped with either GPUs or ASICs (application-specific integrated circuits), running at near 100% utilization around the clock to process calculations that yield cryptocurrency tokens. The penalties for outages are direct — fewer tokens mean lower profits — but the hardware is generally considered disposable. The business case is driven almost entirely by the cost of power, which accounts for almost all of the operating expenditure.

Many crypto mines do not use traditional server racks. Most lack redundancy in power distribution and cooling equipment, and they have no means of continuing operations in the event of a grid outage: no UPS, no batteries, no generators, no fuel. In some cases, mining equipment is located outdoors, shielded from the rain, but little else.

While crypto miners didn’t build traditional data center facilities, they did have two crucial assets: land zoned for industrial use and access to abundant, low-cost power.

Around 2020, some of the largest crypto mining operators began pivoting toward hosting hardware for AI workloads — a shift that became more pronounced following the launch of ChatGPT in late 2022. Table 1 shows how quickly some of these companies have scaled their AI/HPC operations.

Table 1 The transformation of crypto miners

Table: The transformation of crypto miners

To develop data center designs that can accommodate the extreme power and cooling requirements of cutting-edge AI hardware, these companies are turning to engineers and consultants with experience in hyperscale projects. The same applies to construction companies. The resulting facilities are built to industry standards and are concurrently maintainable.

There are three primary reasons why crypto miners were successful in capitalizing on the demand for high-density AI infrastructure:

  • These organizations were accustomed to moving quickly, having been born in an industry that had to respond to volatile cryptocurrency pricing, shifting regulations and fast-evolving mining hardware.
  • Many were already familiar with GPUs through their use in crypto mining — and some had begun renting them out for research or rendering workloads.
  • Their site selection was primarily driven by power availability and cost, rather than proximity to customers or network hubs.

Violence of action

Applied Digital, a publicly traded crypto mining operator based in North Dakota, presents an interesting case study. The state is one of the least developed data center markets in the US, with only a few dozen facilities in total.

Applied Digital’s campus in Ellendale was established to capitalize on cheap renewable power flowing between local wind farms and Chicago. In 2024, the company removed all mentions of cryptocurrency from its website — despite retaining sizable (100 MW-plus) mining operations. It then announced plans to build a 250 MW AI campus in Ellendale, codenamed Polaris Forge, to be leased by CoreWeave.

The operator expects the first 100 MW data center to be ready for service in late 2025. The facility will use direct liquid cooling and is designed to support 300 kW-plus rack densities. It is built to be concurrently maintainable, powered by two utility feeds, and will feature N+2 redundancy on most mechanical equipment. To ensure cooling delivery in the event of a power outage, the facility will be equipped with 360,000 gallons (1.36 million liters) of chilled water thermal storage. This will be Applied Digital’s first non-crypto facility.

The second building, with a capacity of 150 MW, is expected to be ready in the middle of 2026. It will deploy medium-voltage static UPS systems to improve power distribution efficiency and optimize site layout. The company has several more sites under development.

Impact on the sector

Do crypto miners have an edge in data center development? What they do have is existing access to power and a higher tolerance for technical and business risk — qualities that enable them to move faster than much of the traditional competition. This willingness to place bets matters in a market that is lacking solid fundamentals: in 2025, capital expenditure on AI infrastructure is outpacing revenue from AI-based products by orders of magnitude. The future of generative AI is still uncertain.

At present, this new category of data center operators appears to be focusing exclusively on the ultra-high-density end of the market and is not competing for traditional colocation customers. For now, they don’t need to either, as demand for AI training capacity alone keeps them busy. Still, their presence in the market introduces a new competitive threat to colocation providers that have opted to accommodate extreme densities in their recently built or upcoming facilities.

M&E and IT equipment suppliers have welcomed the new arrivals — not simply because they drive overall demand but because they are new buyers in a market increasingly dominated by a handful of technology behemoths. Some operators will be concerned about supply chain capacity, especially when it comes to large-scale projects: high-density campuses could deplete the stock of data center equipment such as large generators, UPS systems and transformers.

One of the challenges facing this new category of operators is the evolving nature of AI hardware. Nvidia, for example, intends to start shipping systems that consume more than 500 kW per compute rack by the end of 2027. It is not clear how many data centers being built today will be able to accommodate this level of density.


The Uptime Intelligence View

The simultaneous pivot by several businesses toward building much more complex facilities is peculiar, yet their arrival will not immediately affect most operators.

While this trend will create business opportunities for a broad swathe of design, consulting and engineering firms, it is also likely to have a negative impact on equipment supply chains, extending lead times for especially large-capacity units.

Much of this group’s future success hinges on the success of generative AI in general — and the largest and most compute-hungry models in particular — as a tool for business. However, the facilities they are building are legitimate data centers that will remain valuable even if the infrastructure needs of generative AI are being overestimated.

The post Crypto mines are turning into AI factories appeared first on Uptime Institute Blog.

Cybersecurity and the cost of human error

Cyber incidents are increasing rapidly. In 2024, the number of outages caused by cyber incidents was twice the average of the previous four years, according to Uptime Institute’s annual report on data center outages (see Annual outage analysis 2025). More operational technology (OT) vendors are experiencing significant increases in cyberattacks on their systems. Data center equipment vendor Honeywell analyzed hundreds of billions of system logs and 4,600 events in the first quarter of 2025, identifying 1,472 new ransomware extortion incidents — a 46% increase on the fourth quarter of 2024 (see Honeywell’s 2025 Cyber Threat Report). Beyond the initial impact, cyberattacks can have lasting consequences for a company’s reputation and balance sheet.

Cyberattacks increasingly exploit human error

Cyberattacks on data centers often exploit vulnerabilities — some stemming from simple and preventable errors, while others are overlooked systemic issues. Human error, such as failing to follow procedures, can create vulnerabilities, which the attacker exploits. For example, staff might forget regular system patches or delay firmware updates, leaving systems exposed. Companies, in turn, implement policies and procedures to ensure employees perform preventative actions on a consistent basis.

In many cases, data center operators may well be aware that elements of their IT and OT infrastructure have certain vulnerabilities. This may be due to policy noncompliance or the policy itself lacking appropriate protocols to defend against hackers. Often, employees lack training on how to recognize and respond to common social engineering techniques used by hackers. Tactics such as email phishing, impersonation and ransomware are increasingly targeting organizations with complex supply chain and third-party dependencies.

Cybersecurity incidents involving human error often follow similar patterns. Attacks may begin with some form of social engineering to obtain login credentials. Once inside, the attack moves laterally through a system, exploiting small errors to cause systemic damage (see Table 1).

Table 1 Cyberattackers exploit human factors to induce human error

Table: Cyberattackers exploit human factors to induce human error

Failure to follow correct procedures

Although many companies have policies and procedures in place, employees can become complacent and fail to follow them. At times, they may unintentionally skip a step or carry it out incorrectly. For instance, workers might forget to install a software update or accidentally misconfigure a port or firewall — despite having technical training. Others may feel overwhelmed by the volume of updates and leave systems vulnerable as a result. In some cases, important details are simply overlooked, such as leaving a firewall port open or setting their cloud storage to public access.

Procedures concerning password strength, password changes and inactive accounts are common vulnerabilities that hackers exploit. Inactive accounts that are not properly deactivated may miss out on critical security updates, as these are monitored less closely than active accounts, making it easier for security breaches to go unnoticed.

Unknowingly engaging with social engineering

Social engineering is a tactic used to deceive individuals into revealing sensitive information or downloading malicious software. It typically involves the attacker impersonating someone from the target’s company or organization to build trust with them. The primary goal is to steal login credentials or gain unauthorized access to the system.

Attackers may call employees while posing as someone from the IT help desk, requesting login details. Another common tactic involves the attacker pretending to be a help desk technician and, under the guise of “routine testing,” pressuring an employee to disclose their login credentials.

Like phishing, spoofing is a tactic used to gain an employee’s trust by simulating familiar conditions, but it often relies on misleading visual cues. For example, social engineers may email a link to a fake version of the company’s login screen, prompting the unsuspecting employee to enter their login information as usual. In some rare cases, attackers might even use AI to impersonate an employee’s supervisor during video call.

Deviation from policies or best practices

Adhering to policies and best practices is critical to determining whether cybersecurity succeeds or fails. Procedures need to be written clearly and without ambiguity. For example, if a procedure does not explicitly require an employee to clear saved login data from their devices, hackers or rogue employees may be able to gain access to the device using default administrator credentials. Similarly, if regular password changes are not mandated, it may be easier for attackers to compromise system access credentials.

Policies must also account for the possibility of a disgruntled employee or third-party worker stealing or corrupting sensitive information for personal gain. To reduce this risk, companies can implement clear deprovisioning rules in their offboarding process, such as ensuring passwords are changed immediately upon an employee’s departure. While there is always a chance that a procedural step may be accidentally overlooked, comprehensive procedures increase the likelihood that each task is completed correctly.

Procedures are especially critical when employees have to work quickly to contain a cybersecurity incident. They should be clearly written, thoroughly tested for reliability, and easily accessible to serve as a reference during a variety of emergencies.

Poor security governance and oversight

A lack of governance or oversight from management can lead to overlooked risks and vulnerabilities, such as missed security patches or failure to monitor systems for threats or alerts. Training helps employees to approach situations with healthy skepticism, encouraging them to perform checks and balances consistent with the company’s policies.

Training should evolve to ensure that workers are informed about the latest threats and vulnerabilities, as well as how to recognize them.

Notable incidents exploiting human error

The types of human error described above are further complicated due to the psychology of how individuals behave in intense situations. For example, mistakes may occur due to heightened stress, fatigue or coercion, all of which can lead to errors of judgment when a quick decision or action is required.

Table 2 identifies how human error may have played a part in eight major public cybersecurity breaches between 2023 and 2025. This includes three of the 10 most significant data center outages — United Healthcare, CDK Global and Ascension Healthcare — highlighted in Uptime Institute’s outages report (see Annual outage analysis 2025). We note the following trends:

  • At least five of the incidents involved social engineering. These attacks often exploited legitimate credentials or third-party vulnerabilities to gain access and execute malicious actions.
  • All incidents likely involved failures by employees to follow policies, procedures or properly manage common vulnerabilities.
  • Seven incidents exposed gaps in skills, training or experience to mitigate threats to the organization.
  • In half of the incidents, policies may have been poorly enforced or bypassed for unknown reasons.

Table 2 Impact of major cyber incidents involving human error

Table: Impact of major cyber incidents involving human error

Typically, organizations are reluctant to disclose detailed information about cyberattacks. However, regulators and government cybersecurity agencies are increasingly expecting more transparency — particularly when the attacks affect citizens and consumers — since attackers often leak information on public forums and the dark web.

The following findings are particularly concerning for data center operators and warrant serious attention:

  • The financial cost of cyber incidents is significant. Among the eight identified cyberattacks, the estimated total losses exceed $8 billion.
  • Full financial and reputational impact can take longer to play out. For example, UK retailer Marks & Spencer is facing lawsuits from customer groups over identity theft and fraud following a cyberattack. Similar actions may be taken by regulators or government agencies, particularly if breaches expose compliance failures with cybersecurity regulations, such as those in the Network and Information Security Directive 2 and the Digital Operational Resilience Act.

The Uptime Intelligence View

Human error is often viewed as a series of unrelated mistakes; however, the errors identified in this report stem from complex, interconnected systems and increasingly sophisticated attackers who exploit human psychology to manipulate events.

Understanding the role of human error in cybersecurity incidents is crucial to help employees recognize and prevent potential oversights. Training alone is unlikely to solve the problem. Data center operators should continuously adapt cyber practices and foster a culture that redefines how staff perceive and respond to the risk of cyber threats. This cultural shift is likely critical to staying ahead of evolving threat tactics.

John O’Brien, Senior Research Analyst, jobrien@uptimeinstitute.com
Rose Weinschenk, Analyst, rweinschenk@uptimeinstitute.com

The post Cybersecurity and the cost of human error appeared first on Uptime Institute Blog.

Self-contained liquid cooling: the low-friction option

Each new generation of server silicon is pushing traditional data center air cooling closer to its operational limits. In 2025, the thermal design power (TDP) of top-bin CPUs reached 500 W, and server chip product roadmaps indicate further escalation in pursuit of higher performance. To handle these high-powered chips, more IT organizations are considering direct liquid cooling (DLC) for their servers. However, large-scale deployment of DLC with supporting facility water infrastructure can be costly and complex to operate, and is still hindered by a lack of standards (see DLC shows promise, but challenges persist).

In these circumstances, an alternative approach has emerged: air-cooled servers with internal DLC systems. Referred to by vendors as either air-assisted liquid cooling (AALC) or liquid-assisted air cooling (LAAC), these systems do not require coolant distribution units or facility water infrastructure for heat rejection. This means that they can be deployed in smaller, piecemeal installations.

Uptime Intelligence considers AALC a broader subset of DLC — defined by the use of coolant to remove heat from components within the IT chassis — that includes options for multiple servers. This report discusses designs that use a coolant loop — typically water in commercially available products — that fit entirely within a single server chassis.

Such systems enable IT system engineers and operators to cool top-bin processor silicon in dense form factors — such as 1U rack-mount servers or blades — without relying on extreme-performance heat sinks or elaborate airflow designs. Given enough total air cooling capacity, self-contained AALC requires no disruptive changes to the data hall or new maintenance tasks for facility personnel.

Deploying these systems in existing space will not expand cooling capacity the way full DLC installations with supporting infrastructure can. However, selecting individual 1U or 2U servers with AALC can either reduce IT fan power consumption or enable operators to support roughly 20% greater TDP than they otherwise could — with minimal operational overhead. According to the server makers offering this type of cooling solution, such as Dell and HPE, the premium for self-contained AALC can pay for itself in as little as two years when used to improve power efficiency.

Does simplicity matter?

Many of today’s commercial cold plate and immersion cooling systems originated and matured in high-performance computing facilities for research and academic institutions. However, another group has been experimenting with liquid cooling for more than a decade: video game enthusiasts. Some have equipped their PCs with self-contained AALC systems to improve CPU and GPU performance, as well as reduce fan noise. More recently, to manage the rising heat output of modern server CPUs, IT vendors have started to offer similar systems.

The engineering is simple: fluid tubing connects one or more cold plates to a radiator and pump. The pumps circulate warmed coolant from the cold plates through the radiator, while server fans draw cooling air through the chassis and across the radiator (see Figure 1). Because water is a more efficient heat transfer medium than air, it can remove heat from the processor at a greater rate — even at a lower case temperature.

Figure 1 Closed-loop liquid cooling within the server

Diagram: Closed-loop liquid cooling within the server

The coolant used in commercially shipping products is usually PG25, a mixture of 75% water and 25% propylene glycol. This formulation has been widely adopted in both DLC and facility water systems for decades, so its chemistry and material compatibility are well understood.

As with larger DLC systems, alternative cooling approaches can use a phase change to remove IT heat. Some designs use commercial two-phase dielectric coolants, and an experimental alternative uses a sealed system containing a small volume of pure water under partial vacuum. This lowers the boiling point of water, effectively turning it into a two-phase coolant.

Self-contained AALC designs with multiple cold plates usually have redundant pumps — one on each cold plate in the same loop — and can continue operating if one pump fails. Because AALC systems for a single server chassis contain a smaller volume of coolant than larger liquid cooling systems, any leak is less likely to spill into systems below. Cold plates are typically equipped with leak detection sensors.

Closed-loop liquid cooling is best applied in 1U servers, where space constraints prevent the use of sufficiently large heat sinks. In internal testing by HPE, the pumps and fans of an AALC system in a 1U server consumed around 40% less power than the server fans in an air-cooled equivalent. This may amount to as much as a 5% to 8% reduction in total server power consumption under full load. The benefits of switching to AALC are smaller for 2U servers, which can mount larger heat sinks and use bigger, more efficient fan motors.

However, radiator size, airflow limitations and temperature-sensitive components mean that self-contained AALC is not on par with larger DLC systems, therefore making it more suitable as a transitory measure. Additionally, these systems are not currently available for GPU servers.

Advantages of AALC within the server:

  • Higher cooling capacity (up to 20%) than air cooling in the same form factor and for the same energy input, offers more even heat distribution and faster thermal response than heat sinks.
  • Requires no changes to white space or gray space.
  • Components are widely available.
  • Can operate without maintenance for the lifetime of the server, with low risk of failure.
  • Does not require space outside the rack, unlike “sidecars” or rear-mounted radiators.

Drawbacks of AALC within the server:

  • Closed-loop server cooling systems use several complex components that cost more than a heat sink.
  • Offers less IT cooling capacity than other liquid cooling approaches: systems available outside of high-performance computing and AI-specific deployments will typically support up to 1.2 kW of load per 1U server.
  • Self-contained systems generally consume more energy than larger DLC systems for server fan power, a parasitic component of IT energy consumption.
  • No control of coolant loop temperatures; control of flow rate through pumps may be available in some designs.
  • Radiator and pumps limit space savings within the server chassis.

Outlook

For some organizations, AALC offers the opportunity to maximize the value of existing investments in air cooling infrastructure. For others, it may serve as a measured step on the path toward DLC adoption.

This form of cooling is likely to be especially valuable for operators of legacy facilities that have sufficient air cooling infrastructure to support some high-powered servers but would otherwise suffer from hot spots. Selecting AALC over air cooling may also reduce server fan power enough to allow operators to squeeze another server into a rack.

Much of AALC’s appeal is its potential for efficient use of fan power and its compatibility with existing facility cooling capabilities. Expanding beyond this to increase a facility’s cooling capacity is a different matter, requiring larger, more expensive DLC systems supported by additional heat transport and rejection equipment. In comparison, server-sized AALC systems represent a much smaller cost increase over heat sinks.

Future technical development may address some of AALC’s limitations, although progress and funding will largely depend on the commercial interest in servers with self-contained AALC. In conversations with Uptime Intelligence, IT vendors have diverging views of the role of self-contained AALC in their server portfolios, suggesting that the market’s direction remains uncertain. Nonetheless, there is some interesting investment in the field. For example, Belgian startup Calyos has developed passive closed-loop cooling systems that operate without pumps, instead moving coolant via capillary action. The company is working on a rack-scale prototype that could eventually see deployment in data centers.


The Uptime Intelligence View

AALC within the server may only deliver a fraction of the improvements associated with DLC, but it does so at a fraction of the cost and with minimal disruption to the facility. For many, the benefits may seem negligible. However, for a small group of air-cooled facilities, AALC can deliver either cooling capacity benefits or energy savings.

The post Self-contained liquid cooling: the low-friction option appeared first on Uptime Institute Blog.

❌