Normal view

Received yesterday — 31 January 2026

AI and cooling: toward more automation

AI is increasingly steering the data center industry toward new operational practices, where automation, analytics and adaptive control are paving the way for “dark” — or lights-out, unstaffed — facilities. Cooling systems, in particular, are leading this shift. Yet despite AI’s positive track record in facility operations, one persistent challenge remains: trust.

In some ways, AI faces a similar challenge to that of commercial aviation several decades ago. Even after airlines had significantly improved reliability and safety performance, making air travel not only faster but also safer than other forms of transportation, it still took time for public perceptions to shift.

That same tension between capability and confidence lies at the heart of the next evolution in data center cooling controls. As AI models — of which there are several — improve in performance, becoming better understood, transparent and explainable, the question is no longer whether AI can manage operations autonomously, but whether the industry is ready to trust it enough to turn off the lights.

AI’s place in cooling controls

Thermal management systems, such as CRAHs, CRACs and airflow management, represent the front line of AI deployment in cooling optimization. Their modular nature enables the incremental adoption of AI controls, providing immediate visibility and measurable efficiency gains in day-to-day operations.

AI can now be applied across four core cooling functions:

  • Dynamic setpoint management. Continuously recalibrates temperature, humidity and fan speeds to match load conditions.
  • Thermal load forecasting. Predicts shifts in demand and makes adjustments in advance to prevent overcooling or instability.
  • Airflow distribution and containment. Uses machine learning to balance hot and cold aisles and stage CRAH/CRAC operations efficiently.
  • Fault detection, predictive and prescriptive diagnostics. Identifies coil fouling, fan oscillation, or valve hunting before they degrade performance.

A growing ecosystem of vendors is advancing AI-driven cooling optimization across both air- and water-side applications. Companies such as Vigilent, Siemens, Schneider Electric, Phaidra and Etalytics offer machine learning platforms that integrate with existing building management systems (BMS) or data center infrastructure management (DCIM) systems to enhance thermal management and efficiency.

Siemens’ White Space Cooling Optimization (WSCO) platform applies AI to match CRAH operation with IT load and thermal conditions, while Schneider Electric, through its Motivair acquisition, has expanded into liquid cooling and AI-ready thermal systems for high-density environments. In parallel, hyperscale operators, such as Google and Microsoft, have built proprietary AI engines to fine-tune chiller and CRAH performance in real time. These solutions range from supervisory logic to adaptive, closed-loop control. However, all share a common aim: improve efficiency without compromising compliance with service level agreements (SLAs) or operator oversight.

The scope of AI adoption

While IT cooling optimization has become the most visible frontier, conversations with AI control vendors reveal that most mature deployments still begin at the facility water loop rather than in the computer room. Vendors often start with the mechanical plant and facility water system because these areas present fewer variables, such as temperature differentials, flow rates and pressure setpoints, and can be treated as closed, well-bounded systems.

This makes the water loop a safer proving ground for training and validating algorithms before extending them to computer room air cooling systems, where thermal dynamics are more complex and influenced by containment design, workload variability and external conditions.

Predictive versus prescriptive: the maturity divide

AI in cooling is evolving along a maturity spectrum — from predictive insight to prescriptive guidance and, increasingly, to autonomous control. Table 1 summarizes the functional and operational distinctions among these three stages of AI maturity in data center cooling.

Table 1 Predictive, prescriptive, and autonomous AI in data center cooling

Table: Predictive, prescriptive, and autonomous AI in data center cooling

Most deployments today stop at the predictive stage, where AI enhances situational awareness but leaves action to the operator. Achieving full prescriptive control will require not only a deeper technical sophistication but also a shift in mindset.

Technically, it is more difficult to engineer because the system must not only forecast outcomes but also choose and execute safe corrective actions within operational limits. Operationally, it is harder to trust because it challenges long-held norms about accountability and human oversight.

The divide, therefore, is not only technical but also cultural. The shift from informed supervision to algorithmic control is redefining the boundary between automation and authority.

AI’s value and its risks

No matter how advanced the technology becomes, cooling exists for one reason: maintaining environmental stability and meeting SLAs. AI-enhanced monitoring and control systems support operating staff by:

  • Predicting and preventing temperature excursions before they affect uptime.
  • Detecting system degradation early and enabling timely corrective action.
  • Optimizing energy performance under varying load profiles without violating SLA thresholds.

Yet efficiency gains mean little without confidence in system reliability. It is also important to clarify that AI in data center cooling is not a single technology. Control-oriented machine learning models, such as those used to optimize CRAHs, CRACs and chiller plants, operate within physical limits and rely on deterministic sensor data. These differ fundamentally from language-based AI models such as GPT, where “hallucinations” refer to fabricated or contextually inaccurate responses.

At the Uptime Network Fall Americas Fall Conference 2025, several operators raised concerns about AI hallucinations — instances where optimization models generate inaccurate or confusing recommendations from event logs. In control systems, such errors often arise from model drift, sensor faults, or incomplete training data, not from the reasoning failures seen in language-based AI. When a model’s understanding of system behavior falls out of sync with reality, it can misinterpret anomalies as trends, eroding operator confidence faster than it delivers efficiency gains.

The discomfort is not purely technical, it is also human. Many data center operators remain uneasy about letting AI take the controls entirely, even as they acknowledge its potential. In AI’s ascent toward autonomy, trust remains the runway still under construction.

Critically, modern AI control frameworks are being designed with built-in safety, transparency and human oversight. For example, Vigilent, a provider of AI-based optimization controls for data center cooling, reports that its optimizing control switches to “guard mode” whenever it is unable to maintain the data center environment within tolerances. Guard mode brings on additional cooling capacity (at the expense of power consumption) to restore SLA-compliant conditions. Typical examples include rapid drift or temperature hot spots. In addition, there is also a manual override option, which enables the operator to take control through monitoring and event logs.

This layered logic provides operational resiliency by enabling systems to fail safely: guard mode ensures stability, manual override guarantees operator authority, and explainability, via decision-tree logic, keeps every AI action transparent. Even in dark-mode operation, alarms and reasoning remain accessible to operators.

These frameworks directly address one of the primary fears among data center operators: losing visibility into what the system is doing.

Outlook

Gradually, the concept of a dark data center, one operated remotely with minimal on-site staff, has shifted from being an interesting theory to a desirable strategy. In recent years, many infrastructure operators have increased their use of automation and remote-management tools to enhance resiliency and operational flexibility, while also mitigating low staffing levels. Cooling systems, particularly those governed by AI-assisted control, are now central to this operational transformation.

Operational autonomy does not mean abandoning human control; it means achieving reliable operation without the need for constant supervision. Ultimately, a dark data center is not about turning off the lights, it is about turning on trust.


The Uptime Intelligence View

AI in thermal management has evolved from an experimental concept into an essential tool, improving efficiency and reliability across data centers. The next step — coordinating facility water, air and IT cooling liquid systems — will define the evolution toward greater operational autonomy. However, the transition to “dark” operation will be as much cultural as it is technical. As explainability, fail-safe modes and manual overrides build operator confidence, AI will gradually shift from being a copilot to autopilot. The technology is advancing rapidly; the question is how quickly operators will adopt it.

The post AI and cooling: toward more automation appeared first on Uptime Institute Blog.

Received before yesterday

Data Center Outsourcing Market to Surpass USD 243.3 Billion by 2034

12 November 2025 at 15:00

The global data center outsourcing market was valued at USD 132.3 billion in 2024 and is estimated to grow at a CAGR of 6.4% to reach USD 243.3 billion by 2034, according to a recent report by Global Market Insights Inc.

The demand for data center outsourcing continues to rise as businesses increasingly pursue flexible and secure infrastructure solutions. Organizations are embracing hybrid cloud strategies that combine the control of private clouds with the agility of public cloud services. This approach enables companies to scale operations while maintaining tighter oversight of critical data. Outsourcing providers are now offering integrated solutions that span both private and public environments, optimizing performance and cost management simultaneously.

As emerging technologies such as 5G, IoT, and real-time applications gain momentum, enterprises are turning to edge computing for faster processing at the source of data generation. This has led to a shift toward more distributed outsourcing models, where smaller, decentralized facilities are placed closer to end users. In the US, hyperscale operators, including Microsoft Azure, Google Cloud, AWS, and IBM, are leading the outsourcing movement with their massive infrastructure and capacity to support enterprises at scale without hefty upfront investments. Meanwhile, data privacy frameworks such as HIPAA, FINRA, and CCPA are shaping outsourcing demand, driving businesses to work with providers who offer certified facilities, robust compliance support, and regional regulatory alignment.

The data center outsourcing market from the hardware segment captured 43.7% share in 2024 and is projected to grow at a CAGR of 6.4% through 2034. With rising data volumes and evolving technologies, organizations are opting to outsource hardware management to cut capital expenses and adopt an operating cost model. Managing in-house infrastructure upgrades proves cost-intensive and time-consuming, which is why outsourcing hardware services has become a preferred path for scalability and agility.

The power and cooling infrastructure segment is expected to register a CAGR of 8.7% from 2025 to 2034. Outsourcing providers are introducing advanced energy management solutions to support high-performance computing environments. Technologies such as AI-based temperature control, liquid cooling, and free cooling are being adopted to handle heat generated by dense workloads while also reducing energy consumption and enhancing system reliability.

The United States data center outsourcing market held a 76.1% share in 2024, generating USD 34.8 billion. The US remains a global hub for data centers, driven by the presence of major providers such as Equinix, Amazon Web Services (AWS), Verizon Communications, and Google Cloud. As regulatory frameworks become more complex, companies increasingly seek third-party partners with the compliance credentials and infrastructure to navigate evolving data privacy laws. Canada’s enterprise market is also transitioning toward cloud-driven outsourcing models, prioritizing speed, innovation, and cross-platform orchestration. Providers with strong hybrid and multi-cloud capabilities are seeing increased traction across the region.

Key companies operating in the global data center outsourcing market include Cognizant, Tata Consultancy Services (TCS), Fujitsu, Accenture, Amazon Web Services (AWS), Google Cloud, Microsoft Azure, Equinix, Verizon Communications, and Digital Realty. To strengthen their position in the competitive data center outsourcing space, companies are focusing on expanding global infrastructure, integrating edge computing solutions, and offering hybrid and multi-cloud management platforms. Strategic investments are being made in AI-based automation for data management, energy optimization, and real-time monitoring. Providers are also forming alliances with hyperscale cloud vendors to co-deliver scalable services while ensuring compliance with evolving regional regulations. Emphasis is being placed on offering flexible service models, cost-effective infrastructure-as-a-service (IaaS), and dedicated support for industry-specific compliance, like healthcare or finance.

The post Data Center Outsourcing Market to Surpass USD 243.3 Billion by 2034 appeared first on Data Center POST.

Addressing the RF Blind Spot in Modern Data Centers

10 November 2025 at 16:00

The rapid adoption of artificial intelligence (AI) and the computing power required to train and deploy advanced models have driven a surge in data center development at a scale not seen before. According to UBS, companies will spend $375 billion globally this year on AI infrastructure and $500 billion next year. It is projected that more than 4,750 data centers will be under construction in primary markets in the United States alone in 2025.

While data center investments often focus on servers, power, and cooling, Cellular Connectivity is an underrated element in ensuring these facilities operate reliably and safely long term. It’s important for operators to understand how this impacts both commercial operations and public safety.

Supporting Technicians and On-Site Personnel

Reliable cellular connectivity is important in day-to-day operations for technicians, engineers, and contractors. From accessing digital work orders to coordinating with off-site experts, mobile devices are central tools for keeping operations running smoothly.

The challenge is that signal strength often weakens in the very areas where staff spend the most time: data halls, mechanical rooms, and utility spaces. Consistent coverage across the entire facility eliminates those gaps. It allows technicians to complete tasks more efficiently, reduces delays, and ensures that communications remain uninterrupted.

Connectivity also improves worker safety. Personnel must be able to reach colleagues or emergency services at any time, regardless of where they are in the facility. Reliable connectivity helps protect both people and operations.

Cellular Connectivity for Data Center Operations

Data centers are highly complex ecosystems, requiring constant monitoring, rapid coordination, and efficient communication. They are also often built in remote locations with plenty of land and natural resources to help with cooling, but this results in terrible cellular connectivity. In addition, they are primarily constructed of steel and concrete for stability and fire resistance, which are also incredibly challenging for radio frequency (RF) to penetrate naturally. Weak signals or dropped calls can delay problem resolution, introduce operational risks, and reduce resiliency.

In the event of an emergency, the stakes are even higher. Cellular service becomes the lifeline for coordinating evacuation procedures, communicating with local authorities, and enabling first responders to perform their duties. Without strong coverage throughout the facility, including in underground or shielded areas, response times can be compromised.

Solutions like distributed antenna systems (DAS) help solve this challenge by connecting base stations to the site, bringing wireless connectivity from the macro network to inside the facility ensuring operators can maintain real-time contact with vendors, remote support teams, and internal staff.

As new facilities increasingly rise in remote or challenging environments, extending reliable cellular service inside the building ensures operational continuity, no matter the location or construction materials involved.

Unified Cellular Networks for Lower Costs

Even though there is record data center spending, cellular infrastructure can be costly. But there are ways to mitigate the expenses up front. Normally, DAS is implemented in large facilities due to public safety requirements. Building codes enforced by authorities having jurisdiction (AHJs) require in-building coverage for emergency communications, ensuring that first responders can connect reliably in critical situations. These mandates drive the deployment of emergency responder communication enhancement systems (ECRES) designed to meet strict performance standards in adherence with the International Fire Code (IFC) and the National Fire Protection Association (NFPA).

Often too late, most operators realize that this infrastructure can deliver substantial benefits for their own staff, but at this point, it usually requires an entirely separate system in parallel with the public safety system, including new remote units, cables, and passive components. But if operators are to be forward thinking and install them both at the same time, the system can serve both public safety and commercial cellular needs within a unified architecture.

The advantages are significant. A unified cellular network reduces the cost and complexity of building two separate systems in parallel. It also ensures that first responders, facility operators, and everyday users all benefit from consistent connectivity throughout the building. It is also capable of supporting evolving technologies such as 5G and emerging public safety requirements.

Developing Resilience

As AI accelerates the demand for new data centers, operators must look beyond traditional infrastructure requirements. Power and cooling remain fundamental, but so too does the ability to maintain clear and reliable lines of communication. Cellular coverage should not be a secondary concern because it supports remote monitoring, emergency response, technician efficiency, and worker safety. When deployed as a unified cellular solution, it also maximizes investment by serving both public safety and commercial needs.

In a mission-critical environment like data center operations, uninterrupted communication onsite and with outside stakeholders is non-negotiable. As facilities continue to expand in size and complexity, cellular connectivity will be essential in ensuring it is always operational with minimal downtime.

# # #

About the Author:

Mohammed Ali is the manager of DAS Engineering at Advanced RF Technologies, Inc. (ADRF), responsible for leading the DAS engineering division within the company across all global accounts. He has more than 10 years of experience in in-building DAS engineering and wireless network planning. Prior to joining ADRF, Mohammed worked as an RF Engineer at TeleworX and Huawei Technologies Sudan and a Network Management Engineer at ZAIN Sudan. Mohammed holds a Bachelor of Science in Telecommunications Engineering from the University of Khartoum in Sudan and a Master’s of Science degree in Telecommunications Engineering from the University of Maryland.

The post Addressing the RF Blind Spot in Modern Data Centers appeared first on Data Center POST.

AI’s growth calls for useful IT efficiency metrics

The digital infrastructure industry is under pressure to measure and improve the energy efficiency of the computing work that underpins digital services. Enterprises seek to maximize returns on cost outlay and operating expenses for IT hardware, and regulators and local communities need reassurance that the energy devoted to data centers is used efficiently. These objectives call for a productivity metric to measure the amount of work that IT hardware performs per unit of energy.

With generative AI projected to boost data center power demand substantially, the stakes have arguably never been higher. Fortunately, organizations monitoring the performance and efficiency of their AI applications can benefit from experiences in the field of supercomputing.

In September 2025, Uptime Intelligence participated in a panel discussion about AI energy efficiency at the Yotta 2025 conference in Las Vegas (Nevada, US). The panelists drew on their extensive experience in supercomputing to weigh in on discussions around AI training efficiency. They discussed the need for a productivity metric to measure it, as well as a key caveat organizations need to consider.

Organizations such as Uptime Intelligence and The Green Grid have published guidance on calculating work capacity for various types of IT. Software applications and their supporting IT hardware vary significantly, so consensus on a single metric to compare energy performance remains out of reach for the foreseeable future. However, tracking energy performance in a given facility over time is important, and is achievable practically for many organizations today.

Defining AI computing work

The work capacity of IT equipment is needed to calculate its utilization and energy performance when running an application. The Green Grid white paper IT work capacity metric V1 — a methodology provides a methodology for calculating a work capacity value for CPU-based servers. Uptime Intelligence has proposed methodologies to extend this to accelerator-based servers for AI and other applications (see Calculating work capacity for server and storage products).

Floating point operations per second (FLOPS) is a common and readily available unit of work capacity for CPU- or accelerator-based servers. In 2025, an AI server’s capacity usually ranks in the trillions of FLOPS, or teraFLOPS (TFLOPS).

Not all FLOPS are the same

Even though large-scale AI training is radically reshaping many commercial data centers, the underlying software and hardware are not fundamentally new. AI training is essentially one of many applications of supercomputing. Supercomputing software, along with the IT selection and configuration, varies in many ways — and one of the most relevant variables when monitoring energy performance is floating point precision. This precision (measured in bits) is analogous to the number of decimal places used in inputs and outputs.

GPUs and other accelerators can perform 64-, 32-, 16-, 8- and 4-bit calculations, and some can use mixed precision. While a high-performance computing (HPC) workload such as computational fluid dynamics might use 64-bit (“double precision”) floating point calculations for high accuracy, other applications do not have such exacting requirements. Lower precision consumes less memory per calculation — and, crucially, less energy. The panel discussion at Yotta raised an important distinction: unlike most engineering and research applications, today’s AI training and inference calculations typically use 4-bit precision.

Floating point precision is necessary information when evaluating a TFLOPS benchmark. A 64-bit precision calculation TFLOPS value is one-half of a 32-bit TFLOPS value — or one-sixteenth of a 4-bit TFLOPS value. For consistent AI work capacity calculation, Uptime Institute recommends that IT operators use 32-bit TFLOPS values supplied by their AI server providers.

Working it out: work per energy

The maximum work capacity calculation for a server can be aggregated at the level of a rack, a cluster or a data center. Work capacity multiplied by average utilization (as a percentage) produces an estimate of the amount of calculation work (in TFLOPS) that was performed over a given period. Operators can divide this figure by the energy consumption (in MWh) over that same time to yield an estimate of the work’s energy efficiency, in TFLOPS/MWh. Separate calculations for CPU-based servers, accelerator-based servers, and other IT (e.g., storage) will provide a more accurate assessment of energy performance (see Figure 1).

Figure 1 Examples of IT equipment work-per-energy calculations

Diagram: Examples of IT equipment work-per-energy calculations

Even when TFLOPS figures are normalized to the same precision, it is difficult to use this information to draw meaningful comparisons between the energy performance of significantly different hardware types and configurations. Accelerator power consumption does not scale linearly with utilization levels. Additionally, the details of software design will determine how closely real-world application performance aligns with simplified work capacity benchmarks.

However, many organizations can benefit from calculating this TFLOPS/MWh productivity metric and are already well equipped to do so. This calculation is most useful to quantify efficiency gains over time, e.g., from IT refresh and consolidation, or refinements to operational control. In some jurisdictions, tracking FLOPS/MWh as a productivity metric can satisfy some regulatory requirements. IT efficiency is often overlooked in favor of facility efficiency — but a consistent productivity metric can help to quantify available improvements.


The Uptime Intelligence View

Generative AI training is poised to drive up data center energy consumption, prompting calls for regulation, responsible resource use and return on investment. A productivity metric can help meet these objectives by consistently quantifying the amount of computing work performed per unit of energy. Supercomputing experts agree that operators should track and use this data, but they caution against interpreting it without the necessary context. A simplified, practical work-per-energy metric is most useful for tracking improvement in one facility over time.

The following participants took part in the panel discussion on energy efficiency at Yotta 2025:

  • Jacqueline Davis, Research Analyst at Uptime Institute (moderator)
  • Dr Peter de Bock, former Program Director, Advanced Research Projects Agency–Energy
  • Dr Alfonso Ortega, Professor of Energy Technology, Villanova University
  • Dr Jon Summers, Research Lead in Data Centers, Research Institutes of Sweden

Other related reports published by Uptime Institute include:

Calculating work capacity for server and storage products

The following Uptime Institute experts were consulted for this report:

Jay Dietrich, Research Director of Sustainability, Uptime Institute

The post AI’s growth calls for useful IT efficiency metrics appeared first on Uptime Institute Blog.

AI power fluctuations strain both budgets and hardware

AI training at scale introduces power consumption patterns that can strain both server hardware and supporting power systems, shortening equipment lifespans and increasing the total cost of ownership (TCO) for operators.

These workloads can cause GPU power draw to spike briefly, even for only a few milliseconds, pushing them past their nominal thermal design power (TDP) or against their absolute power limits. Over time, this thermal stress can degrade GPUs and their onboard power delivery components.

Even when average power draw stays within hardware specifications, thermal stress can affect voltage regulators, solder joints and capacitors. This kind of wear is often difficult to detect and may only become apparent after a failure. As a result, hidden hardware degradation can ultimately affect TCO — especially in data centers that are not purpose-built for AI compute.

Strain on supporting infrastructure

AI training power swings can also push server power supply units (PSUs) and connectors beyond their design limits. PSUs may be forced to absorb rapid current fluctuations, straining their internal capacitors and increasing heat generation. In some cases, power swings can trip overcurrent protection circuits, causing unexpected reboots or shutdowns. Certain power connectors, such as the standard 12VHPWR cables used for GPUs, are also vulnerable. High contact resistance can cause localized heating, further compounding the wear and tear effects.

When AI workloads involve many GPUs operating in synchronization, power swing effects multiply. In some cases, simultaneous power spikes across multiple servers may exceed the rated capacity of row-level UPS modules — especially if they were sized following legacy capacity allocation practices. Under such conditions, AI compute clusters can sometimes reach 150% of their steady-state maximum power levels.

In extreme cases, load fluctuations of large AI clusters can exceed a UPS system’s capability to source and condition power, forcing it to use its stored energy. This happens when the UPS is overloaded and unable to meet demand using only its internal capacitance. Repeated substantial overloads will put stress on internal components as well as the energy storage subsystem. For batteries, particularly lead-acid cells, this can shorten their shelf life. In worst-case scenarios, these fluctuations may cause voltage sags or other power quality issues (see Electrical considerations with large AI compute).

Capacity planning challenges

Accounting for the effects of power swings from AI training workloads during the design phase is challenging. Many circuits and power systems are sized based on the average demand of a large and diverse population of IT loads, rather than their theoretical combined peak. In the case of large AI clusters, this approach can lead to a false sense of security in capacity planning.

When peak amplitudes are underestimated, branch circuits can overheat, breakers may trip, and long-term damage can occur to conductors and insulation — particularly in legacy environments that lack the headroom to adapt. Compounding this challenge, typical monitoring tools track GPU power every 100 milliseconds or more — too slow to detect the microsecond-speed spikes that can accelerate the wear on hardware through current inrush.

Estimating peak power behavior depends on several factors, including the AI model, training dataset, GPU architecture and workload synchronization. Two training runs on identical hardware can produce vastly different power profiles. This uncertainty significantly complicates capacity planning, leading to under-provisioned resources and increased operational risks.

Facility designs for large-scale AI infrastructure need to account for the impact of dynamic power swings. Operators of dedicated training clusters may overprovision UPS capacity, use rapid-response PSUs, or set absolute power and rate-of-change limits on GPU servers using software tools (e.g., Nvidia-SMI). While these approaches can help reduce the risk of power-related failures, they also increase capital and operational costs and can reduce efficiency under typical load conditions.

Many smaller operators — including colocation tenants and enterprises exploring AI — are likely testing or adopting AI training on general-purpose infrastructure. Nearly three in 10 operators already perform AI training, and of those that do not, nearly half expect to begin in the near future, according to results from the Uptime Institute AI Infrastructure Survey 2025 (see Figure 1).

Figure 1 Three in 10 operators currently perform AI training

Diagram: Three in 10 operators currently perform AI training

Many smaller data center environments may lack workload diversity (non-AI loads) to absorb power swings or the specialized engineering to manage dynamic power consumption behavior. As a result, these operators face a greater risk of failure events, hardware damage, shortened component lifespans and reduced UPS reliability — all of which contribute to higher TCO.

Several low-cost strategies can help mitigate risk. These include oversizing branch circuits — ideally dedicating them to GPU servers — distributing GPUs across racks and data halls to prevent localized hotspots, and setting power caps on GPUs to trade some peak performance for longer hardware lifespan.

For operators considering or already experimenting with AI training, TDP alone is an insufficient design benchmark for capacity planning. Infrastructure needs to account for rapid power transients, workload-specific consumption patterns, and the complex interplay between IT hardware and facility power systems. This is particularly crucial when using shared or legacy systems, where the cost of misjudging these dynamics can quickly outweigh the perceived benefits of performing AI training in-house.


The Uptime Intelligence View

For data centers not specifically designed to support AI training workloads, GPU power swings can quietly accelerate hardware degradation and increase costs. Peak power consumption of these workloads is often difficult to predict, and signs of component wear may remain hidden until failures occur. Larger operators with dedicated AI infrastructure are more likely to address these power dynamics during the design phase, while smaller operators — or those using general-purpose infrastructure — may have fewer options.

To mitigate risk, these operators can consider overprovisioning rack-level UPS capacity for GPU servers, oversizing branch circuits (and dedicating them to GPU loads where possible), distributing heat from GPU servers across racks and rooms to avoid localized hotspots, and applying software-based power caps. Data center operators should also factor in more frequent hardware replacements during financial planning to more accurately reflect the actual cost of running AI training workloads.

The following Uptime Institute experts were consulted for this report:
Chris Brown, Chief Technical Officer, Uptime Institute
Daniel Bizo, Senior Research Director, Uptime Institute Intelligence
Max Smolaks, Research Analyst, Uptime Institute Intelligence

Other related reports published by Uptime Institute include:
Electrical considerations with large AI compute

The post AI power fluctuations strain both budgets and hardware appeared first on Uptime Institute Blog.

Retail vs wholesale: finding the right colo pricing model

Colocation providers may offer two pricing and packaging models to sell similar products and capabilities. In both models, customers purchase space, power and services. However, the method of purchase differs.

In a retail model, customers purchase a small quantity of space and power, usually by the rack or a fraction of a rack. The colocation provider standardizes contracts, pricing and capabilities — the cost and complexity of delivering to a customer’s precise requirements are not justified, considering the relatively small contract value.

In a wholesale model, customers purchase a significantly larger quantity of space and power, typically at least a dedicated, enclosed suite of white space. Due to the size of these contracts, colocation providers need to be flexible in meeting customer needs, even potentially building new facilities to accommodate their requirements. The colocation provider negotiates price and terms, and customers often prefer to pay for actual power consumption rather than be billed on maximum capacity. A metered model allows the customer to scale power usage in response to changing demands.

A colocation provider may focus on a particular market by offering only a retail or wholesale model, or the provider may offer both to broaden its appeal. The terms “wholesale” and “retail” colocation more accurately describe the pricing and packaging models used by colocation providers rather than the type of customer.

Table 1 Key differences between retail and wholesale colocation providers

Table: Key differences between retail and wholesale colocation providers

Retail colocation deals typically have higher gross margins in percentage terms, but the volume of sales is lower. Most colocation providers would rather sell wholesale contracts because they offer higher revenues through larger volumes of sales, despite having lower gross margins. As wholesale colocations are better prospects, retail customers are more likely to experience cost rises at renewal than wholesale customers.

Retail colocation pricing model

Retail terms are designed to be simple and predictable. Customers are typically charged a fixed fee based on the maximum power capacity supplied to equipment and the space used. This fee covers both the repayment of fixed costs, and the variable costs associated with IT power and cooling. The fixed fee bundles all these elements together, so customers have no visibility into these individual components — but they benefit from predictable pricing.

In retail colocation, the facilities are already available, so capital costs are recovered across all retail customers through standard pricing. If a customer exceeds their allotted maximum power capacity, they risk triggering a breaker and potentially powering down their IT equipment. Some colocation providers monitor for overages and warn customers that they need to increase their capacity before an outage occurs.

Customers are likely to purchase more power capacity than they need to prevent these outages. As a result, some colocation providers may deliberately oversubscribe power consumption to reduce their power costs and increase their profit margins. There are operational and reputational risks if oversubscription causes service degradation or outages.

Some colocation providers also meter power, charging a fee based on IT usage, which factors in the repayment of capital, IT and cooling costs, as well as a profit margin. Those with metering enabled may charge customers for usage exceeding maximum capacity, typically at a higher rate.

Can a colocation provider increase prices during a contract term? Occasionally, but only as a last resort — such as if power costs increase significantly. This possibility will be stipulated in the contract as an emergency or force majeure measure.

Usually, an internet connection is included. However, data transfer over that connection may be metered or bundled into a fixed cost package. Customers have the option to purchase cross-connects linking their infrastructure to third-party communications providers, including on-ramps to cloud providers.

Wholesale colocation pricing model

Wholesale colocation pricing is designed to offer customers the flexibility to utilize their capacity as they choose. Because terms are customized, pricing models will vary from customer to customer.

Some customers may prefer to pay for a fixed capacity of total power, regardless of whether the power is used or not. In this model, both IT power and cooling costs are factored into the price.

Other customers may prefer a more granular approach, with multiple charging components:

  • Fixed fee per unit of space/rack based on maximum power capacity and is designed to cover the colocation provider’s fixed costs, while including a profit margin.
  • Variable IT power costs are passed directly from the electricity supplier to the customer, metered in kilowatts (kW). Customers bear the full cost of price fluctuations, which can change rapidly depending on grid conditions.
  • To account for variable cooling costs, power costs may be calculated by multiplying actual power usage by an agreed design PUE to create an “additional power” fee. This figure may also be multiplied by a “utilization factor” to reflect cases where a customer is using only a small fraction of the data hall (and therefore impacting overall efficiency).

Some customers may prefer a blended model of both a fixed element for baseline capacity and a variable charge for consumption above the baseline. Redundant feeds are also likely to impact cost. If new data halls need to be constructed, these costs may be passed on to the customers directly, or some capital may be recovered through a higher fixed rack fee.

Alternatively, for long-term deployments, customers may opt for either a “build-to-suit” or “powered shell” arrangement. In a build-to-suit model, the colocation provider designs and constructs the facility —including power, cooling and layout — to the customer’s exact specifications. The space is then leased back to the customer, typically under a long-term agreement exceeding a decade.

In a powered shell setup, the provider delivers a completed exterior building with core infrastructure, such as utility power and network access. The customer is then responsible for outfitting the interior (racks, cooling, electrical systems) to suit their operational needs.

Most customers using wholesale colocation providers will need to implement cross-connects to third-party connectivity and network providers hosted in meet-me rooms. They may also need to arrange the construction of new capacity into the facility with the colocation provider and suppliers.

Hyperscalers are an excellent prospect for wholesale colocation, given their significant scale. However, their limited numbers and strong market power enable them to negotiate lower margins from colocation providers.

Table 2 Pricing models used in retail and wholesale colocation

Table: Pricing models used in retail and wholesale colocation

In a retail colocation engagement, the customer has limited negotiating power — with little scale, they generally have minimal flexibility on pricing, terms and customization. In a wholesale engagement, the opposite is true, and the arrangement favors the customer. Colocation providers want the scale and sales volume, so are willing to cut prices and accommodate additional requirements. They are also willing to offer flexible pricing in response to customers’ rapidly changing requirements.


The Uptime Intelligence View

Hyperscalers have the strongest market power to dictate contracts and prices. With so few players, it is unlikely that many hyperscalers will be bidding for the same space, which would push up prices. However, colocation providers still want their business, because of the volume it brings. They would prefer to reduce gross margins to ensure a win, rather than risk losing a customer with such unmatched scale.

The post Retail vs wholesale: finding the right colo pricing model appeared first on Uptime Institute Blog.

Electrical considerations with large AI compute

The training of large generative AI models is a special case of high-performance computing (HPC) workloads. This is not simply due to the reliance on GPUs — numerous engineering and scientific research computations already use GPUs as standard. Neither is it about the power density or the liquid cooling of AI hardware, as large HPC systems are already extremely dense and use liquid cooling. Instead, what makes AI compute special is its runtime behavior: when training transformer-based models, large compute clusters can create step load-related power quality issues for power distribution systems in data center facilities. A previous Intelligence report offers an overview of the underlying hardware-software mechanisms.

The scale of the power fluctuations makes this phenomenon unusual and problematic. The vast number of generic servers found in most data centers collectively produce a relatively steady electrical load — even if individual servers experience sudden changes in power usage, they are discordant. In contrast, the power use of compute nodes in AI training clusters moves in near unison.

Even compared with most other HPC clusters, AI training clusters exhibit larger power swings. This is due to an interplay between transformer-based neural network architectures and compute hardware, which creates frequent spikes and falls (every second or two) in power demand. These fluctuations correspond to the computational steps in the training processes, exacerbated by an aggressive pursuit of peak performance typical in modern silicon.

Powerful fluctuations

The scope of the resulting step changes in power will depend on the size and configuration of the compute cluster, as well as operational factors such as AI server performance and power management settings. Uptime Intelligence estimates that in worst-case scenarios, the difference between the low and high points of power draw during training program execution can exceed 100% on a system level (the load doubles almost instantaneously, within milliseconds) for some configurations.

These extremes occur every few seconds, whenever a batch of weights and biases is loaded on GPUs and the training begins. This is often accompanied by a massive spike in current, produced by power excursion events as GPUs overshoot their thermal design power rating (TDP) to opportunistically exploit any extra thermal and power delivery budget following a phase of lower transistor activity. In short, power spikes are made possible by intermittent lulls.

This behavior is common in modern compute silicon, including in personal devices and generic servers. Still, it is only with large AI compute clusters that these fluctuations across dozens or hundreds of servers move almost synchronously.

Even in moderately sized clusters with just a few dozen racks, this can result in sudden, millisecond-speed changes in AC power — ranging from several hundred kilowatts to even a few megawatts. If there are no other substantial loads present in the electrical mix to dampen these fluctuations, these step changes may stress capacity components in the power distribution systems. They may also cause power quality issues such as voltage sags and swells, or significant harmonics and sub-synchronous oscillations that distort the sinusoidal waveforms in AC power systems.

Based on several discussions with and disclosures by major electrical equipment manufacturers — including ABB, Eaton, Schneider Electric, Siemens and Vertiv — there is a general consensus that modern power distribution equipment is expected to be able to handle AI power fluctuations, as long as they remain within the rated load.

IT system capacity redefined

The issue of AI step loads appears to center on equipment capacity and the need to avoid frequent overloads. Standard capacity planning practices often start with the nameplate power of installed IT hardware, then derate it to estimate the expected actual power. This adjustment can reduce the total nameplate power by 25% to 50% across all IT loads when accounting for the diversity of workloads — since they do not act in unison — and also for the fact that most software rarely pushes the IT hardware close to its rated power.

In comparison, AI training systems can show extreme behavior. Larger AI compute clusters have the potential to draw what is similar to an inrush current (rapid change of currents, often denoted by high di/dt) that exceed the IT system’s sustained maximum power rating.

Normally, overloads would not pose a problem for modern power distribution. All electrical components and systems have specified overload ratings to handle transient events (e.g., current surges during the startup of IT hardware or other equipment) and are designed and tested accordingly. However, if power distribution components are sized closely to the rated capacity of the AI compute load, these transient overloads could happen millions of times per year in the worst cases — components are not tested for regularly repeated overloads. Over time, this can lead to electromechanical stress, thermal stress and gradual overheating (heat-up is faster than cool-off) — potentially resulting in component failure.

This brings the definition of capacity to the forefront of AI compute step loads. Establishing the repeated peak power of a single GPU-server node is already a non-trivial effort — it requires running a variety of computationally intensive codes and setting up a high-precision power monitor. However, predicting how a specific compute cluster spanning several racks and potentially hundreds or even thousands of GPUs will behave during a training run is difficult to ascertain ahead of deployment.

The expected power profile also depends on server configurations, such as power supply redundancy level, cooling mode and GPU generations. For example, in a typical AI system from the 2022-2024 generation, power fluctuations can reach up 4 kW per 8-GPU server node, or 16 kW per rack when populated with four nodes, according to Uptime estimates. Even so, the likelihood of exceeding the rack power rating of around 41 kW is relatively low. Any overshoot is likely to be minor, as these systems are mostly air-cooled hardware designed to meet ASHRAE Class A2 specifications — allowed to operate in environments up to 35°C (95°F). In practice, most facilities supply much cooler air, making system fans cycle less intensely.

However, with recently launched systems, the issue is further exacerbated as GPUs account for a larger share of the power budget, not only because they use more power (in excess of 1 kW per GPU module) but also because these systems are more likely to use direct liquid cooling (DLC). Liquid cooling reduces system fan power, thereby reducing the stable load of server power. It also has better thermal performance, which helps the silicon to accumulate extra thermal budget for power excursions.

IT hardware specifications and information shared with Uptime by power equipment vendors indicate that in the worst cases, load swings can reach 150%, with a potential for overshoots exceeding 10% above the system’s power specification. In the case of the rack-scale systems based on Nvidia’s GB200 NVL72 architecture, sudden power climbs from around 60 kW and 70 kW to more than 150 kW per rack can occur.

This compares to a maximum power specification of 132 kW, which means that, under worst-case assumptions, repeated overloads can amount to as much as 20% in instantaneous power, Uptime estimates. This warrants extra care regarding circuit sizing (including breakers, tap-off units and placements, busways and other conductors) to avoid overheating and related reliability issues.

Figure 1 shows the power pattern of a GPU-based compute cluster running a transformer-based model training workload. Based on hardware specifications and real-world power data disclosed to Uptime Intelligence, we algorithmically mimicked the behavior of a compute cluster comprising four Nvidia GB200 NVL72 racks and four non-compute racks. It demonstrates the expected power fluctuations during these training clusters and underscores the need to rethink capacity planning compared with traditional, generic IT loads. Even though the average power stays below the power rating of the cluster, peak fluctuations can exceed it. While this estimates a relatively small cluster with 288 GPUs, a larger cluster would exhibit similar behavior at the megawatt scale.

Figure 1 Power profile of a GPU-based training cluster (algorithmic not real-world data)

Diagram: Power profile of a GPU-based training cluster (algorithmic not real-world data)

In electrical terms, no multi-rack workload is perfectly synchronous, while the presence of other loads will help smooth out the edges of fluctuations further. When including non-compute ancillary loads in the cluster — such as storage systems, networks and CDUs (which also require UPS power) — a lower safety margin above the nominal rating (e.g., 10% to 15%) appears sufficient to cover any regular peaks over the nominal system power specifications, even with the latest AI hardware.

Current mitigation options

There are several factors that data center operators may want to consider when deploying compute clusters dedicated to training large, transformer-based AI models. Currently, data center operators have a limited toolkit to fully handle large power fluctuations in a power distribution system, particularly when it comes to not passing them on to the source in their full extent. However, in collaboration with the IT infrastructure team/tenant, it should be possible to minimize fluctuations:

  • Mix with diverse IT loads, share generators. The best first option is to integrate AI training compute with other, diverse IT loads in a shared power infrastructure. This helps to diminish the effects of power fluctuations, particularly on generator sets. For dedicated AI training data center infrastructure installations, this may not be an option for power distribution. However, sharing engine generators will go a long way to dampen the effects of AI power fluctuations.
    Among power equipment, engine generator sets will be the most stressed if exposed to the full extent of the fluctuations seen in a large, dedicated AI training infrastructure. Even if correctly sized for the peak load, generators may struggle with large and fast fluctuations — for example, the total facility load stepping from 45% to 50% of design capacity to 80% to 85% within a second, then dropping back to 45% to 50% after two seconds, on repeat. Such fluctuation cycles may be close to what the engines can handle, at the expense of reduced expected life or outright failure.
  • Select UPS configurations to minimize power quality issues, overload. Even if a smaller frame can handle the fluctuations, according to the vendors, larger systems will carry more capacitance to help absorb the worst of the fluctuations, maintaining voltage and frequency within performance specifications. An additional measure is to use a higher capacity redundancy configuration, for example, by opting for N+2. This allows for UPS maintenance while avoiding any repeated overloads on the operational UPS systems, some of which might hit the battery energy storage system.
  • Use server performance/power management tools. Power and performance management of hardware remain largely underused, despite their ability to not only improve IT power efficiency but also contribute to the overall performance of the data center infrastructure. Even though AI compute clusters feature some exotic interconnect subsystems, they are essentially standard servers using standard hardware and software. This means there are a variety of levers to manage the peaks in their power and performance levels, such as power capping, turning off boost clocks, limiting performance states, or even setting lower temperature limits.
    To address the low end of fluctuations, switching off server energy-saving modes — such as silicon sleep states (known as C-states in CPU parlance) — can help raise the IT hardware’s power floor. A more advanced technique involves limiting the rate of power change (including on the way down). This feature, called “power smoothing”, is available through Nvidia’s System Management Interface on the latest generation of Blackwell GPUs.

Electrical equipment manufacturers are investigating the merits of additional rapid discharge/recharge energy storage and updated controls to UPS units with the aim of shielding the power source from fluctuations. These approaches include super capacitors, advanced battery chemistries or even flywheels that can tolerate frequent, short duration but high-powered discharge and recharge cycles. Next-generation AI compute systems may also include more capacitance and energy storage to limit fluctuations on the data center power system. Ultimately, it is often best to address an issue at its root (in this case the IT hardware and software) rather than treat the symptoms, although these may lie outside the control of data center facilities teams.


The Uptime Intelligence View

Most of the time, data center operators do not need to be overly concerned with the power profile of the IT hardware or the specifics of the associated workloads — rack density estimates were typically overblown to begin with, and overall capacity utilization tends to stay well below 100%. Even so, safety margins, which are expensive, could be thin. However, training large transformer models is different. The specialized compute hardware can be extremely dense, creates large power swings, and is capable of producing frequent power surges that are close to or even above its hardware power rating. This will force data center operators to reconsider their approach to both capacity planning and safety margins across their infrastructure.

The following Uptime Institute experts were consulted for this report:
Chris Brown, Chief Technical Officer, Uptime Institute

The post Electrical considerations with large AI compute appeared first on Uptime Institute Blog.

Is this the data center metric for the 2030s?

When the PUE metric was first proposed and adopted at a Green Grid meeting in California in 2008, few could have forecast how important this simple ratio — despite its limitations — would become.

Few would have expected, too, that the industry would make so little progress on another metric proposed at those same early Green Grid meetings. While PUE highlighted the energy efficiency of the non-IT portion of a data center’s energy use, a separate “useful work” metric was intended to identify how much IT work was being done relative to the total facility and IT energy consumed. A list of proposals was put forward, votes were taken, but none of the ideas came near to being adopted.

Sixteen years later, minimal progress has been made. While some methods for measuring “work per energy” have been proposed, none have garnered any significant support or momentum. Efforts to measure inefficiencies in IT energy use — by far the largest source of both energy consumption and waste in a data center — have constantly stalled or failed to gain support.

That is set to change soon. The European Union and key member states are looking to adopt representative measurements of server (and storage) work capacity — which, in turn, will enable the development of a work per energy or work per watt-hour metric (see below and accompanying report).

So far, the EU has provided limited guidance on the work per energy metric, which it will need to agree in 2025 or 2026. However, it will clearly require a technical definition of CPU, GPU and accelerator work capacity, along with energy-use boundaries.

Once the metric is agreed upon and adopted by the EU, it will likely become both important and widely cited. It would be the only metric that links IT performance to the energy consumed by the data center. Although it may take several years to roll out, this metric is likely to become widely adopted around the world.

The new metric

The EU officials developing and applying the rules set out in the Energy Efficiency Directive (EED) are still working on many key aspects of a data center labeling scheme set to launch in 2026. One area they are struggling with is the development of meaningful IT efficiency metrics.

Uptime Institute and The Green Grid’s proposed work per energy metric is not the only option, but it offers many key advantages. Chief among them: it has a clear methodology; the work capacity value increases with physical core count and newer technology generations; and it avoids the need to measure the performance of every server. The methodology can also be adapted to measure work per megawatt-hour for GPU/accelerator-based servers and dedicated storage equipment. While there are some downsides, these will likely be shared by most alternative approaches.

Full details of the methodology are outlined in the white papers and webinar listed at the end of the report. The initial baseline work — on how to calculate work capacity of standard CPU-based servers — was developed by The Green Grid. Uptime Institute Sustainability and Energy Research Director Jay Dietrich extended the methodology to GPU/accelerator-based servers and dedicated storage equipment, and expanded it to calculate the work per megawatt-hour metric.

The methodology has five components:

  • Build or access an inventory of all the IT in the data center. The required data on CPU, GPU and storage devices should be available in procurement systems, inventory management tools, CMMS or some DCIM platforms.
  • Calculate the work capacity of the servers using the PerfCPU values available on The Green Grid website. These values are based on CPU cores by CPU technology generation.
  • Include GPU or accelerator-based compute servers using the 32-bit TFLOPS metrics. An alternative performance metric, such as Total Processing Performance (TPP), may be used if agreed upon later.
  • Include online, dedicated storage equipment (excluding tape) measured in terabytes.
  • Collect data, usually from existing systems, on:
    • Power supplied to CPUs, GPUs and storage systems. This should be relatively straightforward if the appropriate meters and databases are in place. Where there is insufficient metering, it may be necessary to use reasonable allocation methods.
    • Utilization. It is critical for a work per energy metric to know the utilization averages. This data is routinely monitored in all IT systems, but it needs to be collected and normalized for reporting purposes.

With this data, the work per energy metric can be calculated by adding up and averaging the number of transactions per second, then dividing it by the total amount of energy consumed. Like PUE, it is calculated over the course of a year to give an annual average. A simplified version, for three different workloads, is shown in Figure 1.

Figure 1 Examples of IT equipment work-per-energy calculations

Diagram: Examples of IT equipment work-per-energy calculations

Challenges

There are undoubtedly some challenges with this metric. One is that Figure 1 shows three different figures for three different workloads — whereas, in contrast, data centers usually report a single PUE number. This complexity, however, is unavoidable when measuring very different workloads, especially if the figure(s) are to give meaningful guidance on how to make efficiency improvements.

Under its EED reporting scheme, the EU has so far allowed for the inclusion of only one final figure each for server work capacity and storage capacity reporting. While a single figure works for storage, the different performance characteristics of standard CPU servers, AI inference and high-performance compute, and AI training servers make it necessary to report their capacities separately. Uptime argues that combining these three workloads into a single figure — essentially for at-a-glance public consumption — distorts and oversimplifies the report, risking the credibility of the entire effort. Whatever the EU decides, the problem is likely to be the same for any work per energy metric.

A second issue is that 60% of operators lack a complete component and location inventory of their IT infrastructure. Collecting the required information for the installed infrastructure, adjusting purchasing contracts to require inventory data reporting, and automating data collection for new equipment represents a considerable effort, especially at scale. By contrast, a PUE calculation only requires two meters at a minimum. 

However, most of the data collection — and even the calculations — can be automated once the appropriate databases and software are in place. While collecting the initial data and building the necessary systems may take several months, doing so will provide ongoing data to support efficiency improvements. In the case of this metric, data is already available from The Green Grid and Uptime will support the process.

There are several reasons why, until now, no work per energy metric has been successful. Two are particularly noteworthy. First, IT and facilities organizations are often either entirely separate — as in colocation provider/tenant — or they do not generally collaborate or communicate, as is common in enterprise IT. Even when this is not the case, and the data on IT efficiency is available, chief information officers or marketing teams may prefer not to publicize serious inefficiencies. However, such objections will no longer hold sway if legal compliance is required.

The second issue is that industry technical experts have often let the perfect stand in the way of the good, raising concerns about data accuracy. For an effective work per energy metric, the work capacity metric needs to provide a representative, configuration-independent measure that tracks increased work capacity as physical core count increases and as new CPU and GPU generations are introduced.

The Green Grid and Uptime methodologies will no doubt be questioned or opposed by some, but they achieve the intended goal. The work capacity metric does not have to drill down to specific computational workloads or application types, as some industry technologists demand. The argument that there is no reasonable metric, or that it lacks critical support, is no longer grounds for procrastination. IT energy inefficiencies need to be surfaced and understood.

Further information

To access the Uptime report on server and storage capacity (and on work per unit of energy):

Calculating work capacity for server and storage products

To access The Green Grid reports on IT work capacity:

IT work capacity metric V1 — a methodology

Searchable PerfCPU tables by manufacturer

To access an Uptime webinar discussing the metrics discussed in this report:

Calculating Data Center Work Capacity: The EED and Beyond

The post Is this the data center metric for the 2030s? appeared first on Uptime Institute Blog.

Crypto mines are turning into AI factories

The pursuit of training ever-larger generative AI models has necessitated the creation of a new class of specialized data centers — facilities that have more in common with high-performance computing (HPC) environments than traditional enterprise IT.

These data centers support very high rack densities (130 kW and above with current Nvidia rack-scale systems), direct-to-chip liquid cooling, and supersized power distribution components. This equipment is deployed at scale, in facilities that consume tens of megawatts. Delivering such dense infrastructure at this scale is not just technically complicated — it often requires doing things that have never been attempted before.

Some of these ultra-dense AI training data centers are being built by well-established cloud providers and their partners — wholesale colocation companies. However, the new class of facility has also attracted a different kind of data center developer: former cryptocurrency miners. Many of the organizations now involved in AI infrastructure — such as Applied Digital, Core Scientific, CoreWeave, Crusoe and IREN — originated as crypto mining ventures.

Some have transformed into neoclouds, leasing GPUs at competitive prices. Others operate as wholesale colocation providers, building specialized facilities for hyperscalers, neoclouds, or large AI model developers like OpenAI or Anthropic. Few of them operated traditional data centers before 2020. These operators represent a significant and recent addition to the data center industry — especially in the US.

A league of their own

Crypto mining facilities differ considerably from traditional data centers. Their primary objective is to house basic servers equipped with either GPUs or ASICs (application-specific integrated circuits), running at near 100% utilization around the clock to process calculations that yield cryptocurrency tokens. The penalties for outages are direct — fewer tokens mean lower profits — but the hardware is generally considered disposable. The business case is driven almost entirely by the cost of power, which accounts for almost all of the operating expenditure.

Many crypto mines do not use traditional server racks. Most lack redundancy in power distribution and cooling equipment, and they have no means of continuing operations in the event of a grid outage: no UPS, no batteries, no generators, no fuel. In some cases, mining equipment is located outdoors, shielded from the rain, but little else.

While crypto miners didn’t build traditional data center facilities, they did have two crucial assets: land zoned for industrial use and access to abundant, low-cost power.

Around 2020, some of the largest crypto mining operators began pivoting toward hosting hardware for AI workloads — a shift that became more pronounced following the launch of ChatGPT in late 2022. Table 1 shows how quickly some of these companies have scaled their AI/HPC operations.

Table 1 The transformation of crypto miners

Table: The transformation of crypto miners

To develop data center designs that can accommodate the extreme power and cooling requirements of cutting-edge AI hardware, these companies are turning to engineers and consultants with experience in hyperscale projects. The same applies to construction companies. The resulting facilities are built to industry standards and are concurrently maintainable.

There are three primary reasons why crypto miners were successful in capitalizing on the demand for high-density AI infrastructure:

  • These organizations were accustomed to moving quickly, having been born in an industry that had to respond to volatile cryptocurrency pricing, shifting regulations and fast-evolving mining hardware.
  • Many were already familiar with GPUs through their use in crypto mining — and some had begun renting them out for research or rendering workloads.
  • Their site selection was primarily driven by power availability and cost, rather than proximity to customers or network hubs.

Violence of action

Applied Digital, a publicly traded crypto mining operator based in North Dakota, presents an interesting case study. The state is one of the least developed data center markets in the US, with only a few dozen facilities in total.

Applied Digital’s campus in Ellendale was established to capitalize on cheap renewable power flowing between local wind farms and Chicago. In 2024, the company removed all mentions of cryptocurrency from its website — despite retaining sizable (100 MW-plus) mining operations. It then announced plans to build a 250 MW AI campus in Ellendale, codenamed Polaris Forge, to be leased by CoreWeave.

The operator expects the first 100 MW data center to be ready for service in late 2025. The facility will use direct liquid cooling and is designed to support 300 kW-plus rack densities. It is built to be concurrently maintainable, powered by two utility feeds, and will feature N+2 redundancy on most mechanical equipment. To ensure cooling delivery in the event of a power outage, the facility will be equipped with 360,000 gallons (1.36 million liters) of chilled water thermal storage. This will be Applied Digital’s first non-crypto facility.

The second building, with a capacity of 150 MW, is expected to be ready in the middle of 2026. It will deploy medium-voltage static UPS systems to improve power distribution efficiency and optimize site layout. The company has several more sites under development.

Impact on the sector

Do crypto miners have an edge in data center development? What they do have is existing access to power and a higher tolerance for technical and business risk — qualities that enable them to move faster than much of the traditional competition. This willingness to place bets matters in a market that is lacking solid fundamentals: in 2025, capital expenditure on AI infrastructure is outpacing revenue from AI-based products by orders of magnitude. The future of generative AI is still uncertain.

At present, this new category of data center operators appears to be focusing exclusively on the ultra-high-density end of the market and is not competing for traditional colocation customers. For now, they don’t need to either, as demand for AI training capacity alone keeps them busy. Still, their presence in the market introduces a new competitive threat to colocation providers that have opted to accommodate extreme densities in their recently built or upcoming facilities.

M&E and IT equipment suppliers have welcomed the new arrivals — not simply because they drive overall demand but because they are new buyers in a market increasingly dominated by a handful of technology behemoths. Some operators will be concerned about supply chain capacity, especially when it comes to large-scale projects: high-density campuses could deplete the stock of data center equipment such as large generators, UPS systems and transformers.

One of the challenges facing this new category of operators is the evolving nature of AI hardware. Nvidia, for example, intends to start shipping systems that consume more than 500 kW per compute rack by the end of 2027. It is not clear how many data centers being built today will be able to accommodate this level of density.


The Uptime Intelligence View

The simultaneous pivot by several businesses toward building much more complex facilities is peculiar, yet their arrival will not immediately affect most operators.

While this trend will create business opportunities for a broad swathe of design, consulting and engineering firms, it is also likely to have a negative impact on equipment supply chains, extending lead times for especially large-capacity units.

Much of this group’s future success hinges on the success of generative AI in general — and the largest and most compute-hungry models in particular — as a tool for business. However, the facilities they are building are legitimate data centers that will remain valuable even if the infrastructure needs of generative AI are being overestimated.

The post Crypto mines are turning into AI factories appeared first on Uptime Institute Blog.

Cybersecurity and the cost of human error

Cyber incidents are increasing rapidly. In 2024, the number of outages caused by cyber incidents was twice the average of the previous four years, according to Uptime Institute’s annual report on data center outages (see Annual outage analysis 2025). More operational technology (OT) vendors are experiencing significant increases in cyberattacks on their systems. Data center equipment vendor Honeywell analyzed hundreds of billions of system logs and 4,600 events in the first quarter of 2025, identifying 1,472 new ransomware extortion incidents — a 46% increase on the fourth quarter of 2024 (see Honeywell’s 2025 Cyber Threat Report). Beyond the initial impact, cyberattacks can have lasting consequences for a company’s reputation and balance sheet.

Cyberattacks increasingly exploit human error

Cyberattacks on data centers often exploit vulnerabilities — some stemming from simple and preventable errors, while others are overlooked systemic issues. Human error, such as failing to follow procedures, can create vulnerabilities, which the attacker exploits. For example, staff might forget regular system patches or delay firmware updates, leaving systems exposed. Companies, in turn, implement policies and procedures to ensure employees perform preventative actions on a consistent basis.

In many cases, data center operators may well be aware that elements of their IT and OT infrastructure have certain vulnerabilities. This may be due to policy noncompliance or the policy itself lacking appropriate protocols to defend against hackers. Often, employees lack training on how to recognize and respond to common social engineering techniques used by hackers. Tactics such as email phishing, impersonation and ransomware are increasingly targeting organizations with complex supply chain and third-party dependencies.

Cybersecurity incidents involving human error often follow similar patterns. Attacks may begin with some form of social engineering to obtain login credentials. Once inside, the attack moves laterally through a system, exploiting small errors to cause systemic damage (see Table 1).

Table 1 Cyberattackers exploit human factors to induce human error

Table: Cyberattackers exploit human factors to induce human error

Failure to follow correct procedures

Although many companies have policies and procedures in place, employees can become complacent and fail to follow them. At times, they may unintentionally skip a step or carry it out incorrectly. For instance, workers might forget to install a software update or accidentally misconfigure a port or firewall — despite having technical training. Others may feel overwhelmed by the volume of updates and leave systems vulnerable as a result. In some cases, important details are simply overlooked, such as leaving a firewall port open or setting their cloud storage to public access.

Procedures concerning password strength, password changes and inactive accounts are common vulnerabilities that hackers exploit. Inactive accounts that are not properly deactivated may miss out on critical security updates, as these are monitored less closely than active accounts, making it easier for security breaches to go unnoticed.

Unknowingly engaging with social engineering

Social engineering is a tactic used to deceive individuals into revealing sensitive information or downloading malicious software. It typically involves the attacker impersonating someone from the target’s company or organization to build trust with them. The primary goal is to steal login credentials or gain unauthorized access to the system.

Attackers may call employees while posing as someone from the IT help desk, requesting login details. Another common tactic involves the attacker pretending to be a help desk technician and, under the guise of “routine testing,” pressuring an employee to disclose their login credentials.

Like phishing, spoofing is a tactic used to gain an employee’s trust by simulating familiar conditions, but it often relies on misleading visual cues. For example, social engineers may email a link to a fake version of the company’s login screen, prompting the unsuspecting employee to enter their login information as usual. In some rare cases, attackers might even use AI to impersonate an employee’s supervisor during video call.

Deviation from policies or best practices

Adhering to policies and best practices is critical to determining whether cybersecurity succeeds or fails. Procedures need to be written clearly and without ambiguity. For example, if a procedure does not explicitly require an employee to clear saved login data from their devices, hackers or rogue employees may be able to gain access to the device using default administrator credentials. Similarly, if regular password changes are not mandated, it may be easier for attackers to compromise system access credentials.

Policies must also account for the possibility of a disgruntled employee or third-party worker stealing or corrupting sensitive information for personal gain. To reduce this risk, companies can implement clear deprovisioning rules in their offboarding process, such as ensuring passwords are changed immediately upon an employee’s departure. While there is always a chance that a procedural step may be accidentally overlooked, comprehensive procedures increase the likelihood that each task is completed correctly.

Procedures are especially critical when employees have to work quickly to contain a cybersecurity incident. They should be clearly written, thoroughly tested for reliability, and easily accessible to serve as a reference during a variety of emergencies.

Poor security governance and oversight

A lack of governance or oversight from management can lead to overlooked risks and vulnerabilities, such as missed security patches or failure to monitor systems for threats or alerts. Training helps employees to approach situations with healthy skepticism, encouraging them to perform checks and balances consistent with the company’s policies.

Training should evolve to ensure that workers are informed about the latest threats and vulnerabilities, as well as how to recognize them.

Notable incidents exploiting human error

The types of human error described above are further complicated due to the psychology of how individuals behave in intense situations. For example, mistakes may occur due to heightened stress, fatigue or coercion, all of which can lead to errors of judgment when a quick decision or action is required.

Table 2 identifies how human error may have played a part in eight major public cybersecurity breaches between 2023 and 2025. This includes three of the 10 most significant data center outages — United Healthcare, CDK Global and Ascension Healthcare — highlighted in Uptime Institute’s outages report (see Annual outage analysis 2025). We note the following trends:

  • At least five of the incidents involved social engineering. These attacks often exploited legitimate credentials or third-party vulnerabilities to gain access and execute malicious actions.
  • All incidents likely involved failures by employees to follow policies, procedures or properly manage common vulnerabilities.
  • Seven incidents exposed gaps in skills, training or experience to mitigate threats to the organization.
  • In half of the incidents, policies may have been poorly enforced or bypassed for unknown reasons.

Table 2 Impact of major cyber incidents involving human error

Table: Impact of major cyber incidents involving human error

Typically, organizations are reluctant to disclose detailed information about cyberattacks. However, regulators and government cybersecurity agencies are increasingly expecting more transparency — particularly when the attacks affect citizens and consumers — since attackers often leak information on public forums and the dark web.

The following findings are particularly concerning for data center operators and warrant serious attention:

  • The financial cost of cyber incidents is significant. Among the eight identified cyberattacks, the estimated total losses exceed $8 billion.
  • Full financial and reputational impact can take longer to play out. For example, UK retailer Marks & Spencer is facing lawsuits from customer groups over identity theft and fraud following a cyberattack. Similar actions may be taken by regulators or government agencies, particularly if breaches expose compliance failures with cybersecurity regulations, such as those in the Network and Information Security Directive 2 and the Digital Operational Resilience Act.

The Uptime Intelligence View

Human error is often viewed as a series of unrelated mistakes; however, the errors identified in this report stem from complex, interconnected systems and increasingly sophisticated attackers who exploit human psychology to manipulate events.

Understanding the role of human error in cybersecurity incidents is crucial to help employees recognize and prevent potential oversights. Training alone is unlikely to solve the problem. Data center operators should continuously adapt cyber practices and foster a culture that redefines how staff perceive and respond to the risk of cyber threats. This cultural shift is likely critical to staying ahead of evolving threat tactics.

John O’Brien, Senior Research Analyst, jobrien@uptimeinstitute.com
Rose Weinschenk, Analyst, rweinschenk@uptimeinstitute.com

The post Cybersecurity and the cost of human error appeared first on Uptime Institute Blog.

Cloud: when high availability hurts sustainability

In recent years, the environmental sustainability of IT has become a significant concern for investors and customers, as well as regulatory, legislative and environmental stakeholders. This concern is expected to intensify as the impact of climate change on health, safety and the global economy becomes more pronounced. It has given rise to an assortment of voluntary and mandatory initiatives, standards and requirements that collectively represent, but do not yet define, a basic framework for sustainable IT.

Cloud providers have come under increasing pressure from both the public and governments to reduce their carbon emissions. Their significant data center footprints consume considerable energy to deliver an ever-increasing range of cloud services to a growing customer base. The recent surge in generative AI has thrust the issues of power and carbon further into the spotlight.

Cloud providers have responded with large investments in renewable energy and energy attribute certificates (EACs), widespread use of carbon offsets and the construction of high-efficiency data centers. However, the effectiveness of these initiatives and their impact on carbon emissions vary significantly depending on the cloud provider. While all are promoting an eco-friendly narrative, unwrapping their stories and marketing campaigns to find meaningful facts and figures is challenging.

These efforts and initiatives have garnered considerable publicity. However, the impact of customer configurations on carbon emissions can be considerable and often overlooked. To build resiliency into cloud services, users face a range of options, each carrying its own carbon footprint. This report examines how resiliency affects carbon emissions.

Sustainability is the customer’s responsibility

The reduction of hyperscaler data center carbon emissions is being fought on two fronts. First, service providers are transitioning to lower-carbon energy sources. Second, cloud customers are being encouraged to optimize their resource usage through data and reporting to help lower carbon emissions.

Cloud provider responsibilities

Data centers consume significant power. To reduce their carbon impact, many cloud providers are investing in carbon offsets — these are projects with a negative carbon impact that can balance or negate carbon emissions by a specified weight.

Renewable energy certificates (RECs) are tradable, non-tangible energy commodities. Each REC certifies that the holder has used or will use a quantity of electricity generated from a renewable source, thus avoiding the need for carbon emission offsets for that power use.

Cloud providers can use both offsets and RECs to claim their overall carbon emissions are zero. However, this does not equate to zero carbon production; instead, it means providers are balancing their emissions by accounting for a share of another organization’s carbon reductions.

Although cloud providers are making their own environmental changes, responsibility for sustainability is also being shared with users. Many providers now offer access to carbon emissions information via online portals and application programming interfaces (APIs), aiming to appear “green” by supporting users to measure, report and reduce carbon emissions.

Customer responsibilities

In public cloud, application performance and resiliency are primarily the responsibility of the user. While cloud providers offer services to their customers, they are not responsible for the efficiency or performance of the applications that customers build.

The cloud model lets customers consume services when they are needed. However, this flexibility and freedom can lead to overconsumption, increasing both costs and carbon emissions.

Tools and guidelines are available to help customers manage their cloud usage. Typical recommendations include resizing virtual machines to achieve higher utilization or turning off unused resources. However, these are only suggestions; it is the job of their customers to implement any changes.

Since cloud providers charge based on the resources used, helping customers to reduce their cloud usage is likely to also reduce their bills, which in the short term may impact provider revenue. However, cloud providers are willing to take this risk, betting that helping customers lower both carbon emissions and costs will increase overall revenue in the longer term.

Cloud customers are also encouraged to move workloads to regions with less carbon-heavy electricity supplies. This can often result in lower costs for their customers and lower carbon emissions — a win-win. However, it is up to the customer to implement these changes.

Cloud users face a challenging balancing act: they need to architect applications that are available, cost-effective and have a low carbon footprint. Even with the aid of tools, achieving this balance is far from easy.

Previous research

In previous reports to compare cost, carbon emissions and availability between architectures, Uptime Intelligence started by defining an unprotected baseline. This is an application situated in a single location and not protected from a loss of availability zone (a data center) or region (a collection of closely connected data centers). Then, other architectures were designed to distribute resources across availability zones and regions so that the application could operate during outages. The costs of these new architectures were compared with the price of the baseline to assess how increased availability affects cost.

Table 1 provides an overview of these architectures. A full description can be found in Build resilient apps: do not rely solely on cloud infrastructure.

Table 1 Summary of application architecture characteristics

Table: Summary of application architecture characteristics

An availability percentage for 2024 was calculated using historical status update information for each architecture. In the cloud, applications are charged based on the resources consumed to deliver that application. An application architected across multiple locations uses more resources than one deployed in a single location. In Cloud availability comes at a price the cost of using each application was calculated.

Finally, in this report, Uptime Intelligence calculates the carbon emissions for each architecture and combines this with the availability and cost data.

Carbon versus cost versus downtime

Figure 1 combines availability, cost and carbon emissions into a single chart. The carbon quantities are based on the location-based Scope 2 emissions, which are associated with the electricity consumed by the data center. The availability of the architectures is represented by bubble sizes: inner rings indicate the average annual downtime across all regions in 2024, while the outer rings show the worst-case regional downtime. The axes display cost and carbon premiums, which reflect additional costs and carbon emissions relative to the unprotected baseline. The methodology for calculating carbon is included as an appendix at the end of this report.

Figure 1 Average and worst-case regional availabilities by carbon and cost

Diagram: Average and worst-case regional availabilities by carbon and cost

Findings

Figure 1 shows that the cost premium is linearly proportional to carbon emissions — a rise in cost directly corresponds to an increase in carbon emissions, and vice versa. This proportionality makes sense: designing for resiliency uses more resources across multiple regions. Due to the cloud’s consumption-based pricing model, more resources equate to higher costs. And with more resources, more servers are working, which produces more carbon emissions.

However, higher costs and carbon emissions do not necessarily translate into better availability. As shown in Figure 1, the size of the bubbles does not always decrease with an increase in cost and carbon. Customers, therefore, do not have to pay the highest premiums in cash and carbon terms to obtain good availability. However, they should expect that resilient applications will require additional expenditure and produce more carbon emissions.

A good compromise is to architect the application across regions using a pilot light configuration. This design provides an average annual downtime of 2.6 hours, a similar level of availability to the equivalent dual region active-active configuration, but with roughly half the cost and carbon emissions.

Even if this architecture were deployed across the worst-performing regions, downtime would remain relatively low at 5.3 hours, which is still consistent with the more expensive resilient design.

However, although the cost and carbon premiums of the pilot light design are at the midpoint in our analysis, they are still high. Compared with an unprotected application, a dual region pilot light configuration produces double the carbon emissions and costs 50% more.

For those organizations looking to keep emissions and costs low, a dual zone active-failover provides an average downtime of 2.9 hours per year at a cost premium of 14% and a carbon premium of 38%. However, it is more susceptible to regional failures — in the worst-performing regions, downtime increases almost fourfold to 10.8 hours per year.

Conclusions

In all examined cases, increases in carbon are substantial. High availability inevitably comes with an increase in carbon emissions. Enterprises need to decide what compromises they are willing to make between low cost, low carbon and high availability.

These trade-offs should be evaluated during the design phase, before implementation. Ironically, most tools provided by cloud providers only focus on reporting and optimizing current resource usage rather helping assess the impact of potential architectures.  

AWS provides its Customer Carbon Footprint Tool, Google offers a Cloud Carbon Footprint capability, Microsoft delivers an Emissions Impact Dashboard for Azure, IBM has a Cloud Carbon Calculator, and Oracle Cloud has its OCI Sustainability Dashboard. These tools aid carbon reporting and may make recommendations to reduce carbon emissions. However, they do not suggest fundamental changes to the architecture design based on broader requirements such as cost and availability.

Considering the direct relationship between carbon emissions and cost, organizations can take some comfort in knowing that architectures built with an awareness of cost optimization are also likely to reduce emissions. In AWS’s Well-Architected framework for application development, the Cost Optimization pillar and the Sustainability pillar share similarities, such as turning off unused resources and sizing virtual machines correctly. Organizations should investigate if their cost optimization developments can also reduce carbon emissions.


The Uptime Intelligence View

The public cloud may initially appear to be a low-cost, low-carbon option. However, customers aiming for high availability should architect their applications across availability zones and regions. More resources running in more locations equates to higher costs (due to the cloud’s consumption-based pricing) and increased carbon emissions (due to the use of multiple physical resources). Ultimately, those developing cloud applications need to decide where their priorities lie regarding cost reduction, environmental credentials and user experience.

Appendix: methodology

The results presented in this report should not be considered prescriptive but hypothetical use cases. Readers should perform their own analyses before pursuing or avoiding any action.

Data is obtained from the Cloud Carbon Footprint (CCF) project, an open-source tool for analyzing carbon emissions. This initiative seeks to aid users in measuring and reducing the carbon emissions associated with their public cloud use.

The CCF project uses several sources, including the SPECpower database, to calculate power consumption for various cloud services hosted on AWS, Google and Microsoft Azure. SPECpower is a database of power consumption at various utilization points for various servers. Power is converted to an estimate of carbon emissions using data from the European Environment Agency, the US Environmental Protection Agency and carbonfootprint.com.

Uptime Intelligence used the CCF’s carbon and power assumptions to estimate carbon emissions for several cloud architectures. We consider the CCF’s methodology and assumptions reasonable enough to compare carbon emissions based on cloud architecture. However, we cannot state that the CCF’s tools, methods and assumptions suit all purposes. That said, the project’s open-source and collaborative nature means it is more likely to be an unbiased and fair methodology than those offered by cloud providers.

The CCF’s methodology details are available on the project’s website and in the freely accessible source code. See cloudcarbonfootprint.org/docs/methodology.

For this research, Uptime Intelligence has based our calculations on Amazon Web Services (AWS). Not only is AWS the market leader, but it also provides sufficiently detailed information to make an investigation possible. Other public cloud services have similar pricing models, services and architectural principles — this report’s fundamental analysis will apply to other cloud providers. AWS costs are obtained from the company’s website and carbon emissions are obtained from the CCF project’s assumptions for AWS. We used an m5.large virtual machine in us-east-1 for our architecture.

Table 2 shows the carbon emissions calculations based on these sources.

Table 2 Carbon emissions calculations

Table: Carbon emissions calculations

The following Uptime Institute expert was consulted for this report:
Jay Dietrich, Research Director of Sustainability, Uptime Institute

The post Cloud: when high availability hurts sustainability appeared first on Uptime Institute Blog.

Self-contained liquid cooling: the low-friction option

Each new generation of server silicon is pushing traditional data center air cooling closer to its operational limits. In 2025, the thermal design power (TDP) of top-bin CPUs reached 500 W, and server chip product roadmaps indicate further escalation in pursuit of higher performance. To handle these high-powered chips, more IT organizations are considering direct liquid cooling (DLC) for their servers. However, large-scale deployment of DLC with supporting facility water infrastructure can be costly and complex to operate, and is still hindered by a lack of standards (see DLC shows promise, but challenges persist).

In these circumstances, an alternative approach has emerged: air-cooled servers with internal DLC systems. Referred to by vendors as either air-assisted liquid cooling (AALC) or liquid-assisted air cooling (LAAC), these systems do not require coolant distribution units or facility water infrastructure for heat rejection. This means that they can be deployed in smaller, piecemeal installations.

Uptime Intelligence considers AALC a broader subset of DLC — defined by the use of coolant to remove heat from components within the IT chassis — that includes options for multiple servers. This report discusses designs that use a coolant loop — typically water in commercially available products — that fit entirely within a single server chassis.

Such systems enable IT system engineers and operators to cool top-bin processor silicon in dense form factors — such as 1U rack-mount servers or blades — without relying on extreme-performance heat sinks or elaborate airflow designs. Given enough total air cooling capacity, self-contained AALC requires no disruptive changes to the data hall or new maintenance tasks for facility personnel.

Deploying these systems in existing space will not expand cooling capacity the way full DLC installations with supporting infrastructure can. However, selecting individual 1U or 2U servers with AALC can either reduce IT fan power consumption or enable operators to support roughly 20% greater TDP than they otherwise could — with minimal operational overhead. According to the server makers offering this type of cooling solution, such as Dell and HPE, the premium for self-contained AALC can pay for itself in as little as two years when used to improve power efficiency.

Does simplicity matter?

Many of today’s commercial cold plate and immersion cooling systems originated and matured in high-performance computing facilities for research and academic institutions. However, another group has been experimenting with liquid cooling for more than a decade: video game enthusiasts. Some have equipped their PCs with self-contained AALC systems to improve CPU and GPU performance, as well as reduce fan noise. More recently, to manage the rising heat output of modern server CPUs, IT vendors have started to offer similar systems.

The engineering is simple: fluid tubing connects one or more cold plates to a radiator and pump. The pumps circulate warmed coolant from the cold plates through the radiator, while server fans draw cooling air through the chassis and across the radiator (see Figure 1). Because water is a more efficient heat transfer medium than air, it can remove heat from the processor at a greater rate — even at a lower case temperature.

Figure 1 Closed-loop liquid cooling within the server

Diagram: Closed-loop liquid cooling within the server

The coolant used in commercially shipping products is usually PG25, a mixture of 75% water and 25% propylene glycol. This formulation has been widely adopted in both DLC and facility water systems for decades, so its chemistry and material compatibility are well understood.

As with larger DLC systems, alternative cooling approaches can use a phase change to remove IT heat. Some designs use commercial two-phase dielectric coolants, and an experimental alternative uses a sealed system containing a small volume of pure water under partial vacuum. This lowers the boiling point of water, effectively turning it into a two-phase coolant.

Self-contained AALC designs with multiple cold plates usually have redundant pumps — one on each cold plate in the same loop — and can continue operating if one pump fails. Because AALC systems for a single server chassis contain a smaller volume of coolant than larger liquid cooling systems, any leak is less likely to spill into systems below. Cold plates are typically equipped with leak detection sensors.

Closed-loop liquid cooling is best applied in 1U servers, where space constraints prevent the use of sufficiently large heat sinks. In internal testing by HPE, the pumps and fans of an AALC system in a 1U server consumed around 40% less power than the server fans in an air-cooled equivalent. This may amount to as much as a 5% to 8% reduction in total server power consumption under full load. The benefits of switching to AALC are smaller for 2U servers, which can mount larger heat sinks and use bigger, more efficient fan motors.

However, radiator size, airflow limitations and temperature-sensitive components mean that self-contained AALC is not on par with larger DLC systems, therefore making it more suitable as a transitory measure. Additionally, these systems are not currently available for GPU servers.

Advantages of AALC within the server:

  • Higher cooling capacity (up to 20%) than air cooling in the same form factor and for the same energy input, offers more even heat distribution and faster thermal response than heat sinks.
  • Requires no changes to white space or gray space.
  • Components are widely available.
  • Can operate without maintenance for the lifetime of the server, with low risk of failure.
  • Does not require space outside the rack, unlike “sidecars” or rear-mounted radiators.

Drawbacks of AALC within the server:

  • Closed-loop server cooling systems use several complex components that cost more than a heat sink.
  • Offers less IT cooling capacity than other liquid cooling approaches: systems available outside of high-performance computing and AI-specific deployments will typically support up to 1.2 kW of load per 1U server.
  • Self-contained systems generally consume more energy than larger DLC systems for server fan power, a parasitic component of IT energy consumption.
  • No control of coolant loop temperatures; control of flow rate through pumps may be available in some designs.
  • Radiator and pumps limit space savings within the server chassis.

Outlook

For some organizations, AALC offers the opportunity to maximize the value of existing investments in air cooling infrastructure. For others, it may serve as a measured step on the path toward DLC adoption.

This form of cooling is likely to be especially valuable for operators of legacy facilities that have sufficient air cooling infrastructure to support some high-powered servers but would otherwise suffer from hot spots. Selecting AALC over air cooling may also reduce server fan power enough to allow operators to squeeze another server into a rack.

Much of AALC’s appeal is its potential for efficient use of fan power and its compatibility with existing facility cooling capabilities. Expanding beyond this to increase a facility’s cooling capacity is a different matter, requiring larger, more expensive DLC systems supported by additional heat transport and rejection equipment. In comparison, server-sized AALC systems represent a much smaller cost increase over heat sinks.

Future technical development may address some of AALC’s limitations, although progress and funding will largely depend on the commercial interest in servers with self-contained AALC. In conversations with Uptime Intelligence, IT vendors have diverging views of the role of self-contained AALC in their server portfolios, suggesting that the market’s direction remains uncertain. Nonetheless, there is some interesting investment in the field. For example, Belgian startup Calyos has developed passive closed-loop cooling systems that operate without pumps, instead moving coolant via capillary action. The company is working on a rack-scale prototype that could eventually see deployment in data centers.


The Uptime Intelligence View

AALC within the server may only deliver a fraction of the improvements associated with DLC, but it does so at a fraction of the cost and with minimal disruption to the facility. For many, the benefits may seem negligible. However, for a small group of air-cooled facilities, AALC can deliver either cooling capacity benefits or energy savings.

The post Self-contained liquid cooling: the low-friction option appeared first on Uptime Institute Blog.

The two sides of a sustainability strategy

While much has been written, said and taught about data center sustainability, there is still limited consensus on the definition and scope of an ideal data center sustainability strategy. This lack of clarity has created much confusion, encouraged many operators to pursue strategies with limited results, and enabled some to make claims that are ultimately of little worth.

To date, the data center industry has adopted three broad, complementary approaches to sustainability:

  • Facility and IT sustainability. This approach prioritizes operational efficiency, minimizing the energy, direct carbon and water footprints of IT and facility infrastructure. It directly addresses the operational impacts of individual facilities, reducing material and energy use and costs. Maximizing the sustainability of individual facilities is key to addressing the increased government focus on regulating individual data centers.
  • Ecosystem sustainability. This strategy focuses on carbon neutrality (or carbon negativity), water positivity and nature positivity across the enterprise. Ecosystem sustainability offsets the environmental impacts of an enterprise’s operations, which may increase business costs.
  • Overall sustainability. While some data center operators promote the sustainability of their facilities with limited efforts on ecosystem sustainability, others build their brand around ecosystem sustainability with minimal discussion about the sustainability of their facilities. Although it is common for organizations to make efforts in both areas, it is less common for the strategies to be integrated as a part of a coherent plan.

Each approach has its own benefits and challenges, providing different levels of business and environmental performance improvement. This report is an extension and update to the Sustainability Series of reports, published by Uptime Intelligence in 2022 (see below for a list of the reports), which detailed the seven elements of a sustainability strategy.

Data center sustainability

Data center sustainability involves incorporating sustainability and efficiency considerations into siting, design and operational processes throughout a facility’s life. The organizations responsible for siting and design, IT operations, facility operations, procurement, contracting (colocation and cloud operators) and waste management must embrace the enterprise’s overall sustainability strategy and incorporate it into their daily operations.

Achieving sustainability objectives may require a more costly initial investment for an individual facility, but the reward is likely an overall lower cost of ownership over its life. To implement a sustainability strategy effectively, an operator must address the full range of sustainability elements:

  • Siting and design. Customer and business needs dictate a data center’s location. Typically, multiple sites will satisfy these criteria, however, the location should also be selected based on whether it can help optimize the facility’s sustainability performance. Operators should focus on maximizing free cooling and carbon-free energy consumption while minimizing energy and water consumption. The design should choose equipment and materials that maximize the facility’s environmental performance.
  • Cooling system. The design should minimize water and energy use, including capturing available free-cooling hours. In water-scarce or water-stressed regions, operators should deploy waterless cooling systems. Where feasible and economically viable, heat reuse systems should also be incorporated into the design.
  • Standby power system. The standby power system design should enable fuel flexibility (able to use low-carbon or carbon-free fuels) and provide primary power capability. It should be capable and permitted to deliver primary power for extended periods. This enables the system to support grid reliability and assist in addressing the intermittency of wind and solar generation contracted to supply power to the data center, thereby reducing the carbon intensity of the electricity consumption.
  • IT infrastructure efficiency. IT equipment should be selected to maximize the average work delivered per watt of installed capacity. The installed equipment should run at or close to the highest practical utilization level of the installed workloads while meeting their reliability and resiliency requirements. IT workload placement and management software should be used to monitor and optimize the IT infrastructure performance.
  • Carbonfree energy consumption. Operators should work with electricity utilities, energy retailers, energy developers and regulators to maximize the quantity of clean energy consumed and minimize location-based emissions. Over time, they should plan to increase carbon-free energy consumption to 90% or more of the total consumption. Timelines will vary by region depending on the economics and availability of carbon-free energy.
  • End-of-life equipment reuse and materials recovery. Operators need an end-of-life equipment management process that maximizes the reuse of equipment and components, both within the organization and through refurbishment and use by others. Where equipment must be scrapped, there should be a process in place to recover valuable metals and minerals, as well as energy, through environmentally responsible processes.  
  • Scope 3 emissions management. Operators should require key suppliers to maintain a sustainability strategy, publicly disclose their greenhouse gas (GHG) emissions inventory and reduction goals, and demonstrate progress toward their sustainability objectives. There should be consequences in place for suppliers that fail to show reasonable progress.

While these strategies may appear simple, creating and executing a sustainability strategy requires the commitment of the whole organization — from technicians and engineers to procurement, finance and executive leadership. In some cases, financial criteria may need to shift from considering the initial upfront costs to the total cost of ownership and the revenue benefits/enhancements gained from a demonstrably sustainable operation. A data center sustainability strategy can enhance business and environmental performance.

Ecosystem sustainability

An ecosystem sustainability strategy emphasizes mitigating and offsetting the environmental impacts of an operator’s data center portfolio. While these efforts do not change the environmental operating profile of individual data centers, they are designed to benefit the surrounding community and natural environment. Such projects and environmental offsets are typically managed at the enterprise level rather than the facility level and represent a cost to the enterprise.

  • Carbon-neutral or carbon-negative operations. Operators should purchase energy attribute certificates (EACs) and carbon capture offsets to reduce or eliminate their Scope 1, 2 and 3 emissions inventory. The offsets are generated primarily from facilities geographically separate from the data center facilities. EACs and offsets can be purchased directly from brokers or from operators of carbon-free energy or carbon capture systems.
  • Water-positive operations. Operators should work with communities and conservation groups to implement water recharge and conservation projects that return more water to the ecosystem than is used across their data centers. Examples include wetlands reclamation, water replenishment, support of sustainable agriculture, and leak detection and minimization systems for water distribution networks. These projects can benefit the local watershed or unrelated, geographically distinct watersheds.
  • Nature-positive facilities. The data center or campus should be landscaped to regenerate and integrate with the natural landscape and local ecosystem. Rainwater and stormwater should be naturally filtered and reused where practical. The landscape should be designed and managed to support local flora and fauna, ensuring that the overall campus is seamlessly integrated into the local ecosystem. The overall intent is to make the facility as “invisible” as possible to the local community.
  • Emissions reductions achieved with IT tools. Some operators and data center industry groups quantify and promote the emissions reduction benefits (known as Scope 4 “avoided emissions”) generated from the operation of the IT infrastructure. They assert that the “avoided emissions” achieved through the application of IT systems to increase the operational efficiency of systems or processes, or “dematerialize” products, can offset some or all of the data center infrastructure’s emissions footprint. However, these claims should be approached with caution, as there is a high degree of uncertainty in the calculated quantities of “avoided emissions.”
  • Pro-active work with supply chains. Some operators work directly with supply chain partners to decarbonize their operations. This approach is practical when an enterprise represents a significant percentage of a supplier’s revenue. However, it becomes impractical when an operator’s purchases represent only a small percentage of the supplier’s business.

Ecosystem sustainability seeks to deliver environmental performance improvements to operations and ecosystems outside the operator’s direct control. These improvements compensate for and offset any remaining environmental impacts following the full execution of the data center sustainability strategy. They typically represent a business cost and enhance an operator’s commercial reputation and brand.

Where to focus

Facility and IT and ecosystem sustainability strategies are complementary, addressing the full range of sustainability activities and opportunities. In most organizations, it will be necessary to cover all of these areas, often by different teams focusing on their respective domains.

An operator’s primary focus should be improving the operational efficiency and sustainability performance of its data centers. Investments in the increased use of free cooling, automated control of chiller and IT space cooling systems, and IT consolidation projects can yield significant energy, water and cost savings, along with reductions in GHG emissions. These improvements will not only reduce the environmental footprint of the data center but can also improve its business performance.

These efforts also enable operators to proactively address emerging regulatory and standards frameworks. Such regulations are intended to increase the reporting of operating data and metrics and may ultimately dictate minimum performance standards for data centers.

To reduce the Scope 2 emissions (purchased electricity) associated with data center operations to zero, operators need to work with utilities, energy retailers, and the electricity transmission and distribution system operators. The shared goal is to help build a resilient, interconnected electricity grid populated by carbon-free electricity generation and storage systems — a requirement for government net-zero mandates.

Addressing ecosystem sustainability opportunities is a valuable next step in an operator’s sustainability journey. Ecosystem projects can enhance the natural environment surrounding the data facility, improve the availability of carbon-free energy and water resources locally and globally, and directly support, inform and incentivize the sustainability efforts of customers and suppliers.

Data center sustainability should be approached in two separate ways: first, the infrastructure itself and, second, the ecosystem. Confusion and overlap between these two aspects can lead to unfortunate results. For example, in many cases, a net-zero and water-positive data center program is (wrongly) accepted as an indication that an enterprise is operating a sustainable data center infrastructure.


The Uptime Intelligence View

Operators should prioritize IT and facilities sustainability over ecosystem sustainability. The execution and results of an IT and facilities sustainability strategy directly minimize the environmental footprint of a data center portfolio, while maximizing its business and sustainability performance.

Data reporting and minimum performance standards embodied in enacted or proposed regulations are focused on the operation of the individual data centers, not the aggregated enterprise-level sustainability performance. An operator must demonstrate that they have a highly utilized IT infrastructure (maximized work delivered per unit of energy consumed) and minimized the energy and water consumption and GHG emissions associated with its facility operations.

Pursuing an Ecosystem sustainability strategy is the logical next step for operators that want to do more and further enhance their sustainability credentials. However, an ecosystem sustainability strategy should not be pursued at the expense of an IT and Facilities strategy to shield poor or marginal facility and IT systems performance.

The following Uptime Institute expert was consulted for this report:
Jay Paidipati, Vice President Sustainability Program Management, Uptime Institute

Other related reports published by Uptime Institute include:
Creating a sustainability strategy
IT Efficiency: the critical core of sustainability
Three key elements: water, circularity and siting
Navigating regulations and standards
Tackling greenhouse gases
Reducing the energy footprint

The post The two sides of a sustainability strategy appeared first on Uptime Institute Blog.

Reliance Industries Consolidates 16 Step-Down Subsidiaries into Reliance New Energy to Streamline Clean Energy Operations – EQ

In Short : Reliance Industries has merged 16 step-down subsidiaries into Reliance New Energy, reinforcing its strategic focus on clean energy and new-age technologies. The consolidation aims to simplify the corporate structure, improve operational efficiency, optimise resource deployment, and strengthen execution across renewable energy, energy storage, and green technology initiatives within the Reliance ecosystem.

In Detail : Reliance Industries has approved the merger of 16 step-down subsidiaries into Reliance New Energy, marking a significant organisational move to strengthen its clean energy and sustainability-focused businesses. The consolidation reflects the company’s intent to build a more agile and integrated structure to support its long-term energy transition strategy.

The step-down subsidiaries being merged were engaged in various activities linked to renewable energy, energy storage, advanced materials, and emerging clean technologies. Bringing these entities under a single umbrella is expected to enhance coordination, reduce administrative complexity, and enable faster decision-making across projects and investments.

Reliance New Energy has been positioned as the group’s primary vehicle for driving growth in the clean energy domain. By consolidating multiple subsidiaries into this entity, Reliance aims to create a unified platform that can efficiently manage large-scale investments, technology development, and project execution in a rapidly evolving sector.

Operational efficiency is a key driver behind the merger. A streamlined corporate structure allows for better capital allocation, reduced compliance burden, and improved utilisation of shared resources such as talent, infrastructure, and intellectual property. This is particularly important in capital-intensive segments like renewable energy and advanced manufacturing.

The consolidation is also expected to strengthen governance and financial transparency. With fewer entities and clearer reporting lines, Reliance New Energy can present a more cohesive financial and operational profile, which supports long-term planning and enhances confidence among investors and stakeholders.

From a strategic perspective, the merger aligns with Reliance Industries’ broader vision of becoming a global leader in clean energy and decarbonisation solutions. The company has committed significant investments toward renewable power, battery storage, green hydrogen, and related technologies as part of its transition roadmap.

The integrated structure is likely to accelerate project execution timelines by improving coordination across development, engineering, procurement, and deployment activities. Faster execution is critical as competition intensifies and demand for clean energy solutions continues to grow in India and globally.

For employees and partners, the merger is expected to create clearer roles, unified processes, and better alignment with the group’s clean energy objectives. A consolidated organisation can also attract specialised talent and foster innovation by bringing diverse capabilities together under a single leadership framework.

Overall, the merger of 16 step-down subsidiaries into Reliance New Energy represents a strategic step toward building scale, efficiency, and focus in Reliance Industries’ clean energy journey. By simplifying its structure and strengthening execution capabilities, the company is positioning itself to play a central role in shaping India’s future energy landscape.

❌