Normal view

Received yesterday — 31 January 2026

AI and cooling: toward more automation

AI is increasingly steering the data center industry toward new operational practices, where automation, analytics and adaptive control are paving the way for “dark” — or lights-out, unstaffed — facilities. Cooling systems, in particular, are leading this shift. Yet despite AI’s positive track record in facility operations, one persistent challenge remains: trust.

In some ways, AI faces a similar challenge to that of commercial aviation several decades ago. Even after airlines had significantly improved reliability and safety performance, making air travel not only faster but also safer than other forms of transportation, it still took time for public perceptions to shift.

That same tension between capability and confidence lies at the heart of the next evolution in data center cooling controls. As AI models — of which there are several — improve in performance, becoming better understood, transparent and explainable, the question is no longer whether AI can manage operations autonomously, but whether the industry is ready to trust it enough to turn off the lights.

AI’s place in cooling controls

Thermal management systems, such as CRAHs, CRACs and airflow management, represent the front line of AI deployment in cooling optimization. Their modular nature enables the incremental adoption of AI controls, providing immediate visibility and measurable efficiency gains in day-to-day operations.

AI can now be applied across four core cooling functions:

  • Dynamic setpoint management. Continuously recalibrates temperature, humidity and fan speeds to match load conditions.
  • Thermal load forecasting. Predicts shifts in demand and makes adjustments in advance to prevent overcooling or instability.
  • Airflow distribution and containment. Uses machine learning to balance hot and cold aisles and stage CRAH/CRAC operations efficiently.
  • Fault detection, predictive and prescriptive diagnostics. Identifies coil fouling, fan oscillation, or valve hunting before they degrade performance.

A growing ecosystem of vendors is advancing AI-driven cooling optimization across both air- and water-side applications. Companies such as Vigilent, Siemens, Schneider Electric, Phaidra and Etalytics offer machine learning platforms that integrate with existing building management systems (BMS) or data center infrastructure management (DCIM) systems to enhance thermal management and efficiency.

Siemens’ White Space Cooling Optimization (WSCO) platform applies AI to match CRAH operation with IT load and thermal conditions, while Schneider Electric, through its Motivair acquisition, has expanded into liquid cooling and AI-ready thermal systems for high-density environments. In parallel, hyperscale operators, such as Google and Microsoft, have built proprietary AI engines to fine-tune chiller and CRAH performance in real time. These solutions range from supervisory logic to adaptive, closed-loop control. However, all share a common aim: improve efficiency without compromising compliance with service level agreements (SLAs) or operator oversight.

The scope of AI adoption

While IT cooling optimization has become the most visible frontier, conversations with AI control vendors reveal that most mature deployments still begin at the facility water loop rather than in the computer room. Vendors often start with the mechanical plant and facility water system because these areas present fewer variables, such as temperature differentials, flow rates and pressure setpoints, and can be treated as closed, well-bounded systems.

This makes the water loop a safer proving ground for training and validating algorithms before extending them to computer room air cooling systems, where thermal dynamics are more complex and influenced by containment design, workload variability and external conditions.

Predictive versus prescriptive: the maturity divide

AI in cooling is evolving along a maturity spectrum — from predictive insight to prescriptive guidance and, increasingly, to autonomous control. Table 1 summarizes the functional and operational distinctions among these three stages of AI maturity in data center cooling.

Table 1 Predictive, prescriptive, and autonomous AI in data center cooling

Table: Predictive, prescriptive, and autonomous AI in data center cooling

Most deployments today stop at the predictive stage, where AI enhances situational awareness but leaves action to the operator. Achieving full prescriptive control will require not only a deeper technical sophistication but also a shift in mindset.

Technically, it is more difficult to engineer because the system must not only forecast outcomes but also choose and execute safe corrective actions within operational limits. Operationally, it is harder to trust because it challenges long-held norms about accountability and human oversight.

The divide, therefore, is not only technical but also cultural. The shift from informed supervision to algorithmic control is redefining the boundary between automation and authority.

AI’s value and its risks

No matter how advanced the technology becomes, cooling exists for one reason: maintaining environmental stability and meeting SLAs. AI-enhanced monitoring and control systems support operating staff by:

  • Predicting and preventing temperature excursions before they affect uptime.
  • Detecting system degradation early and enabling timely corrective action.
  • Optimizing energy performance under varying load profiles without violating SLA thresholds.

Yet efficiency gains mean little without confidence in system reliability. It is also important to clarify that AI in data center cooling is not a single technology. Control-oriented machine learning models, such as those used to optimize CRAHs, CRACs and chiller plants, operate within physical limits and rely on deterministic sensor data. These differ fundamentally from language-based AI models such as GPT, where “hallucinations” refer to fabricated or contextually inaccurate responses.

At the Uptime Network Fall Americas Fall Conference 2025, several operators raised concerns about AI hallucinations — instances where optimization models generate inaccurate or confusing recommendations from event logs. In control systems, such errors often arise from model drift, sensor faults, or incomplete training data, not from the reasoning failures seen in language-based AI. When a model’s understanding of system behavior falls out of sync with reality, it can misinterpret anomalies as trends, eroding operator confidence faster than it delivers efficiency gains.

The discomfort is not purely technical, it is also human. Many data center operators remain uneasy about letting AI take the controls entirely, even as they acknowledge its potential. In AI’s ascent toward autonomy, trust remains the runway still under construction.

Critically, modern AI control frameworks are being designed with built-in safety, transparency and human oversight. For example, Vigilent, a provider of AI-based optimization controls for data center cooling, reports that its optimizing control switches to “guard mode” whenever it is unable to maintain the data center environment within tolerances. Guard mode brings on additional cooling capacity (at the expense of power consumption) to restore SLA-compliant conditions. Typical examples include rapid drift or temperature hot spots. In addition, there is also a manual override option, which enables the operator to take control through monitoring and event logs.

This layered logic provides operational resiliency by enabling systems to fail safely: guard mode ensures stability, manual override guarantees operator authority, and explainability, via decision-tree logic, keeps every AI action transparent. Even in dark-mode operation, alarms and reasoning remain accessible to operators.

These frameworks directly address one of the primary fears among data center operators: losing visibility into what the system is doing.

Outlook

Gradually, the concept of a dark data center, one operated remotely with minimal on-site staff, has shifted from being an interesting theory to a desirable strategy. In recent years, many infrastructure operators have increased their use of automation and remote-management tools to enhance resiliency and operational flexibility, while also mitigating low staffing levels. Cooling systems, particularly those governed by AI-assisted control, are now central to this operational transformation.

Operational autonomy does not mean abandoning human control; it means achieving reliable operation without the need for constant supervision. Ultimately, a dark data center is not about turning off the lights, it is about turning on trust.


The Uptime Intelligence View

AI in thermal management has evolved from an experimental concept into an essential tool, improving efficiency and reliability across data centers. The next step — coordinating facility water, air and IT cooling liquid systems — will define the evolution toward greater operational autonomy. However, the transition to “dark” operation will be as much cultural as it is technical. As explainability, fail-safe modes and manual overrides build operator confidence, AI will gradually shift from being a copilot to autopilot. The technology is advancing rapidly; the question is how quickly operators will adopt it.

The post AI and cooling: toward more automation appeared first on Uptime Institute Blog.

Received before yesterday

AI’s growth calls for useful IT efficiency metrics

The digital infrastructure industry is under pressure to measure and improve the energy efficiency of the computing work that underpins digital services. Enterprises seek to maximize returns on cost outlay and operating expenses for IT hardware, and regulators and local communities need reassurance that the energy devoted to data centers is used efficiently. These objectives call for a productivity metric to measure the amount of work that IT hardware performs per unit of energy.

With generative AI projected to boost data center power demand substantially, the stakes have arguably never been higher. Fortunately, organizations monitoring the performance and efficiency of their AI applications can benefit from experiences in the field of supercomputing.

In September 2025, Uptime Intelligence participated in a panel discussion about AI energy efficiency at the Yotta 2025 conference in Las Vegas (Nevada, US). The panelists drew on their extensive experience in supercomputing to weigh in on discussions around AI training efficiency. They discussed the need for a productivity metric to measure it, as well as a key caveat organizations need to consider.

Organizations such as Uptime Intelligence and The Green Grid have published guidance on calculating work capacity for various types of IT. Software applications and their supporting IT hardware vary significantly, so consensus on a single metric to compare energy performance remains out of reach for the foreseeable future. However, tracking energy performance in a given facility over time is important, and is achievable practically for many organizations today.

Defining AI computing work

The work capacity of IT equipment is needed to calculate its utilization and energy performance when running an application. The Green Grid white paper IT work capacity metric V1 — a methodology provides a methodology for calculating a work capacity value for CPU-based servers. Uptime Intelligence has proposed methodologies to extend this to accelerator-based servers for AI and other applications (see Calculating work capacity for server and storage products).

Floating point operations per second (FLOPS) is a common and readily available unit of work capacity for CPU- or accelerator-based servers. In 2025, an AI server’s capacity usually ranks in the trillions of FLOPS, or teraFLOPS (TFLOPS).

Not all FLOPS are the same

Even though large-scale AI training is radically reshaping many commercial data centers, the underlying software and hardware are not fundamentally new. AI training is essentially one of many applications of supercomputing. Supercomputing software, along with the IT selection and configuration, varies in many ways — and one of the most relevant variables when monitoring energy performance is floating point precision. This precision (measured in bits) is analogous to the number of decimal places used in inputs and outputs.

GPUs and other accelerators can perform 64-, 32-, 16-, 8- and 4-bit calculations, and some can use mixed precision. While a high-performance computing (HPC) workload such as computational fluid dynamics might use 64-bit (“double precision”) floating point calculations for high accuracy, other applications do not have such exacting requirements. Lower precision consumes less memory per calculation — and, crucially, less energy. The panel discussion at Yotta raised an important distinction: unlike most engineering and research applications, today’s AI training and inference calculations typically use 4-bit precision.

Floating point precision is necessary information when evaluating a TFLOPS benchmark. A 64-bit precision calculation TFLOPS value is one-half of a 32-bit TFLOPS value — or one-sixteenth of a 4-bit TFLOPS value. For consistent AI work capacity calculation, Uptime Institute recommends that IT operators use 32-bit TFLOPS values supplied by their AI server providers.

Working it out: work per energy

The maximum work capacity calculation for a server can be aggregated at the level of a rack, a cluster or a data center. Work capacity multiplied by average utilization (as a percentage) produces an estimate of the amount of calculation work (in TFLOPS) that was performed over a given period. Operators can divide this figure by the energy consumption (in MWh) over that same time to yield an estimate of the work’s energy efficiency, in TFLOPS/MWh. Separate calculations for CPU-based servers, accelerator-based servers, and other IT (e.g., storage) will provide a more accurate assessment of energy performance (see Figure 1).

Figure 1 Examples of IT equipment work-per-energy calculations

Diagram: Examples of IT equipment work-per-energy calculations

Even when TFLOPS figures are normalized to the same precision, it is difficult to use this information to draw meaningful comparisons between the energy performance of significantly different hardware types and configurations. Accelerator power consumption does not scale linearly with utilization levels. Additionally, the details of software design will determine how closely real-world application performance aligns with simplified work capacity benchmarks.

However, many organizations can benefit from calculating this TFLOPS/MWh productivity metric and are already well equipped to do so. This calculation is most useful to quantify efficiency gains over time, e.g., from IT refresh and consolidation, or refinements to operational control. In some jurisdictions, tracking FLOPS/MWh as a productivity metric can satisfy some regulatory requirements. IT efficiency is often overlooked in favor of facility efficiency — but a consistent productivity metric can help to quantify available improvements.


The Uptime Intelligence View

Generative AI training is poised to drive up data center energy consumption, prompting calls for regulation, responsible resource use and return on investment. A productivity metric can help meet these objectives by consistently quantifying the amount of computing work performed per unit of energy. Supercomputing experts agree that operators should track and use this data, but they caution against interpreting it without the necessary context. A simplified, practical work-per-energy metric is most useful for tracking improvement in one facility over time.

The following participants took part in the panel discussion on energy efficiency at Yotta 2025:

  • Jacqueline Davis, Research Analyst at Uptime Institute (moderator)
  • Dr Peter de Bock, former Program Director, Advanced Research Projects Agency–Energy
  • Dr Alfonso Ortega, Professor of Energy Technology, Villanova University
  • Dr Jon Summers, Research Lead in Data Centers, Research Institutes of Sweden

Other related reports published by Uptime Institute include:

Calculating work capacity for server and storage products

The following Uptime Institute experts were consulted for this report:

Jay Dietrich, Research Director of Sustainability, Uptime Institute

The post AI’s growth calls for useful IT efficiency metrics appeared first on Uptime Institute Blog.

AI power fluctuations strain both budgets and hardware

AI training at scale introduces power consumption patterns that can strain both server hardware and supporting power systems, shortening equipment lifespans and increasing the total cost of ownership (TCO) for operators.

These workloads can cause GPU power draw to spike briefly, even for only a few milliseconds, pushing them past their nominal thermal design power (TDP) or against their absolute power limits. Over time, this thermal stress can degrade GPUs and their onboard power delivery components.

Even when average power draw stays within hardware specifications, thermal stress can affect voltage regulators, solder joints and capacitors. This kind of wear is often difficult to detect and may only become apparent after a failure. As a result, hidden hardware degradation can ultimately affect TCO — especially in data centers that are not purpose-built for AI compute.

Strain on supporting infrastructure

AI training power swings can also push server power supply units (PSUs) and connectors beyond their design limits. PSUs may be forced to absorb rapid current fluctuations, straining their internal capacitors and increasing heat generation. In some cases, power swings can trip overcurrent protection circuits, causing unexpected reboots or shutdowns. Certain power connectors, such as the standard 12VHPWR cables used for GPUs, are also vulnerable. High contact resistance can cause localized heating, further compounding the wear and tear effects.

When AI workloads involve many GPUs operating in synchronization, power swing effects multiply. In some cases, simultaneous power spikes across multiple servers may exceed the rated capacity of row-level UPS modules — especially if they were sized following legacy capacity allocation practices. Under such conditions, AI compute clusters can sometimes reach 150% of their steady-state maximum power levels.

In extreme cases, load fluctuations of large AI clusters can exceed a UPS system’s capability to source and condition power, forcing it to use its stored energy. This happens when the UPS is overloaded and unable to meet demand using only its internal capacitance. Repeated substantial overloads will put stress on internal components as well as the energy storage subsystem. For batteries, particularly lead-acid cells, this can shorten their shelf life. In worst-case scenarios, these fluctuations may cause voltage sags or other power quality issues (see Electrical considerations with large AI compute).

Capacity planning challenges

Accounting for the effects of power swings from AI training workloads during the design phase is challenging. Many circuits and power systems are sized based on the average demand of a large and diverse population of IT loads, rather than their theoretical combined peak. In the case of large AI clusters, this approach can lead to a false sense of security in capacity planning.

When peak amplitudes are underestimated, branch circuits can overheat, breakers may trip, and long-term damage can occur to conductors and insulation — particularly in legacy environments that lack the headroom to adapt. Compounding this challenge, typical monitoring tools track GPU power every 100 milliseconds or more — too slow to detect the microsecond-speed spikes that can accelerate the wear on hardware through current inrush.

Estimating peak power behavior depends on several factors, including the AI model, training dataset, GPU architecture and workload synchronization. Two training runs on identical hardware can produce vastly different power profiles. This uncertainty significantly complicates capacity planning, leading to under-provisioned resources and increased operational risks.

Facility designs for large-scale AI infrastructure need to account for the impact of dynamic power swings. Operators of dedicated training clusters may overprovision UPS capacity, use rapid-response PSUs, or set absolute power and rate-of-change limits on GPU servers using software tools (e.g., Nvidia-SMI). While these approaches can help reduce the risk of power-related failures, they also increase capital and operational costs and can reduce efficiency under typical load conditions.

Many smaller operators — including colocation tenants and enterprises exploring AI — are likely testing or adopting AI training on general-purpose infrastructure. Nearly three in 10 operators already perform AI training, and of those that do not, nearly half expect to begin in the near future, according to results from the Uptime Institute AI Infrastructure Survey 2025 (see Figure 1).

Figure 1 Three in 10 operators currently perform AI training

Diagram: Three in 10 operators currently perform AI training

Many smaller data center environments may lack workload diversity (non-AI loads) to absorb power swings or the specialized engineering to manage dynamic power consumption behavior. As a result, these operators face a greater risk of failure events, hardware damage, shortened component lifespans and reduced UPS reliability — all of which contribute to higher TCO.

Several low-cost strategies can help mitigate risk. These include oversizing branch circuits — ideally dedicating them to GPU servers — distributing GPUs across racks and data halls to prevent localized hotspots, and setting power caps on GPUs to trade some peak performance for longer hardware lifespan.

For operators considering or already experimenting with AI training, TDP alone is an insufficient design benchmark for capacity planning. Infrastructure needs to account for rapid power transients, workload-specific consumption patterns, and the complex interplay between IT hardware and facility power systems. This is particularly crucial when using shared or legacy systems, where the cost of misjudging these dynamics can quickly outweigh the perceived benefits of performing AI training in-house.


The Uptime Intelligence View

For data centers not specifically designed to support AI training workloads, GPU power swings can quietly accelerate hardware degradation and increase costs. Peak power consumption of these workloads is often difficult to predict, and signs of component wear may remain hidden until failures occur. Larger operators with dedicated AI infrastructure are more likely to address these power dynamics during the design phase, while smaller operators — or those using general-purpose infrastructure — may have fewer options.

To mitigate risk, these operators can consider overprovisioning rack-level UPS capacity for GPU servers, oversizing branch circuits (and dedicating them to GPU loads where possible), distributing heat from GPU servers across racks and rooms to avoid localized hotspots, and applying software-based power caps. Data center operators should also factor in more frequent hardware replacements during financial planning to more accurately reflect the actual cost of running AI training workloads.

The following Uptime Institute experts were consulted for this report:
Chris Brown, Chief Technical Officer, Uptime Institute
Daniel Bizo, Senior Research Director, Uptime Institute Intelligence
Max Smolaks, Research Analyst, Uptime Institute Intelligence

Other related reports published by Uptime Institute include:
Electrical considerations with large AI compute

The post AI power fluctuations strain both budgets and hardware appeared first on Uptime Institute Blog.

Retail vs wholesale: finding the right colo pricing model

Colocation providers may offer two pricing and packaging models to sell similar products and capabilities. In both models, customers purchase space, power and services. However, the method of purchase differs.

In a retail model, customers purchase a small quantity of space and power, usually by the rack or a fraction of a rack. The colocation provider standardizes contracts, pricing and capabilities — the cost and complexity of delivering to a customer’s precise requirements are not justified, considering the relatively small contract value.

In a wholesale model, customers purchase a significantly larger quantity of space and power, typically at least a dedicated, enclosed suite of white space. Due to the size of these contracts, colocation providers need to be flexible in meeting customer needs, even potentially building new facilities to accommodate their requirements. The colocation provider negotiates price and terms, and customers often prefer to pay for actual power consumption rather than be billed on maximum capacity. A metered model allows the customer to scale power usage in response to changing demands.

A colocation provider may focus on a particular market by offering only a retail or wholesale model, or the provider may offer both to broaden its appeal. The terms “wholesale” and “retail” colocation more accurately describe the pricing and packaging models used by colocation providers rather than the type of customer.

Table 1 Key differences between retail and wholesale colocation providers

Table: Key differences between retail and wholesale colocation providers

Retail colocation deals typically have higher gross margins in percentage terms, but the volume of sales is lower. Most colocation providers would rather sell wholesale contracts because they offer higher revenues through larger volumes of sales, despite having lower gross margins. As wholesale colocations are better prospects, retail customers are more likely to experience cost rises at renewal than wholesale customers.

Retail colocation pricing model

Retail terms are designed to be simple and predictable. Customers are typically charged a fixed fee based on the maximum power capacity supplied to equipment and the space used. This fee covers both the repayment of fixed costs, and the variable costs associated with IT power and cooling. The fixed fee bundles all these elements together, so customers have no visibility into these individual components — but they benefit from predictable pricing.

In retail colocation, the facilities are already available, so capital costs are recovered across all retail customers through standard pricing. If a customer exceeds their allotted maximum power capacity, they risk triggering a breaker and potentially powering down their IT equipment. Some colocation providers monitor for overages and warn customers that they need to increase their capacity before an outage occurs.

Customers are likely to purchase more power capacity than they need to prevent these outages. As a result, some colocation providers may deliberately oversubscribe power consumption to reduce their power costs and increase their profit margins. There are operational and reputational risks if oversubscription causes service degradation or outages.

Some colocation providers also meter power, charging a fee based on IT usage, which factors in the repayment of capital, IT and cooling costs, as well as a profit margin. Those with metering enabled may charge customers for usage exceeding maximum capacity, typically at a higher rate.

Can a colocation provider increase prices during a contract term? Occasionally, but only as a last resort — such as if power costs increase significantly. This possibility will be stipulated in the contract as an emergency or force majeure measure.

Usually, an internet connection is included. However, data transfer over that connection may be metered or bundled into a fixed cost package. Customers have the option to purchase cross-connects linking their infrastructure to third-party communications providers, including on-ramps to cloud providers.

Wholesale colocation pricing model

Wholesale colocation pricing is designed to offer customers the flexibility to utilize their capacity as they choose. Because terms are customized, pricing models will vary from customer to customer.

Some customers may prefer to pay for a fixed capacity of total power, regardless of whether the power is used or not. In this model, both IT power and cooling costs are factored into the price.

Other customers may prefer a more granular approach, with multiple charging components:

  • Fixed fee per unit of space/rack based on maximum power capacity and is designed to cover the colocation provider’s fixed costs, while including a profit margin.
  • Variable IT power costs are passed directly from the electricity supplier to the customer, metered in kilowatts (kW). Customers bear the full cost of price fluctuations, which can change rapidly depending on grid conditions.
  • To account for variable cooling costs, power costs may be calculated by multiplying actual power usage by an agreed design PUE to create an “additional power” fee. This figure may also be multiplied by a “utilization factor” to reflect cases where a customer is using only a small fraction of the data hall (and therefore impacting overall efficiency).

Some customers may prefer a blended model of both a fixed element for baseline capacity and a variable charge for consumption above the baseline. Redundant feeds are also likely to impact cost. If new data halls need to be constructed, these costs may be passed on to the customers directly, or some capital may be recovered through a higher fixed rack fee.

Alternatively, for long-term deployments, customers may opt for either a “build-to-suit” or “powered shell” arrangement. In a build-to-suit model, the colocation provider designs and constructs the facility —including power, cooling and layout — to the customer’s exact specifications. The space is then leased back to the customer, typically under a long-term agreement exceeding a decade.

In a powered shell setup, the provider delivers a completed exterior building with core infrastructure, such as utility power and network access. The customer is then responsible for outfitting the interior (racks, cooling, electrical systems) to suit their operational needs.

Most customers using wholesale colocation providers will need to implement cross-connects to third-party connectivity and network providers hosted in meet-me rooms. They may also need to arrange the construction of new capacity into the facility with the colocation provider and suppliers.

Hyperscalers are an excellent prospect for wholesale colocation, given their significant scale. However, their limited numbers and strong market power enable them to negotiate lower margins from colocation providers.

Table 2 Pricing models used in retail and wholesale colocation

Table: Pricing models used in retail and wholesale colocation

In a retail colocation engagement, the customer has limited negotiating power — with little scale, they generally have minimal flexibility on pricing, terms and customization. In a wholesale engagement, the opposite is true, and the arrangement favors the customer. Colocation providers want the scale and sales volume, so are willing to cut prices and accommodate additional requirements. They are also willing to offer flexible pricing in response to customers’ rapidly changing requirements.


The Uptime Intelligence View

Hyperscalers have the strongest market power to dictate contracts and prices. With so few players, it is unlikely that many hyperscalers will be bidding for the same space, which would push up prices. However, colocation providers still want their business, because of the volume it brings. They would prefer to reduce gross margins to ensure a win, rather than risk losing a customer with such unmatched scale.

The post Retail vs wholesale: finding the right colo pricing model appeared first on Uptime Institute Blog.

Electrical considerations with large AI compute

The training of large generative AI models is a special case of high-performance computing (HPC) workloads. This is not simply due to the reliance on GPUs — numerous engineering and scientific research computations already use GPUs as standard. Neither is it about the power density or the liquid cooling of AI hardware, as large HPC systems are already extremely dense and use liquid cooling. Instead, what makes AI compute special is its runtime behavior: when training transformer-based models, large compute clusters can create step load-related power quality issues for power distribution systems in data center facilities. A previous Intelligence report offers an overview of the underlying hardware-software mechanisms.

The scale of the power fluctuations makes this phenomenon unusual and problematic. The vast number of generic servers found in most data centers collectively produce a relatively steady electrical load — even if individual servers experience sudden changes in power usage, they are discordant. In contrast, the power use of compute nodes in AI training clusters moves in near unison.

Even compared with most other HPC clusters, AI training clusters exhibit larger power swings. This is due to an interplay between transformer-based neural network architectures and compute hardware, which creates frequent spikes and falls (every second or two) in power demand. These fluctuations correspond to the computational steps in the training processes, exacerbated by an aggressive pursuit of peak performance typical in modern silicon.

Powerful fluctuations

The scope of the resulting step changes in power will depend on the size and configuration of the compute cluster, as well as operational factors such as AI server performance and power management settings. Uptime Intelligence estimates that in worst-case scenarios, the difference between the low and high points of power draw during training program execution can exceed 100% on a system level (the load doubles almost instantaneously, within milliseconds) for some configurations.

These extremes occur every few seconds, whenever a batch of weights and biases is loaded on GPUs and the training begins. This is often accompanied by a massive spike in current, produced by power excursion events as GPUs overshoot their thermal design power rating (TDP) to opportunistically exploit any extra thermal and power delivery budget following a phase of lower transistor activity. In short, power spikes are made possible by intermittent lulls.

This behavior is common in modern compute silicon, including in personal devices and generic servers. Still, it is only with large AI compute clusters that these fluctuations across dozens or hundreds of servers move almost synchronously.

Even in moderately sized clusters with just a few dozen racks, this can result in sudden, millisecond-speed changes in AC power — ranging from several hundred kilowatts to even a few megawatts. If there are no other substantial loads present in the electrical mix to dampen these fluctuations, these step changes may stress capacity components in the power distribution systems. They may also cause power quality issues such as voltage sags and swells, or significant harmonics and sub-synchronous oscillations that distort the sinusoidal waveforms in AC power systems.

Based on several discussions with and disclosures by major electrical equipment manufacturers — including ABB, Eaton, Schneider Electric, Siemens and Vertiv — there is a general consensus that modern power distribution equipment is expected to be able to handle AI power fluctuations, as long as they remain within the rated load.

IT system capacity redefined

The issue of AI step loads appears to center on equipment capacity and the need to avoid frequent overloads. Standard capacity planning practices often start with the nameplate power of installed IT hardware, then derate it to estimate the expected actual power. This adjustment can reduce the total nameplate power by 25% to 50% across all IT loads when accounting for the diversity of workloads — since they do not act in unison — and also for the fact that most software rarely pushes the IT hardware close to its rated power.

In comparison, AI training systems can show extreme behavior. Larger AI compute clusters have the potential to draw what is similar to an inrush current (rapid change of currents, often denoted by high di/dt) that exceed the IT system’s sustained maximum power rating.

Normally, overloads would not pose a problem for modern power distribution. All electrical components and systems have specified overload ratings to handle transient events (e.g., current surges during the startup of IT hardware or other equipment) and are designed and tested accordingly. However, if power distribution components are sized closely to the rated capacity of the AI compute load, these transient overloads could happen millions of times per year in the worst cases — components are not tested for regularly repeated overloads. Over time, this can lead to electromechanical stress, thermal stress and gradual overheating (heat-up is faster than cool-off) — potentially resulting in component failure.

This brings the definition of capacity to the forefront of AI compute step loads. Establishing the repeated peak power of a single GPU-server node is already a non-trivial effort — it requires running a variety of computationally intensive codes and setting up a high-precision power monitor. However, predicting how a specific compute cluster spanning several racks and potentially hundreds or even thousands of GPUs will behave during a training run is difficult to ascertain ahead of deployment.

The expected power profile also depends on server configurations, such as power supply redundancy level, cooling mode and GPU generations. For example, in a typical AI system from the 2022-2024 generation, power fluctuations can reach up 4 kW per 8-GPU server node, or 16 kW per rack when populated with four nodes, according to Uptime estimates. Even so, the likelihood of exceeding the rack power rating of around 41 kW is relatively low. Any overshoot is likely to be minor, as these systems are mostly air-cooled hardware designed to meet ASHRAE Class A2 specifications — allowed to operate in environments up to 35°C (95°F). In practice, most facilities supply much cooler air, making system fans cycle less intensely.

However, with recently launched systems, the issue is further exacerbated as GPUs account for a larger share of the power budget, not only because they use more power (in excess of 1 kW per GPU module) but also because these systems are more likely to use direct liquid cooling (DLC). Liquid cooling reduces system fan power, thereby reducing the stable load of server power. It also has better thermal performance, which helps the silicon to accumulate extra thermal budget for power excursions.

IT hardware specifications and information shared with Uptime by power equipment vendors indicate that in the worst cases, load swings can reach 150%, with a potential for overshoots exceeding 10% above the system’s power specification. In the case of the rack-scale systems based on Nvidia’s GB200 NVL72 architecture, sudden power climbs from around 60 kW and 70 kW to more than 150 kW per rack can occur.

This compares to a maximum power specification of 132 kW, which means that, under worst-case assumptions, repeated overloads can amount to as much as 20% in instantaneous power, Uptime estimates. This warrants extra care regarding circuit sizing (including breakers, tap-off units and placements, busways and other conductors) to avoid overheating and related reliability issues.

Figure 1 shows the power pattern of a GPU-based compute cluster running a transformer-based model training workload. Based on hardware specifications and real-world power data disclosed to Uptime Intelligence, we algorithmically mimicked the behavior of a compute cluster comprising four Nvidia GB200 NVL72 racks and four non-compute racks. It demonstrates the expected power fluctuations during these training clusters and underscores the need to rethink capacity planning compared with traditional, generic IT loads. Even though the average power stays below the power rating of the cluster, peak fluctuations can exceed it. While this estimates a relatively small cluster with 288 GPUs, a larger cluster would exhibit similar behavior at the megawatt scale.

Figure 1 Power profile of a GPU-based training cluster (algorithmic not real-world data)

Diagram: Power profile of a GPU-based training cluster (algorithmic not real-world data)

In electrical terms, no multi-rack workload is perfectly synchronous, while the presence of other loads will help smooth out the edges of fluctuations further. When including non-compute ancillary loads in the cluster — such as storage systems, networks and CDUs (which also require UPS power) — a lower safety margin above the nominal rating (e.g., 10% to 15%) appears sufficient to cover any regular peaks over the nominal system power specifications, even with the latest AI hardware.

Current mitigation options

There are several factors that data center operators may want to consider when deploying compute clusters dedicated to training large, transformer-based AI models. Currently, data center operators have a limited toolkit to fully handle large power fluctuations in a power distribution system, particularly when it comes to not passing them on to the source in their full extent. However, in collaboration with the IT infrastructure team/tenant, it should be possible to minimize fluctuations:

  • Mix with diverse IT loads, share generators. The best first option is to integrate AI training compute with other, diverse IT loads in a shared power infrastructure. This helps to diminish the effects of power fluctuations, particularly on generator sets. For dedicated AI training data center infrastructure installations, this may not be an option for power distribution. However, sharing engine generators will go a long way to dampen the effects of AI power fluctuations.
    Among power equipment, engine generator sets will be the most stressed if exposed to the full extent of the fluctuations seen in a large, dedicated AI training infrastructure. Even if correctly sized for the peak load, generators may struggle with large and fast fluctuations — for example, the total facility load stepping from 45% to 50% of design capacity to 80% to 85% within a second, then dropping back to 45% to 50% after two seconds, on repeat. Such fluctuation cycles may be close to what the engines can handle, at the expense of reduced expected life or outright failure.
  • Select UPS configurations to minimize power quality issues, overload. Even if a smaller frame can handle the fluctuations, according to the vendors, larger systems will carry more capacitance to help absorb the worst of the fluctuations, maintaining voltage and frequency within performance specifications. An additional measure is to use a higher capacity redundancy configuration, for example, by opting for N+2. This allows for UPS maintenance while avoiding any repeated overloads on the operational UPS systems, some of which might hit the battery energy storage system.
  • Use server performance/power management tools. Power and performance management of hardware remain largely underused, despite their ability to not only improve IT power efficiency but also contribute to the overall performance of the data center infrastructure. Even though AI compute clusters feature some exotic interconnect subsystems, they are essentially standard servers using standard hardware and software. This means there are a variety of levers to manage the peaks in their power and performance levels, such as power capping, turning off boost clocks, limiting performance states, or even setting lower temperature limits.
    To address the low end of fluctuations, switching off server energy-saving modes — such as silicon sleep states (known as C-states in CPU parlance) — can help raise the IT hardware’s power floor. A more advanced technique involves limiting the rate of power change (including on the way down). This feature, called “power smoothing”, is available through Nvidia’s System Management Interface on the latest generation of Blackwell GPUs.

Electrical equipment manufacturers are investigating the merits of additional rapid discharge/recharge energy storage and updated controls to UPS units with the aim of shielding the power source from fluctuations. These approaches include super capacitors, advanced battery chemistries or even flywheels that can tolerate frequent, short duration but high-powered discharge and recharge cycles. Next-generation AI compute systems may also include more capacitance and energy storage to limit fluctuations on the data center power system. Ultimately, it is often best to address an issue at its root (in this case the IT hardware and software) rather than treat the symptoms, although these may lie outside the control of data center facilities teams.


The Uptime Intelligence View

Most of the time, data center operators do not need to be overly concerned with the power profile of the IT hardware or the specifics of the associated workloads — rack density estimates were typically overblown to begin with, and overall capacity utilization tends to stay well below 100%. Even so, safety margins, which are expensive, could be thin. However, training large transformer models is different. The specialized compute hardware can be extremely dense, creates large power swings, and is capable of producing frequent power surges that are close to or even above its hardware power rating. This will force data center operators to reconsider their approach to both capacity planning and safety margins across their infrastructure.

The following Uptime Institute experts were consulted for this report:
Chris Brown, Chief Technical Officer, Uptime Institute

The post Electrical considerations with large AI compute appeared first on Uptime Institute Blog.

Is this the data center metric for the 2030s?

When the PUE metric was first proposed and adopted at a Green Grid meeting in California in 2008, few could have forecast how important this simple ratio — despite its limitations — would become.

Few would have expected, too, that the industry would make so little progress on another metric proposed at those same early Green Grid meetings. While PUE highlighted the energy efficiency of the non-IT portion of a data center’s energy use, a separate “useful work” metric was intended to identify how much IT work was being done relative to the total facility and IT energy consumed. A list of proposals was put forward, votes were taken, but none of the ideas came near to being adopted.

Sixteen years later, minimal progress has been made. While some methods for measuring “work per energy” have been proposed, none have garnered any significant support or momentum. Efforts to measure inefficiencies in IT energy use — by far the largest source of both energy consumption and waste in a data center — have constantly stalled or failed to gain support.

That is set to change soon. The European Union and key member states are looking to adopt representative measurements of server (and storage) work capacity — which, in turn, will enable the development of a work per energy or work per watt-hour metric (see below and accompanying report).

So far, the EU has provided limited guidance on the work per energy metric, which it will need to agree in 2025 or 2026. However, it will clearly require a technical definition of CPU, GPU and accelerator work capacity, along with energy-use boundaries.

Once the metric is agreed upon and adopted by the EU, it will likely become both important and widely cited. It would be the only metric that links IT performance to the energy consumed by the data center. Although it may take several years to roll out, this metric is likely to become widely adopted around the world.

The new metric

The EU officials developing and applying the rules set out in the Energy Efficiency Directive (EED) are still working on many key aspects of a data center labeling scheme set to launch in 2026. One area they are struggling with is the development of meaningful IT efficiency metrics.

Uptime Institute and The Green Grid’s proposed work per energy metric is not the only option, but it offers many key advantages. Chief among them: it has a clear methodology; the work capacity value increases with physical core count and newer technology generations; and it avoids the need to measure the performance of every server. The methodology can also be adapted to measure work per megawatt-hour for GPU/accelerator-based servers and dedicated storage equipment. While there are some downsides, these will likely be shared by most alternative approaches.

Full details of the methodology are outlined in the white papers and webinar listed at the end of the report. The initial baseline work — on how to calculate work capacity of standard CPU-based servers — was developed by The Green Grid. Uptime Institute Sustainability and Energy Research Director Jay Dietrich extended the methodology to GPU/accelerator-based servers and dedicated storage equipment, and expanded it to calculate the work per megawatt-hour metric.

The methodology has five components:

  • Build or access an inventory of all the IT in the data center. The required data on CPU, GPU and storage devices should be available in procurement systems, inventory management tools, CMMS or some DCIM platforms.
  • Calculate the work capacity of the servers using the PerfCPU values available on The Green Grid website. These values are based on CPU cores by CPU technology generation.
  • Include GPU or accelerator-based compute servers using the 32-bit TFLOPS metrics. An alternative performance metric, such as Total Processing Performance (TPP), may be used if agreed upon later.
  • Include online, dedicated storage equipment (excluding tape) measured in terabytes.
  • Collect data, usually from existing systems, on:
    • Power supplied to CPUs, GPUs and storage systems. This should be relatively straightforward if the appropriate meters and databases are in place. Where there is insufficient metering, it may be necessary to use reasonable allocation methods.
    • Utilization. It is critical for a work per energy metric to know the utilization averages. This data is routinely monitored in all IT systems, but it needs to be collected and normalized for reporting purposes.

With this data, the work per energy metric can be calculated by adding up and averaging the number of transactions per second, then dividing it by the total amount of energy consumed. Like PUE, it is calculated over the course of a year to give an annual average. A simplified version, for three different workloads, is shown in Figure 1.

Figure 1 Examples of IT equipment work-per-energy calculations

Diagram: Examples of IT equipment work-per-energy calculations

Challenges

There are undoubtedly some challenges with this metric. One is that Figure 1 shows three different figures for three different workloads — whereas, in contrast, data centers usually report a single PUE number. This complexity, however, is unavoidable when measuring very different workloads, especially if the figure(s) are to give meaningful guidance on how to make efficiency improvements.

Under its EED reporting scheme, the EU has so far allowed for the inclusion of only one final figure each for server work capacity and storage capacity reporting. While a single figure works for storage, the different performance characteristics of standard CPU servers, AI inference and high-performance compute, and AI training servers make it necessary to report their capacities separately. Uptime argues that combining these three workloads into a single figure — essentially for at-a-glance public consumption — distorts and oversimplifies the report, risking the credibility of the entire effort. Whatever the EU decides, the problem is likely to be the same for any work per energy metric.

A second issue is that 60% of operators lack a complete component and location inventory of their IT infrastructure. Collecting the required information for the installed infrastructure, adjusting purchasing contracts to require inventory data reporting, and automating data collection for new equipment represents a considerable effort, especially at scale. By contrast, a PUE calculation only requires two meters at a minimum. 

However, most of the data collection — and even the calculations — can be automated once the appropriate databases and software are in place. While collecting the initial data and building the necessary systems may take several months, doing so will provide ongoing data to support efficiency improvements. In the case of this metric, data is already available from The Green Grid and Uptime will support the process.

There are several reasons why, until now, no work per energy metric has been successful. Two are particularly noteworthy. First, IT and facilities organizations are often either entirely separate — as in colocation provider/tenant — or they do not generally collaborate or communicate, as is common in enterprise IT. Even when this is not the case, and the data on IT efficiency is available, chief information officers or marketing teams may prefer not to publicize serious inefficiencies. However, such objections will no longer hold sway if legal compliance is required.

The second issue is that industry technical experts have often let the perfect stand in the way of the good, raising concerns about data accuracy. For an effective work per energy metric, the work capacity metric needs to provide a representative, configuration-independent measure that tracks increased work capacity as physical core count increases and as new CPU and GPU generations are introduced.

The Green Grid and Uptime methodologies will no doubt be questioned or opposed by some, but they achieve the intended goal. The work capacity metric does not have to drill down to specific computational workloads or application types, as some industry technologists demand. The argument that there is no reasonable metric, or that it lacks critical support, is no longer grounds for procrastination. IT energy inefficiencies need to be surfaced and understood.

Further information

To access the Uptime report on server and storage capacity (and on work per unit of energy):

Calculating work capacity for server and storage products

To access The Green Grid reports on IT work capacity:

IT work capacity metric V1 — a methodology

Searchable PerfCPU tables by manufacturer

To access an Uptime webinar discussing the metrics discussed in this report:

Calculating Data Center Work Capacity: The EED and Beyond

The post Is this the data center metric for the 2030s? appeared first on Uptime Institute Blog.

Cybersecurity and the cost of human error

Cyber incidents are increasing rapidly. In 2024, the number of outages caused by cyber incidents was twice the average of the previous four years, according to Uptime Institute’s annual report on data center outages (see Annual outage analysis 2025). More operational technology (OT) vendors are experiencing significant increases in cyberattacks on their systems. Data center equipment vendor Honeywell analyzed hundreds of billions of system logs and 4,600 events in the first quarter of 2025, identifying 1,472 new ransomware extortion incidents — a 46% increase on the fourth quarter of 2024 (see Honeywell’s 2025 Cyber Threat Report). Beyond the initial impact, cyberattacks can have lasting consequences for a company’s reputation and balance sheet.

Cyberattacks increasingly exploit human error

Cyberattacks on data centers often exploit vulnerabilities — some stemming from simple and preventable errors, while others are overlooked systemic issues. Human error, such as failing to follow procedures, can create vulnerabilities, which the attacker exploits. For example, staff might forget regular system patches or delay firmware updates, leaving systems exposed. Companies, in turn, implement policies and procedures to ensure employees perform preventative actions on a consistent basis.

In many cases, data center operators may well be aware that elements of their IT and OT infrastructure have certain vulnerabilities. This may be due to policy noncompliance or the policy itself lacking appropriate protocols to defend against hackers. Often, employees lack training on how to recognize and respond to common social engineering techniques used by hackers. Tactics such as email phishing, impersonation and ransomware are increasingly targeting organizations with complex supply chain and third-party dependencies.

Cybersecurity incidents involving human error often follow similar patterns. Attacks may begin with some form of social engineering to obtain login credentials. Once inside, the attack moves laterally through a system, exploiting small errors to cause systemic damage (see Table 1).

Table 1 Cyberattackers exploit human factors to induce human error

Table: Cyberattackers exploit human factors to induce human error

Failure to follow correct procedures

Although many companies have policies and procedures in place, employees can become complacent and fail to follow them. At times, they may unintentionally skip a step or carry it out incorrectly. For instance, workers might forget to install a software update or accidentally misconfigure a port or firewall — despite having technical training. Others may feel overwhelmed by the volume of updates and leave systems vulnerable as a result. In some cases, important details are simply overlooked, such as leaving a firewall port open or setting their cloud storage to public access.

Procedures concerning password strength, password changes and inactive accounts are common vulnerabilities that hackers exploit. Inactive accounts that are not properly deactivated may miss out on critical security updates, as these are monitored less closely than active accounts, making it easier for security breaches to go unnoticed.

Unknowingly engaging with social engineering

Social engineering is a tactic used to deceive individuals into revealing sensitive information or downloading malicious software. It typically involves the attacker impersonating someone from the target’s company or organization to build trust with them. The primary goal is to steal login credentials or gain unauthorized access to the system.

Attackers may call employees while posing as someone from the IT help desk, requesting login details. Another common tactic involves the attacker pretending to be a help desk technician and, under the guise of “routine testing,” pressuring an employee to disclose their login credentials.

Like phishing, spoofing is a tactic used to gain an employee’s trust by simulating familiar conditions, but it often relies on misleading visual cues. For example, social engineers may email a link to a fake version of the company’s login screen, prompting the unsuspecting employee to enter their login information as usual. In some rare cases, attackers might even use AI to impersonate an employee’s supervisor during video call.

Deviation from policies or best practices

Adhering to policies and best practices is critical to determining whether cybersecurity succeeds or fails. Procedures need to be written clearly and without ambiguity. For example, if a procedure does not explicitly require an employee to clear saved login data from their devices, hackers or rogue employees may be able to gain access to the device using default administrator credentials. Similarly, if regular password changes are not mandated, it may be easier for attackers to compromise system access credentials.

Policies must also account for the possibility of a disgruntled employee or third-party worker stealing or corrupting sensitive information for personal gain. To reduce this risk, companies can implement clear deprovisioning rules in their offboarding process, such as ensuring passwords are changed immediately upon an employee’s departure. While there is always a chance that a procedural step may be accidentally overlooked, comprehensive procedures increase the likelihood that each task is completed correctly.

Procedures are especially critical when employees have to work quickly to contain a cybersecurity incident. They should be clearly written, thoroughly tested for reliability, and easily accessible to serve as a reference during a variety of emergencies.

Poor security governance and oversight

A lack of governance or oversight from management can lead to overlooked risks and vulnerabilities, such as missed security patches or failure to monitor systems for threats or alerts. Training helps employees to approach situations with healthy skepticism, encouraging them to perform checks and balances consistent with the company’s policies.

Training should evolve to ensure that workers are informed about the latest threats and vulnerabilities, as well as how to recognize them.

Notable incidents exploiting human error

The types of human error described above are further complicated due to the psychology of how individuals behave in intense situations. For example, mistakes may occur due to heightened stress, fatigue or coercion, all of which can lead to errors of judgment when a quick decision or action is required.

Table 2 identifies how human error may have played a part in eight major public cybersecurity breaches between 2023 and 2025. This includes three of the 10 most significant data center outages — United Healthcare, CDK Global and Ascension Healthcare — highlighted in Uptime Institute’s outages report (see Annual outage analysis 2025). We note the following trends:

  • At least five of the incidents involved social engineering. These attacks often exploited legitimate credentials or third-party vulnerabilities to gain access and execute malicious actions.
  • All incidents likely involved failures by employees to follow policies, procedures or properly manage common vulnerabilities.
  • Seven incidents exposed gaps in skills, training or experience to mitigate threats to the organization.
  • In half of the incidents, policies may have been poorly enforced or bypassed for unknown reasons.

Table 2 Impact of major cyber incidents involving human error

Table: Impact of major cyber incidents involving human error

Typically, organizations are reluctant to disclose detailed information about cyberattacks. However, regulators and government cybersecurity agencies are increasingly expecting more transparency — particularly when the attacks affect citizens and consumers — since attackers often leak information on public forums and the dark web.

The following findings are particularly concerning for data center operators and warrant serious attention:

  • The financial cost of cyber incidents is significant. Among the eight identified cyberattacks, the estimated total losses exceed $8 billion.
  • Full financial and reputational impact can take longer to play out. For example, UK retailer Marks & Spencer is facing lawsuits from customer groups over identity theft and fraud following a cyberattack. Similar actions may be taken by regulators or government agencies, particularly if breaches expose compliance failures with cybersecurity regulations, such as those in the Network and Information Security Directive 2 and the Digital Operational Resilience Act.

The Uptime Intelligence View

Human error is often viewed as a series of unrelated mistakes; however, the errors identified in this report stem from complex, interconnected systems and increasingly sophisticated attackers who exploit human psychology to manipulate events.

Understanding the role of human error in cybersecurity incidents is crucial to help employees recognize and prevent potential oversights. Training alone is unlikely to solve the problem. Data center operators should continuously adapt cyber practices and foster a culture that redefines how staff perceive and respond to the risk of cyber threats. This cultural shift is likely critical to staying ahead of evolving threat tactics.

John O’Brien, Senior Research Analyst, jobrien@uptimeinstitute.com
Rose Weinschenk, Analyst, rweinschenk@uptimeinstitute.com

The post Cybersecurity and the cost of human error appeared first on Uptime Institute Blog.

Cloud: when high availability hurts sustainability

In recent years, the environmental sustainability of IT has become a significant concern for investors and customers, as well as regulatory, legislative and environmental stakeholders. This concern is expected to intensify as the impact of climate change on health, safety and the global economy becomes more pronounced. It has given rise to an assortment of voluntary and mandatory initiatives, standards and requirements that collectively represent, but do not yet define, a basic framework for sustainable IT.

Cloud providers have come under increasing pressure from both the public and governments to reduce their carbon emissions. Their significant data center footprints consume considerable energy to deliver an ever-increasing range of cloud services to a growing customer base. The recent surge in generative AI has thrust the issues of power and carbon further into the spotlight.

Cloud providers have responded with large investments in renewable energy and energy attribute certificates (EACs), widespread use of carbon offsets and the construction of high-efficiency data centers. However, the effectiveness of these initiatives and their impact on carbon emissions vary significantly depending on the cloud provider. While all are promoting an eco-friendly narrative, unwrapping their stories and marketing campaigns to find meaningful facts and figures is challenging.

These efforts and initiatives have garnered considerable publicity. However, the impact of customer configurations on carbon emissions can be considerable and often overlooked. To build resiliency into cloud services, users face a range of options, each carrying its own carbon footprint. This report examines how resiliency affects carbon emissions.

Sustainability is the customer’s responsibility

The reduction of hyperscaler data center carbon emissions is being fought on two fronts. First, service providers are transitioning to lower-carbon energy sources. Second, cloud customers are being encouraged to optimize their resource usage through data and reporting to help lower carbon emissions.

Cloud provider responsibilities

Data centers consume significant power. To reduce their carbon impact, many cloud providers are investing in carbon offsets — these are projects with a negative carbon impact that can balance or negate carbon emissions by a specified weight.

Renewable energy certificates (RECs) are tradable, non-tangible energy commodities. Each REC certifies that the holder has used or will use a quantity of electricity generated from a renewable source, thus avoiding the need for carbon emission offsets for that power use.

Cloud providers can use both offsets and RECs to claim their overall carbon emissions are zero. However, this does not equate to zero carbon production; instead, it means providers are balancing their emissions by accounting for a share of another organization’s carbon reductions.

Although cloud providers are making their own environmental changes, responsibility for sustainability is also being shared with users. Many providers now offer access to carbon emissions information via online portals and application programming interfaces (APIs), aiming to appear “green” by supporting users to measure, report and reduce carbon emissions.

Customer responsibilities

In public cloud, application performance and resiliency are primarily the responsibility of the user. While cloud providers offer services to their customers, they are not responsible for the efficiency or performance of the applications that customers build.

The cloud model lets customers consume services when they are needed. However, this flexibility and freedom can lead to overconsumption, increasing both costs and carbon emissions.

Tools and guidelines are available to help customers manage their cloud usage. Typical recommendations include resizing virtual machines to achieve higher utilization or turning off unused resources. However, these are only suggestions; it is the job of their customers to implement any changes.

Since cloud providers charge based on the resources used, helping customers to reduce their cloud usage is likely to also reduce their bills, which in the short term may impact provider revenue. However, cloud providers are willing to take this risk, betting that helping customers lower both carbon emissions and costs will increase overall revenue in the longer term.

Cloud customers are also encouraged to move workloads to regions with less carbon-heavy electricity supplies. This can often result in lower costs for their customers and lower carbon emissions — a win-win. However, it is up to the customer to implement these changes.

Cloud users face a challenging balancing act: they need to architect applications that are available, cost-effective and have a low carbon footprint. Even with the aid of tools, achieving this balance is far from easy.

Previous research

In previous reports to compare cost, carbon emissions and availability between architectures, Uptime Intelligence started by defining an unprotected baseline. This is an application situated in a single location and not protected from a loss of availability zone (a data center) or region (a collection of closely connected data centers). Then, other architectures were designed to distribute resources across availability zones and regions so that the application could operate during outages. The costs of these new architectures were compared with the price of the baseline to assess how increased availability affects cost.

Table 1 provides an overview of these architectures. A full description can be found in Build resilient apps: do not rely solely on cloud infrastructure.

Table 1 Summary of application architecture characteristics

Table: Summary of application architecture characteristics

An availability percentage for 2024 was calculated using historical status update information for each architecture. In the cloud, applications are charged based on the resources consumed to deliver that application. An application architected across multiple locations uses more resources than one deployed in a single location. In Cloud availability comes at a price the cost of using each application was calculated.

Finally, in this report, Uptime Intelligence calculates the carbon emissions for each architecture and combines this with the availability and cost data.

Carbon versus cost versus downtime

Figure 1 combines availability, cost and carbon emissions into a single chart. The carbon quantities are based on the location-based Scope 2 emissions, which are associated with the electricity consumed by the data center. The availability of the architectures is represented by bubble sizes: inner rings indicate the average annual downtime across all regions in 2024, while the outer rings show the worst-case regional downtime. The axes display cost and carbon premiums, which reflect additional costs and carbon emissions relative to the unprotected baseline. The methodology for calculating carbon is included as an appendix at the end of this report.

Figure 1 Average and worst-case regional availabilities by carbon and cost

Diagram: Average and worst-case regional availabilities by carbon and cost

Findings

Figure 1 shows that the cost premium is linearly proportional to carbon emissions — a rise in cost directly corresponds to an increase in carbon emissions, and vice versa. This proportionality makes sense: designing for resiliency uses more resources across multiple regions. Due to the cloud’s consumption-based pricing model, more resources equate to higher costs. And with more resources, more servers are working, which produces more carbon emissions.

However, higher costs and carbon emissions do not necessarily translate into better availability. As shown in Figure 1, the size of the bubbles does not always decrease with an increase in cost and carbon. Customers, therefore, do not have to pay the highest premiums in cash and carbon terms to obtain good availability. However, they should expect that resilient applications will require additional expenditure and produce more carbon emissions.

A good compromise is to architect the application across regions using a pilot light configuration. This design provides an average annual downtime of 2.6 hours, a similar level of availability to the equivalent dual region active-active configuration, but with roughly half the cost and carbon emissions.

Even if this architecture were deployed across the worst-performing regions, downtime would remain relatively low at 5.3 hours, which is still consistent with the more expensive resilient design.

However, although the cost and carbon premiums of the pilot light design are at the midpoint in our analysis, they are still high. Compared with an unprotected application, a dual region pilot light configuration produces double the carbon emissions and costs 50% more.

For those organizations looking to keep emissions and costs low, a dual zone active-failover provides an average downtime of 2.9 hours per year at a cost premium of 14% and a carbon premium of 38%. However, it is more susceptible to regional failures — in the worst-performing regions, downtime increases almost fourfold to 10.8 hours per year.

Conclusions

In all examined cases, increases in carbon are substantial. High availability inevitably comes with an increase in carbon emissions. Enterprises need to decide what compromises they are willing to make between low cost, low carbon and high availability.

These trade-offs should be evaluated during the design phase, before implementation. Ironically, most tools provided by cloud providers only focus on reporting and optimizing current resource usage rather helping assess the impact of potential architectures.  

AWS provides its Customer Carbon Footprint Tool, Google offers a Cloud Carbon Footprint capability, Microsoft delivers an Emissions Impact Dashboard for Azure, IBM has a Cloud Carbon Calculator, and Oracle Cloud has its OCI Sustainability Dashboard. These tools aid carbon reporting and may make recommendations to reduce carbon emissions. However, they do not suggest fundamental changes to the architecture design based on broader requirements such as cost and availability.

Considering the direct relationship between carbon emissions and cost, organizations can take some comfort in knowing that architectures built with an awareness of cost optimization are also likely to reduce emissions. In AWS’s Well-Architected framework for application development, the Cost Optimization pillar and the Sustainability pillar share similarities, such as turning off unused resources and sizing virtual machines correctly. Organizations should investigate if their cost optimization developments can also reduce carbon emissions.


The Uptime Intelligence View

The public cloud may initially appear to be a low-cost, low-carbon option. However, customers aiming for high availability should architect their applications across availability zones and regions. More resources running in more locations equates to higher costs (due to the cloud’s consumption-based pricing) and increased carbon emissions (due to the use of multiple physical resources). Ultimately, those developing cloud applications need to decide where their priorities lie regarding cost reduction, environmental credentials and user experience.

Appendix: methodology

The results presented in this report should not be considered prescriptive but hypothetical use cases. Readers should perform their own analyses before pursuing or avoiding any action.

Data is obtained from the Cloud Carbon Footprint (CCF) project, an open-source tool for analyzing carbon emissions. This initiative seeks to aid users in measuring and reducing the carbon emissions associated with their public cloud use.

The CCF project uses several sources, including the SPECpower database, to calculate power consumption for various cloud services hosted on AWS, Google and Microsoft Azure. SPECpower is a database of power consumption at various utilization points for various servers. Power is converted to an estimate of carbon emissions using data from the European Environment Agency, the US Environmental Protection Agency and carbonfootprint.com.

Uptime Intelligence used the CCF’s carbon and power assumptions to estimate carbon emissions for several cloud architectures. We consider the CCF’s methodology and assumptions reasonable enough to compare carbon emissions based on cloud architecture. However, we cannot state that the CCF’s tools, methods and assumptions suit all purposes. That said, the project’s open-source and collaborative nature means it is more likely to be an unbiased and fair methodology than those offered by cloud providers.

The CCF’s methodology details are available on the project’s website and in the freely accessible source code. See cloudcarbonfootprint.org/docs/methodology.

For this research, Uptime Intelligence has based our calculations on Amazon Web Services (AWS). Not only is AWS the market leader, but it also provides sufficiently detailed information to make an investigation possible. Other public cloud services have similar pricing models, services and architectural principles — this report’s fundamental analysis will apply to other cloud providers. AWS costs are obtained from the company’s website and carbon emissions are obtained from the CCF project’s assumptions for AWS. We used an m5.large virtual machine in us-east-1 for our architecture.

Table 2 shows the carbon emissions calculations based on these sources.

Table 2 Carbon emissions calculations

Table: Carbon emissions calculations

The following Uptime Institute expert was consulted for this report:
Jay Dietrich, Research Director of Sustainability, Uptime Institute

The post Cloud: when high availability hurts sustainability appeared first on Uptime Institute Blog.

The two sides of a sustainability strategy

While much has been written, said and taught about data center sustainability, there is still limited consensus on the definition and scope of an ideal data center sustainability strategy. This lack of clarity has created much confusion, encouraged many operators to pursue strategies with limited results, and enabled some to make claims that are ultimately of little worth.

To date, the data center industry has adopted three broad, complementary approaches to sustainability:

  • Facility and IT sustainability. This approach prioritizes operational efficiency, minimizing the energy, direct carbon and water footprints of IT and facility infrastructure. It directly addresses the operational impacts of individual facilities, reducing material and energy use and costs. Maximizing the sustainability of individual facilities is key to addressing the increased government focus on regulating individual data centers.
  • Ecosystem sustainability. This strategy focuses on carbon neutrality (or carbon negativity), water positivity and nature positivity across the enterprise. Ecosystem sustainability offsets the environmental impacts of an enterprise’s operations, which may increase business costs.
  • Overall sustainability. While some data center operators promote the sustainability of their facilities with limited efforts on ecosystem sustainability, others build their brand around ecosystem sustainability with minimal discussion about the sustainability of their facilities. Although it is common for organizations to make efforts in both areas, it is less common for the strategies to be integrated as a part of a coherent plan.

Each approach has its own benefits and challenges, providing different levels of business and environmental performance improvement. This report is an extension and update to the Sustainability Series of reports, published by Uptime Intelligence in 2022 (see below for a list of the reports), which detailed the seven elements of a sustainability strategy.

Data center sustainability

Data center sustainability involves incorporating sustainability and efficiency considerations into siting, design and operational processes throughout a facility’s life. The organizations responsible for siting and design, IT operations, facility operations, procurement, contracting (colocation and cloud operators) and waste management must embrace the enterprise’s overall sustainability strategy and incorporate it into their daily operations.

Achieving sustainability objectives may require a more costly initial investment for an individual facility, but the reward is likely an overall lower cost of ownership over its life. To implement a sustainability strategy effectively, an operator must address the full range of sustainability elements:

  • Siting and design. Customer and business needs dictate a data center’s location. Typically, multiple sites will satisfy these criteria, however, the location should also be selected based on whether it can help optimize the facility’s sustainability performance. Operators should focus on maximizing free cooling and carbon-free energy consumption while minimizing energy and water consumption. The design should choose equipment and materials that maximize the facility’s environmental performance.
  • Cooling system. The design should minimize water and energy use, including capturing available free-cooling hours. In water-scarce or water-stressed regions, operators should deploy waterless cooling systems. Where feasible and economically viable, heat reuse systems should also be incorporated into the design.
  • Standby power system. The standby power system design should enable fuel flexibility (able to use low-carbon or carbon-free fuels) and provide primary power capability. It should be capable and permitted to deliver primary power for extended periods. This enables the system to support grid reliability and assist in addressing the intermittency of wind and solar generation contracted to supply power to the data center, thereby reducing the carbon intensity of the electricity consumption.
  • IT infrastructure efficiency. IT equipment should be selected to maximize the average work delivered per watt of installed capacity. The installed equipment should run at or close to the highest practical utilization level of the installed workloads while meeting their reliability and resiliency requirements. IT workload placement and management software should be used to monitor and optimize the IT infrastructure performance.
  • Carbonfree energy consumption. Operators should work with electricity utilities, energy retailers, energy developers and regulators to maximize the quantity of clean energy consumed and minimize location-based emissions. Over time, they should plan to increase carbon-free energy consumption to 90% or more of the total consumption. Timelines will vary by region depending on the economics and availability of carbon-free energy.
  • End-of-life equipment reuse and materials recovery. Operators need an end-of-life equipment management process that maximizes the reuse of equipment and components, both within the organization and through refurbishment and use by others. Where equipment must be scrapped, there should be a process in place to recover valuable metals and minerals, as well as energy, through environmentally responsible processes.  
  • Scope 3 emissions management. Operators should require key suppliers to maintain a sustainability strategy, publicly disclose their greenhouse gas (GHG) emissions inventory and reduction goals, and demonstrate progress toward their sustainability objectives. There should be consequences in place for suppliers that fail to show reasonable progress.

While these strategies may appear simple, creating and executing a sustainability strategy requires the commitment of the whole organization — from technicians and engineers to procurement, finance and executive leadership. In some cases, financial criteria may need to shift from considering the initial upfront costs to the total cost of ownership and the revenue benefits/enhancements gained from a demonstrably sustainable operation. A data center sustainability strategy can enhance business and environmental performance.

Ecosystem sustainability

An ecosystem sustainability strategy emphasizes mitigating and offsetting the environmental impacts of an operator’s data center portfolio. While these efforts do not change the environmental operating profile of individual data centers, they are designed to benefit the surrounding community and natural environment. Such projects and environmental offsets are typically managed at the enterprise level rather than the facility level and represent a cost to the enterprise.

  • Carbon-neutral or carbon-negative operations. Operators should purchase energy attribute certificates (EACs) and carbon capture offsets to reduce or eliminate their Scope 1, 2 and 3 emissions inventory. The offsets are generated primarily from facilities geographically separate from the data center facilities. EACs and offsets can be purchased directly from brokers or from operators of carbon-free energy or carbon capture systems.
  • Water-positive operations. Operators should work with communities and conservation groups to implement water recharge and conservation projects that return more water to the ecosystem than is used across their data centers. Examples include wetlands reclamation, water replenishment, support of sustainable agriculture, and leak detection and minimization systems for water distribution networks. These projects can benefit the local watershed or unrelated, geographically distinct watersheds.
  • Nature-positive facilities. The data center or campus should be landscaped to regenerate and integrate with the natural landscape and local ecosystem. Rainwater and stormwater should be naturally filtered and reused where practical. The landscape should be designed and managed to support local flora and fauna, ensuring that the overall campus is seamlessly integrated into the local ecosystem. The overall intent is to make the facility as “invisible” as possible to the local community.
  • Emissions reductions achieved with IT tools. Some operators and data center industry groups quantify and promote the emissions reduction benefits (known as Scope 4 “avoided emissions”) generated from the operation of the IT infrastructure. They assert that the “avoided emissions” achieved through the application of IT systems to increase the operational efficiency of systems or processes, or “dematerialize” products, can offset some or all of the data center infrastructure’s emissions footprint. However, these claims should be approached with caution, as there is a high degree of uncertainty in the calculated quantities of “avoided emissions.”
  • Pro-active work with supply chains. Some operators work directly with supply chain partners to decarbonize their operations. This approach is practical when an enterprise represents a significant percentage of a supplier’s revenue. However, it becomes impractical when an operator’s purchases represent only a small percentage of the supplier’s business.

Ecosystem sustainability seeks to deliver environmental performance improvements to operations and ecosystems outside the operator’s direct control. These improvements compensate for and offset any remaining environmental impacts following the full execution of the data center sustainability strategy. They typically represent a business cost and enhance an operator’s commercial reputation and brand.

Where to focus

Facility and IT and ecosystem sustainability strategies are complementary, addressing the full range of sustainability activities and opportunities. In most organizations, it will be necessary to cover all of these areas, often by different teams focusing on their respective domains.

An operator’s primary focus should be improving the operational efficiency and sustainability performance of its data centers. Investments in the increased use of free cooling, automated control of chiller and IT space cooling systems, and IT consolidation projects can yield significant energy, water and cost savings, along with reductions in GHG emissions. These improvements will not only reduce the environmental footprint of the data center but can also improve its business performance.

These efforts also enable operators to proactively address emerging regulatory and standards frameworks. Such regulations are intended to increase the reporting of operating data and metrics and may ultimately dictate minimum performance standards for data centers.

To reduce the Scope 2 emissions (purchased electricity) associated with data center operations to zero, operators need to work with utilities, energy retailers, and the electricity transmission and distribution system operators. The shared goal is to help build a resilient, interconnected electricity grid populated by carbon-free electricity generation and storage systems — a requirement for government net-zero mandates.

Addressing ecosystem sustainability opportunities is a valuable next step in an operator’s sustainability journey. Ecosystem projects can enhance the natural environment surrounding the data facility, improve the availability of carbon-free energy and water resources locally and globally, and directly support, inform and incentivize the sustainability efforts of customers and suppliers.

Data center sustainability should be approached in two separate ways: first, the infrastructure itself and, second, the ecosystem. Confusion and overlap between these two aspects can lead to unfortunate results. For example, in many cases, a net-zero and water-positive data center program is (wrongly) accepted as an indication that an enterprise is operating a sustainable data center infrastructure.


The Uptime Intelligence View

Operators should prioritize IT and facilities sustainability over ecosystem sustainability. The execution and results of an IT and facilities sustainability strategy directly minimize the environmental footprint of a data center portfolio, while maximizing its business and sustainability performance.

Data reporting and minimum performance standards embodied in enacted or proposed regulations are focused on the operation of the individual data centers, not the aggregated enterprise-level sustainability performance. An operator must demonstrate that they have a highly utilized IT infrastructure (maximized work delivered per unit of energy consumed) and minimized the energy and water consumption and GHG emissions associated with its facility operations.

Pursuing an Ecosystem sustainability strategy is the logical next step for operators that want to do more and further enhance their sustainability credentials. However, an ecosystem sustainability strategy should not be pursued at the expense of an IT and Facilities strategy to shield poor or marginal facility and IT systems performance.

The following Uptime Institute expert was consulted for this report:
Jay Paidipati, Vice President Sustainability Program Management, Uptime Institute

Other related reports published by Uptime Institute include:
Creating a sustainability strategy
IT Efficiency: the critical core of sustainability
Three key elements: water, circularity and siting
Navigating regulations and standards
Tackling greenhouse gases
Reducing the energy footprint

The post The two sides of a sustainability strategy appeared first on Uptime Institute Blog.

❌