Normal view

Received before yesterday

Retail vs wholesale: finding the right colo pricing model

Colocation providers may offer two pricing and packaging models to sell similar products and capabilities. In both models, customers purchase space, power and services. However, the method of purchase differs.

In a retail model, customers purchase a small quantity of space and power, usually by the rack or a fraction of a rack. The colocation provider standardizes contracts, pricing and capabilities — the cost and complexity of delivering to a customer’s precise requirements are not justified, considering the relatively small contract value.

In a wholesale model, customers purchase a significantly larger quantity of space and power, typically at least a dedicated, enclosed suite of white space. Due to the size of these contracts, colocation providers need to be flexible in meeting customer needs, even potentially building new facilities to accommodate their requirements. The colocation provider negotiates price and terms, and customers often prefer to pay for actual power consumption rather than be billed on maximum capacity. A metered model allows the customer to scale power usage in response to changing demands.

A colocation provider may focus on a particular market by offering only a retail or wholesale model, or the provider may offer both to broaden its appeal. The terms “wholesale” and “retail” colocation more accurately describe the pricing and packaging models used by colocation providers rather than the type of customer.

Table 1 Key differences between retail and wholesale colocation providers

Table: Key differences between retail and wholesale colocation providers

Retail colocation deals typically have higher gross margins in percentage terms, but the volume of sales is lower. Most colocation providers would rather sell wholesale contracts because they offer higher revenues through larger volumes of sales, despite having lower gross margins. As wholesale colocations are better prospects, retail customers are more likely to experience cost rises at renewal than wholesale customers.

Retail colocation pricing model

Retail terms are designed to be simple and predictable. Customers are typically charged a fixed fee based on the maximum power capacity supplied to equipment and the space used. This fee covers both the repayment of fixed costs, and the variable costs associated with IT power and cooling. The fixed fee bundles all these elements together, so customers have no visibility into these individual components — but they benefit from predictable pricing.

In retail colocation, the facilities are already available, so capital costs are recovered across all retail customers through standard pricing. If a customer exceeds their allotted maximum power capacity, they risk triggering a breaker and potentially powering down their IT equipment. Some colocation providers monitor for overages and warn customers that they need to increase their capacity before an outage occurs.

Customers are likely to purchase more power capacity than they need to prevent these outages. As a result, some colocation providers may deliberately oversubscribe power consumption to reduce their power costs and increase their profit margins. There are operational and reputational risks if oversubscription causes service degradation or outages.

Some colocation providers also meter power, charging a fee based on IT usage, which factors in the repayment of capital, IT and cooling costs, as well as a profit margin. Those with metering enabled may charge customers for usage exceeding maximum capacity, typically at a higher rate.

Can a colocation provider increase prices during a contract term? Occasionally, but only as a last resort — such as if power costs increase significantly. This possibility will be stipulated in the contract as an emergency or force majeure measure.

Usually, an internet connection is included. However, data transfer over that connection may be metered or bundled into a fixed cost package. Customers have the option to purchase cross-connects linking their infrastructure to third-party communications providers, including on-ramps to cloud providers.

Wholesale colocation pricing model

Wholesale colocation pricing is designed to offer customers the flexibility to utilize their capacity as they choose. Because terms are customized, pricing models will vary from customer to customer.

Some customers may prefer to pay for a fixed capacity of total power, regardless of whether the power is used or not. In this model, both IT power and cooling costs are factored into the price.

Other customers may prefer a more granular approach, with multiple charging components:

  • Fixed fee per unit of space/rack based on maximum power capacity and is designed to cover the colocation provider’s fixed costs, while including a profit margin.
  • Variable IT power costs are passed directly from the electricity supplier to the customer, metered in kilowatts (kW). Customers bear the full cost of price fluctuations, which can change rapidly depending on grid conditions.
  • To account for variable cooling costs, power costs may be calculated by multiplying actual power usage by an agreed design PUE to create an “additional power” fee. This figure may also be multiplied by a “utilization factor” to reflect cases where a customer is using only a small fraction of the data hall (and therefore impacting overall efficiency).

Some customers may prefer a blended model of both a fixed element for baseline capacity and a variable charge for consumption above the baseline. Redundant feeds are also likely to impact cost. If new data halls need to be constructed, these costs may be passed on to the customers directly, or some capital may be recovered through a higher fixed rack fee.

Alternatively, for long-term deployments, customers may opt for either a “build-to-suit” or “powered shell” arrangement. In a build-to-suit model, the colocation provider designs and constructs the facility —including power, cooling and layout — to the customer’s exact specifications. The space is then leased back to the customer, typically under a long-term agreement exceeding a decade.

In a powered shell setup, the provider delivers a completed exterior building with core infrastructure, such as utility power and network access. The customer is then responsible for outfitting the interior (racks, cooling, electrical systems) to suit their operational needs.

Most customers using wholesale colocation providers will need to implement cross-connects to third-party connectivity and network providers hosted in meet-me rooms. They may also need to arrange the construction of new capacity into the facility with the colocation provider and suppliers.

Hyperscalers are an excellent prospect for wholesale colocation, given their significant scale. However, their limited numbers and strong market power enable them to negotiate lower margins from colocation providers.

Table 2 Pricing models used in retail and wholesale colocation

Table: Pricing models used in retail and wholesale colocation

In a retail colocation engagement, the customer has limited negotiating power — with little scale, they generally have minimal flexibility on pricing, terms and customization. In a wholesale engagement, the opposite is true, and the arrangement favors the customer. Colocation providers want the scale and sales volume, so are willing to cut prices and accommodate additional requirements. They are also willing to offer flexible pricing in response to customers’ rapidly changing requirements.


The Uptime Intelligence View

Hyperscalers have the strongest market power to dictate contracts and prices. With so few players, it is unlikely that many hyperscalers will be bidding for the same space, which would push up prices. However, colocation providers still want their business, because of the volume it brings. They would prefer to reduce gross margins to ensure a win, rather than risk losing a customer with such unmatched scale.

The post Retail vs wholesale: finding the right colo pricing model appeared first on Uptime Institute Blog.

Electrical considerations with large AI compute

The training of large generative AI models is a special case of high-performance computing (HPC) workloads. This is not simply due to the reliance on GPUs — numerous engineering and scientific research computations already use GPUs as standard. Neither is it about the power density or the liquid cooling of AI hardware, as large HPC systems are already extremely dense and use liquid cooling. Instead, what makes AI compute special is its runtime behavior: when training transformer-based models, large compute clusters can create step load-related power quality issues for power distribution systems in data center facilities. A previous Intelligence report offers an overview of the underlying hardware-software mechanisms.

The scale of the power fluctuations makes this phenomenon unusual and problematic. The vast number of generic servers found in most data centers collectively produce a relatively steady electrical load — even if individual servers experience sudden changes in power usage, they are discordant. In contrast, the power use of compute nodes in AI training clusters moves in near unison.

Even compared with most other HPC clusters, AI training clusters exhibit larger power swings. This is due to an interplay between transformer-based neural network architectures and compute hardware, which creates frequent spikes and falls (every second or two) in power demand. These fluctuations correspond to the computational steps in the training processes, exacerbated by an aggressive pursuit of peak performance typical in modern silicon.

Powerful fluctuations

The scope of the resulting step changes in power will depend on the size and configuration of the compute cluster, as well as operational factors such as AI server performance and power management settings. Uptime Intelligence estimates that in worst-case scenarios, the difference between the low and high points of power draw during training program execution can exceed 100% on a system level (the load doubles almost instantaneously, within milliseconds) for some configurations.

These extremes occur every few seconds, whenever a batch of weights and biases is loaded on GPUs and the training begins. This is often accompanied by a massive spike in current, produced by power excursion events as GPUs overshoot their thermal design power rating (TDP) to opportunistically exploit any extra thermal and power delivery budget following a phase of lower transistor activity. In short, power spikes are made possible by intermittent lulls.

This behavior is common in modern compute silicon, including in personal devices and generic servers. Still, it is only with large AI compute clusters that these fluctuations across dozens or hundreds of servers move almost synchronously.

Even in moderately sized clusters with just a few dozen racks, this can result in sudden, millisecond-speed changes in AC power — ranging from several hundred kilowatts to even a few megawatts. If there are no other substantial loads present in the electrical mix to dampen these fluctuations, these step changes may stress capacity components in the power distribution systems. They may also cause power quality issues such as voltage sags and swells, or significant harmonics and sub-synchronous oscillations that distort the sinusoidal waveforms in AC power systems.

Based on several discussions with and disclosures by major electrical equipment manufacturers — including ABB, Eaton, Schneider Electric, Siemens and Vertiv — there is a general consensus that modern power distribution equipment is expected to be able to handle AI power fluctuations, as long as they remain within the rated load.

IT system capacity redefined

The issue of AI step loads appears to center on equipment capacity and the need to avoid frequent overloads. Standard capacity planning practices often start with the nameplate power of installed IT hardware, then derate it to estimate the expected actual power. This adjustment can reduce the total nameplate power by 25% to 50% across all IT loads when accounting for the diversity of workloads — since they do not act in unison — and also for the fact that most software rarely pushes the IT hardware close to its rated power.

In comparison, AI training systems can show extreme behavior. Larger AI compute clusters have the potential to draw what is similar to an inrush current (rapid change of currents, often denoted by high di/dt) that exceed the IT system’s sustained maximum power rating.

Normally, overloads would not pose a problem for modern power distribution. All electrical components and systems have specified overload ratings to handle transient events (e.g., current surges during the startup of IT hardware or other equipment) and are designed and tested accordingly. However, if power distribution components are sized closely to the rated capacity of the AI compute load, these transient overloads could happen millions of times per year in the worst cases — components are not tested for regularly repeated overloads. Over time, this can lead to electromechanical stress, thermal stress and gradual overheating (heat-up is faster than cool-off) — potentially resulting in component failure.

This brings the definition of capacity to the forefront of AI compute step loads. Establishing the repeated peak power of a single GPU-server node is already a non-trivial effort — it requires running a variety of computationally intensive codes and setting up a high-precision power monitor. However, predicting how a specific compute cluster spanning several racks and potentially hundreds or even thousands of GPUs will behave during a training run is difficult to ascertain ahead of deployment.

The expected power profile also depends on server configurations, such as power supply redundancy level, cooling mode and GPU generations. For example, in a typical AI system from the 2022-2024 generation, power fluctuations can reach up 4 kW per 8-GPU server node, or 16 kW per rack when populated with four nodes, according to Uptime estimates. Even so, the likelihood of exceeding the rack power rating of around 41 kW is relatively low. Any overshoot is likely to be minor, as these systems are mostly air-cooled hardware designed to meet ASHRAE Class A2 specifications — allowed to operate in environments up to 35°C (95°F). In practice, most facilities supply much cooler air, making system fans cycle less intensely.

However, with recently launched systems, the issue is further exacerbated as GPUs account for a larger share of the power budget, not only because they use more power (in excess of 1 kW per GPU module) but also because these systems are more likely to use direct liquid cooling (DLC). Liquid cooling reduces system fan power, thereby reducing the stable load of server power. It also has better thermal performance, which helps the silicon to accumulate extra thermal budget for power excursions.

IT hardware specifications and information shared with Uptime by power equipment vendors indicate that in the worst cases, load swings can reach 150%, with a potential for overshoots exceeding 10% above the system’s power specification. In the case of the rack-scale systems based on Nvidia’s GB200 NVL72 architecture, sudden power climbs from around 60 kW and 70 kW to more than 150 kW per rack can occur.

This compares to a maximum power specification of 132 kW, which means that, under worst-case assumptions, repeated overloads can amount to as much as 20% in instantaneous power, Uptime estimates. This warrants extra care regarding circuit sizing (including breakers, tap-off units and placements, busways and other conductors) to avoid overheating and related reliability issues.

Figure 1 shows the power pattern of a GPU-based compute cluster running a transformer-based model training workload. Based on hardware specifications and real-world power data disclosed to Uptime Intelligence, we algorithmically mimicked the behavior of a compute cluster comprising four Nvidia GB200 NVL72 racks and four non-compute racks. It demonstrates the expected power fluctuations during these training clusters and underscores the need to rethink capacity planning compared with traditional, generic IT loads. Even though the average power stays below the power rating of the cluster, peak fluctuations can exceed it. While this estimates a relatively small cluster with 288 GPUs, a larger cluster would exhibit similar behavior at the megawatt scale.

Figure 1 Power profile of a GPU-based training cluster (algorithmic not real-world data)

Diagram: Power profile of a GPU-based training cluster (algorithmic not real-world data)

In electrical terms, no multi-rack workload is perfectly synchronous, while the presence of other loads will help smooth out the edges of fluctuations further. When including non-compute ancillary loads in the cluster — such as storage systems, networks and CDUs (which also require UPS power) — a lower safety margin above the nominal rating (e.g., 10% to 15%) appears sufficient to cover any regular peaks over the nominal system power specifications, even with the latest AI hardware.

Current mitigation options

There are several factors that data center operators may want to consider when deploying compute clusters dedicated to training large, transformer-based AI models. Currently, data center operators have a limited toolkit to fully handle large power fluctuations in a power distribution system, particularly when it comes to not passing them on to the source in their full extent. However, in collaboration with the IT infrastructure team/tenant, it should be possible to minimize fluctuations:

  • Mix with diverse IT loads, share generators. The best first option is to integrate AI training compute with other, diverse IT loads in a shared power infrastructure. This helps to diminish the effects of power fluctuations, particularly on generator sets. For dedicated AI training data center infrastructure installations, this may not be an option for power distribution. However, sharing engine generators will go a long way to dampen the effects of AI power fluctuations.
    Among power equipment, engine generator sets will be the most stressed if exposed to the full extent of the fluctuations seen in a large, dedicated AI training infrastructure. Even if correctly sized for the peak load, generators may struggle with large and fast fluctuations — for example, the total facility load stepping from 45% to 50% of design capacity to 80% to 85% within a second, then dropping back to 45% to 50% after two seconds, on repeat. Such fluctuation cycles may be close to what the engines can handle, at the expense of reduced expected life or outright failure.
  • Select UPS configurations to minimize power quality issues, overload. Even if a smaller frame can handle the fluctuations, according to the vendors, larger systems will carry more capacitance to help absorb the worst of the fluctuations, maintaining voltage and frequency within performance specifications. An additional measure is to use a higher capacity redundancy configuration, for example, by opting for N+2. This allows for UPS maintenance while avoiding any repeated overloads on the operational UPS systems, some of which might hit the battery energy storage system.
  • Use server performance/power management tools. Power and performance management of hardware remain largely underused, despite their ability to not only improve IT power efficiency but also contribute to the overall performance of the data center infrastructure. Even though AI compute clusters feature some exotic interconnect subsystems, they are essentially standard servers using standard hardware and software. This means there are a variety of levers to manage the peaks in their power and performance levels, such as power capping, turning off boost clocks, limiting performance states, or even setting lower temperature limits.
    To address the low end of fluctuations, switching off server energy-saving modes — such as silicon sleep states (known as C-states in CPU parlance) — can help raise the IT hardware’s power floor. A more advanced technique involves limiting the rate of power change (including on the way down). This feature, called “power smoothing”, is available through Nvidia’s System Management Interface on the latest generation of Blackwell GPUs.

Electrical equipment manufacturers are investigating the merits of additional rapid discharge/recharge energy storage and updated controls to UPS units with the aim of shielding the power source from fluctuations. These approaches include super capacitors, advanced battery chemistries or even flywheels that can tolerate frequent, short duration but high-powered discharge and recharge cycles. Next-generation AI compute systems may also include more capacitance and energy storage to limit fluctuations on the data center power system. Ultimately, it is often best to address an issue at its root (in this case the IT hardware and software) rather than treat the symptoms, although these may lie outside the control of data center facilities teams.


The Uptime Intelligence View

Most of the time, data center operators do not need to be overly concerned with the power profile of the IT hardware or the specifics of the associated workloads — rack density estimates were typically overblown to begin with, and overall capacity utilization tends to stay well below 100%. Even so, safety margins, which are expensive, could be thin. However, training large transformer models is different. The specialized compute hardware can be extremely dense, creates large power swings, and is capable of producing frequent power surges that are close to or even above its hardware power rating. This will force data center operators to reconsider their approach to both capacity planning and safety margins across their infrastructure.

The following Uptime Institute experts were consulted for this report:
Chris Brown, Chief Technical Officer, Uptime Institute

The post Electrical considerations with large AI compute appeared first on Uptime Institute Blog.

Cloud: when high availability hurts sustainability

In recent years, the environmental sustainability of IT has become a significant concern for investors and customers, as well as regulatory, legislative and environmental stakeholders. This concern is expected to intensify as the impact of climate change on health, safety and the global economy becomes more pronounced. It has given rise to an assortment of voluntary and mandatory initiatives, standards and requirements that collectively represent, but do not yet define, a basic framework for sustainable IT.

Cloud providers have come under increasing pressure from both the public and governments to reduce their carbon emissions. Their significant data center footprints consume considerable energy to deliver an ever-increasing range of cloud services to a growing customer base. The recent surge in generative AI has thrust the issues of power and carbon further into the spotlight.

Cloud providers have responded with large investments in renewable energy and energy attribute certificates (EACs), widespread use of carbon offsets and the construction of high-efficiency data centers. However, the effectiveness of these initiatives and their impact on carbon emissions vary significantly depending on the cloud provider. While all are promoting an eco-friendly narrative, unwrapping their stories and marketing campaigns to find meaningful facts and figures is challenging.

These efforts and initiatives have garnered considerable publicity. However, the impact of customer configurations on carbon emissions can be considerable and often overlooked. To build resiliency into cloud services, users face a range of options, each carrying its own carbon footprint. This report examines how resiliency affects carbon emissions.

Sustainability is the customer’s responsibility

The reduction of hyperscaler data center carbon emissions is being fought on two fronts. First, service providers are transitioning to lower-carbon energy sources. Second, cloud customers are being encouraged to optimize their resource usage through data and reporting to help lower carbon emissions.

Cloud provider responsibilities

Data centers consume significant power. To reduce their carbon impact, many cloud providers are investing in carbon offsets — these are projects with a negative carbon impact that can balance or negate carbon emissions by a specified weight.

Renewable energy certificates (RECs) are tradable, non-tangible energy commodities. Each REC certifies that the holder has used or will use a quantity of electricity generated from a renewable source, thus avoiding the need for carbon emission offsets for that power use.

Cloud providers can use both offsets and RECs to claim their overall carbon emissions are zero. However, this does not equate to zero carbon production; instead, it means providers are balancing their emissions by accounting for a share of another organization’s carbon reductions.

Although cloud providers are making their own environmental changes, responsibility for sustainability is also being shared with users. Many providers now offer access to carbon emissions information via online portals and application programming interfaces (APIs), aiming to appear “green” by supporting users to measure, report and reduce carbon emissions.

Customer responsibilities

In public cloud, application performance and resiliency are primarily the responsibility of the user. While cloud providers offer services to their customers, they are not responsible for the efficiency or performance of the applications that customers build.

The cloud model lets customers consume services when they are needed. However, this flexibility and freedom can lead to overconsumption, increasing both costs and carbon emissions.

Tools and guidelines are available to help customers manage their cloud usage. Typical recommendations include resizing virtual machines to achieve higher utilization or turning off unused resources. However, these are only suggestions; it is the job of their customers to implement any changes.

Since cloud providers charge based on the resources used, helping customers to reduce their cloud usage is likely to also reduce their bills, which in the short term may impact provider revenue. However, cloud providers are willing to take this risk, betting that helping customers lower both carbon emissions and costs will increase overall revenue in the longer term.

Cloud customers are also encouraged to move workloads to regions with less carbon-heavy electricity supplies. This can often result in lower costs for their customers and lower carbon emissions — a win-win. However, it is up to the customer to implement these changes.

Cloud users face a challenging balancing act: they need to architect applications that are available, cost-effective and have a low carbon footprint. Even with the aid of tools, achieving this balance is far from easy.

Previous research

In previous reports to compare cost, carbon emissions and availability between architectures, Uptime Intelligence started by defining an unprotected baseline. This is an application situated in a single location and not protected from a loss of availability zone (a data center) or region (a collection of closely connected data centers). Then, other architectures were designed to distribute resources across availability zones and regions so that the application could operate during outages. The costs of these new architectures were compared with the price of the baseline to assess how increased availability affects cost.

Table 1 provides an overview of these architectures. A full description can be found in Build resilient apps: do not rely solely on cloud infrastructure.

Table 1 Summary of application architecture characteristics

Table: Summary of application architecture characteristics

An availability percentage for 2024 was calculated using historical status update information for each architecture. In the cloud, applications are charged based on the resources consumed to deliver that application. An application architected across multiple locations uses more resources than one deployed in a single location. In Cloud availability comes at a price the cost of using each application was calculated.

Finally, in this report, Uptime Intelligence calculates the carbon emissions for each architecture and combines this with the availability and cost data.

Carbon versus cost versus downtime

Figure 1 combines availability, cost and carbon emissions into a single chart. The carbon quantities are based on the location-based Scope 2 emissions, which are associated with the electricity consumed by the data center. The availability of the architectures is represented by bubble sizes: inner rings indicate the average annual downtime across all regions in 2024, while the outer rings show the worst-case regional downtime. The axes display cost and carbon premiums, which reflect additional costs and carbon emissions relative to the unprotected baseline. The methodology for calculating carbon is included as an appendix at the end of this report.

Figure 1 Average and worst-case regional availabilities by carbon and cost

Diagram: Average and worst-case regional availabilities by carbon and cost

Findings

Figure 1 shows that the cost premium is linearly proportional to carbon emissions — a rise in cost directly corresponds to an increase in carbon emissions, and vice versa. This proportionality makes sense: designing for resiliency uses more resources across multiple regions. Due to the cloud’s consumption-based pricing model, more resources equate to higher costs. And with more resources, more servers are working, which produces more carbon emissions.

However, higher costs and carbon emissions do not necessarily translate into better availability. As shown in Figure 1, the size of the bubbles does not always decrease with an increase in cost and carbon. Customers, therefore, do not have to pay the highest premiums in cash and carbon terms to obtain good availability. However, they should expect that resilient applications will require additional expenditure and produce more carbon emissions.

A good compromise is to architect the application across regions using a pilot light configuration. This design provides an average annual downtime of 2.6 hours, a similar level of availability to the equivalent dual region active-active configuration, but with roughly half the cost and carbon emissions.

Even if this architecture were deployed across the worst-performing regions, downtime would remain relatively low at 5.3 hours, which is still consistent with the more expensive resilient design.

However, although the cost and carbon premiums of the pilot light design are at the midpoint in our analysis, they are still high. Compared with an unprotected application, a dual region pilot light configuration produces double the carbon emissions and costs 50% more.

For those organizations looking to keep emissions and costs low, a dual zone active-failover provides an average downtime of 2.9 hours per year at a cost premium of 14% and a carbon premium of 38%. However, it is more susceptible to regional failures — in the worst-performing regions, downtime increases almost fourfold to 10.8 hours per year.

Conclusions

In all examined cases, increases in carbon are substantial. High availability inevitably comes with an increase in carbon emissions. Enterprises need to decide what compromises they are willing to make between low cost, low carbon and high availability.

These trade-offs should be evaluated during the design phase, before implementation. Ironically, most tools provided by cloud providers only focus on reporting and optimizing current resource usage rather helping assess the impact of potential architectures.  

AWS provides its Customer Carbon Footprint Tool, Google offers a Cloud Carbon Footprint capability, Microsoft delivers an Emissions Impact Dashboard for Azure, IBM has a Cloud Carbon Calculator, and Oracle Cloud has its OCI Sustainability Dashboard. These tools aid carbon reporting and may make recommendations to reduce carbon emissions. However, they do not suggest fundamental changes to the architecture design based on broader requirements such as cost and availability.

Considering the direct relationship between carbon emissions and cost, organizations can take some comfort in knowing that architectures built with an awareness of cost optimization are also likely to reduce emissions. In AWS’s Well-Architected framework for application development, the Cost Optimization pillar and the Sustainability pillar share similarities, such as turning off unused resources and sizing virtual machines correctly. Organizations should investigate if their cost optimization developments can also reduce carbon emissions.


The Uptime Intelligence View

The public cloud may initially appear to be a low-cost, low-carbon option. However, customers aiming for high availability should architect their applications across availability zones and regions. More resources running in more locations equates to higher costs (due to the cloud’s consumption-based pricing) and increased carbon emissions (due to the use of multiple physical resources). Ultimately, those developing cloud applications need to decide where their priorities lie regarding cost reduction, environmental credentials and user experience.

Appendix: methodology

The results presented in this report should not be considered prescriptive but hypothetical use cases. Readers should perform their own analyses before pursuing or avoiding any action.

Data is obtained from the Cloud Carbon Footprint (CCF) project, an open-source tool for analyzing carbon emissions. This initiative seeks to aid users in measuring and reducing the carbon emissions associated with their public cloud use.

The CCF project uses several sources, including the SPECpower database, to calculate power consumption for various cloud services hosted on AWS, Google and Microsoft Azure. SPECpower is a database of power consumption at various utilization points for various servers. Power is converted to an estimate of carbon emissions using data from the European Environment Agency, the US Environmental Protection Agency and carbonfootprint.com.

Uptime Intelligence used the CCF’s carbon and power assumptions to estimate carbon emissions for several cloud architectures. We consider the CCF’s methodology and assumptions reasonable enough to compare carbon emissions based on cloud architecture. However, we cannot state that the CCF’s tools, methods and assumptions suit all purposes. That said, the project’s open-source and collaborative nature means it is more likely to be an unbiased and fair methodology than those offered by cloud providers.

The CCF’s methodology details are available on the project’s website and in the freely accessible source code. See cloudcarbonfootprint.org/docs/methodology.

For this research, Uptime Intelligence has based our calculations on Amazon Web Services (AWS). Not only is AWS the market leader, but it also provides sufficiently detailed information to make an investigation possible. Other public cloud services have similar pricing models, services and architectural principles — this report’s fundamental analysis will apply to other cloud providers. AWS costs are obtained from the company’s website and carbon emissions are obtained from the CCF project’s assumptions for AWS. We used an m5.large virtual machine in us-east-1 for our architecture.

Table 2 shows the carbon emissions calculations based on these sources.

Table 2 Carbon emissions calculations

Table: Carbon emissions calculations

The following Uptime Institute expert was consulted for this report:
Jay Dietrich, Research Director of Sustainability, Uptime Institute

The post Cloud: when high availability hurts sustainability appeared first on Uptime Institute Blog.

The two sides of a sustainability strategy

While much has been written, said and taught about data center sustainability, there is still limited consensus on the definition and scope of an ideal data center sustainability strategy. This lack of clarity has created much confusion, encouraged many operators to pursue strategies with limited results, and enabled some to make claims that are ultimately of little worth.

To date, the data center industry has adopted three broad, complementary approaches to sustainability:

  • Facility and IT sustainability. This approach prioritizes operational efficiency, minimizing the energy, direct carbon and water footprints of IT and facility infrastructure. It directly addresses the operational impacts of individual facilities, reducing material and energy use and costs. Maximizing the sustainability of individual facilities is key to addressing the increased government focus on regulating individual data centers.
  • Ecosystem sustainability. This strategy focuses on carbon neutrality (or carbon negativity), water positivity and nature positivity across the enterprise. Ecosystem sustainability offsets the environmental impacts of an enterprise’s operations, which may increase business costs.
  • Overall sustainability. While some data center operators promote the sustainability of their facilities with limited efforts on ecosystem sustainability, others build their brand around ecosystem sustainability with minimal discussion about the sustainability of their facilities. Although it is common for organizations to make efforts in both areas, it is less common for the strategies to be integrated as a part of a coherent plan.

Each approach has its own benefits and challenges, providing different levels of business and environmental performance improvement. This report is an extension and update to the Sustainability Series of reports, published by Uptime Intelligence in 2022 (see below for a list of the reports), which detailed the seven elements of a sustainability strategy.

Data center sustainability

Data center sustainability involves incorporating sustainability and efficiency considerations into siting, design and operational processes throughout a facility’s life. The organizations responsible for siting and design, IT operations, facility operations, procurement, contracting (colocation and cloud operators) and waste management must embrace the enterprise’s overall sustainability strategy and incorporate it into their daily operations.

Achieving sustainability objectives may require a more costly initial investment for an individual facility, but the reward is likely an overall lower cost of ownership over its life. To implement a sustainability strategy effectively, an operator must address the full range of sustainability elements:

  • Siting and design. Customer and business needs dictate a data center’s location. Typically, multiple sites will satisfy these criteria, however, the location should also be selected based on whether it can help optimize the facility’s sustainability performance. Operators should focus on maximizing free cooling and carbon-free energy consumption while minimizing energy and water consumption. The design should choose equipment and materials that maximize the facility’s environmental performance.
  • Cooling system. The design should minimize water and energy use, including capturing available free-cooling hours. In water-scarce or water-stressed regions, operators should deploy waterless cooling systems. Where feasible and economically viable, heat reuse systems should also be incorporated into the design.
  • Standby power system. The standby power system design should enable fuel flexibility (able to use low-carbon or carbon-free fuels) and provide primary power capability. It should be capable and permitted to deliver primary power for extended periods. This enables the system to support grid reliability and assist in addressing the intermittency of wind and solar generation contracted to supply power to the data center, thereby reducing the carbon intensity of the electricity consumption.
  • IT infrastructure efficiency. IT equipment should be selected to maximize the average work delivered per watt of installed capacity. The installed equipment should run at or close to the highest practical utilization level of the installed workloads while meeting their reliability and resiliency requirements. IT workload placement and management software should be used to monitor and optimize the IT infrastructure performance.
  • Carbonfree energy consumption. Operators should work with electricity utilities, energy retailers, energy developers and regulators to maximize the quantity of clean energy consumed and minimize location-based emissions. Over time, they should plan to increase carbon-free energy consumption to 90% or more of the total consumption. Timelines will vary by region depending on the economics and availability of carbon-free energy.
  • End-of-life equipment reuse and materials recovery. Operators need an end-of-life equipment management process that maximizes the reuse of equipment and components, both within the organization and through refurbishment and use by others. Where equipment must be scrapped, there should be a process in place to recover valuable metals and minerals, as well as energy, through environmentally responsible processes.  
  • Scope 3 emissions management. Operators should require key suppliers to maintain a sustainability strategy, publicly disclose their greenhouse gas (GHG) emissions inventory and reduction goals, and demonstrate progress toward their sustainability objectives. There should be consequences in place for suppliers that fail to show reasonable progress.

While these strategies may appear simple, creating and executing a sustainability strategy requires the commitment of the whole organization — from technicians and engineers to procurement, finance and executive leadership. In some cases, financial criteria may need to shift from considering the initial upfront costs to the total cost of ownership and the revenue benefits/enhancements gained from a demonstrably sustainable operation. A data center sustainability strategy can enhance business and environmental performance.

Ecosystem sustainability

An ecosystem sustainability strategy emphasizes mitigating and offsetting the environmental impacts of an operator’s data center portfolio. While these efforts do not change the environmental operating profile of individual data centers, they are designed to benefit the surrounding community and natural environment. Such projects and environmental offsets are typically managed at the enterprise level rather than the facility level and represent a cost to the enterprise.

  • Carbon-neutral or carbon-negative operations. Operators should purchase energy attribute certificates (EACs) and carbon capture offsets to reduce or eliminate their Scope 1, 2 and 3 emissions inventory. The offsets are generated primarily from facilities geographically separate from the data center facilities. EACs and offsets can be purchased directly from brokers or from operators of carbon-free energy or carbon capture systems.
  • Water-positive operations. Operators should work with communities and conservation groups to implement water recharge and conservation projects that return more water to the ecosystem than is used across their data centers. Examples include wetlands reclamation, water replenishment, support of sustainable agriculture, and leak detection and minimization systems for water distribution networks. These projects can benefit the local watershed or unrelated, geographically distinct watersheds.
  • Nature-positive facilities. The data center or campus should be landscaped to regenerate and integrate with the natural landscape and local ecosystem. Rainwater and stormwater should be naturally filtered and reused where practical. The landscape should be designed and managed to support local flora and fauna, ensuring that the overall campus is seamlessly integrated into the local ecosystem. The overall intent is to make the facility as “invisible” as possible to the local community.
  • Emissions reductions achieved with IT tools. Some operators and data center industry groups quantify and promote the emissions reduction benefits (known as Scope 4 “avoided emissions”) generated from the operation of the IT infrastructure. They assert that the “avoided emissions” achieved through the application of IT systems to increase the operational efficiency of systems or processes, or “dematerialize” products, can offset some or all of the data center infrastructure’s emissions footprint. However, these claims should be approached with caution, as there is a high degree of uncertainty in the calculated quantities of “avoided emissions.”
  • Pro-active work with supply chains. Some operators work directly with supply chain partners to decarbonize their operations. This approach is practical when an enterprise represents a significant percentage of a supplier’s revenue. However, it becomes impractical when an operator’s purchases represent only a small percentage of the supplier’s business.

Ecosystem sustainability seeks to deliver environmental performance improvements to operations and ecosystems outside the operator’s direct control. These improvements compensate for and offset any remaining environmental impacts following the full execution of the data center sustainability strategy. They typically represent a business cost and enhance an operator’s commercial reputation and brand.

Where to focus

Facility and IT and ecosystem sustainability strategies are complementary, addressing the full range of sustainability activities and opportunities. In most organizations, it will be necessary to cover all of these areas, often by different teams focusing on their respective domains.

An operator’s primary focus should be improving the operational efficiency and sustainability performance of its data centers. Investments in the increased use of free cooling, automated control of chiller and IT space cooling systems, and IT consolidation projects can yield significant energy, water and cost savings, along with reductions in GHG emissions. These improvements will not only reduce the environmental footprint of the data center but can also improve its business performance.

These efforts also enable operators to proactively address emerging regulatory and standards frameworks. Such regulations are intended to increase the reporting of operating data and metrics and may ultimately dictate minimum performance standards for data centers.

To reduce the Scope 2 emissions (purchased electricity) associated with data center operations to zero, operators need to work with utilities, energy retailers, and the electricity transmission and distribution system operators. The shared goal is to help build a resilient, interconnected electricity grid populated by carbon-free electricity generation and storage systems — a requirement for government net-zero mandates.

Addressing ecosystem sustainability opportunities is a valuable next step in an operator’s sustainability journey. Ecosystem projects can enhance the natural environment surrounding the data facility, improve the availability of carbon-free energy and water resources locally and globally, and directly support, inform and incentivize the sustainability efforts of customers and suppliers.

Data center sustainability should be approached in two separate ways: first, the infrastructure itself and, second, the ecosystem. Confusion and overlap between these two aspects can lead to unfortunate results. For example, in many cases, a net-zero and water-positive data center program is (wrongly) accepted as an indication that an enterprise is operating a sustainable data center infrastructure.


The Uptime Intelligence View

Operators should prioritize IT and facilities sustainability over ecosystem sustainability. The execution and results of an IT and facilities sustainability strategy directly minimize the environmental footprint of a data center portfolio, while maximizing its business and sustainability performance.

Data reporting and minimum performance standards embodied in enacted or proposed regulations are focused on the operation of the individual data centers, not the aggregated enterprise-level sustainability performance. An operator must demonstrate that they have a highly utilized IT infrastructure (maximized work delivered per unit of energy consumed) and minimized the energy and water consumption and GHG emissions associated with its facility operations.

Pursuing an Ecosystem sustainability strategy is the logical next step for operators that want to do more and further enhance their sustainability credentials. However, an ecosystem sustainability strategy should not be pursued at the expense of an IT and Facilities strategy to shield poor or marginal facility and IT systems performance.

The following Uptime Institute expert was consulted for this report:
Jay Paidipati, Vice President Sustainability Program Management, Uptime Institute

Other related reports published by Uptime Institute include:
Creating a sustainability strategy
IT Efficiency: the critical core of sustainability
Three key elements: water, circularity and siting
Navigating regulations and standards
Tackling greenhouse gases
Reducing the energy footprint

The post The two sides of a sustainability strategy appeared first on Uptime Institute Blog.

❌