Normal view

Received yesterday — 31 January 2026

AI and cooling: toward more automation

AI is increasingly steering the data center industry toward new operational practices, where automation, analytics and adaptive control are paving the way for “dark” — or lights-out, unstaffed — facilities. Cooling systems, in particular, are leading this shift. Yet despite AI’s positive track record in facility operations, one persistent challenge remains: trust.

In some ways, AI faces a similar challenge to that of commercial aviation several decades ago. Even after airlines had significantly improved reliability and safety performance, making air travel not only faster but also safer than other forms of transportation, it still took time for public perceptions to shift.

That same tension between capability and confidence lies at the heart of the next evolution in data center cooling controls. As AI models — of which there are several — improve in performance, becoming better understood, transparent and explainable, the question is no longer whether AI can manage operations autonomously, but whether the industry is ready to trust it enough to turn off the lights.

AI’s place in cooling controls

Thermal management systems, such as CRAHs, CRACs and airflow management, represent the front line of AI deployment in cooling optimization. Their modular nature enables the incremental adoption of AI controls, providing immediate visibility and measurable efficiency gains in day-to-day operations.

AI can now be applied across four core cooling functions:

  • Dynamic setpoint management. Continuously recalibrates temperature, humidity and fan speeds to match load conditions.
  • Thermal load forecasting. Predicts shifts in demand and makes adjustments in advance to prevent overcooling or instability.
  • Airflow distribution and containment. Uses machine learning to balance hot and cold aisles and stage CRAH/CRAC operations efficiently.
  • Fault detection, predictive and prescriptive diagnostics. Identifies coil fouling, fan oscillation, or valve hunting before they degrade performance.

A growing ecosystem of vendors is advancing AI-driven cooling optimization across both air- and water-side applications. Companies such as Vigilent, Siemens, Schneider Electric, Phaidra and Etalytics offer machine learning platforms that integrate with existing building management systems (BMS) or data center infrastructure management (DCIM) systems to enhance thermal management and efficiency.

Siemens’ White Space Cooling Optimization (WSCO) platform applies AI to match CRAH operation with IT load and thermal conditions, while Schneider Electric, through its Motivair acquisition, has expanded into liquid cooling and AI-ready thermal systems for high-density environments. In parallel, hyperscale operators, such as Google and Microsoft, have built proprietary AI engines to fine-tune chiller and CRAH performance in real time. These solutions range from supervisory logic to adaptive, closed-loop control. However, all share a common aim: improve efficiency without compromising compliance with service level agreements (SLAs) or operator oversight.

The scope of AI adoption

While IT cooling optimization has become the most visible frontier, conversations with AI control vendors reveal that most mature deployments still begin at the facility water loop rather than in the computer room. Vendors often start with the mechanical plant and facility water system because these areas present fewer variables, such as temperature differentials, flow rates and pressure setpoints, and can be treated as closed, well-bounded systems.

This makes the water loop a safer proving ground for training and validating algorithms before extending them to computer room air cooling systems, where thermal dynamics are more complex and influenced by containment design, workload variability and external conditions.

Predictive versus prescriptive: the maturity divide

AI in cooling is evolving along a maturity spectrum — from predictive insight to prescriptive guidance and, increasingly, to autonomous control. Table 1 summarizes the functional and operational distinctions among these three stages of AI maturity in data center cooling.

Table 1 Predictive, prescriptive, and autonomous AI in data center cooling

Table: Predictive, prescriptive, and autonomous AI in data center cooling

Most deployments today stop at the predictive stage, where AI enhances situational awareness but leaves action to the operator. Achieving full prescriptive control will require not only a deeper technical sophistication but also a shift in mindset.

Technically, it is more difficult to engineer because the system must not only forecast outcomes but also choose and execute safe corrective actions within operational limits. Operationally, it is harder to trust because it challenges long-held norms about accountability and human oversight.

The divide, therefore, is not only technical but also cultural. The shift from informed supervision to algorithmic control is redefining the boundary between automation and authority.

AI’s value and its risks

No matter how advanced the technology becomes, cooling exists for one reason: maintaining environmental stability and meeting SLAs. AI-enhanced monitoring and control systems support operating staff by:

  • Predicting and preventing temperature excursions before they affect uptime.
  • Detecting system degradation early and enabling timely corrective action.
  • Optimizing energy performance under varying load profiles without violating SLA thresholds.

Yet efficiency gains mean little without confidence in system reliability. It is also important to clarify that AI in data center cooling is not a single technology. Control-oriented machine learning models, such as those used to optimize CRAHs, CRACs and chiller plants, operate within physical limits and rely on deterministic sensor data. These differ fundamentally from language-based AI models such as GPT, where “hallucinations” refer to fabricated or contextually inaccurate responses.

At the Uptime Network Fall Americas Fall Conference 2025, several operators raised concerns about AI hallucinations — instances where optimization models generate inaccurate or confusing recommendations from event logs. In control systems, such errors often arise from model drift, sensor faults, or incomplete training data, not from the reasoning failures seen in language-based AI. When a model’s understanding of system behavior falls out of sync with reality, it can misinterpret anomalies as trends, eroding operator confidence faster than it delivers efficiency gains.

The discomfort is not purely technical, it is also human. Many data center operators remain uneasy about letting AI take the controls entirely, even as they acknowledge its potential. In AI’s ascent toward autonomy, trust remains the runway still under construction.

Critically, modern AI control frameworks are being designed with built-in safety, transparency and human oversight. For example, Vigilent, a provider of AI-based optimization controls for data center cooling, reports that its optimizing control switches to “guard mode” whenever it is unable to maintain the data center environment within tolerances. Guard mode brings on additional cooling capacity (at the expense of power consumption) to restore SLA-compliant conditions. Typical examples include rapid drift or temperature hot spots. In addition, there is also a manual override option, which enables the operator to take control through monitoring and event logs.

This layered logic provides operational resiliency by enabling systems to fail safely: guard mode ensures stability, manual override guarantees operator authority, and explainability, via decision-tree logic, keeps every AI action transparent. Even in dark-mode operation, alarms and reasoning remain accessible to operators.

These frameworks directly address one of the primary fears among data center operators: losing visibility into what the system is doing.

Outlook

Gradually, the concept of a dark data center, one operated remotely with minimal on-site staff, has shifted from being an interesting theory to a desirable strategy. In recent years, many infrastructure operators have increased their use of automation and remote-management tools to enhance resiliency and operational flexibility, while also mitigating low staffing levels. Cooling systems, particularly those governed by AI-assisted control, are now central to this operational transformation.

Operational autonomy does not mean abandoning human control; it means achieving reliable operation without the need for constant supervision. Ultimately, a dark data center is not about turning off the lights, it is about turning on trust.


The Uptime Intelligence View

AI in thermal management has evolved from an experimental concept into an essential tool, improving efficiency and reliability across data centers. The next step — coordinating facility water, air and IT cooling liquid systems — will define the evolution toward greater operational autonomy. However, the transition to “dark” operation will be as much cultural as it is technical. As explainability, fail-safe modes and manual overrides build operator confidence, AI will gradually shift from being a copilot to autopilot. The technology is advancing rapidly; the question is how quickly operators will adopt it.

The post AI and cooling: toward more automation appeared first on Uptime Institute Blog.

LTL industry meets in Atlanta

29 January 2026 at 22:20



The state of the freight economy, rise of artificial intelligence (AI), and accelerating levels of fraud across the trucking industry topped the agenda at SMC3 JumpStart26, an annual supply chain education event held in Atlanta earlier this week.

JumpStart brings together professionals from across the less-than-truckload (LTL) industry for three days of networking, presentations, and panel discussions on the issues affecting the industry. More than 500 people turned out for the event, which was held at the Renaissance Atlanta Waverly, January 26-28.

The freight economy continues to be marked by uncertainty, despite some bright spots on the broader economic horizon, according to economist Keith Prather of Armada Corporate Intelligence, who gave an economic update on the first day of the conference. Prather cited consumer spending on services rather than goods, tariff volatility, and an unhealthy housing market as persistent drags on the freight economy. Bright spots include anticipated tax refunds that may boost consumer spending on goods later this year, a slowly improving residential construction market that could help spur freight movement, and strong GPD growth heading into 2026.

Touting AI

AI dominated much of the discussion over the three days, with LTL freight carriers, third-party logistics services (3PL) providers, and technology companies detailing how the technology can be used to improve operations within companies and across the industry. Speakers included Mark Albrecht, vice president of artificial intelligence and enterprise strategy at 3PL C.H. Robinson.

Albrecht’s talk coincided with the company’s launch of AI agents aimed at combatting missed LTL pickups. Two new AI agents are tracking down missed pickups and using advanced reasoning to determine how to keep freight moving, according to a January 26 company announcement. C.H. Robinson said it has automated 95% of checks on missed LTL pickups, saving more than 350 hours of manual work per day, helping shippers’ freight move up to a day faster, and reducing unnecessary return trips to pick up missed freight by 42%. The tools are part of a fleet of more than 30 AI agents C.H. Robinson has developed in house to streamline LTL processes.

Cracking down on fraud

The conference also featured an interview with Derek Barrs, administrator of the Federal Motor Carrier Safety Administration (FMCSA). Barrs addressed the widespread fraud affecting the trucking industry, discussing how FMCSA is working with states and other federal agencies to combat safety problems arising from several issues, including the issuing of non-domiciled commercial drivers licenses (CDLs), English-language proficiency enforcement, and entry-level driver training programs.

Barrs said FMCSA is working with states to ensure the enforcement of existing English-language proficiency regulations and that “thousands and thousands” of drivers have been placed out of service as a result. FMCSA is also working with states to review their processes for issuing non-domiciled CDLs, which may be granted to non-citizens living in the United States. Much of the problem centers around states issuing licenses to non-citizens for periods of time that exceed their legal status to work in the country. Barrs said most states have stopped issuing non-domiciled CDLs while those processes are being reviewed but said much work remains to fix breakdowns in the system.

Barrs said FMCSA is focused on rooting out bad actors in driver training as well, noting that he agency has removed 6,800 listings from its training provider registry to date and that investigations of driver training schools continue.

Show organizers cited a strong turnout for the event despite being affected by winter storm Fern, which resulted in thousands of flight cancellations nationwide and widespread power outages in the Southeast. The crowd of more than 500 attendees was down from an expected group of more than 700 registrants.

SMC3 will hold its annual Connections event this coming June in Palm Beach, Fla.

Transportation and logistics providers see 2026 as critical year for technology to transform business processes

29 January 2026 at 17:48



In his 40 years leading McLeod Software, one of the nation’s largest providers of transportation management systems for truckers and 3PLs (third-party logistics providers), Tom McLeod has seen many a new technology product introduced with much hype and promise, only to fade in real-world practice and fail to mature into a productive application.

In his view, as new tech players have come and gone, the basic demand from shippers and trucking operators for technology has remained pretty much the same, straightforwardly simple and unchanged over time: “Find me a way to use computers and software to get more done in less time and [at a] lower cost,” he says.

“It’s been the same goal, from decades ago when we replaced typewriters, all the way to today finding ways to use artificial intelligence (AI) to automate more tasks, streamline processes, and make the human worker more efficient,” he adds. “Get more done in less time. Make people more productive.”

The difference between now and the pretenders of the past? McLeod and others believe that AI is the real thing and, as it continues to develop and mature, will be incorporated deeper into every transportation and logistics planning, execution, and supply chain process, fundamentally changing and forcing a reinvention of how shippers and logistics service providers operate and manage the supply chain function.

“But it is not a magic bullet you can easily switch on,” McLeod cautions. “While the capabilities look magical, at some level it takes time to train these models and get them using data properly and then come back with recommendations or actions that can be relied upon,” he adds.

THE DATA CONUNDRUM

One of the challenges is that so much supply chain data today remains highly unstructured—by one estimate, as much as 75%. Converting and consolidating myriad sources and formats of data, and ensuring it is clean, complete, and accurate remains perhaps the biggest challenge to accelerated AI adoption.

Often today when a broker is searching for a truck, entering an order, quoting a load, or pulling a status update, someone is interpreting that text or email, extracting information from the transportation management system (TMS), and creating a response to the customer, explains Doug Schrier, McLeod’s vice president of growth and special projects. “With AI, what we can do is interpret what the email is asking for, extract that, overlay the TMS information, and use AI to respond to the customer in an automated fashion,” he says.

To come up with a price quote using traditional methods might take three or four minutes, he’s observed. An AI-enabled process cuts that down to five seconds. Similarly, entering an order into a system might take four to five minutes. With AI interpreting the email string and other inputs, a response is produced in a minute or less. “So if you are doing [that task] hundreds of times a week, it makes a difference. What you want to do is get the human adding the value and [use AI] to get the mundane out of the workflow.”

Yet the growth of AI is happening across a technology landscape that remains fragmented, with some solutions that fit part of the problem, and others that overlap or conflict. Today it’s still a market where there is not one single tech provider that can be all things to all users.

In McLeod’s view, its job is to focus on the mission of providing a highly functional primary TMS platform—and then complement and enhance that with partners who provide a specialized piece of an ever-growing solution puzzle. “We currently have built, over the past three decades, 150 deep partnerships, which equates to about 250 integrations,” says Ahmed Ebrahim, McLeod’s vice president of strategic alliances. “Customers want us to focus on our core competencies and work with best-of-breed parties to give them better choices [and a deeper solution set] as their needs evolve,” he adds.

One example of such a best-of-breed partnership is McLeod’s arrangement with Qued, an AI-powered application developer that provides McLeod TMS clients with connectivity and process automation for every load appointment scheduling mode, whether through a portal, email, voice, or text.

Before Qued was integrated, there were about 18 steps a user had to complete to get an appointment back into the TMS, notes Tom Curee, Qued’s president. With Qued, those steps are reduced to virtually zero and require no human intervention.

As soon as a stop is entered into the TMS, it is immediately and automatically routed to Qued, which reaches out to the scheduling platform or location, secures the appointment, and returns an update into the TMS with the details. It eliminates manual appointment-making tasks like logging on and entering data into a portal, and rekeying or emailing, and it significantly enhances the value and efficiency of this particular workflow activity for McLeod users.

LEGACY SYSTEM PAIN

One of the effects of the three-year freight recession has been its impact on investment. Whereas in better times, logistics and trucking firms would focus on buying tech to reduce costs, enhance productivity, and improve customer service, the constant financial pressure has narrowed that focus.

“First and exclusively, it is now on ‘How do we create efficiency by replacing people and really bring cost levels down because rates are still extremely low and margins really tight,’” says Bart De Muynck, a former Gartner research analyst covering the visibility and supply chain tech space, and now principal at consulting firm Bart De Muynck LLC.

Most industry operators he’s spoken with have looked at AI. One example he cites as ripe for transformation is freight brokerages, “where you have rows and rows of people on the phone.” They are asking the question “Which of these processes or activities can we do with AI?”

Yet De Muynck points to one issue that is proving to be a roadblock to change and transformation. “For many of these companies, their foundational technology is still on older architectural platforms,’’ in some cases proprietary ones, he notes. “It’s hard to combine AI with those.” And because of years of low margins and cash flow restrictions, “they have not been able to replace their core ERP [enterprise resource planning system] or the TMS for that carrier or broker, so they are still running on very old tech.”

For those players, De Muynck says they will discover a disconcerting reality: the difficulty of trying to apply AI on a platform that is decades old. “That will yield some efficiencies, but those will be short term and limited in terms of replacing manual tasks,” he says.

The larger question, De Muynck says, is “How do you reinvent your company to become more successful? How do we create applications and processes that are based on the new architecture so there is a big [transformative] lift and shift [and so we can implement and deploy foundational pieces fairly quickly]? Then with those solutions build something with AI that is truly transformational and effective.” And, he adds, bring the workforce along successfully in the process.

“People have some things in their jobs they have to do 100 times a day,” often a menial or boring task, De Muynck adds. “AI can automate or streamline those tasks in such a way that it improves the employee’s work experience and job satisfaction, while driving efficiencies. [Rather than eliminate a position], brokers can redirect worker time to more higher-value, complex tasks that need human input, intuition, and leadership.”

“With logistics, you cannot take people completely out of the equation,” he emphasizes. “[The best AI solutions] will be a human paired up with an intelligent AI agent. It will be a combination of people [and their tribal knowledge and institutional experience] and technology,” he predicts.

EYES OPEN

Shippers, truckers, and 3PLs are experiencing an awakening around the possibilities of technologies today and what modern architecture, in-the-cloud platforms, and AI-powered agents can do, says Ann Marie Jonkman, vice president–industry advisory for software firm Blue Yonder. For many, the hardest decision is where to start. It can be overwhelming, particularly in a market environment shaped by chaos, uncertainty, and disruption, where surviving every week sometimes seems a challenge in itself.

“First understand and be clear about what you want to achieve and the problems you want to solve” with a tech strategy, she advises. “Pick two or three issues and develop clear, defined use cases for each. Look at the biggest disruptions—where are the leakages occurring and how do I start?”

Among the most frequently targeted areas of investment she sees are companies putting capital and resources into broad areas of automation, not just physical activity with robotics, but in business processes, workflows, and operations. It also is about being able to understand tradeoffs, getting ahead of and removing waste, and moving the organization from a reactionary posture to one that’s more proactive and informed, and can leverage what Jonkman calls “decision velocity.” That places a priority on not only connecting the silos, but also on incorporating clean, accurate, and actionable data into one command center or control tower. When built and deployed correctly, such central platforms can provide near-immediate visibility into supply chain health as well as more efficient and accurate management of the end-to-end process.

Those investments in supply chain orchestration not only accelerate and improve decision-making around stock levels, fulfillment, shipping choices, and overall network and partner performance, but also provide the ability to “respond to disruption and get a handle on the data to monitor and predict disruption,” Jonkman adds. It’s tying together the nodes and flows of the supply chain so “fulfillment has the order ready at the right place and the right time [with the right service]” to reduce detention and ensure customer expectations are met.

It is important for companies not to sit on the sidelines, she advises. Get into the technology transformation game in some form. “Just start somewhere,” even if it is a small project, learn and adapt, and then go from there. “It does not need to be perfect. Perfection can be the enemy of success.”

The speed of technology innovation always has been rapid, and the advent of AI and automation is accelerating that even further, observes Jason Brenner, senior vice president of digital portfolio at FedEx. “We see that as an opportunity, rather than a challenge.”

He believes one of the industry’s biggest challenges is turning innovation into adoption, “ensuring new capabilities integrate smoothly into existing operations and deliver value quickly.” Brenner adds that in his view, “innovation is healthy and pushes everyone forward.”

Execution at scale is where the rubber meets the road. “Delivering technology that works reliably across millions of shipments, geographies, and constantly changing conditions requires deep operational integration, massive data sets, and the ability to test solutions in multiple environments,” he says. “That’s where FedEx is uniquely positioned.”

DEFYING AUTOMATION NO MORE

Before the arrival of the newest forms of AI, “there were shipping tasks that had defied automation for decades,” notes Mark Albrecht, vice president of artificial intelligence for freight broker and 3PL C.H. Robinson. “Humans had to do this repetitive, time-consuming—I might even say mind-numbing—yet essential work.”

Application of early forms of AI, such as machine learning tools and algorithms, provided a hint of what was to come. CHR, which has one of the largest in-house IT development groups in the industry, has been using those for a decade.

Large language models and generative AI were the next big leap. “It’s the advent of agentic AI that opens up new possibilities and holds the greatest potential for transformation in the coming year,” Albrecht says, adding, “Agentic AI doesn’t just analyze or generate content; it acts autonomously to achieve goals like a human would. It can apply reasoning and make decisions.”

CHR has built and deployed more than 30 AI agents, Albrecht says. Collectively, they have performed millions of once-manual tasks—and generated significant benefits. “Take email pricing requests. We get over 10,000 of those a day, and people used to open each one, read it, get a quote from our dynamic pricing engine, and send that back to the customer,” he notes. “Now a proprietary AI agent does that—in 32 seconds.”

Another example is load tenders. “It used to take our people upwards of four hours to get to those through a long queue of emails,” he recalls. That work is now done by an AI agent that reads the email subject line, body, and attachments; collects other needed information; and “turns it into an order in our system in 90 seconds,” Albrecht says. He adds that if the email is for 20 orders, “the agent can handle them simultaneously in the same 90 seconds,” whereas a human would have to handle them sequentially.

Time is money for the shipper at every step of the logistics process. So the faster a rate quote is provided, order created, carrier selected, and load appointment scheduled, the greater the benefits to the shipper. “It’s all about speed to market, which whether a retailer or manufacturer, often translates into if you make the sale or keep an assembly line rolling.”

LOOKING AHEAD

Strip away all the hype, and the one tech deliverable that remains table stakes for all logistics providers and their customers are platforms that provide a timely and accurate view into where goods are and with whom, and when they will get to their destination. “First and foremost is real-time visibility that enables customer access to the movement of their product across the supply chain,” says Penske Executive Vice President Mike Medeiros. “Then, getting further upstream and allowing them to be more agile and responsive to disruptions.”

As for AI, “it’s not about replacing [workers]; it’s about pointing them in the right direction and helping [them] get more done in the same amount of time, with a higher level of service and enabling a more satisfying work experience. It’s human capital complemented by AI-powered agents as virtual assistants. We’ve already [started] down that path.”

C.H. Robinson uses AI agents to avoid missed LTL freight pickups

27 January 2026 at 20:33



Logistics provider C.H. Robinson has launched artificial intelligence (AI) agents to combat the problem of missed less than truckload (LTL) pickups, the company said.

The new technology is now tracking down missed pickups and using advanced reasoning to determine how to keep freight moving. Those agents are also collecting and analyzing previously unavailable data that LTL carriers are now using to improve their technology, scheduling, and operations.

C.H. Robinson says it launched the initiative because with one truck carrying freight from up to 20 different shippers, LTL shipping requires complex coordination to pick it all up, take it to a terminal, and recombine it on other trucks with other freight heading the same direction. That complexity means that missed pickups and costly delays can ripple through LTL networks.

According to the company, the results are already in: 95% of checks on missed LTL pickups have been automated, saving over 350 hours of manual work per day. And unnecessary return trips to pick up missed freight have been reduced by 42%.

“A missed pickup isn’t just a minor inconvenience,” Greg West, Vice President for LTL, said in a release. “When a truck arrives and the freight or packaging isn’t ready, or the carrier couldn’t make it because they got stuck in traffic, it forces another truck to come back the next day. That might not even be our shipper’s freight, but it creates a domino effect for other freight that was supposed to get picked up and for all the other trucks down the line.”

The new agents join a fleet of more than 30 other AI agents that C.H. Robinson has already built for LTL. They include units that handle LTL price quotes, orders, freight classification, shipment tracking, and proof of delivery.

“We don’t just throw AI at anything and everything. It’s not a hobby for us. We use AI agents only where they can deliver tangible business results,” C.H. Robinson’s vice president for artificial intelligence, Mark Albrecht, said. “Our Lean AI processes helped us uncover the extent of time wasted in handling missed pickups and where artificial intelligence had the most potential to augment our automation software.”

Received before yesterday

ALCF Issues AI for Science Program Call for Proposals, Feb. 27 Deadline

Jan. 20, 2026 — The Argonne Leadership Computing Facility (ALCF) invites proposals for a new collaboration and development program, called APEX, designed to fast-track novel applications of AI in science. This program seeks proposals that apply AI methods in new, creative, or unconventional ways within their domain, such as introducing new AI methods or bringing […]

The post ALCF Issues AI for Science Program Call for Proposals, Feb. 27 Deadline appeared first on Inside HPC & AI News | High-Performance Computing & Artificial Intelligence.

Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell

8 January 2026 at 19:43
As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with...

As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with AI more frequently, meaning that more tokens need to be generated. To serve these tokens at the lowest possible cost, AI platforms need to deliver the best possible token throughput per watt. Through extreme co-design across GPUs, CPUs…

Source

Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate

15 December 2025 at 14:00
Agentic AI systems increasingly rely on collections of cooperating agents—retrievers, planners, tool executors, verifiers—working together across large...

Agentic AI systems increasingly rely on collections of cooperating agents—retrievers, planners, tool executors, verifiers—working together across large contexts and long time spans. These systems demand models that deliver fast throughput, strong reasoning accuracy, and persistent coherence over large inputs. They also require a level of openness that allows developers to customize, extend…

Source

💾

How to Build Privacy-Preserving Evaluation Benchmarks with Synthetic Data

12 December 2025 at 16:33
Validating AI systems requires benchmarks—datasets and evaluation workflows that mimic real-world conditions—to measure accuracy, reliability, and safety...

Validating AI systems requires benchmarks—datasets and evaluation workflows that mimic real-world conditions—to measure accuracy, reliability, and safety before deployment. Without them, you’re guessing. But in regulated domains such as healthcare, finance, and government, data scarcity and privacy constraints make building benchmarks incredibly difficult. Real-world data is locked behind…

Source

NVIDIA Blackwell Enables 3x Faster Training and Nearly 2x Training Performance Per Dollar than Previous-Gen Architecture

11 December 2025 at 19:20
AI innovation continues to be driven by three scaling laws: pre-training, post-training, and test-time scaling. Training is foundational to building smarter...

AI innovation continues to be driven by three scaling laws: pre-training, post-training, and test-time scaling. Training is foundational to building smarter models, and post-training—which can include fine-tuning, reinforcement learning, and other techniques—helps to further increase accuracy for specific tasks, as well as provide models with new capabilities like the ability to reason.

Source

NVIDIA Kaggle Grandmasters Win Artificial General Intelligence Competition

5 December 2025 at 18:00
NVIDIA researchers on Friday won a key Kaggle competition many in the field treat as a real-time pulse check on humanity’s progress toward artificial general...

NVIDIA researchers on Friday won a key Kaggle competition many in the field treat as a real-time pulse check on humanity’s progress toward artificial general intelligence (AGI). Ivan Sorokin and Jean-Francois Puget, two members of the Kaggle Grandmasters of NVIDIA (KGMoN), came in first on the Kaggle ARC Prize 2025 public leaderboard with a 27.64% score by building a solution evaluated on…

Source

Op-Ed: XPENG’s New Extended-Range EVs Are Actually About Ultra-Fast Charging & AI

20 January 2026 at 16:17

On January 8, XPENG dropped some interesting news about its product roadmap that actually reveals something bigger than just two new extended-range models. The company is doing something different with extended-range EVs — instead of treating them as a workaround for charging infrastructure problems, XPENG is building them as electric-first ... [continued]

The post Op-Ed: XPENG’s New Extended-Range EVs Are Actually About Ultra-Fast Charging & AI appeared first on CleanTechnica.

AI Is Moving to the Water’s Edge, and It Changes Everything

5 January 2026 at 15:00

A new development on the Jersey Shore is signaling a shift in how and where AI infrastructure will grow. A subsea cable landing station has announced plans for a data hall built specifically for AI, complete with liquid-cooled GPU clusters and an advertised PUE of 1.25. That number reflects a well-designed facility, but it highlights an emerging reality. PUE only tells us how much power reaches the IT load. It tells us nothing about how much work that power actually produces.

As more “AI-ready” landing stations come online, the industry is beginning to move beyond energy efficiency alone and toward compute productivity. The question is no longer just how much power a facility uses, but how much useful compute it generates per megawatt. That is the core of Power Compute Effectiveness, PCE. When high-density AI hardware is placed at the exact point where global traffic enters a continent, PCE becomes far more relevant than PUE.

To understand why this matters, it helps to look at the role subsea landing stations play. These are the locations where the massive internet cables from overseas come ashore. They carry banking records, streaming platforms, enterprise applications, gaming traffic, and government communications. Most people never notice them, yet they are the physical beginning of the global internet.

For years, large data centers moved inland, following cheaper land and more available power. But as AI shifts from training to real-time inference, location again influences performance. Some AI workloads benefit from sitting directly on the network path instead of hundreds of miles away. This is why placing AI hardware at a cable landing station is suddenly becoming not just possible, but strategic.

A familiar example is Netflix. When millions of viewers press Play, the platform makes moment-to-moment decisions about resolution, bitrate, and content delivery paths. These decisions happen faster and more accurately when the intelligence sits closer to the traffic itself. Moving that logic to the cable landing reduces distance, delays, and potential bottlenecks. The result is a smoother user experience.

Governments have their own motivations. Many countries regulate which types of data can leave their borders. This concept, often called sovereignty, simply means that certain information must stay within the nation’s control. Placing AI infrastructure at the point where international traffic enters the country gives agencies the ability to analyze, enforce, and protect sensitive data without letting it cross a boundary.

This trend also exposes a challenge. High-density AI hardware produces far more heat than traditional servers. Most legacy facilities, especially multi-tenant carrier hotels in large cities, were never built to support liquid cooling, reinforced floors, or the weight of modern GPU racks. Purpose-built coastal sites are beginning to fill this gap.

And here is the real eye-opener. Two facilities can each draw 10 megawatts, yet one may produce twice the compute of the other. PUE will give both of them the same high efficiency score because it cannot see the difference in output. Their actual productivity, and even their revenue potential, could be worlds apart.

PCE and ROIP, Return on Invested Power, expose that difference immediately. PCE reveals how much compute is produced per watt, and ROIP shows the financial return on that power. These metrics are quickly becoming essential in AI-era build strategies, and investors and boards are beginning to incorporate them into their decision frameworks.

What is happening at these coastal sites is the early sign of a new class of data center. High density. Advanced cooling. Strategic placement at global entry points for digital traffic. Smaller footprints but far higher productivity per square foot.

The industry will increasingly judge facilities not by how much power they receive, but by how effectively they turn that power into intelligence. That shift is already underway, and the emergence of AI-ready landing stations is the clearest signal yet that compute productivity will guide the next generation of infrastructure.

# # #

About the Author

Paul Quigley is the former President and current Chief Strategic Partnership Officer of Airsys Cooling Technologies, and a global advocate for high density, energy efficient data center design. With more than three decades in HVAC and mission critical cooling, he focuses on practical solutions that connect energy stewardship with real world compute performance. Paul writes and speaks internationally about PCE, ROIP, and the future of data center health in the age of AI.

The post AI Is Moving to the Water’s Edge, and It Changes Everything appeared first on Data Center POST.

Inside the 2025 INCOMPAS Show and the Convergence of Policy Infrastructure and AI

29 December 2025 at 15:00

The 2025 INCOMPAS Show, held November 2–4 at the JW Marriott and Tampa Marriott Water Street in Tampa, Florida, brought together more than 3,000 leaders across communications, broadband, fiber, and technology sectors to explore the evolving landscape of connectivity and competition. One of the most influential gatherings in the U.S. communications ecosystem, the event provided a platform for senior executives, policymakers, and innovators to align on strategies shaping the future of broadband deployment, infrastructure investment, and digital transformation.

This year’s theme of collaboration and convergence set the tone for a comprehensive agenda that highlighted how technology, policy, and innovation are coming together to expand connectivity and bridge the digital divide. Across three days of panels, workshops, and executive-level discussions, speakers addressed the accelerating impact of AI, automation, and public-private partnerships on both network operations and competitive strategy.

The Convergence Era: Policy, Infrastructure, and AI

The opening remarks emphasized the urgency of convergence in today’s communications landscape. Chip Pickering, CEO of INCOMPAS, framed the event with a focus on consolidation, critical infrastructure, and the growing interdependence of networks, power, and policy.

That theme carried into high-profile sessions featuring executives from Verizon, Lumen Technologies, and Bluebird Fiber, where speakers examined how fiber density, cloud connectivity, and edge infrastructure are reshaping both network design and M&A strategy. Panels such as Future-Proofing the Network and Strategic Convergence: How Wireline-Wireless Integration Is Impacting M&A highlighted how capacity planning and integration are now central drivers of transaction value.

AI-driven transformation emerged as a defining force throughout the agenda. In the session Powering Intelligence: The Convergence of Energy, Networks, and AI Infrastructure, leaders including Jeff Uphues, CEO, DC BLOX and Dan Davis, CEO and Co-Founder, Arcadian Infracom explored the mounting energy demands of AI workloads and the need for resilient, scalable infrastructure. Discussions emphasized that AI is no longer an overlay, but a foundational consideration in network architecture, power strategy, and long-term investment planning.

Cybersecurity also took center stage, with experts from Granite Telecommunications, UNITEL, Axcent Networks, and Verizon Partner Solutions outlining how AI is being deployed to detect threats, automate responses, and protect increasingly complex telecom environments.

Policy at the Center of Broadband Expansion

Policy reform remained a cornerstone of the INCOMPAS agenda. Sessions focused on the future of the Universal Service Fund, broadband permitting reform, and federal regulatory alignment drew strong engagement from both providers and policymakers. Led by INCOMPAS policy leadership and legal experts from firms including Morgan Lewis, Cooley, Nelson Mullins Riley & Scarborough LLP, and JSI, these discussions reinforced the critical role of permitting, spectrum access, and funding mechanisms such as BEAD in accelerating equitable broadband deployment nationwide.

Modern Marketing and the Human Element

Beyond infrastructure and policy, the Marketing Workshop Series delivered some of the show’s most actionable insights. The opening session, Marketing’s New Blueprint: Balancing AI, Automation, and Authenticity, featured Laura Johns, Founder and CEO of The Business Growers, and Joy Milkowski, Partner at Access Marketing Company. Together, they explored how communications and technology companies can leverage automation and AI tools without losing the authenticity and strategic clarity required to build trust and drive revenue.

The discussion reinforced that AI should function as a strategic enabler rather than a replacement for human insight. Follow-on workshops expanded on this theme, with sessions focused on revenue-driven AI strategy, practical prompt frameworks, and marketing automation systems designed to align sales and marketing teams while supporting scalable growth.

Networking, Partnerships, and Industry Momentum

As always, the INCOMPAS Show excelled as a venue for relationship-building and deal-making. The Buyers Forum and Deal Center facilitated high-value, pre-scheduled meetings, while exhibit hall programming and networking events fostered collaboration across fiber providers, technology vendors, and service partners.

Workforce development, sustainability, and inclusion also emerged as shared priorities. Speakers stressed the need to build talent pipelines capable of supporting AI-driven networks while ensuring that digital transformation delivers measurable benefits across communities.

The Road Ahead

The 2025 INCOMPAS Show made one thing clear: the future of communications will be defined by integration, collaboration, and adaptability. From AI-powered networks and evolving policy frameworks to authentic marketing and workforce readiness, the conversations in Tampa reflected an industry actively shaping its next chapter.

As the ecosystem looks toward 2026, the momentum from INCOMPAS reinforces a collective commitment to closing connectivity gaps, modernizing infrastructure, and aligning innovation with opportunity.

To learn more about INCOMPAS and upcoming events, visit www.incompas.org and www.show.incompas.org.

The post Inside the 2025 INCOMPAS Show and the Convergence of Policy Infrastructure and AI appeared first on Data Center POST.

AI’s growth calls for useful IT efficiency metrics

The digital infrastructure industry is under pressure to measure and improve the energy efficiency of the computing work that underpins digital services. Enterprises seek to maximize returns on cost outlay and operating expenses for IT hardware, and regulators and local communities need reassurance that the energy devoted to data centers is used efficiently. These objectives call for a productivity metric to measure the amount of work that IT hardware performs per unit of energy.

With generative AI projected to boost data center power demand substantially, the stakes have arguably never been higher. Fortunately, organizations monitoring the performance and efficiency of their AI applications can benefit from experiences in the field of supercomputing.

In September 2025, Uptime Intelligence participated in a panel discussion about AI energy efficiency at the Yotta 2025 conference in Las Vegas (Nevada, US). The panelists drew on their extensive experience in supercomputing to weigh in on discussions around AI training efficiency. They discussed the need for a productivity metric to measure it, as well as a key caveat organizations need to consider.

Organizations such as Uptime Intelligence and The Green Grid have published guidance on calculating work capacity for various types of IT. Software applications and their supporting IT hardware vary significantly, so consensus on a single metric to compare energy performance remains out of reach for the foreseeable future. However, tracking energy performance in a given facility over time is important, and is achievable practically for many organizations today.

Defining AI computing work

The work capacity of IT equipment is needed to calculate its utilization and energy performance when running an application. The Green Grid white paper IT work capacity metric V1 — a methodology provides a methodology for calculating a work capacity value for CPU-based servers. Uptime Intelligence has proposed methodologies to extend this to accelerator-based servers for AI and other applications (see Calculating work capacity for server and storage products).

Floating point operations per second (FLOPS) is a common and readily available unit of work capacity for CPU- or accelerator-based servers. In 2025, an AI server’s capacity usually ranks in the trillions of FLOPS, or teraFLOPS (TFLOPS).

Not all FLOPS are the same

Even though large-scale AI training is radically reshaping many commercial data centers, the underlying software and hardware are not fundamentally new. AI training is essentially one of many applications of supercomputing. Supercomputing software, along with the IT selection and configuration, varies in many ways — and one of the most relevant variables when monitoring energy performance is floating point precision. This precision (measured in bits) is analogous to the number of decimal places used in inputs and outputs.

GPUs and other accelerators can perform 64-, 32-, 16-, 8- and 4-bit calculations, and some can use mixed precision. While a high-performance computing (HPC) workload such as computational fluid dynamics might use 64-bit (“double precision”) floating point calculations for high accuracy, other applications do not have such exacting requirements. Lower precision consumes less memory per calculation — and, crucially, less energy. The panel discussion at Yotta raised an important distinction: unlike most engineering and research applications, today’s AI training and inference calculations typically use 4-bit precision.

Floating point precision is necessary information when evaluating a TFLOPS benchmark. A 64-bit precision calculation TFLOPS value is one-half of a 32-bit TFLOPS value — or one-sixteenth of a 4-bit TFLOPS value. For consistent AI work capacity calculation, Uptime Institute recommends that IT operators use 32-bit TFLOPS values supplied by their AI server providers.

Working it out: work per energy

The maximum work capacity calculation for a server can be aggregated at the level of a rack, a cluster or a data center. Work capacity multiplied by average utilization (as a percentage) produces an estimate of the amount of calculation work (in TFLOPS) that was performed over a given period. Operators can divide this figure by the energy consumption (in MWh) over that same time to yield an estimate of the work’s energy efficiency, in TFLOPS/MWh. Separate calculations for CPU-based servers, accelerator-based servers, and other IT (e.g., storage) will provide a more accurate assessment of energy performance (see Figure 1).

Figure 1 Examples of IT equipment work-per-energy calculations

Diagram: Examples of IT equipment work-per-energy calculations

Even when TFLOPS figures are normalized to the same precision, it is difficult to use this information to draw meaningful comparisons between the energy performance of significantly different hardware types and configurations. Accelerator power consumption does not scale linearly with utilization levels. Additionally, the details of software design will determine how closely real-world application performance aligns with simplified work capacity benchmarks.

However, many organizations can benefit from calculating this TFLOPS/MWh productivity metric and are already well equipped to do so. This calculation is most useful to quantify efficiency gains over time, e.g., from IT refresh and consolidation, or refinements to operational control. In some jurisdictions, tracking FLOPS/MWh as a productivity metric can satisfy some regulatory requirements. IT efficiency is often overlooked in favor of facility efficiency — but a consistent productivity metric can help to quantify available improvements.


The Uptime Intelligence View

Generative AI training is poised to drive up data center energy consumption, prompting calls for regulation, responsible resource use and return on investment. A productivity metric can help meet these objectives by consistently quantifying the amount of computing work performed per unit of energy. Supercomputing experts agree that operators should track and use this data, but they caution against interpreting it without the necessary context. A simplified, practical work-per-energy metric is most useful for tracking improvement in one facility over time.

The following participants took part in the panel discussion on energy efficiency at Yotta 2025:

  • Jacqueline Davis, Research Analyst at Uptime Institute (moderator)
  • Dr Peter de Bock, former Program Director, Advanced Research Projects Agency–Energy
  • Dr Alfonso Ortega, Professor of Energy Technology, Villanova University
  • Dr Jon Summers, Research Lead in Data Centers, Research Institutes of Sweden

Other related reports published by Uptime Institute include:

Calculating work capacity for server and storage products

The following Uptime Institute experts were consulted for this report:

Jay Dietrich, Research Director of Sustainability, Uptime Institute

The post AI’s growth calls for useful IT efficiency metrics appeared first on Uptime Institute Blog.

AI power fluctuations strain both budgets and hardware

AI training at scale introduces power consumption patterns that can strain both server hardware and supporting power systems, shortening equipment lifespans and increasing the total cost of ownership (TCO) for operators.

These workloads can cause GPU power draw to spike briefly, even for only a few milliseconds, pushing them past their nominal thermal design power (TDP) or against their absolute power limits. Over time, this thermal stress can degrade GPUs and their onboard power delivery components.

Even when average power draw stays within hardware specifications, thermal stress can affect voltage regulators, solder joints and capacitors. This kind of wear is often difficult to detect and may only become apparent after a failure. As a result, hidden hardware degradation can ultimately affect TCO — especially in data centers that are not purpose-built for AI compute.

Strain on supporting infrastructure

AI training power swings can also push server power supply units (PSUs) and connectors beyond their design limits. PSUs may be forced to absorb rapid current fluctuations, straining their internal capacitors and increasing heat generation. In some cases, power swings can trip overcurrent protection circuits, causing unexpected reboots or shutdowns. Certain power connectors, such as the standard 12VHPWR cables used for GPUs, are also vulnerable. High contact resistance can cause localized heating, further compounding the wear and tear effects.

When AI workloads involve many GPUs operating in synchronization, power swing effects multiply. In some cases, simultaneous power spikes across multiple servers may exceed the rated capacity of row-level UPS modules — especially if they were sized following legacy capacity allocation practices. Under such conditions, AI compute clusters can sometimes reach 150% of their steady-state maximum power levels.

In extreme cases, load fluctuations of large AI clusters can exceed a UPS system’s capability to source and condition power, forcing it to use its stored energy. This happens when the UPS is overloaded and unable to meet demand using only its internal capacitance. Repeated substantial overloads will put stress on internal components as well as the energy storage subsystem. For batteries, particularly lead-acid cells, this can shorten their shelf life. In worst-case scenarios, these fluctuations may cause voltage sags or other power quality issues (see Electrical considerations with large AI compute).

Capacity planning challenges

Accounting for the effects of power swings from AI training workloads during the design phase is challenging. Many circuits and power systems are sized based on the average demand of a large and diverse population of IT loads, rather than their theoretical combined peak. In the case of large AI clusters, this approach can lead to a false sense of security in capacity planning.

When peak amplitudes are underestimated, branch circuits can overheat, breakers may trip, and long-term damage can occur to conductors and insulation — particularly in legacy environments that lack the headroom to adapt. Compounding this challenge, typical monitoring tools track GPU power every 100 milliseconds or more — too slow to detect the microsecond-speed spikes that can accelerate the wear on hardware through current inrush.

Estimating peak power behavior depends on several factors, including the AI model, training dataset, GPU architecture and workload synchronization. Two training runs on identical hardware can produce vastly different power profiles. This uncertainty significantly complicates capacity planning, leading to under-provisioned resources and increased operational risks.

Facility designs for large-scale AI infrastructure need to account for the impact of dynamic power swings. Operators of dedicated training clusters may overprovision UPS capacity, use rapid-response PSUs, or set absolute power and rate-of-change limits on GPU servers using software tools (e.g., Nvidia-SMI). While these approaches can help reduce the risk of power-related failures, they also increase capital and operational costs and can reduce efficiency under typical load conditions.

Many smaller operators — including colocation tenants and enterprises exploring AI — are likely testing or adopting AI training on general-purpose infrastructure. Nearly three in 10 operators already perform AI training, and of those that do not, nearly half expect to begin in the near future, according to results from the Uptime Institute AI Infrastructure Survey 2025 (see Figure 1).

Figure 1 Three in 10 operators currently perform AI training

Diagram: Three in 10 operators currently perform AI training

Many smaller data center environments may lack workload diversity (non-AI loads) to absorb power swings or the specialized engineering to manage dynamic power consumption behavior. As a result, these operators face a greater risk of failure events, hardware damage, shortened component lifespans and reduced UPS reliability — all of which contribute to higher TCO.

Several low-cost strategies can help mitigate risk. These include oversizing branch circuits — ideally dedicating them to GPU servers — distributing GPUs across racks and data halls to prevent localized hotspots, and setting power caps on GPUs to trade some peak performance for longer hardware lifespan.

For operators considering or already experimenting with AI training, TDP alone is an insufficient design benchmark for capacity planning. Infrastructure needs to account for rapid power transients, workload-specific consumption patterns, and the complex interplay between IT hardware and facility power systems. This is particularly crucial when using shared or legacy systems, where the cost of misjudging these dynamics can quickly outweigh the perceived benefits of performing AI training in-house.


The Uptime Intelligence View

For data centers not specifically designed to support AI training workloads, GPU power swings can quietly accelerate hardware degradation and increase costs. Peak power consumption of these workloads is often difficult to predict, and signs of component wear may remain hidden until failures occur. Larger operators with dedicated AI infrastructure are more likely to address these power dynamics during the design phase, while smaller operators — or those using general-purpose infrastructure — may have fewer options.

To mitigate risk, these operators can consider overprovisioning rack-level UPS capacity for GPU servers, oversizing branch circuits (and dedicating them to GPU loads where possible), distributing heat from GPU servers across racks and rooms to avoid localized hotspots, and applying software-based power caps. Data center operators should also factor in more frequent hardware replacements during financial planning to more accurately reflect the actual cost of running AI training workloads.

The following Uptime Institute experts were consulted for this report:
Chris Brown, Chief Technical Officer, Uptime Institute
Daniel Bizo, Senior Research Director, Uptime Institute Intelligence
Max Smolaks, Research Analyst, Uptime Institute Intelligence

Other related reports published by Uptime Institute include:
Electrical considerations with large AI compute

The post AI power fluctuations strain both budgets and hardware appeared first on Uptime Institute Blog.

Electrical considerations with large AI compute

The training of large generative AI models is a special case of high-performance computing (HPC) workloads. This is not simply due to the reliance on GPUs — numerous engineering and scientific research computations already use GPUs as standard. Neither is it about the power density or the liquid cooling of AI hardware, as large HPC systems are already extremely dense and use liquid cooling. Instead, what makes AI compute special is its runtime behavior: when training transformer-based models, large compute clusters can create step load-related power quality issues for power distribution systems in data center facilities. A previous Intelligence report offers an overview of the underlying hardware-software mechanisms.

The scale of the power fluctuations makes this phenomenon unusual and problematic. The vast number of generic servers found in most data centers collectively produce a relatively steady electrical load — even if individual servers experience sudden changes in power usage, they are discordant. In contrast, the power use of compute nodes in AI training clusters moves in near unison.

Even compared with most other HPC clusters, AI training clusters exhibit larger power swings. This is due to an interplay between transformer-based neural network architectures and compute hardware, which creates frequent spikes and falls (every second or two) in power demand. These fluctuations correspond to the computational steps in the training processes, exacerbated by an aggressive pursuit of peak performance typical in modern silicon.

Powerful fluctuations

The scope of the resulting step changes in power will depend on the size and configuration of the compute cluster, as well as operational factors such as AI server performance and power management settings. Uptime Intelligence estimates that in worst-case scenarios, the difference between the low and high points of power draw during training program execution can exceed 100% on a system level (the load doubles almost instantaneously, within milliseconds) for some configurations.

These extremes occur every few seconds, whenever a batch of weights and biases is loaded on GPUs and the training begins. This is often accompanied by a massive spike in current, produced by power excursion events as GPUs overshoot their thermal design power rating (TDP) to opportunistically exploit any extra thermal and power delivery budget following a phase of lower transistor activity. In short, power spikes are made possible by intermittent lulls.

This behavior is common in modern compute silicon, including in personal devices and generic servers. Still, it is only with large AI compute clusters that these fluctuations across dozens or hundreds of servers move almost synchronously.

Even in moderately sized clusters with just a few dozen racks, this can result in sudden, millisecond-speed changes in AC power — ranging from several hundred kilowatts to even a few megawatts. If there are no other substantial loads present in the electrical mix to dampen these fluctuations, these step changes may stress capacity components in the power distribution systems. They may also cause power quality issues such as voltage sags and swells, or significant harmonics and sub-synchronous oscillations that distort the sinusoidal waveforms in AC power systems.

Based on several discussions with and disclosures by major electrical equipment manufacturers — including ABB, Eaton, Schneider Electric, Siemens and Vertiv — there is a general consensus that modern power distribution equipment is expected to be able to handle AI power fluctuations, as long as they remain within the rated load.

IT system capacity redefined

The issue of AI step loads appears to center on equipment capacity and the need to avoid frequent overloads. Standard capacity planning practices often start with the nameplate power of installed IT hardware, then derate it to estimate the expected actual power. This adjustment can reduce the total nameplate power by 25% to 50% across all IT loads when accounting for the diversity of workloads — since they do not act in unison — and also for the fact that most software rarely pushes the IT hardware close to its rated power.

In comparison, AI training systems can show extreme behavior. Larger AI compute clusters have the potential to draw what is similar to an inrush current (rapid change of currents, often denoted by high di/dt) that exceed the IT system’s sustained maximum power rating.

Normally, overloads would not pose a problem for modern power distribution. All electrical components and systems have specified overload ratings to handle transient events (e.g., current surges during the startup of IT hardware or other equipment) and are designed and tested accordingly. However, if power distribution components are sized closely to the rated capacity of the AI compute load, these transient overloads could happen millions of times per year in the worst cases — components are not tested for regularly repeated overloads. Over time, this can lead to electromechanical stress, thermal stress and gradual overheating (heat-up is faster than cool-off) — potentially resulting in component failure.

This brings the definition of capacity to the forefront of AI compute step loads. Establishing the repeated peak power of a single GPU-server node is already a non-trivial effort — it requires running a variety of computationally intensive codes and setting up a high-precision power monitor. However, predicting how a specific compute cluster spanning several racks and potentially hundreds or even thousands of GPUs will behave during a training run is difficult to ascertain ahead of deployment.

The expected power profile also depends on server configurations, such as power supply redundancy level, cooling mode and GPU generations. For example, in a typical AI system from the 2022-2024 generation, power fluctuations can reach up 4 kW per 8-GPU server node, or 16 kW per rack when populated with four nodes, according to Uptime estimates. Even so, the likelihood of exceeding the rack power rating of around 41 kW is relatively low. Any overshoot is likely to be minor, as these systems are mostly air-cooled hardware designed to meet ASHRAE Class A2 specifications — allowed to operate in environments up to 35°C (95°F). In practice, most facilities supply much cooler air, making system fans cycle less intensely.

However, with recently launched systems, the issue is further exacerbated as GPUs account for a larger share of the power budget, not only because they use more power (in excess of 1 kW per GPU module) but also because these systems are more likely to use direct liquid cooling (DLC). Liquid cooling reduces system fan power, thereby reducing the stable load of server power. It also has better thermal performance, which helps the silicon to accumulate extra thermal budget for power excursions.

IT hardware specifications and information shared with Uptime by power equipment vendors indicate that in the worst cases, load swings can reach 150%, with a potential for overshoots exceeding 10% above the system’s power specification. In the case of the rack-scale systems based on Nvidia’s GB200 NVL72 architecture, sudden power climbs from around 60 kW and 70 kW to more than 150 kW per rack can occur.

This compares to a maximum power specification of 132 kW, which means that, under worst-case assumptions, repeated overloads can amount to as much as 20% in instantaneous power, Uptime estimates. This warrants extra care regarding circuit sizing (including breakers, tap-off units and placements, busways and other conductors) to avoid overheating and related reliability issues.

Figure 1 shows the power pattern of a GPU-based compute cluster running a transformer-based model training workload. Based on hardware specifications and real-world power data disclosed to Uptime Intelligence, we algorithmically mimicked the behavior of a compute cluster comprising four Nvidia GB200 NVL72 racks and four non-compute racks. It demonstrates the expected power fluctuations during these training clusters and underscores the need to rethink capacity planning compared with traditional, generic IT loads. Even though the average power stays below the power rating of the cluster, peak fluctuations can exceed it. While this estimates a relatively small cluster with 288 GPUs, a larger cluster would exhibit similar behavior at the megawatt scale.

Figure 1 Power profile of a GPU-based training cluster (algorithmic not real-world data)

Diagram: Power profile of a GPU-based training cluster (algorithmic not real-world data)

In electrical terms, no multi-rack workload is perfectly synchronous, while the presence of other loads will help smooth out the edges of fluctuations further. When including non-compute ancillary loads in the cluster — such as storage systems, networks and CDUs (which also require UPS power) — a lower safety margin above the nominal rating (e.g., 10% to 15%) appears sufficient to cover any regular peaks over the nominal system power specifications, even with the latest AI hardware.

Current mitigation options

There are several factors that data center operators may want to consider when deploying compute clusters dedicated to training large, transformer-based AI models. Currently, data center operators have a limited toolkit to fully handle large power fluctuations in a power distribution system, particularly when it comes to not passing them on to the source in their full extent. However, in collaboration with the IT infrastructure team/tenant, it should be possible to minimize fluctuations:

  • Mix with diverse IT loads, share generators. The best first option is to integrate AI training compute with other, diverse IT loads in a shared power infrastructure. This helps to diminish the effects of power fluctuations, particularly on generator sets. For dedicated AI training data center infrastructure installations, this may not be an option for power distribution. However, sharing engine generators will go a long way to dampen the effects of AI power fluctuations.
    Among power equipment, engine generator sets will be the most stressed if exposed to the full extent of the fluctuations seen in a large, dedicated AI training infrastructure. Even if correctly sized for the peak load, generators may struggle with large and fast fluctuations — for example, the total facility load stepping from 45% to 50% of design capacity to 80% to 85% within a second, then dropping back to 45% to 50% after two seconds, on repeat. Such fluctuation cycles may be close to what the engines can handle, at the expense of reduced expected life or outright failure.
  • Select UPS configurations to minimize power quality issues, overload. Even if a smaller frame can handle the fluctuations, according to the vendors, larger systems will carry more capacitance to help absorb the worst of the fluctuations, maintaining voltage and frequency within performance specifications. An additional measure is to use a higher capacity redundancy configuration, for example, by opting for N+2. This allows for UPS maintenance while avoiding any repeated overloads on the operational UPS systems, some of which might hit the battery energy storage system.
  • Use server performance/power management tools. Power and performance management of hardware remain largely underused, despite their ability to not only improve IT power efficiency but also contribute to the overall performance of the data center infrastructure. Even though AI compute clusters feature some exotic interconnect subsystems, they are essentially standard servers using standard hardware and software. This means there are a variety of levers to manage the peaks in their power and performance levels, such as power capping, turning off boost clocks, limiting performance states, or even setting lower temperature limits.
    To address the low end of fluctuations, switching off server energy-saving modes — such as silicon sleep states (known as C-states in CPU parlance) — can help raise the IT hardware’s power floor. A more advanced technique involves limiting the rate of power change (including on the way down). This feature, called “power smoothing”, is available through Nvidia’s System Management Interface on the latest generation of Blackwell GPUs.

Electrical equipment manufacturers are investigating the merits of additional rapid discharge/recharge energy storage and updated controls to UPS units with the aim of shielding the power source from fluctuations. These approaches include super capacitors, advanced battery chemistries or even flywheels that can tolerate frequent, short duration but high-powered discharge and recharge cycles. Next-generation AI compute systems may also include more capacitance and energy storage to limit fluctuations on the data center power system. Ultimately, it is often best to address an issue at its root (in this case the IT hardware and software) rather than treat the symptoms, although these may lie outside the control of data center facilities teams.


The Uptime Intelligence View

Most of the time, data center operators do not need to be overly concerned with the power profile of the IT hardware or the specifics of the associated workloads — rack density estimates were typically overblown to begin with, and overall capacity utilization tends to stay well below 100%. Even so, safety margins, which are expensive, could be thin. However, training large transformer models is different. The specialized compute hardware can be extremely dense, creates large power swings, and is capable of producing frequent power surges that are close to or even above its hardware power rating. This will force data center operators to reconsider their approach to both capacity planning and safety margins across their infrastructure.

The following Uptime Institute experts were consulted for this report:
Chris Brown, Chief Technical Officer, Uptime Institute

The post Electrical considerations with large AI compute appeared first on Uptime Institute Blog.

Crypto mines are turning into AI factories

The pursuit of training ever-larger generative AI models has necessitated the creation of a new class of specialized data centers — facilities that have more in common with high-performance computing (HPC) environments than traditional enterprise IT.

These data centers support very high rack densities (130 kW and above with current Nvidia rack-scale systems), direct-to-chip liquid cooling, and supersized power distribution components. This equipment is deployed at scale, in facilities that consume tens of megawatts. Delivering such dense infrastructure at this scale is not just technically complicated — it often requires doing things that have never been attempted before.

Some of these ultra-dense AI training data centers are being built by well-established cloud providers and their partners — wholesale colocation companies. However, the new class of facility has also attracted a different kind of data center developer: former cryptocurrency miners. Many of the organizations now involved in AI infrastructure — such as Applied Digital, Core Scientific, CoreWeave, Crusoe and IREN — originated as crypto mining ventures.

Some have transformed into neoclouds, leasing GPUs at competitive prices. Others operate as wholesale colocation providers, building specialized facilities for hyperscalers, neoclouds, or large AI model developers like OpenAI or Anthropic. Few of them operated traditional data centers before 2020. These operators represent a significant and recent addition to the data center industry — especially in the US.

A league of their own

Crypto mining facilities differ considerably from traditional data centers. Their primary objective is to house basic servers equipped with either GPUs or ASICs (application-specific integrated circuits), running at near 100% utilization around the clock to process calculations that yield cryptocurrency tokens. The penalties for outages are direct — fewer tokens mean lower profits — but the hardware is generally considered disposable. The business case is driven almost entirely by the cost of power, which accounts for almost all of the operating expenditure.

Many crypto mines do not use traditional server racks. Most lack redundancy in power distribution and cooling equipment, and they have no means of continuing operations in the event of a grid outage: no UPS, no batteries, no generators, no fuel. In some cases, mining equipment is located outdoors, shielded from the rain, but little else.

While crypto miners didn’t build traditional data center facilities, they did have two crucial assets: land zoned for industrial use and access to abundant, low-cost power.

Around 2020, some of the largest crypto mining operators began pivoting toward hosting hardware for AI workloads — a shift that became more pronounced following the launch of ChatGPT in late 2022. Table 1 shows how quickly some of these companies have scaled their AI/HPC operations.

Table 1 The transformation of crypto miners

Table: The transformation of crypto miners

To develop data center designs that can accommodate the extreme power and cooling requirements of cutting-edge AI hardware, these companies are turning to engineers and consultants with experience in hyperscale projects. The same applies to construction companies. The resulting facilities are built to industry standards and are concurrently maintainable.

There are three primary reasons why crypto miners were successful in capitalizing on the demand for high-density AI infrastructure:

  • These organizations were accustomed to moving quickly, having been born in an industry that had to respond to volatile cryptocurrency pricing, shifting regulations and fast-evolving mining hardware.
  • Many were already familiar with GPUs through their use in crypto mining — and some had begun renting them out for research or rendering workloads.
  • Their site selection was primarily driven by power availability and cost, rather than proximity to customers or network hubs.

Violence of action

Applied Digital, a publicly traded crypto mining operator based in North Dakota, presents an interesting case study. The state is one of the least developed data center markets in the US, with only a few dozen facilities in total.

Applied Digital’s campus in Ellendale was established to capitalize on cheap renewable power flowing between local wind farms and Chicago. In 2024, the company removed all mentions of cryptocurrency from its website — despite retaining sizable (100 MW-plus) mining operations. It then announced plans to build a 250 MW AI campus in Ellendale, codenamed Polaris Forge, to be leased by CoreWeave.

The operator expects the first 100 MW data center to be ready for service in late 2025. The facility will use direct liquid cooling and is designed to support 300 kW-plus rack densities. It is built to be concurrently maintainable, powered by two utility feeds, and will feature N+2 redundancy on most mechanical equipment. To ensure cooling delivery in the event of a power outage, the facility will be equipped with 360,000 gallons (1.36 million liters) of chilled water thermal storage. This will be Applied Digital’s first non-crypto facility.

The second building, with a capacity of 150 MW, is expected to be ready in the middle of 2026. It will deploy medium-voltage static UPS systems to improve power distribution efficiency and optimize site layout. The company has several more sites under development.

Impact on the sector

Do crypto miners have an edge in data center development? What they do have is existing access to power and a higher tolerance for technical and business risk — qualities that enable them to move faster than much of the traditional competition. This willingness to place bets matters in a market that is lacking solid fundamentals: in 2025, capital expenditure on AI infrastructure is outpacing revenue from AI-based products by orders of magnitude. The future of generative AI is still uncertain.

At present, this new category of data center operators appears to be focusing exclusively on the ultra-high-density end of the market and is not competing for traditional colocation customers. For now, they don’t need to either, as demand for AI training capacity alone keeps them busy. Still, their presence in the market introduces a new competitive threat to colocation providers that have opted to accommodate extreme densities in their recently built or upcoming facilities.

M&E and IT equipment suppliers have welcomed the new arrivals — not simply because they drive overall demand but because they are new buyers in a market increasingly dominated by a handful of technology behemoths. Some operators will be concerned about supply chain capacity, especially when it comes to large-scale projects: high-density campuses could deplete the stock of data center equipment such as large generators, UPS systems and transformers.

One of the challenges facing this new category of operators is the evolving nature of AI hardware. Nvidia, for example, intends to start shipping systems that consume more than 500 kW per compute rack by the end of 2027. It is not clear how many data centers being built today will be able to accommodate this level of density.


The Uptime Intelligence View

The simultaneous pivot by several businesses toward building much more complex facilities is peculiar, yet their arrival will not immediately affect most operators.

While this trend will create business opportunities for a broad swathe of design, consulting and engineering firms, it is also likely to have a negative impact on equipment supply chains, extending lead times for especially large-capacity units.

Much of this group’s future success hinges on the success of generative AI in general — and the largest and most compute-hungry models in particular — as a tool for business. However, the facilities they are building are legitimate data centers that will remain valuable even if the infrastructure needs of generative AI are being overestimated.

The post Crypto mines are turning into AI factories appeared first on Uptime Institute Blog.

MIT spinoff lands $120 million funding to make AI apps for frontline workers

16 January 2026 at 19:09



The MIT spinoff firm Tulip says it has landed $120 million in venture backing for its technology that provides artificial intelligence (AI)-enabled, connected apps for frontline workers in the manufacturing, pharmaceutical, and medical device sectors.

The “series D” round was led by Mitsubishi Electric Corp., which has invested in and signed a strategic alliance agreement with Somerville, Massachusetts-based Tulip, forming a strategic commitment to overhaul digital transformation (DX) in manufacturing.

Tulip says its technology is needed because manufacturers today face the dual threats of volatile supply chains and critical labor shortages. Traditional systems and paper-based workarounds are too slow and disconnected for manufacturers to react, so Tulip solves this by embedding AI into frontline operations, enabling rapid problem solving and turning complex data into insight and action, the firm says.

Through the new partnership, Mitsubishi Electric will leverage Tulip’s composable platform to rapidly roll out scalable, AI-driven applications, signaling a shift away from monolithic software toward agile, human-centric innovation.

“We believe that people are the most valuable asset in any operation,” Natan Linder, CEO of Tulip Interfaces Inc., said in a release. “Our partnership with Mitsubishi Electric solidifies a shared commitment to a human-first digital transformation. We are building modern, composable architectures not to automate people away, but to give them superpowers through practical use of AI. We recognize that technology must work for the operators and the engineers, not the other way around.”

42% of logistics leaders are holding back on Agentic AI, survey shows

15 January 2026 at 20:33



A recent survey of North American transportation, logistics, and supply chain executives reveals a disconnect between what those leaders see as the promise of advanced artificial intelligence (AI) solutions, such as Agentic AI, and their readiness to implement them.

Conducted by global technology firm Ortec, which provides optimization software and analytics solutions to a range of industries, the survey examined the effects of adopting AI and machine learning (ML) in logistics. While nearly all of the survey’s 400 respondents said they recognize the potential of Agentic AI to modernize planning and execution, 42% said they are not yet exploring the technology and instead remain focused solely on traditional AI and machine learning (ML) approaches.

“The survey … found that only a small minority had active Agentic AI pilots or deployments at the end of 2025, even as 23% say they plan to pilot Agentic AI within the next 12 months—putting 2026 squarely in focus as a test-and-learn year for autonomous decision-making in logistics,” according to the report.There are key differences in traditional and advanced AI: Traditional AI solutions perform tasks based on predefined rules and algorithms—a common example is the virtual assistant Siri. Agentic AI solutions can make decisions without human intervention—examples include autonomous vehicles that can navigate traffic.

Despite a lack of industry testing and deployment of Agentic AI, respondents said they have high expectations for its use in supply chain operations, citing drastic cost savings through fuel and mileage optimization (30%), increased operational resilience (22%), and improved data quality (20%) as their top anticipated benefits.

That optimism is balanced by concerns about getting Agentic AI production-ready in 2026, according to the report. Respondents point to high integration costs with existing systems as their number one frustration (32%). They also cite a “lack of model explainability” (26%)—which refers to situations in which AI systems make planning or execution decisions, but logistics teams can’t clearly understand why a specific recommendation or action was taken. Poor data quality is another key concern (22%).Respondents said they are also concerned about a lack of in-house expertise and unclear ROI (return-on-investment) when it comes to implementing AI in general.

Despite the obstacles, executives say they have a clear view of where Agentic AI should be applied first in supply chains: First- and final-mile route scheduling is seen as the top target for AI-driven reinvention (35%), followed by global supply chain network design (20%).

When asked what would most accelerate adoption, respondents prioritized clear ROI measurement frameworks (30%), peer case studies from similar organizations (25%), and seamless integration with existing planning systems (24%).“Executives are entering 2026 with a clear mandate: make Agentic AI real, measurable, and safe for operations,” Daphne de Poot, Ortec’s senior vice president of operations for the Americas, said in a statement announcing the survey’s findings. “Our research shows they believe Agentic AI can fundamentally improve cost, service, and resilience, but they need transparent decisioning, reliable data, and a phased approach that keeps planners in control while AI gradually takes on more of the repetitive and complex decision-making work.

“These survey findings provide a detailed view into how leaders are thinking about the next wave of AI—beyond predictive analytics and into autonomous, decision-making systems that can continuously optimize complex logistics networks.”

❌