2025 AIChE Annual Meeting

(711f) A Graph-Based Modeling Framework for Cooling System Availability Analysis and Component Isolation in Modular Data Centers

Authors

Aniket Nitin Deore, University of California, Davis
Matthew Ellis, University of California, Davis
Data centers power critical digital infrastructure, from artificial intelligence (AI) to healthcare. Modular data centers (MDCs) are prefabricated, portable solutions quickly deployed in customized shipping containers. As a result, they enable rapid expansion and deployment in remote locations [1,2,3]. The computing equipment in MDCs, like large-scale data centers, produces heat during their operation that a cooling system must dissipate to maintain the computing equipment below a maximum operating temperature to prevent damage and failures [4]. As MDCs are increasingly deployed in remote locations and high-density urban environments, their cooling systems must be designed to deal with space constraints and other resource constraints (e.g., minimal energy consumption) while capable of dissipating the increasingly large heat load generated by modern computing systems [3]. As a result, MDCs, like large-scale data centers, are moving to liquid cooling systems to meet these requirements [5,6]. While there are several liquid cooling system designs, the overall high-level approach is similar: a coolant, usually a mixture of water and propylene glycol, is circulated to the computing equipment [5]. The coolant flows through a heat sink attached to the computing equipment, allowing the heat generated by the computing equipment to transfer into the cold coolant and maintaining the temperature of the computing equipment below a maximum operating temperature.

System availability, a key performance metric for a data center, is the system uptime divided by the total time (uptime and downtime) [7]. The Uptime Institute’s tier classification is widely used to evaluate data center performance. Tier III data centers must be concurrently maintainable and achieve at least 99.98% availability, corresponding to less than 1.6 hours of system downtime annually [8]. The cooling system must be designed with sufficient redundancy, preventing a single point of failure, and a control system that can automatically isolate any component in case of faults, failures, or maintenance to achieve these requirements.

This work presents a generalized, directed graph-based modeling framework for modular data center (MDC) cooling systems. The framework serves multiple purposes. First, it enables the computation of system availability to down-select candidate cooling system designs based on component-level mean time between failure and mean time to repair. We develop a Monte Carlo-based approach for estimating system availability from the system graph by simulating different component fault and failure scenarios, automatically isolating the failed/faulty components, and using the resulting directed graph, which describes the available system components, to determine if the MDC cooling system can be operated. A useful by-product of this method is its ability to automatically generate component isolation strategies, which can be used for real-time component isolation. We apply the framework to a recently developed MDC cooling system featuring a single-loop architecture and heat rejection via microchannel polymer heat exchangers. We evaluate several design variants to demonstrate the framework’s generality and compare their system availabilities. Finally, we illustrate the generated isolation strategies for failed or maintainable components.

References:

[1] K. V. Vishwanath, A. Greenberg, and D. A. Reed, “Modular data centers: how to design them?”, in Proceedings of the 1st ACM Workshop on Large-Scale System and Application Performance, Germany, June 2009, pp. 3–10.

[2] W. Vinson, M. Slaby, and I. Levine, “Modular data centers: Design, deployment, and other considerations”, Data Center Handbook, pp. 59–87, 2014.

[3] M. M. Waldrop, “Data center in a box”, Scientific American, vol. 297, no. 2, pp. 90–93, 2007.

[4] L. Ling, Q. Zhang, Y. Yu, and S. Liao, “Experimental investigation on the thermal performance of water cooled multi-split heat pipe system (MSHPS) for space cooling in modular data centers”, Applied Thermal Engineering, vol. 107, pp. 591–601, 2016.

[5] M. Azarifar, M. Arik, and J.-Y. Chang, “Liquid cooling of data centers: a necessity facing challenges", Applied Thermal Engineering, vol. 247, p. 123112, 2024.

[6] J. Matteson, “Pathway to liquid cooling”, [Online] The Pathway to Liquid Cooling | Data Center Frontier , 2024, accessed on March 30, 2025.

[7] K. S. Trivedi, and A. Bobbio, “Reliability and availability engineering: modeling, analysis, and applications,” Cambridge University Press, 2017.

[8] W. Pitt, I. V. Turner, J. H. Seader, V. Renaud, and K. G. Brill, “Tier classification define site infrastructure performance,” Uptime Institute, 17, 2006.