Efficient Reinforcement Learning Implementations for Sustainable Operation of Liquid Cooled HPC Data Centers (Papers Track)
Avisek Naug (Hewlett Packard Enterprise); Antonio Guillen-Perez (Hewlett Packard Enterprise); Vineet Gundecha (Hewlett Packard Enterprise); Ashwin Ramesh Babu (Hewlett Packard Enterprise); Sahand Ghorbanpour (Hewlett Packard Enterprise); Ricardo Luna Gutierrez (Hewlett Packard Enterprise); Soumyendu Sarkar (Hewlett Packard Enterprise)
Abstract
The rapid growth of data-intensive applications like AI has led to a significant increase in the energy consumption and carbon footprint of data centers. Liquid cooling has emerged as a crucial technology to manage the thermal loads of high-density servers more efficiently than traditional air cooling. However, optimizing the complex dynamics of liquid cooling systems to maximize energy efficiency remains a significant challenge. To accelerate research in this domain, we design a suite of highly scalable reinforcement learning (RL) control strategies for liquid-cooled data centers. We demonstrate our work on a digital twin of the Oak Ridge National Laboratory's Frontier supercomputer cooling system that provides a detailed, customization, and scalable platform for end-to-end liquid cooling control. We demonstrate the utility of our framework by developing and evaluating centralized and decentralized multi-agent RL controllers that optimize cooling tower and server-level operations. Our results show centralized RL-based control can significantly improve operational carbon footprint and thermal management compared to traditional RL applications in literature, thereby offering a promising path toward more sustainable data centers and mitigating their climate impact.