An Exascale Computing Project–funded team has developed MemHC, a GPU memory management framework that optimizes the many-body correlation function. The computation kernel, fundamental to modern physics computing applications, is computationally and memory intensive. MemHC accelerates the calculation of many-body correlation functions with a series of new memory reduction designs. Whereas other recent efforts have focused on optimizing individual tensor contractions and result in suboptimal performance, MemHC optimizes memory function among contractions for reduced GPU memory allocation redundancy, CPU–GPU communication redundancy, and GPU oversubscription and more efficient calculations. The framework is portable for platforms utilizing NVIDIA and AMD GPUs. The team’s work was published in the March 2022 issue of ACM Transactions on Architecture and Code Optimization.
Many-body correlation functions are widely used in scientific physics systems such as Lattice quantum chromodynamics and are critical for physics observables such as predicting the properties of light nuclei. Calculations from these functions are inefficient due to the difficulty in fully utilizing GPU computing power; production of voluminous intermediate results, which adds complexity and may overwhelm available GPUs; and the lack of data reuse, which generates a large amount of GPU input/output tasks. MemHC employs duplication-aware management and lazy release of GPU memories for better data reusability (e.g., intermediate outputs used as inputs for subsequent allocations); implements data reorganization and on-demand synchronization to eliminate redundant or unneeded data transfer between CPUs and GPUs; and exploits Pre-Protected LRU to reduce evictions and leverage memory hits. In tests, MemHC achieved 2.17–10.73× higher GFLOPS compared with unified memory management for general correlation functions and 3.56–6.12× improved execution time and 3.56–6.08× speedup in GFLOPS for three real-world physics correlation functions. MemHC’s optimized LRU eviction policy outperformed the original policy with up to 1.36× improvement.
Future work includes extending MemHC to address more types of hadronic systems and further optimizing capabilities for high-rank tensor contractions, such as tetra systems based on 4D tensors, which are much more complex in terms of both memory utilization and computation expense. The team also plans to extend the framework to a multinode cluster with GPUs and to optimize intranode and internode communications, including asynchronous data copy and prefetching data.
Qihan Wang, Zhen Peng, Bin Ren, Jie Chen, and Robert G. Edwards. “MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation.” 2022. ACM Transactions on Architecture and Code Optimization (March).