Researchers funded by the Exascale Computing Project have delivered a novel method that addresses overloaded communication processes that use MPI-IO by adding a second I/O request aggregation layer. Their method, called TAM, for two-phase aggregation method, combines data on a node prior to performing additional internode optimizations that accelerate I/O processes. Computing requests are carried out in two layers to reduce communication contention at global aggregators, allowing collective I/O scale-up and improving communication speed.
MPI-IO’s two-phase I/O strategy, with collective functions that call on all processes to answer an I/O request, has delivered high performance in parallel computing but is expected to pose a performance bottleneck and increased costs compared to independent I/O as computing reaches exascale. The new method represents a new tool in the high-performance computing (HPC) toolbox, enabling richer output and faster time-to-science.
TAM adds an intranode request aggregation layer, making use of local aggregators that coalesce requests from local processes into fewer, contiguous requests. Local aggregators then communicate I/O requests across nodes to global aggregators, completing I/O on behalf of the group. The researchers implemented TAM in ROMIO and benchmarked TAM’s performance against traditional two-phase I/O using E3SM-IO, S3D-IO, and BTIO.
On Nersc Cori KNL nodes, for example, the MPICH ROMIO collective write function bandwidth of ~600–700 MB/s on a small number of compute nodes drops to <100 MB/s for the I/O kernel of E3SM F case when problem size scales to 16K processes on 256 compute nodes due to communication contention in the communication phase of two-phase I/O. Scaled-up, TAM maintains a collective write bandwidth of 700 MB/s.
Their experiments showed the new method works well for applications that run many processes on a single node and exhibit a high degree of noncontiguity in their accesses, enabling richer output and faster time-to-science.
Co-author Rob Ross recently received the Ernest Orlando Lawrence Award, one of DOE’s highest honors, for “significant research contributions in the areas of scientific data storage and management, and communication software and architectures; and leadership in major DOE initiatives such as the SciDAC program.”
Kang, Qiao, Sunwoo Lee, Kaiyuan Hou, Robert Ross, Ankit Agrawal, Alok Choudhary, and Wei-keng Liao. 2020. “Improving MPI Collective I/O for High Volume Non-Contiguous Requests With Intra-Node Aggregation.” IEEE Transactions on Parallel and Distributed Systems 31 (11) (November 1): 2682–2695. doi:10.1109/tpds.2020.3000458. http://dx.doi.org/10.1109/TPDS.2020.3000458.