Figure 9. (Top) Timeline of CG with PetscSF + CUDA-aware MPI and (bottom) CGAsync with PetscSF + NVSHMEM on rank 2 of a test run with 6 MPI ranks (GPUs) on a Summit compute node. Each ran 10 iterations. Blue csr… bars are csrMV (i.e., SpMV) kernels in cuSPARSE, and the red c… bars are cudaMemcpyAsync() that copies data from device to host.