Visualization of HPCToolkit’s hpcprof-mpi using 8 nodes employing 63 threads each (504 threads total) to analyze performance data collected for a test case of the LAMMPS code running on 2K CPUs and 2K GPUS. hpcprof-mpi analyzes 38.1 GB of performance measurement data in 41 seconds. The prominent speckled region at top shows the team’s novel, highly parallel streaming aggregation approach at work. The lack of white space or solid color bands in this central region shows that the streaming aggregation approach does not stall or wait due to synchronization. The series of bands at right show the activity as hpcprof-mpi performs a sparse out-of-core transpose of performance data and writes its analysis results.