by Sameer Shende
As we near this important milestone in High Performance Computing, I believe we may need to step back and examine the key choices we have made so far and refactor our codes to take advantage of such extreme levels of performance. While I have no doubt that we will reach the performance goals on key HPC benchmarks, achieving such high levels of performance in a sustainable manner for applications will require a careful orchestration of several components. These will include choosing appropriate programming models and runtime systems that target heterogenous GPU devices, the management of threads on the CPU, the coordination of the I/O subsystem using a hierarchy of storage devices, the efficient utilization of network interfaces that may be connected directly to the GPUs, compilers, libraries to express parallelism, and finally tools to observe the application performance. Performance evaluation tools will play a key role in the performance engineering process. These would need to support interfaces to runtime systems operating on CPUs and GPUs from competing vendors. To observe the performance of the application in a meaningful manner, tools will need to instrument higher level runtime systems and map performance data from low-level executions back to higher levels of abstractions that make sense to the user. Libraries such as Kokkos and RAJA will play a key role in taming the complexity of expressing node-level parallelism, but the users will need support from a mature ecosystem of compilers and tools to optimize their codes.
To reach these extreme levels of performance, the application developers will also need to formulate problems that are large enough to solve and partition the problems efficiently to execute on an ever-increasing number of nodes. It will also be an interesting time for application developers as they grapple with the complexity of installing software tools and libraries in a consistent and reliable way that ensures GPU resources are used optimally across layers of the software stack. The Extreme-scale Scientific Software Stack (E4S) will help reduce the complexity of assembling the software stack correctly. But, ultimately choosing appropriate tools and libraries that can scale to the desired levels of performance and striking a balance between the rates of execution between the memory, I/O subsystem, GPUs, and network devices may need a careful refactoring of our codes. It will be a time for introspection for sure and our ability to observe performance at multiple levels of the software stack – from programming models to runtime systems – may ultimately be key as we iterate towards exascale levels of performance.