Quantitatively Assessing Performance Portability with Roofline

The IDEAS Productivity project, in partnership with the DOE Computing Facilities of the ALCF, OLCF, and NERSC and the DOE Exascale Computing Project (ECP) has resumed the webinar series on Best Practices for HPC Software Developers, which we began in 2016.

As part of this series, we offer one-hour webinars on topics in scientific software development and high-performance computing, approximately once a month. The next webinar is titled Quantitatively Assessing Performance Portability with Roofline, and will be presented by John Pennycook (Intel), Charlene Yang (Lawrence Berkeley National Laboratory) and Jack Deslippe (Lawrence Berkeley National Laboratory). The webinar will take place on Wednesday, January 23, 2019 at 1:00 pm ET.

Abstract:

Wouldn’t it be great if we could port a code to a new high-performance architecture without substantially changing the code yet achieving a similar level of performance as hand-optimized code? This webinar will frame the discussion around ‘performance portability’, why it is important and desirable, and how to quantitatively measure it. The webinar started with a background check on how the concept of performance portability came about and past attempts to define it and quantify it. Then the speaker introduced a simple yet powerful metric and an empirical methodology to quantitatively assess a code’s performance portability across multiple platforms. The methodology uses the Roofline performance model to measure an ‘architectural efficiency’ term in the metric proposed by Pennycook et al. The speaker then dove into a few nuances of this methodology, for example, how and why empirical ceilings should be used for performance bounds, how to accurately account for complex instructions such as divides, how to model strided memory accesses, and how to select the appropriate Roofline ceilings and application performance points to make sure that the performance portability analysis is not erroneously skewed. We also showed some results of measuring performance portability using the aforementioned metric and methodology on two modern architectures, Intel Xeon Phi and NVIDIA V100 GPUs.