Apr
7
Wed
A Workflow for Increasing the Quality of Scientific Software
Apr 7 @ 1:00 pm – 2:00 pm

The IDEAS Productivity project, in partnership with the DOE Computing Facilities of the ALCF, OLCF, and NERSC and the DOE Exascale Computing Project (ECP) has resumed the webinar series on Best Practices for HPC Software Developers, which we began in 2016.

As part of this series, we offer one-hour webinars on topics in scientific software development and high-performance computing, approximately once a month. The April webinar is titled A Workflow for Increasing the Quality of Scientific Software, and will be presented by Tomislav Maric (TU Darmstadt). The webinar will take place on Wednesday, April 7, 2021 at 1:00 pm ET.

Abstract:

The webinar will present a workflow that increases the quality of research software in Computational Science and Engineering (CSE) by applying established software engineering practices extended with CSE-specific testing and visualization, and periodical cross-linking of software with reports/publications and datasets. The workflow is minimalistic. It introduces a small amount of work overhead, which is crucial for research groups without dedicated funding for ensuring the quality of research software and reproducibility of scientific results.

Apr
12
Mon
Automating Application Performance Analysis with Caliper, SPOT, and Hatchet
Apr 12 all-day

This tutorial was held on April 12, 2021 as part of the 2021 ECP Annual Meeting.

At LLNL, we have developed a workflow enabling users to automate application performance analysis and pinpoint bottlenecks in their codes. Our workflow leverages three open-source tools – Caliper, SPOT, and Hatchet – to provide a wholistic suite for integrated performance data.

In this tutorial, the presenters provided an overview of each of the tools, and demonstrated how to profile your applications with Caliper, how to visualize your performance data in SPOT, and how to programmatically analyze your data with Hatchet. Caliper is a performance analysis toolbox in a library. It provides performance profiling capabilities for HPC applications, making them available at runtime for any application run. This approach greatly simplifies performance profiling tasks for application end users, who can enable performance measurements for regular program runs without the complex setup steps often required by specialized performance debugging tools.

SPOT is a web-based tool for visualizing application performance data collected with Caliper. SPOT visualizes an application’s performance data across many runs. Users can track performance changes over time, compare the performance achieved by different users, or run scaling studies across MPI ranks. With a high-level overview of an application’s performance, users are also quickly able to identify data that they might be interested in analyzing in finer-grained detail. Hatchet is a Python-based tool for analyzing and visualizing performance data generated by popular profiling tools, such as Caliper, HPCToolkit, and gprof.

With Hatchet, users can write small code snippets to answer questions such as: What speedup am I getting from using the GPUs? Which portions of my code are scaling poorly? What differences exist in using one MPI implementation over another?

To answer these questions, Hatchet provides operations (e.g., sub-selection, aggregation, arithmetic) to analyze and visualize calling context trees and call graphs from one or multiple executions.

LLNL-ABS-818869

Cabana Tutorial
Apr 12 all-day

This tutorial was held on April 12, 2021, as part of the 2021 ECP Annual Meeting.

This tutorial explained the design and use of the CoPA developed Cabana Particle Simulation Toolkit and provided hands-on exercises in the form of simple examples and proxy applications. A full description of the library, connections to other ECP projects, and existing use cases began the session. Cabana uses Kokkos for on-node performance portability, extending to particle-specific sub-motifs, and MPI for multi-node simulations. Cabana provides capabilities for nearly all particle-based applications: neighbor list generation, particle redistribution, halo exchange, particle-grid interpolation, and more. A hands-on, interactive walk-through of the tutorial available within the Cabana repository introduced users to the library. Descriptions and code demonstrations of current use cases, including both proxy apps and production code kernels, exemplified what can be done with Cabana. Key design decisions for performance was an emphasis of this last section.

Developing a Testing and Continuous Integration Strategy for your Team
Apr 12 all-day

This tutorial was held on April 12, 2021 as part of the 2021 ECP Annual Meeting.

A thorough and robust testing regime is central to the productive development, evolution, and maintenance of quality, trustworthy scientific software. Continuous integration, though much discussed, is just one element of such a testing regime. Most project teams feel that they could (and should) do a “better job” of testing. In many cases, designing and implementing a strong testing strategy can seem so daunting that it is hard to know where to start.

In this tutorial, which was aimed at those with beginner to intermediate levels of comfort with testing and continuous integration, we briefly reviewed the multiple motivations for testing, and the different types of tests that address them. We discussed some strategies for testing complex software systems, and how continuous integration testing fits into the larger picture. Accompanying hands-on activities, available for self-study, we demonstrated how to get started with a very simple level of CI testing.

Apr
13
Tue
ADIOS Storage and in situ I:O Tutorial
Apr 13 all-day

This tutorial was held on April 13, 2021, as part of the 2021 ECP Annual Meeting.

As concurrency and complexity continue to increase on high-end machines, I/O performance is rapidly becoming a fundamental challenge to achieving exascale computing. To address this challenge, wider adoption of higher-level I/O abstractions will be critically important. The ADIOS I/O framework provides data models, portable APIs, storage abstractions, in situ infrastructure, and self-describing data containers. These capabilities enable reproducible science, allow for more effective data management throughout the data life cycle, and facilitate parallel I/O with high performance and scalability.

In this tutorial, participants learned about the ADIOS concept and APIs and we showed a pipeline of a simulation, analysis and visualization, using both storage I/O and in situ data staging. Participants also learned how to use state-of-the art compression techniques with ADIOS. Finally, we discussed how to use ADIOS on the LCFs for the best storage performance, in situ data analysis and code coupling. This is a short, 2-hour long tutorial that aimed to teach the basic concepts, demonstrated the capabilities and provided pointers to interested parties to incorporate ADIOS into their science applications.

Apr
14
Wed
Introduction to Containers for HPC
Apr 14 all-day

This tutorial was held on April 14, 2021 as part of the 2021 ECP Annual Meeting.

Container computing has revolutionized the way applications are developed and delivered. It offers opportunities that never existed before for significantly improving efficiency of scientific workflows and easily moving these workflows from the laptop to the supercomputer. Tools like Docker, Shifter, Singularity and Charliecloud enable a new paradigm for scientific and technical computing. However, to fully unlock its potential, users need to understand how to utilize these new approaches.

This tutorial introduced attendees to the basics of creating container images, explained best practices, and covered more advanced topics such as creating images to be run on HPC platforms using various container runtimes. The tutorial also explained how research scientists can utilize container-based computing to accelerate their research and how these tools can boost the impact of their research by enabling better reproducibility and sharing of their scientific process without compromising security. This is an updated version of the highly successful tutorial presented at 5 SC Conferences and multiple ECP Summits.

Product Lifecycle Management – An Industry Perspective
Apr 14 all-day

This talk was held on April 14, 2021 as part of the 2021 ECP Annual Meeting and was a dialog between ECP leadership and experts from industry around product lifecycle management. Three industry speakers gave short presentations on how their companies initiate, develop, enhance, maintain, support and retire software products. Then ECP leadership joined the industry speakers for a panel, with questions from the moderators and attendees. This talk was a great opportunity to learn from industry and share ECP experiences and plans.

Apr
15
Thu
HPC System and Software Testing via Buildtest
Apr 15 all-day

This talk was held on April 15, 2021 as part of the 2021 ECP Annual Meeting.

HPC computing environment is a tightly coupled system that includes a cluster of nodes and accelerators interconnected with a high-speed interconnect, a parallel file system, multiple storage tiers, a job scheduler and a software stack for users to run their workflows. This environment is highly interdependent, therefore it is essential to regularly test various components of the HPC system and the software stack. There is significant progress in software build frameworks (spack, easybuild) for installing software packages for HPC systems, however there is little consensus on the testing front.

In this talk, we presented buildtest (https://buildtest.readthedocs.io/en/devel/index.html), an acceptance testing framework for HPC systems. In buildtest, tests are written in YAML called ‘buildspecs’ which are processed by buildtest into shell-scripts. These tests can be run locally or via a job scheduler (Slurm, LSF and Cobalt). Buildtest supports a rich YAML structure for writing buildspecs which is defined in JSON Schema for validating buildspecs. Currently, buildtest supports two major schema types (compiler and script) for writing shell and python scripts as well as single source compilation tests.

In this talk, we covered the core framework, its features and writing tests (i.e. buildspecs) using script and compiler schema. In addition, we presented a summary of Cori testsuite (https://github.com/buildtesters/buildtest-cori) that includes real tests for Cori system at NERSC.

In Jan 2021, we deployed Spack E4S 20.10 stack (https://docs.nersc.gov/applications/e4s/) on Cori for the NERSC user community. As part of this initiative, we test E4S stack via E4S testsuite (https://github.com/E4S-Project/testsuite) using buildtest with Gitlab scheduled pipelines. We concluded this talk with a brief demo of buildtest and additional resources to get started.

Apr
16
Fri
Using Spack to Accelerate Developer Workflows Tutorial
Apr 16 all-day

This tutorial was held on April 16, 2021 as part of the 2021 ECP Annual Meeting.

Spack is an open source tool for HPC package management that simplifies building, installing, developing, and sharing HPC software stacks. It is the official deployment and distribution tool for ECP, and it allows ECP developers to easily leverage each others’ work. Spack continues to grow in popularity among end-users, HPC developers, and the world’s largest HPC centers. It provides a powerful and flexible dependency model, a simple Python syntax for writing package build recipes, and a repository of over 5,000 community-maintained packages. The modern scientific software stack is complex and spans C, C++, Fortran, Python, and R; Spack can help reduce the integration burden and allow developers to spend more time on science and less on the drudgery of deployment.

This tutorial builds significantly on past Spack tutorials with a stronger focus on developer workflows. We covered the traditional topics of installation, package authorship, and Spack’s dependency model. We went in-depth on Spack environments and configuration, and gave examples of how Spack can be used to bootstrap a developer environment and concurrently develop multiple packages. Finally, we demonstrated how `spack external find` and Spack build caches (binary packages) can accelerate development and CI workflows. Participants can expect to come away from this tutorial with new skills, even if they have participated in Spack tutorials in the past.

Apr
19
Mon
Timemory ECP Tutorial
Apr 19 @ 12:00 pm – 3:00 pm

Software monitoring

Have you ever written a multi-level logging abstraction for your project? Created an error checking system? Written a high-level timer + label abstraction? Have you then added additional abstractions for logging data values and/or recording the memory usage? Did you add or want to add support for exporting these labels to external profilers like VTune, Nsight, TAU, etc.? Do you need to support flushing this data intermittently? If your answer to any of these questions is yes, this is the right tutorial for you.

Logging, error-checking, high-level timekeeping abstractions are a staple in HPC applications. As projects grow in complexity and users, the developers often end up having to provide these abstractions because these capabilities are generally viewed as necessary for debugging, validation, and ensuring optimal performance. Timemory aims to simplify monitoring the state and performance of your application so that relevant debugging, logging, and performance data can be trivially enabled or disabled in a consistent and portable manner.

Why timemory?

Timemory is designed as a toolkit for implementing profiling, debugging, and logging solutions as well as providing a holistic profiling solution. If you would like to keep all your current abstractions and only want type-safe handles for invoking groups of them in bulk, timemory can provide that; if you would like to simplify aggregating the data from different MPI/UPC++ ranks, timemory can provide that; if you only want to add support for exporting to JSON/XML/etc., timemory can provide that; if you want to create a new command-line tool which combines different measurements, timemory can provide the components to easily do that; if you want a holistic solution that you can easily extend or restrict, timemory can provide that.

What is timemory?

Timemory is a multi-purpose C++ toolkit and suite of C/C++/Fortran/Python tools for performance analysis, optimization studies, logging, and debugging. The primary objective of timemory is to create a universal instrumentation framework which streamlines building software monitoring interfaces and tools by coupling the inversion of control programming principle with C++ template metaprogramming. The original intention of the toolkit design was specific to performance analysis, however, it was later realized that the design allowed debugging and logging abstractions to co-exist seamlessly with the performance analysis abstractions.
The design allows developers to construct production quality implementations which couple application-specific software monitoring requirements with third-party tools and libraries. In order to help ensure this objective is fully realized, timemory provides a number of pre-built implementations of a generic C/C++/Fortran library interface, compiler instrumentation, dynamic instrumentation, various popular frameworks such as MPI, OpenMP, NCCL, and Kokkos, Python bindings, and an extended analogue of the UNIX time command-line tool.

Does HPC need another profiling tool?

No. HPC has a surplus of performance analysis tools and APIs: VTune, NSight, TAU, Caliper, Score-P, Callgrind, LIKWID, Arm-MAP, CrayPAT, OpenSpeedShop, ittnotify, NVTX, PAPI, CUPTI, MPI-P, MPI-T, OMPT, gperftools, ROC-profiler, ROC-tracer, and innumerable application-specific abstractions which perform anything from basic timekeeping and memory usage to implementations and callbacks for the aforementioned APIs. We designed timemory as a way to easily integrate and maintain the exact set of measurements/tools/features you want to support with an interface best suited for your application.

Contents of the Tutorial

This is a preliminary outline of the tutorial. The tutorial is divided into two days. The first day will cover the front-end tools for C/C++/Fortran/CUDA/Python. The second day will cover how to use the C++ toolkit. The interactive tutorials will be held on Mondays: 9:00 AM – 12:00 PM PT (12:00 PM – 3:00 PM ET).

Day 1: Tools and Library (04/19/2021)

Introduction to timemory

  • Motivation
  • Design philosophy and nomenclature
  • Installation

Command-line Tools

  • timemory-avail — information tool
  • timem — UNIX time + more
  • timemory-run — dynamic instrumentation and binary re-writing
  • timemory-plotter — matplotlib plotting of results
  • timemory-roofline — generate the roofline

Library API

  • Compiler instrumentation
  • Extern C interface

Python API

  • Decorators and context-managers
  • Iterating over results in-situ

Python Command-Line Tools

  • timemory-python-profiler — python function profiler
  • timemory-python-trace — python line-by-line tracing
  • timemory-line-profiler — classic line-profiler tool extended to collect different metrics

Visualizing and Analyzing Results

  • Converting timemory data to pandas dataframes via Hatchet
  • Manipulating dataframes
  • Visualizing in Jupyter notebooks

Day 2: C++ and Python Toolkit (04/26/2021)

Python

  • Using Individual Components to build your own tools

C++

  • Creating a new component
  • Using a custom component for timemory-run
  • Designing a customized profiling API for your project
  • Designing a customized debugging/logging interface for your project
    • Wrapping externally defined functions
    • Creating profiling/debugging libraries for your project
    • Insert measurements/logging/error-checking around C/C++ function calls
    • Auditing incoming arguments and return values
  • Replacing externally defined functions
    • Experiment with mixed-precision without modifying original source code

How to Attend

  • The lecture series is available to everyone.
  • No-cost registration is necessary, meeting password will be sent to registrants.
  • For the exercises, timemory can be installed locally or registrants may use a provided docker image.

Presenters

  • Jonathan Madsen
  • Laurie Stephey
  • Muazz Gul Awan
  • Rahulkumar Gayatri

Tutorial Material
Recording – Day 1
Recording – Day 2