Doug Kothe, ECP Director
The start of a new year inherently evokes a sense of anticipation of what exciting accomplishments and experiences may lie ahead. The Exascale Computing Project (ECP) has some particularly compelling reasons to feel energized and uplifted.
Soon after 2019 established itself, hundreds of our ECP team members along with our US Department of Energy (DOE) sponsors and stakeholders convened in Houston for our 3rd Annual Meeting. Multiple projects worked together on the crucial and overarching theme of integration, and lots of fruitful side discussions took place. In the end, we walked away with new actions and milestones or tasks to tackle that we hadn’t appreciated before.
The annual meeting is essential for our planning and execution, and this year’s gathering was our best so far. Congratulations to Marta Garcia of Argonne National Laboratory and Doug Collins of Oak Ridge National Laboratory (ORNL) for their stellar work as committee chairs.
The venue was superb, as was the agenda, which was developed from the bottom up by the people in the technical research focus areas—Application Development (AD), Software Technology (ST), and Hardware and Integration (HI)—and the ECP Project Office. They proposed breakouts and tutorials to address specific challenges, along with a number of educational and training topics.
The meeting also presented an opportunity for senior principal investigators to sit down together and go through their projects’ war stories and journeys. What emerged were new commitments to collaborate and integrate products and technologies across ECP.
An especially pleasant surprise is how well senior researchers in computational science and data science have adapted their research plans and methodologies to a milestone-based, project-oriented formulation. ECP has struck a good balance between needing to perform high-risk exploratory research and yet deliver products and technologies.
This event was not a workshop or a conference, and the emphasis was not on rolling out the fruits of our efforts. But the fact is, we have numerous great results that clearly show we are on track to realize the vision of reaching exascale, and I will accentuate some of those promising outcomes in the following sections.
Highlights from AD
AD, composed of almost 40 research teams, has been executing for two and a half years, and the number of highlights that have emerged in the last six months has been astounding.
Before I share examples of some of AD’s successes, I’ll note that AD Director Andrew Siegel of Argonne and his team have performed a fairly rigorous assessment of the projects, brought in external reviewers, and documented the results in a capability assessment report. A public version of that highly detailed document will be made available soon.
Within AD are six co-design centers, in various stages of maturation. They focus on computational motifs, or common patterns of communication and computation. Each of these centers has been delivering abundant products.
The AMReX Co-Design Center, led by John Bell of Lawrence Berkeley National Laboratory, is deeply impacting five different codes and likely will do the same with many more moving down the road. The center will release new technologies for what’s referred to as tiling-on-node performance for inventive boundaries for pushing particles in simulations. We’re seeing many applications using that technology.
The CEED Co-Design Center, led by Tzanio Kolev of Lawrence Livermore National Laboratory (LLNL), is focused on the development of unstructured meshes and partial-differential-equation-based solvers for finite elements. It’s also impacting a number of applications.
We have a particle co-design center, called COPA, and an online data-reduction center, known as CODAR. Those are under the direction Tim Germann of Los Alamos National Laboratory (LANL) and Ian Foster of Argonne, respectively.
All of these centers are pushing out their technologies and directly enhancing applications. So our model for the co-designing of apps—which is focusing on the motifs and cross-cuts and hitting multiple apps—is really paying off.
Before I leave the discussion of co-design centers, I want to point out that ExaLearn, which has been funded, up, and running for four months, has devised a really good, aggressive plan to impact a number of applications. The ExaLearn co-design center is led by Frank Alexander of Brookhaven National Laboratory.
Concerning some of the applications highlights, the ExaSMR project recently demonstrated an almost 40x speedup on the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF) and a 20x speedup overall by refactoring their Monte Carlo algorithm from a history- to an event-based approach. By having done so, they’re able to effectively use GPUs such as the Nvidia Volta on Summit.
ExaSMR’s 4x or 5x improvement over that of the hardware, exemplifies the kind of enhancement we’re aiming for in ECP. We’re not just seeking to exploit the hardware—we absolutely need to do that—but also must develop better algorithms.
The NWChemEx chemistry application has significantly reduced its memory communication footprint for Cholesky-based decomposition of two-electron integrals. That’s a key kernel for electronic structure chemistry codes.
Another app that’s made a big step forward is GAMESS—it is now interfaced with quantum Monte Carlo. The GAMESS team is experiencing good returns in their performance analysis and scaling up.
The accelerator molecular dynamics application, known as EXAALT, has demonstrated the ability to scale up to 272,000 cores, whereas before, it could not scale well at all. By working closely with some ST efforts, such as the SLATE project, the EXAALT team is exploiting some very efficient DGEMMs, which are similar to a core BLAS kernel. The EXAALT and SLATE projects are led by Art Voter of LANL and Jack Dongarra of the University of Tennessee, respectively.
We’ve seen QMCPACK, a quantum Monte Carlo materials science application, show a 6x to 7x speedup by delaying their updates relative to their Monte Carlo walkers. That example illustrates another instance of algorithmic improvement.
While the first point of emphasis for our applications is to anticipate and prepare for the coming hardware and be ready to use it, we have to simultaneously rethink the algorithms. Consequently, all of the application efforts within ECP have unique plans to both reconceive—that is, refactor and redesign—the algorithms and exploit the hardware.
But there’s more to the story: A codependent arrangement exists between AD’s efforts and those of ST. AD’s needs and requirements must flow down to ST, and new capabilities in software deeply impact where the applications go.
I refer back to the upcoming assessment report that AD will soon publish, because it will provide a trove of information about what our apps are doing. We believe the insights will lead to many new potential interactions and collaborations.
Highlights from ST
ST Director Mike Heroux of Sandia National Laboratories (SNL) and his team have been together now for just over a year and have really mobilized to combine more development and deployment with their intensive research. For example, ST has already made two releases of the first version of the software development kits (SDKs) they’re creating. Their initial focus is on a math SDK.
Of the 90 or so tangible software products ECP is working on—and we don’t necessarily own them all—we’re contributing a lot of community products. We are packaging them up to approximately a half dozen SDKs of common themes, and the math SDK is a good example, with linear and nonlinear dense and sparse solvers, eigensolvers, and such bundled together.
ST’s initial release in November contained about 25 products, and the latest release, in December, had 37 products. Mike and his team are adopting a very agile methodology that allows for frequent releases. I encourage you to go to the website E4S.io and see the products ST is rolling out.
As the Applications group is doing, the ST team is releasing a fairly regular capability assessment report. ST issued its first report in July of last year and another is planned for this month. The document will enable readers to peer into current and planned activities.
To call out a couple of ST highlights, first, OpenMP has really worked hard on deep copy and memory management, APIs for heterogeneous architectures. Great improvements have taken place in that effort. ECP has played a major role in the low-level virtual machine, or LLVM, as we call it. And ST is ensuring OpenMP works well there by evolving the compiler and having C++, C, and Fortran on the front end.
Although not something that leads to products, another essential function of ST is participation in community standards. ST staff sit on a number of standards committees. Examples are MPI, OpenMP, C++, OpenACC, Fortran, and the list goes on and on. The extent of ECP’s community standards involvement is itemized in the assessment report.
Another part of the assessment report deals with the very important aspect of abstraction layers, which have proved to be quite instrumental, where appropriate, in shielding applications from the compute and memory hierarchy. Examples of such abstractions are Kokkos at SNL and RAJA at LLNL. We believe the abstractions can and should lead to next-generation or future standards, and ECP is playing a role.
Highlights from HI
ECP’s HI focus area, led by Terri Quinn of LLNL, has existed for a little more than a year, during which time it has evolved from start-up, to up and steady state, to fully running. The scope of HI is broad and includes several key areas.
One of those areas is PathForward, which is funding US computer vendors on R&D of node and system architecture designs. Nondisclosure agreements are required for discussion of some of the details, but basically that program has been executing since 2016, and the vendors are delivering fantastic results. We are confident that a portion these technologies are going to appear in some of the exascale systems.
Among the pleasant surprises in ECP is the success so far of our resource loading and planning with respect to continuous software integration at facilities, an activity within Terri’s group led by Dave Montoya of LANL. This aspect of integration is metaphorically a handshake between Mike Heroux’s ST SDK products with the DOE HPC facilities.
As the SDKs are pushed out, they are continuously integrated and tested on the facilities with an eye toward the six labs that have been leaders in their facilities. Eventually, users will see alpha or beta versions of our products at, say, the Oak Ridge Leadership Facility (OLCF) on Titan or Summit.
Another part of HI I want to briefly mention is application integration. It is a process that involves deploying our applications and getting them up and running on the systems at the facilities so that our facility and computer and computational scientists can help performance-tune. This is a final step across the finish line for ECP.
We have three teams now that have targeted specific applications in ECP sitting at the National Energy Research Scientific Computing Center, the Argonne Leadership Computing Facility, and the OLCF that are targeting very specific applications. This also is a formal handshake, but this time between the AD team and the application specialists at the facilities. A lot of progress has been made in this effort, and we’re looking forward to the years to come.
Terri’s group is going to successfully take our products to journey’s end and deliver on the metrics that we promised to our DOE sponsors in ECP.
Keeping Everyone on the Same Page
The science and the products in ECP understandably receive a great deal of attention in our discussions. But what we don’t talk about enough is the element that makes everything we do possible: ECP’s Project Office, led by Kathlyn Boudwin of ORNL.
It’s important to point out that ECP’s scientists work closely with project management in a peer relationship that allows for cross-fertilization.
Our scientists have been learning various aspects of efficient project management that are critical for ECP but will also carry over to their future endeavors. Conversely, the project management team has been learning about the science and determining how to tailor their textbook Agile Project Management skills to research. The Project Office has performed brilliantly in its interactions with our technical teams in meeting the unique needs of ECP relative to areas such as risk, exploration, and return.
One of our principal requirements is that we be able to work from a central plan, and thanks to the Project Office, we’re doing that very well. Kathlyn’s team has adopted some commercial tools for Agile that allow everyone to see the entire plan, which is a crucial capability for integration, and thus for the success of ECP.
ECP’s New Deputy Director is Making a Big Impact
Lori Diachin of LLNL, who joined us as deputy director back in summer of last year, has been a breath of fresh air to ECP. From the get-go she humbly offered to help us in whatever ways were needed. She has rolled up her sleeves and done just that, applying her experience in technical leadership to some of our more difficult tasks.
She’s also challenged us in areas where she wondered why we were taking certain approaches and asked if doing those things better or differently were possible. Besides adding value to ECP in a variety of ways, Lori is a real joy to work with. We’re not surprised at all, however—her reputation preceded her. We’re extremely lucky to have Lori.
ECP is, in fact, fortunate in a host of ways. As evidenced by all the snapshots of success and momentum I’ve conveyed here and so much, much more we’re seeing in ECP, we have plenty of reasons to feel highly optimistic that we’re clearly on the road to realizing exascale computing.
Lori Diachin, Deputy Director
A Letter to the ECP Research Community
As may know, I have been on board as the ECP deputy director for a little less than six months. Entering 2019 fully immersed in the myriad management functions of this complex project, I am both excited and honored to be ushering in the new year as a senior member of the project’s leadership team.
This is one of the largest and most important computing projects that DOE has ever undertaken, and I am proud to be a part of it.
I have been extremely impressed by the depth and quality of the technical work, the talent and commitment of the research staff, and the dedication and skills of the leadership team. All are outstanding, which was prominently noted by the reviewers in our most recent independent project review, held at the end of October.
I’d like to share what I find most exciting about the project—the many ways in which ECP is breaking new ground.
First, there is the breadth of scientific domains and new application areas that will be exascale-enabled by ECP. National policy decisions in critical areas ranging from energy, health, and economic and national security, to name a few, will be impacted by the application codes that are being developed in, and would not be possible without, ECP.
The advances in physics capabilities and the innovations in algorithms and software implementations to improve scalability, portability, and sustainability are all critical to realizing the tremendous return on investment that is possible for the project.
Second, I have been involved with furthering DOE software my entire career, with activities ranging from developing new numerical libraries, to leading SciDAC projects that brought advanced numerical libraries to Office of Science applications and managing programs designed to foster the use of DOE high-performance computing (HPC) software tools in industrial applications. But I have never been involved in an effort of this scale and complexity that has as one of its primary goals the creation of interoperable and usable DOE software.
Thousands of years of experience are encapsulated in the libraries and tools supported by ECP, and the efforts to create turnkey installations in the software development toolkits (SDKs) should significantly improve users’ experiences of building and using multiple libraries together. Furthermore, the collaboration between the ECP and the DOE leadership-class facilities in the development of continuous integration tools represents a first-of-its-kind effort that will improve DOE software sustainability over the long term.
Third, I am a collaborative person by nature; and the ways in which ECP is bringing together the DOE community are inspiring.
As a result of ECP, stronger connections now exist between application developers and software technology experts as application teams make use of the new software tools to provide capabilities that enable new science, improve performance, or reduce the burden of interacting with increasingly complex computer architectures.
Increased interactions are also now present among software development teams in the creation of the SDKs and among the DOE Facility staff and application and software developers due to the efforts facilitated by the hardware and integration program element. I believe those connections and interactions will continue to grow well beyond the life of this project, bringing tremendous benefit to the US HPC ecosystem.
Finally, ECP is breaking new ground in project management practices. Although formal project management is a new area for me, I realize that this is the first time that such a large research, development, and deployment (RD&D) project has been managed through the formal DOE 413.3B order. This has created a variety of challenges and opportunities that require innovation and creative thinking to overcome.
The tools that have been developed in the Jira and Confluence platforms to plan projects and manage and track progress, and the metrics developed to define the success of ECP, are now generating significant interest from other large projects within DOE and from other federal agencies. ECP is establishing a model that others will learn from and follow in terms of modern project management.
Looking forward to this year, I anticipate we will all learn and grow together to ensure our collaboration and integration activities bear fruit, allowing the nation to see the progress that can be made when hundreds of researchers are pulling in the same direction. I am also excited about getting to know as many of you as possible and endeavoring to help foster the success of ECP.
ECP Receives HPCwire Editors’ Choice Award for Best HPC Collaboration of Government, Academia, and Industry
ECP announced it has been recognized by HPCwire with an Editor’s Choice Award for the project’s extensive collaborative engagement with government, academia, and industry in support of the ECP’s effort to accelerate delivery of a capable exascale computing ecosystem, as the nation prepares for the next era of supercomputers capable of a quintillion operations per second.
Extreme-Scale Scientific Software Stack is Released
In synchrony with the timing of the SC18 supercomputing conference in Dallas recently, ECP released a portion of the next version of its software stack, called the Extreme-Scale Scientific Software Stack, or E4S.
Perspective on the Extreme-Scale Scientific Software Stack Release (Source: The Next Platform)
An article on a high-performance computing industry blog provides the big picture surrounding ECP’s recent release of its software stack, E4S.
Optimizing a New Technology to Reduce Power Plant Carbon Dioxide Emissions
An ECP effort is developing a tool that will leverage future exascale supercomputers to enhance a new technology for carbon capture and storage.
Addressing the Challenge of Continuous Integration of Software at Department of Energy Facilities
A main aspect of ECP’s continuous integration activities is ensuring that the software in development for exascale can efficiently be deployed at the facilities and that it properly blends with the facilities’ many software components.
A Powerful Tool for Improving Parallel Computing Applications
Developers of parallel computing applications can well appreciate the Tuning and Analysis Utilities performance evaluation tool—it helps them optimize their efforts.
Aiming to Simulate the Universe with Maximal Computing Power
At the heart of the ExaSky project is work to develop a caliber of simulation that will use the coming exascale systems at maximal power.
Industry Leaders Prepare for Rice University Oil and Gas Conference (Source: insideHPC)
Lori Diachin, ECP project deputy director, will contribute as a speaker at the Rice University Oil and Gas HPC Conference in Houston, March 4–6, where the focus will be on the computational challenges and needs in the energy industry.
ECP’s Heroux, Alexander to Give Talks at HPC User Forum
Mike Heroux, Software Technology director, and Frank Alexander, principal investigator for the ExaLearn Co-Design Center, will speak at the 71st HPC User Forum, April 1–3, in Santa Fe, NM.