By Scott Gibson
The Let’s Talk Exascale podcast is joined in the latest episode by Hartwig Anzt of the University of Tennessee. He is co-principal investigator (PI) of the PEEKS and xSDK4ECP projects, respectively, within the Software Technology research focus area of the US Department of Energy’s Exascale Computing Project (ECP). He is also the technical PI of the multiprecision effort of the ECP math library consortium. In this role, he coordinated the multiprecision work in the distinct mathematical libraries of ECP to engage cooperation and maximize synergies.
Several math library projects recently began looking into abandoning the strict IEEE double-precision paradigm and investigating the potential of mixing different precision formats to benefit from faster computation and communication in a lower-precision format. Instead of having different projects independently researching on multiprecision algorithms, ECP decided to synchronize these efforts and channel the activities through the xSDK4ECP project, with Anzt as coordinator.
“In xSDK4ECP, all major math libraries participate, and we have some external people on board: Nick Higham from Manchester and Erin Carson from Prague, both experts in multiprecision numerics who will help us understand the numerical effects of using lower-precision formats,” Anzt said.
Although the premise is that the algorithms preserve the IEEE double-precision quality in the output, lower-precision formats can be used in some intermediate calculations without destroying the final output quality. “But, of course, lowering the precision format for intermediate computations has to happen with careful consideration of the numerical effects,” Anzt said.
Using lower-precision formats in intermediate calculations is not a new concept. It has been done without impacting the quality of the final output. “However, to my knowledge, all of these efforts took place in prototype implementations and never made it into production code—we want to change this,” Anzt said. “I think in the past, the research on multiprecision algorithms was interesting from an academic standpoint, but there was no real pressure to bring the technology into production. The likely reason is that algorithms using uniform precision are much easier to program and to maintain. Multiprecision algorithms are much harder for the developers.”
One motivator for the development of multiprecision algorithms is the evolution of hardware.
“Specifically, an increasing number of hardware architectures are adopting low-precision special-function units,” Anzt said. “Maybe the most prominent is the tensor core technology in the NVIDIA Volta processors, powering the Summit supercomputer [at the Oak Ridge Leadership Computing Facility], for example. These tensor cores support IEEE half-precision operations at a more than 10X higher performance than what the GPU achieves for IEEE double- and single-precision computations.”
While the NVIDIA Volta processor technology is perhaps the most prominent example, other architectures are likely to take the same path. “Driving this technology evolution are machine-learning applications such as deep neural networks that greatly benefit from high-performance low-precision calculations,” Anzt said. “It’s now up to us to leverage for our needs the performance provided by the low-precision processors. Obviously, using low precision in complex numerical computations is a big challenge. But at the same time, we are convinced that leveraging this technology has enormous potential.”
Multiprecision algorithms are not limited to special-function units. They will port to any other architecture. Moreover, faster computations in low precision are only one facet of the technology. “Many of the algorithms used in the ECP application projects are memory bound, so the arithmetic performance is not even the most relevant aspect,” Anzt said. “Much more applicable is the data transfer volume; and using more-compact precision formats efficiently reduces the pressure on the memory bandwidth. Consequently, a central focus of the xSDK4ECP multiprecision effort is on developing technology that decouples the arithmetic format from the memory format and employs more compact formats for communication to main memory and in-between nodes. In this format decoupling, we depart from IEEE standard-precision formats but also consider customized formats and compression techniques. Anything reducing the data transfer volume while still communicating the information can help make use of the software more efficient. Benefits are available even if the decreased data transfer volume comes at the cost of additional operations. This is illustrated by what we did in the Ginkgo software library.”
Ginkgo is a next-generation sparse linear algebra library that can run on multi- and manycore architectures. It is an open source product licensed under the BSD 3 clause and ships with the latest version (v.05.0) of the xSDK4ECP package. “The design of the library is guided by combining ecosystem extensibility with heavy, architecture-specific kernel optimization,” Anzt said. “The software development cycle ensures production-quality code by featuring unit testing, automated configuration and installation, Doxygen code documentation, and a continuous integration and continuous benchmarking framework.”
Block-Jacobi is a preconditioner that is very effective for finite element discretizations. It is based on the idea of inverting small dense blocks on the main diagonal of the system matrix. “Applying a block-Jacobi preconditioner is a memory-based operation, and reducing the data transfer volume would speed up the block-Jacobi preconditioner,” Anzt said. “At the same time, the block-Jacobi is only a very rough approximation of the system matrix inverse. So a valid question to ask is if we have only a rough approximation of the matrix inverse, why should we use this operator in high precision. And, indeed, if we carefully use a lower-precision format for storing the block-Jacobi, we see almost no numeric effects. I say ‘carefully use’ because we have to make sure we do not destroy the regularity of the block-Jacobi, which means we have to check for each block to see if we can store it in lower precision.”
A different precision format is used for each block of the block-Jacobi preconditioner. “Each block uses an individually optimized format, but after we load the block-Jacobi matrix into the processor, all blocks are converted back to double precision and all computation uses double precision,” Anzt said.
Computation is not done in lower precision because there’s no difference in performance. “The algorithm is memory bound, and all we measure is the time we need to read the data into the processors,” Anzt said. “Computations are free; and there is a numeric reason for using double precision—this is the way block-Jacobi remains a constant operator.”
ECP’s Ginkgo library project has an algorithm speed accomplishment to be proud of. “We now have realized the adaptive precision block-Jacobi as a production-ready preconditioner in the Ginkgo library; and averaging over different problems, it runs about 20 percent faster than the standard algorithm,” Anzt said. “While this already is a big success, I see it as model algorithm, and I am sure the idea of decoupling memory precision from arithmetic precision has a lot of potential. This, of course, requires some research and motivates us to outline the xSDK4ECP multiprecision effort as several stages. The first stages are primarily focused on research, gathering information, and sharing experience. Later we will deploy production-ready multiprecision algorithms that will in the last stages be integrated into ECP applications.”
The multiprecision work in ECP is still in its setup phase. “But we have a pretty clear vision,” Anzt said. “At the next ECP all-hands meeting in Houston in February 2020, we want to gather everyone who is interested in research on multiprecision technology.”
In a breakout session at ECP’s annual meeting in which the Ginkgo software library team has motivational talks, they plan to gather information concerning what multiprecision efforts already exist, which technology is already available in prototype software, what the needs of the ECP applications are, and where the team sees the most potential.
“We will then form cross-project interest groups for specific topics, such as multiprecision preconditioning, low-precision multigrid, and multiprecision Krylov solvers, but also low-precision BLAS, mixed-precision FFT, and so on,” Anzt said. “In these interest groups we share our experience and work together on developing multiprecision algorithms and deploy them in software. Of course, it will be hard to decide on one software package where we deploy all multiprecision technology. At the same time, we don’t aim at starting a new library from scratch. But if we deploy multiprecision functionality in an existing library, we want to make sure that it can also be used from other libraries. Hence, a strong focus will be the multiprecision interoperability. If we deploy a multiprecision algorithm in a certain software library, we want to ensure that the functionality can smoothly be interfaced from the other libraries. And there we complete the cycle: the interoperability aspects perfectly align with the goals of the xSDK project. That is another reason that locating ECP’s multiprecision efforts in the xSDK4ECP project makes a lot of sense.”