The complexity of the exascale systems that will be delivered, from processors with many cores to accelerators and heterogeneous memory, makes it challenging for scientists to achieve high performance from their simulations. Legion provides a data-centric programming system that allows scientists to describe the properties of their program data and dependencies along with a runtime that extracts tasks and executes them by using knowledge of the exascale systems to improve performance, thus shielding scientists from this complexity. Using this approach, the team developed FlexFlow, a deep neural network (DNN) framework built via Legion that automatically discovers fast parallelization strategies for distributed training. FlexFlow generalizes and goes beyond today’s manually designed parallelization strategies (e.g., data and model parallelism) used across the industry. This approach has seen performance improvements up to 15 times faster than leading industry toolkits (e.g., TensorFlow).
Increasing hardware specialization, power, and cost constraints will result in exascale systems with billion-way concurrency, a growing gap between memory and network latency and floating-point performance, heterogeneity in processing and memory capabilities, and more dynamic performance characteristics due to power capping and highly tapered network topologies. Achieving sustained performance on these systems will require significant advances in latency hiding, minimizing data movement, and the ability to extract additional levels of parallelism from applications.
The Legion parallel programming system is a data-centric system for writing portable high-performance programs targeted at distributed, heterogeneous architectures designed to address these challenges. Legion presents abstractions that allow programmers to describe the properties of their program data, such as independence and locality. By making the Legion programming system aware of the structure of program data, it can automate many of the tedious tasks programmers currently face, including correctly extracting task- and data-level parallelism and moving data around complex memory hierarchies. A novel mapping interface provides explicit programmer-controlled data placement in the memory hierarchy and task assignment to processors in a way that is orthogonal to correctness, thereby enabling easy porting and tuning of Legion applications to new architectures to achieve performance.
The Legion team is focusing on developing new and modified features and integrating them into their programming system to address application requirements unique to the Exascale Computing Project, including better support for complex data structures, scalable data partitioning mechanisms, more versatile decomposition into different forms of parallelism, and more flexible and performant mechanisms to map computations and data to hardware. This approach can help traditional computational and more recent data-intensive workloads, including machine learning training and inference computations.