SOLLVE: Scaling OpenMP With LLVm for Exascale Performance and Portability
Improvements to LLVM OpenMP
There are a number of ways in which the LLVM OpenMP implementation can be improved.
- Compiler optimizations for parallel regions. These optimizations include the fusion of adjacent parallel regions and parallel loops, as well as fission of these same constructs, in order to increase the overall performance of the code. This is important for performance portability, but moreover, important to allow programming using abstraction layers without suffering under prohibitive abstraction penalties. Consider, for example, the new C++17 parallel algorithms, which easily ends up producing fine-grained parallel regions that should be combined:
-
// in library A: void foo() { std::for_each(std::par_unseq, a.begin(), a.end(), /* bodyA; */); } // in library B: void bar() { std::for_each(std::par_unseq, b.begin(), b.end(), /* bodyB; */); } // in main code: foo(); bar();
LLVM Intrinsic Function and Tag Name String Interface for Directives
After inlining the two adjacent parallel regions might be better combined into one region. With parallel algorithms implemented on top of OpenMP, the compiler will be able to analyze these regions, understand the semantics, apply a cost model, and if profitable, fuse the regions. The optimal thing to do here could be very different on different architectures: A GPU might prefer finer-grained regions where as a CPU might prefer the regions combined. However, a CPU system might need an analysis of the number of hardware prefetching streams required in order to determine the optimal fusing strategy. The high-level LLVM IR representations for parallelism, as being developed by the ECP PROTEAS project, will likely prove very helpful in this regard. We're collaborating with Intel, and others, on developing this technologyOpenMP, Unified Memory, and Prefetching
Preparing for unified virtual memory (UVM). UVM is NVIDIA's term, but generally, systems in which targets share access with the host to all of the host's available memory space seem to be becoming more common. Aside from engineering work to make sure that the current runtime systems can interact well with this capability, it also opens interesting questions about what to with target regions with working sets larger than the available device memory. In the past, these could generally not be supported. Now, with UVM, and the availability of on- demand paging, these can be naturally supported. However, on-demand paging may give highly suboptimal performance. As a result, we're implementing both new OpenMP directives to direct data-streaming transformations and also compiler-analysis-driven transformations that implement streaming/pipelining of kernel executions over large data sets. These transformations involve both compile-time and runtime changes. Preliminary experimentation on NVIDIA P100 GPUs has revealed that properly pipelining kernel execution with data "prefetching" to the device can result in performance equivalent to that from manual data movement on smaller working sets (thus extending this good performance to larger working sets). Our compiler analysis combines loop-level analysis with profiling data and runtime inputs to the cost models in order to determine when to apply this technique.- The quality of the OpenMP frontend implementation can also be improved by adding more analysis-based warnings. For example, variables modified in a parallel loop without any obvious synchronization marked as shared can generate a warning (and hints either suggesting synchronization, reduction, or privatization).