SOLLVE: Scaling OpenMP with LLVm for Exascale performance and portability
The Current OpenMP Support in LLVM/Clang
The purpose of this document is to evaluate the current status of OpenMP support in LLVM/Clang, and to identify potential inefficiency in both implementation and OpenMP standard. Later this document will be used to direct SOLLVE compiler implementation as well as OpenMP standard evolving .
OpenMP 3.1 Support
The initial support of OpenMP in LLVM/Clang is done by Intel. It is available to public since 2013. At the beginning the project was kept at clang-omp.github.io. It has been fully integrated into the upstream LLVM/Clang since version 3.7. The Intel compiler group still actively enhances their implementation of OpenMP 3.1 in several directions such as vectorization.
We use the OpenMP validation suite developed by in University of Houston  to evaluate the integrity of LLVM/Clang’s OpenMP support. Since only C/C++ frontend is available right now, we only test the OpenMP extension for C/C++. To the best of our knowledge, the Fortran frontend Flang is under development by PGI and OpenMP support is missing in its current implementation. The result shows that all OpenMP 3.1 C/C++ tests pass successfully, which indicates OpenMP 3.1 features are fully supported, including internal control variables, runtime library routines, various parallel constructs, etc. Performance wise, It has been demonstrated that LLVM/Clang shows similar performance and scalability for OpenMP programs compared with other mainstream compilers [3, 4].
OpenMP 4.0/4.5 Support
Since OpenMP 3.1 has been relatively well supported in LLVM/Clang, we focus on the support of features introduced in OpenMP 4.0/4.5. After OpenMP 3.1, the major changes in OpenMP standard is to add SIMD and device offloading support (e.g., GPUs, MICs). LLVM/Clang has already supported the simd directive, and thus we focus on investigating device offloading related OpenMP features.
The offloading support of Intel Xeon Phi processors is implemented by Intel and have been integrated into the upstream LLVM/Clang version. The offloading support of NVIDIA GPUs is currently under implementation by IBM. The implementation is finalizing currently (March 2017) and is expected to be available late June 2017. There is also an AMD team who is implementing AMD GPU offloading support. It is not clear when AMD GPU support will be available. Due to the lack of time and NVIDIA’s participation, we expect the GPU offloading support still has improvement space. Thus GPU offloading support evaluation is one of our focus.
Note that our evaluation is done using in-house microbenchmarks because there is no OpenMP 4.0/4.5 benchmarks ready currently. A subtask of SOLLVE is to develop a OpenMP verification and validation suite that supports OpenMP 4.0/4.5 as well as future versions. It will be ready in 2019 based on the current schedule. Thorough evaluation will be possible then. We also got a lot of information from the IBM compiler team who is actively implementing OpenMP GPU offloading in LLVM/Clang.
Currently, most target features introduced in OpenMP 4.0/4.5 have been supported in the compiler underdeveloped by IBM, clang-ykt. A feature that is missing now is the asynchronous GPU execution support. For instance, if a parallel region is tagged as nowait, using GPU offloading still causes unnecessary wait since the GPU and CPU execution is synchronous. However, it only affects the performance, not the correctness of programs. We expect all OpenMP 4.5 features will be supported by the end of June 2017.
Performance wise, the IBM team has implemented several optimizations to have a decent GPU offloading performance. For instance, GPU device runtime has been inlined to reduce the register pressure and increase concurrency; the default thread scheduling policy has been changed for GPU to suit for its execution model; adaptive thread organization is implemented to coalescing memory accesses of different threads, and etc.
There are still many things that can be done to optimize the performance of OpenMP GPU offloading implementation, several OpenMP features are still missing. Here we list some of the most important things we believe that are currently missing:
- Asynchronous device offloading. Currently all activities offloaded to GPU are synchronous with each other. E.g., if there are two target offloading regions and nowait is specified for the first region, the second region still cannot start until the first one completes.
- Parallel IR. Currently the frontend will lower OpenMP directives directly to runtime calls, which prevents efficient compiler optimization. Parallel IR will enable more compiler optimization opportunities.
- ThinLTO support. Link Time Optimization (LTO) is important for OpenMP performance. ThinLTO is a new LTO technique which reduces the overhead of LTO significantly. Having GPU offloading support in ThinLTO will enable a lot of compiler optimization without incurring a large overhead.
- The support of new GPU features. For instance, unified memory feature provided by Kepler and later GPUs is not supported in LLVM/Clang. It is important for the compiler to be aware and to leverage these new features for better performance.
In summary, for C/C++, OpenMP 3.1 is fully supported in LLVM/Clang, and preliminary OpenMP support 4.0/4.5 is under development and should be available soon. For Fortran, its LLVM frontend has not been ready and there is no OpenMP implementation in the underdeveloped Fortran frontend. We will work on the optimization of OpenMP support in C/C++, as well as the initial OpenMP support in Fortran frontend.