SOLLVE: Scaling OpenMP with LLVm for Exascale performance and portability
OpenMP With LLVM's LTO
The LLVM compiler infrastructure has advanced support for link-time optimization (LTO). This enabled the optimization of code across translation-unit boundaries. For example, functions that are defined in separate translation units (i.e., source files) can be nevertheless inlined into each other during the process of composing the final binary. For more information on LTO in general, and LLVM's implementations in particular, see: ThinLTO and the video.
There's nothing special about integrating host-level OpenMP with LLVM's LTO, but additional work will be needed to integrate LTO with OpenMP's target offloading features.
LTO generally works by having Clang generate LLVM bitcode instead of machine code in the output object files. Those bitcode object files are then interpreted by a LLVM linker plugin. During normal (monolithic) LTO, the linker plugin collects all of the bitcode from the relevant input object files, links all of the IR together using LLVM's IR-level linking functionality, internalizes as appropriate and optimizes the resulting IR module, generates code for the resulting module, and then feeds that code back to the linker as the contribution from the bitcode object files. ThinLTO works in a similar way, except that:
- In addition to each bitcode object file generated, a summary file is also generated.
- The LLVM linker plugin processes each bitcode object-file input in parallel, making use of the combined summary information to figure out what functions should be loaded from other bitcode object files and from which files the definitions should be loaded.
When compiling with OpenMP target offload support enabled, Clang does not generate regular object files, but rather, multi-target bundles. These bundles contain collections of object files that can be separated using Clang's bundler/unbundler tool. When these bundled object files are passed to Clang for linking, they're unbundled and the linker is invoked. When linking the host code, LTO (monolithic or thin) should work as usual. There aren't regression tests for this, however, and they should be added. For the target code where the target is a "normal" target, this process should also work as usual (again, regression tests are needed). The complication comes when the target is a GPU. The GPU case is complicated by the fact that a regular linker is not used for the target code (and ptxas/fatbinary is not a linker and, thus, does not support linker plugins). Some additional information, although not exactly matching trunk, can be found here .
To support LTO for the GPU code, we'll need to have either have the driver recognize the bitcode inputs in what would otherwise be ptx inputs and invoke the LTO directly, or we'll need to provide a wrapper to fatbinary/nvlink that will do this and then invoke the underlying NVIDIA utility. One interesting question in the context of a wrapper utility is: how generic could such a utility be? Could we have a generic wrapper utility that makes any underlying linker-like program LTO aware? During a regular link process, the linker provides the LLVM plugin with information on which symbols are requested, and thus, the plugin can mark others as dead. When optimizing for a GPU, which is an effectively-closed system, we should also be able to have LTO make strong assumptions about what is needed