SOLLVE: Scaling OpenMP with LLVm for Exascale performance and portability
Hard Target Semantics
Here we detail some aspects of OpenMP's semantics that make optimization difficult.
Tasks
When we have code like this:
void bar(); void foo(float *a, float *b, int &n) { if (n < 4) return; bar(); for (int i = 0; i < n; ++i) { x->a[i] = x->b[i] + 1; } }
The compiler will analyze the pointers 'a' and 'b' and the resulting aliasing information, along with value information for n, will be available to the transformations optimizing the loop (e.g., vectorization). Now imagine that this code is called in some function that is part of a parallel region, and moreover, we want to make a task out of the loop:
void bar(); void foo(float *a, float *b, int &n) { if (n < 4) return; #pragma omp task { bar(); for (int i = 0; i < n; ++i) { a[i] = b[i] + 1; } } }
In the first case, when the loop is run, we know that n is at least 4. Do we know the same thing when the loop in the task in run? No. This is because n was passed by reference, and nothing prevents the caller from modifying the value of n before the task is actually run. You might object that the compiler should be able to figure out that the value of n has not changed because any change would introduce a race condition (i.e., use the same logic it generally uses to justify doing common-subexpression-elimination on loads in regular code). In this case, however, unless the compiler can see that bar() contains no synchronization constructs, it doesn't know that any such change to the value of n must be a race condition. Worse, however, this entire code could be run only if only one thread is available, and the compiler can't know that such a guarding condition does not exist, meaning that it can't assume that unsynchronized changes to values in tasks must be race conditions.
Possible Enhancement: OpenMP could place restrictions on what variables, referred to in tasks, especially reference variables, can be modified after the task is created.
Target Semantics
There are a number of difficulties in the mapping of OpenMP target semantics onto common accelerators (e.g., GPUs). These generally boil down to semantic mismatches, either in form or scope, between OpenMP's model and the underlying programming model of the accelerator.
Consider the case of compiling OpenMP target regions for SIMT GPUs (e.g., NVIDIA GPUs). Such accelerators execute kernels in a SIMT model where many threads execute in lock step each kernel in a manner dictated by the host when the kernel was scheduled for execution. The OpenMP model, however, provides for target execution of both serial and parallel code regions where almost all aspects of the parallel execution can be determined within the target region itself. Moreover, in an implementation supporting separate compilation (i.e., not forcing all code that might be executed in a target region to be defined within the containing translation unit), it may be impossible to know statically (i.e., at compile time) what OpenMP features will be used within each target region. This has a number of practical effects:
- Essentially all of the OpenMP features available on the host are available within target regions, including features such as tasks and different scheduling policies for parallel loops. Many of these features do not map directly onto capabilities provided by the underlying software and hardware layers. For example, CUDA does not provide capabilities for kernel tasks or dynamic scheduling. A sophisticated target-side runtime library is needed to support these features. A sophisticated runtime library is needed on the host as well, but the cost of providing such a library can be much greater on an accelerator. For example, because the runtime library must be compiled into each kernel, and additional registers, over what a direct coding of the algorithm might require, may be necessary (either within the implementation of the runtime library, in using the library's API, or both), the runtime library might decrease parallelism and performance just as a side effect of its inclusion. Instruction cache resources may be relatively limited on an accelerator, compared to the host, and the function calls might be relatively more expensive. GPUs, for example, have been generally designed to execute relatively small kernels with high degrees of parallelism, and thus, the hardware does not contain logic specifically devoted to decreasing the overhead of function calls or dealing efficiently with large code regions.
- In addition to the requirement of a runtime library, there's overhead in dealing with both target serial and parallel regions on SIMT devices. In order to support target code, in general, all threads must run (even during the serial code region). As a result, the compiler must essentially generate a state machine inside of the kernel: all threads run but most wait on a barrier until the serial region is done, then all threads run the parallel code, then most threads wait on a barrier, and so on. This decreases performance, not only because of the cost of the barriers, but also because the extra code will tend to increase register pressure (and, thus, decrease the available parallelism).
Another issue that is problematic is the assumption of globally-accessible thread stacks. Because OpenMP variables are shared by default, and this includes "local" variables on the stack of each thread, these variables often can't be put on the stack in the usual way. This is because on many GPUs, including NVIDIA GPUs, the local thread stack is not accessible from all other threads. The "address" of a local variable is not meaningful in a global sense. Instead, these variables must be placed in global memory, and the handling of recursive functions is therefore non-trivial. Moreover, because of separate compilation, it is often unknown which local variables might need to be located somewhere besides the local stack (and, so, to be conservatively correct, none of them that have their address captured or passed to any function can be put on the thread-local stack).