Dynamic Correctness Testing of MPI Programs
Debugging MPI applications can be difficult. Software complexity,
data races, and scheduling dependencies can make simple programming
errors very difficult to locate with manual debugging techniques. Worse,
few debugging tools are even targeted to MPI abstractions and error
messages from MPI implementations are often misleading when the programmer
uses MPI incorrectly. As a result, users rely on a spectrum of time-consuming
and complicated ad-hoc techniques to locate MPI programming errors.
Clearly, MPI programmers need tools that simplify the code development
Umpire is an innovative
tool that dynamically analyzes any MPI
application for typical MPI programming errors. Examples of these errors
include resource exhaustion and configuration-dependent buffer deadlock.
Umpire performs this analysis on unmodified application codes at runtime
by using the MPI profiling layer.
This talk presents
a distributed memory version of Umpire that
uses techniques that are similar to those employed in high-performance
multi-threaded MPI implementations. This version of Umpire has identified
several MPI programming errors in Sphinx, a widely-available MPI benchmark
suite and initial performance results with several applications are
promising. This talk will present the distributed design and preliminary
results as well as key issues for identifying complex MPI programming
errors, such as deadlocks involving MPI_Recv with MPI_ANY_SOURCE.