General Lab Information

Computer Science and Mathematics Projects

Integrated End-to-End Performance Prediction and Diagnosis for Extreme Scientific Workflows (IPPD)

It is increasingly difficult to design, analyze, and implement large-scale workflows for scientific computing especially in situations where time critical decisions have to be taken. Workflows are designed to execute on a loosely connected set of distributed and heterogeneous computational resources. Each computational resource may have vastly different capabilities, ranging from sensors to high performance clusters. Frequently, workflows are composite applications built from loosely connected parts. Each task of a workflow may be designed for a different programming model and implemented in a different language. Most workflow tasks communicate via files sent over general purpose networks. As a result of this complex software and execution space, large-scale scientific workflows exhibit extreme performance variability. It is critically important to have a clear understanding of the factors that influence their performance and for the potential optimization of their execution.

Performance Prediction and Diagnosis

The performance of a workflow is determined by a wide range of factors. Some are specific to a particular workflow component and include both software factors (application, data sizes etc.) and hardware factors (compute nodes, I/O, network). Others stem from the combination and orchestration of the different tasks in the workflow including: the workflow engine, the mapping of the workflow onto the distributed resources, co-ordination of tasks and data organization across programming models, and workflow component interaction.

In IPPD three core issues are being addressed in order to provide insights into workflow execution that can be used to both explain and optimize their execution:

  1. Provide an expectation of the performance of a workflow in-advance of execution to provide a best baseline performance;
  2. Identify areas of consistent low performance and diagnose the reason why; and
  3. Study the important issue of performance variability.

The design and analysis of large-scale scientific workflows is difficult precisely because each task can exhibit extreme performance variability. New prediction and diagnostic methods are required to enable efficient use of present and emerging workflow resources.

Brookhaven National Laboratory is focusing as part of IPPD on the development of a new provenance paradigm that will enable the capture of empirical workflow performance information in extreme scale environments, to define performance baselines, identify consistent performance bottlenecks and their sources

Publications

Kleese van Dam, K., Stephan, E., Raju, B., Altintas, I., Elsethagen, T., Krishnamoorthy, S. November 2015. Enabling Structured Exploration of Workflow Performance Variability in Extreme-scale Environments. In proceedings MTAGS15, collocated with SC15.