General Lab Information

Computation and Data-Driven Discovery (C3D) Projects

Streaming Visualization for Performance Anomaly Detection

Due to the sheer volume of data, it typically is impractical to analyze the detailed performance of a high-performance computing (HPC) application running at scale. While conventional small-scale benchmarking and scaling studies often are sufficient for simple applications, many modern workflow-based applications couple multiple elements with competing resource demands and complex inter-communication patterns for which performance cannot easily be studied in isolation and at small scales. This work discusses Chimbuko, a performance analysis framework that provides real-time, in situ anomaly detection. By focusing specifically on performance anomalies and their origin (provenance), data volumes are dramatically reduced without losing necessary details. To the best of our knowledge, Chimbuko is the first online, distributed, and scalable workflow-level performance trace analysis framework. We demonstrate the tool’s usefulness on the Summit supercomputing system housed at the Oak Ridge Leadership Computing Facility.

As part of this effort, Chimbuko Visualization is presented as the module to enable monitoring and analysis of performance anomalies as the application is running. It is comprised of both a back end server and a front end graphical user interface.

Back end server

For the design of the back end server, two dataflows are connected, including real-time performance statistics provided periodically by the parameter server and detailed anomaly data stored in the provenance database. Therefore, the visualization module has three major roles: 1) receiving and processing the streaming statistics from the parameter server, 2) querying the provenance database and processing the queried data, and 3) visualizing the processed data to users and responding to their interactions. The goal is to provide a scalable server that is able to digest requests asynchronously with minimal latency or memory overhead and can handle concurrent requests or long-running tasks of the connected clients.

To meet these requirements, the back end server has been redesigned from previous works to have two levels of scalability. At the first level, uWSGI software is adopted to handle multiple concurrent connections. At the second level, the requests are distributed to celery workers and handled asynchronously for both short- and long-running tasks. Finally, streaming (or broadcasting) data to the connected users is completed by using WebSocket technology with Socket.IO library.

Front end interface

For the design of the visualizations shown in the front end interface, two types of data are managed: 1) dynamic performance statistics displayed and streaming in the context of their execution while the code is running, and 2) deeper investigation upon user interaction to select a time interval of functions and each function’s execution details.

To meet these requirements, two front-end visualizations are engaged that present data in an “overview first, zoom and filter, then details on-demand” mechanism.

Dynamic statistics visualization. Streaming data from the PS (project system) module are processed into a number of anomaly statistics, including the average, standard deviation, maximum, minimum, and total number of anomalous function executions. Users can select a statistic along with the number of ranks for which it is visualized. A dynamic “ranking dashboard” of the most problematic Message Passing Interface (MPI) ranks in a rank-level granularity is provided.

Selecting corresponding ranks activates the visualization server to broadcast the number of anomalies per time frame (e.g., per second) of these ranks to the connected users while performance traced applications are running. This streaming scatter plot provides time-frame-level granularity by showing the dynamic changes of anomaly amounts of an MPI rank within a time frame.

Detailed Functions Visualization. This visualization is designed to retrieve data from the provenance database and show the function execution details. It consists of two parts: a function view and a call stack view. In the function view, it visualizes the distribution of functions executed within a selected time interval. The distribution can be controlled by selecting the X and Y axes among different function properties.

In the call stack view, users can more closely investigate a selected function execution in detail. The invocation relationships among functions and their communications over other ranks are presented for users to interpret the potential cause of anomalous behavior.

Finally, the rank statistics along with the function statistics in PS describing a general workflow performance can be stored for post hoc analysis. We also have sampled normal events to enrich the contextual comprehension.

The code and other detailed documentation are available here.