Computation and Data-Driven Discovery (C3D) Projects
Provenance and Reproducibility Tools for Multi-Modal X-ray Spectroscopy Experiments
Brookhaven Lab is building significant world leadership through the National Synchrotron Light Source II (NSLS-II), Center for Functional Nanomaterials (CFN), and Computational Science Initiative (CSI). New-generation detectors produce massive amounts of data that require novel infrastructure and advanced computing discovery and analysis techniques to address the challenges of designing and validating new material structures. Efficient automated metadata and provenance capture, event tagging, and real-time analysis of massive experimental and observational data are a significant challenge faced by all DOE facilities. As light sources get brighter and detectors get larger and faster, increasingly large data are generated by diverse user communities; processed in many unique and highly customized scientific workflows; and exhibit complexity in large and changing collections of instruments with broad ranges of data rates, structures, and access patterns. Multi-modal techniques that characterize experimental data across different imaging modalities and facilities are poised to increase complexity by several orders of magnitude. The key data challenges of complexity, velocity, and volume require a new data and computing infrastructure, including curation, detailed recording of processes acting on data, and advanced computing discovery techniques integrated into computational workflows to enable the autonomous experiments in the future.
Exploring relationships between material structures and desired properties for material discovery and synthesis requires mining machine-readable databases and linking experimental data with sufficiently rich annotations. There are vast gaps both in these data sources and the connections between them. This project is developing building blocks for a provenance-based data management and analysis framework that enables capturing, persisting, and reanalyzing experimental data. Our system leverages NSLS-II Bluesky for experimental data acquisition, captures analysis parameters, and enriches the search space with results from the scientific literature (Figure 1). We have piloted data search through heterogeneous data sources from a single interface across samples and experiments for Inner Shell Spectroscopy (ISS) and X-ray Powder Diffraction (XPD). The Crystallography Open Database, an open-access collection of files for inorganic, organic, and metal-organic compounds and minerals, also is available through the same search system. The scientific literature module is described here.
The data analysis pipelines are built upon the NSLS-II Databroker Event Model. They leverage experimental parameters captured in Databroker, automatically capture analysis processes, and persists them as Directed Acyclic Graphs to obtain a complete provenance trace of the analysis (Figure 2). The result is that details of the analysis are preserved for future comparisons and replay by others (Figure 3).
Figure 2: DAG representing analysis of tooth data reconstruction using tomopy.
Figure 3: Comparison of two reconstructions of the same data by two users (left and center) and their differences (right)
Pouchard, L., Juhas, P., Billinge, S., Wright C., Campbell, S., Park, G., Stavitski, E., Van Dam, H. B. Provenance infrastructure for multimodal x-ray experiments and reproducible analysis. Handbook on Big Data and Machine Learning in the Physical Sciences. Vol 2: Advanced Analysis Solutions for Leading Experimental Techniques. Eds. K. Kleese, S. Campbell, K. Yager, R. Farnsworth, M. Van Dam, World Scientific. 2020. https://doi.org/10.1142/9789811204579_0015.