Computation and Data-Driven Discovery (C3D) Projects
Provenance and Reproducibility Tools for Multi-Modal X-ray Spectroscopy Experiments
New-generation detectors produce massive amounts of data that require novel infrastructure and advanced computing discovery and analysis techniques to address the challenges of designing and validating new material structures. Efficient automated metadata and provenance capture, event tagging, and real-time analysis of massive experimental and observational data are significant challenges faced by all DOE scientific user facilities. As light sources get brighter and detectors get larger and faster, increasingly large data are generated by diverse user communities; processed in many unique and highly customized scientific workflows; and exhibit complexity in large and changing collections of instruments with broad ranges of data rates, structures, and access patterns. Multi-modal techniques that characterize experimental data across different imaging modalities and facilities are poised to increase complexity by several orders of magnitude. The key data challenges of complexity, velocity, and volume require a new data and computing infrastructure, including curation, detailed recording of processes acting on data, and advanced computing discovery techniques integrated into computational workflows to enable future autonomous experiments.
Exploring relationships between material structures and desired properties for material discovery and synthesis requires mining machine-readable databases and linking experimental data with sufficiently rich annotations. There are vast gaps both in these data sources and in the connections between them. This project is developing building blocks of a provenance-based data management and analysis framework that enable capturing, persisting, and reanalyzing experimental data. The system leverages NSLS-II Bluesky for experimental data acquisition, captures analysis parameters, and enriches the search space with results from the scientific literature (Figure 1). A pilot project launched data search through heterogeneous data sources from a single interface across samples and experiments for the Inner Shell Spectroscopy (ISS) and X-ray Powder Diffraction (XPD). The Crystallography Open Database, an open access collection of inorganic, organic, and metal-organic compound and mineral files, also is available through the same search system.
The data analysis pipelines are built upon the NSLS-II Databroker Event Model. They leverage experimental parameters captured in Databroker, automatically capture analysis processes, and persists them as directed acyclic graphs (DAGs) to obtain a complete provenance trace of the analysis (Figure 2). The result is that details of the analysis are preserved for future comparisons and replay by others (Figure 3).
Figure 2: DAG representing analysis of tooth data reconstruction using tomopy.
Figure 3: Comparison of two reconstructions of the same data by two users (left and center) and their differences (right)
Pouchard, L., Juhas, P., Billinge, S., Wright C., Campbell, S., Park, G., Stavitski, E., Van Dam, H. B. Provenance infrastructure for multimodal x-ray experiments and reproducible analysis. Handbook on Big Data and Machine Learning in the Physical Sciences. Vol 2: Advanced Analysis Solutions for Leading Experimental Techniques. Eds. K. Kleese, S. Campbell, K. Yager, R. Farnsworth, M. Van Dam, World Scientific. 2020. https://doi.org/10.1142/9789811204579_0015.