General Lab Information

Computation and Data-Driven Discovery (C3D) Projects

Experimental Data Curation with the Open-source Invenio Platform

Scientific results traditionally are disseminated in publications. With the deluge of data unleashed by modern computational methods and high-throughput detectors facilities, such as National Synchrotron Light Source II (NSLS-II) at Brookhaven National Lab and others worldwide, traditional scientific communication of results in publications has become insufficient. Access to the data supporting results, their lineage, and the software processes operating on the data (also known as provenance) is necessary to improve communication of results and enable validation, interpretation, and ultimately reproducibility of scientific results. Data repositories have become a preferred way for storing, searching, sharing, and publishing datasets associated with scientific projects. Other supplemental materials, such as data files and samples, that do not find their place in traditional scientific literature can be stored in repositories. Brookhaven Lab has designed some of its archives using the open-source Invenio software platform developed at CERN. Invenio provides the means to develop a state-of-the-art, standards-based, archival service based on widely used technologies, and the best known is the Zenodo service. Invenio also provides modules to build archival packages in compliance with the international standard for Open Archival Information Systems and is in compliance with U.S. Department of Energy (DOE) data management regulations.

Brookhaven Lab has deployed customized instances of Invenio backed up by its world-leading extensive disk storage, tape backup, networking, and data-intensive computing infrastructure. A user interacts with the repository portal either through the Graphical User Interface for single record upload or through REST APIs (representational state transfer application programming interfaces) for bulk deposit, search, and download. The portal allows users to form communities of interest and supports customized metadata. Invenio was designed as a highly scalable system to meet the challenging data requirements of the CERN user community and, as such, will be able to accommodate any future expansion required by Brookhaven Lab communities. As part of the overall Scientific Data and Computing Center data storage, computing, and networking infrastructure at Brookhaven Lab, the hardware backing up Invenio services can be equally expanded to meet user requirements and stakeholder budgets. Additional storage and processing capabilities (high throughput, high performance, or cloud) can be added on demand both for short- and long-term needs.

At Brookhaven, Invenio instances are deployed using the InCommons Consortium and CoManage system for authentication. This unique capability at DOE enables users whose institutions are members of InCommons to login to the platform, which is particularly useful for collaborative projects with universities as researchers do not need a guest appointment to access content. One instance of the Invenio platform has been deployed for storing programmatic data, such as reports and scientific measurements, in support of the National Nuclear Security Administration. Another facilitates collaborations and data sharing among project members in the Energy Frontier Research Center, Genesis. In addition to supporting data storage and access needs, the CSI Institutional Cluster computational resources can be linked to Invenio instances for collaborative projects.

Several instances of the Invenio platform are deployed at Brookhaven Lab for different project community needs. At BNL, an important feature is the ability to deploy Invenio instances using the InCommons Consortium and CoManage system for authentication. This unique DOE capability enables users whose institutions are members of InCommons to login to the platform and is particularly useful for collaborative projects with universities whose researchers do not need a guest appointment to access content. One instance of the Invenio platform has been deployed for storing programmatic data, such as reports and scientific measurements in support of the National Nuclear Security Administration. Another facilitates collaborations and data sharing among project members in the Energy Frontier Research Center, Genesis. In addition to supporting data storage and access needs, SDCC’s Institutional Cluster also can be linked to BNL’s Invenio instances for collaborative projects.

figure 1

Publications

Pouchard, L., Campbell, S., Kleese Van Dam, K. “Experimental Data Curation at Large Instrument Facilities with Open Source Software,” International Journal of Digital Curation Volume 14, Issue1  https://doi.org/10.2218/ijdc.v14i1.637; http://www.ijdc.net/issue/view/27.