Computation and Data-Driven Discovery (C3D) Projects
Experimental Data Curation with the Open-source Invenio Platform
Scientific results traditionally are disseminated via publications. With the deluge of data from modern computational methods and high-throughput detectors, traditional scientific communication of results in publications has become insufficient. Access to the data supporting results, their lineage, and the software processes operating on the data, also known as provenance, is necessary to improve communication of results and enable validation; interpretation; and, ultimately, reproducibility of scientific results. Data repositories have become a preferred way for storing, searching, sharing, and publishing datasets associated with scientific projects. Other supplemental materials, such as data files and samples that do not find their place in traditional scientific literature, also can be stored in repositories. Brookhaven Lab has designed some of its archives using the open-source Invenio software platform developed at CERN. Invenio provides the means to develop a state-of-the-art, standards-based archival service based on widely used technologies, the best known is the Zenodo service. Invenio also provides modules to build archival packages in compliance with the international standard for Open Archival Information Systems and is in compliance with U.S. Department of Energy (DOE) data management regulations.
Brookhaven has deployed customized instances of Invenio backed up by the Lab’s world-leading extensive disk storage, tape backup, networking, and data-intensive computing infrastructure. A user interacts with the repository portal either through the graphical user interface (GUI) for single record upload or through REST application programming interfaces for bulk deposit, search, and download. The portal allows users to form communities of interest and supports customized metadata. Invenio was designed as a highly scalable system to meet the challenging data requirements of the CERN user community. As such, it can accommodate any future expansion required by Brookhaven Lab research communities. As part of the overall Scientific Data and Computing Center data storage, computing, and networking infrastructure, the hardware backing up Invenio services can be equally expanded to meet user requirements and stakeholder budgets. Additional storage and processing capabilities (high throughput, high performance, or cloud) can be added on demand both for short- and long-terms needs.
At Brookhaven, Invenio instances are deployed using the InCommons Consortium and CoManage system for authentication. This unique capability at DOE enables users whose institutions are members of InCommons to login to the platform, which is particularly useful for collaborative projects with universities as researchers do not need a guest appointment to access content. One instance of the Invenio platform has been deployed for storing programmatic data, such as reports and scientific measurements, in support of the National Nuclear Security Administration. Another facilitates collaborations and data sharing among project members in the Energy Frontier Research Center, Genesis. In addition to supporting data storage and access needs, the CSI Institutional Cluster computational resources can be linked to Invenio instances for collaborative projects.
Pouchard, L., Campbell, S., Kleese Van Dam, K. “Experimental Data Curation at Large Instrument Facilities with Open Source Software,” International Journal of Digital Curation Volume 14, Issue1 https://doi.org/10.2218/ijdc.v14i1.637; http://www.ijdc.net/issue/view/27.