As the sole Tier-1 computing facility for ATLAS in the United States – and the largest ATLAS Tier 1 computing center worldwide – Brookhaven's RHIC and ATLAS Computing Facility provides a large portion of the overall computing resources for U.S. collaborators and serves as the central hub for storing, processing, and distributing ATLAS experimental data among scientists across the country. Brookhaven also houses one of the three U.S. ATLAS Analysis Support Centers, which organizes topical conferences and periodic "Jamborees" that train researchers in numerous data analysis techniques.
The ATLAS grid computing system comprises a complex structure analogous to the power grid that allows researchers and students around the world to analyze ATLAS data.
The beauty of the grid is that a wealth of computing resources are available for a scientist to accomplish an analysis, even if those resources are not physically available close to them. The data, software, processing power and storage may be located hundreds or thousands of miles away, but the grid makes this invisible to the researcher.
Organizationally, the grid is set up in a tier system, with the Large Hadron Collider located at CERN, the Tier-0 center. CERN receives the raw data from the ATLAS detector, performs a first-pass analysis, and then distributes it among ten Tier-1 locations, also known as regional centers, including Brookhaven. At the Tier-1 level, which is connected to Tier-0 via a dedicated high-performance optical network path, a fraction of the raw data is stored, processed, and analyzed. Each Tier-1 facility then distributes derived data to Tier-2 computing facilities that provide data storage and processing capacities for more in-depth user analysis and simulation.
The grid computing infrastructure is made up of several key components. The "fabric" consists of the hardware elements – processor "farms" comprising hundreds to thousands of compute nodes, disk and tape storage, and networking. The "applications" are the software programs that users employ, for example, to analyze data. Applications take the raw data from ATLAS and reconstruct it into meaningful information that scientists can interpret. Another type of software, called "middleware," links the fabric elements deployed within and across regions together so that they form a unified system — the Grid. The development of the middleware is a joint effort between physicists and computer scientists.
Outside of high-energy physics, grid computing is used on smaller scales to manage data within other scientific areas such as astronomy, biology, and geology. But the LHC grid is the largest of its kind.
Funding for middleware development is provided by the National Science Foundation's (NSF) Information Technology Research program and by the U.S. Department of Energy (DOE). DOE also funds the Tier-1 center activities, while the Tier-2 centers are funded mostly by the NSF and the DOE. DOE and NSF support the Open Science Grid, which is the middleware used by all of the U.S. Tier-1 and Tier-2 sites.
The key to successfully managing ATLAS data to date has been highly efficient distributed data handling over powerful networks, minimal disk storage demands, minimal operational load, and constant innovation. The scientists store the data they want to keep permanently on tape or disk and use a workload distribution system known as PanDA to coherently aggregate that data and make it available to thousands of scientists via a globally distributed computing network. End users can access the needed files, stored on a server in the cloud, by making service requests.
The latest drive to accommodate the torrent of data expected as the LHC begins collisions at higher energy is to move the tools of PanDA to the realm of supercomputers. The challenge is that time on advanced supercomputers is limited, and expensive. But just as there’s room for sand in a ‘full’ jar of rocks, there’s room on supercomputers between big jobs for fine-grained processing of high-energy physics data.
The new fine-grained data processing system, called Yoda, is a specialization of an “event service” workflow engine designed for the efficient exploitation of distributed and architecturally diverse computing resources. To minimize the use of costly storage, data flows would make use of cloud data repositories with no pre-staging requirements. The supercomputer would send “event requests” to the cloud for small-batch subsets of data required for a particular analysis every few minutes. This pre-fetched data would then be available for analysis on any unused supercomputing capacity—the grains of sand fitting in between the larger computational problems being handled by the machine.
This system was constructed by a broad collaboration of U.S.-based ATLAS scientists at Brookhaven Lab, Lawrence Berkeley National Laboratory, Argonne National Laboratory, University of Texas at Arlington, and Oak Ridge National Laboratory, leveraging support from the DOE Office of Science—including the Office of Advanced Scientific Computing Research (ASCR) and the Office of High Energy Physics (HEP)—and the powerful high-speed networks of DOE’s Energy Sciences Network (ESnet).