Physicists and Students Format PHENIX Data for Easy Access

Effort standardizes data needed to unlock the secrets of matter while building skills and bringing new faces into science

PHENIX detector enlarge

The PHENIX detector at the Relativistic Heavy Ion Collider (RHIC) operated from 2000 to 2016, generating data that resulted in more than 200 scientific publications. (Joseph Rubino/Brookhaven National Laboratory)

Christine Nattrass, a physics professor at the University of Tennessee (UT), Knoxville, has recruited a crew of mostly undergraduate students to dig deep into data from billions of particle collisions at the Relativistic Heavy Ion Collider (RHIC)—a U.S. Department of Energy (DOE) Office of Science user facility for nuclear physics research at DOE’s Brookhaven National Laboratory. Their goal: reformat data from scientific papers published by RHIC’s PHENIX detector collaboration and upload it to a modern database now used across the nuclear and high energy physics (HEP) research communities.

Posting the PHENIX data to this database, known as “HEPData,” would make it accessible to anyone wanting to compare new findings with historical measurements or results from one experiment to another—or see how experimental results match up with theoretical descriptions of the building blocks of matter. 

The team has nearly completed the task—successfully reformatting and uploading the data from 201 of the 219 published PHENIX papers.* Drafts of the data for most of the remaining papers are undergoing final reviews.

“We worked with many PHENIX volunteers, including some people who were in grade school when these papers were originally written,” Nattrass said. “This is an ongoing collaboration-wide effort.”

Deep in the data

The project started when Nattrass was working with students to compare RHIC’s experimental findings with predictions about how the particles detected in collisions should behave. But to make those comparisons, they needed not just the scientific papers—which summarize findings in charts and graphs and include scientists’ interpretations—but access to the all the data points that generated those charts and graphs.

Teal lines represent data collected from collision enlarge

A computer image generated from data collected from a single collision of two gold nuclei at the center of the PHENIX detector. Scientists analyze data from millions or billions of collisions to arrive at the conclusions they describe in scientific papers. (The PHENIX Collaboration/Brookhaven National Laboratory)

Of all the data points generated by a typical PHENIX analysis of particle collisions, only a fraction may make it into the published paper. Nattrass noted one paper where the dataset fills more than 2,000 data tables. “That’s way more data than you can possibly put in a paper,” she said.

Those data exist on the original PHENIX collaboration website, but access to the website is now restricted to the members of the PHENIX group. Even for PHENIX members like Nattrass, getting the data in a useful format can be tricky, she said. In RHIC’s early days, dating back to 2000, when scientists were eager to share all their data by posting it on the once-open collaboration websites, there was no single standard for how that data appeared. Often, the numbers were simply pasted into text files, not formatted for easy searching and sorting. Reformatting that data in the now-standardized and publicly accessible HEPData format, Nattrass reasoned, would enable her and her students to make their comparisons and make the data available for download by anyone wanting future access. 

“Once it’s uploaded to HEPData, you can get access to all the data just by clicking a button,” she said.

For each paper, PHENIX physicists serve as the key to quality control. With a deep understanding of the physics, they must ensure that the reformatted data accurately represent the findings from the detector—before it’s uploaded to the central database.

Maxim Potekhin, the PHENIX HEPData coordinator and a member of the Nuclear and Particle Physics Software (NPPS) group at Brookhaven Lab, oversees all submissions.

“We work very closely with CERN [Europe’s nuclear and particle physics lab] and the high energy and nuclear physics community at large to leverage best tools and practices to preserve PHENIX data and physics analyses,” Potekhin said. “This effort has been successful thanks to active participation of PHENIX members and efficient teamwork across PHENIX and the NPPS group. The results achieved by working in coordination with Christine’s group at UT have been quite remarkable for PHENIX and for the nuclear physics community.”

Opportunities for students

“For a couple of years, we were doing this parasitically,” Nattrass said. “As I had students working on projects and they needed the data for an analysis, they would format and upload the data to HEPData.”

In some cases, Nattrass offered students the opportunity to work on HEPData uploads to earn independent study credit, which fulfilled a physics elective requirement.

“Sometimes, a student would realize they did not have enough elective credit to graduate in their last semester,” Nattrass said. “This project would fulfill that requirement.”

They would also often gain skills they could put on their resume.

“They would learn how to use the Linux command line. They would do a lot of data analysis, even if it was not the most super high level. They would learn how to use a few different scripting languages and get some practical experience,” Nattrass said.

Christine Nattrass, Christal Martin, Nikolas Nelson, and Tom Krobatch enlarge

Christine Nattrass (front, lime green sweater), a physics professor at the University of Tennessee, Knoxville, started the PHENIX HEPData project by offering opportunities for undergraduate students to earn credit and eventually pay. Christal Martin, Nikolas Nelson, and Tom Krobatch are some of the students who participated. Martin and Nelson are now graduate students and Krobatch parlayed his experience into a job. (Showni Medlin/University of Tennessee, Knoxville)

In 2019, she incorporated the HEPData project into a “course-based undergraduate research experience” (CURE) she was teaching. She noted that such course-based research helps retain students in science majors. It also appeals to students who might not otherwise seek out a research experience—say, by having to ask an individual professor for an opportunity.

“It gives them a chance to get involved in research in a less intimidating way,” Nattrass said. “When you sign up for a class, you know how much you are signing up for.”

The course attracted a wide range of students, including some “nontraditional” students who had taken a break from their education and women and students from other groups generally underrepresented in science, technology, engineering, and mathematics (STEM) fields.

“Of the 20 students who have taken my CURE since I started offering it, 50% are women and 25% are underrepresented minorities,” Nattrass said. Compare that to the demographics of those earning bachelor’s degrees in physics across the U.S.: 22% of whom are women and around 15% are underrepresented minorities.

“Graduate student Christal Martin, who herself was a nontraditional student, helped bring students up to speed,” Nattrass said.

Martin got her start on the HEPData project as an undergraduate.

“I reformatted data for several papers so that I could use the data from HEPData,” she recalled. “I continued research as a graduate student and began mentoring other incoming physicists. I have been able to work with over a dozen CURE and summer research students who have had varying backgrounds and skills. I have enjoyed being able to get students started with HEPData tasks and see them transition their initial work to starting their own analyses—similar to what I did.”

Software innovation

Tom Krobatch, a physics and computer science double major who was working full-time while going to school and one of Nattrass’s first CURE students, wrote a software tool the group still uses.

The tool, called YAMLmaker, converts text files to a data serialization language called YAML, which is used in computer science and industry. It combs through text-based files and converts original tables containing data points into the YAML format. This helps ensure that the data can be interpreted properly, Nattrass explained. The software can handle the variability of how RHIC physicists and graduate students recorded their data.

“Different people can format data in a million different ways,” Krobatch said. “YAMLmaker can handle variations in the layout of the input data and convert them to YAML, which can be then used for HEPData and for processing in other systems and applications.”

Krobatch’s experience and innovation set the trajectory for his future.

“YAML is not a hugely popular format or language, like HTML—everybody knows what that is,” he said. “My experience working with YAML, which is not common among undergrads, allowed me to directly get a job. Now I’m a system administrator for high-performance computing at the University of Tennessee. I work for a group that does corporate computing on the high-performance computing cluster at Oak Ridge National Laboratory.”

Better than retail

In January 2022, with the data from about 55 PHENIX papers uploaded to HEPData, Nattrass decided to apply for a supplement to her research grant that would allow her to pay students to work on HEPData uploads.

“I figured I could get most of the remaining PHENIX data formatted for HEPData by paying undergraduate students,” she said. 

Given that a lot of students work to support their education anyway, she reasoned, why not give them an opportunity to work a flexible schedule, for decent pay, and build up marketable skills in the process?

When the grant was awarded, one of the students who took the job was Nikolas Nelson. He’d started out as a UT engineering student, switched to physics, then got shut out of interactions with other students by COVID. After spending his junior year abroad in Wales, he felt somewhat out of sync with the other physics majors. Working with Nattrass gave him a chance to reconnect.

“I was going to be looking for a job anyway, probably in some retail hell, and this paid about the same, so I said, ‘Why not?’ And it ended up working really well with my schedule,” Nelson said.

Overall, Nelson reformatted about 70% of the PHENIX papers that have been posted to HEPData. He also helped supervise other undergrads this past summer. He’s now enrolled in a medical physics master’s program at UT, putting all his physics and data knowhow to work.

He’s enjoyed the cooperative spirit of the project.

“There were so many different problems that each paper had, you’d go to whomever had come across that same thing before for help,” he said. “This job working with physics data gave me an opportunity to see that people in the physics community who I was in class with weren't stuck up like I thought they could be. They were just a bunch of geeks like me, and it was kind of fun.”

*RHIC’s PHENIX detector operated from 2000 to 2016. The collaboration continues to analyze data collected during that time and publish papers with ongoing impact on nuclear science.

Brookhaven National Laboratory is supported by the Office of Science of the U.S. Department of Energy. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit science.energy.gov.

Follow @BrookhavenLab on social media. Find us on Instagram, LinkedIn, Twitter, and Facebook.

2023-21570  |  INT/EXT  |  Newsroom