Meet NSLS-II's Tom Caswell: Lead Developer of Matplotlib
November 21, 2018
Tom Caswell stands on a balcony overlooking the experimental floor of NSLS-II, where he works to streamline data acquisition.
Computational scientist Tom Caswell is helping to solve one of today’s biggest challenges in science: how to collect, manage, and analyze big data. Based at the U.S. Department of Energy’s (DOE) Brookhaven National Laboratory, Caswell’s task is to streamline data acquisition at the National Synchrotron Light Source II (NSLS-II)—a DOE Office of Science User Facility that is one of the most advanced light sources in the world.
NSLS-II is a giant x-ray microscope with room for 60 different experiments to take place at a single time. Each experimental station at NSLS-II, called a beamline, has world-class tools to image materials at the atomic scale. But, like all synchrotron light sources, NSLS-II collects different kinds of data at each beamline, making it difficult for scientists to compare data sets across beamlines.
Caswell, as a member of NSLS-II’s Data Acquisition, Management and Analysis (DAMA) group, helped solve this widespread problem by developing Bluesky, software that automates the data collection process at NSLS-II and produces uniform, easily manageable data.
“Historically, beamlines have largely been run independently with no standardization of file formats or metadata,” Caswell said. “We’ve built a uniform data collection and management system that runs across all of NSLS-II.”
Now, Bluesky is being adopted by light sources around the world.
The programming language behind Bluesky, called “Python,” is also a hobby for Caswell. For six years, he’s been helping to develop a popular data visualization tool, or “plotting library,” for Python, called Matplotlib. Matplotlib can be found behind many technological and scientific advancements, such as the recent Nobel-prize winning LIGO work. Almost all the library’s developments are made on a volunteer-basis from passionate people, including Caswell.
“We have a lever to change some of the ways that science is done by moving to more systematic, database-driven data management systems.”
– Tom Caswell
In 2012, when Caswell was in graduate school studying physics, he began answering online questions about Matplotlib—admittedly to procrastinate on his studies. Nevertheless, his work would eventually lead him to be named lead developer of Matplotlib, which he estimates is used by over a million people every year.
“Matplotlib has a tremendous impact on the science world,” Caswell said. “NSLS-II uses Matplotlib on the experimental floor as part of Bluesky’s data collection system, and I was using it significantly in my work in graduate school. That was a really good way for me to learn the library, because I had the opportunity to solve a lot of problems and learn how to track and report bugs.”
After submitting countless “pull requests” to Matplotlib’s team in graduate school, they eventually asked him to join the team. Now, six years later, Caswell oversees about 200 volunteers who edit Matplotlib’s code each year.
“Matplotlib is a community project, and we run by a consensus,” he said. “If a technical consensus can’t be made, I’m the vote of last resort. But I try hard not to do that, because I believe consensus is better. Matplotlib works because there’s a huge community of volunteers who put in a lot of untracked and unnoticed labor.”
Caswell says he has shifted to playing a larger role in code review and big picture projects at Matplotlib. While he never expected to become lead developer of the project, he’s happy that he got involved with Matplotlib early on and has been able to integrate the library into his work at NSLS-II.
“My job and my hobby are indistinguishable,” he said. “I still walk into NSLS-II and am amazed that this is the place where I work. I’ve always known I wanted to be a physicist, and I love having the opportunity to work with the scientists here and get involved in their projects.”
Caswell added, “We have a lever to change some of the ways that science is done by moving to more systematic, database-driven data management systems. I think the way we have been doing science is not going to scale past the next generation of detectors. Now that we have deployed Bluesky at the scale of a whole facility, and we’re getting buy in from other facilities, we’ll hopefully change the way people think about data.”
Caswell earned a Ph.D. in physics from the University of Chicago in 2014 and a Bachelor of Arts in physics and mathematics from Cornell University in 2007. After completing his graduate studies, Caswell started as a postdoc at NSLS-II before becoming one of the inaugural members of the DAMA group as an assistant computational scientist in 2015.
Brookhaven National Laboratory is supported by the Office of Science of the U.S. Department of Energy. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.
2018-13049 | INT/EXT | Newsroom