Accelerating Scientific Discovery Through Code Optimization on Many-Core Processors

Brookhaven Lab hosted a hackathon for research and computational scientists, code developers, and computing hardware experts to optimize scientific application codes for high-performance computing

Bei Wang, Hideki Saito, and Han Aung enlarge

At the Brookhaven Lab-hosted Xeon Phi hackathon, (left to right) mentor Bei Wang, a high-performance-computing software engineer at Princeton University; mentor Hideki Saito, a principal engineer at Intel; and participant Han Aung, a graduate student in the Department of Physics at Yale University, optimize an application code that simulates the formation of structures in the universe. Aung and his fellow team members sought to increase the numerical resolution of their simulations so they can more realistically model the astrophysical processes in galaxy clusters.

Supercomputers are enabling scientists to study problems they could not otherwise tackle—from understanding what happens when two black holes collide and figuring out how to make tiny carbon nanotubes that clean up oil spills to determining the binding sites of proteins associated with cancer. Such problems involve datasets that are too large or complex for human analysis.

Intel Xeon Phi processor enlarge

The Intel Xeon Phi processor is patterned using a 14-nanometer (nm) lithography process. The 14 nm refers to the size of the transistors on the chip—only 14 times wider than DNA molecules.

In 2016, Intel released the second generation of its many-integrated-core architecture targeting high-performance-computing (HPC): the Intel Xeon Phi processor (formerly code-named “Knights Landing”). With up to 72 processing units, or cores, per chip, Xeon Phi is designed to carry out multiple calculations at the same time (in parallel). This architecture is ideal for handling the large, complex computations that are characteristic of scientific applications.

Other features that make Xeon Phi appealing for such applications include its fast memory access; its ability to simultaneously execute multiple processes, or threads, that follow the same instructions while sharing some computing resources (multithreading); and its support of efficient vectorization, a form of parallel programming in which the processor performs the same operation on multiple elements (vectors) of independent data in a single processing cycle. All of these features can greatly enhance performance, enabling scientists to solve problems more quickly and with greater efficiency than ever before.

Making the most out of Xeon Phi 

Cori supercompute

Cori is a supercomputer that is named after Gerty Cori, the first American woman to win a Nobel Prize in science. Credit: NERSC.

Currently, several supercomputers in the United States are based on Intel’s Xeon Phi processors, including Cori at the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy (DOE) Office of Science User Facility at Lawrence Berkeley National Laboratory; Theta at Argonne Leadership Computing Facility, another DOE Office of Science User Facility; and Stampede2 at the University of Texas at Austin’s Texas Advanced Computing Center. Smaller-scale systems, such as the computing cluster at DOE’s Brookhaven National Laboratory, also rely on this architecture. But in order to take full advantage of its capabilities, users need to adapt and optimize their applications accordingly.

To facilitate that process, Brookhaven Lab’s Computational Science Initiative (CSI) hosted a five-day coding marathon, or hackathon, in partnership with the High-Energy Physics (HEP) Center for Computational Excellence—which Brookhaven joined last July—and collaborators from the SOLLVE software development project funded by DOE’s Exascale Computing Project.

“The goal of this hands-on workshop was to help participants optimize their application codes to exploit the different levels of parallelism and memory hierarchies in the Xeon Phi architecture,” said CSI computational scientist Meifeng Lin, who co-organized the hackathon with CSI Director Kerstin Kleese van Dam, CSI Computer Science and Mathematics Department Head Barbara Chapman, and CSI computational scientist Martin Kong. “By the end of the hackathon, the participants had not only made their codes run more efficiently on Xeon Phi–based systems, but also learned about strategies that could be applied to other CPU [central processing unit]-based systems to improve code performance.”

Last year, Lin was part of the committee that organized Brookhaven’s first hackathon, at which teams learned how to program their scientific applications on computing devices called graphics processing units (GPUs). As was the case for that hackathon, this one was open to any current or potential user of the hardware. In the end, five teams of three to four members each—representing Brookhaven Lab, the Institute for Mathematical Sciences in India, McGill University, Stony Brook University, University of Miami, University of Washington, and Yale University—were accepted to participate in the Intel Xeon Phi hackathon.

Xinmin Tian enlarge

Xinmin Tian, a senior principal engineer at Intel, gives a presentation on vector programming to help the teams optimize their scientific codes for the Xeon Phi processors.

Expanding the possibilities for scientific breakthroughs

From February 26 through March 2, nearly 20 users of Xeon Phi–based supercomputers came together at Brookhaven Lab to be mentored by computing experts from Brookhaven and Lawrence Berkeley national labs, Indiana University, Princeton University, University of Bielefeld in Germany, and University of California–Berkeley. The hackathon organizing committee selected the mentors based on their experience in Xeon Phi optimization and shared-memory parallel programming with the OpenMP (for Multi-Processing) industry standard.

Participants did not need to have prior Xeon Phi experience to attend. Several weeks prior to the hackathon, the teams were assigned to mentors with scientific backgrounds relevant to the respective application codes. The mentors and teams then held a series of meetings to discuss the limitations of their existing codes and goals at the hackathon. In addition to their specific mentors, the teams had access to four Intel technical experts with backgrounds in programming and scientific domains. These Intel experts served as floating mentors during the event to provide expertise in hardware architecture and performance optimization. 

“The hackathon provided an excellent opportunity for application developers to talk and work with Intel experts directly,” said mentor Bei Wang, a HPC software engineer at Princeton University. “The result was a significant speed up in the time it takes to optimize code, thus helping application teams achieve their science goals at a faster pace. Events like this hackathon are of great value to both scientists and vendors.”

Brookhaven co-organizer Meifeng Lin, Intel mentor Hideki Saito, and Yale graduate student participant Urmila Chadayammuri share their perspectives on the hackathon.

The five codes that were optimized cover a wide variety of applications:

  • A code for tracking particle-device and particle-particle interactions that has the potential to be used as the design platform for future particle accelerators
  • A code for simulating the evolution of the quark-gluon plasma (a hot, dense state of matter thought to have been present for a few millionths of a second after the Big Bang) produced through high-energy collisions at Brookhaven’s Relativistic Heavy Ion Collider (RHIC)—a DOE Office of Science User Facility
  • An algorithm for sorting records from databases, such as DNA sequences to identify inherited genetic variations and disorders
  • A code for simulating the formation of structures in the universe, particularly galaxy clusters
  • A code for simulating the interactions between quarks and gluons in real time 

“Large-scale numerical simulations are required to describe the matter created at the earliest times after the collision of two heavy ions,” said team member Mark Mace, a PhD candidate in the Nuclear Theory Group in the Physics and Astronomy Department at Stony Brook University and the Nuclear Theory Group in the Physics Department at Brookhaven Lab. “My team had a really successful week—we were able to make our code run much faster (20x), and this improvement is a game changer as far as the physics we can study with the resources we have. We will now be able to more accurately describe the matter created after heavy-ion collisions, study a larger array of macroscopic phenomena observed in such collisions, and make quantitative predictions for experiments at RHIC and the Large Hadron Collider in Europe.”

“With the new memory subsystem recently released by Intel, we can order a huge number of elements faster than with conventional memory because more data can be transferred at a time,” said team member Sergey Madaminov, who is pursuing his PhD in computer science in the Computer Architecture at Stony Brook (COMPAS) Lab at Stony Brook University. “However, this high-bandwidth memory is physically located close to the processor, limiting its capacity. To mitigate this limitation, we apply smart algorithms that split data into smaller chunks that can then fit into high-bandwidth memory and be sorted inside it. At the hackathon, our goal was to demonstrate our theoretical results—our algorithms speed up sorting—in practice. We ended up finding many weak places in our code and were able to fix them with the help of our mentor and experts from Intel, improving our initial code more than 40x. With this improvement, we expect to sort much larger datasets faster.”

Xeon Phi processors illustration enlarge

One hackathon team worked on taking advantage of the high-bandwidth memory in Xeon Phi processors to optimize their code to more quickly sort datasets of increasing size. The team members applied smart algorithms that split the original data into "blocks" (equally sized chunks), which are moved into "buckets" (sets of elements) that can fit inside high-bandwidth memory for sorting, as shown in the illustration above.

According to Lin, the hackathon was highly successful—all five teams improved the performance of their codes, achieving from 2x to 40x speedups.

“It is expected that Intel Xeon Phi–based computing resources will continue operating until the next-generation exascale computers come online,” said Lin. “It is important that users can make these systems work to their full potential for their specific applications.”

Brookhaven National Laboratory is supported by the Office of Science of the U.S. Department of Energy. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit

Follow @BrookhavenLab on Twitter or find us on Facebook.

Tags: computing

2018-12743  |  INT/EXT  |  Newsroom