General Lab Information

Computation and Data-Driven Discovery (C3D) Projects

Replicating Machine Learning Experiments in Materials Sciences

To harness the benefits of new-generation machine learning (ML) algorithms in materials informatics and broaden the path of accelerated discovery, it is necessary to understand their limits and establish transparency in how results are obtained. Better ways to explain results obtained with ML will improve reproducibility of results in materials science, boosting confidence in the ability of those methods to predict new candidates for experimentation. Transparency and reproducibility are important aspects of validation for ML models that are not fully understood and are applied across the board independently of the application domain. Understanding the limits of computational reproducibility when dealing with complex mathematical models, such as ML, exceeds making scientific code, training data, and hyperparameters accessible. Machine learning methods present specific challenges in reproducibility related to building models, the effects of random seeds, and choice of platforms and execution environments.

This experiment investigates the reproducibility of previously published results by one of the study’s co-author (Supervised Machine-Learning-Based Determination of Three-Dimensional Structure of Metallic Nanoparticles, 2017. DOI: 10.1021/acs.jpclett.7b02364). This work examines reproducibility across ML platforms and the influence of random factors in two types of widely used regression models: 1) Gradient-Boosted Trees (GBT), an efficient machine learning model that ensembles a set of decision trees and 2) multilayer perceptron (MLP), a classic fully connected neural network. These models have shown the highest performance in original results.

figure 1

Figure 1: Schematic for applying machine learning to guide high-throughput experiments.

In materials science, ML methods are used increasingly to predict the relationship between atomic structures and materials properties and provide guidance to experimentalists for suggesting potentially useful combinations. In previously published results, ML models were used to predict coordination numbers (CN), known to characterize the size and three-dimensional shape of nanoparticles. Training sets are built using computational data produced from ab initio methods. The model then can be used on experimental spectra to determine the properties of experimental particles. In the experimental process, X-ray absorption near edge structure (XANES) spectra are measured, and CN are calculated. The computational approach calculates XANES spectra and CN from computational structures. After validation, computational XANES spectra and CNs are used to train the model. Predicted CN are compared to calculated CN to validate the model. In the ML approach, the trained model can be used with large amounts of experimental spectra pouring out of high-throughput detectors to predict expected CN during the course of an experiment (Figure 1).

For the CN prediction task, the influence of one random factor at a time is measured by fixing all the hyperparameters and other random factors (with the appropriate random seeds) and freeing the factor under consideration. For MLP, different data orders are investigated using stochastic gradient descent to iteratively optimize the loss function and the different weight initializations. For GBT, the influence of random feature selections and that of data selections are studied. The models are trained five times for each case to obtain an accuracy number on the test data (Table 1 reflects some results).

Table 1

Table 1. Results of replicating the Coordination Number prediction task with various random factors.

Coefficient of Variation (CV), also known as Relative Standard Deviation, and Mean Absolute Difference (MAD) are the metrics used for the dispersion of accuracy numbers. As these metrics show, GBT appears more robust than MLP, i.e., with a certain amount of randomness, the accuracy of GBT is more consistent than MLP. Yet in the literature, results showing the best performance usually are the ones reported—regardless of its robustness.

Typically, training an ML model multiple times, even with the same data set, does not produce the same model as different training and testing errors are produced with each run. A second set of errors, referring to transfer learning or domain adaptation, will draw more attention from ML researchers. This experiment shows that the first class of errors should not be ignored by practitioners and scientists interested in the practical application of these models for their domain science.


Pouchard, L., Y. Lin, and H. van Dam, “Replicating Machine Learning Experiments in Materials Sciences,” ParCo Symposium 2019, DOI: 10.3233/APC200105.