2024 AIChE Annual Meeting

(569ay) Investigating the Fitting Errors of Machine Learning Potentials on the Open Catalyst 2020 (OC20) Dataset

Authors

Kolluru, A., Carnegie Mellon University
Cheula, R., Aarhus University
Kitchin, J., Carnegie Mellon University
Machine learning potentials (MLPs) have accelerated the atomistic simulations used for material discovery. The Open Catalyst 2020 (OC20) dataset is one of the largest datasets for training MLPs for heterogeneous catalysis. The mean absolute errors (MAE) of the MLPs on the energy target of the dataset were found to be imbalanced between the different material classes with non-metals having the highest errors. In this work, we investigate the Density Functional Theory (DFT) settings and the adsorption energy referencing scheme used in the dataset as possible sources of this error imbalance. First, we investigate the impact of tighter convergence of the DFT calculations in the dataset across k-points sampling, plane-wave energy cutoff and smearing width. Significant DFT convergence errors with a mean absolute value of ~0.15 eV were found on the total energies of non-metals, higher than other material classes. Interestingly, due to the cancellation of errors, we find the convergence errors on the adsorption energies are less than 0.05 eV across all material classes. Second, we show that calculations with surface reconstruction can introduce inconsistencies to the adsorption energy referencing scheme that cannot be captured by the MLPs. Nonmetals and halides were found to have the highest fraction of calculations with surface reconstruction. Removing calculations with surface reconstructions from the validation sets significantly lowers the MAEs by ~35% and reduces the imbalance of the MAEs between the different material classes. Although adsorption energy referencing reduces the convergence errors due to the cancellation of errors, it introduces a data inconsistency issue for systems with surface reconstructions.