2022 Annual Meeting

Tmqm++ Dataset and Machine Learning Model Benchmarks

Much work has been done on computational materials characterization and predictions, as to accelerate the materials discovery process. Instead of needing to test every material in a lab, methods like density functional theory (DFT) can be used to determine the electronic structure properties of a material, especially its energy, which can then be correlated with catalytic activity through concepts like the Sabatier principle. However, DFT calculations are computationally expensive, and thus can take too long to be used in high-throughput methods. Machine learning (ML) approaches have been used in an attempt to circumvent this by training models that can predict energies given a structure.

In order to train a ML model that is generalizable, a large and diverse dataset is necessary. Most of the large datasets that have been developed for catalysis have focused on heterogeneous catalysis, despite many reactions also utilizing homogeneous catalysts. In order to provide the necessary training data, the tmQM++ dataset is presented here, which provides the DFT energies, computed using the ⍵B97M-V functional and def2-SVPD basis set, of the structures provided in the transition metal quantum mechanics (tmQM) dataset. The functional and basis set were chosen to better handle the transition metal and metal-ligand interactions. In addition to recomputing the energies in tmQM, approximately 200 structures with implicit hydrogens were also removed from tmQM when generating tmQM++.

The tmQM++ dataset was used to train several ML models commonly used in heterogeneous catalysis. In order to prepare the data for training, the energies were first converted to analogues of formation energy via a reference correction strategy. The models trained thus far include SchNet, PaiNN, SpinConv, and GemNet-T. All models demonstrated normally distributed residuals, indicative of unbiased predictions. Various preprocessing strategies were also used, including a delta learning strategy using xTB energies and only training on charge-neutral structures. The best MAE obtained so far is 0.17 eV, using a neutral-only subset of tmQM++ and targets consisting of the difference between DFT and xTB energies. This MAE is competitive with those of ML models trained on similarly diverse datasets, such as the Open Catalyst 2020 (OC20) dataset, which has ~0.3 eV MAEs for the best performing models.