2022 Annual Meeting
Tmqm++ Dataset and Machine Learning Model Benchmarks
In order to train a ML model that is generalizable, a large and diverse dataset is necessary. Most of the large datasets that have been developed for catalysis have focused on heterogeneous catalysis, despite many reactions also utilizing homogeneous catalysts. In order to provide the necessary training data, the tmQM++ dataset is presented here, which provides the DFT energies, computed using the âµB97M-V functional and def2-SVPD basis set, of the structures provided in the transition metal quantum mechanics (tmQM) dataset. The functional and basis set were chosen to better handle the transition metal and metal-ligand interactions. In addition to recomputing the energies in tmQM, approximately 200 structures with implicit hydrogens were also removed from tmQM when generating tmQM++.
The tmQM++ dataset was used to train several ML models commonly used in heterogeneous catalysis. In order to prepare the data for training, the energies were first converted to analogues of formation energy via a reference correction strategy. The models trained thus far include SchNet, PaiNN, SpinConv, and GemNet-T. All models demonstrated normally distributed residuals, indicative of unbiased predictions. Various preprocessing strategies were also used, including a delta learning strategy using xTB energies and only training on charge-neutral structures. The best MAE obtained so far is 0.17 eV, using a neutral-only subset of tmQM++ and targets consisting of the difference between DFT and xTB energies. This MAE is competitive with those of ML models trained on similarly diverse datasets, such as the Open Catalyst 2020 (OC20) dataset, which has ~0.3 eV MAEs for the best performing models.