2025 AIChE Annual Meeting

(641c) Assigning Accurate Ligand Charges in the Creation of the Openlig Dataset

Authors

Roland St. Michel - Presenter, Massachusetts Institute of Technology
Ilia Kevlishvili, Massachusetts Institute of Technology
Sukrit Mukhopadhyay, The Dow Chemical Company
Computational discovery of new transition metal complexes (TMCs) is predominantly explored through combining different ligands with each other. The Cambridge Structural Database (CSD) houses hundreds of thousands of transition metal complexes with tens of thousands of unique ligands. Combinatorics allows for large chemical spaces for computational exploration. However, the resulting complexes need a known charge to allow for first-principles calculations. As such, accurate charge assignment of mined ligands is vital for determining the final charge of any theoretical TMC . We start from work first introduced that implemented an iterative approach that balanced the ligand charges of complexes already in the CSD, bypassing the need for expensive calculations. We improve upon the iterative mining procedure through the introduction of an algorithm that incorporates the agreement between the crystallographic model and the experimental diffraction patterns, also known as the R-factor, and weighting schemes based on charges of previously mined ligands. This leads to accurate ligand charge assignments in low-data regimes. We also incorporate charge assignment based on fulfillment of the octet rule to provide a second charge assignment scheme for comparison. We obtain many of the oxidation states directly from the user-reported naming available in the CSD, and we supplement these oxidation states for cases where it was not reported by predicting a subset using cell2mol. Using these overall complex oxidation states, we are able to obtain ligand charges. We present a dataset of 60,000 unique ligands with assigned charges. We split the data into a high-confidence and low-confidence set based on whether multiple charge assignment schemes match. These ligands are partially-optimized and then characterized with density functional theory. We also associate these ligands with their known application in synthetic chemistry using a previously developed natural language processing workflow.