2025 AIChE Annual Meeting

(448e) Generalized Molecular Property Imputation Using a Flexible Transformer Architecture

Authors

Ericka Miller, University of Notre Dame
Brett Savoie, Purdue University
Chemical data is fundamentally sparse, as molecular structures can serve as database keys for countless properties. Ideally, it would be possible to convert between databases with different properties for each molecule, or to fill in missing properties based on those that are available, or to fuse databases with partially overlapping properties. However, classical data imputation strategies based on primitive interpolation or structural regressors fail at these tasks. Even in predicting a single property, there is typically additional known property information that is neglected or an incomplete structure that makes traditional methods unwise or inapplicable. Here, we present a more general paradigm of chemical property imputation that uses all available information in imputation, fusion, and conversion tasks. A robust transformer model architecture was developed for these generalized imputation tasks. We examine these capabilities in multiple trials using a dataset of approximately 16M organic molecules and 23 properties. Finally, we proffer an imputation protocol with the same architecture to impute a sparse dataset using only the data contained therein. The suitability of this protocol for general imputation is demonstrated by two case studies in which sparse data is imputed with an average R2 values of 0.79 and 0.85. These advances should herald more general models and strengthen our collective understanding of the relationships between molecules and their properties.