2023 AIChE Annual Meeting

(451h) Informing Graph Convolutional Networks with Molecular Mechanics: A Case Study of Sigma Profile Prediction

Authors

Maginn, E., University of Notre Dame
Sigma profiles, a type of molecular descriptor obtained through quantum chemistry calculations, have shown great performance when used as a feature for machine learning models. However, because the computation of sigma profiles involves DFT optimizations, often for reasonably large molecules, they are expensive and time-consuming to obtain. Furthermore, there are no free, open-source tools readily available to directly calculate them. As such, the development of predictive methodologies able to estimate sigma profiles without the use of quantum-chemistry-based calculations would greatly enhance the applicability of these molecular descriptors in large scale machine learning models.

Graph convolutional networks (GCNs) are a class of machine learning models that have a similar architecture to convolutional neural networks but operate on graphs. Graphs are a mathematical construction of nodes and edges, and are represented by adjacency matrices, a table indicating the connectivity between nodes. There is an obvious resemblance between the mathematical concept of a graph and molecular structures, with atoms and bonds being represented by nodes and edges, respectively. Thus, because molecules are easily translated to graphs without the need to compute quantum-chemistry-based descriptors, GCNs are the ideal candidate to accelerate the development and collection of vast sigma profile datasets.

A well-known issue of GCNs pertains to the accurate description of the global and local environment of atoms in molecules, which often requires many convolution iterations, leading to over-smoothing and poor model performance. For example, the chemical natures of aromatic and aliphatic carbons are markedly distinct, but several iterations would be necessary for a GCN to learn a proper graph embedding of a ring with six nodes (e.g., benzene ring). Fortunately, these issues have been addressed at length by the molecular mechanics community. That is, commonplace to the development of classical force fields is the necessity of distinguishing the polarity and behavior of atoms with different neighboring environments. Thus, force fields typically assign class types (i.e., numerical values) to distinguish between atoms with the same atomic number but different neighboring moieties. The existence of this type of atom classification, developed throughout decades of trial-and-error methodologies and heuristics, may prove to be an excellent type of node-level featurization to be used in GNNs and, thus, merits its investigation.

Motivated by the need to expedite the development of large sigma profile datasets, the objective of this work is to ascertain the usefulness of force field atom types as node-level features in GNN models. To do so, graph representations were constructed for the compounds present in a freely available sigma profile database, with three types of node-level features being explored: simple atomic numbers, general Amber force field (GAFF) atom types, and Merck molecular force field (MMFF) atom types. GAFF and MMFF were chosen due to the breadth of molecules that both force fields can simultaneously describe, as well as the availability of tools that perform their atom typing automatically. Combining these node-level features with GCNs led to models with excellent sigma profile predictive capability (coefficients of determination larger than 0.95), with both force fields displaying similar performances. Finally, to demonstrate their practical applicability, the sigma profiles predicted using this methodology were used as the input to machine learning models previously developed in the literature to estimate the boiling temperature and aqueous solubility of organic compounds, yielding performances that rival those attained using sigma profiles obtained through quantum chemistry calculations.