2023 AIChE Annual Meeting
(451h) Informing Graph Convolutional Networks with Molecular Mechanics: A Case Study of Sigma Profile Prediction
Graph convolutional networks (GCNs) are a class of machine learning models that have a similar architecture to convolutional neural networks but operate on graphs. Graphs are a mathematical construction of nodes and edges, and are represented by adjacency matrices, a table indicating the connectivity between nodes. There is an obvious resemblance between the mathematical concept of a graph and molecular structures, with atoms and bonds being represented by nodes and edges, respectively. Thus, because molecules are easily translated to graphs without the need to compute quantum-chemistry-based descriptors, GCNs are the ideal candidate to accelerate the development and collection of vast sigma profile datasets.
A well-known issue of GCNs pertains to the accurate description of the global and local environment of atoms in molecules, which often requires many convolution iterations, leading to over-smoothing and poor model performance. For example, the chemical natures of aromatic and aliphatic carbons are markedly distinct, but several iterations would be necessary for a GCN to learn a proper graph embedding of a ring with six nodes (e.g., benzene ring). Fortunately, these issues have been addressed at length by the molecular mechanics community. That is, commonplace to the development of classical force fields is the necessity of distinguishing the polarity and behavior of atoms with different neighboring environments. Thus, force fields typically assign class types (i.e., numerical values) to distinguish between atoms with the same atomic number but different neighboring moieties. The existence of this type of atom classification, developed throughout decades of trial-and-error methodologies and heuristics, may prove to be an excellent type of node-level featurization to be used in GNNs and, thus, merits its investigation.
Motivated by the need to expedite the development of large sigma profile datasets, the objective of this work is to ascertain the usefulness of force field atom types as node-level features in GNN models. To do so, graph representations were constructed for the compounds present in a freely available sigma profile database, with three types of node-level features being explored: simple atomic numbers, general Amber force field (GAFF) atom types, and Merck molecular force field (MMFF) atom types. GAFF and MMFF were chosen due to the breadth of molecules that both force fields can simultaneously describe, as well as the availability of tools that perform their atom typing automatically. Combining these node-level features with GCNs led to models with excellent sigma profile predictive capability (coefficients of determination larger than 0.95), with both force fields displaying similar performances. Finally, to demonstrate their practical applicability, the sigma profiles predicted using this methodology were used as the input to machine learning models previously developed in the literature to estimate the boiling temperature and aqueous solubility of organic compounds, yielding performances that rival those attained using sigma profiles obtained through quantum chemistry calculations.