2025 AIChE Annual Meeting

(121e) Graph Neural Network–Enhanced Pka Prediction Using Polarizable Molecular Simulations

Authors

Ziyu Song - Presenter, Institute of Process Engineering, Chinese Academy of Sciences
Zuyi Huang, Villanova University
Precise prediction of residue-level pKa values in proteins is critical to understanding enzymatic activity, protein–ligand binding, antibody-antigen interactions, and the behavior of membrane proteins—factors essential to drug design, disease mechanism investigation, and personalized medicine. For example, accurate pKa models can inform rational drug targeting of ion channels, metalloproteins, and pH-dependent binding pockets. Molecular modeling based pKa prediction showed better accuracy compared with the empirical model such as PROPKA, but most of them characterized the electrostatic interactions between protein residues by assigning a pre-defined partial charge to each atom1,2. This approach is based on conventional additive (nonpolarizable) Force Fields (FF) like AMBER 3, GROMOS 4 , CHARMM 5. For the general purpose, this method was proved to be effective and reliable. However, as it does not account for the induced electrostatic, such as electrical interactions due to neighboring atoms, aqueous surroundings, or the most crucial part — protein folding. Consequently, this method may lack the accuracy required to describe the many-body interactions induced atomic charge polarizations. Therefore, it’s not an ideal approach to capture the electrostatic dominated characteristics 6.

To address the aforementioned issue, this study presents a new application of molecular simulation for pKa prediction using the polarizable AMOEBA force field to improve over traditional empirical models like PROPKA. In particular, all proteins containing residues with experimentally determined pKa values were downloaded from the Protein Data Bank (PDB). These files were then cleaned via a multi-step preparation protocol that include: first, PDBFixer (via OpenMM7) was used to add missing atoms and residues and convert non-standard residues to canonical amino acids. Next, ChimeraX was used to strip extraneous water molecules while preserving essential metal ions (e.g., Cu, Fe). For problematic structures (e.g., 1I0E), SWISS–PDBViewer was employed to correct anomalies. The cleaned structures were then converted into .xyz files compatible with the Tinker molecular modeling package using the pdbxyz tool that utilized the AMOEBABio18 polarizable force field. Proteins were solvated in pre-equilibrated or custom-generated cubic water boxes with explicit water molecules. Systems were neutralized to achieve net charge balance, and energy minimization was performed using Tinker. A sphere centered at the residue's Cα was cropped from the whole protein to isolate the residue from the whole protein, 5 different radii from 7A to 11A were tested to optimize the representation of the residue's local environment and to better capture the unbounded Repulsion-dispersion (vdW) effects based on the Halgren’s buffered 14−7 function employed by AMOEBA force field.

The protein features extracted from the aforementioned molecular modeling workflow were used as inputs for three advanced graph-based deep learning models: Graph Convolutional Network (GCN), Graph Isomorphism Network (GIN), and Graph Attention Network (GAT). These models were selected, constructed, and tested for pKa prediction due to the inherently graph-structured nature of molecular data. Each protein residue was represented as a graph, with atoms as nodes and chemical bonds as edges. The same set of 15 node feature descriptors derived from molecular simulations was used across all models, including induced dipole moment vectors, normalized atomic coordinates, hydrogen bonding counts, solvent accessibility, and local heavy atom density. Adjacency matrices encoded bond connectivity and enabled the aggregation of information from neighboring atoms. Compared to the traditional PROPKA method, the graph neural network models—particularly GAT—demonstrated significantly improved accuracy in pKa prediction by capturing detailed spatial and electrostatic features that empirical approaches often overlook.

Reference

  1. Cai, Z., Luo, F., Wang, Y., Li, E. & Huang, Y. Protein pKa Prediction with Machine Learning. ACS Omega 6, 34823–34831 (2021).
  2. Stepniewska-Dziubinska, M. M., Zielenkiewicz, P. & Siedlecki, P. Development and evaluation of a deep learning model for protein–ligand binding affinity prediction. Bioinformatics 34, 3666–3674 (2018).
  3. Cornell, W. D. et al. A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules. J. Am. Chem. Soc vol. 117 https://pubs.acs.org/sharingguidelines (1995).
  4. Wang, D. et al. Validation of the GROMOS 54A7 Force Field Regarding Mixed a/b-Peptide Molecules.
  5. MacKerell, A. D. et al. All-Atom Empirical Potential for Molecular Modeling and Dynamics Studies of Proteins. J Phys Chem B 102, 3586–3616 (1998).
  6. Stone, A. The Theory of Intermolecular Forces. (Oxford University Press, 2013). doi:10.1093/acprof:oso/9780199672394.001.0001.