2025 AIChE Annual Meeting

(394aj) Manifold Learning for Mechanism Discovery in Automated Protein Folding

Authors

Begum Cicekdag, Columbia University
Sachin Kadyan, Columbia University
Gonzalo Gutierrez, Columbia University
Venkat Venkatasubramanian, Columbia University
AlphaFold, for all its success in predicting physics-abiding protein structures, largely remains a black-box model [1]. To begin reconciling human-derived knowledge of protein folding mechanisms with model behaviors, we adopt the viewpoint that amino-acid sequences in large protein datasets are akin to “chunks” of text in a natural language corpus. We propose a “Zipf’s Law” [2] style analysis of proteinaceous corpora, observing the relative frequencies of specific amino acid sub-sequences of various lengths across the tree of life. To question AlphaFold’s ability to “memorize” spatial arrangements of common sub-sequences, we observe if a sub-sequence’s frequency rank correlates with the confidence score of its spatial arrangement as a part of various proteins as reported by AlphaFold’s confidence module. Further, we decipher how latent representations of specific amino acid sub-sequences - generated during model inference on full proteins - provide instructions for the structure-generating module of AlphaFold; we employ manifold learning techniques to identify clusters of sub-sequence latent representations in many different proteins, and determine if representations in the same cluster condition similar motions of atom groups in the 3D space of the structural module.

[1] Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024). https://doi.org/10.1038/s41586-024-07487-w

[2] Piantadosi ST. Zipf's word frequency law in natural language: a critical review and future directions. Psychon Bull Rev. 2014 Oct;21(5):1112-30. doi: 10.3758/s13423-014-0585-6. PMID: 24664880; PMCID: PMC4176592.