AlphaFold, for all its success in predicting physics-abiding protein structures, largely remains a black-box model [1]. To begin reconciling human-derived knowledge of protein folding mechanisms with model behaviors, we adopt the viewpoint that amino-acid sequences in large protein datasets are akin to “chunks” of text in a natural language corpus. We propose a “Zipf’s Law” [2] style analysis of proteinaceous corpora, observing the relative frequencies of specific amino acid sub-sequences of various lengths across the tree of life. To question AlphaFold’s ability to “memorize” spatial arrangements of common sub-sequences, we observe if a sub-sequence’s frequency rank correlates with the confidence score of its spatial arrangement as a part of various proteins as reported by AlphaFold’s confidence module. Further, we decipher how latent representations of specific amino acid sub-sequences - generated during model inference on full proteins - provide instructions for the structure-generating module of AlphaFold; we employ manifold learning techniques to identify clusters of sub-sequence latent representations in many different proteins, and determine if representations in the same cluster condition similar motions of atom groups in the 3D space of the structural module.
[1] Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature630, 493–500 (2024). https://doi.org/10.1038/s41586-024-07487-w
[2] Piantadosi ST. Zipf's word frequency law in natural language: a critical review and future directions. Psychon Bull Rev. 2014 Oct;21(5):1112-30. doi: 10.3758/s13423-014-0585-6. PMID: 24664880; PMCID: PMC4176592.