Although recent years have seen rapid improvements in the computational design of binding proteins, with tools like RFdiffusion and AlphaProteo significantly outperforming prior approaches, successfully predicting which computational models of protein complexes bind versus those that fail when experimentally tested remains challenging. In previous work, we identified Expected Persistent Pairwise Interaction (EPPI) features, which are features of protein complexes that are expected to persist over time and meaningfully contribute to binding. Consideration of those features significantly improved an initial model’s ability to distinguish real antibody complexes from computational decoys.
In this work, we are expanding on that prior study to identify the features and characteristics that are most important for distinguishing real antibodies from decoys. We began by curating a non-redundant database of antibody-protein complexes. Experimental structures of every antibody-protein complex were downloaded from the international ImMunoGeneTics information system (IMGT®) 3DStructure Database. The unique variable domains and antigens from each file were identified. Final structures were selected for inclusion in the non-redundant database on the basis of three criteria. First, each antibody that bound to a unique antigen was included. Second, if an antibody bound to the same antigen as another antibody, it must have at least one complementarity determining region (CDR) that differs in length from the other antibody. Finally, if the first two criteria were not met, then the antibodies must have at least five amino acid mutations in their CDRs from one another. After the database was curated, decoy complexes of the antibodies with their native antigens and with other antigens in the database were created using existing complex prediction tools, including HADDOCK, ZDOCK, and AlphaFold Multimer. This overall process resulted in a large database of real and decoy antibody-protein complexes for analysis.
We then developed and implemented a bespoke classifier to analyze the complexes. This classifier is highly similar to Random Forest Classifiers, with additional algorithmic consideration for reevaluating prior decisions and ensuring redundancy in selection criteria. The EPPI features were calculated for all complexes in the curated database and used to train the classifier. The output of this overall analysis are clusters of real antibody complexes that share a list of features while all decoy complexes are excluded by two or more of the features. This talk will discuss the curation of the database of complexes, the details of the classifier algorithm, and the feature spaces of the largest clusters, which reveal interesting details of critical mechanisms of how antibodies bind to proteins.