2025 AIChE Annual Meeting

A Novel Library Design for Exploring Epistatic Effects in Proteins

Proteins are sequence-defined biopolymers that serve as the primary effector molecules of most
biological functions and are of great interest as catalysts and materials. Engineering proteins is
complicated by the massive potential variable space posed by combinatorial replacement of the
twenty different amino acid monomers in any given sequence. For example, in a 100 amino
acid protein there are 2000 potential single substitutions, but 1.98 x 10^6 potential double
replacements. There are many established strategies for exploring the massive fitness
landscapes of proteins such as high-throughput selections (e.g. connecting protein activity to
cell survival) capable of interrogating large mutant libraries, or systematic approaches like
Deep Mutational Scanning (DMS) which fully sample all single substitutions possible by
constraining library scale.


Previously, we performed a DMS-based engineering study on the small ultra-red fluorescent
protein (smURFP), a near-infrared FP of interest for in vivo imaging and identified a set of
single codon mutations which improved the observed fluorescence in living cells. Identifying
beneficial single codon mutations by high-throughput screening was straightforward; however,
the process of testing these single mutations in combination by systematically cloning and
assaying unique sequences was laborious, time-consuming, and did not capture all possible
combinations. In this work, we demonstrate a novel strategy for more comprehensively
exploring the combinatorial mutation space of smURFP by using a constrained library
comprising specific mutations selected based on enrichment scores from high-throughput
sequencing data of the DMS library.


Using enrichment data from the smURFP DMS library before and after eliminating non
functional population members using fluorescence assisted cell sorting (FACS), we selected a
set of highly enriched single codon mutations to prioritize in a combinatorial study. Ten
fragments of the smURFP gene encoding these mutations and appropriate cloning scars were
generated by in vitro synthesis and assembled using Type IIs mediated restriction-ligation
(GoldenGate). We have established that this method is able to assemble complete and varied
gene sequences encoding our constrained set of mutations in various combinations. We
screened the population to generate a dataset connecting protein sequence to phenotype
(fluorescence activity in vivo), and will use this data set to train a machine learning model to
predict the epistatic effect of different mutant combination to identify and test optimized
candidate sequences.