Synthetic cells have been proposed to recapitulate one or multiple cell functions in a compartmentalized volume for the fundamental understanding of origin of life and technological application in engineering synthetic cells capable of operating in harsh environments or carrying and targeted delivery of drug payloads. Lipid-based vesicles are widely employed but limited in further development and wider application in synthetic cells due to their instability under mechanic forces and requirement of harsh chemical conditions for modification or functionalization. Biopolymers such as polypeptides serve as another class of potential building blocks. Of particular interest, elastin-like polypeptides (ELPs), composed of repetitive (VPGXG) sequences, prove to be structurally robust candidates for synthetic cells or organelles that can form vesicles through self- or templated-assembly. The residue X is a guest residue of any amino acid except proline. The resulting vesicular structures from amphiphilic ELP diblock polymers have been shown to capable of encapsulating different cargos and carrying out cell-free expression functions. However, the systematic search for the diblock ELP polymers requires extensive characterization over millions of sequences, which is prohibited in an experimental setting. Data-driven methods and computational modeling instead offers opportunities to search the chemical space and accelerate the discovery of ELP sequences capable of forming ultra-stable vesicles and to infer design rules from large scale data linking sequences to physicochemical properties.
Herein, we present a high throughput screening pipeline that integrates coarse grained (CG) simulations, alchemical free energy calculation, Gaussian process regression (GPR) and Bayesian optimization (BO) to identify optimal candidates from a library of putative amphiphilic diblock ELPs that have propensity to form stable and mechanically robust vesicles (Figure 1). The active learning-guided screen efficiently filters the high-performance ELP sequences from a large design space for the experimental construction of synthetic cells. Moreover, the predictive capability of the model enables large-scale analysis of amino acid residue preferences in the hydrophilic and hydrophobic blocks of amphiphilic ELPs, which helps derive design principles that further guide experimental efforts. From the defined chemical space, we identified novel ELP sequences that have superior stability with folds improvement compared with ELPs that have been reported to be stable in experiments. We also reveal that top performers frequently adopt histidine as the guest residue in the hydrophilic block, indicating the potential role of hydrogen bonds and π - π stacking in shaping the stability the ELP vesicles. The proposed ELPs are potential candidates to be tested to form ultra-stable vesicles in experiments. The high throughput virtual screening (HTVS) pipeline is broadly transferable, and we make it freely available as an open-source tools to accelerate the design and optimization of bioactive peptides or peptide-based biomaterials with desired structural or functional properties.