2015 Synthetic Biology: Engineering, Evolution & Design (SEED)

Learning the Sequence Determinants of Exon Definition from Millions of Random Synthetic Sequences


Many of the genetic variants in coding regions of human genes cause disease through altered RNA splicing. Measuring the splicing effects of all exonic variants is infeasible, while training predictive models is challenging due to the limited number of variants with experimental data. Here we develop a novel approach that allows us to accurately predict the effects of these variants on splicing. Rather than examining splicing of genomic sequences, we measure splicing patterns of millions of randomized sequences, encompassing 100 million bases of variation. The large size of our dataset allows us to improve current models of splicing as well as gain new mechanistic insights. From these data we learn that multiple sequence motifs regulate exon definition additively rather than cooperatively. We also show that the same motifs regulate exon definition in alternative 5’, 3’, and cassette exon splicing. Our model of exon definition and model of the human 5’ splice site greatly improve prediction of the effects of variants on both alternative 5’ and cassette exon splicing. Our results suggest that large scale assays of random or synthetic sequences can also be used to improve our understanding of other complex forms of gene regulation, such as translation or transcription.