2025 AIChE Annual Meeting

(345b) Accelerating the Computational Design of Biosynthetic Pathways with Machine Learning

Authors

Zhuofu Ni, Northwestern University
Kevin Shebek, Northwestern University
Linda Broadbelt, Northwestern University
Keith Tyo, Northwestern University
Computationally aided synthesis-planning (CASP) tools within biology aim to leverage the inherent substrate promiscuity of enzymes to design novel biosynthetic routes to valuable small-molecules, which may include commodity chemicals, biofuels, or even therapeutics. A common strategy amongst many CASP tools is to use reaction templates to recursively enumerate all possible enzymatic transformations that may arise from a given precursor, and then to traverse the resulting reaction network for plausible pathways. While this approach is comprehensive, it can often result in many false positive reaction predications due to the broad permissiveness of many reaction templates. To mitigate this and accurately predict the feasibility of novel enzymatic reactions, we introduce DORA-XGB, a gradient-boosted classifier that integrates into our group’s CASP tool, DORAnet, to score newly generated reactions and consequently, prioritize feasible pathways. We curated a high-quality dataset to train DORA-XGB by first extracting known enzymatic reactions from public databases and then filtering these reactions for their thermodynamic feasibility. While positive reactions can easily be extracted from publicly available databases, however, negative reactions are rarely published. To circumvent this lack of failed reactions, we synthetically generated negative data by considering “alternate reaction centers” on known substrates. These are chemical moieties within a substrate that despite being identical to the reaction center known to undergo catalysis, remain uncatalyzed in a given enzymatic reaction. With this “alternate reaction center” hypothesis, we strategically inferred negative reactions from known, positive ones, and used the ensuing dataset to train DORA-XGB. Within various case studies, our model is found able to successfully recover newly published enzymatic reactions and accelerates pathway design by ranking reactions and consequently, pathways by their feasibility.