2025 AIChE Annual Meeting

(345c) Pathlm: Chemical Language Model for Biosynthesis Planning

Biocatalytic reactions offer transformative potential for molecular synthesis: they operate under mild conditions, generate minimal waste, enable one pot enzymatic cascades, as well as deliver high stereoselectivity and regioselectivity. Yet systematic biosynthetic pathway design remains a challenge. The largest freely available biosynthetic reaction databases contain only a fraction of the entries found in chemical reaction databases. Existing retrosynthesis tools use these large databases to predict single chemical reaction steps in isolation. Although tree search algorithms traverse through the combinatorial explosion of possible routes quickly, they often omit stoichiometric balance which is crucial for biosynthesis pathways.

Here, we introduce PathLM, a platform for de novo biosynthetic pathway design that integrates large language model finetuning with reinforcement learning to predict complete, balanced routes between precursors and target molecules. PathLM is trained on both publicly available biosynthetic reactions and an internally curated corpus of pathways, enabling it to capture evolutionary context and cofactor requirements across multistep sequences. A reinforcement learning framework enforces chemical validity and mass balance, ensuring that individual reactions and overall pathways are stoichiometrically balanced and yield chemically feasible intermediates.

We demonstrate that PathLM can propose concise, end to end biosynthetic routes that explicitly incorporate necessary cofactors and side metabolites. By bridging the gap between single step retrosynthesis and holistic pathway engineering, PathLM promises to accelerate the development of sustainable biocatalytic processes for pharmaceuticals, fine chemicals and biofuels.