2025 AIChE Annual Meeting

(588bo) ProcedureT5: Enhanced Experimental Procedure Prediction with Pre-Training and Data Augmentation

Computer-aided synthesis planning (CASP) has demonstrated its potential to assist the synthesis of target molecules by proposing synthetic pathways1, recommending reaction conditions2, and assessing the likelihood of success for the proposed reactions3. However, translating computer-generated synthesis routes into executable experimental procedures remains a challenge due to the lack of robust automated methods. To meet the demand for automated experimental procedure prediction, AI-driven data extraction has been used to mitigate the limited availability of annotated experimental procedure data, enabling the curation of datasets for developing experimental procedure prediction models4–6. Despite the efficiency of AI-driven data extraction, model performance in this domain demands further improvement and rigorous evaluation beyond AI-curated test sets, necessitating advanced training frameworks and high-quality benchmarks.

In this work, we introduce ProcedureT5, an approach that integrates chemistry-oriented pre-trained models with augmented multi-source datasets to enhance the prediction of experimental procedures across broader scenarios. Our method achieves state-of-the-art performance on the Pistachio dataset - a collection of reaction procedures derived from US patent literature, showing a 4-point increase in BLEU score and a 34% improvement in exact-match accuracy compared to existing methods. Additionally, we curate a small expert-annotated dataset, Orgsyn, consisting of verified organic synthesis procedures, to assess the model’s performance in more diverse applications. Fine-tuning ProcedureT5 on the Orgsyn dataset demonstrates its adaptability, yielding a BLEU score of 41.19 and an average similarity of 50.58%. This work underscores the crucial role of ProcedureT5 in bridging the gap between computational synthesis planning and practical laboratory implementation.

Reference

(1) Jiang, Y.; Yu, Y.; Kong, M.; Mei, Y.; Yuan, L.; Huang, Z.; Kuang, K.; Wang, Z.; Yao, H.; Zou, J.; Coley, C. W.; Wei, Y. Artificial Intelligence for Retrosynthesis Prediction. Engineering 2023, 25, 32–50. https://doi.org/10.1016/j.eng.2022.04.021.

(2) Gao, H.; Struble, T. J.; Coley, C. W.; Wang, Y.; Green, W. H.; Jensen, K. F. Using Machine Learning To Predict Suitable Conditions for Organic Reactions. ACS Cent. Sci. 2018, 4 (11), 1465–1476. https://doi.org/10.1021/acscentsci.8b00357.

(3) Hua, P.-X.; Huang, Z.; Xu, Z.-Y.; Zhao, Q.; Ye, C.-Y.; Wang, Y.-F.; Xu, Y.-H.; Fu, Y.; Ding, H. An Active Representation Learning Method for Reaction Yield Prediction with Small-Scale Data. Commun Chem 2025, 8 (1), 1–12. https://doi.org/10.1038/s42004-025-01434-0.

(4) Vaucher, A. C.; Zipoli, F.; Geluykens, J.; Nair, V. H.; Schwaller, P.; Laino, T. Automated Extraction of Chemical Synthesis Actions from Experimental Procedures. Nat Commun 2020, 11 (1), 3601. https://doi.org/10.1038/s41467-020-17266-6.

(5) Vaucher, A. C.; Schwaller, P.; Geluykens, J.; Nair, V. H.; Iuliano, A.; Laino, T. Inferring Experimental Procedures from Text-Based Representations of Chemical Reactions. Nat Commun 2021, 12 (1), 2573. https://doi.org/10.1038/s41467-021-22951-1.

(6) Liu, Z.; Shi, Y.; Zhang, A.; Li, S.; Zhang, E.; Wang, X.; Kawaguchi, K.; Chua, T.-S. ReactXT: Understanding Molecular “Reaction-Ship” via Reaction-Contextualized Molecule-Text Pretraining. arXiv May 23, 2024. https://doi.org/10.48550/arXiv.2405.14225.