The environmental impacts of chemical commodity-type molecules used for API synthesis are essential knowledge for sustainable chemical process design. These environmental impacts, however, are often unknown as existing environmental impact databases only cover approximately 6% of commercial chemicals.
[1] Accurately estimating the environmental impact of producing a molecule not in existing databases requires carrying out a Life Cycle Assessment (LCA) of the molecule’s production process; a time consuming process which often requires proprietary data. Machine learning (ML) models which only require the molecular structure of interest have been created to address this issue, but the small sizes of the training datasets used limit the accuracy of these models.
In this work we investigate the effects of cross-property transfer learning from a molecular price dataset on model accuracy for four target environmental impacts. Our models were first trained on price data from Reaxys, and then fine-tuned for environmental impact prediction. Separate models were trained to predict the carbon footprint, energy demand, mass consumption excluding water, and water consumption associated with the processes of manufacturing the organic precursor molecules. This methodology was carried out for three different model architectures: a SMILES based transformer encoder,[2] a directed Message Passing Neural Network (d-MPNN),[3] and a transformer enhanced MPNN.[4]
We compared transfer learning aided models against models directly trained on environmental impacts across five cross validation folds and found supervised pretraining on molecular price data increases model accuracy for three out of four different LCA impact categories, for all models. Taking the mean for all models, across all impacts and folds, transfer learning increases R2 by 0.13, and reduces mean percentage absolute error by 7.6%. The most accurate transfer learning enhanced model for carbon footprint prediction (R2=0.62) slightly outperforms the existing literature benchmarks.[5]
Such models can be used to increase the accuracy of existing process greenness metrics,[6] or to aid process engineers in designing more sustainable chemical processes by allowing for more environmentally informed precursor choices. We also anticipate our findings on price transfer learning can be used to increase the accuracy of any future ML LCA prediction model.
References
- Parvatker, A.G. and M.J. Eckelman, Comparative Evaluation of Chemical Life Cycle Inventory Generation Methods and Implications for Life Cycle Assessment Results. Acs Sustainable Chemistry & Engineering, 2019. 7(1): p. 350-367.
- Ross, J., et al., Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 2022. 4(12).
- Heid, E., et al., Chemprop: A Machine Learning Package for Chemical Property Prediction. Journal of Chemical Information and Modeling, 2024. 64(1): p. 9-17.
- Liu, C.Y., et al., ABT-MPNN: an atom-bond transformer-based message-passing neural network for molecular property prediction. Journal of Cheminformatics, 2023. 15(1).
- Kleinekorte, J., et al., APPROPRIATE Life Cycle Assessment: A PROcess-Specific, PRedictive Impact AssessmenT Method for Emerging Chemical Processes. Acs Sustainable Chemistry & Engineering, 2023. 11(25): p. 9303-9319.
- Roschangar, F., et al., Improved iGAL 2.0 Metric Empowers Pharmaceutical Scientists to Make Meaningful Contributions to United Nations Sustainable Development Goal 12. Acs Sustainable Chemistry & Engineering, 2022. 10(16): p. 5148-5162.