Industries across the globe are striving to achieve net-zero emissions as a crucial step toward mitigating the adverse effects of climate change [1]. Among them, the chemical industry stands out as a major contributor, emitting around 2 billion metric tons of CO₂ annually, approximately 5% of total global greenhouse gas (GHG) emissions [2]. This is largely due to its heavy dependence on fossil-based feedstocks and a predominantly linear 'make-use-dispose' production model, which leads to significant carbon emissions and material waste. To align with net-zero targets, a fundamental transformation of this model is required. The shift toward a circular chemical industry is already underway, focusing on minimizing waste and maximizing resource efficiency. Achieving this transformation involves integrating advanced technologies, utilizing alternative and renewable raw materials, adopting innovative and sustainable business models, and implementing supportive regulatory frameworks and policies [3].
Chemical reactions are at the heart of the industry, defining the thermodynamics and chemistry of the synthesis process from raw materials and playing a crucial role in reactor and process system design. The significance of chemical reactions has inspired many researchers, leading to studies on reaction synthesis planning [4], circular process solutions [5], the prediction of (hazardous) chemicals based on reacting species and process conditions [6], and the prediction of product yield [7] using machine learning (ML) models and SMILES representations of molecules. The ML-based approach for reaction prediction, known as retrosynthesis, has enabled the development of tools like Askcos and Synthia, which predict the most likely chemical products. A bidirectional search approach has also developed to form closed loops within a set of reactions using retrosynthesis results and Tanimoto similarity calculations to define chemical similarity between entities for chemicals [8]. In the similar direction, large language models (LLMs) have also been used to predict reactions based on the SMILES representations of reacting entities [9]. Establishing closed loops in a reaction network for a chemical defines circularity, referred to as a circular reaction network (CRN), providing stakeholders with flexibility in selecting multiple circular pathways and conducting further research to identify the optimal sustainable pathways for any given chemical’s life cycle. A recent framework [10] demonstrates how novel reaction pathways can be identified through hierarchical screening of a circular reaction network (CRN). Its successful application depends on efficient methods for developing a CRN for any given product. To the best of the authors' knowledge, no prior work has leveraged LLMs to develop a CRN for a chemical that covers the entire life cycle.
We propose a customized large language model (LLM) to develop a CRN for any chemical's life cycle using scientific literature. LLMs are transformer-based deep learning models capable of processing vast amounts of unstructured textual data and extracting meaningful insights. Trained on extensive corpora from books and journals, these models have inspired domain-specific adaptations in materials science [11], biology [12], and geology [13]. Notable examples include Recycle-BERT for plastic recycling [14] and CCU-Llama for literature mining on carbon capture and utilization (CCU) [15]. LLMs have proven effective in tasks like question answering, entity extraction, and classification. Building on these capabilities, we tailor an LLM to extract reaction information from scientific literature and construct a CRN following a cradle-to-cradle system boundary. To enhance efficiency within computational limits, we rely on research abstracts, which offer concise and structured summaries, as input data. Relevant abstracts are retrieved from the Elsevier database using an API and targeted keywords.
To test the proposed CRN creation methodology, methanol, the smallest building block in the chemical industry, is selected as a case study. Three keyword sets are used to capture its full life cycle: “syngas synthesis,” “methanol synthesis,” and “use of methanol.” Relevant abstracts are extracted from the Elsevier database and preprocessed to remove duplicates and irrelevant entries, and an expert annotates a sample of 100 abstracts to identify reactants, methodologies, and products. This annotated data is used to fine-tune the open-source LLaMA-2 model for entity extraction tasks. Model performance is evaluated using a learning curve showing decreasing training and validation loss over 90 steps (3 epochs), indicating effective learning. The fine-tuned model, named Rxn-LLM, then processed the remaining abstracts, extracting reaction-related data, which is cleaned and stored in an Excel sheet. Two Python scripts, Text2Chemical and Chemical2Reaction, are developed to convert chemical names into formulas and balance reactions, respectively. The resulting CRN for methanol includes 61 unique, balanced reactions and 31 chemical entities, as illustrated in Figure 1(b).
The resulting circular reaction network (CRN) is compared with two others developed using a human-based double-direction approach [10] and a pattern recognition-based method. The human-based approach relies heavily on domain expertise and involves manually reviewing literature through two forward searches (literature search) and one backward retrosynthesis search. It also uses Tanimoto similarity calculations to link commercially available and synthesized chemical entities, resulting in a CRN with fourteen reactions and ten chemical entities, as shown in Figure 1(c). In contrast, the pattern recognition-based method identifies reaction patterns in text using symbols such as "→, ↔, ⇌" and extracts reactions directly from full-text articles. After post-processing, removing unbalanced reactions, and filtering out irrelevant entities like Cl, Br, and Al, it produces a CRN with thirty-one reactions and eight chemical entities, as presented in Figure 1(d). Compared to both methods, the LLM-based approach is more time-efficient, accurately retrieves balanced reactions, and offers greater scalability, making it a promising tool for developing CRNs with minimal manual effort.
Overall, for this case study, the LLM-based approach requires approximately 90% less time than the human-based approach when both approaches develop networks with similar reactions and chemicals. This approach automates CRN creation, enabling further analyses, such as reactor design, to support sustainable chemical processes for net-zero emission goals. By promoting circularity from synthesis to application, it leverages the vast chemical knowledge base to discover novel, sustainable circular reaction networks that can be further developed to reinvent the chemical industry as a sustainable circular economy. This work will also be extended by including more integrated approaches, such as combining LLM with retrosynthesis and LLM with human expert advice. This may provide a more effective and robust approach to developing the CRN for any chemical.
References:
[1] Kloo, Y., Nilsson, L. J., & Palm, E. (2024). Reaching net-zero in the chemical industry—A study of roadmaps for industrial decarbonisation. Renewable and Sustainable Energy Transition, 5, 100075.
[2] Gabrielli, P., Rosa, L., Gazzani, M., Meys, R., Bardow, A., Mazzotti, M., & Sansavini, G. (2023). Net-zero emissions chemical industry in a world of limited resources. One Earth, 6(6), 682-704.
[3] Jin, E., Jabarivelisdeh, B., Schoeneberger, C., Chamanara, S., Dunn, J. B., Christopher, P., & Masanet, E. (2024). Critical review of technologies, data, and scenario elements in net-zero pathway modeling for the chemical industry. Renewable and Sustainable Energy Reviews, 205, 114831.
[4] Baró, E. L., Nadal Rodríguez, P., Juárez‐Jiménez, J., Ghashghaei, O., & Lavilla, R. (2024). Reaction Space Charting as a Tool in Organic Chemistry Research and Development. Advanced Synthesis & Catalysis, 366(4), 551-573.
[5] Weber, J. M., Guo, Z., & Lapkin, A. A. (2022). Discovering circular process solutions through automated reaction network optimization. ACS Engineering Au, 2(4), 333-349.
[6] Saraf, S. R., Rogers, W. J., & Mannan, M. S. (2003). Prediction of reactive hazards based on molecular structure. Journal of hazardous materials, 98(1-3), 15-29.
[7] Zuranski, A. M., Martinez Alvarado, J. I., Shields, B. J., & Doyle, A. G. (2021). Predicting reaction yields via supervised learning. Accounts of chemical research, 54(8), 1856-1865.
[8] Yu, K., Roh, J., Li, Z., Gao, W., Wang, R., & Coley, C. (2024). Double-ended synthesis planning with goal-constrained bidirectional search. Advances in Neural Information Processing Systems, 37, 112919-112949.
[9] Zhang, C., Lin, Q., Zhu, B., Yang, H., Lian, X., Deng, H., ... & Liao, K. (2025). SynAsk: unleashing the power of large language models in organic synthesis. Chemical Science, 16(1), 43-56.
[10] Kim, Sunghoon and Bakshi B. R., Discovering Sustainable Net-Zero Chemical Processes and Pathways by Developing Circular Reaction Networks and their Hierarchical Screening, Submitted in Industrial & Engineering Chemistry Research.
[11] Zaki, M., & Krishnan, N. A. (2024). MaScQA: investigating materials science knowledge of large language models. Digital Discovery, 3(2), 313-327.
[12] Luu, R. K., & Buehler, M. J. (2024). BioinspiredLLM: Conversational large language model for the mechanics of biological and bio‐inspired materials. Advanced Science, 11(10), 2306724.
[13] Lin, Z., Deng, C., Zhou, L., Zhang, T., Xu, Y., Xu, Y., ... & Zhou, C. (2023). Geogalactica: A scientific large language model in geoscience. arXiv preprint arXiv:2401.00434.
[14] Kumar, A., Bakshi, B. R., Ramteke, M., & Kodamana, H. (2023). Recycle-BERT: extracting knowledge about plastic waste recycling by natural language processing. ACS Sustainable Chemistry & Engineering, 11(32), 12123-12134.
[15] Jami, H. C., Singh, P. R., Kumar, A., Bakshi, B. R., Ramteke, M., & Kodamana, H. (2024). CCU-Llama: A Knowledge Extraction LLM for Carbon Capture and Utilization by Mining Scientific Literature Data. Industrial & Engineering Chemistry Research, 63(41), 17585-17598.
