2025 AIChE Annual Meeting

(584ci) Automated Synthesis Procedure Generation in Heterogeneous Catalysis Via Fine-Tuned Large Language Model

The exploration of catalytic materials and their synthesis routes traditionally demands extensive iterative experimentation, substantial resource allocation, and significant time investment. To overcome these constraints, we have developed an advanced extraction and classification workflow integrating sophisticated language models and multimodal processing techniques. Initially, textual data from over 9,000 scientific articles were analyzed to identify and extract detailed catalyst attributes such as chemical composition, structural motifs, morphology, crystal structure, size, shape, and support materials. Additionally, images and their associated captions were systematically captured from these publications, enriching the dataset through advanced vision-language processing methods. Subsequently, this structured information was refined through rigorous classification, synthesis query generation, and feasibility validation, resulting in a curated dataset comprising 2,250 high-quality catalyst synthesis procedures.

Leveraging this dataset, we fine-tuned a large language model using parameter-efficient adaptation, significantly enhancing its capability to accurately predict detailed catalyst synthesis methods. Performance evaluation of our fine-tuned model revealed stable and effective convergence, demonstrating substantial improvements over baseline models with a ROUGE-1 score of 0.522, a ROUGE-L score of 0.290, and a BERTScore of 0.863. These results underscore the effectiveness of integrating structured multimodal data and iterative validation methods, offering a powerful new pathway to accelerate catalyst discovery and synthesis optimization, thereby reducing research timelines and resource demands.