2025 AIChE Annual Meeting

(202f) Automatic Chemical Information Extraction Using Deep Learning and Large Language Models: Applications in Chemistry and Materials Predictions

Authors

Yufan Chen, Hong Kong University of Science and Technology
Yuxuan Zhang, China Agricultural University
Machine learning (ML) is transforming molecular sciences by accelerating the discovery and optimization of molecules, materials, and reactions. However, the performance of ML models heavily relies on the availability of high-quality, large-scale datasets. Historically, constructing such datasets in chemistry and materials science has required labor-intensive manual curation due to the heterogeneous distribution of data across publications and the complexity of domain-specific formats. As ML applications in these fields expand rapidly, reliance on manual data extraction has emerged as a critical bottleneck.

Automated data extraction from scientific literature offers a promising solution, particularly with recent advances in large language models (LLMs). Yet, chemistry and materials science present unique challenges, including complex named entities, long-range dependencies, and multimodal data (e.g., text, tables, and figures). In this work, we present deep learning and LLM-based approaches for automated information extraction from organic chemistry and polymer materials literature. By leveraging data augmentation, prompt engineering, and fine-tuning, we develop efficient and accurate models for structured knowledge extraction. Notably, the automatically extracted data achieves comparable performance in downstream ML tasks to expert-curated datasets. Our results demonstrate the potential of fully automated, data-driven pipelines to accelerate discovery in chemistry and materials science.