2025 AIChE Annual Meeting
(202f) Automatic Chemical Information Extraction Using Deep Learning and Large Language Models: Applications in Chemistry and Materials Predictions
Authors
Automated data extraction from scientific literature offers a promising solution, particularly with recent advances in large language models (LLMs). Yet, chemistry and materials science present unique challenges, including complex named entities, long-range dependencies, and multimodal data (e.g., text, tables, and figures). In this work, we present deep learning and LLM-based approaches for automated information extraction from organic chemistry and polymer materials literature. By leveraging data augmentation, prompt engineering, and fine-tuning, we develop efficient and accurate models for structured knowledge extraction. Notably, the automatically extracted data achieves comparable performance in downstream ML tasks to expert-curated datasets. Our results demonstrate the potential of fully automated, data-driven pipelines to accelerate discovery in chemistry and materials science.