2025 AIChE Annual Meeting

(675e) Text Mining Experimental Heterogeneous Catalysis Literature with Large Language Models

Authors

Suljo Linic, University of Michigan-Ann Arbor
Extracting experimentally measured heterogeneous catalysis data from research articles into structured databases would facilitate the rapid screening of catalysts with target properties and development of machine learning models that can directly predict experimental outcomes. This text mining task has been transformed by the release of large language models (LLMs) capable of following general natural language instructions, which have made it possible to mine text without the need to train task-specific models or define comprehensive expression-matching rules. Here, we present a text mining tool we developed called CatMiner that extracts arbitrary user-specified catalytic structure–environment–property data using LLMs. It is agnostic to LLM choice with both closed-source GPT and open-source Llama and Deepseek models supported with no modification. We benchmark the ability of CatMiner to extract data on the oxidative coupling of methane (OCM) reaction and measure the effect that different LLMs and prompting strategies have on performance. Using Llama 3.1 405B, we achieve an F1-score of 80.3% on a catalyst–property extraction task and 68.7% on a more difficult catalyst–temperature–property extraction task. We find that incorporating domain knowledge, chat-like memory, follow-up prompting, and inter-paragraph search capabilities are all necessary to achieve best performance. Using CatMiner, we generate a machine-readable database of 3628 OCM measurements extracted from 1029 papers and abstracts.