2024 AIChE Annual Meeting

(372ag) Variable Extraction and Equivalence Judgment with BERT Model Pre-Trained on Chemical Engineering-Related Papers

Authors

Shota Kato - Presenter, Kyoto University
Chunpu Zhang, Kyoto University
Kotaro Nagayama, Kyoto University
Manabu Kano, Kyoto University
Introduction

In the field of process industry, physical models that mimic manufacturing processes play a crucial role. However, building physical models demands substantial effort, including deep expertise and an extensive literature survey. To mitigate this effort, we aim to realize an artificial intelligence system, Automated Physical Model Builder (AutoPMoB), designed to automate the construction of physical models from literature databases [1]. As part of this work, this study addresses variable definition extraction from chemical process-related literature and equivalence judgment of variable definitions extracted from different sources.

We assess the effectiveness of a domain-specific language model by constructing a language model specialized in the field of chemical engineering, named ProcessOnlyBERT, and comparing its performance against other models in variable definition extraction (VDE) and variable definition equivalence judgment (VDEJ) tasks. Unlike ProcessBERT [2], which was trained on a smaller dataset and shared the vocabulary of Bidirectional Encoder Representations from Transformers (BERT) [3], ProcessOnlyBERT utilizes a larger domain-specific dataset and domain-specific vocabulary and is trained from scratch.

Methods

Our research involves creating a BERT model specialized in chemical engineering, using approximately 1.2 million papers from 130 journals. We focus on abstracts and main texts, excluding titles, keywords, references, tables, and figures. The total word count in the training dataset is approximately 4.1 billion (24.9 GB). For sentence segmentation, we employ Scispacy [4], which was used in developing SciBERT [5], a model pre-trained on scientific papers. We set the vocabulary size to 30,000, consistent with BERT, and apply the same architecture as BERTBASE, which has 12 transformer blocks, 768 hidden sizes, and 12 self-attention heads, and the total parameters are 110 million. Training is conducted on Google Cloud's TPU v3 (8 cores) with the same configuration used for the pre-training of BERTBASE.

We used different approaches for the VDE and VDEJ tasks. We employ five models for these tasks: BERTBASE, BERTLARGE, which has a total of 340 million parameters, SciBERT, ProcessBERT, and ProcessOnlyBERT. In the VDE task, for a target variable in each sentence, the corresponding definition must be extracted if it exists, and nothing must be extracted if it does not exist. To achieve this, we initially convert the variable in the input sentence to a special token [target] and then tokenize the input sentence. The obtained token sequence is fed into the BERT model and the BERT model outputs the probabilities that each token is the start or the end of a definition. We seek the maximum cumulative probability under the condition that the start position is the same or precedes the end position and extract the span from the identified start to end position as the definition. Nonetheless, it is acknowledged that not all variables in the papers have associated definitions. In such instances, the [CLS] token at the head of the input token sequence is extracted. In the VDEJ task, one must determine if the two definitions are equivalent. To solve the VDEJ task using each BERT model, we input a word sequence that concatenates two definitions with the [CLS] token into the BERT model to determine the likelihood of their equivalence. If this likelihood surpasses a predetermined threshold, the definitions are considered equivalent; otherwise, they are determined to be non-equivalent.

Experiments

In VDE, we target variables in papers that are either surrounded by spaces or exist alone on the left-hand side of equations, and we extract their definitions. For example, in the sentence “The pressure P=nRT/V is assumed constant, where T and V are the temperature and volume, respectively,” the variables P, T, and V are the targets for definition extraction, while n and R are not. The variables to be extracted are assumed to be known since rule-based methods for extracting variables have achieved a high accuracy rate of 97% [6]. For the evaluation in the VDEJ task, we target all variable definitions used in VDE and determine the equivalence of all definition pairs contained in two different papers.

We created a dataset related to chemical processes to train and evaluate models. First, we selected five processes with different types of variables and equations and collected 47 papers related to these processes. The five processes are the biodiesel production process (BD), continuous stirred tank reactor (CSTR), crystallization process (CRYST), Czochralski process (CZ), and shell and tube heat exchanger (STHE). We then extracted variables and their definitions from each collected paper to create the dataset Dprocess, which contains a total of 2,028 variables, 1,276 of which include definitions. For the VDE task, in addition to this dataset, we also use the Symlink dataset Dsymlink [7], one of the largest datasets related to VDE.

For fine-tuning and evaluation, we split the two datasets. Dprocess is divided into training, validation, and test sets at the paper level. The number of papers for the test is two for STHE and three for the others, with one paper for validation and the rest for training. Dsymlink is divided into training, validation, and test sets at an 8:1:1 ratio, using only the training and validation sets for fine-tuning.

For evaluation, we use five models: BERTBASE, BERTLARGE, SciBERT, ProcessBERT, and ProcessOnlyBERT. Fine-tuning is conducted on an NVIDIA A100 GPU, using the Adam optimizer [8], with a batch size of 16, a learning rate of 1e-5, and 5 epochs. The performance in the VDE and VDEJ tasks is evaluated by accuracy and F1 score, respectively.

Results and Discussion

SciBERT showed the highest accuracy of 79.7% in the VDE task, with ProcessBERT closely following. The performance of ProcessBERT was almost equivalent to that of BERTBASE. Additionally, ProcessOnlyBERT performed the lowest accuracy of 70.9%. In the VDEJ task, SciBERT achieved the highest F1 score of 68.0%, with BERTLARGE coming in second with 66.6%. The performances of ProcessBERT and ProcessOnlyBERT were nearly the same, at 66.4% and 66.1%, respectively.

In both tasks, since the structure of all models except for BERTLARGE was the same and the size of the datasets was also similar, the performance differences among the models are likely attributed to the quality of their training data. ProcessBERT and ProcessOnlyBERT were trained using texts extracted from XML-formatted papers; the XML-formatted text contains various symbols, which may have degraded performance. Gupta et al. achieved a high performance by standardizing the notation for the same symbols and removing unnecessary symbols before training SciBERT on materials science-related papers [9]. Therefore, adopting a similar approach may enhance the performance of domain-specific models.

Domain-specific language models such as ProcessBERT and SciBERT, which were initialized with BERT parameters, outperformed ProcessOnlyBERT, indicating that the knowledge from BERT can be beneficial. Further training based on language models that perform better than BERT could yield even better results.

Conclusion

We developed a domain-specific language model, ProcessOnlyBERT, and evaluated its effectiveness in VDE and VDEJ tasks alongside four other models: BERTBASE, BERTLARGE, SciBERT, and ProcessBERT. The evaluation using a dataset comprising 47 chemical process-related papers revealed that SciBERT achieved the highest performance and ProcessBERT demonstrated comparable results. Future work will consider refining training data and modifying the structure and initial parameters of the models used for training to enhance the performance of domain-specific language models.

Acknowledgment

This work was supported by JST, ACT-X Grant Number JPMJAX23C5, Japan.

References

[1] S. Kato and M. Kano: Towards an Automated Physical Model Builder: CSTR Case Study, Computer Aided Chemical Engineering, 49, 1669–1674 (2022)

[2] S. Kato, K. Kanegami, and M. Kano: ProcessBERT: A Pre-trained Language Model for Judging Equivalence of Variable Definitions in Process Models, IFAC-PapersOnLine, 55-7, 957–962 (2022)

[3] J. Devlin et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL-HLT, 4171–4186 (2019)

[4] M. Neumann et al.: ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing, 18th BioNLP Workshop and Shared Task, 319–327 (2019)

[5] I. Beltagy, K. Lo, and A. Cohan: SciBERT: A Pre-trained Language Model for Scientific Text, EMNLP-IJCNLP, 3615–3620 (2019)

[6] S. Kato and M. Kano: VARAT: Variable Annotation Tool for Documents on Manufacturing Processes, Authorea preprint (2023)

[7] V. Lai et al.: SemEval 2022 task 12: Symlink - Linking Mathematical Symbols to their Descriptions, SemEval-2022, 1671–1678 (2022)

[8] D. P. Kingma and J. Ba: Adam: A Method for Stochastic Optimization, arXiv preprint arXiv:1412.6980 (2014)

[9] T. Gupta et al.: MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction, npj Computational Materials, 8-1, 1–11 (2022)