2025 AIChE Annual Meeting

(510h) Achieving More Accurate Global Warming Potential Prediction Based on Multimodal Machine Learning: A Case Study of the Chemical Industry

Authors

Shanying Hu, Center for Industrial Ecology
Chemicals are a fundamental component of socio-economic systems, yet they are inextricably linked to the climate crisis and a multitude of environmental impacts. It is estimated that over 95% of manufactured products contain chemicals. Consequently, the reduced life-cycle environmental impacts of chemicals will positively contribute to the overall environmental performance of products in numerous industries.

With growing concern about the environmental risks associated with chemicals, the concept of green chemistry, the underlying theory of which is to minimize the risks of chemicals to humans and the environment, is widely recognized and applied in the chemical industry. It's worth noting that the best opportunity to mitigate the environmental impact of chemical production is at the early planning and design stage. Therefore, technologies that enable researchers to quickly obtain sustainable information about a chemical at the production stage, or even throughout its life cycle, in the early planning stages of process design, will maximize environmental benefits.

Life cycle assessment (LCA), a standard methodology for determining the environmental impact of a product or process, has become a key technique for assessing the environmental impact of chemicals. So far, the life cycle environmental impacts of hundreds of typical chemicals have been studied using the LCA model, covering a wide range of chemicals such as primary chemicals, intermediate chemicals, nitrogen fertilizers, polymer materials, additives and other chemicals. The well-established business LCA database contains approximately 2,000 data points related to chemistry (Ecoinvent v3.10). However, 275 million chemicals (e.g., organic substances, alloys, polymers, and salts, etc.) have been disclosed in Chemical Abstracts Service database since the early 19th century. Only bulk chemicals have been analyzed in detail in the available data, and this is only the tip of the iceberg for a large number of chemicals. Currently, the rate of assessment of the life cycle environmental impacts of chemicals is much slower than the rate of growth of new chemicals. Furthermore, the life-cycle environments employed for the evaluation of chemicals frequently entail considerable expenditure in terms of labor and time, necessitating the collation of substantial quantities of life-cycle inventory (LCI) data. This data is often proprietary and challenging to procure from commercial production activities. These limitations make it difficult to conduct a rapid life cycle assessment for a chemical.

To address the above challenges, agent data based on the knowledge of chemical engineering, LCI data supplemented by data science methods, and machine learning (ML) models based on molecular structure were developed. Based on chemical engineering knowledge, common methods include using process simulation tools (e.g. Aspen Plus), conducting process design calculations (e.g., calculation of heat requirements for reaction (Qreact) and distillation (Qdist) based on mathematical formulae), and making stoichiometric rough estimates (e.g., mass balance calculations from balanced reactions) to obtain the approximate LCI data. This type of approach requires detailed data information on the chemical process conditions and chemical engineering expertise to organize the available data. The rougher the process of performing agent data simulations, the greater the impact on the uncertainty of the LCA model results. The most representative study of the data science approach is the prediction of partial LCI missing data based on linear regression models, which still requires a small amount of known LCI data for support. Both methods above would fail if there is little data or information available. However, for hundreds of millions of chemicals, the absence of any process information or LCI base data is the norm. Therefore, a series of ML models based on molecular structure were developed as better alternatives. The training of these ML models is independent of LCI data availability for the target object. Nevertheless, the results yielded by the ML models proposed in the current studies are not yet satisfactory. Starting in 2008, Wernet et al. presented the first molecular-structure-based model of chemical inventories using neural networks and further proposed the FineChem 1 model in 2009. Following a period of relative inactivity, the introduction of DL models into the field by Suh et al. in 2017 prompted a surge of extensive research. In the period between 2019 and the present, researchers have conducted comprehensive studies on a number of dimensions, including the introduction of process descriptors, the development of specialized models for bio-based chemicals, the exploration of optimal solutions for models, the expansion of training datasets, and the creation of application scenarios. These studies are part of an ongoing effort to develop ML models with enhanced predictive capabilities. However, the prediction accuracy of global warming potential (GWP) is illustrative of the limitations of the current studies. The majority of studies yielded low coefficients of determination (~0.6), with a few studies reaching 0.8. The critical drawback of low prediction accuracy is a limitation to the wide application of related techniques. Meanwhile, the development of ML models involves multiple dimensions, including data, feature engineering, and model architectures. Existing research usually focuses on a single technical perspective to enhance the model's capability, but the development of more comprehensive ML models is lacking.

Herein, we propose a multimodal machine learning framework to perform more accurate predictions of the GWP of chemicals. More than 1,200 chemical-related unit process data from the Ecoinvent database (v3.10), the most classic and authoritative LCA database, were selected as the dataset. The results of the GWP100 calculated based on the IPCC 2021 methodology stored in each unit process data are selected as output labels, and the associated descriptive information is also retained to extract the key descriptive information therein. The feature engineering part of this ML framework integrates three modal features, including chemical features (molecular fingerprints and molecular descriptors), numerical features, and textual features. Among them, numerical features and textual features are obtained based on the large language model (qwen: 72b). By incorporating LCA domain expertise, we design a set of prompt engineering for extracting key knowledge information to standardize the mining of relevant features. By applying this cue word engineering to a large language model, we extracted 98 relevant feature information, among which the features with high information density were formed into feature inputs by one-hot coding and transforming into word vectors. The Multi-Layer Perceptron (MLP) model constructed in this study adopts the following architecture and implementation strategy: the model contains two hidden layers with 512 and 128 neurons, which constitute the core computational structure of the deep network. In order to prevent the model from overfitting, the Dropout regularization method is used in this paper to improve the generalization ability. The input feature dimension of the model is set to 3125, and the nonlinear transformation of high-dimensional features is realized through forward propagation. In the experimental session, the dataset is divided into a training set and a validation set (with a ratio of 8:2), and an additional 10 independent samples are kept as a test set to verify the model generalization performance. In order to fully assess the model robustness, the study adopts a five-fold cross-validation strategy and uses the coefficient of determination (R²) as the core evaluation index to systematically validate the prediction accuracy and stability of the model. The model with ECFP molecular fingerprints as feature inputs was selected as the Benchmark group (R2 = 0.65), and the incorporation of features such as molecular descriptors, one hot coding based on text extraction, continuous variables, and word vectors all improved the R2 of the model to varying degrees (R2 = 0.65-0.70). By combining multimodal features and selecting the training set directionally based on the Euclidean distance method, the model achieves a significant improvement in prediction accuracy (R2 = 0.84). The multimodal machine learning model proposed in this study jumps out of the research paradigm of previous studies that relied only on a single modal feature and achieves improved prediction accuracy. The aim of this research is to build new technical workflows based on machine learning and large predictive models, to broaden the current working paradigm. And to provide technical support for related technologies to achieve more accurate prediction of product GWP for sustainable development of the chemical industry.