2023 AIChE Annual Meeting

Exploring Machine Learning Models for Predicting Vapor Pressure

The current way that vapor pressure is predicted is through fitted empirical equations, such as Antoine’s equation. The problem with the current methods is that the coefficients of these equations need to be found experimentally for each different molecule and temperature range. Fitting to these empirical forms do not extrapolate well outside the fitted temperature range. Also, these equations cannot be used to estimate vapor pressure for species where experimental data are not already available. To address these problems, the goal of this experiment is to create a machine learning model to predict vapor pressure of molecules. The machine learning models are built by using Chemprop, a software package for creating machine learning models for molecular properties.

In this study, we explore several variations of model structure. The first is the most basic in that it is a standard model predicting the pressure from the training data directly, with the temperature as an added model feature. A more complex version of the model involves predicting the Antoine coefficients as an intermediate model result, instead of predicting the pressure directly. Then, using those coefficients, along with temperature, to get the vapor pressure. A third option is to use a separate model to predict extra data, like the critical temperature and pressure, and adding those as extra features, along with temperature. A fourth option is to introduce a level of noise to the temperature included as an extra feature to the model in order to avoid overfitting the Antoine’s coefficients. The method for this experiment involves taking data gathered from different sources (DIPR, NIST, Yaws Handbook) and using it with Chemprop to create models with the different structures outlined above. We discuss the effect of different model structure variations on the accuracy of the resulting models, such as which model variations perform the best when used in conjunction with each other. In evaluating model performance, we address both potential use-cases where data is not available for a molecule and cases in which the tested data is outside the fitted range of a model.