2025 AIChE Annual Meeting

(588ca) Leveraging Machine Learning Models to Predict Thermal Properties of Organic Chemicals

Accurate prediction of thermal properties such as melting point, boiling point, and critical temperature is essential for the design, safety, and performance of chemical processes involving organic compounds. These properties are essential for determining phase transitions, chemical reactivity, volatility and generalizing the environmental fate of these chemicals. Traditionally, experimental methods are used to obtain these properties, but the process is time-consuming, expensive and impractical for the vast amount of pre-existing chemicals.

To address this, this study aims to develop and validate a machine learning (ML) model capable of predicting thermal properties using pre-existing molecular descriptors sourced from publicly available databases. Specifically, we curated a dataset from PubChem, assembling property data for over 22,000 organic compounds with experimentally measured melting points. The initial feature set for training was derived directly from molecular descriptors available in the PubChem database for majority of the chemicals.

A Pearson correlation analysis was applied to identify features with a correlation coefficient of 0.3 or higher with melting point. These included descriptors such as double bond equivalence (DBE), molecular complexity, topological polar surface area (TPSA), heavy metal count, molar weight, number of carbons in the structure, number of H-bond acceptors, number of H-bond donators, and number of oxygens in the structure as shown in Figure 1. These features were selected as the input variables for ML models.

To assess model performance, training was conducted on a range of commonly used regression algorithms including Extra Trees Regressor, Random Forest Regressor, XGBoost Regressor, Bagging Regressor, LGBM Regressor, HistGradientBoosting Regressor, MLP Regressor, K-Nearest Neighbors Regressor, and Gradient Boosting Regressor. As illustrated in Figure 2, the XGBoost model emerged as the best-performing model, achieving an R-squared value of 0.77 and an RMSE of 42.49, outperforming all other algorithms in both accuracy and computation speed.

Following the initial comparison, the XGBoost model was selected for further optimization and model generalization using nested cross-validation and hyperparameter tuning via grid search and 5-fold cross-validation. The final optimized model was evaluated using a parity plot comparing the predicted melting points to experimental values from the test dataset, with a train/test ratio of 0.9 to 0.1. The results demonstrate a strong alignment and generalization performance as shown in Figure 3.

To evaluate the generalizability of the model, it was tested on previously unseen (vaulted) data. Specifically, the trained XGBoost model was applied to predict the melting points of compounds listed in the EPA Safer Chemical Ingredients List (SCIL). As illustrated in Figure 4, the model successfully predicted melting points for 135 classified safer chemicals, achieving a RMSE of 61.64 and R-squared of 0.45.

While the model demonstrated reasonable predictive capability on this external dataset, one limitation observed was reduced accuracy for compounds with melting points above 200 C or below -50 C. This suggests that despite efforts to generalize across a broad chemical space, the model still faces challenges in accurately capturing extreme thermal behaviors. Further model refinement and inclusion of additional high- and low- melting point compounds in the training data could improve performance in these regions.

The machine learning framework developed for melting point prediction was extended to a new dataset of 1,000 organic compounds from the NIST Chemistry WebBook to predict both boiling point and critical temperature. For boiling point prediction, the XGBoost model achieved a root mean square error (RMSE) of 28.95 and an R² value of 0.9192, demonstrating excellent predictive accuracy and a strong alignment with experimental values, as shown in Figure 5. Similarly, for critical temperature prediction, the model attained an RMSE of 44.85 and an R² value of 0.9056, as illustrated in Figure 6. These results highlight the model’s ability to generalize across multiple thermal properties with consistently high performance.

To ensure robust generalization and minimize the risk of overfitting, the model underwent hyperparameter tuning and nested cross-validation, replicating the optimization strategy used for the melting point prediction model. These results demonstrate the scalability and adaptability of the machine learning framework across different thermal property predictions for organic compounds.

In summary, this study demonstrates the effectiveness of machine learning as a scalable and accurate approach for predicting the thermal properties of organic compounds, including melting point, boiling point, and critical temperature. Using a curated dataset from PubChem and NIST, and applying a variety of regression algorithms, we identified XGBoost as the most reliable model, achieving high predictive accuracy with minimal computational cost. The model was further validated on external datasets, such as the EPA Safer Chemical Ingredients List, highlighting both its strengths and limitations across different temperature ranges. By selecting structurally and statistically relevant features and applying rigorous cross-validation and hyperparameter tuning, we minimized overfitting and improved generalizability. The same modeling pipeline was successfully extended to boiling point and critical temperature predictions, emphasizing the versatility of the framework. Overall, this work provides a powerful tool for high-throughput thermal property screening, with broad applications in green chemistry, material design, and process development.