2025 AIChE Annual Meeting

(645a) Federated Learning of Molecular and Mixture Properties

Authors

Jan G. Rittig - Presenter, RWTH Aachen University
Clemens Kortmann, RWTH Aachen University
Alexander Mitsos, RWTH Aachen University
Machine learning (ML) has advanced the prediction of physicochemical properties of molecules and their mixtures in chemical engineering but is often limited by the amount of readily available property data. By capturing structure-property relationships directly from property data sets, ML models such as graph neural networks and transformers outperform well-established semi-empirical prediction models like UNIFAC or COSMO-RS, cf. [1-6]. To further advance molecular ML models, i.e., increasing their predictive accuracy and applicability domain, it is desirable to assemble larger data sets of molecular and mixture properties. However, openly available property data is often limited; data is rather owned by private entities, mainly chemical companies. These companies are not willing to share data because it contains confidential information about the chemical species they use, and they have invested a lot of time and money in collecting the data and conducting lab experiments. Yet, they share a common interest in improving the quality of predictive ML models, so that they can more efficiently explore novel, more sustainable species and optimize chemical processes.

Federated learning allows the collaboration of multiple entities in training ML models without the need to share their private data and is thus highly promising for advancing molecular ML models with the chemical industry, also cf. [7]. Federated learning was proposed in 2017 by McMahan et al. [8] shifting the focus from sharing data to sharing ML models. Specifically, the participating entities agree to jointly train an ML model by sharing model updates that they compute locally on their individual data sets, hence without the need to share their data, cf. overview in [9]. Since then, federated learning has been applied to various domains, where data privacy is a main concern, e.g., training ML models for next-word prediction on mobile keyboards of private users [10] and for drug property prediction by collaboration of multiple pharmaceutical companies on an industrial scale [11]. Further, Zhu et al. have recently demonstrated successful application of federated learning for training GNNs to predict a variety of pure component properties [12]. However, the application of federated learning to mixture property prediction is missing so far.

We propose federated learning for predicting properties of mixtures. Specifically, we consider the prediction of activity coefficients of binary mixtures at infinite dilution and varying temperatures. Based on the data set by Brouwer et al. [13] which contains about 18,000 activity coefficient data points, we create different scenarios of data distributions among multiple chemical companies, also aiming at real-world industrial settings. These scenarios include varying sizes of the data sets companies hold and different distributions of the chemical space, e.g., random and scaffold-based distributions, i.e., heterogeneous data distributions. We also consider different forms of federated learning, e.g., the companies share the whole model or just a part of the model with each other. We compare the federated learning approach to two baselines: first, the companies decide not to collaborate and each trains a separate ML model on their private data set, and second, all data is aggregated as single data set and then used for training, which would mostly be ideal for developing an ML model. Our results show that the predictive accuracy of a model trained with federated learning is superior to individual training - even if data is distributed heterogeneously, hence each company benefits from federated learning. We also find scenarios where federated learning achieves a similar accuracy to a model trained on the whole data set. Therefore, we demonstrate the potential of federated learning for advancing molecular ML models by collaboration of the chemical industry without the need for data sharing, making it promising for considering further properties and targeting industry-scale applications. We will provide our data scenarios, models, and code as open source.

References

[1] Vermeire, F. H., & Green, W. H. (2021). Transfer learning for solvation free energies: From quantum chemistry to experiments. Chemical Engineering Journal, 418, 129307.

[2] Medina, E. I. S., Linke, S., Stoll, M., & Sundmacher, K. (2022). Graph neural networks for the prediction of infinite dilution activity coefficients. Digital Discovery, 1(3), 216-225.

[3] Jirasek, F., Alves, R. A., Damay, J., Vandermeulen, R. A., Bamler, R., Bortz, M., Mandt, S., Kloft, M. & Hasse, H. (2020). Machine learning in thermodynamics: Prediction of activity coefficients by matrix completion. The Journal of Physical Chemistry Letters, 11(3), 981-985.

[4] Winter, B., Winter, C., Esper, T., Schilling, J., & Bardow, A. (2023). SPT-NRTL: A physics-guided machine learning model to predict thermodynamically consistent activity coefficients. Fluid Phase Equilibria, 568, 113731.

[5] Qin, S., Jiang, S., Li, J., Balaprakash, P., Van Lehn, R. C., & Zavala, V. M. (2023). Capturing molecular interactions in graph neural networks: A case study in multi-component phase equilibrium. Digital Discovery, 2(1), 138-151.

[6] Rittig, J. G., Felton, K. C., Lapkin, A. A., & Mitsos, A. (2023). Gibbs–Duhem-informed neural networks for binary activity coefficient prediction. Digital Discovery, 2(6), 1752-1767.

[7] Dutta, S., Leal de Freitas, I., Maciel Xavier, P., Miceli de Farias, C., & Bernal Neira, D. E. (2024). Federated Learning in Chemical Engineering: A Tutorial on a Framework for Privacy-Preserving Collaboration across Distributed Data Sources. Industrial & Engineering Chemistry Research.

[8] McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273-1282). PMLR.

[9] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., ... & Zhao, S. (2021). Advances and open problems in federated learning. Foundations and trends® in machine learning, 14(1–2), 1-210.

[10] Yang, T., Andrew, G., Eichner, H., Sun, H., Li, W., Kong, N., ... & Beaufays, F. (2018). Applied federated learning: Improving google keyboard query suggestions. arXiv preprint arXiv:1812.02903.

[11] Heyndrickx, W., Mervin, L., Morawietz, T., Sturm, N., Friedrich, L., Zalewski, A., ... & Ceulemans, H. (2023). MELLODDY: cross-pharma federated learning at unprecedented scale unlocks benefits in QSAR without compromising proprietary information. Journal of Chemical Information and Modeling, 64(7), 2331-2344.

[12] Zhu, W., Luo, J., & White, A. D. (2022). Federated learning of molecular properties with graph neural networks in a heterogeneous setting. Patterns, 3(6).

[13] Brouwer, T., Kersten, S. R., Bargeman, G., & Schuur, B. (2021). Trends in solvent impact on infinite dilution activity coefficients of solutes reviewed and visualized using an algorithm to support selection of solvents for greener fluid separations. Separation and Purification Technology, 272, 118727.