2024 AIChE Annual Meeting

(193d) Torchsisso: A Python Package for Explainable AI Using the Sure Independence Screening and Sparsifying Operator Method with GPU Support

Authors

Muthyala, M. - Presenter, The Ohio State University
Paulson, J., The Ohio State University
First-principles models have long been a fundamental tool in scientific inquiry due to their ability to (i) provide transparency into the underlying prediction mechanism and (ii) generalize beyond the situations/data for which they have been trained (assuming an accurate model structure) [1]. However, the development of efficient and accurate first-principles models in new domains can require huge amounts of time and/or resources due to the difficulty in knowing the right level of detail to include. Data-driven artificial intelligence (AI) methods, on the other hand, aim to learn statistical relationships directly from data such that they can in essence “discover” correlations between different sets of properties without a priori knowing which ones are important. Because of this, AI has become increasingly popular across many diverse fields such as material and drug discovery, medical diagnostics, financial forecasting, natural language processing, and robotics, to name a few. Despite the many successful applications of AI, the creation of explainable and physically relevant AI models remains an important open challenge [2]. The lack of explainability typically results in a lower degree of trust in the model from decision makers (reducing its real-world impact) and makes the models more susceptible to overfitting to limited training data (significantly reducing their performance and generalizability).

Symbolic regression (SR) refers to an interesting class of explainable AI methods that look to identify optimal closed-form (nonlinear) expressions for a given target property (or output) from a set of input features that are possibly related to the target [3, 4]. Early work on SR focused on genetic algorithm-based methods [5], but more recent work has focused on sparse linear regression-based approaches. For example, the sure independence screening and sparsifying operator (SISSO) method [6] uses compressed sensing with feature expansion to perform SR. The feature expansion step proceeds by combining a set of primary features with a set of unary and binary operators until a large enough feature set is available (typically on the order of 107-1010 features). The SIS method [7] is used to identify a small set of promising features on which one can solve a full L0-regularized regression problem to construct the final set of descriptors. SISSO has been used to learn interpretable descriptors for many properties including phase stability [8], catalyst performance [9], and glass transition temperature [10].

Although powerful, there are some important challenges with current implementations of SISSO that have prevented its widespread use. First, the original SISSO repository [11] is implemented in Fotran, making it challenging for users to install and run (especially in cloud-based computing environments). Second, the feature expansion step in [11] has been hard-coded such that it cannot be directly modified. This is important because, as we show through a couple of simple examples, the potentially incomplete expansion can result in a failure to learn the true symbolic expression. Third, the combinatoric expansion of the feature space can be slow or even infeasible depending on the available set of memory. To address these issues, we introduce a new Python package, TorchSISSO [12], that implements an enhanced version of the SISSO method. We base our implementation off the open-source machine learning library Torch [13] so that all of the internal operations can be GPU-accelerated if desired. TorchSISSO is pip installable (i.e., pip install torchsisso) such that it can readily installed locally or in cloud environments. Through a series of examples, we show that TorchSISSO can discover physically relevant equations up to 18x faster (and with higher accuracy) than the original SISSO implementation. We also implement a novel filtering strategy that enables application of the method to problems with high-dimensional primary feature spaces (common in material discovery problems).

References:

[1] Hermann, J., DiStasio Jr, R. A., & Tkatchenko, A. (2017). First-principles models for van der Waals interactions in molecules and materials: Concepts, theory, and applications. Chemical Reviews, 117(6), 4714-4758.

[2] Angelov, P. P., Soares, E. A., Jiang, R., Arnold, N. I., & Atkinson, P. M. (2021). Explainable artificial intelligence: an analytical review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(5), e1424.

[3] Aldeia, G. S. I., & de França, F. O. (2021, June). Measuring feature importance of symbolic regression models using partial effects. In Proceedings of the genetic and evolutionary computation conference (pp. 750-758).

[4] Wang, Y., Wagner, N., & Rondinelli, J. M. (2019). Symbolic regression in materials science. MRS Communications, 9(3), 793-805.

[5] Koza, John R. "Genetic programming as a means for programming computers by natural selection." Statistics and computing 4 (1994): 87-112.

[6] Ouyang, R., Curtarolo, S., Ahmetcik, E., Scheffler, M., & Ghiringhelli, L. M. (2018). SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Physical Review Materials, 2(8), 083802.

[7] Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(5), 849-911.

[8] Schleder, G. R., Acosta, C. M., & Fazzio, A. (2019). Exploring two-dimensional materials thermodynamic stability via machine learning. ACS applied materials & interfaces, 12(18), 20149-20157.

[9] Han, Z. K., Sarker, D., Ouyang, R., Mazheika, A., Gao, Y., & Levchenko, S. V. (2021). Single-atom alloy catalysts designed by first-principles calculations and artificial intelligence. Nature communications, 12(1), 1833.

[10] Pilania, G., Iverson, C. N., Lookman, T., & Marrone, B. L. (2019). Machine-learning-based predictive modeling of glass transition temperatures: a case of polyhydroxyalkanoate homopolymers and copolymers. Journal of Chemical Information and Modeling, 59(12), 5013-5025.

[11] https://github.com/rouyang2017/SISSO

[12] https://pypi.org/project/TorchSisso/

[13] Collobert, R., Bengio, S., & Mariéthoz, J. (2002). Torch: a modular machine learning software library.