2025 AIChE Annual Meeting

(593b) Regularized Symbolic Regression with Pystar

Authors

Radhakrishna Tumbalam Gooty, Purdue University
Nikolaos Sahinidis, Georgia Institute of Technology
Symbolic regression (SR) is a surrogate modelling technique that simultaneously determines a model’s functional form and the regression parameter values that best fit data. SR has received significant attention in recent years, because it facilitates the discovery of unknown functions analytically without making any assumptions about their functional form. Traditionally, SR is formulated as a genetic programming problem, which requires many iterations to find an appropriate expression due to its stochastic nature. Cozad and Sahinidis [1] were the first to formulate the symbolic regression problem as a Mixed-Integer Nonlinear Program (MINLP) and solve it globally. However, this approach can also be computationally inefficient, especially for large datasets. To address this, Sarwar [2] developed the algorithm STAR (Symbolic regression Through Algebraic Representations) which formulates a relaxed MINLP, whose solution provides probabilities for randomized rounding that leads to a scalable approach to symbolic regression. Kim et al. [3] present the results on various datasets obtained with PySTAR (Python Symbolic regression Through Algebraic Representations), an open-source Python implementation of STAR. Their results indicate that the STAR approach has a superior overall predictive performance compared to the state-of-the-art methods for SR, such as GPLearn [4] and Operon [5].

Here, we present recent advances in SR and PySTAR. In our work, we use regularized objectives including the Bayesian Information Criterion, as the model fitness metric, instead of the Sum of Squared Residuals (SSR) used in previous studies, to address the bias-variance trade-off of the STAR surrogates. Regularized objectives balance predictive accuracy and model complexity by penalizing the number of non-zero parameters through a regularization term that is added to the traditional SSR. This leads to the construction of models that are not only accurate, but also simple and interpretable. To demonstrate this regularization capability and its implementation in PySTAR, we build surrogates for critical minerals processes utilizing both SSR and regularized objectives, and compare them in optimization frameworks. We also perform a benchmarking study to compare STAR with other surrogate modelling techniques, including deep learning and regularization. Our results showcase that the regularized SR expressions are among the most accurate models, while their simplicity facilitates optimization and interpretability.

References

[1] Cozad, A. and Sahinidis, N. V. A global MINLP approach to symbolic regression. Mathematical Programming, 170:97–119, 2018.

[2] Sarwar, O. Algorithms for interpretable high-dimensional regression, Carnegie Mellon University, Pittsburgh, PA, 2022.

[3] Kim, M., Sarwar, O. and Sahinidis, N. V. STAR: Symbolic regression Through Algebraic Representations. Submitted, 2025.

[4] Jones, T. K. GPLearn: Genetic Programming in Python, with a scikit-learn inspired API. https://github.com/trevorstephens/gplearn, 2017.

[5] Burlacu, B., Kronberger, G., and Kommenda, M. Operon C++ an efficient genetic programming framework for symbolic regression. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, pp. 1562–1570, 2020.