Symbolic regression (SR) is a machine learning technique that seeks closed-form mathematical expressions to describe data, offering both predictive accuracy and interpretability. Unlike traditional machine learning models, SR does not assume a predefined functional form, but it discovers both the structure and parameters of the underlying system. While neural networks and other black-box models undoubtedly have achieved impressive performance across many domains, they often lack interpretability, which is limiting their utility in scientific discovery. SR bridges this gap by providing analytical mathematical expressions that can yield insights into the governing processes of a system.
For engineering applications, symbolic regression might be able to provide interesting insights into kinetics of systems such as bioprocesses or catalytic reactions. Other application domains of SR may include building surrogate models of expensive experimental procedures or simulations, or its use in mathematical programing.1
Conventional SR methods, such as those based on evolutionary algorithms2,3,4,5, evolve populations of candidate expressions to balance accuracy with model complexity. However, these methods are computationally expensive and do not leverage gradient information; instead, they rely on stochastic processes and provide no guarantee or measure of convergence. Alternative approaches using deep learning6,7, reinforcement learning8 or combinations thereof with evolutionary algorithms9,10 show promise but suffer from large search spaces and limited data efficiency. Reinforcement learning, for instance, learns policies to construct equations, but often underutilizes the data's structure.
An alternative approach which already provided promising results is that by AIFenyman,11 where the authors utilized standard neural networks and statistical tests to decompose the original complex problem into smaller and simpler subproblems which can be more easily solved with SR. As the SR tool, they used a brute-force and polynomial fit approach which suffers greatly from the curse of dimensionality, whilst the polynomial fit again imposes a predefined model structure to the problem which SR is exactly trying to solve. Additionally, the brute force algorithm again does not make use of gradient information.
SR becomes particularly challenging in the context of dynamic systems, where the goal is to uncover differential equations that govern temporal behavior. Sparse Identification of Nonlinear Dynamics12 (SINDy) is a notable method in this area, relying on a library of basis functions and sparse regression. However, its performance is highly sensitive to noise and on the choice of basis functions, due to its dependence on numerical differentiation. Recent work by Forster et. al.13 has attempted to mitigate this by learning smooth approximations of derivatives, but these approaches often revert to population-based SR, which remains difficult to interpret and tune.
In this work, we address key limitations of existing SR methods by combining expressive neural networks with mathematical simplification techniques. Specifically, we leverage Kolmogorov–Arnold Networks14 (KANs), a newly proposed class of neural networks with gradient-based optimization, to efficiently navigate the SR search space. KANs have already shown great potential for SR due to their expressability and trainable activation functions but yet they have their limitations. A problem KANs have is, that with increasing complexity (width and depth of the network), the obtained trained activation functions often do not represent simple mathematical terms which can be easily extracted and thus the interpretability is decreased substantially. By decomposing complex expressions into smaller subproblems, inspired by the AIFeynman11 approach, and solving them with KANs instead of brute-force methods, we improve scalability and accuracy. Furthermore, our framework naturally extends to dynamic systems via Neural ODEs15,16, avoiding the need for explicit derivative estimation and enabling end-to-end training directly from time series data, as shown using KANs in Koenig et al17, whilst still maintaining the ability to decompose the original complex problem into smaller subproblems.
Preliminary results show great potential in terms of overall computational time to achieve accurate solutions. Tests are currently performed on the SRSD-Feynmann datasets18 which consist of equations found in Physics problems, as well as on synthetic bioprocess datasets. However, some limitations still remain, particularly on how to deal with nested multiplications in the data, as this is difficult to discover for the decomposition algorithms and almost impossible to fit expressibly for standard KANs. Future work will focus on optimizing the algorithm and extending its capabilities to more effectively identify and model multiplicative interactions.
1. Daoutidis, P. et al. Machine learning in process systems engineering: Challenges and opportunities. Comput Chem Eng 181, (2024).
2. Cranmer, M. Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl. (2023).
3.
Guimerà, R. et al. A Bayesian Machine Scientist to Aid in the Solution of Challenging Scientific Problems.
https://www.science.org (2020).
4. Cozad, A., Sahinidis, N. V. & Miller, D. C. Learning surrogate models for simulation-based optimization. AIChE Journal 60, 2211–2227 (2014).
6. Kamienny, P.-A., d’Ascoli, S., Lample, G. & Charton, F. End-to-end symbolic regression with transformers. (2022).
7. Mežnar, S., Džeroski, S. & Todorovski, L. Efficient Generator of Mathematical Expressions for Symbolic Regression. (2023) doi:10.1007/s10994-023-06400-2.
8. Petersen, B. K. et al. Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. (2019).
9. Mundhenk, T. N. et al. Symbolic Regression via Neural-Guided Genetic Programming Population Seeding. (2021).
10. Landajuela, M. et al. A Unified Framework for Deep Symbolic Regression. in Proceedings of the 36th International Conference on Neural Information Processing Systems (Curran Associates Inc., New Orleans, LA, USA, 2022).
11.
Udrescu, S.-M. & Tegmark, M. AI Feynman: A Physics-Inspired Method for Symbolic Regression.
https://www.science.org (2020).
12. Brunton, S. L., Proctor, J. L. & Kutz, J. N. Discovering governing equations from data: Sparse identification of nonlinear dynamical systems. (2015) doi:10.1073/pnas.1517384113.
13. Forster, T., Vázquez, D., Müller, C. & Guillén-Gosálbez, G. Machine learning uncovers analytical kinetic models of bioprocesses. Chem Eng Sci 300, (2024).
14. Liu, Z. et al. KAN: Kolmogorov-Arnold Networks. (2024).
15. Chen, R. T. Q., Rubanova, Y., Bettencourt, J. & Duvenaud, D. Neural Ordinary Differential Equations. (2018).
16. Kidger, P. On Neural Differential Equations. (2022).
17. Koenig, B. C., Kim, S. & Deng, S. KAN-ODEs: Kolmogorov-Arnold Network Ordinary Differential Equations for Learning Dynamical Systems and Hidden Physics. (2024) doi:10.1016/j.cma.2024.117397.
18. Matsubara, Y., Chiba, N., Igarashi, R. & Ushiku, Y. Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery. (2022).