Symbolic regression (SR) [1, 2] aims to discover concise, interpretable mathematical expressions that describe relationships in data – offering a compelling alternative to black-box machine learning (ML) models (such as neural networks and gradient-boosted decision trees), particularly in scientific and engineering domains where transparency and extrapolation are critical. Unlike conventional regression, SR does not assume a fixed functional form but instead searches over symbolic expressions to reveal governing equations or empirical laws. This flexibility makes SR attractive for tasks such as physical law discovery, dynamical system identification, and property prediction.
Over the past decade, a wide range of SR methods have emerged. Early approaches based on genetic programming (GP) [3, 4] are expressive but computationally expensive and prone to bloat. More recent neural-guided SR methods [5, 6] leverage deep learning to scale to larger problems but often require extensive training data and lack robustness in noisy or low-data settings. A third family of methods reformulates SR as sparse regression over a constructed feature library (e.g., ALAMO [7], SISSO [8], and SINDy [9]), enabling greater control over model complexity. However, these approaches still struggle with hyperparameter tuning, correlated features, and navigating the trade-off between model accuracy and interpretability.
To overcome these limitations, we introduce SyMANTIC (Symbolic Modeling with Adaptive iNtelligent feaTure expansIon) [10], a fast, flexible, and interpretable SR framework designed for data-driven scientific discovery. SyMANTIC builds on the sparse regression paradigm but introduces several key innovations: (1) a recursive, information-theoretic feature expansion strategy guided by Maximal Information Coefficient (MIC) screening; (2) a modified sparse regression formulation based on L0-regularization with adaptive constraints on model dimensionality; and (3) an MDL-inspired model complexity metric to rank expressions along an approximate Pareto frontier (finding models that better balance predictive accuracy and complexity). These components are integrated into a GPU-accelerated PyTorch framework with automated hyperparameter tuning, enabling rapid and robust symbolic modeling with minimal user intervention.
We benchmark SyMANTIC across a diverse set of problems, including synthetic expressions, physical equations, real-world chemical and materials property prediction, and chaotic dynamical systems. On standard symbolic regression benchmarks, SyMANTIC recovers over 95% of ground-truth equations, significantly outperforming state-of-the-art baselines that recover just over 50%. It also produces more concise models (30–40% fewer terms on average) and achieves these results with an order-of-magnitude speedup relative to existing methods. In the challenging task of learning the chaotic Lorenz system from limited data, SyMANTIC accurately recovers governing equations using just five time points – surpassing popular approaches like SINDy [9]. In a materials property prediction task relevant to sustainable battery design, SyMANTIC outperforms both symbolic and black-box ML models in predictive accuracy, while yielding compact, interpretable expressions. Together, these results establish SyMANTIC as a high-performance, user-friendly SR tool for interpretable model discovery in noisy, high-dimensional, and data-scarce scientific domains.
References:
[1] Makke, N., & Chawla, S. (2024). Interpretable scientific discovery with symbolic regression: a review. Artificial Intelligence Review, 57(1), 2.
[2] Angelis, D., Sofos, F., & Karakasidis, T. E. (2023). Artificial intelligence in physical sciences: Symbolic regression trends and perspectives. Archives of Computational Methods in Engineering, 30(6), 3845-3865.
[3] Koza, J. R. (1994). Genetic programming as a means for programming computers by natural selection. Statistics and computing, 4, 87-112.
[4] Cranmer, M. (2023). Interpretable machine learning for science with PySR and SymbolicRegression. jl. arXiv preprint arXiv:2305.01582.
[5] Petersen, B. K., Landajuela, M., Mundhenk, T. N., Santiago, C. P., Kim, S. K., & Kim, J. T. (2019). Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. arXiv preprint arXiv:1912.04871.
[6] Kamienny, P. A., d'Ascoli, S., Lample, G., & Charton, F. (2022). End-to-end symbolic regression with transformers. Advances in Neural Information Processing Systems, 35, 10269-10281.
[7] Wilson, Z. T., & Sahinidis, N. V. (2017). The ALAMO approach to machine learning. Computers & Chemical Engineering, 106, 785-795.
[8] Ouyang, R., Curtarolo, S., Ahmetcik, E., Scheffler, M., & Ghiringhelli, L. M. (2018). SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Physical Review Materials, 2(8), 083802.
[9] Brunton, S. L., Proctor, J. L., & Kutz, J. N. (2016). Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 113(15), 3932-3937.
[10] Muthyala, M. R., Sorourifar, F., Peng, Y., & Paulson, J. A. (2025). SyMANTIC: An Efficient Symbolic Regression Method for Interpretable and Parsimonious Model Discovery in Science and Beyond. Industrial & Engineering Chemistry Research.