2025 AIChE Annual Meeting
(469c) Symantic: A Symbolic Regression Framework for Scalable, Interpretable, and Efficient Model Discovery
Authors
Over the past decade, a wide range of SR methods have emerged. Early approaches based on genetic programming (GP) [3, 4] are expressive but computationally expensive and prone to bloat. More recent neural-guided SR methods [5, 6] leverage deep learning to scale to larger problems but often require extensive training data and lack robustness in noisy or low-data settings. A third family of methods reformulates SR as sparse regression over a constructed feature library (e.g., ALAMO [7], SISSO [8], and SINDy [9]), enabling greater control over model complexity. However, these approaches still struggle with hyperparameter tuning, correlated features, and navigating the trade-off between model accuracy and interpretability.
To overcome these limitations, we introduce SyMANTIC (Symbolic Modeling with Adaptive iNtelligent feaTure expansIon) [10], a fast, flexible, and interpretable SR framework designed for data-driven scientific discovery. SyMANTIC builds on the sparse regression paradigm but introduces several key innovations: (1) a recursive, information-theoretic feature expansion strategy guided by Maximal Information Coefficient (MIC) screening; (2) a modified sparse regression formulation based on L0-regularization with adaptive constraints on model dimensionality; and (3) an MDL-inspired model complexity metric to rank expressions along an approximate Pareto frontier (finding models that better balance predictive accuracy and complexity). These components are integrated into a GPU-accelerated PyTorch framework with automated hyperparameter tuning, enabling rapid and robust symbolic modeling with minimal user intervention.
We benchmark SyMANTIC across a diverse set of problems, including synthetic expressions, physical equations, real-world chemical and materials property prediction, and chaotic dynamical systems. On standard symbolic regression benchmarks, SyMANTIC recovers over 95% of ground-truth equations, significantly outperforming state-of-the-art baselines that recover just over 50%. It also produces more concise models (30–40% fewer terms on average) and achieves these results with an order-of-magnitude speedup relative to existing methods. In the challenging task of learning the chaotic Lorenz system from limited data, SyMANTIC accurately recovers governing equations using just five time points – surpassing popular approaches like SINDy [9]. In a materials property prediction task relevant to sustainable battery design, SyMANTIC outperforms both symbolic and black-box ML models in predictive accuracy, while yielding compact, interpretable expressions. Together, these results establish SyMANTIC as a high-performance, user-friendly SR tool for interpretable model discovery in noisy, high-dimensional, and data-scarce scientific domains.
References:
[1] Makke, N., & Chawla, S. (2024). Interpretable scientific discovery with symbolic regression: a review. Artificial Intelligence Review, 57(1), 2.
[2] Angelis, D., Sofos, F., & Karakasidis, T. E. (2023). Artificial intelligence in physical sciences: Symbolic regression trends and perspectives. Archives of Computational Methods in Engineering, 30(6), 3845-3865.
[3] Koza, J. R. (1994). Genetic programming as a means for programming computers by natural selection. Statistics and computing, 4, 87-112.
[4] Cranmer, M. (2023). Interpretable machine learning for science with PySR and SymbolicRegression. jl. arXiv preprint arXiv:2305.01582.
[5] Petersen, B. K., Landajuela, M., Mundhenk, T. N., Santiago, C. P., Kim, S. K., & Kim, J. T. (2019). Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. arXiv preprint arXiv:1912.04871.
[6] Kamienny, P. A., d'Ascoli, S., Lample, G., & Charton, F. (2022). End-to-end symbolic regression with transformers. Advances in Neural Information Processing Systems, 35, 10269-10281.
[7] Wilson, Z. T., & Sahinidis, N. V. (2017). The ALAMO approach to machine learning. Computers & Chemical Engineering, 106, 785-795.
[8] Ouyang, R., Curtarolo, S., Ahmetcik, E., Scheffler, M., & Ghiringhelli, L. M. (2018). SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Physical Review Materials, 2(8), 083802.
[9] Brunton, S. L., Proctor, J. L., & Kutz, J. N. (2016). Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 113(15), 3932-3937.
[10] Muthyala, M. R., Sorourifar, F., Peng, Y., & Paulson, J. A. (2025). SyMANTIC: An Efficient Symbolic Regression Method for Interpretable and Parsimonious Model Discovery in Science and Beyond. Industrial & Engineering Chemistry Research.