Breadcrumb
- Home
- Publications
- Proceedings
- 2014 AIChE Annual Meeting
- Computing and Systems Technology Division
- Advances in Data Analysis: Theory and Applications
- (103g) Subset Selection in Multiple Linear Regression
The purpose of this paper is to present a systematic analysis of new and existing approaches to the subset selection problem encountered in ALAMO. The same problem of subset selection arises naturally in a variety of applications and has been the subject of study in the machine learning and statistics literatures [2, 3]. Yet, an effective solution approach to this problem is still elusive due to its highly combinatorial and nonlinear nature. It is often the practice to use greedy step-wise heuristics to produce a good fitting subset of regression variables [2]. These heuristics typically use different model fitness metrics, including Akaike’s Information Criterion and Mallows’ Cp, in order to define a stopping point. We compare these heuristic stepwise algorithms, exhaustive search algorithms [3], and newly proposed direct optimization of integer programming formulations for several different model selection criteria. For this purpose, we use a large test set with problems from a variety of applications.
References
[1] Cozad, A., N. V. Sahinidis, and D. C. Miller, Automatic learning of algebraic models for optimization, AIChE Journal, 60, 2211-2227, 2014.
[2] Miller, A. J. (1990). Subset selection in regression. London [England]: Chapman and Hall.
[3] Furnival, G.M. and R. W. Wilson (1974). Regression by leaps and bounds, Technometrics, 16, 499-511.