Selection of optimal regression models via cross-validation



A general problem arising in the development of regression models is the selection of the optimal model. Whenever a feature selection procedure, such as step forward, backward elimination, best subset or all possible combinations, or when a data compression approach, such as principal components or partial least-squares regression, is used, the question of how many regression terms to include in the final model must be addressed.

This work describes the evaluation of four different criteria for selection of the optimal predictive regression model using cross-validation. The results obtained in this work illustrate the problems which can arise in the analysis of small or inadequately sampled data sets. The common approach, selecting the model which yields the absolute minimum in the predictive residual error sum of squares (PRESS), was found to have particularly poor statistical properties. A very simple change to a criterion based on the first local minimum in PRESS will provide a significant improvement in the cross-validation result. A criterion based on testing the significance of incremental changes in PRESS with an F-test may provide more robust performance than the local minimum in PRESS method.