The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration



Classical least squares (CLS) and partial least squares (PLS) are two common multivariate regression algorithms in chemometrics. This paper presents an asymptotically exact mathematical analysis of the mean squared error of prediction of CLS and PLS under the linear mixture model commonly assumed in spectroscopy. For CLS regression with a very large calibration set the root mean squared error is approximately equal to the noise per wavelength divided by the length of the net analyte signal vector. It is shown, however, that for a finite training set with n samples in p dimensions there are additional error terms that depend on σ2p2/n2, where σ is the noise level per co-ordinate. Therefore in the ‘large p—small n’ regime, common in spectroscopy, these terms can be quite large and even dominate the overall prediction error. It is demonstrated both theoretically and by simulations that dimensional reduction of the input data via their compact representation with a few features, selected for example by adaptive wavelet compression, can substantially decrease these effects and recover the asymptotic error. This analysis provides a theoretical justification for the need to perform feature selection (dimensional reduction) of the input data prior to application of multivariate regression algorithms. Copyright © 2005 John Wiley & Sons, Ltd.