On estimating model complexity and prediction errors in multivariate calibration: generalized resampling by random sample weighting (RSW)

Authors

  • L. Xu,

    1. College of Pharmacology, Dali University, Dali 671003, P.R. China
    Search for more papers by this author
  • Q.-S. Xu,

    1. School of Mathematical Sciences, Central South University, Changsha 410083, P.R. China
    2. State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P.R. China
    Search for more papers by this author
  • M. Yang,

    1. College of Pharmacology, Dali University, Dali 671003, P.R. China
    Search for more papers by this author
  • H.-Z. Zhang,

    1. College of Pharmacology, Dali University, Dali 671003, P.R. China
    Search for more papers by this author
  • C.-B. Cai,

    1. Department of Chemistry and Life Science, Chuxiong Normal University, Chuxiong 675000, P.R. China
    Search for more papers by this author
  • J.-H. Jiang,

    1. State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P.R. China
    Search for more papers by this author
  • H.-L. Wu,

    1. State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P.R. China
    Search for more papers by this author
  • R.-Q. Yu

    Corresponding author
    1. State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P.R. China
    • State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P.R. China.
    Search for more papers by this author

Abstract

The present paper focuses on determining the number of PLS components by using resampling methods such as cross validation (CV), Monte Carlo cross validation (MCCV), bootstrapping (BS), etc. To resample the training data, random non-negative weights are assigned to the original training samples and a sample-weighted PLS model is developed without increasing the computational burden much. Random weighting is a generalization of the traditional resampling methods and is expected to have a lower risk of getting an insufficient training set. For prediction, only the training samples with random weights less than a threshold value are selected to ensure that the prediction samples have less influence on training. For complicated data, because the optimal number of PLS components is often not unique or readily distinguished and there might exist an optimal region of model complexity, the distribution of prediction errors can be more useful than a single value of root mean squared error of prediction (RMSEP). Therefore, the distribution of prediction errors are estimated by repeated random sample weighting and used to determine model complexity. RSW is compared with its traditional counterparts like CV, MCCV, BS and a recently proposed randomization test method to demonstrate its usefulness. Copyright © 2010 John Wiley & Sons, Ltd.

Ancillary