SEARCH

SEARCH BY CITATION

Keywords:

  • multivariate calibration;
  • semi-supervised learning;
  • unlabeled data;
  • optimal filtering;
  • drift

Abstract

In principal component regression (PCR) and partial least-squares regression (PLSR), the use of unlabeled data, in addition to labeled data, helps stabilize the latent subspaces in the calibration step, typically leading to a lower prediction error. For using unlabeled data in PLSR, a non-sequential approach based on optimal filtering (OF) has been proposed in the literature. In this work, a sequential version of the OF-based PLSR and a PCA-based PLSR (PLSR applied to PCA-preprocessed data) are proposed. It is shown analytically that the sequential version of the OF-based PLSR is equivalent to that of PCA-based PLSR, which leads to a new interpretation of OF. Simulated and experimental data sets are used to point out the usefulness and pitfalls of using unlabeled data. Unlabeled data can replace labeled data to some extent, thereby leading to an economic benefit. However, in the presence of drift, the use of unlabeled data can result in an increase in prediction error compared to that obtained with a model based on labeled data alone. Copyright © 2011 John Wiley & Sons, Ltd.