SEARCH

SEARCH BY CITATION

Keywords:

  • High-dimensional data;
  • Misclassification;
  • Proteomics

Summary.  Training data in a supervised learning problem consist of the class label and its potential predictors for a set of observations. Constructing effective classifiers from training data is the goal of supervised learning. In biomedical sciences and other scientific applications, class labels may be subject to errors. We consider a setting where there are two classes but observations with labels corresponding to one of the classes may in fact be mislabeled. The application concerns the use of protein mass-spectrometry data to discriminate between serum samples from cancer and noncancer patients. The patients in the training set are classified on the basis of tissue biopsy. Although biopsy is 100% specific in the sense that a tissue that shows itself to have malignant cells is certainly cancer, it is less than 100% sensitive. Reference gold standards that are subject to this special type of misclassification due to imperfect diagnosis certainty arise in many fields. We consider the development of a supervised learning algorithm under these conditions and refer to it as partially supervised learning. Boosting is a supervised learning algorithm geared toward high-dimensional predictor data, such as those generated in protein mass-spectrometry. We propose a modification of the boosting algorithm for partially supervised learning. The proposal is to view the true class membership of the samples that are labeled with the error-prone class label as missing data, and apply an algorithm related to the EM algorithm for minimization of a loss function. To assess the usefulness of the proposed method, we artificially mislabeled a subset of samples and applied the original and EM-modified boosting (EM-Boost) algorithms for comparison. Notable improvements in misclassification rates are observed with EM-Boost.