Statistical classification of multivariate flow cytometry data analyzed by manual gating: Stem, progenitor, and epithelial marker expression in nonsmall cell lung cancer and normal lung

Authors

  • Daniel P. Normolle,

    1. Department of Biostatistics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, PA 15213
    2. University of Pittsburgh Cancer Institute
    Search for more papers by this author
  • Vera S. Donnenberg,

    1. University of Pittsburgh Cancer Institute
    2. Department of Cardiothoracic Surgery, University of Pittsburgh School of Medicine, Pittsburgh, PA 15213, USA
    3. McGowan Institute of Regenerative Medicine, Pittsburgh, PA 15219, USA
    Search for more papers by this author
  • Albert D. Donnenberg

    Corresponding author
    1. University of Pittsburgh Cancer Institute
    2. McGowan Institute of Regenerative Medicine, Pittsburgh, PA 15219, USA
    3. Department of Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA 15213, USA
    • 5117 Centre Avenue, Suite 2.42c Research Pavilion, Pittsburgh, PA 15213
    Search for more papers by this author

Abstract

The use of supervised classification to extract markers from primary flow cytometry data is an emerging field that has made significant progress, spurred by the growing complexity of multidimensional flow cytometry. Whether the markers are extracted without supervision or by conventional gate and region methods, the number of candidate variables identified is typically larger than the number of specimens (p > n) and many variables are highly intercorrelated. Thus, comparison across groups or treatments to determine which markers are significant is challenging. Here, we utilized a data set in which 86 variables were created by conventional manual analysis of individual listmode data files, and compared the application of five multivariate classification methods to discern subtle differences between the stem/progenitor content of 35 nonsmall cell lung cancer and adjacent normal lung specimens. The methods compared include elastic-net, lasso, random forest, diagonal linear discriminant analysis, and best single variable (best-1). We described a broadly applicable methodology consisting of: 1) variable transformation and standardization; 2) visualization and assessment of correlation between variables; 3) selection of significant variables and modeling; and 4) characterization of the quality and stability of the model. The analysis yielded both validating results (tumors are aneuploid and have higher light scatter properties than normal lung), as well as leads that require followup: Cytokeratin+ CD133+ progenitors are present in normal lung but reduced in lung cancer; diploid (or pseudo-diploid) CD117+CD44+ cells are more prevalent in tumor. We anticipate that the methods described here will be broadly applicable to a variety of multidimensional cytometry problems. © 2012 International Society for Advancement of Cytometry

Ancillary