Statistical classification of multivariate flow cytometry data analyzed by manual gating: Stem, progenitor, and epithelial marker expression in nonsmall cell lung cancer and normal lung


  • Daniel P. Normolle,

    1. Department of Biostatistics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, PA 15213
    2. University of Pittsburgh Cancer Institute
    Search for more papers by this author
  • Vera S. Donnenberg,

    1. University of Pittsburgh Cancer Institute
    2. Department of Cardiothoracic Surgery, University of Pittsburgh School of Medicine, Pittsburgh, PA 15213, USA
    3. McGowan Institute of Regenerative Medicine, Pittsburgh, PA 15219, USA
    Search for more papers by this author
  • Albert D. Donnenberg

    Corresponding author
    1. University of Pittsburgh Cancer Institute
    2. McGowan Institute of Regenerative Medicine, Pittsburgh, PA 15219, USA
    3. Department of Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA 15213, USA
    • 5117 Centre Avenue, Suite 2.42c Research Pavilion, Pittsburgh, PA 15213
    Search for more papers by this author


The use of supervised classification to extract markers from primary flow cytometry data is an emerging field that has made significant progress, spurred by the growing complexity of multidimensional flow cytometry. Whether the markers are extracted without supervision or by conventional gate and region methods, the number of candidate variables identified is typically larger than the number of specimens (p > n) and many variables are highly intercorrelated. Thus, comparison across groups or treatments to determine which markers are significant is challenging. Here, we utilized a data set in which 86 variables were created by conventional manual analysis of individual listmode data files, and compared the application of five multivariate classification methods to discern subtle differences between the stem/progenitor content of 35 nonsmall cell lung cancer and adjacent normal lung specimens. The methods compared include elastic-net, lasso, random forest, diagonal linear discriminant analysis, and best single variable (best-1). We described a broadly applicable methodology consisting of: 1) variable transformation and standardization; 2) visualization and assessment of correlation between variables; 3) selection of significant variables and modeling; and 4) characterization of the quality and stability of the model. The analysis yielded both validating results (tumors are aneuploid and have higher light scatter properties than normal lung), as well as leads that require followup: Cytokeratin+ CD133+ progenitors are present in normal lung but reduced in lung cancer; diploid (or pseudo-diploid) CD117+CD44+ cells are more prevalent in tumor. We anticipate that the methods described here will be broadly applicable to a variety of multidimensional cytometry problems. © 2012 International Society for Advancement of Cytometry

Analysis of flow cytometry data is usually performed in two steps. In primary analysis, variables such as the proportion of events having given properties (e.g., percent CD3+ of CD45+) are extracted from individual Flow Cytometry Standard (FCS) files. Secondary analysis involves statistical comparison of the tabulated variables between groups or treatments. Although the use of statistical classification methods to automate the extraction of markers from raw or minimally processed FCS is an emerging field (1–4), the application of the similar methodologies to secondary data analysis has not been exploited. Regardless of the method of primary analysis, the number of candidate variables detected in multidimensional cytometry data may exceed the number of specimens to be compared. Thus, simple bivariate comparison, one variable at a time, lacks statistical power to detect between group differences and is blind to correlations that may exist between variables. Clearly, a multivariate approach is needed for the secondary analysis of the large number of variables that are routinely extracted from multidimensional FCS files. When the number of specimens is limited, the conventional approach of nonoverlapping training and validation data sets is not feasible. In such cases, a naïve multivariate analysis may overfit the data, and then optimistically estimate the magnitude of the association between the variables and the group or treatment, resulting in findings that cannot be replicated. Further, small changes in the data or classification method may change those variables are identified as significant when intercorrelation is significant. There are a number of suitable contemporary computational methods for variable selection and modeling, but they have fundamentally different theoretical motivations, require tuning and can choose different variables as "significant." Analysts who have success with one method on one data set tend to apply that method to subsequent data sets. Thus, the results of a particular analysis may become reified without a sense of their sensitivity to different variable selection methods or random variation.

In this report, we propose an analysis workflow to address these issues. Given a previously published set of variables assessed on a limited number of specimens (5), our goal is to identify an appropriate analytical method to select variables related to a dichotomous condition (nonsmall cell lung cancer versus normal lung), to build a model of the condition as a mathematical function of the variables that can be used to classify subsequent specimens and to characterize the robustness of the analysis. By focusing on a flow cytometry data set previously analyzed using simple bivariate comparisons between tumor and adjacent lung specimens, the present analysis will address hazards particular to flow cytometry data: 1) highly correlated variables result in competing models that are difficult to compare; 2) statistical tests summarized by p-values do not necessarily produce sets of variables that reproducibly discriminate between conditions; 3) estimates of sensitivity and specificity must be adjusted when the number of specimens is insufficient to form independent training and testing sets. The proposed workflow, summarized here, and presented in detail as online Supporting Information, provides a practical approach to the secondary analysis of multidimensional flow cytometry data.



Because the statistical and cytometry literature often attribute different meanings to the same terms (e.g., parameter, sample), we have used the following terminology. A flow cytometry assay assesses a limited number of features (e.g., forward light scatter, 525/40 fluorescence intensity) on hundreds of thousands or millions of events (d, e.g., cells) that are typically acquired from a limited number (n) of specimens (e.g., lung tumor specimens). Feature expression can either be treated as continuous (e.g., fluorescence intensity of an event), or dichotomous (present or absent on a given event). These features are then combined through a gating process to define a number of markers (e.g., CD90+/CD44+). Variables (p) are derived from markers as either: 1) ratios of the numbers of events having or not having these markers as determined by a gating process (e.g., proportion of CD90+/CD44+ cells among cytokeratin+ cells); 2) absolute number (events per volume) of events having a given marker; or 3) the numerical value of a feature in a population of events defined by a marker (e.g., the mean fluorescence intensity of CD90-PE among CD90+/CD44+/cytokeratin+ events).

Acquisition of Data by Flow Cytometry

Eight-color, 11-feature data were collected for 16 paired nonsmall cell lung tumors/normal lung specimens, plus three additional tumor specimens. All specimens were collected under IRB approval (UPCI 99-053). Specimens, specimen processing, data acquisition, and variable extraction using conventional (gate and region) analysis have been reported in detail, including a MIFlowCyt checklist (5). Following artifact removal (6, 7), the analysis was focused on the comparison of stem/progenitor marker expression (CD44, CD90, CD117, CD133) on tumor cells and cells from adjacent normal lung. After limiting the analysis to CD45-/CD14-/CD33-/glycophorin A-cells, we subseted based on cytokeratin expression (epithelial versus nonepithelial or pre-epithelial) and ploidy (2N vs. >2N), yielding four classes of cells on which to assess variables (proportion of cells positive for stem/progenitor markers, proportion with low versus high light scatter, Fig. 1). The reason for using DNA content as a grouping variable is that, in tumor specimens, we could be certain that the majority of aneuploid cells were of bona fide tumor origin (as opposed to normal stromal or epithelial cells). The analytical regions in Figure 1, numbered 1–86, are linked to variable names and descriptions in Table 1. The data for all listmode files were analyzed using VenturiOne software (Applied Cytometry Systems, Dinnington, UK) in one session to standardize the placement of analytical regions. Region data (event counts) were exported to a comma-separated variable (.csv) file for statistical analysis. The data are available to interested investigators by request to the corresponding author.

Figure 1.

Regions and gates used in mutivariate analysis. This example shows a lung adenocarcinoma. Logical gates applied to each histogram are shown above the histogram frame. The “Clean Non Heme” gate was created on CD45-/CD14-/CD33-/glycophorin A-singlet events with DNA content ≥ 2N (not shown). The region numbers indicate the markers used for multivariate analysis and are keyed to Table 1. CKP = cytokeratin+, CKN = cytokeratin negative Euploid = gated on region 4 or 9. Aneuploid = gated on region 5 or 10. The euploid and low light scatter regions were matched to tissue infiltrating lymphocytes (not shown).

Table 1. Variable names and descriptions keyed to the analytical regions in Figure 1
RegionVariable nameDescription (Marker)Denominator (Marker)
1CKPCytokeratin+Singlet ≥ 2N DNA events
2CKP_SMCytokeratin+ lymphoid scatterCytokeratin+
3CKP_LGCytokeratin +, > lymphoid scatterCytokeratin+
4CKP_2NCytokeratin+ euploidCytokeratin+
5CKP_GT2NCytokeratin+ aneuploidCytokeratin+
6CKNCytokeratin negativeSinglet ≥ 2N DNA events
7CKN_SMCytokeratin negative lymphoid scatterCytokeratin negative
8CKN_LGCytokeratin negative > lymphoid scatterCytokeratin negative
9CKN_2NCytokeratin negative euploidCytokeratin negative
10CKN_GT2NCytokeratin negative aneuploidCytokeratin negative
11CKN_44PCytokeratin negative CD44+Cytokeratin negative
12CKP_44PCytokeratin+ CD44+Cytokeratin+
13CKN_90PCytokeratin negative CD90+Cytokeratin negative
14CKP_90PCytokeratin+ CD90+Cytokeratin+
15CKN_117PCytokeratin negative CD117+Cytokeratin negative
16CKP_117PCytokeratin+ CD117+Cytokeratin+
17CKN_133PCytokeratin negative CD133+Cytokeratin negative
18CKP_133PCytokeratin+ CD133+Cytokeratin+
19CPG2_SMALLCytokeratin+ aneuploid lymphoid scatterCytokeratin+ aneuploid
20CPG290N44PCytokeratin+ aneuploid CD90 negative CD44+Cytokeratin+ aneuploid
21CPG290P44PCytokeratin+ aneuploid CD90+ CD44+Cytokeratin+ aneuploid
22CPG290N44NCytokeratin+ aneuploid CD90 negative CD44 negativeCytokeratin+ aneuploid
23CPG290P44NCytokeratin+ aneuploid CD90+ CD44 negativeCytokeratin+ aneuploid
24CPG2117N44PCytokeratin+ aneuploid CD117 negative CD44+Cytokeratin+ aneuploid
25CPG2117P44PCytokeratin+ aneuploid CD117+ CD44+Cytokeratin+ aneuploid
26CPG2117N44NCytokeratin+ aneuploid CD117 negative CD44 negativeCytokeratin+ aneuploid
27CPG2117P44NCytokeratin+ aneuploid CD117+ CD44 negativeCytokeratin+ aneuploid
28CPG2117N133PCytokeratin+ aneuploid CD117 negative CD133+Cytokeratin+ aneuploid
29CPG2117P133PCytokeratin+ aneuploid CD117+ CD133+Cytokeratin+ aneuploid
30CPG2117N133NCytokeratin+ aneuploid CD117 negative CD133 negativeCytokeratin+ aneuploid
31CPG2117P133NCytokeratin+ aneuploid CD117+ CD133 negativeCytokeratin+ aneuploid
32CPG2117N90PCytokeratin+ aneuploid CD117 negative CD90+Cytokeratin+ aneuploid
33CPG2117P90PCytokeratin+ aneuploid CD117+ CD90+Cytokeratin+ aneuploid
34CPG2117N90NCytokeratin+ aneuploid CD117 negative CD90 negativeCytokeratin+ aneuploid
35CPG2117P90NCytokeratin+ aneuploid CD117+ CD90 negativeCytokeratin+ aneuploid
36CP2_SMALLCytokeratin+ euploid lymphoid scatterCytokeratin+ euploid
37CP90N44PCytokeratin+ euploid CD90 negative CD44+Cytokeratin+ euploid
38CP90P44PCytokeratin+ euploid CD90+ CD44+Cytokeratin+ euploid
39CP90N44NCytokeratin+ euploid CD90 negative CD44 negativeCytokeratin+ euploid
40CP90P44NCytokeratin+ euploid CD90+ CD44 negativeCytokeratin+ euploid
41CP2117N44PCytokeratin+ euploid CD117 negative CD44+Cytokeratin+ euploid
42CP2117P44PCytokeratin+ euploid CD117+ CD44+Cytokeratin+ euploid
43CP2117N44NCytokeratin+ euploid CD117 negative CD44 negativeCytokeratin+ euploid
44CP2117P44NCytokeratin+ euploid CD117+ CD44 negativeCytokeratin+ euploid
45CP2117N133PCytokeratin+ euploid CD117 negative CD133+Cytokeratin+ euploid
46CP2117P133PCytokeratin+ euploid CD117+ CD133+Cytokeratin+ euploid
47CP2117N133NCytokeratin+ euploid CD117 negative CD133 negativeCytokeratin+ euploid
48CP2117P133NCytokeratin+ euploid CD117+ CD133 negativeCytokeratin+ euploid
49CP2117N90PCytokeratin+ euploid CD117 negative CD90+Cytokeratin+ euploid
50CP2117P90PCytokeratin+ euploid CD117+ CD90+Cytokeratin+ euploid
51CP2117N90NCytokeratin+ euploid CD117 negative CD90 negativeCytokeratin+ euploid
52CP2117P90NCytokeratin+ euploid CD117+ CD90 negativeCytokeratin+ euploid
53CNG2_SMALLCytokeratin negative aneuploid lymphoid scatterCytokeratin negative aneuploid
54CNG290N44PCytokeratin negative aneuploid CD90 negative CD44+Cytokeratin negative aneuploid
RegionVariable nameDescription (Marker)Denominator (Marker)
55CNG290P44PCytokeratin negative aneuploid CD90+ CD44+Cytokeratin negative aneuploid
56CNG290N44NCytokeratin negative aneuploid CD90 negative CD44 negativeCytokeratin negative aneuploid
57CNG290P44NCytokeratin negative aneuploid CD90+ CD44 negativeCytokeratin negative aneuploid
58CNG2117N44PCytokeratin negative aneuploid CD117 negative CD44+Cytokeratin negative aneuploid
59CNG2117P44PCytokeratin negative aneuploid CD117+ CD44+Cytokeratin negative aneuploid
60CNG2117N44NCytokeratin negative aneuploid CD117 negative CD44 negativeCytokeratin negative aneuploid
61CNG2117P44NCytokeratin negative aneuploid CD117+ CD44 negativeCytokeratin negative aneuploid
62CNG2117N133PCytokeratin negative aneuploid CD117 negative CD133+Cytokeratin negative aneuploid
63CNG2117P133PCytokeratin negative aneuploid CD117+ CD133+Cytokeratin negative aneuploid
64CNG2117N133NCytokeratin negative aneuploid CD117 negative CD133 negativeCytokeratin negative aneuploid
65CNG2117P133NCytokeratin negative aneuploid CD117+ CD133 negativeCytokeratin negative aneuploid
66CNG2117N90PCytokeratin negative aneuploid CD117 negative CD90+Cytokeratin negative aneuploid
67CNG2117P90PCytokeratin negative aneuploid CD117+ CD90+Cytokeratin negative aneuploid
68CNG2117N90NCytokeratin negative aneuploid CD117 negative CD90 negativeCytokeratin negative aneuploid
69CNG2117P90NCytokeratin negative aneuploid CD117+ CD90 negativeCytokeratin negative aneuploid
70CN2_SMALLCytokeratin negative euploid lymphoid scatterCytokeratin negative euploid
71CN90N44PCytokeratin negative euploid CD90 negative CD44+Cytokeratin negative euploid
72CN90P44PCytokeratin negative euploid CD90+ CD44+Cytokeratin negative euploid
73CN90N44NCytokeratin negative euploid CD90 negative CD44 negativeCytokeratin negative euploid
74CN90P44NCytokeratin negative euploid CD90+ CD44 negativeCytokeratin negative euploid
75CN2117N44PCytokeratin negative euploid CD117 negative CD44+Cytokeratin negative euploid
76CN2117P44PCytokeratin negative euploid CD117+ CD44+Cytokeratin negative euploid
77CN2117N44NCytokeratin negative euploid CD117 negative CD44 negativeCytokeratin negative euploid
78CN2117P44NCytokeratin negative euploid CD117+ CD44 negativeCytokeratin negative euploid
79CN2117N133PCytokeratin negative euploid CD117 negative CD133+Cytokeratin negative euploid
80CN2117P133PCytokeratin negative euploid CD117+ CD133+Cytokeratin negative euploid
81CN2117N133NCytokeratin negative euploid CD117 negative CD133 negativeCytokeratin negative euploid
82CN2117P133NCytokeratin negative euploid CD117+ CD133 negativeCytokeratin negative euploid
83CN2117N90PCytokeratin negative euploid CD117 negative CD90+Cytokeratin negative euploid
84CN2117P90PCytokeratin negative euploid CD117+ CD90+Cytokeratin negative euploid
85CN2117N90NCytokeratin negative euploid CD117 negative CD90 negativeCytokeratin negative euploid
86CN2117P90NCytokeratin negative euploid CD117+ CD90 negativeCytokeratin negative euploid

Data Analysis Workflow

We analyzed the data according to the following steps: 1) transformation and standardization; 2) visualization and assessment of correlation between variables; 3) selection of significant markers and modeling; 4) characterization of the quality and stability of the model. All of the analytic steps applied to the data in the.csv file were performed using the open-source statistical package R (8). A diagram of the workflow (Fig. S1) and the associated R code is included in the online Supporting Information.

Transformation and Standardization

Because the variables analyzed in this data set were proportions (e.g., proportion of cytokeratin+ viable cells), it was necessary to address their statistical properties. First, when a population of cells is divided into a set and its complement, the two proportions are perfectly correlated (r = −1). Similarly, large fractions are often highly correlated. For example, cytokeratin+ and cytokeratin negative, expressed as a proportion of nucleated cells, are highly inversely correlated. Second, mid-range proportions (e.g., aneuploid tumor cells as a proportion of cytokeratin positive cells) tend to be normally distributed, but proportions that are close to 0 or 1 have asymmetric distributions over the specimens. Third, proportion estimates on small denominators (<50 cells) are unreliable. Finally, zero proportions do not necessarily indicate an impossible event, but, possibly, a rare, yet informative event, especially when denominators are small. We addressed these issues by unzeroing, stabilization, and standardization.


Some numerical transformations (e.g., logarithmic) cannot operate on zeros or ones, and some statistical classifiers perform worse on highly skewed data. We replaced zeros by a random value between zero and one-tenth the smallest non-zero value and ones by a random value between 1 and 1−(1−m)/10, where m is the largest value less than 1. Additional details are presented in the Supporting Information.


Some statistical classification methods assume that the variables have Gaussian distributions (e.g., Fisher's linear discriminant analysis) or do not accommodate highly skewed data well. We used the logit transform in our analysis because it is symmetric and its range is all positive and all negative numbers. The logit and inverse logit transformations are shown in Supporting Information Figure S2, where merits of various transformations are discussed.


For each variable, we subtracted the mean and divided by the standard deviation, so that, over specimens from all conditions, mean equals 0 and the observed standard deviation equals 1. This was required because some computational algorithms are unstable when operating simultaneously on values with multilog differences in scale.

Visualization of Data and Assessment of Correlation Between Variables

The transformed, standardized data were scanned for artifacts, influence points, and obvious classifiers before analysis. Outliers, in the usual sense of very large (positive or negative) values, are not present in transformed, standardized proportions. However, influence points (single values that induce correlations), such as clusters of 0 and 1 counts, were evident in the data and required unzeroing as described above.

Initial Assessment of Variables for Classification

We performed comparisons of the individual variables versus the condition by t-tests and/or rank sum tests to determine whether modeling was feasible (Table 2). Because a number of significant differences were observed, we proceeded with the analysis. Although a small p-value does not necessarily indicate a good classifier, if none of the p-values were significant, we would have terminated the analysis.

Table 2. P-values of unpaired t- and rank-sum tests on tumor versus normal lung specimens. The 32 variables with the smallest P-values (t-test) are shown
Variable P-values
t-testRank sum test

Selection of Significant Variables and Modeling of Condition

We tested a number of methods in a resampling framework to determine if: 1) some methods classify the specimens better than others; 2) there is agreement between the methods on which variables are important; 3) there are any specimens that have particular influence over the variables selected as important and/or coefficient estimates; and 4) the estimated coefficients are sensitive to small perturbations in the data. We used two resampling methods to address these issues: cross-validation and bootstrapping.


The data were partitioned into 10 disjoint subsets. At each of 10 iterations, a different subset was set aside and modeling parameters were estimated from the other 9 subsets, pooled together. We estimated the sensitivity and specificity, and set tuning parameters using the set-aside test set.


In the nonparametric bootstrap (9), a sample of specimens, the same size as the original sample specimen data set, was constructed by random draws from the original sample with replacement. That is, once an observation was selected for the bootstrap sample, it was replaced in the pool of eligible observations where it could be sampled again. All of the methods were trained on the bootstrap sample, and the specimens that were excluded from the sample (equal, on average, to (1−1/n)n 0.37 of the sample size, n) were treated as the test set. These excluded samples are referred to as out-of-bag (OOB). The bootstrapping process was repeated 500 times, each bootstrap iteration testing the full set of methods on different training and test sets, both of which were drawn from the full sample. Once the bootstrap iterations were completed, the average and the variability of the cross-validated sensitivity, specificity, accuracy, selected variables, influential specimens, and estimated coefficients of the different methods were estimated by the empirical distribution of bootstrap estimates. Further discussion and software code appears in the online Supporting Information.

Variable Importance

Within each bootstrap iteration, variables were chosen as either significant contributors to the discriminator, or nonsignificant. The proportion of iterations in which a given variable was considered significant is a measure of the robustness of its power as a discriminator. The correlation of selection between variables (where each variable is coded 1 if it is selected, and 0 otherwise) was interpreted as a descriptor of structure; variables whose selection indices are negatively correlated are possibly associated with a pathway.

Influential Observations

We also calculated the proportion of bootstrap iterations in which each test specimen was correctly classified as tumor or normal lung. Specimens that are consistently misclassified by one method, but not another, may indicate that a nonlinear partition may be required, or that the specimen is dissimilar from other specimens with the same condition. A specimen that is misclassified in close to ½ of the bootstrap iterations probably lies close to the partition boundary.

Candidate Classifiers

We focused on three methods that reflect different philosophies of model building: diagonal linear discrimination (DLDA), random forests, and elastic net. DLDA is frequently used in very high-dimensional analyses; it is structurally very simple, and reduces computationally complexity by ignoring interactions between the variables. Random forest is a nonparametric machine learning method that itself uses a resampling method to form a partition of the variable space that may be highly nonlinear. Elastic net is a “regularized” version of logistic regression that is designed for stability in the presence of highly correlated variables and has a built-in variable selection scheme. We also used the lasso (which is a special case of elastic net), and a naïve classifier, best-single marker. All of these methods were embedded in the bootstrap loop; all methods were applied to each bootstrap sample, and then the results were compared. Details and code for all these methods are in the online Supporting Information.


Visualization and Assessment of Correlation Within Subjects and Between Variables

Before logit transformation, centering, and scaling, we jittered 0 values to between 1/10 of the smallest non-zero value and 0. Values of 1 were treated analogously. The intrasubject correlation (between tumor and normal lung) on the 86 variables was measured on the 16 pairs of specimens (three tumor specimens did not have paired normal tissue). The median of the 86 correlation coefficients was 0.22, and ranged from −0.63 to 0.87. The low mean of the correlations indicated that an analysis based on differences within subject was unlikely to be useful. We decided not to analyze the data as paired (tumor and normal specimens), but to treat the specimens as independent. Scatterplots of variables (tumor versus normal) with large coefficients indicated that many of the largest correlations (either positive or negative) were artifactual. The distribution of the 86 × (86−1)/2 = 3655 pairwise correlation coefficients (tumor vs. normal lung) is illustrated in Figure 2. A heat map and dendrogram of the normal lung and tumor specimens in the lung data set did not immediately indicate a dominant cluster of discriminating variables (Fig. S3). The closest two single variables are CKN_44P (cytokeratin negative/CD44+) and CN2117N44P (cytokeratin negative, euploid, CD117 negative/CD44+), which are highly correlated (r = 0.96) because the latter is a subset of the former.

Figure 2.

Distribution of between-variable correlations in the example data set.

Initial Assessment of Variables for Classification

Table 2 displays the p-values of Student's t and Wilcoxon rank sum tests comparing the distributions between normal lung and tumor specimens for all of the variables, from most significant (Student's t) to least significant. It is seen that there are some significant differences, so that further classification analysis is warranted, but that the distributions of most variables in normal lung and tumor specimens are similar, so the final discriminators are not expected to include many variables.

Sensitivity and Specificity

The sensitivity, specificity, and accuracy of the discriminators, estimated over the 500 bootstrap samples, are presented in Table 3. The bootstrapped estimates are calculated by classifying the OOB samples (averaged over all bootstrap iterations), whereas the resubstitution estimates are obtained by classifying the training samples (also averaged over all bootstrap iterations). DLDA, random forests, elastic net, and the lasso produce essentially equivalent accuracy; random forest is different from the other three in that its estimated specificity is larger than its sensitivity. The differences between sensitivity and specificity in these methods are more likely due to peculiarities of the example set than the methods themselves. Best-1 accuracy is lower than the other four methods, which is not surprising, given that it is restricted to a single variable. The bootstrapped estimates of sensitivity, specificity, and accuracy are much less optimistic than the resubstitution estimates. Resubstitution estimates of the operating characteristics of Best-1 were not as optimistic, but the bootstrapped estimates were not as good as the other methods.

Table 3. Sensitivity, specificity, and accuracy of the five methods
  E-NetRandom ForestDLDALassoBest-1
  1. Sensitivity is defined as true positives (TP)/(TP + false negatives (FN)), Specificity = true NEGATIVES (TN)/(TN + False positives (FP)), Accuracy = (TP+TN)/(TP+TN+FP+FN).

Bootstrapped EstimatesSensitivity0.6960.6230.6910.6940.659
Resubstitution EstimatesSensitivity0.89510.8950.8950.684

Figure 3 is a scatterplot of the estimates of sensitivity and specificity over all 500 bootstrap samples for the five methods. This plot demonstrates the variability in sensitivity and specificity that could occur when sampling from the population from which the study sample was drawn. If there was no evaluation of the variability of estimation, then the observed sensitivity and specificity could be any one of the graphed values, and it would not be possible to determine how typical or atypical those estimates might be. It is seen that the observed sensitivities and specificities (Table 3) are roughly at the modes in Figure 3.

Figure 3.

Specificity and sensitivity of 500 bootstrap samples.

Variables Selected

Table 4 presents the importance index of each variable on a scale of 0–500 (the number of bootstrap iterations). The values for elastic net, random forest, DLDA, and lasso are much larger than those for Best-1 discrimination because the total number of variables chosen over 500 bootstrap iterations of Best-1 is fixed at a total of 500. The table is ordered by the average importance index over the five methods, where the top variables are, on average, most important. There is a fairly large drop in importance after the fifth most important variable, CKN_SM (cytokeratin negative, lymphoid scatter). Except for CKN_SM and CPG2117N133P (cytokeratin+ aneuploid CD117 negative CD133+), the five most important variables do not cluster together (Fig. S3), suggesting that they represent different processes. Figure 4 shows the first two principal components of the top five variables. The variables commonly chosen are mostly coherent on the average, but different methods can be fairly discordant on the same bootstrap sample. Table 5 displays the concordance of the five methods in choosing CKP_GT2N (cytokeratin+ aneuploid), the variable with the highest mean importance index, across the bootstrap samples. Even though elastic net and lasso are similar in concept, they are discordant in 86/500 = 17.2% of the bootstrap samples. Best-1, which is not smoothed or regularized in any way, is usually discordant with the other methods.

Figure 4.

Two most important variables versus principal component analysis discriminating between tumor and normal lung. The plot on the left shows ability of the two most predictive variables (cytokeratin+ aneuploid and cytokeratin+ aneuploid CD117 negative CD133+) to distinguish between tumor and normal lung. The plot on the right shows the two principal components of five most important variables (Table 4). Numbers refer to the specimen ID column in Table 6. The case number indicates the frequency with which the case was correctly classified (lower number more frequently classified correctly).

Table 4. Importance of variables on a scale of 0–500
inline image
Table 5. Concordance between methods in choosing CKP_GT2N (cytokeratin+ aneuploid cells) as a significant classifying variable over 500 bootstrap iterations
  1. Two by two tables are shown for each pair of methods. “Y” indicates the given model included CKP_GT2N, “N” indicates that it was excluded.

RFN  6885658814211
Y  3031790257247100
DLDAN    5147980
Y    104298291111
LassoN      1487
Y      241104

Figure 5 displays the distributions of the numbers of variables selected by the different methods across the bootstrap samples (Best-1 is not included because it always chooses exactly one variable). As expected, the numbers of variables in the elastic net models with the most variables were larger than similar numbers associated with the lasso (Fig. 5). Reducing the mixing parameter α shifts the elastic net curve to the right, capturing more variables, but not increasing the accuracy of classification (data not shown). Although DLDA's number of variables is smaller than the first two, it has a longer tail, describing a few runs where over 20 variables were selected. Random forest tends to build more parsimonious models, but also sometimes will indicate that 20 or even 30 variables are useful in classifying specimens.

Figure 5.

Numbers of variables chosen by individual method (Best-1 always chooses one variable).

Classification of Individual Specimens

Table 6 shows the frequency with which each specimen was correctly classified (tumor or normal lung) by each method over 500 bootstrap iterations. The column labeled c.v. is a measure of concordance of the methods, with low coefficients of variation indicating better concordance. All of the methods have difficulty with specimens 27 through 35.

Table 6. Proportion of correct classifications by method for each specimen
 IDE-NetRandom ForestDLDALassoBest-1c.v.
  1. A value close to 1 indicates the specimen was almost always classified correctly. The c.v. column indicates the coefficient of variation of the five methods; higher values indicate more variability in the classification of the given observation between methods. Specimen IDs correspond to Figure 4.



The secondary analysis of multidimensional flow cytometry data can be daunting, in part because there are arbitrary many ways that data can be parsed when conventional gate/region-based analysis is used. During exploratory data analysis, such as the example presented here, there is usually a fundamental objective that drives the analysis. In this case, it was to discern any differences in the expression of stem/progenitor markers between non-small cell lung tumors and adjacent normal lung. As tumors are heterogeneous and also contain much non-neoplastic tissue (stromal, vascular, and immune cells) our objectives were to: 1) First, examine cytokeratin (a definitive epithelial marker) versus the stem/progenitor markers, one at a time; and then 2) break the data into four classes (cytokeratin + euploid, cytokeratin + aneuploid, cytokeratin negative euploid and cytokeratin negative aneuploid) to study the light scatter properties and the coexpression of the stem/progenitor markers in a pairwise fashion. As the latter categories are subsets of the former, it is expected that many of the variables derived from them will be correlated. Another problem inherent to multivariate cytometry data sets is that the number of variables (derived from proportions of analytical regions) is great, often larger than the number of specimens. In this example, we had 86 variables (p) and only 35 specimens (n). Conventional bivariate analysis would require adjustment for the number of comparisons, greatly diluting the statistical power to detect differences.

In this report, we offer an objective method to deal with multidimensional data sets where there is the potential for highly correlated variables and where p >> n. We used five different methods and found two, elastic net and the lasso, particularly useful for this data set, because they consistently chose a manageable number of variables (Fig. 5) that resulted in excellent discrimination between tumor and normal (Fig. 4). Random forest and DLDA were also useful and might prove superior for other data sets. Only best-1 was inadequate to the purpose of this analysis, because it inherently chooses the single best discriminating variable (Fig. 3). An area which was not addressed in this study is the effect of the variance of individual measurements. As raw event counts vary between variables and between samples, not all measured results are equally certain. Future analytical models can take this into account.

Classifying multivariate data against known categories (tumor and normal) can accomplish two important objectives: picking out differences of potentially biological importance for further exploration, and model building for prospective classification of unknowns. Requirements for accuracy are more relaxed for the first application, and considerably more stringent if the model is going to be used prospectively to classify unknown individual cases. In the present data set the model performed excellently for the first application, but consistently misclassified specific individual cases (Table 6). Given that we evaluated 86 variables covering the expression and coexpression of four stem/progenitor associated markers, the similarity in expression patterns between tumor and normal was striking (Fig. S3), suggesting that even among the most deranged tumor cells, these proteins serve important functions that are conserved. Among the most important markers that did distinguish tumor and normal were those in routine use by pathologists, such as cytokeratin + aneuploid cells in tumors, and cytokeratin + small (low scatter), and euploid cells in normal lung. In addition to these known discriminators, selection of which validates the methodology proposed here, there are also several interesting leads (Table 4) such as the expression of the stem/progenitor marker CD133 (CPG2117N133P) and the coexpression of CD117 and CD44 (CP2117P44P). Indeed, CD133, coexpressed with the proliferation marker Ki67 has recently been proposed as a marker of poor prognosis in non-small cell lung cancer (10), but our comparison with normal tissue reveals a greater prevalence of cytokeratin+ cycling CD117 negative/CD133+ cells in normal lung (chosen by all 5 methods, Table 4). Similarly coexpression of CD117, a lung stem cell-associated growth factor receptor (11), in euploid (or pseudo-diploid) cells and CD44 a principal marker associated with tumorigenicity in breast cancer (7, 12) also distinguished tumor from normal. Although small data sets such as ours cannot by their nature offer conclusive proof that such markers are of mechanistic, diagnostic or prognostic significance, they provide a sound rational for prospective studies and a model for confirmatory data analysis.


The authors like to acknowledge our clinical collaborators Drs. James D. Luketich and Rodney J. Landreneau, as well as Dr. Ludovic Zimmerlin, Melanie Pfeifer, James Arbore, and E. Michael Meyer for their assistance in compiling the data set used in this analysis. Biostatistical analysis was supported by P30CA047904 and P50CA090440. The authors have no conflicts of interest to declare.