Analysis of flow cytometry data is usually performed in two steps. In primary analysis, variables such as the proportion of events having given properties (e.g., percent CD3+ of CD45+) are extracted from individual Flow Cytometry Standard (FCS) files. Secondary analysis involves statistical comparison of the tabulated variables between groups or treatments. Although the use of statistical classification methods to automate the extraction of markers from raw or minimally processed FCS is an emerging field (1–4), the application of the similar methodologies to secondary data analysis has not been exploited. Regardless of the method of primary analysis, the number of candidate variables detected in multidimensional cytometry data may exceed the number of specimens to be compared. Thus, simple bivariate comparison, one variable at a time, lacks statistical power to detect between group differences and is blind to correlations that may exist between variables. Clearly, a multivariate approach is needed for the secondary analysis of the large number of variables that are routinely extracted from multidimensional FCS files. When the number of specimens is limited, the conventional approach of nonoverlapping training and validation data sets is not feasible. In such cases, a naïve multivariate analysis may overfit the data, and then optimistically estimate the magnitude of the association between the variables and the group or treatment, resulting in findings that cannot be replicated. Further, small changes in the data or classification method may change those variables are identified as significant when intercorrelation is significant. There are a number of suitable contemporary computational methods for variable selection and modeling, but they have fundamentally different theoretical motivations, require tuning and can choose different variables as "significant." Analysts who have success with one method on one data set tend to apply that method to subsequent data sets. Thus, the results of a particular analysis may become reified without a sense of their sensitivity to different variable selection methods or random variation.
In this report, we propose an analysis workflow to address these issues. Given a previously published set of variables assessed on a limited number of specimens (5), our goal is to identify an appropriate analytical method to select variables related to a dichotomous condition (nonsmall cell lung cancer versus normal lung), to build a model of the condition as a mathematical function of the variables that can be used to classify subsequent specimens and to characterize the robustness of the analysis. By focusing on a flow cytometry data set previously analyzed using simple bivariate comparisons between tumor and adjacent lung specimens, the present analysis will address hazards particular to flow cytometry data: 1) highly correlated variables result in competing models that are difficult to compare; 2) statistical tests summarized by p-values do not necessarily produce sets of variables that reproducibly discriminate between conditions; 3) estimates of sensitivity and specificity must be adjusted when the number of specimens is insufficient to form independent training and testing sets. The proposed workflow, summarized here, and presented in detail as online Supporting Information, provides a practical approach to the secondary analysis of multidimensional flow cytometry data.