### Abstract

- Top of page
- Abstract
- MATERIALS AND METHODS
- RESULTS
- DISCUSSION
- Acknowledgements
- LITERATURE CITED
- Supporting Information

Statistical mixture modeling provides an opportunity for automated identification and resolution of cell subtypes in flow cytometric data. The configuration of cells as represented by multiple markers simultaneously can be modeled arbitrarily well as a mixture of Gaussian distributions in the dimension of the number of markers. Cellular subtypes may be related to one or multiple components of such mixtures, and fitted mixture models can be evaluated in the full set of markers as an alternative, or adjunct, to traditional subjective gating methods that rely on choosing one or two dimensions. Four color flow data from human blood cells labeled with FITC-conjugated anti-CD3, PE-conjugated anti-CD8, PE-Cy5-conjugated anti-CD4, and APC-conjugated anti-CD19 Abs was acquired on a FACSCalibur. Cells from four murine cell lines, JAWS II, RAW 264.7, CTLL-2, and A20, were also stained with FITC-conjugated anti-CD11c, PE-conjugated anti-CD11b, PE-Cy5-conjugated anti-CD8a, and PE-Cy7-conjugated-CD45R/B220 Abs, respectively, and single color flow data were collected on an LSRII. The data were fitted with a mixture of multivariate Gaussians using standard Bayesian statistical approaches and Markov chain Monte Carlo computations. Statistical mixture models were able to identify and purify major cell subsets in human peripheral blood, using an automated process that can be generalized to an arbitrary number of markers. Validation against both traditional expert gating and synthetic mixtures of murine cell lines with known mixing proportions was also performed. This article describes the studies of statistical mixture modeling of flow cytometric data, and demonstrates their utility in examples with four-color flow data from human peripheral blood samples and synthetic mixtures of murine cell lines. © 2008 International Society for Advancement of Cytometry

One of the fundamental uses of flow cytometry is the identification and quantification of distinct cell subsets with phenotypes defined by the density of cell surface or intracellular markers. Ideally, such a biological classification should be objective, stable, and predictive (1).

Objectivity, stability, and predictivity are all problematic with the traditional approach in which samples are sequentially gated in one- or two-dimensions. In particular, the choice of which sequence of markers to gate on and where to draw the gates depends on expertise and is highly subjective. This makes it difficult to replicate the cell subset identification procedure across different laboratories. The problem is compounded with polychromatic flow cytometry, because the number of possible gating sequences rises rapidly with the number of channels used. In a recent study of flow cytometric standardization involving 15 institutions, the mean inter-laboratory coefficient of variation ranged from 17 to 44%, even though preparation was standardized and performed using the same samples and reagents at each site (2). Most of the variation was attributed to gating, even though all analyses were conducted by individuals with expertise in antigen-specific flow cytometry. The process of manual compensation and gate delineation is also extremely time-consuming, and hence a major cost factor in large-scale clinical flow cytometric analysis. For these reasons, a reliable automated approach to flow cytometric analysis is desirable (3).

Statistical mixture models are widely used in scientific problems where objects represented in several or many dimensions are to be clustered or classified. One appeal of mixture models is the ability to represent essentially any observed data distribution to a high degree of accuracy (4, 5). Some useful background on methodology and ideas underlying mixture models, as well as some specific applications, appears in Refs. 6 and7 and a range of examples in biomedical problems that provide useful insights into various applied aspects appear in Refs. 8–10, for example. In some applications, the identification of underlying scientific meaning of specific mixture components is of relevance, whereas in others mixtures are of interest primarily as flexible data smoothers. Our interest here is the utility of multivariate mixture models for flow cytometric cell subtype identification, so that the resolution of mixture components is of interest. Some of the potential utility is that of directly modeling and resolving flow cytometric data for all the markers simultaneously, so that determining a one- or two-dimensional marker sequence for gating is unnecessary. Further, the analysis of mixtures using current computational statistical technology is automatic, and will apply in as many dimensions as we have markers.

We describe our studies of statistical mixture modeling using Gaussian mixtures for flow cytometric data densities. We use a Bayesian mixture modeling approach that is effectively standard modern statistical methodology, and fit such models using Markov Chain Monte Carlo (MCMC) computational algorithms (11, 12). The basic mixture model framework seems apt for modeling distinct cell subsets, each of which may be reflected in one or more of the multivariate Gaussian components. The analysis is open to exploiting biological expert knowledge in the specification of priors where available, or alternatively can be run in default, objective mode. The MCMC computations use a standard, efficient, and flexible algorithm that requires little tuning, and that works well with complex multimodal distributions which routinely arise with flow cytometric recordings.

We demonstrate the application and utility of mixture modeling of flow cytometric data in a series of examples in which multimodality arises naturally as a result of individual or groups of mixture components that map well to biologically relevant cell subsets. The examples involve analyses of four-color flow cytometric data of human peripheral blood and murine cell line samples. They demonstrate the methodology and suggest that Bayesian mixture models can be used effectively to identify cell subsets, improve specificity, and remove outliers in such data, and to do so automatically. We also provide supporting software for others interested in such analyses.1

### DISCUSSION

- Top of page
- Abstract
- MATERIALS AND METHODS
- RESULTS
- DISCUSSION
- Acknowledgements
- LITERATURE CITED
- Supporting Information

We have shown in this article that Bayesian mixture models can extract biologically meaningful components (cell subsets) from flow cytometric data, by defining putative cell subsets as groups of mixture components fitting the following criteria:

- 1
Each component must have a density greater than some threshold to distinguish it from noise; a reasonable threshold density is the density of a single multivariate normal fitted to the same data set.

- 2
Each component must have a well-conditioned covariance matrix.

- 3
The components, taken together, have a single mode.

It may also be necessary to merge components that result from events piling up against an axis, typically resulting from log transformation of the data. This is an artifact of the log transform, and disappears with linear FCS 3.0 data using the hyperlog (18) or logicle (19) transforms (data not shown).

Using these putative cell subsets, we show that the accuracy of event classification can be improved by thresholding on the posterior density of each event, or by selecting events from a smaller coverage set. Critically, this analysis also provides us with a robust statistical model of flow cytometric data that can potentially be used for the rigorous statistical comparison of two or more flow cytometric data sets. If flow cytometric data can be normalized in a standardized fashion, it should also be possible to build up a training set of flow cytometric samples, allowing future automated classification of cell subsets in a sample, as well as classification of entire data samples (e.g., as “normal” or “abnormal”).

In the traditional analysis of flow cytometry, the choice of which sequence of markers to gate on and where to draw the gates depends on expertise and it is difficult to replicate cell subset identification and quantitation across different laboratories. The problem is compounded with increasing number of colors, as the number of possible gating sequences rises rapidly with the number of markers used. The process of manual compensation and gate delineation is also extremely time-consuming, and hence a major cost factor in large-scale clinical flow cytometric analysis.

It is clear that some of the complexity of flow cytometric analysis arises from the use of 1D or 2D tools to analyze data that exist in higher-dimensional parameter space. The use of appropriate multivariate statistical approaches would therefore result in a simpler workflow and may also offer greater accuracy in the quantitation of cell subsets, because projections in 2D that cannot be resolved may often be separable with a higher-dimensional partition surface. For these reasons, several automated heuristic-based methods (e.g., discriminant analysis, neural networks, support vector machines) have been suggested for the clustering and classification of flow cytometry data.

However, we believe that a more principled model-based approach using mixture models has many benefits—for example, we can use the model to detect anomalous events, determine rejection criteria to minimize misclassifications, and even construct models for combining different sets of data. In addition, the analysis of mixtures using current computational statistical technology is automatic, and will apply in as many dimensions as we have markers. Unlike the previous approaches outlined earlier, statistical mixture methods can also more easily exploit biological expert knowledge in the specification of priors. Furthermore, calculation of estimation intervals and the assessment of uncertainty in the posterior probability of belonging to groups are simple with a Bayesian approach, allowing us to control the purity of extracted cell subsets naturally. Importantly, a model-based approach allows increasingly sophisticated models to be constructed so as to better capture data constraints (e.g., hierarchical cluster structure), which will result in more accurate and efficient classification schemes.

Bayesian mixture models can be extended to an arbitrary number of dimensions, and we are actively researching its utility for the analysis of polychromatic flow cytometry. In practice, however, several challenges have to be overcome for this to be practical. We have described a simple but adequate strategy for determining the number of components, by simply adding components until the contribution of the last added component is negligible. With a larger number of dimensions and a corresponding increase in the number of mixture components necessary, such an approach may be too inefficient. More advanced sampling methods can estimate the number of components directly (7, 20, 21, 22). In high dimensions, it is also critical to develop more efficient samplers, and advances in this direction include gradient optimization (23) and the combination of variational optimization with MCMC (24).

We believe that Bayesian models are a promising approach to the automated or semiautomated analysis of flow cytometric data. With the increasing dimensionality and volume of flow cytometric data being generated, such an approach is likely to prove ever more useful and necessary.