One of the fundamental uses of flow cytometry is the identification and quantification of distinct cell subsets with phenotypes defined by the density of cell surface or intracellular markers. Ideally, such a biological classification should be objective, stable, and predictive (1).

Objectivity, stability, and predictivity are all problematic with the traditional approach in which samples are sequentially gated in one- or two-dimensions. In particular, the choice of which sequence of markers to gate on and where to draw the gates depends on expertise and is highly subjective. This makes it difficult to replicate the cell subset identification procedure across different laboratories. The problem is compounded with polychromatic flow cytometry, because the number of possible gating sequences rises rapidly with the number of channels used. In a recent study of flow cytometric standardization involving 15 institutions, the mean inter-laboratory coefficient of variation ranged from 17 to 44%, even though preparation was standardized and performed using the same samples and reagents at each site (2). Most of the variation was attributed to gating, even though all analyses were conducted by individuals with expertise in antigen-specific flow cytometry. The process of manual compensation and gate delineation is also extremely time-consuming, and hence a major cost factor in large-scale clinical flow cytometric analysis. For these reasons, a reliable automated approach to flow cytometric analysis is desirable (3).

Statistical mixture models are widely used in scientific problems where objects represented in several or many dimensions are to be clustered or classified. One appeal of mixture models is the ability to represent essentially any observed data distribution to a high degree of accuracy (4, 5). Some useful background on methodology and ideas underlying mixture models, as well as some specific applications, appears in Refs. 6 and7 and a range of examples in biomedical problems that provide useful insights into various applied aspects appear in Refs. 8–10, for example. In some applications, the identification of underlying scientific meaning of specific mixture components is of relevance, whereas in others mixtures are of interest primarily as flexible data smoothers. Our interest here is the utility of multivariate mixture models for flow cytometric cell subtype identification, so that the resolution of mixture components is of interest. Some of the potential utility is that of directly modeling and resolving flow cytometric data for all the markers simultaneously, so that determining a one- or two-dimensional marker sequence for gating is unnecessary. Further, the analysis of mixtures using current computational statistical technology is automatic, and will apply in as many dimensions as we have markers.

We describe our studies of statistical mixture modeling using Gaussian mixtures for flow cytometric data densities. We use a Bayesian mixture modeling approach that is effectively standard modern statistical methodology, and fit such models using Markov Chain Monte Carlo (MCMC) computational algorithms (11, 12). The basic mixture model framework seems apt for modeling distinct cell subsets, each of which may be reflected in one or more of the multivariate Gaussian components. The analysis is open to exploiting biological expert knowledge in the specification of priors where available, or alternatively can be run in default, objective mode. The MCMC computations use a standard, efficient, and flexible algorithm that requires little tuning, and that works well with complex multimodal distributions which routinely arise with flow cytometric recordings.

We demonstrate the application and utility of mixture modeling of flow cytometric data in a series of examples in which multimodality arises naturally as a result of individual or groups of mixture components that map well to biologically relevant cell subsets. The examples involve analyses of four-color flow cytometric data of human peripheral blood and murine cell line samples. They demonstrate the methodology and suggest that Bayesian mixture models can be used effectively to identify cell subsets, improve specificity, and remove outliers in such data, and to do so automatically. We also provide supporting software for others interested in such analyses.1