Flow cytometry (FCM) can be applied to analyze thousands of samples per day. However, as each dataset typically consists of multiparametric descriptions of millions of individual cells, data analysis can present a significant challenge. As a result, despite its widespread use, FCM has not reached its full potential because of the lack of an automated analysis platform to parallel the high-throughput data-generation platform. As noted in a recent Communication to the Editor (1), in contrast to the tremendous interest in the FCM technology, there is a dearth of statistical and bioinformatics tools to manage, analyze, present, and disseminate FCM data. There is considerable demand for the development of appropriate software tools, as manual analysis of individual samples is error-prone, nonreproducible, nonstandardized, not open to reevaluation, and requires an inordinate amount of time, making it a limiting aspect of the technology (2–10).

One major component of FCM analysis involves gating, the process of identifying homogeneous groups of cells that display a particular function. This identification of cell populations currently relies on using software to apply a series of manually drawn gates (i.e., data filters) that select regions in 2D graphical representations of the data. This process is based largely on intuition rather than standardized statistical inference (3, 11, 12). It also ignores the high-dimensionality of FCM data, which may convey information that cannot be displayed in 1 or 2D projections. This is illustrated in Supplementary Figure 1 with a synthetic dataset, consisting of two dimensions, generated from a *t*-mixture model (13) with three components. While the three clusters can be identified using both dimensions, the structure is hardly recognized when the dataset is projected on either dimension. Such an example illustrates the potential loss of information if we disregard the multivariate nature of the data. The same problem occurs when projecting three (or more)-dimensional data onto two dimensions.

Several attempts have been made to automate the gating process. Among those, the *K*-means algorithm (14) has found the most applications (15–18). Demers et al. (17) have proposed an extension of *K*-means allowing for nonspherical clusters, but this algorithm has been shown to lead to performance inferior to fuzzy *K*-means clustering (18). In fuzzy *K*-means (19), each cell can belong to several clusters with different association degrees, rather than belonging completely to only one cluster. Even though fuzzy *K*-means takes into consideration some form of classification uncertainty, it is a heuristic-based algorithm and lacks a formal statistical foundation. Other popular choices include hierarchical clustering algorithms (e.g., linkage or Pearson coefficients method). However, these algorithms are not appropriate for FCM data, since the size of the pairwise distance matrix increases in the order of *n*^{2} with the number of cells, unless they are applied to some preliminary partition of the data (16), or they are used to cluster across samples, each of which is represented by a few statistics aggregating measurements of individual cells (20, 21). Classification and regression trees (22), artificial neural networks (23) and support vector machines (24, 25) have also been used in the context of FCM analyses (26–29), but these supervised approaches require training data, which are not always available.

In statistics, the problem of finding homogeneous groups of observations is referred to as clustering. An increasingly popular choice is model-based clustering (13, 30–33), which has been shown to give good results in many applied fields involving high dimensions (greater than ten); see, for example Refs. (33–35). In this paper, we propose to apply an unsupervised model-based clustering approach to identify cell populations in FCM analysis. In contrast to previous unsupervised methods (6–8, 15–18), our approach provides a formal unified statistical framework to answer central questions: How many populations are there? Should we transform the data? What model should we use? How should we deal with outliers (aberrant observations)? These questions are fundamental to FCM analysis, where one does not usually know the number of populations, and where outliers are frequent. By performing clustering using all variables consisting of fluorescent markers, the full multidimensionality of the data is exploited, leading to more accurate and more reproducible identification of cell populations.

The most commonly used model-based clustering approach is based on finite Gaussian mixture models (13, 31–33). However, Gaussian mixture models rely heavily on the assumption that each component follows a Gaussian distribution, which is often unrealistic. A common approach is to look for transformations of the data that make the normality assumption more realistic. Box and Cox (36) discussed the power transformation in the context of linear regression, which has also been applied to Gaussian mixture models (37, 38); see also Ref. (39) for a variant of Box–Cox transformation for FCM data. In addition to nonnormality, there is also the problem of outlier identification in mixture modeling. Outliers can have a significant effect on the resulting clustering. For example, they will usually lead to overestimating the number of components to provide a good representation of the data. If a more robust model is used, fewer clusters may suffice. Outliers can be handled in the model-based clustering framework, by either replacing the Gaussian distribution with a more robust one [e.g., *t* (13, 40)] or adding an extra component to model the outliers (e.g., uniform (30)).

Transformation selection can be heavily influenced by the presence of outliers (41, 42). To handle the issues of transformation selection and outlier identification simultaneously, we have developed an automated clustering approach based on *t*-mixture models with Box–Cox transformation. The *t* distribution is similar in shape to the Gaussian distribution with heavier tails and thus provides a robust alternative (43). The Box–Cox transformation is a type of power transformation, which can bring skewed data back to symmetry, a property of both the Gaussian and *t* distributions. In particular, the Box–Cox transformation is effective for data where the dispersion increases with the magnitude, a scenario not uncommon to FCM data.