### Abstract

- Top of page
- Abstract
- MATERIALS AND METHODS
- RESULTS
- DISCUSSION
- Acknowledgements
- LITERATURE CITED
- Supporting Information

The capability of flow cytometry to offer rapid quantification of multidimensional characteristics for millions of cells has made this technology indispensable for health research, medical diagnosis, and treatment. However, the lack of statistical and bioinformatics tools to parallel recent high-throughput technological advancements has hindered this technology from reaching its full potential. We propose a flexible statistical model-based clustering approach for identifying cell populations in flow cytometry data based on *t*-mixture models with a Box–Cox transformation. This approach generalizes the popular Gaussian mixture models to account for outliers and allow for nonelliptical clusters. We describe an Expectation-Maximization (EM) algorithm to simultaneously handle parameter estimation and transformation selection. Using two publicly available datasets, we demonstrate that our proposed methodology provides enough flexibility and robustness to mimic manual gating results performed by an expert researcher. In addition, we present results from a simulation study, which show that this new clustering framework gives better results in terms of robustness to model misspecification and estimation of the number of clusters, compared to the popular mixture models. The proposed clustering methodology is well adapted to automated analysis of flow cytometry data. It tends to give more reproducible results, and helps reduce the significant subjectivity and human time cost encountered in manual gating analysis. © 2008 International Society for Analytical Cytology

Flow cytometry (FCM) can be applied to analyze thousands of samples per day. However, as each dataset typically consists of multiparametric descriptions of millions of individual cells, data analysis can present a significant challenge. As a result, despite its widespread use, FCM has not reached its full potential because of the lack of an automated analysis platform to parallel the high-throughput data-generation platform. As noted in a recent Communication to the Editor (1), in contrast to the tremendous interest in the FCM technology, there is a dearth of statistical and bioinformatics tools to manage, analyze, present, and disseminate FCM data. There is considerable demand for the development of appropriate software tools, as manual analysis of individual samples is error-prone, nonreproducible, nonstandardized, not open to reevaluation, and requires an inordinate amount of time, making it a limiting aspect of the technology (2–10).

One major component of FCM analysis involves gating, the process of identifying homogeneous groups of cells that display a particular function. This identification of cell populations currently relies on using software to apply a series of manually drawn gates (i.e., data filters) that select regions in 2D graphical representations of the data. This process is based largely on intuition rather than standardized statistical inference (3, 11, 12). It also ignores the high-dimensionality of FCM data, which may convey information that cannot be displayed in 1 or 2D projections. This is illustrated in Supplementary Figure 1 with a synthetic dataset, consisting of two dimensions, generated from a *t*-mixture model (13) with three components. While the three clusters can be identified using both dimensions, the structure is hardly recognized when the dataset is projected on either dimension. Such an example illustrates the potential loss of information if we disregard the multivariate nature of the data. The same problem occurs when projecting three (or more)-dimensional data onto two dimensions.

Several attempts have been made to automate the gating process. Among those, the *K*-means algorithm (14) has found the most applications (15–18). Demers et al. (17) have proposed an extension of *K*-means allowing for nonspherical clusters, but this algorithm has been shown to lead to performance inferior to fuzzy *K*-means clustering (18). In fuzzy *K*-means (19), each cell can belong to several clusters with different association degrees, rather than belonging completely to only one cluster. Even though fuzzy *K*-means takes into consideration some form of classification uncertainty, it is a heuristic-based algorithm and lacks a formal statistical foundation. Other popular choices include hierarchical clustering algorithms (e.g., linkage or Pearson coefficients method). However, these algorithms are not appropriate for FCM data, since the size of the pairwise distance matrix increases in the order of *n*^{2} with the number of cells, unless they are applied to some preliminary partition of the data (16), or they are used to cluster across samples, each of which is represented by a few statistics aggregating measurements of individual cells (20, 21). Classification and regression trees (22), artificial neural networks (23) and support vector machines (24, 25) have also been used in the context of FCM analyses (26–29), but these supervised approaches require training data, which are not always available.

In statistics, the problem of finding homogeneous groups of observations is referred to as clustering. An increasingly popular choice is model-based clustering (13, 30–33), which has been shown to give good results in many applied fields involving high dimensions (greater than ten); see, for example Refs. (33–35). In this paper, we propose to apply an unsupervised model-based clustering approach to identify cell populations in FCM analysis. In contrast to previous unsupervised methods (6–8, 15–18), our approach provides a formal unified statistical framework to answer central questions: How many populations are there? Should we transform the data? What model should we use? How should we deal with outliers (aberrant observations)? These questions are fundamental to FCM analysis, where one does not usually know the number of populations, and where outliers are frequent. By performing clustering using all variables consisting of fluorescent markers, the full multidimensionality of the data is exploited, leading to more accurate and more reproducible identification of cell populations.

The most commonly used model-based clustering approach is based on finite Gaussian mixture models (13, 31–33). However, Gaussian mixture models rely heavily on the assumption that each component follows a Gaussian distribution, which is often unrealistic. A common approach is to look for transformations of the data that make the normality assumption more realistic. Box and Cox (36) discussed the power transformation in the context of linear regression, which has also been applied to Gaussian mixture models (37, 38); see also Ref. (39) for a variant of Box–Cox transformation for FCM data. In addition to nonnormality, there is also the problem of outlier identification in mixture modeling. Outliers can have a significant effect on the resulting clustering. For example, they will usually lead to overestimating the number of components to provide a good representation of the data. If a more robust model is used, fewer clusters may suffice. Outliers can be handled in the model-based clustering framework, by either replacing the Gaussian distribution with a more robust one [e.g., *t* (13, 40)] or adding an extra component to model the outliers (e.g., uniform (30)).

Transformation selection can be heavily influenced by the presence of outliers (41, 42). To handle the issues of transformation selection and outlier identification simultaneously, we have developed an automated clustering approach based on *t*-mixture models with Box–Cox transformation. The *t* distribution is similar in shape to the Gaussian distribution with heavier tails and thus provides a robust alternative (43). The Box–Cox transformation is a type of power transformation, which can bring skewed data back to symmetry, a property of both the Gaussian and *t* distributions. In particular, the Box–Cox transformation is effective for data where the dispersion increases with the magnitude, a scenario not uncommon to FCM data.

### DISCUSSION

- Top of page
- Abstract
- MATERIALS AND METHODS
- RESULTS
- DISCUSSION
- Acknowledgements
- LITERATURE CITED
- Supporting Information

The experimental data and the simulation studies have demonstrated the importance of handling transformation selection, outlier identification, and clustering simultaneously. While a stepwise approach in which transformation is preselected ahead of outlier detection (or vice versa) may be considered, it is unlikely to tackle the problem well in general, as the preselected transformation may be influenced by the presence of outliers. This is shown in the analysis of the Rituximab dataset. Without outlier removal, the use of Gaussian mixture models led to inappropriate transformation and poor classification in order to accommodate outliers (Fig. 3d and Supplementary Fig. 3d). Conversely, without transformation, the *t*-mixture model could not model the shape of the top cluster well (Fig. 3c and Supplementary Fig. 3c). Similarly, it is necessary to perform transformation selection and clustering simultaneously (37, 38) as opposed to a stepwise approach. It is difficult to know what transformation to select beforehand as one only observes the mixture distribution, and the classification labels are unknown. A skewed distribution could be the result of one dominant cluster and one (or more) smaller cluster. As shown by our analysis with the experimental data and the simulation studies, our proposed approach based on *t*-mixture models with Box–Cox transformation benefits from handling these issues, which have mutual influence, simultaneously. Furthermore, confirmed by results of our simulation studies, our proposed approach is robust against model misspecification and can avoid the problem of Gaussian mixture models that excessive clusters are often needed to provide a reasonable fit in case of model misspecification (34).

One of the benefits of model-based clustering is that it provides mechanism for both “hard” clustering (i.e., the partitioning of the whole data into separate clusters) and fuzzy clustering (i.e., a “soft” clustering approach in which each event may be associated with more than one cluster). The latter approach is in line with the rationale that there exists uncertainty about to which cluster an event should be assigned. The overlaps between clusters as seen in Figures 3 and 4 reveal such uncertainty in the cluster assignment. When fuzzy clustering is considered, the posterior probability *z̃*_{ig} can be interpreted as the evidence of the association of *y*_{i} with cluster *g*; when a partition of data is desired, we may assign each observation *y*_{i} to cluster *g* associated with the largest *z̃*_{ig} value.

In many FCM clustering applications, the number of clusters is usually unknown and requires estimation. There are several approaches for choosing the number of clusters in model-based clustering, including resampling, cross validation, and various information criteria (58). Our approach to the problem is based on the BIC, which gives good results in the context of mixture models (33, 59). BIC is computationally cheap to compute once maximum likelihood estimation for the model parameters has been completed, an advantage over other approaches, especially in the context of FCM where datasets tend to be very large. While computationally cheap, BIC relies heavily on an approximation of marginal likelihoods, which might not be very accurate for some data. Currently, we are looking for alternatives, for example, the integrated completed likelihood (60), to improve the estimation of the number of clusters. Nevertheless, combined with expert knowledge, we view BIC as a useful tool that can provide guidance on choosing a reasonable value, as supported by our simulation study of assessing the accuracy in selecting the number of clusters.

There exist several modified versions of the Box–Cox transformation to handle negative-valued data, for example, the log-shift transformation, which was also proposed in the paper for the original Box–Cox transformation (36). The advantage of our choice, given by Eq. (10), is that while continuity is maintained across the whole range of the data, it retains the simplicity of the form of the transformation without introducing any additional parameters; when all data are positive, it reduces to the same form of the original Box–Cox transformation.

It is well known that the convergence of the EM algorithm depends on the initial conditions used. A bad initialization may incur slow convergence or convergence to a local minimum. In the real-data examples and the simulation studies, we used a deterministic approach called hierarchical clustering (30, 53) for initialization. We have found this approach to perform well in the datasets explored here. However, better initialization, perhaps incorporating expert knowledge, might be needed for more complex datasets. For example, if there is a high level of noise in the data, it might be necessary to use an initialization method that accounts for such outliers; see Ref. (33) for an example.

To estimate how long it takes to analyze a sample of size typical for an FCM dataset, we have carried out a test run on a synthetic dataset, which consists of one million events and 10 dimensions. To complete an analysis with 10 clusters, it took about 20 min on a 3-GHz Intel Xeon processor with 2 GB of RAM. This illustrates that the algorithm should be quick enough for analyzing a large flow dataset. In general, the computational time increases linearly with the number of events and increases in the order of *p*^{2} with the number of variables, *p*, per EM iteration. This is an advantage over hierarchical clustering in which the computational time and memory space required increase in the order of *n*^{2} with the number of events, making a hierarchical approach impractical when a sample of a moderate size, say, >5,000, is investigated. Meanwhile, we realize the need of high computational speed in FCM analysis, and are currently investigating means to speed up the EM algorithm for parameter estimation.

Like all clustering approaches, the methodology we have developed includes assumptions, which may limit the applicability of this approach, and it will not identify every cell population in every sample. If the distribution of the underlying population is highly sparse without a well-defined core, our approach may not properly identify all subpopulations. This is illustrated in the Rituximab analysis where the loosely structured group of apoptotic cells was left undetected. This in turn has hindered the capability of the approach from giving satisfactory estimates of the G1 and S frequencies for the identified clusters that would be desired for normal analysis of a 7-AAD DNA distribution for cultured cells. On the other hand, identification of every cluster may not always be important. The Rituximab study was designed as a high-throughput drug screen to identify compounds that caused a >50% reduction in S-phase cells (46), as would be captured by both the manual gates and our automated analysis should it occur. Furthermore, the exact identification of every cluster through careful manual analysis may not always be possible, especially in high-throughput experiments. For instance, in the manual analysis of the GvHD dataset, a quadrant gate was set in Figure 1c in order to identify the CD3^{+}CD4^{+}CD8β^{+} population, which was of primary interest. For convenience sake, this gate was set at the same level across all the samples being investigated. While five clusters can be clearly identified on the graph, it would be time consuming to manually adjust the positions of each of the gates for all the samples in a high-throughput environment, as well as identify all novel populations. Contrariwise, our automated approach can identify these clusters in short order without the need for manual adjustment. To complete the analysis of the GvHD dataset (>12,000 cells, six dimensions) to identify the CD3^{+}CD4^{+}CD8β^{+} population (Fig. 1), it took less than 5 min, using the aforementioned sequential approach to clustering, on an Intel Core 2 Duo with 2 GB of RAM running Mac OS X 10.4.10.

A rigorous quantitative assessment is important before implementing this, or any approach, as a replacement for expert manual analysis. The availability of a wide variety of example data would aid in the development and evaluation of automated analysis methodologies. We are therefore developing such a public resource, and would welcome contributions from the wider FCM community.

An R (61) package called flowClust is being developed to implement the clustering methodology proposed in this paper. The source code is built in C for optimal utilization of system resources and makes use of the BLAS library (62), which enables multithreaded processes. The R package will be available from Bioconductor (63) at http://www.bioconductor.org.