Flow Cytometry (FCM) can be applied in a high-throughput fashion to process thousands of samples per day. However, data analysis can be a significant challenge because each data set is a multiparametric description of millions of individual cells. Consequently, despite widespread use, FCM has not reached its full potential due to the lack of an automated analysis platform to assist high-throughput data generation (1–3).

A critical bottleneck in data analysis is the identification of groups of similar cells for further study (i.e., gating). This process involves identification of regions in multivariate space containing homogeneous cell populations of interest. Generally, gating has been performed manually by expert users, but manual gating is subject to user variability (4–6) and is unsuitable for high-throughput data analysis (7). Several methods have been developed to automate the gating process (8). flowClust is a model-based clustering approach that models cell populations using a Box-Cox transformation to remove skewness followed by a mixtures of *t*-distributions (9). flowMerge (10) extends the flowClust framework by applying a cluster merging algorithm (11) to allow multiple components to model the same populations, enabling it to fit concave cell populations. FLAME (12) uses a mixture of skew-t-distributions to make the model more flexible to skewed cell populations. curvHDR (13) is a nonparametric density-based approach, and therefore is not limited to identifying cell populations based on shape. curvHDR models cell populations based on the curvature of the underlying distribution. However, it requires user-defined parameter values and cannot be applied to more than three dimensional data. SamSpectral (14) uses an spectral clustering algorithm to find cell populations, including nonconvex ones. Given the high time and memory requirements of the spectral clustering algorithm, SamSpectral finds cell populations based on representative subsampling of the data; however, this can potentially decrease the quality of the gating as some biological information can be lost during the sampling process. SamSpectral also requires user-defined parameter values for each data set of similar experiments.

With the advent of high-throughput FCM analysis, millions of cells can be analyzed for up to 20 markers per sample. For these experiments, the runtime of gating algorithms is a bottleneck of automated FCM data analysis pipelines (18). The K-means clustering algorithm was the first automated data analysis approaches applied to FCM data (15). Given a *d*-variable vector *X*_{1},*X*_{2},…,*X*_{n}, K-means aims to partition *X* into *K* < *n* sets *S* = *S*_{1},*S*_{2},…,*S*_{k} so as to minimize the within-cluster sum of squares:

where *c*_{i} is the centroid or center of *S*_{i} estimated by its mean value.

However, the adoption of K-means has been restricted, because it requires the number of populations to be pre-identified, it is sensitive to its initialization, and it is limited to modelling spherical cell populations. To estimate the number of clusters, Pelleg et al. (16) and Hamerly and Elkan (17) extended basic K-means by using the Bayesian Information Criterion (BIC) and a normality test, respectively. Voting-K-means (18) tries to achieve a good clustering by running the K-means algorithm with a number of different settings and combining the results using an ensemble clustering algorithm. However, the application of these algorithms for automated FCM data analysis has not been successful since the first two are sensitive to noise, and all three require user-defined parameter values (8, 14).

We have developed a new K-means-based clustering framework that addresses the initialization, shape limitation, and model-selection problems of K-means clustering, and can be applied to FCM data. We extended the flowMerge (10) approach by replacing the statistical model with a faster clustering algorithm. By introducing a new merging criterion, our approach finds nonconvex cell populations, and we use a change point detection algorithm to estimate the number of clusters.