The next‐generation K‐means algorithm

Typically, when referring to a model‐based classification, the mixture distribution approach is understood. In contrast, we revive the hard‐classification model‐based approach developed by Banfield and Raftery (1993) for which K‐means is equivalent to the maximum likelihood (ML) estimation. The next‐generation K‐means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model‐based approach for the K‐means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no‐clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K‐means.


INTRODUCTION
K-means is the most popular clustering algorithm. A review of the technique is outside the scope of the present work-we refer the reader to a highly cited paper by Jain [20] for a general discussion.
Typically, K-means is referred to as a hard classification clustering technique because the answer to whether an observation belongs to a cluster is either yes or no. In contrast, another popular classification algorithm based on a mixture (in most instances a Gaussian mixture) distribution is a soft classification technique because the answer on cluster membership is expressed in terms of a probability. An advantage of the mixture distribution approach is that the membership indicator is a continuous parameter (probability) and therefore smooth optimization methodology applies so that maximization of the likelihood function can be effectively achieved by the expectation-maximization (EM) algorithm [26]. An attractive feature of the Gaussian mixture is that it is a model-based classification approach; therefore, traditional likelihood-based methodologies, such as hypothesis testing or the AIC/BIC criteria, can be employed to facilitate testing of the components or to select the number of clusters. It is well forgotten that the K-means also can be viewed as a model-based approach with minimization of the total within sum of squares equivalent to the maximum likelihood (ML). However, unlike the mixture distribution approach, the classical ML theory fails here because (1) the number of parameters, as the partition index sets, exponentially increases with n and K (typically referred to as an HP-hard problem); (2) parameters as index sets are integers (discrete) and therefore the Wald and likelihood ratio tests do not apply (the parameter value must be an inner point of the parameter space); and (3) the AIC/BIC criteria are not applicable because of the discontinuity of the parameter space.
Statistical model-based hard classification was popularized and developed by Banfield and Raftery [2], although they do not mention the term "K-means algorithm." Remarkably, not much has been done in terms of developing and extending the model-based K-means algorithm since then. Perhaps the most attractive feature of model-based cluster analysis, compared to a method-based approach, is that data can be generated according to the model, and the statistical properties of clusterization can be studied via simulations.
The goal of the present work is to revive and extend the ideas presented by Banfield and Raftery by viewing the K-means algorithm as the ML technique in several directions: (1) testing the presence of clusters and computing the p-value; (2) identification of the number of clusters; (3) viewing the K-medians algorithm as the ML based on the Laplace distribution; (4) developing a semisupervised K-means algorithm in the case of a priori information; (5) developing the clusterwise K-means regression; and, finally, (6) extension of the K-means algorithm to clustering of multilevel data in the presence of replicates. However, it is not the goal of this work to develop new numerical algorithms. Instead, our hard classification procedures are reduced to the repeated application of the existing and efficient Hartigan-Wong [17] algorithm. In the present work, only the spherical Gaussian distribution is assumed; an extension to the case when observations are heteroscedastic or correlated, as studied by Banfield and Raftery [2], can be carried out along the lines of the spherical case and is a topic of future research.

SPHERICAL GAUSSIAN DISTRIBUTION
In this section, we consider the simplest model-based hard classification problem leading to the K-means algorithm. It is assumed that n independently distributed observation vectors x 1 , x 2 , ..., x n ∈ R m are independent and belong to K groups specified by the index sets C 1 , C 2 , ..., C K . These index sets partition the set {1, 2, ...., n} so that ∪ K k=1 C k = {1, 2, … , n} and C k ∩ C l = ∅ for k ≠ l. The distribution of observations from each cluster is identical to the common variance. Moreover, it is assumed that the distribution is spherical Gaussian: The parameters to estimate are the means { k , k = 1, 2, ..., K}, the common variance 2 , and, most importantly, the index sets (C 1 , C 2 , ..., C K ). The twice-negative log-likelihood function takes the form Differentiating with respect to k , we find that, given the index sets, the ML estimator is where n k is the number of elements in the cluster k. Differentiating (2) with respect to 2 , we find that the ML estimation is equivalent to the minimum of the total within sum of squares: Thus, ML with a spherical Gaussian distribution is equivalent to the traditional K-means algorithm. The minimization of criterion (3) is not trivial and may have multiple minima, so several starting points may be used to confirm that the global minimum is found. An ML estimate of the variance iŝ 2 = ( ) −1 S K , as follows from (2). An immediate implication of the fact that the K-means algorithm solves the ML problem is an obvious but sometimes ignored consequence that the K-means algorithm is applicable only to normally distributed data with equal variance. Consequently, the K-means algorithm is not justified for uniformly distributed data or when vector components are measured on different scales and therefore have different variances. One might suggest normalizing the original data by subtracting the gross mean and dividing by the standard deviation, but such normalization would be suboptimal because the variance should be computed around the mean in each cluster, not around the gross mean.

Testing the presence of clusters
A fundamental question is: Are there clusters? A false clusterization is illustrated in Figure 1. The K-means algorithm with 2 clusters (K = 2) is applied to n = 100 points generated from the same normal distribution with zero mean, unit variance, and zero correlation (spherical Gaussian distribution). The K-means algorithm divides these points into 2 clusters, but in fact there are no clusters because points are generated from the same distribution. Visualization may be deceiving. Needless to say, the absence of clusters becomes even more difficult to detect for higher dimensions (m > 2). We aim to test whether points {x i , i = 1, 2, ..., n} belong to the same normal population-that is, there are no clusters. This hypothesis will be referred to as the no-clusters hypothesis. Clustering tendency bothered mathematicians from the very beginning [42], but most of the work has been done in an asymptotic setup when n → ∞ . We mention just a sample of authors: Pollard [27], Bryant and Williamson [7], Bock [5], and Jain and Dubes [19]. Unlike previous research, we want to compute the p-value for testing the no-clusters hypothesis for small n. The idea is to use the established MANOVA test statistic when the index sets are known. The key observation is that, for the K-means algorithm, the index sets are unknown and subject to estimation. Therefore, a distribution, such as the F-distribution, does not hold. This distribution will be derived via simulations; see also refs. 25, 22, 30. The K-means algorithm with K = 2 for a sample of 100 random points from the same bivariate normal distribution with zero mean and unit variance. A wrong clusterization is shown in the right plot (the same points)!
We say that there are no clusters if the null hypothesis H 0 : 1 = 2 = · · · = K is not rejected with the given Type I error (typically, = 0.05). If the index sets C k were known, the traditional exact F-test or approximate likelihood ratio (LR) MANOVA test could be applied, Anderson [1]. These are based on the total and within-cluster sums of squares respectively. When the index sets are unknown and estimated, as in the K-means algorithm, the distribution of classical statistics does not hold, so the classical MANOVA does not apply.
To compute the p-value for the no-clusters hypothesis when the index sets are unknown, we need to estimate the cumulative distribution function (cdf) of statistics under the null hypothesis: that is, when x i ∼ ( , 2 I), i = 1, 2, ..., n. We could use either the F-statistic, (S 1 − S K )/S K , or the likelihood ratio test, log(S 1 /S K ), but the p-value does not change upon any strictly increasing transformation, so it suffices to find the cdf of the ratio The advantage of the statistic (5) is that its distribution, under the null hypothesis, does not depend on and 2 . Indeed, simple algebra proves that and z i ∼ (0, I).
Finally, the method of computing the p-value for the no-clusters null hypothesis versus the alternative that the number of clusters is K is as follows: Let the K-means algorithm for the data at hand {x i , i = 1, 2, ..., n} produce r * as the ratio of 2 sums of squares (5). Carry out a fairly large number of simulations N, say N = 1000, to obtain the empirical cdf of r: For each simulation, (1) generate {z 1 , z 2 , … , z n } ∼ (0, I), (2) run K-means, and (3) compute the total sum of squares S 1z , the within sum of squares from the K-means, S Kz , and r = S 1z /S Kz . It took about 2 s for the data depicted in Figure 1 to do simulations in R on a regular desktop using 10 random initialization starts. Then, the p-value is the proportion of simulations in which r > r * . If there were clusters, then r * would be greater than the typical r under the null hypothesis (no clusters). Typically, we say that the null hypothesis is rejected if the proportion (p-value) is < 0.05. The p-value for the configuration of points depicted in Figure 1 is 0.438. This means that the no-clusters hypothesis cannot be rejected. If the number of simulations N is fairly large, the p-value is computed with precision of order 1/N. The typical threshold for the p-value, 0.05, specifies Type I error (the alpha error): the probability of concluding that there are several clusters when in fact there are no clusters. Type II error (the beta error) is the probability of concluding that there are no clusters when in fact there are clusters. Usually, we compute the power function as complement to the beta error, that is, the probability of rejecting clusters when in fact there are clusters. Of course, the power function depends on how separated the clusters are. For example, in the case of 2 clusters, the power function depends on the Mahalanobis distance, || 1 − 2 ||/ = . When = 0, the power function turns into Type I error ; when → ∞, the power function approaches 1. The power function tells how different the centers of the clusters, adjusted for , must be to claim that there are 2 clusters. An example of the power function for cluster detection is shown in Figure 2 for different n and m = 2. More points produce a higher probability of cluster detection. With 20 points, one needs to have the distance ≈ 3 to be able to detect the cluster configuration with probability ∼80%.

2.2
How many clusters: the broken-line algorithm "What is K?" is the paramount question of the K-means algorithm, Hastie et al. [18]. There is a rich body of literature on the topic, and it is not the objective of the present work to review available methods for choosing the number of clusters in the K-means algorithm. Instead, we develop a new broken-line algorithm and compare its performance via simulations against 27 other algorithms of K determination computed by the R function NbClust based on the statistical model (1); see the next section.
Our broken-line algorithm is an elaboration of the well-known and loosely defined elbow method: (1) Plot the log total within sum of squares, S K , against K for a sequence of values K = 1, 2, … , K max , and (2) chose K at the elbow of the curve, that is, where the line exhibits a change of slope. Although this method is intuitively appealing, there is no formal rule to define the elbow. We facilitate the determination of K by plotting lnS K and identifying K where the rate of decrease of lnS K (the slope) changes. Precisely, the broken-line algorithm is as follows: Fit 2 linear regressions using 2 segments of the data, {S 1 , S 2 , .., S K } and {S K+1 , S K+2 , .., S K max }, and compute the total residual sum of squares for K = 2, 3, … , K max − 2. The optimal K is where the sum of squares takes a minimum.
This algorithm is illustrated in Figure 3. In the left plot, 6 clusters are simulated according to model (1) using = 0.2 with about 150 points in each cluster. The circles depict the 95% confidence region with centers at the true mean and radius √ −2 (0.95, 2), where −2 (0.95, 2) is the 0.95th quantile of the chi-distribution with 2 degrees of freedom. In the right plot, we run 24 kmeans algorithms, letting K = 1, 2, ..., 24 (=K max ) and plot lnS K against K. Then we run 23 × 2 = 46 linear fits and find the pair that produces the minimum total residual sum of squares. The minimum occurs at K = 6. Note that plotting S K against K, as usually recommended, does not detect the change in slope-the log scale is crucial.
Although no theoretical justification for using lnS K is offered in this work yet, the link back to the log-likelihood (2) can be easily traced. Indeed, the minimum twice-negative log-likelihood is mn[lnS K − ln(mn) + 1]. Since m and n are K-independent, the optimal log-likelihood solely depends on lnS K , which is the prime metric in the famous Neyman-Pearson lemma for hypothesis testing, Lehmann and Romano [24]. Example 1. Human tumor microarray data. Hastie et al. [18], p. 512 provide an example of the K-means algorithm with n = 64 human tumors to be classified in groups using 6830 gene microarray expression data. As stated in the book, "...there is no clear indication" on the number of clusters; they use K = 3. The identification of the number of clusters in this example based on our broken-line algorithm is depicted in the left plot of Figure 6 . Although visual identification of the elbow is indeed difficult even on the log scale, the rate of the drop changes at K = 5 (the gap statistic identified 2 clusters).

Comparison with other methods using simulations
We use the package NbClustin R to compare our broken-line algorithm against 27 other methods previously reported in the literature over the years, including a popular gap statistic method by Tibshirani et al. [36]. We simulate six clusters according to model (1) with = 0.2, 0.3, 0.4, and 0.5, with typical configurations shown in Figure 4; a typical configuration for = 0.2 is depicted in Figure 3. The best 6 methods of K determination are presented in Table 1; we do not report the results of classification on the other 21 methods, including gap statistic, because they are worse in terms of the deviation of the identified number clusters from K = 6. For each method, we compute the mean of the identified K across simulations, K, to evaluate the bias; the standing is determined by the absolute deviation of averages from 6 across (the last column). It is understandable that, when increases (clusters are getting wider), the methods tend to find fewer clusters. The superiority of the broken-line algorithm is obvious.

Example 2. Classification of ovarian cancer microarrays.
The identification of latent clusters of genes of ovarian cancer is an important problem for improving treatment outcomes [32]. Considerable effort has been devoted by The Cancer Genome Atlas (TCGA) Research Network researchers to carry out microarray experiments to identify clusters of genes that could better classify the disease with a possibility of gene therapy [33]. However, the number of gene clusters is still an open question. Several researchers hypothesize that the number of clusters of genes should be equal to the number of clinically supported ovarian tumor subtypes: serous, mucinous, endometrioid, and clear cell [39]. Here we use the gene expressions data of the n = 1500 most representative genes from m = 489 ovary tumors [12].

K-MEDIANS CLUSTERING ALGORITHM
In reality, observations may contain outliers or even observations that do not belong to either cluster. In this section, we suggest a statistical model for the K-medians clustering algorithm. The K-medians is a well-known robust version of hard clustering-we will derive this algorithm via the method of ML using the multivariate Laplace distribution. Although the application of Laplace distribution to mixture distribution and fuzzy clustering is known [6,9,3,13,29,34,37], we are not aware of derivation of the K-medians algorithm through the method of ML, but most importantly by taking full advantage of a statistical model-based approach by (1) applying classical statistical tests to answer important questions about clusters, (2) computing the confidence region for each cluster, and, finally, (3) generating data and carrying out simulations to study statistical properties of statistical tests and estimators.
Denote with ( , ) the Laplace (or double-exponential) distribution with the density f (x; , ) = (2 ) −1 e −|x − |/ , where is referred to as the location parameter and is referred to as the scale parameter. It is well known that, if , then the ML estimator of is the median and solves the minimization problem ∑ n i=1 |x i − | ⇒ min. This fact is the impetus for our statistical model: It is assumed that the components of the m-dimensional vector x i from cluster k are independent and identically distributed with the location parameter k and the common scale parameter (vector observations are independent as well). Symbolically The log-likelihood function, up to a constant term, takes the form Commonly, |x i − k | refers to the L 1 -norm or Manhattan distance between the observation vector x i and the respective center k . Obviously, the maximum of l occurs when   Figure 6 wherex k is the m × 1 median vector in cluster k. This implies that the method of ML with the Laplace distribution is equivalent to the K-medians algorithm. Now we illustrate how the no-clusters test can be generalized to the K-medians: As before, the test statistic is the ratio (5), but now wherex is the overall median vector assuming no clusters. It is easy to see that, similar to the Gaussian case, the ratio r = S 1 /S K does not depend on either or . This means that we can estimate the cdf of r from simulations using z i iid ∼ (0, I m ) instead of x i . Then the p-value for testing the null hypothesis that there are no clusters is the th quantile of the empirical cdf (typically we use = 0.05).
The broken-line algorithm for selection of K generalizes to K-medians in a straightforward manner and is illustrated in Figure 7, where observations from 3 clusters are generated according to the Laplace distribution. We used the R function cclust of the package flexclust to run the K-medians algorithm.
The (1 − )th confidence region for each cluster is con- In particular, for m = 2, as in Figure 7, the confidence region for the kth cluster is the 45 • rotated square (rhombus) and is defined by the equation , 4) is the 0.95th quantile of the chi-distribution with 4 degrees of freedom (the left plot). The right plot shows that the broken-line algorithm correctly determines the number of clusters.

SEMISUPERVISED K-MEANS ALGORITHM
A common critique of cluster analysis is that it does not make a connection between cluster and group. The labeling and interpretation is up to the user because cluster analysis is an unsupervised classification technique. Sometimes, one has an additional set of observations from some clusters to put the labels right. Several authors have suggested variants of the K-means algorithm to account for observations with known labels/groups. For example, Wagstaff et al. [41] and Basu et al. [4] suggested improving the K-means algorithm starting from seeding generated by the label-known (known membership) observations or do clustering in the restricted sense, so that all observations with known membership belong to the same cluster. However, unlike previous authors, we suggest the incorporation of a priori knowledge using a model-based approach.
We use the following example to illustrate the K-means when the cluster membership of some observations is known (these observations will be referred to as supervised observations). That is why this version will be called the semisupervised K-means algorithm. The following example clarifies the concept.
Example 3. Political party classification. We want to use reading proficiency and attitude toward gay marriage to identify whether the individual is a Democrat or a Republican. Thirty people were tested for reading proficiency, and the question was asked about their opinion on gay marriage. In addition, each person reported his/her political party, Republican (circle) or Democrat (triangle); the measurements were transformed into a scoring system where 0 means national average; see Figure 8. The party membership information was not used for classification but for computing the misclassification error. The standard K-means algorithm was applied to classify the 30 people into 2 groups. Small circles and triangles indicate the true party membership, and large circles and triangles indicate the K-means classification accordingly (an individual is misclassified if the symbols are different). As follows from the left plot, one Republican was mistakenly classified as a Democrat, but there are many more Democrats misclassified as Republicans. Overall, 30% of individuals were misclassified. In the right plot, the same points are used, but, in addition, we have 5 individuals (supervised observations) with known party, marked as solid symbols. The question is: How to incorporate the supervised observations into the classification algorithm and what is the appropriate statistical model?
Now we describe the statistical model for the hard classification that incorporates observations with known clusters. As in the regular K-means model, it is assumed that unsupervised observations follow the assumption where i = k for i ∈ C k , k = 1, 2, ..., K. In addition to these n points, we have p k supervised observations for the kth cluster. Note that p k ≥ 0, and in a special case when p k = 0 for all k = 1, ..., K, we come to the standard K-means model. The twice negative log-likelihood function, up to a constant term, is m(n + P) ln 2 .
Equating the derivative with respect to 2 to zero, we reduce the ML estimation to the following minimization problem: We use this derivation to solve the ML hard classification via the repeated K-means algorithm: 1. Apply the regular K-means to n unsupervised observations {x i }.

Adjust
and apply the K-means to n + ∑ K k=1 p k points {x *i } iterating until convergence.
To understand the adjustments (6), find the center of the kth cluster for observations {x *i }: As follows from this algebra, the adjustments (6) can be viewed as the weighted means using x k and y k with the weights proportional to the number of unsupervised and supervised observations in cluster k, respectively. Typically, it takes 1 or 2 iterations to converge.
This algorithm was applied to the above example (see the right plot of Figure 8), and it converged in 2 iterations. The addition of supervised observations improved the discrimination: the total misclassification error dropped from 30% to 17%.

CLUSTERWISE REGRESSION
Most literature takes the soft clusterization, mixture distribution, approach to linear regression, for example, Yan et al. [38]. An extension of the hard classification to the linear regression model is also known and called clusterwise regression, Spath [31]. In this section, we suggest a statistical model for clusterwise regression, reduce the ML estimation to repeated K-means, demonstrate how the distribution of the regression coefficients can be studied via simulations, and, finally, generalize clusterwise regression to multiple dependent variables.

Single dependent variable
If y i is the ith observation of the dependent variable and x i is the respective m × 1 vector of independent (explanatory) variables, it is assumed that, within each cluster, there is its own vector of regression coefficients , i ∈ C k , k = 1, 2, … , K, under a standard assumption that observations {y i , i = 1, 2, ..., n} are independent. As in the case of regular K-means, the task is not only to estimate the Km regression coefficients but also to identify to what cluster each observation i belongs: that is, to find/estimate the partition of the set {1, 2, ..., n} into K nonoverlapping index sets {C k }. If index sets were known, the residual sum of squares within cluster k could be expressed using the generalized matrix inverse: where y k is the n k × 1 vector of the dependent variable, and X k is the n k × m matrix of independent variables composed of vectors x i (n k is the number of observations in cluster k). This formula covers the cases when n k < m or when matrix X k does not have full rank. Simple algebra shows that the ML estimation reduces to the following optimization problem: This representation gives rise to another interpretation of clusterwise regression. To simplify, let us assume that matrix X k has full rank. Noting that 2 (X ′ k X k ) = cov k is the covariance matrix of̂k, rewrite Thus (7) can be interpreted as the maximization of the total significance test statistic in the Wald test.
The K-means regression analysis can be extended to the case when clusters share regression coefficients (supplied with the subscript 0): For example, the clusters may have the same slopes but different intercepts (see an example below).
To estimate the clusterwise regression with shared coefficients (8), the following repeated K-means algorithm is proposed: (0) apply the least squares to the entire dataset and compute residuals r i ; (1) apply the K-means to residuals {r i } to classify them into K clusters (classification on the real line); (2) estimate 0 and k in each cluster separately using the dummy-variable approach and compute new residuals, and return to step (1). Iterate until convergence. The following example illustrates the repeated K-means algorithm for the clusterwise regression.

Example 4.
Two group regressions with common slope. Consider a simple linear regression y i = 0 + 1 x i + i , where i = 1, 2, ..., n denotes the subject id. The data is suspected to combine 2 groups with different baselines-the intercepts are group-specific, but the slope coefficient 1 is the same (the groups are unknown and subject to estimation). Specifically, we want to know what group the subject depicted with "?" belongs to (the left bottom corner); see Figure 9. The statistical model is y i = 01 + 1 x i + i if i ∈ C 1 , and y i = 02 + 1 x i + i if i ∈ C 2 , where C 1 ∩ C 2 = ∅ and C 1 ∪ C 2 = {1, 2, ..., n}. We start by fitting the data at left with the least squares regression, treating the data as 1 sample. Then we compute the residuals and apply the K-means algorithm to separate the residuals into 2 groups and obtain the first index set approximation, C 1 and C 2 . Next, we introduce 2 dummy variables, d 1i = 1 if i ∈ C 1 and 0 otherwise, and d 2i = 1 if i ∈ C 2 and 0 otherwise (d 1 and d 2 are orthogonal). In the next step, we run the linear model y i = 1 d 1i + 2 d 2i + 1 x i + i and obtain the residuals; we apply the K-means again to obtain the next index set, C 1 and C 2 , and iterate in this fashion while the total residual sum of squares decreases. It took 2 iterations for the data in Figure 9 to converge. The plot at right depicts the results of the clusterwise regression. To indicate the classification, we use different symbols; the regression lines are parallel because the groups have common slope. The question mark subject belongs to Group 1.

Multidimensional dependent variable
Here we generalize the above example to the case when the dependent variable y is m-dimensional [40]. Let X i denote the known m × p matrix of explanatory variables, i = 1, 2, ..., n. As before, we assume that vectors from different clusters have different means (intercepts) but the same slopes. Then the statistical model takes the form where k is the m × 1 vector of cluster-specific intercepts, and is the p × 1 vector of common slope coefficients. Matrix X i should not contain a column of 1's (no intercept) because the model will be not identifiable otherwise-the intercepts are captured by k . It is easy to see that maximization of the log-likelihood function turns into the minimization of The following repeated K-means algorithm is proposed for minimization of this criterion: (0) Estimate the intercepts and slopes treating the data as 1 cluster by stacking {y i } into the nm × 1 vector y and {X i } into the nm × p matrix X.
To represent the vector of cluster-specific intercepts, , let Z = 1 n ⊗ I m (stack n is the m × m identity matrices, ⊗ denotes the matrix Kronecker product), and estimate the linear model

Example 5.
Statistical simulations for clusterwise regression. An advantage of a statistical model for classification is that one can generate data and study the statistical properties of clustering through simulations. Consider the following regression problem with a three-dimensional dependent variable (m = 3) and 2 slope coefficients (p = 2). Two groups of observations (K = 2)-146 observations from the first group (mean vector 1 ) and 54 from the second (mean vector 2 )-have to be identified along with estimation of the 2 slope coefficients (n = 200). Let = 0.75, with the Mahalanobis distance between group-specific intercepts D = || 1 − 2 ||/ = 1.6. How well can the observations be classified into 2 groups, and what is the statistical distribution of the slope coefficients? In particular, we want to understand the impact of grouping on the distribution of the slope coefficients. The results of 10 000 simulations with the data generated according to model (9) are presented in Figure 10.
For each simulated dataset, the repeated K-means algorithm was applied (typically it took about 4-5 iterations to converge), and the index sets C k and slope coefficients were estimated. The slope coefficients were also estimated under the assumption that there were no clusters using the standard linear model for a benchmark comparison. The left plot in Figure 10 shows the results of clustering in 10 000 experiments; the fact that the first 146 observations belong to cluster 1 and the remaining 54 observations belong to cluster 2 (the Approximately 34% of observations were wrongly assigned to another cluster (interestingly, each i has almost the same misclassification error). The distribution of 10 000 slope coefficients is shown in the plots at right. The solid line depicts the Gaussian kernel density estimate from the clusterwise regression, and the dotted line depicts the density of the coefficients when the presence of clusters is ignored; the vertical line indicates the true value of the coefficient. In both cases, the no-cluster distribution (1 group/mean) is tighter with an underestimated standard deviation. The estimates of the second slope are positively biased in both methods; however, the two-group model has a smaller bias.

CLUSTERING OF MULTILEVEL DATA
Traditional cluster algorithms work under the assumption of data homogeneity. Sometimes, the data to classify have a multilevel structure; for example, we may want to classify subjects for whom repeated measurements (replicates) are available. Such data will be referred to as multilevel data. The following example illustrates the concept.

Example 6.
Atomic force microscopy (AFM) for cervical cancer detection. Several studies report that AFM imaging can discriminate normal and cancer cervical cells using physical characteristics of the cell surface [14,15]. Figure 11 depicts a typical distribution of 2 cell AFM image characteristics, namely fractal dimension and cell surface area. The original data (the left plot) represents cell samples from a pap smear exam from n = 25 women; each exam sample contained 2 to 10 cells (black filled circles); the red circle is the average across replicates (red filled circle). We use segments to connect replicates to averages for a better hierarchy visualization; overall there are 138 pairs of observations.
The ground truth is known: there are 3 types of samples: (1) normal cells (10 women), (2) squamous cell carcinoma (7 women), and (3) adenocarcinoma (8 women). Can the K -means algorithm identify the type of the woman's cervix cells in an unsupervised fashion? Two approaches are obvious: (1) use cells as the measurement unit and apply the K-means to all n = 138 two-dimensional vectors, or (2) apply the K-means to averages over replicates (red circles), n = 25. The first approach may lead to confusion because replicates from 1 woman may be assigned to different clusters. The second approach treats the averages equal, but in fact one has to take into account the number of averaged replicates. The following statistical model takes into account the hierarchy of the data by recognizing the difference between the variation of image characteristics within each woman and between women (women heterogeneity).
The statistical model for classification with replicates takes the form of the variance components model [35,21], but the groups are not known. As before, x indicates the vector of observations, but now it has 2 indices: The first index i indicates the observation to be classified (the woman in the AFM example), and the second index j indicates a replicate (there are p i replicates for woman i). The statistical model can be viewed as a simple mixed model [11] where b i ∼ (0, 2 2 I m ), ∼ (0, 2 I m ) are random effects representing intra-individual variation. Note that in the traditional mixed model, the clusters specified by the index sets are known; here we want to estimate them along with the means and variance parameters. In the AFM example, parameter 2 reflects the heterogeneity among women, and it is expected that 2 > 1, reflecting a commonly observed biological phenomenon: namely the variation  between women is larger than the variation within woman. The total variance of x ij is 2 + 2 2 = 2 (1 + 2 ), the sum of the variation across replicates of the same woman ( 2 ) and the variation between women ( 2 2 ). This model implies that replicates corresponding to the same i correlate with the correlation coefficient = 2 /(1 + 2 ). The following theorem lists the facts about the ML classification of the multilevel data specified by model (11).

Maximum likelihood classification
over , k , and {C k , k = 1, ..., K}, where N = ∑ n i=1 p i and (c) Minimization of (12) can be accomplished by alternating between the weighted K-means algorithm using {x i } with weights w i = p i /(1 + p i 2 ) when 2 is held fixed, and the fix-point algorithm for when k and {C k } are held fixed: (d) When 2 = 0, minimization of (12) turns into the weighted K-means algorithm for {x i } with weights w i = p i . See the Appendix for the proof. The fact that the ML estimation with an equal number of replicates reduces to the K-means is understandable because then averages have the same variance and therefore can be treated equally. It is easy to prove that fix-point iterations produce a positive solution if and otherwise 2 = 0. Indeed, consider the right-hand side of expression (13) as a function of 2 . This function approaches (14) when 2 → ∞ . The solution is positive if the derivative of this function, evaluated at 2 = 0, is greater than 1-it is easy to see that this holds under the inequality (15).

Example 7. AFM cell imaging (continued).
We apply the ML for hard classification to AFM cervical cell images using 2 the characteristics shown in the left plot of Figure 11. The R function cclust of package flexclust is used to run the weighted K-means algorithm when 2 is held fixed. The results of classification are shown in the right plot of Figure 11. Only 1 woman, indicated with a large red circle, is misclassified. She belongs to Cluster 1 (adenocarcinoma), but the ML hard classification put her into Cluster 2 (squamous cell carcinoma).

CONCLUSIONS
Typical cluster analysis stops after classification is complete. For the next-generation K-means algorithm, the work is about to start: What is the confidence interval for the mean vector, and how well are the index sets C k estimated? How to test that clusters exist? What is the number of clusters and what is their distribution of its estimate? What are statistical properties of clusterwise regression coefficients? These questions cannot be answered based on the standard algorithm-driven paradigm. The only way to study the properties of the classification is to use a model-based K-means algorithm. This model was proposed by Banfield and Raftery more than 20 years ago, but little has been done since.
We have developed new directions and extensions to the statistical model-based K-means algorithm which turns into the ML estimation. But it is too early to claim victory: The hard classification problem does not fall into the track of the well-established statistical theory because the number of parameters grows with n and the index sets are discrete. Special statistical methods, married with combinatorics, are required, and simulations here will be very helpful.
The hard classification problem, and particularly finding the optimal partition set, may have several local solutions. Development of the global minimum criteria, following the route of continuous optimization [10], is a matter of future work. We strongly recommend the use of at least 10 starting points in the K-means algorithm to ensure that the global minimum has been achieved (kmeans[...,nstart = 10,...] in R).
Obviously, 1 paper cannot solve multiple problems emerging in connection with extension of the K-means algorithm. However, we hope that our work will stimulate interest in further development of hard classification algorithms and deeper understanding of their statistical properties.

ACKNOWLEDGMENTS
I am grateful to the reviewers and the editor for their helpful comments and suggestions that improved the paper. Therefore, the twice-negative log-likelihood function for the ith observation from cluster k, up to a constant term, can be written as l i ( K , 2 , 2 ) = ln 2 + m ln(1 + p 2 ) After some matrix algebra, we obtain to shorten the notation. After these simplifications, the twice-negative log-likelihood function to be minimized takes the form ) .
(A1) The minimum of this function over 2 is attained at Plugging this back into expression (A1) leads to the minimization of