Clustering with Statistical Error Control

This paper presents a clustering approach that allows for rigorous statistical error control similar to a statistical test. We develop estimators for both the unknown number of clusters and the clusters themselves. The estimators depend on a tuning parameter alpha which is similar to the significance level of a statistical hypothesis test. By choosing alpha, one can control the probability of overestimating the true number of clusters, while the probability of underestimation is asymptotically negligible. In addition, the probability that the estimated clusters differ from the true ones is controlled. In the theoretical part of the paper, formal versions of these statements on statistical error control are derived in a standard model setting with convex clusters. A simulation study and two applications to temperature and gene expression microarray data complement the theoretical analysis.


INTRODUCTION
perature curves using data recorded on a spatial grid (DeGaetano, 2001;Finazzi, Haggarty, Miller, Scott, & Fasso, 2015;Fovell & Fovell, 1993), and the clustering of consumer profiles on the basis of survey data (Tripathi, Bhardwaj, & Poovammal, 2018;Wedel & Kamakura, 2000). A major challenge in cluster analysis is to estimate the unknown number of groups K 0 from a sample of data. A common approach is to compute a criterion function that measures the quality of the clustering for different cluster numbers K. An estimator of K 0 is then obtained by optimizing the criterion function over K. Prominent examples of this approach are the Hartigan index (Hartigan, 1975), the silhouette statistic (Rousseeuw, 1987) and the gap statistic (Tibshirani, Walther, & Hastie, 2001).
Another common way to estimate K 0 is based on statistical test theory. Roughly speaking, one can distinguish between two types of test-based procedures: The first type relies on a statistical test that checks either whether some clusters can be merged or whether a cluster can be subdivided. Given a set of clusters, the test is repeatedly applied until no clusters can be merged or split any more. The number of remaining clusters serves as an estimator of K 0 . Classical examples of methods that proceed in this way are discussed in Gordon (1999, Chapter 3.5) who terms them "local methods." Obviously, these methods involve a multiple testing problem. However, the employed critical values do not properly control for the fact that multiple tests are performed. The significance level used to carry out the tests thus cannot be interpreted strictly. Put differently, the procedures do not allow for rigorous statistical error control.
Test-based approaches of the second type proceed by sequentially testing a model with K clusters against one with K + 1 clusters. The smallest number K for which the test does not reject serves as an estimator of K 0 . Most work in this direction has been done in the framework of Gaussian mixture models; see McLachlan and Rathnayake (2014) for an overview. However, deriving a general theory for testing a mixture with K components against one with K ′ > K components has turned out to be a very challenging problem; see Ghosh and Sen (1985) and Hartigan (1985) for a description of the main technical issues involved. Many results are therefore restricted to the special case of testing a homogeneous model against a mixture with K = 2 clusters; see Liu and Shao (2004) and Li, Chen, and Marriott (2009) among many others. More general test procedures often lack a complete theoretical foundation or are based on very restrictive conditions.
Only recently, there have been some advances in developing a general theory for testing K against K ′ > K clusters under reasonably weak conditions. In a mixture model setup, Li and Chen (2010) and Chen, Li, and Fu (2012) have constructed a new expectation-maximization procedure to tackle this testing problem. Outside the mixture model context, Maitra, Melnykov, and Lahiri (2012) have developed a bootstrap procedure to test a model with K groups against one with K ′ > K groups. These articles derive the theoretical properties of the proposed tests under the null hypothesis of K clusters, where K is a prespecified fixed number. However, they do not formally investigate the properties of a procedure that estimates K 0 by sequentially applying the tests. In particular, they do not analyze whether such a sequential procedure may allow for a rigorous interpretation of the significance level that is used to carry out the tests.
The main aim of this article is to construct an estimatorK 0 of K 0 , which allows for rigorous statistical error control in the following sense: For any prespecified significance level ∈ (0, 1), the proposed estimatorK 0 =K 0 ( ) has the property that According to this, the probability of overestimating K 0 is asymptotically controlled by the level , while the probability of underestimating K 0 is asymptotically negligible. By picking , one can thus control the probability of choosing too many clusters, while, on the other hand, one can ignore the probability of choosing too few clusters (at least asymptotically).
Estimating K 0 with statistical error control in the sense of (1) and (2) is a challenging problem. In this article, we make a first step toward solving this problem by providing statistical methodology and rigorous mathematical theory for a simple baseline model. Specifically, we construct an estimatorK 0 with the properties (1) and (2) in a model with convex spherical clusters, which is formally introduced in Section 2. This model is widely used in practice and suitable for a number of applications as discussed in Section 2 and illustrated by our empirical examples in Section 5. It should be noted though that it is not suited for applications where the clusters have general, nonconvex shapes. Restricting attention to the simple baseline model of Section 2 allows us to present our main ideas in a clean way and to derive a complete asymptotic theory for our estimators.
Our estimation approach is developed in Section 3. As will be shown there, the proposed procedure does not only provide us with an estimatorK 0 of the unknown number of groups K 0 . It also yields estimators of the groups themselves, which allow for statistical error control similarly toK 0 . Our approach is based on the following general strategy: 1. Construct a statistical test that, for any given number K, checks the null hypothesis that there are K clusters in the data. 2. Starting with K = 1, sequentially apply this test until it does not reject the null hypothesis of K clusters any more. 3. Define the estimatorK 0 of K 0 as the smallest number K for which the test does not reject the null.
This strategy is discussed in detail in Section 3.1. It is generic in the sense that it can be employed with different test statistics. For our theoretical analysis, we apply it with a specific statistic which is introduced in Section 3.2. For this specific choice, we derive the statements (1) and (2) on statistical error control under suitable regularity conditions (Section 4). Some alternative choices of the test statistic are discussed in Section 6. In the following, we refer to our estimation procedure as CluStErr ("Clustering with Statistical Error Control"). We complement the theoretical analysis of the article by a simulation study and two applications on temperature and microarray data in Section 5. The R code to reproduce the numerical examples is contained in the add-on package CluStErr (Lasota, Vogt, & Schmid, 2019), which implements the CluStErr method and which is part of the supplemental materials of the article.

MODEL
Suppose we measure p features on n different subjects. In particular, for each subject i ∈ {1, … , n}, we observe the vector where Y ij denotes the measurement of the jth feature for the ith subject. Our data sample thus has the form {Y i ∶ 1 ≤ i ≤ n}. Both the number of subjects n and the number of features p are assumed to tend to infinity, with n diverging much faster than p. This reflects the fact that n is much larger than p in the applications we have in mind. When clustering the genes in a typical microarray dataset, for instance, the number of genes n is usually a few thousands, whereas the number of tissue samples p is not more than a few tenths. The exact technical conditions on the sizes of n and p are laid out in Section 4.1.
The data vectors Y i of the various subjects i = 1, … , n are supposed to satisfy the model where i = ( i1 , … , ip ) ⊤ is a deterministic signal vector and e i = (e i1 , … , e ip ) ⊤ is the noise vector. The subjects in our sample are assumed to belong to K 0 different classes. More specifically, the set of subjects {1, … , n} can be partitioned into K 0 groups G 1 , … , G K 0 such that for each where m k ∈ R p are vectors with m k ≠ m k ′ for k ≠ k ′ . Hence, the members of each group G k all have the same signal vector m k .
Equations (3) and (4) specify a model with convex spherical clusters, which underlies the k-means and many other Euclidean distance-based clustering algorithms. Despite its simplicity, this framework has frequently been employed in the literature and is useful for a number of applications as illustrated by our empirical examples. It is thus a suitable baseline model for developing our ideas on clustering with statistical error control. We now discuss Equations (3) and (4) in detail.
Details on Equation (3). The noise vector e i = (e i1 , … , e ip ) ⊤ is assumed to consist of entries e ij with the additive component structure e ij = i + ij . Equation (3) for the ith subject thus writes as Here, i is a subject-specific random intercept term. Moreover, the terms ij are standard idiosyncratic noise variables with E[ ij ] = 0. For simplicity, we assume the error terms ij to be i.i.d. both across i and j. The random intercepts i , in contrast, are allowed to be dependent across subjects i in an arbitrary way. In general, the components of (5) may depend on p. The exact formulation of Equation (5) for the ith subject thus reads However, to keep the notation simple, we suppress this dependence on p and write the model for the ith subject as (5). If we drop the random intercept i from (5), the signal vector i is equal to the mean E[Y i ]. In the general Equation (5), in contrast, i is only identified up to an additive constant. To identify i in (5), we impose the normalization constraint p −1 ∑ p j=1 ij = 0 for each i. We thus normalize the entries of i to be zero on average for each i. Under the technical conditions specified in Section 4.1, the constraint p −1 ∑ p j=1 ij = 0 implies that i = lim p→∞ p −1 ∑ p j=1 Y ij almost surely, which in turn identifies the signal vector i .
Details on Equation (4). This equation specifies the group structure in our model. We assume the number of groups K 0 to be fixed, implying that the groups G k = G k,n depend on the sample size n. Keeping the number of classes K 0 fixed while letting the size of the classes G k,n grow is a reasonable assumption: It reflects the fact that in most applications, we expect the number of groups K 0 to be small compared with the total number of subjects n. To keep the notation simple, we suppress the dependence of the classes G k,n on the sample size n and denote them by G k throughout the article.
In the remainder of this section, we discuss two special cases of model (3)-(4), which are relevant for our empirical applications.
A model for the clustering of time series data. Suppose we observe time series Y i = (Y i1 , … , Y ip ) ⊤ of length p for n different subjects i. The time series Y i of the ith subject is assumed to follow the time trend model where i (⋅) is an unknown nonparametric trend function and t 1 < … < t p are the time points of the observations. The deterministic design points t j are supposed to be the same across subjects i and are normalized to lie in the unit interval. An important example is the equidistant design t j = j∕p. However, it is also possible to allow for nonequidistant designs. To identify the trend function i (⋅) in (6), we suppose that ∫ 1 0 i (w) dw = 0 for each i, which is a slight modification of the identification constraint stipulated in (5). Analogous to our general model, we impose a group structure on the observed time series: There are K 0 groups of time series G 1 , … , G K 0 such that i (⋅) = m k (⋅) for all i ∈ G k . Hence, the members of each class G k all have the same time trend function m k (⋅).
A model for the clustering of genes in microarray experiments. In a microarray experiment, the expression levels of n different genes are often measured in p different tissue samples (obtained, e.g., from p different patients). For each gene i, we observe the vector where Y ij is the measured expression level of gene i for tissue sample j. The vector Y i of gene i is supposed to satisfy the model Equation (5), which componentwise reads as Here, ij can be regarded as the true expression level of gene i for tissue j, whereas Y ij is the measured expression level corrupted by the noise term i + ij . Most microarray experiments involve different types of tissues, for example, tumor "cases" versus healthy "controls," or different tumor (sub)types. We therefore suppose that there are T different types of tissues in our sample and order them according to their type (which is known by experimental design). More specifically, the tissues j of type t are labeled by j t−1 ≤ j < j t , where 1 = j 0 < j 1 < · · · < j T−1 < j T = p + 1. If the patients from which tissues are obtained constitute samples of sufficiently homogeneous populations, it is natural to assume that the true expression level ij of gene i is the same for tissues j of the same type, that is, ij = ij ′ for j t−1 ≤ j, j ′ < j t . The signal vector i thus has a piecewise constant structure for each i. As in our general model, we suppose that there are K 0 groups of genes G 1 , … , G K 0 such that i = m k for all i ∈ G k and some vector m k . The genes of each class G k thus have the same (co)expression profile m k .

ESTIMATION METHOD
We now present our approach to estimate the unknown groups G 1 , … , G K 0 and their unknown number K 0 in model (3)-(4). Section 3.1 gives an overview of the general method, while Sections 3.2-3.4 fill in the details.

The general method
To construct our method, we proceed in two steps: In the first step, we specify an algorithm that clusters the set of subjects {1, … , n} into K groups for any given number K (which may or may not coincide with the true number of classes K 0 ). Let {Ĝ [K] k ∶ 1 ≤ k ≤ K} be the K clusters produced by the algorithm when the number of clusters is K. For K = 1, we trivially setĜ [1] 1 = {1, … , n}. For our theory to work, we require the clustering algorithm to consistently estimate the class structure {G k ∶ 1 ≤ k ≤ K 0 } when K = K 0 . More specifically, we require the estimators {Ĝ This restriction is satisfied by a wide range of clustering algorithms under our regularity conditions. It is, for example, satisfied by a k-means type algorithm introduced in Section 3.3. Moreover, it can be shown to hold for a number of hierarchical clustering algorithms, in particular for agglomerative algorithms with single, average, and complete linkage. Our estimation method can be based on any clustering algorithm that has the consistency property (8).
In the second step, we construct a test for each K, which checks whether the data can be well described by the K clustersĜ [K] 1 , … ,Ĝ [K] K . We thereby test whether the number of clusters is equal to K. More formally, we use the K-cluster partition {Ĝ [K] k ∶ 1 ≤ k ≤ K} to construct a statistic [K] that allows us to test the hypothesis H 0 ∶ K = K 0 versus H 1 ∶ K < K 0 . For any given number of clusters K, our test is defined as T [K] = 1( [K] > q( )), where q( ) is the (1 − )-quantile of a known distribution, which will be specified later on. We reject H 0 at the level if T [K] = 1, that is, if [K] > q( ). A detailed construction of the statistic [K] along with a precise definition of the quantile q( ) is given in Section 3.2.
To estimate the classes G 1 , … , G K 0 and their number K 0 , we proceed as follows: For each K = 1, 2, …, we check whether [K] ≤ q( ) and stop as soon as this criterion is satisfied. Put differently, we carry out our test for each K = 1, 2, … until it does not reject H 0 any more. Our estimator of K 0 is defined as the smallest number K for which [K] ≤ q( ), that is, for which the test does not reject H 0 . Formally speaking, we definê Moreover, we estimate the class structure k . The definition (9) can equivalently be written aŝ wherep [K] is the p-value corresponding to the statistic [K] . The heuristic idea behind (10) is as follows: Starting with K = 1, we successively test whether the data can be well described by a model with K clusters, in particular by the partition For each K, we compute the p-valuep [K] that expresses our confidence in a model with K clusters. We stop as soon aŝ p [K] > , that is, as soon as we have enough statistical confidence in a model with K groups.
As shown in Section 4, under appropriate regularity conditions, our statistic [K] has the property that Put differently, P(T [K] = 0) → 1 − for K = K 0 and P(T [K] = 1) → 1 for K < K 0 . Hence, our test is asymptotically of level . Moreover, it detects the alternative H 1 ∶ K < K 0 with probability tending to 1, that is, its power against H 1 is asymptotically equal to 1. From (11), it follows that Hence, the probability of overestimating K 0 is asymptotically bounded by , while the probability of underestimating K 0 is asymptotically negligible. By picking , we can thus control the probability of choosing too many clusters similarly to the Type-I-error probability of a test. Moreover, we can asymptotically ignore the probability of choosing too few clusters similarly to the Type-II-error probability of a test. In finite samples, there is of course a trade-off between the probabilities of under-and overestimating K 0 : By decreasing the significance level , we can reduce the probability of overestimating However, we pay for this by increasing the probability of underestimating K 0 , since < ( ′ ) ≥ < ( ) for ′ < . This can also be regarded as a trade-off between the size and the power of the test on whicĥ K 0 is based. Taken together, the two statements (12) and (13) yield that that is, the probability that the estimated number of classesK 0 differs from the true number of classes K 0 is asymptotically equal to . With the help of (14) and the consistency property (8) of the estimated clusters, we can further show that that is, the probability of making a classification error is asymptotically equal to as well. The statements (12)-(15) give a mathematically precise description of the statistical error control that can be performed by our method.

Construction of the statistic [K]
To construct the statistic [K] , we use the following notation: The variablesŶ ij serve as approximations of Y * ij , since under standard regularity conditionsŶ 2. For any set S ⊆ {1, … , n}, let m j,S = (#S) −1 ∑ i∈S ij be the average of the signals ij with i ∈ S and estimate it bym j,S = (#S) −1 ∑ i∈SŶ ij . We use the notation m [K] j,k = m j,Ĝ [K] k andm [K] j,k =m j,Ĝ [K] k to denote the average of the signals in the clusterĜ [K] k and its estimator, respectively.
With this notation at hand, we define the statistiĉ for each subject i. This is essentially a scaled version of the residual sum of squares for the ith subject when the number of clusters is K. Intuitively,Δ [K] i measures how well the data of the ith subject are described when the sample of subjects is partitioned into the are the building blocks of the overall statistic [K] . Before we move on with the construction of [K] , we have a closer look at the stochastic behavior of the statisticsΔ [K] i . To do so, we consider the following stylized situation: We assume that the variables ij are i.i.d. normally distributed with mean 0 and variance 2 . Moreover, we neglect the estimation error in the expressionsŶ ij ,m [K] j,k ,̂2, and̂. In this situation,Δ is the difference between the signal ij of the ith subject and the average signal in the cluster S. We now give a heuristic discussion of the behavior ofΔ [K] i in the following two cases: consistently estimates G k . Neglecting the estimation error inĜ for each i. Hence, the individual statisticsΔ all have a rescaled 2 -distribution. K < K 0 : If we pick K smaller than the true number of classes K 0 , the clusters {Ĝ [K] k ∶ 1 ≤ k ≤ K} cannot provide an appropriate approximation of the true class structure {G k ∶ 1 ≤ k ≤ K 0 }. In particular, there is always a cluster S =Ĝ [K] k which contains subjects from at least two different classes. For simplicity, let S = G k 1 ∪ G k 2 . For any i ∈ S, it holds that under our regularity conditions from Section 4.1. Moreover, it is not difficult to see that for at that is, the statisticΔ has an explosive behavior. According to these heuristic considerations, the statisticsΔ i has an explosive behavior at least for some subjects i. This mirrors the fact that a partition with K < K 0 clusters cannot give a reasonable approximation to the true class structure. In particular, it cannot describe the data of all subjects i in an accurate way, resulting in an explosive behavior of the (rescaled) residual sum of squaresΔ is a collection of (approximately) independent random variables that (approximately) have a rescaled 2 -distribution. Hence, all statisticsΔ have a stable, nonexplosive behavior. This reflects the fact that the partition {Ĝ is an accurate estimate of the true class structure and thus yields a moderate residual sum of squaresΔ for all subjects i. Since the statisticsΔ [K] i behave quite differently depending on whether K = K 0 or K < K 0 , they can be used to test H 0 ∶ K = K 0 versus H 1 ∶ K < K 0 . In particular, testing H 0 versus H 1 can be achieved by testing the hypothesis thatΔ variables with a rescaled 2 -distribution against the alternative that at least oneΔ [K] i has an explosive behavior. We now construct a statis-tic [K] for this testing problem. A natural approach is to take the maximum of the individual statisticsΔ [K] i : Define 2p. Our heuristic discussion from above, in particular formula (17), suggests that for K = K 0 , Moreover, for K < K 0 , we can show with the help of (18) and some additional considerations that  [K] ≥ c √ p for some c > 0 with probability tending to 1. The quantile q( ), in contrast, can be shown to grow at the rate √ log n. Since √ log n = o( √ p) under our formal assumptions on n and p, [K] diverges faster than the quantile q( ), implying that (1) for K < K 0 . This suggests that [K] has the property (11) and thus is a reasonable statistic to test the hypothesis In this article, we restrict attention to the maximum statistic [K] defined in (19). In principle though, we may work with any statistic that satisfies the higher order property (11). In Section 6, we discuss some alternative choices of [K] .

A k-means clustering algorithm
In this section, we construct a k-means-type clustering algorithm, which has the consistency property (8). Since its introduction by Cox (1957) and Fisher (1958), the k-means algorithm has become one of the most popular tools in cluster analysis. Our version of the algorithm differs from the standard one in the choice of the initial values.
To ensure the consistency property (8), we pick initial clusters [K] 1 , … , [K] K for each given K as follows: Let i 1 , … , i K be indices which (with probability tending to 1) belong to K different classes G k 1 , … , G k K in the case that K ≤ K 0 and to K 0 different classes in the case that K > K 0 . We explain how to obtain such indices below. With these indices at hand, we compute the distance The starting values [K] 1 , … , [K] K are now defined by assigning the index i to cluster [K] k if̂k(i) = min 1≤k ′ ≤K̂k ′ (i). The indices i 1 , … , i K in this construction are computed as follows: For K = 2, pick any index i 1 ∈ {1, … , n} and calculate i 2 = arg max 1≤i≤n̂( i 1 , i). Next suppose we have already constructed the indices i 1 , … , i K−1 for the case of K − 1 clusters and compute the corresponding starting values K−1 as described above. Calculate the maximal within-cluster dis- To formulate the k-means algorithm, let the number of clusters K be given and denote the starting values by The rth iteration of the k-means algorithm proceeds as follows: Step r: Repeat this algorithm until the estimated groups do not change any more. For a given sample of data, this is guaranteed to happen after finitely many steps. The resulting k-means estimators are denoted by In Section 4, we formally show that these estimators have the consistency property (8) under our regularity conditions.

Estimation of 2 and
In practice, the error variance 2 and the normalization constant are unknown and need to be estimated from the data at hand. We distinguish between two different estimation approaches, namely, a difference-and a residual-based approach.

Difference-based estimators
To start with, consider the time trend model from Section 2, where ij = i (j∕p) with some trend Similarly, the fourth moment = E[ 4 ij ] can be estimated bŷ which in turn allows us to estimate the parameter bŷ Difference-based estimators of this type have been considered in the context of nonparametric regression by Müller and Stadtmüller (1988), Hall, Kay, and Titterington (1990) and Tecuapetla-Gómez and Munk (2017) . The estimatorŝ2 Lip and̂L ip are particularly suited for applications where the trend functions i (⋅) can be expected to be fairly smooth. This ensures that the unknown first differences ( ij − i,j−1 ) can be sufficiently well approximated by the terms A similar difference-based estimation strategy can be used in the model for gene expression microarray data from Section 2. In this setting, the signal vectors i have a piecewise constant structure. In particular, Similarly as before, we may thus estimate 2 , , and bŷ

Residual-based estimators
Let {Ĝ [K] k ∶ 1 ≤ k ≤ K} be the k-means estimators from Section 3.3 for a given K. Moreover, let̂ [ K] ij be the cluster-specific residuals introduced at the beginning of Section 3.2 and denote the vector of residuals for the ith subject bŷ ip ) ⊤ . With this notation at hand, we define the residual sum of squares for K clusters by where || ⋅ || denotes the usual Euclidean norm for vectors. RSS(K) can be shown to be a consistent estimator of 2 for any fixed K ≥ K 0 . The reason is the following: For any K ≥ K 0 , the k-means estimatorsĜ [K] k have the property that for 1 ≤ k ≤ K under the technical conditions (C1)-(C3). Hence, with probability tending to 1, the estimated clustersĜ [K] k contain elements from only one class G k ′ . The residualŝ[ K] ij should thus give a reasonable approximation to the unknown error terms ij . This in turn suggests that the residual sum of squares RSS(K) should be a consistent estimator of 2 for K ≥ K 0 . Now suppose we know that K 0 is not larger than some upper bound K max . In this situation, we may try to estimate 2 bỹ2 RSS = RSS(K max ). Even though consistent, this is a very poor estimator of 2 . The issue is the following: The larger K max , the smaller the residual sum of squares RSS(K max ) tends to be. This is a natural consequence of the way in which the k-means algorithm works. Hence, if K max is much larger than K 0 , theñ2 RSS = RSS(K max ) tends to strongly underestimate 2 . To avoid this issue, we replace the naive estimator̃2 RSS by a refined version: where J A and J B are disjoint index sets with J A∪ J B = {1, … , p}. For = A, B, write p ∶= #J and assume that p ∕p → c with some c ∈ (0, 1). Moreover, for Under the technical conditions (C1)-(C3), the random vectors in  A are independent from those in  B . 2. Apply the k-means algorithm with K = K max to the sample  A and denote the resulting esti- These estimators can be shown to have the property (21), provided that we impose the following condition: Let m k be the class-specific signal vector of the class G k and define the vectors m A k and m B k in the same way as above. Assume that According to this assumption, the signal vectors m k and m k ′ of two different classes can be distinguished from each other only by looking at m A k and m A k ′ . It goes without saying that this is not a very severe restriction. 3. Compute cluster-specific residuals from the data sample  B , In contrast to the naive estimator̃2 RSS , the refined version̂2 RSS does not tend to strongly underestimate 2 . The main reason is that the residualŝB i are computed from the random vectorŝ Y B i , which are independent of the estimated clustersĜ A k .
A sketch of the proof is given in the Supplementary Material.
Remark 1. The difference-and residual-based estimators of 2 and constructed above are suited to different situations. While the residual-based estimatorŝ2 RSS and̂R SS work under general conditions, the difference-based estimatorŝ2 Lip and̂L ip as well aŝ2 pc and̂p c are designed for special model settings where the underlying signal vectors i have a specific structure:̂2 Lip and Lip are constructed for applications where the underlying signals i are smooth trends, whereaŝ 2 pc and̂p c are constructed for the case of piecewise constant signals. We suggest to use the difference-based estimators in the special cases for which they are designed rather than the residual-based estimators, since the latter do not exploit the specific structure of the signal vectors and can thus be expected to be less precise. In the general case where the signal vectors are not known to have any particular structure, in contrast, the residual-based estimators should be used.
Remark 2. In this article, we analyze a simple model setting that presupposes that the error variance 2 (as well as the parameter ) is the same across all subjects i. This is of course a quite restrictive assumption that might be violated in applications. In a more general model, it could be assumed that the error variances are group-specific. In particular, for each 1 ≤ k ≤ K 0 , the subjects i ∈ G k could be assumed to have the error variance 2 k = Var( ij ), which may be different across k. Notably, it is in principle possible to generalize the CluStErr method to a model with group-specific error variances. To see this, consider the situation where the signal vectors are piecewise constant. In this case, we can estimate the error variance 2 If the true groups G k were known, we could further estimate the group-specific error variances 2 k bŷ2 k,pc = (#G k ) −1 ∑ i∈G k̂2 i,pc for 1 ≤ k ≤ K 0 . These estimators are of course not feasible in practice. Assuming that we know an upper bound K max on K 0 , we can replace them by the feasible estimatorŝ2

ASYMPTOTICS
In this section, we investigate the asymptotic properties of our estimators. We first list the assumptions needed for the analysis and then summarize the main results.

Assumptions
To formulate the technical conditions that we impose on model (3)-(4), we denote the size, that is, the cardinality of the class G k by n k = #G k . Moreover, we use the shorthand a ≪ b to express that a ∕b ≤ c − for sufficiently large with some c > 0 and a small > 0. Our assumptions read as follows: (C1) The errors ij are identically distributed and independent across both i and j with E[ ij ] = 0 and E[| ij | ] ≤ C < ∞ for some > 8. (C2) The class-specific signal vectors m k = (m 1,k , … , m p,k ) ⊤ differ across groups in the following sense: There exists a constant 0 > 0 such that for any pair of groups G k and G k ′ with k ≠ k ′ . Moreover, |m j,k | ≤ C for all k and j, where C > 0 is a sufficiently large constant. (C3) Both n and p tend to infinity. The group sizes n k = #G k are such that p ≪ n k ≪ p ( /4)−1 for all 1 ≤ k ≤ K 0 , implying that p ≪ n ≪ p ( /4)−1 .
We briefly comment on the above conditions. By imposing (C1), we restrict the noise terms ij to be i.i.d. Yet, the error terms e ij = i + ij of our model may be dependent across subjects i, as we do not impose any restrictions on the random intercepts i . This is important, for instance, when clustering the genes in a microarray dataset, where we may expect different genes i to be correlated. (C2) is a fairly harmless condition, which requires the signal vectors to differ in an L 2 -sense across groups. By (C3), the group sizes n k and thus the total number of subjects n are supposed to grow faster than the number of features p. We thus focus attention on applications where n is (much) larger than p. However, n should not grow too quickly compared with p. Specifically, it should not grow faster than p ( /4)−1 , where is the number of existing error moments E[| ij | ] < ∞. As can be seen, the bound p ( /4)−1 on the growth rate of n gets larger with increasing . In particular, if all moments of ij exist, n may grow as quickly as any polynomial of p. Importantly, (C3) allows the group sizes n k to grow at different rates (between p and p ( /4)−1 ). Put differently, it allows for strongly heterogeneous group sizes. Our estimation methods are thus able to deal with situations where some groups are much smaller than others.

Main results
Our first result shows that the maximum statistic [K] = max 1≤i≤nΔ [K] i has the property (11) and thus is a reasonable statistic to test the hypothesis Theorem 1. Assume that the estimated clusters have the consistency property (8). Moreover, let̂2 and̂be any estimators witĥ2 = 2 + O p (p −(1∕2+ ) ) and̂= + O p (p − ) for some > 0. Under (C1)-(C3), the statistic [K] has the property (11), that is, This theorem is the main stepping stone to derive the central result of the article, which describes the asymptotic properties of the estimatorsK 0 and {Ĝ k ∶ 1 ≤ k ≤K 0 }.

Theorem 2. Under the conditions of Theorem 1, it holds that
Theorem 2 holds true for any clustering algorithm with the consistency property (8). The next result shows that this property is fulfilled, for example, by the k-means algorithm from Section 3.3.

Simulation study
To explore the properties of the CluStErr method in a systematic way, we carry out a simulation study which splits into three main parts. The first part investigates the finite sample behavior of CluStErr, whereas the second part compares CluStErr with several alternative methods. The third and final part reports the results of some robustness checks concerning the estimation of the error variance 2 and the parameter . The simulation design is inspired by the analysis of the gene expression data in Section 5.3. It is based on model (7) from Section 2. The data vectors Y i have the form Y i = i + i with piecewise constant signal profiles i . We set the number of clusters to K 0 = 10 and define the cluster-specific signal vectors m k by m 1 = (1, 0, 0, 0, 0 where 1 = (1, … , 1) and 0 = (0, … , 0) are vectors of length p∕5. A graphical illustration of the signal vectors m k is provided in Figure 1. The error terms ij are assumed to be i.i.d. normally distributed with mean 0 and variance 2 . In the course of the simulation study, we consider different values of n, p, and 2 as well as different cluster sizes. To assess the noise level in the simulated data, we consider the ratios between the error variance 2 and the "variances" of the signals m k . In particular, we define the noise-to-signal ratios NSR k ( 2 ) = 2 ∕Var(m k ), where Var(m k ) denotes the empirical variance of the vector m k . Since Var(m k ) ≈ 0.16 is the same for all k in our design, we obtain that NSR k ( 2 ) = NSR( 2 ) ≈ 2 ∕0.16 for all k.

Finite sample properties of CluStErr
In this part of the simulation study, we analyze a design with equally sized clusters and set the sample size to (n, p) = (1, 000, 30). Three different noise-to-signal ratios NSR are considered, in particular NSR = 1, 1.5, and 2. Since 2 ≈ 0.16 NSR, the corresponding error variances amount to 2 ≈ 0.16, 0.25, and 0.32, respectively. The noise-to-signal ratio NSR = 1 mimics the noise level in the application on gene expression data from Section 5.3, where the estimated noise-to-signal ratios all lie between 0.6 and 1. The ratios NSR = 1.5 and NSR = 2 are used to investigate how the CluStErr method behaves when the noise level increases. We implement the CluStErr algorithm with = 0.05 and the difference-based estimatorŝ2 pc and̂p c from Section 3.4. For each of the three noise-to-signal ratios under consideration, we simulate B = 1, 000 samples and compute the estimateK 0 for each sample.
The simulation results are presented in Figure 2. Each panel shows a histogram of the esti-matesK 0 for a specific noise-to-signal ratio. For the ratio level NSR = 1, the CluStErr method produces very accurate results: About 95% of the estimates are equal to the true value K 0 = 10 .000 .274 .631 .941 .549 .242 .059 .177 .127 and most of the remaining estimates take the value 11. For the ratio-level NSR = 1.5, the estimation results are also quite precise: Most of the estimates take a value between 9 and 11 with around 55% of them being equal to the true value K 0 = 10. Only for the highest noise-to-signal ratio NSR = 2, the estimation results are less accurate. In this case, the noise level in the data is too high for the method to produce precise results. As one can see, the estimates have a strong downward bias, which can be explained as follows: When there is too much noise in the data, the test procedure on which the estimatorK 0 is based does not have enough power to detect the alternative H 1 ∶ K < K 0 . As a result, our repeated test procedure stops too soon, thus underestimating the true number of clusters. According to our theoretical results, the estimatorK 0 allows for statistical error control in the following sense: It has the property that P(K 0 > K 0 ) = + o(1) and P(K 0 < K 0 ) = o(1), implying that P(K 0 = K 0 ) = (1 − ) + o(1). Setting to 0.05, we should thus observe thatK 0 equals K 0 = 10 in approximately 95% of the simulations and overestimates K 0 in about 5% of them. Table 1 shows that this is indeed the case for the lowest noise-to-signal ratio NSR = 1. In this situation, the probability P(K 0 > K 0 ) of overestimating K 0 is around 5%, while the probability P(K 0 < K 0 ) of underestimating K 0 is 0%, implying that P(K 0 = K 0 ) is about 95%. For the two higher ratio levels NSR = 1.5 and 2, in contrast, the estimated values of the probabilities P(K 0 < K 0 ), P(K 0 = K 0 ), and P(K 0 > K 0 ) do not accurately match the values predicted by the theory. This is due to the fact that the statistical error control of the CluStErr method is asymptotic in nature. Table 2 illustrates this fact by reporting the estimated values of the probabilities P(K 0 < K 0 ), P(K 0 = K 0 ), and P(K 0 > K 0 ) for the noise-to-signal ratio NSR = 1.5 and various sample sizes (n, p) = (1, 000, 30), (1, 500, 40), (2, 000, 50), (2, 500, 60), (3, 000, 70). As one can clearly see, the estimated probabilities gradually approach the values predicted by the theory as the sample size increases.
To summarize, our simulations on the finite sample behavior of the CluStErr method indicate the following: (i) The method produces accurate estimates of K 0 as long as the noise level in the data is not too high. (ii) For sufficiently large sample sizes, it controls the probability of under-TA B L E 2 Estimates of the probabilities P(K 0 < K 0 ), P(K 0 = K 0 ), and P(K 0 > K 0 ) in the simulation scenario with NSR = 1.5 and five different sample sizes (n, p)
It is important to note that (iii) is not a major issue: Even in situations where the error control is not very precise, the CluStErr method may still produce accurate estimates of K 0 . This is illustrated by our simulations. Inspecting the histogram of Figure 2 with NSR = 1.5, for example, the estimated probability P(K 0 = K 0 ) is seen to be only around 55% rather than 95%. Nevertheless, 904 out of 1,000 estimates take a value between 9 and 11. Hence, in a large number of simulations, the CluStErr method yields a reasonable approximation to the true number of clusters. From a heuristic perspective, the CluStErr method can indeed be expected to produce satisfying estimation results even in smaller samples when the error control is not very precise. This becomes clear when regarding CluStErr as a thresholding procedure. For K = 1, 2, …, it checks whether the statistic [K] is below a certain threshold level q and stops as soon as this is the case. For this approach to work, it is crucial to pick the threshold level q appropriately. Our theoretical results suggest that the choice q = q( ) for a common significance level such as = .05 should be appropriate. Of course, this choice guarantees precise error control only for sufficiently large sample sizes. Nevertheless, in smaller samples, the threshold level q = q( ) can still be expected to be of the right order of magnitude, thus resulting in reasonable estimates of K 0 .

Comparison of CluStErr with alternative methods
We now compare the CluStErr method to other criteria for selecting the number of clusters K 0 , in particular to (i) the gap statistic (Tibshirani et al., 2001), (ii) the silhouette statistic (Rousseeuw, 1987), (iii) the Hartigan index (Hartigan, 1975), and (d) a BIC based method for Gaussian mixtures (Fraley & Raftery, 2002). As before, we set the sample size to (n, p) = (1, 000, 30) and consider the three noise-to-signal ratios NSR = 1, 1.5, and 2. In addition to a "balanced" scenario with clusters of the same size n∕K 0 each, we also consider an "unbalanced" scenario with clusters of sizes 1 + 18k for k = 1, … , K 0 . For each design, we simulate B = 100 samples and compare the estimated cluster numbers obtained from the CluStErr method with those produced by the alternative algorithms. The CluStErr estimates are computed as described in the first part of the simulation study. The gap, silhouette, and Hartigan statistics are implemented with a k-means algorithm as the underlying clustering method. To compute the values of the gap statistic, we employ the clusGap function contained in the R package cluster (Maechler, Rousseeuw, Struyf, & Hubert, 2016). The number of clusters is estimated by the function maxSE with the option Tibs2001SEmax. We thus determine the number of clusters as suggested in Tibshirani et al. (2001). To compute the silhouette and Hartigan statistics, we apply the R package NbClust (Charrad, Ghazzali, Boiteau, & Niknafs, 2015). The BIC based method for Gaussian mixtures (GM) is computed using the R package mclust (Fraley, Raftery, & Scrucca, 2019). The results of the comparison study are presented in Table 3. Part (a) of the table provides the results for the balanced scenario with equal cluster sizes. As can be seen, CluStErr clearly outperforms the gap, silhouette and Hartigan statistics in the setting with NSR = 1. In the scenario with NSR = 1.5, it also performs well in comparison with these three methods. Only for the highest noise-to-signal ratio NSR = 2, it produces estimates of K 0 with a strong downward bias and is outperformed by the silhouette and Hartigan statistics. The simulation results further show that the BIC method for Gaussian mixtures outperforms all other methods including CluStErr. This result is most likely due to the fact that the simulation setup under consideration is a Gaussian mixture model, so that methods tailored to this class of models (such as the BIC based method) can be expected to perform better than the more general CluStErr, gap, silhouette and Hartigan statistics.

TA B L E 3 Results of the comparison study
Part (b) of Table 3 presents the results for the unbalanced scenario, where the clusters strongly differ in size. In this scenario, the gap, silhouette and Hartigan statistics substantially underestimate the number of clusters. CluStErr, in contrast, provides accurate estimates of K 0 in the two designs with NSR = 1 and NSR = 1.5. Only in the high-noise design with NSR = 2, it produces estimates with a substantial downward bias, which nevertheless is much less pronounced than that of the gap, silhouette and Hartigan statistics. As can further be seen, CluStErr performs quite similar to the BIC method for Gaussian mixtures. Hence, in the unbalanced setting, CluStErr is competitive with the BIC procedure, even though it is not specially constructed for Gaussian mixture models.
To summarize, the main findings of our comparison study are as follows: (i) Overall, CluStErr performs well in comparison with the three competing methods that are not specifically designed for Gaussian mixture models (that is, in comparison with the gap, silhouette and Hartigan statistics). On the other hand, as expected, it is outperformed by the BIC method for Gaussian mixtures. (ii) CluStErr outperforms the gap, silhouette and Hartigan statistics as long as the noise-to-signal ratio is not too high. It is, however, outperformed by part of them in the balanced scenario when the noise level is high. In Section 6, we discuss some modifications of the CluStErr method that are designed to improve its behavior in the case of high noise. (iii) CluStErr is able to deal with both balanced and unbalanced cluster sizes, whereas the gap, silhouette and Hartigan statistics perform less adequately in unbalanced settings.
The findings (ii) and (iii) can heuristically be explained as follows: The CluStErr method is based on the test statistic [K] = max 1≤i≤nΔ i , which is essentially the maximum over the residual sums of squares of the various individuals i. The gap, silhouette and Hartigan approaches, in contrast, are based on statistics, which evaluate averages rather than maxima. Hartigan's rule, for instance, relies on a statistic which is essentially a scaled version of the ratio RSS(K)∕RSS(K + 1), where RSS(K) is defined as in (20) and denotes the average residual sum of squares for a partition with K clusters. Averaging the residual sums of squares reduces the noise in the data much more strongly than taking the maximum. This is the reason why Hartigan's rule tends to perform better than the CluStErr method in a balanced setting with high noise. On the other hand, the average residual sum of squares hardly reacts to changes in the residual sums of squares of a few individuals that form a small cluster. Hence, small clusters are effectively ignored when taking the average of the residual sums of squares. This is the reason why Hartigan's statistic is not able to deal adequately with unbalanced settings. Taking the maximum of the residual sums of squares instead allows us to handle even highly unbalanced cluster sizes.

Robustness checks
In order to run CluStErr, one needs to estimate the error variance 2 (as well as the parameter ). We now present the results of some additional simulations that were carried out to explore in more  Table 3. detail how CluStErr is affected by the estimation of the error variance 2 (and the parameter ). To do so, we restrict attention to the balanced scenario with clusters of the same size, where we set the sample size to (n, p) = (1, 000, 30) and consider the three noise-to-signal ratios NSR = 1, 1.5, and 2.
(i) So far, we have implemented CluStErr with the difference-based estimatorŝ2 pc and̂p c , which are specifically designed for piecewise constant signals. Alternatively, we can use the residual-based estimatorŝ2 RSS and̂R SS . Since these are not specifically constructed for piecewise constant signals, we expect CluStErr to produce somewhat less precise results when implemented with these. Table 4 shows how much precision is lost when CluStErr is run with the residual-based estimatorŝ2 RSS and̂R SS , where we set K max = 20, J B = {5, 10, 15, … , p}, and J A = {1, … , p}∖J B . As can be seen, the loss of precision is fairly moderate, at least in the simulation scenario at hand.
(ii) Throughout the article, we have assumed that the error variance 2 (as well as the parameter ) is the same across groups. To investigate the sensitivity of CluStErr to violations of this assumption, we simulate data from the same setup as above, the only difference being that the error variances are group-specific. In particular, for each 1 ≤ k ≤ K 0 , the error variance is 2 k = c k 2 for all i ∈ G k , where 2 ≈ 0.16, 0.25, 0.32 as before and c k = 0.5 + 0.05k for 1 ≤ k ≤ K 0 . In this scenario, the error variances 2 k range between 0.55 2 and 2 , implying that the largest variance is about double the smallest one. Table 5  Note: The format of the table is the same as that of Table 3.

F I G U R E 3 Analysis of the
Berkeley Earth temperature data. The plot depicts the average land surface temperature curves at various example locations on earth 2 k = 2 for all k. As can be seen, CluStErr tends to overestimate the true number of groups K 0 much more strongly in the heterogeneous than in the homogeneous setting. This can be explained as follows: When the error variances are heterogeneous, the statistiĉ2 estimates a weighted average of them. Hence, it tends to lie between the smallest and the largest error variance, that is, between 0.55 2 and 2 . This implies that the statisticsΔ are not normalized appropriately. In particular, sincê2 tends to be smaller than 2 , the statisticsΔ [K 0 ] i tend to be too large for subjects i in the group with the error variance 2 . As a consequence, CluStErr tends to overestimate K 0 . This tendency will be the stronger, the more heterogeneous the error variances are. Hence, in applications where the noise level in the data is expected to be strongly heterogeneous across groups, one should work with an extended version of CluStErr which can handle group-specific error variances (see Remark 1).

Clustering of temperature curves
Our first application is concerned with the analysis of a dataset on land surface temperatures that was collected by the investigators of the Berkeley Earth project (Rohde et al., 2013). The data, which are publicly available at http://berkeleyearth.org/data, contain measurements on a grid of worldwide locations that is defined on a one degree (longitude) by one degree (latitude) basis. For each grid point, the dataset contains a monthly land surface temperature profile. This profile is a vector with 12 entries, the first entry specifying the average temperature of all Januaries from 1951 to 1980, the second entry specifying the average temperature of all Februaries from 1951 to 1980, and so on. The temperature profiles at various example locations on earth are shown in Figure 3. As grid points containing 100% sea surface are not taken into account, the overall number of grid points is equal to n = 24,311. A detailed description of the derivation of the data can be found in Rohde et al. (2013). The aim of our analysis is to cluster the 24,311 grid points in order to obtain a set of climate regions characterized by distinct temperature profiles. For this purpose, we impose the time trend model (6) on the data and apply the CluStErr method to them, setting n = 24,311, p = 12, and = .05. To estimate the error variance 2 and the normalization constant , we apply the difference-based estimatorŝ2 Lip and̂L ip from Section 3.4, thus making use of the smoothness of the temperature curves illustrated in Figure 3.
The estimation results are presented in Figures 4 and 5. Figure 4 depicts the p-valuesp [K] corresponding to the test statistics [K] for different numbers of clusters K. It shows that the p-valuep [K]  Berkeley Earth temperature data. The plot depicts the p-valuesp [K] corresponding to the test statistics [K] as a function of K. The horizontal dashed line specifies the significance level = .05, and the vertical dashed line indicates that the estimated number of clusters isK 0 = 26 F I G U R E 5 Visualization of theK 0 = 26 clusters obtained from the analysis of the Berkeley Earth temperature data. Each shade of color refers to one cluster remains below the = .05 threshold for any K < 26 but jumps across this threshold for K = 26. The CluStErr algorithm thus estimates the number of clusters to be equal toK 0 = 26, suggesting that there are 26 distinct climate regions. The sizes of the estimated clusters range between 244 and 2,110; the error variance is estimated to bê2 = 16.25. Figure 5 uses a spatial grid to visualize the 26 regions and demonstrates the plausibility of the obtained results. For example, mountain ranges such as the Himalayas and the South American Andes, but also tropical climates in Africa, South America, and Indonesia are easily identified from the plot. Of note, the results presented in Figure 5 show a remarkable similarity to the most recent modification of the Köppen-Geiger classification, which is one of the most widely used classification systems in environmental research (Peel, Finlayson, & McMahon, 2007). In particular, the overall number of climate regions defined in Peel et al. (2007) is equal to 29, which is similar to the cluster numberK 0 = 26 identified by the CluStErr algorithm. Thus, although our example is purely illustrative, and although expert classification systems account for additional characteristics such as precipitation and vegetation, Figure 5 confirms the usefulness of the CluStErr method in this setting.

Clustering of gene expression data
Our second application is concerned with the analysis of gene expression data, which has become a powerful tool for the understanding of disease processes in biomedical research (Jiang et al., 2004). A popular approach to measure gene expression is to carry out microarray experiments. These experiments simultaneously quantify the expression levels of n genes across p samples of patients with different clinical conditions, such as tumor stages or disease subtypes. In the analysis of microarray data, clustering of the n genes is frequently used to detect genes with similar cellular function and to discover groups of "coexpressed" genes showing similar expression patterns across clinical conditions (Chipman, Hastie, & Tibshirani, 2003;D'haeseleer, 2005;Jiang et al., 2004).
In what follows, we analyze a set of gene expression data that was collected during the first stage of the Microarray Innovations in Leukemia (MILE) study (Haferlach et al., 2010). The dataset contains expression-level measurements for 20,172 genes and is publicly available as part of the Bioconductor package leukemiasEset (Aibar et al., 2013). The gene expression levels were measured using Affymetrix HG-U133 Plus 2.0 microarrays. For statistical analysis, the raw expression data were normalized using the Robust Multichip Average (RMA) method, followed by an additional genewise standardization of the expression levels. For details on data collection and preprocessing, we refer to Haferlach et al. (2010) and Aibar et al. (2013).
The data of the MILE study were obtained from p = 60 bone marrow samples of patients who were untreated at the time of diagnosis. Of these patients, 48 were either diagnosed with acute lymphoblastic leukemia (ALL, 12 patients), acute myeloid leukemia (AML, 12 patients), chronic lymphocytic leukemia (CLL, 12 patients), or chronic myeloid leukemia (CML, 12 patients). The other 12 samples were obtained from nonleukemia (NoL) patients. From a biomedical point of view, the main interest focuses on the set of "differentially expressed" genes, that is, on those genes that show a sufficient amount of variation in their expression levels across the five tissue types (ALL, AML, CLL, CML, NoL). To identify the set of these genes, we run a univariate ANOVA for each gene and discard those with Bonferroni-corrected p-values ≥ .01 in the respective overall F-tests. Application of this procedure results in a sample of n = 3,167 univariately significant genes.
The aim of our analysis is to cluster the n = 3,167 genes into groups whose members have similar expression patterns across the five tissue types (ALL, AML, CLL, CML, NoL). To do so, we impose model (7) from Section 2 on the data. The measured expression profiles Y i = (Y i1 , … , Y ip ) ⊤ of the various genes i = 1, … , n are thus assumed to follow the model equation The signal vectors i are supposed to have a piecewise constant structure after the patients have been ordered according to their tissue type (ALL, AML, CLL, CML, NoL). For illustration, the expression profile Y i of a randomly selected gene is plotted in Figure 6.
To cluster the genes, we apply the CluStErr algorithm with the significance level = .05 and the difference-based estimatorŝ2 pc and̂p c from Section 3.4, thus exploiting the piecewise constant structure of the signal vectors. The estimation results are presented in Figures 7 and 8. The plot in Figure 7 depicts the p-valuesp [K] corresponding to the test statistics [K] as a function of the cluster number K. It shows that the estimated number of clusters isK 0 = 14. The estimated sizes of the 14 clusters range between 58 and 469. Moreover, the estimated error variance iŝ2 = 0.442.
Inspecting Figure 7, the p-valuesp [K] can be seen to be fairly small for any K < 14, which suggests that the structure in the data is not well described by a model with K < 14 clusters. A first larger upward jump occurs at K = 14, indicating that the data are much better described F I G U R E 6 Analysis of the MILE study gene expression data. The plot depicts the expression levels of a randomly selected gene after normalization and standardization. The label of the gene ("ENSG00000002834") refers to its Ensembl gene ID to which the original Affymetrix probesets were mapped (Aibar, Fontanillo, & De Las Rivas, 2013). Horizontal lines represent the average gene expression levels across the five tissue types. by a model with K = 14 clusters. This large jump or "elbow" in the p-value plot indicates that K 0 = 14 is a reasonable estimate of the number of clusters. The p-value plot thus lends credibility to the estimateK 0 = 14. Generally speaking, the p-value plot is a useful diagnostic tool in practice.
In Figure 8, the cluster centersm k = (#Ĝ k ) −1 ∑ i∈Ĝ kŶ i are presented, which estimate the cluster-specific signal vectors m k = (#G k ) −1 ∑ i∈G k i . All clusters show a distinct separation of at least one tissue type, supporting the assumption of piecewise constant signals m k and indicating that the genes contained in the clusters are coexpressed differently across the five tissue types. For example, cluster #2 separates CML and NoL samples from ALL, AML, and CLL samples, whereas cluster #4 separates CLL samples from the other tissue types. Thus, each of the 14 clusters represents a specific pattern of coexpressed gene profiles. F I G U R E 8 Visualization of the cluster centers obtained from the analysis of the MILE study gene expression data. The dots represent the cluster centersm k = (#Ĝ k ) −1 ∑ i∈Ĝ kŶ i , which estimate the cluster-specific signal vectors m k = (#G k ) −1 ∑ i∈G k i . Gene ENSG00000002834, whose expression profile is visualized in Figure 6, is an element of cluster #13. Note the similarity of the patterns in cluster #13 and Figure 6.

EXTENSIONS
In this article, we have developed an approach for estimating the number of clusters with statistical error control. We have derived rigorous statistical theory for a simple model with convex spherical clusters. An interesting question is how to extend our ideas to the case of general, potentially nonconvex clusters. Developing theory for this general case, however, is a very challenging problem and goes beyond the scope of this article. In what follow, we discuss some modifications and extensions of our methods within the limits of the model setting (3)-(4). So far, we have based our estimation methods on the maximum statistic [K] = max 1≤i≤nΔ [K] i . However, we are not bound to this choice. Our approach can be based on any test statistic [K] that fulfills the higher order property (11). The maximum statistic serves as a baseline, which may be modified and improved in several directions. The building blocks of the maximum statistic are the individual statisticsΔ [K] i . Their stochastic behavior has been analyzed in detail in Section 3.2. Under the null hypothesis H 0 ∶ K = K 0 , the statisticsΔ holds for all subjects i = 1, … , n. We are thus faced with a multiple testing problem. A maximum statistic is a classical tool to tackle this problem. However, as is well known from the field of multiple testing, maximum statistics tend to be fairly conservative. When the noise level in the data is high, a test based on the maximum statistic [K] = max 1≤i≤nΔ [K] i can thus be expected to have low power against the alternative H 1 ∶ K < K 0 . As a consequence, the repeated test procedure on which the estimatorK 0 is based tends to stop too soon, thus underestimating the true number of clusters. This is exactly what we have seen in the high-noise scenarios of the simulation study from Section 5.1. We now present two ways how to construct a statistic [K] with better power properties.
A blocked maximum statistic. Let w 0 = 0 and define w k = ∑ k r=1 #Ĝ [K] r for 1 ≤ k ≤ K. Moreover, writeĜ [K] k = {i w k−1 +1 , … , i w k } with i w k−1 +1 < · · · < i w k for any k. To start with, we order the indices {1, … , n} clusterwise. In particular, we write them as {i 1 , i 2 , … , i n }, which yields the orderinĝ We next partition the ordered indices into blocks where N is the block length and L = ⌈n∕N⌉ is the number of blocks. With this notation at hand, we construct blockwise averagesΛ In addition, we let q B ( ) be the (1 − )-quantile of as a special case with the block length N = 1. Under appropriate restrictions on the block length N, the estimators that result from applying the CluStErr method with the blocked statistic [K] B can be shown to have the theoretical properties stated in Theorems 1-3. More specifically, Theorems 1-3 can be shown to hold true for the blocked statistic [K] B if the following two restrictions are satisfied: (i) N∕p 1− = O(1) for some small > 0, that is, the block length N diverges more slowly than p. (ii) #G k ∕n → c k > 0 for all k, that is, the cluster sizes #G k all grow at the same rate. Condition (ii) essentially rules out strongly differing cluster sizes. It is not surprising that we require such a restriction: To construct the blocked statistic [K] B , we average over the individual statisticsΔ [K] i . As already discussed in the context of the simulation study of Section 5.1, averaging has the effect that small clusters are effectively ignored. Hence, in contrast to the maximum statistic [K] = max 1≤i≤nΔ [K] i , the blocked statistic [K] B with a large block size N can be expected not to perform adequately when the cluster sizes are highly unbalanced.
In balanced settings, however, the blocked statistic [K] B can be shown to have better power properties than the maximum statistic when the block size N is chosen sufficiently large. To see this, we examine the behavior of [K] B for different block lengths N. Our heuristic discussion of the individual statisticsΔ [K] i from Section 3.2 directly carries over to the blocked versionsΛ [K] : With the help of (17), it is easy to see that ≤ q B ( )) ≈ (1 − ).
Moreover, (18) together with some additional arguments suggests that [K] B has an explosive behavior for K < K 0 . Specifically, under our technical conditions from Section 4.1 and the two additional restrictions (i) and (ii) from above, we can show that Np for some c > 0 with prob. tending to 1.
As the quantile q B ( ) grows at the slower rate √ log L (≤ √ log n), we can conclude that P( [K] B ≤ q B ( )) = o(1) for K < K 0 . As a result, [K] B should satisfy the higher order property (11). Moreover, according to (24), the statistic [K] B explodes at the rate √ Np for K < K 0 . Hence, the faster the block size N grows, the faster [K] B diverges to infinity. Put differently, the larger N, the more power we have