Supervised convex clustering

Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to their unsupervised nature. Meanwhile, in many real‐world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named supervised convex clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.


Introduction
Clustering is an unsupervised learning approach that seeks to find groups of objects which are similar to each other.Despite successes in applying clustering in many fields such as genomics, online advertising, and text mining, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to its unsupervised nature.
Currently in practice, most people cluster data in a fully unsupervised manner and then interpret the clustering results via some outcomes of interest or other meta-data that help to validate the clusters.We call these "supervising auxiliary variables".Our goal is to use the supervising auxiliary variables as part of the clustering procedure itself to help guide towards finding more accurate and interpretable clusters.
Let us consider our motivating case study on the clinical genomics of Alzheimer's Disease (AD), which will be discussed in more detail in Section 4. All individuals in this case study experience cognitive decline as they age, yet cognitive abilities of some subjects decline at a much faster rate than others and there is a large degree of heterogeneity in cognitive skills.
Understanding such heterogeneity in cognitive decline can better elucidate the underlying genetic mechanisms responsible for AD and other dementias.However, apart from a handful of well-known genes such as APOE, little is known about the genomics of AD and of cognitive decline in older adults.It is common for people to study this by clustering the subjects and validating the results using additional information such as the clinical diagnosis or cognitive test scores.However, compelling evidence of genetic subtypes for AD and cognitive decline has not been found.We propose to use the additional meta information such as the cognitive test scores or clinical diagnosis directly to help us find better and more interpretable clusters, and hence, shed new light on the genetic basis responsible for onset of AD and dementia.In many other genomics studies, additional clinical information or survival times are available and can thus be used as supervising auxiliary variables to guide clustering.
Yet, making use of the supervising auxiliary variables to help guide towards finding groups presents several major challenges.First, due to human subjectivity and measurement errors, these clinical outcomes are noisy and thus cannot be fully trusted as ground-truth outcomes or labels.Specifically, we do not know how much supervision we should get from these noisy supervising auxiliary variables when clustering unlabeled data.Second, these clinical outcomes can be of different data types.For example, diagnostic opinions assigned by biologists can be categorical while survival time is censored data.
To address these challenges, in this paper, we seek to leverage information from both supervising auxiliary variables, usually of different types, and unlabeled data to uncover more scientifically interpretable group structures that may be hidden in completely unsupervised analyses of data.Our approach is distinct from supervised learning, which treats these outcomes as ground truth to make scientific discoveries, as the supervised approaches fail to exploit the unlabeled data to uncover group structures.In addition, these supervising auxiliary variables are different from the outcomes or labels in supervised learning in that they are largely noisy and thus cannot be fully trusted.Our method is also distinct from unsupervised approaches as we make better use of these potentially meaningful supervising auxiliary variables to understand the true underlying group structures.Although these supervising auxiliary variables are not true labels or outcomes themselves, they have loose relationship with the group structure in the data and hence indicate some forms of observed heterogeneity of the unlabeled data.
Though supervised clustering has not been widely studied, there is a plethora of literature on semi-supervised clustering specific to the nature of the outcome variables.For partially labeled data, Basu et al. (2002) proposed to modify the objective function of k-means to compute initial cluster centroids by incorporating such labels.In this case, labeled observations are always assigned to their known cluster.However, such approaches usually assume the labels are perfect and require prior knowledge of total number of clusters.In other scenarios, people incorporate prior information on pairwise (must-link or cannot-link) constraints that dictate whether two data points must be clustered in the same group or not.
To take those constraints into account, Basu et al. (2004); Xing et al. (2003); Bar-Hillel et al. (2003) modified the objective function of existing clustering methods or the distance metric in the distance-based clustering method.Still, those methods require that our presumed knowledge of the constraints is correct.
On the other hand, many have tried to improve the interpretability of clustering results by only using features related to some supervised outcome.Specifically, Bair and Tibshirani (2004); Koestler et al. (2010); Gaynor and Bair (2013) discarded or down-weighted unimportant features by univariate filtering associated with the outcome and then performed clustering on the meaningful features.However, such methods might neglect features that are weakly associated with the outcome variable but differ across clusters.Above all, these semi-supervised clustering approaches make full use of the noisy outcome variable without adjustment.Yet, our goal is to leverage information from both supervising auxiliary variables, which may be imperfect, and unlabeled data to obtain more interpretable group structures.
Another line of work focuses on semi-supervised classification which uses both labeled and unlabeled data to improve the performance of classifiers.The semi-supervised SVM minimizes the objective function by examining all possible label combinations of unlabeled data points and then finds low density regions that the decision boundary could pass through (Chapelle et al. 2007(Chapelle et al. , 2006;;Yuille and Rangarajan 2003).The cluster-then-label techniques first find clusters of high density regions in data space by clustering.A standard supervised learner is then applied to find a separating decision boundary that passes through the low density regions (Chapelle and Zien 2005;Gan et al. 2013).A somewhat related line of work proposes some classification methods that can handle noisy or missing labels (Bi and Kwok 2014).Angluin and Laird (1988) proposed random classification noise model which assumes each label is flipped independently with some probability less than 0.5.Other popular approaches include using losses that are robust to the presence of noisy labels such as 0-1 loss (Manwani and Sastry 2013) or modifying surrogate loss functions that approximate the 0-1 loss via a convex function (Natarajan et al. 2013).We refer the reader to the survey by Frénay and Verleysen (2013).Our method is fundamentally different from both semi-supervised classification or classification with noisy labels methods in that the major task of these methods is supervised learning (prediction) while our goal is to find groups using the supervising auxiliary variables.
We seek to develop a unified, convex formulation of supervised clustering based on increasingly popular convex clustering methods.Pelckmans et al. (2005); Lindsten et al. (2011); Hocking et al. (2011) studied a convex formulation of clustering that achieves agglomeration through a convex fusion penalty.Due to this convex formulation, it enjoys nice statistical properties such as global optimal solutions, stable solutions to small perturbation of data (Pelckmans et al. 2005;Chi et al. 2017) and statistical consistency (Radchenko and Mukherjee 2017;Tan and Witten 2015).Recently, to address the expensive computation of convex clustering, Chi and Lange (2015); Weylandt et al. (2019) developed fast and efficient algorithms to solve the convex clustering problem and yield full regularization paths.Further, convex clustering has been extended to many applications such as convex biclustering (Chi et al. 2017), which allows for clustering features simultaneously, and closely related to our work, recently Wang and Allen (2019) adopted the convex clustering approach to perform integrative clustering for high dimensional mixed, multi-view data.
In this paper, we propose and develop a new statistical pattern discovery method named Supervised Convex Clustering (SCC) that borrows strength from both the unlabelled data and supervising auxillary variables to find more interpretable patterns.Specifically, we develop an optimization problem defining our method that consists of three parts: an unsupervised loss for the unlabeled data, a supervised loss that incorporates the supervising auxiliary variable, and a joint convex fusion penalty that forces the group structure of the unlabeled data and the supervising auxiliary variable to be the same.Our method, to the best of our knowledge, is the first to perform supervised clustering that directly uses supervising auxiliary variables to help cluster unlabeled data.

Supervised Convex Clustering Method
In this section, we propose and develop our supervised convex clustering method for different types of supervising auxiliary variables.Then we discuss some practical considerations for applying our method and develop an adaptive approach to adjust for additional covariates.

General Model & Formulation
Let (y i , X i• ) denote the pair of supervising auxiliary variable y i and feature vector X i• ∈ R p for the i th observation, i ∈ {1, . . ., n}.Let (θ i , U i• ) denote the corresponding pair of supervising auxiliary variable centroid θ i and data centroid denote the additional covariates associated with the supervising auxiliary variable for the i th observation and β ∈ R d is the corresponding vector of coefficients.Define g(•) to be the appropriate link function for a Generalized Linear Model (GLM) whose exact form depends on the data type of the supervising auxiliary variable y (continuous, skewed-continuous, binary, and count-valued, among others).For example, if y is count-valued data, g(•) can be the log-link.Define C : i → k to be a function which maps from the observation indices i ∈ {1, . . ., n} to cluster labels k ∈ {1, . . ., K}.Then for i ∈ {1, . . ., n}, we consider the following data-generating model: This model assumes the unlabeled data follows a group mean (centroid) plus noise model.
The supervising auxillary variable, adjusted for covariates Z i• , follows a GLM whose mean has the same group structure as the unlabeled data.
We propose to fit this model by formulating a convex optimization problem based on convex clustering.Let (•) be the negative log-likelihood or loss function for the particular generalized linear model associated with g(•).Our Supervised Convex Clustering method is hence the solution to the following optimization problem: minimize (1) Here, π X and π y are fixed inputs by the user in advance; λ is a non-negative tuning parameter; and, w ij are non-negative user-specific fixed inputs.
Our optimization problem can be thought of as an extension of convex clustering that incorporates supervised data.One way to interpret this is that we have a loss function for the unlabeled data, a loss function for the supervising auxiliary variable, and a new joint convex fusion penalty that connects the supervised and unsupervised parts.Specifically, we employ a joint group-lasso fusion penalty on the concatenated centroid θ U that forces the group structure (cluster assignment) of the i th row of U to be the same as that of θ.Similar to convex clustering, our joint group-lasso-type fusion penalty encourages the differences in the rows to be shrunk towards zero, inducing a clustering behavior.Here, λ is a positive tuning parameter which regulates both the cluster assignment and number of clusters.When λ equals zero, each observation forms its own cluster centroid.As λ increases, the fusion penalty encourages the rows of concatenated centroid θ U to merge together, forming clusters.We say that subjects with the same centriods belong to the same cluster, which means, X i. and X j. have the same cluster membership if U i. = U j. and θ i = θ j .As λ is sufficiently large, all the rows of concatenated centroid θ U coalesce to a single cluster centroid.Our joint fusion penalty is novel as it puts together the centroids for both the unlabeled data and the supervising auxiliary variable, forcing them to have the same group structure.In this way, our method borrows strength from both information sources and yields the same cluster assignment for similar observations.The weight w ij , which manifests pairwise affinity, is a user-specific input which will be discussed in detail in Section 2.7.
The two loss functions, for the unsupervised and supervised part respectively, are weighted by π X and π y respectively, which are deterministic parameters for fixed data.The user can directly specify these weights according to how much they want to weight the unsupervised and supervised part.But if one wants these to be completely data-driven determined hyperparameters, we have found that setting π X and π y to be inversely proportional to the null deviance evaluated at the loss-specific center, i.e., π X = , performs well in practice.Here ỹ denotes the loss-specific center for loss (•) as discussed in Wang and Allen (2019).We use such π's so that two losses are evaluated at the same scale in the objective function.Suppose we remove the second term in the objective function (π y = 0), we get a fully unsupervised method.Similarly, if we remove the first term (π X = 0), we perform convex clustering on the supervising auxiliary variable alone.
We employ different loss functions to account for the unlabeled data and supervising auxiliary variables of different data types.Notice that here we assume the data matrix X to follow Gaussian distribution and use Euclidean distances as the loss function, but one could easily generalize it to any convex losses for non-Gaussian data X.The general loss (.) is a convex function whose specific form depends on the data type of y.For example, can be the negative log-likelihood for any common generalized linear models such as Gaussian, logistic, log-linear (Poisson), negative binomial.For supervising auxiliary variable which is categorical or survival data, the general form above does not apply and we need some minor changes to the formulation.We specify these in the next subsections as special cases.

Special Case: Categorical Supervising Auxiliary Variable
We model a categorical supervising auxiliary variable with K classes using the multinomial loss.To facilitate this, we first transform the supervising auxiliary variable into dummy variables Y ∈ R n×K where Y ik = 1 if subject i belongs to the k th class.
By construction, we employ negative log-likelihood of multinomial distribution as loss (•) and similarly denote Θ ∈ R n×K as the centroid matrix for supervising auxiliary variable Y; this gives supervised convex clustering for categorical supervising auxiliary variables.minimize We enforce a joint fusion penalty on the rows of concatenated centroid Θ U to yield shared group structure between two sources.Note as in the regular multinomial regression problem, this parameterization of β k is not identifiable as the value of objective function would not change if we change β k with β k + c for all k.To address this issue, we add the constraint K k=1 β k = 0 as discussed in Zhu and Hastie (2004).

Special Case: Censored Survival Time as Supervising Auxiliary Variable
Following the Cox Proportional Hazards (CPH) model, suppose we observe data with survival time (x i , δ i , t i ) where δ i denotes the censoring indicator and t i refers to the censoring time.
For each i = 1, . . ., n, denote R t i as the risk set of individuals who are alive and in the study All survival times are assumed to be unique; for tied times, we use Breslow's approximation.The supervised convex clustering problem can be formulated as follows: minimize Here, we interpret the supervising auxiliary variable centroid θ as the hazard rate for the i th subject.A larger θ indicates that the event is more likely to be observed for the subject.
Hence, one interpretation of supervised convex clustering is that we are finding groups that have different hazard rates for survival.

Supervised Convex Biclustering
To allow for grouping observations and features simultaneously, we extend our method to supervised convex biclustering based on the approach discussed by Chi et al. (2017).The supervised convex biclustering problem can be formulated as follows: minimize The row-wise fusion penalty fuses the rows of concatenated centroid θ U while the column-wise fusion penalty fuses the columns of U. Note that we fuse observations based on the data matrix and the supervising auxiliary variable while clustering only the features in the data matrix.For the choice of the weights w ii and w jj , we refer the reader to Chi et al. (2017) and will discuss this in detail in Section 2.7.The two penalties jointly achieve a checkerboard pattern that illustrates the associations between groups of subjects guided by the supervised auxillary variable and groups of features that distinguish the subjects.

Doubly-supervised Convex Biclustering
In some cases, we observe meta-data or supervising auxiliary information for both the rows (subjects) and the columns (features) of the unlabeled data.Suppose, besides supervising auxiliary variable for the i th subject, y i , we observe another supervising auxiliary variable for the j th feature, denoted as ỹj .Denote θ ∈ R p as the cluster centroid for the supervising auxiliary variable ỹ ∈ R p ; denote Z ∈ R p× d as the additional covariates that are associated with outcome ỹ, and denote β ∈ R d as the corresponding vector of coefficients.Let ˜ (•) be the negative log-likelihood or loss function for the supervising auxiliary variable ỹ.Let π ỹ be fixed input.The doubly-supervised convex clustering problem can be formulated as follows: minimize Our doubly-supervised convex clustering can be interpreted as performing supervised clustering on both the subjects and features of the unlabeled data that directly uses two sources of supervising auxiliary variables.

Algorithm
In this subsection, we propose an algorithm to solve our supervised convex clustering problem.
Since there are more than two separate functions in our problem, the most common approach is to use multi-block ADMM (Lin et al. 2015;Deng et al. 2017), which decomposes the original problem into several smaller and easier sub-problems.
Denote D ∈ R |E|×n as the directed difference matrix corresponding to the non-zero fusion weights.We can recast the supervised convex clustering problem (1) as the equivalent To facilitate the constraints above in matrix-form, we concatenate the supervising aux-iliary variable centroid vector θ ∈ R n and the data centroid matrix U ∈ R n×p into the aggregated centroid matrix θ U ∈ R n×(p+1) .We then introduce an auxiliary variable p+1) containing the pairwise differences between connected rows of the aggregated centroid matrix θ U .The constraints can now be written as D θ U −V = 0.
In addition, if is strictly convex, it converges to the unique global solution.
Proposition 1 is an extension of Theorem 4 in Wang and Allen (2019) and guarantees the convergence of multi-block ADMM using inexact sub-problem approximations.Similarly, to solve the supervised convex biclustering problem, we apply multi-block ADMM and provide Algorithm 3 in Appendix A.

Practical Issues
In this section, we address some practical issues of applying our methods to real data.First, we show how to choose the regularization parameter λ.Then, we discuss the choice of weights and tuning parameters for the level of supervision.Moreover, we introduce an adaptive method to adjust for additional covariates.

Choice of Regularization Parameter
As mentioned, the tuning parameter λ regulates both the number of clusters and the cluster assignments.The same type of procedures people use to choose regularization parameter λ for other convex clustering methods work here.For example, Wang (2010); Fang and Wang (2012) proposed stability selection based methods while Chi et al. (2017) proposed hold-out validation.In this paper, we suggest using stability selection when the number of clusters is not known.We find this approach works well in practice.

Choice of Weights and Level of Supervision
In practice, the choice of fusion weights has been shown to play an important role in computational efficiency and clustering quality.Chi and Lange (2015); Hocking et al. (2011); Chi et al. (2017) have shown that setting weights inversely proportional to the distances between two observations yields superior performance.On the other hand, enforcing sparse weights reduces computational cost and improves clustering quality.Given these two, the most commonly used weights choice for convex clustering is k-nearest-neighbors method with a Gaussian kernel.
The challenge here is that the unlabeled data X and supervising auxiliary variable y are measured from different sources and can be of different types; thus the Gaussian kernel with Euclidean distances is not an appropriate distance metric in this case.To measure the dissimilarity of two subjects measured in different data types, we adopt the Gower distance (Gower 1971), which is a commonly used distance metric for mixed types of data and shown to obtain superior performance compared with other distance metrics (Wang and Allen 2019;Ali and Massmoudi 2013;Hummel et al. 2017).The Gower distance between observation i and j can be defined as g(X i. , X j. ) = p l=1 g ijl /p where g ijl = refers to the Gower distance between observation i and j for the l th feature and R l = max i,j |X il − X jl | is the range of the l th feature.
Denote ν k ij as the indicator which equals 1 if observation j is among observation i's k nearest neighbors or vice versa, and 0 otherwise.Let α be a non-negative tuning parameter bewtween 0 and 1.Then, our recommendation for weights is given by the following: The tuning parameter α suggests the level of supervision y gives to data X.A larger α suggests putting more weight on the supervising auxiliary variables.In practice, we suggest , where D y is the null deviance of supervising auxiliary variabley, and hence α is the ratio of null deviances between two sources.If the clustering signal in the supervising auxiliary variable is weak, our choice of weight will down-weight this variable and vice versa.This weight scheme balances the contribution of X and y.Yet, one may choose other weighting schemes based on the level of confidence in the supervising auxiliary variable as well.
In the presence of additional covariates, we can no longer calculate weights based on the distance between y i and y j as the supervising auxiliary variable y now contains the effect of additional covariates Z. Recall the fusion penalty is measuring the dissimilarity between cluster centroids.To remove the effect of additional covariates in calculating weights, we suggest i) first estimating the effect of covariates β by fitting supervised convex clustering with weights not adjusted for covariates as usual and ii) then calculating weights based on distances between the supervising auxiliary variable centroids θ by removing the effect of covariates from the supervising auxiliary variable.This gives our adaptive supervised convex clustering with covariate-adjusted weights, as detailed in Algorithm 2. Our adaptive supervised convex clustering is similar to many adaptive approaches in the literature (Zou 2006).
Algorithm 2 Adaptive SCC with covariate-adjusted weights 1. Fit SCC with w and a sequence of γ; Find γ (k) which gives desired number of clusters; Get the estimate β(k) .

Update fusion weights
, where ŷ is the residual of supervising variable adjusted for the covariates.3. Fit SCC with ŵ.
To remove the effects of additional covariates on the supervising auxiliary variable, we use the property of the link function: g(E[y|Z]) = θ + Zβ.A good estimate of θ would be removing the effect of covariate from the link function of y, i.e., θ = g(y) − Z β.Hence we can get estimated supervising auxiliary variable without the effect of additional covariates using inverse link function: ŷ = g −1 ( θ).

Simulation Studies
In this section, we evaluate the performance of supervised convex clustering and compare with existing methods.For all the simulations, we design challenging scenarios where clustering either X or y alone cannot lead to good clustering results.We will discuss the simulation setup in detail later.
We compare our supervised convex clustering method with convex clustering and hierarchical clustering using different distance metrics (Euclidean distances and Gower distances).
It should be pointed out that there are several linkage options for hierarchical clustering and we only report the linkage with the best clustering performance.We use the adjusted Rand index (Hubert and Arabie 1985) to evaluate the accuracy of clustering results.The adjusted Rand index is a commonly used metric to measure the agreement between the estimated cluster label and the true underlying label.A larger adjusted Rand index (close to 1) indicates a good resemblance between the estimated and true labels.For all methods, we assume that the oracle number of clusters is known for fair comparisons.
We first consider the base simulation where the supervising auxiliary variable y is generated from the cluster centroid directly without additional covariates.In the base simulation, we study two designs of unlabeled data in which two different scenarios are considered.In addition to the base simulation, we consider other setups including varying dimensions, unequal group sizes and additional covariates.
The base simulation, as mentioned, assumes that the supervising auxiliary variable y is generated from the cluster centroid directly.For each simulation, the data set consists of n = 120 observations and p = 30 features with 3 clusters.Each cluster has an equal number of observations for the base simulation.The data is generated from the following model: refers to the observation indices belonging to group k).The supervising auxiliary variable, y i , is generated from different distributions with parameter µ k based on data type; the two sources have the shared group label which means y i ∼ φ(µ k ), where i ∈ G k , k = 1, 2, 3 and φ is a distribution function.We denote X G k and y G k as the data points and their corresponding supervising auxiliary variable that belong to group k.
We consider two designs of the unlabeled data X: spherical (S) and half-moon (H).In terms of the half moon data, we consider the standard simulated data of three interlocking half moons as suggested by Chi and Lange (2015) and Wang and Allen (2019).For each design, we consider two scenarios where none of the data sources lead to perfect clustering results.In the first scenario (S1 and H1), X G 1 and X G 3 overlap while X G 2 are separate from X G 1 and X G 3 ; y G 1 and y G 3 have two separate clusters while y G 2 are noisy and overlap with y G 1 and y G 3 .In the second scenario (S2 and H2), X G 1 , X G 2 and X G 3 all overlap; y has three separate clusters with little overlapping.Yet, y is noisy and one cannot get perfect results by clustering y alone.Hence, for each of the four simulation scenarios above, we create a challenging problem where good clustering results cannot be achieved by clustering either X or y alone and we seek to get good clustering results by borrowing strength from two sources with our method.The detail of the exact parameters for simulating the unlabeled data and supervising auxiliary variables are given in Appendix B.
For the above simulations, we assume that the number of cluster centroids for both sources is the same.Yet, in the case of categorical supervising auxiliary variable, usually, the number of categories we observe in that variable is different from the number of true classes.Hence we consider the following additional simulations.In additional simulation 1 (AS1), we assume the number of clusters of X is greater than number of categories in y; in AS1, we consider both binary and categorical supervising auxiliary variables.In additional simulation 2 (AS2), we assume the number of categories in y is greater than number of clusters of X; in AS2, we consider categorical supervising auxiliary variable.
From Table 1, we see that our supervised convex clustering outperforms existing methods for different types of supervising auxiliary variables by leveraging information from both sources.For hierarchical clustering on spherical data, different distance metrics might perform comparably well on different types of supervising auxiliary variable.For example, hierarchical clustering with Euclidean distances works well for Gaussian and count-valued supervising auxiliary variables while hierarchical clustering with Gower distances works well for binary and categorical supervising auxiliary variables.Yet, our SCC performs comparatively well in terms of the best hierarchical clustering method for all cases.For non-spherical data, our method performs significantly better than hierarchical clustering.For the rest of this section, we consider different setups from the base simulation and verify that our method could still perform well in these settings.We slightly change the setup of scenario 1 (S1) in the base simulation described above with Gaussian supervising auxiliary variable.Specifically, we vary the number of features and group sizes one at a time while keeping the rest of the setup the same.We then consider the case when the supervising auxiliary variable is affected by additional covariates.Finally, we examine the performance of supervised convex biclustering.
First, we change the number of features of X from 30 to 50 and 100 respectively.To increase the difficulty of the simulation, we increase the within-cluster variance σ in the simulation setup.From Table 2, we see that our supervised convex clustering method still performs comparably well as the number of features increases by leveraging the information from both two sources.In contrast, existing methods do not perform as well as in Table 1 since X now contains more clustering information with increased dimension and dominates the clustering results for existing methods.Moreover, we evaluate the performance of our method on unequal group sizes with n 1 = 80, n 2 = 10 and n 3 = 30.Next, in the following simulation, we examine the performance of our supervising convex clustering when the supervising auxiliary variable is affected by additional covariates.We use our adaptive supervised convex clustering approach proposed in Section 2.7.2 to adjust for these additional covariates.
The data X is generated from the same distribution as in scenario 1 (S1) of the base simulation described above.Still, we consider different types of supervising auxiliary variable.
Yet, the supervising auxiliary variable y i is now simulated from y i ∼ φ(µ k + Z T i β) where Z i ∈ R 10 ∼ N (0, I 10 ) and β j ∼ N (±3, 1); µ k is generated similarly in Scenario 1 of the base simulation.We set the number of features for the additional covariates to be 10.
Table 3 shows that our adaptive supervised convex clustering performs the best by removing the effects of additional covariates from supervising auxiliary variable; hence our clustering method which includes the information of both the unlabeled data X and supervising auxiliary variable y to get better clustering results.
4 Case Study: Discovering New Subtypes of Alzheimer's

Disease
An important application of our proposed method is in clinical genomics, where the objective is to elucidate the genetic basis of diseases and to find potential biomarkers for developing personalized treatments.An important aspect of personalized treatments is to identify groups of subjects with similar genetic profiles and similar clinical outcomes so that personalized medicine targeting specific gene groups can be developed.For example, breast cancer patients with different genomic subtypes now receive different sets of treatments.We would like to investigate the clinical genomics of AD to better study the genetic basis of AD.Alzheimer's Disease (AD) is a debilitating brain disorder that irreversibly damages cognitive skills.
However, there is a large amount of heterogeneity in cognition of older adults and little is known about the underlying genetic mechanisms that cause AD apart from a handful of genes.In this case study, we apply our SCC method to find biologically meaningful group structures among both subjects and genomic profiles for AD by jointly analyzing both clinical measurements and gene expression via RNASeq acquired from the Religious Orders Study Memory and Aging Project (ROSMAP) Study (Bennett et al. 2018).
For our analysis, to start with, we take the clinically measured global cognition score as the noisy supervising auxiliary variable.Global cognition score, a summary measure of cognition proximal to death, is computed by averaging nineteen clinical cognitive tests conducted during a subject's last clinical visit (Bennett et al. 2018).A higher value for global cognition score indicates relatively healthier cognitive abilities.The ROSMAP data consists of 507 subjects with a complete recording of 41, 809 RNASeq genes.First, we log-transform the RNASeq counts, which is commonly done in many RNASeq analyses.After that, we remove undesirable batch effects from RNASeq using the ComBat technique (Johnson et al. 2007).To reduce the number of genes to a more manageable size, we take the top 20, 000 RNASeq genes with the highest variance and subsequently keep the top 600 genes that are most associated with cognition score via univariate filtering.
Our goal is to identify both scientifically meaningful group structures among subjects and potential genetic biomarkers for AD that may be hidden in completely supervised/unsupervised analyses of data by leveraging information from both clinical global cognition score as supervising auxiliary variable and unlabeled RNASeq gene expression data.To this end, we run our SCC-biclustering method with adjustment for age at death, which simultaneously estimates group structures among subjects and RNASeq genes from the preprocessed ROSMAP data.We include adjustment for age to account for its effects on the cognition score because cognitive skills are expected to decline as an individual ages, as shown in Fig 3C.  that might reveal genetic differences between subjects with higher cognition and subjects with lower cognition.To briefly summarize, our SCC method leverages information from both unlabeled RNASeq gene expression data and noisy supervising auxiliary variable, global cognition score, to recover more interpretable clusters of subjects and gene signatures, in contrast to either fully supervised or unsupervised methods which only estimate clusters using one data source.
To study the scientific validity of the clusters discovered by our SCC method, we focus on analyzing the heterogeneity among subjects and RNASeq genes discovered by SCC (Fig 1A).First and foremost, many RNASeq genes appear to be upregulated (downregulated) for subjects in the low cognition cluster whereas these same genes seem to be downregulated patients (Liu et al. 2018;Han et al. 2014;Carter 2017;Gomez et al. 2010;Li et al. 2017;Espuny-Camacho et al. 2017;Pankratova et al. 2018;Bossers et al. 2010) Additionally, the other 16 DEGs found by SCC, which could be missed by the supervised approach, may point to candidates for future studies of genetic basis for AD.Overall, this indicates that our SCC method yields results that are very distinct from other supervised learning approaches.Such seemingly contradictory observation hints at the possibility that these atypical subjects in the low cognition cluster might possess certain degrees of so-called Cognitive Resilience (CR), which is a phenomenon where healthy cognition can exist despite extensive AD-related pathology (Negash et al. 2011;Stern 2012).Previous scientific studies have also observed that individuals with higher CR experience a slower rate of cognitive decline over time (Yu et al. 2015).
To further substantiate this finding, we examine levels of amyloid plaques, one of the hallmarks of AD brain pathology (Takahashi et al. 2017) In addition to discovering subjects with high CR, our SCC method manages to identify subjects with dementia caused by conditions other than "gold standard" AD pathology.
Specifically, the atypical in the high cognition cluster appear to be free of RNASeq signatures that is common among subjects in the low cognition cluster (Fig 3A).Also, median amyloid level of these atypical subjects in the high cognition cluster is significantly lower than that of the atypical subjects in the low cognition cluster (p-value = 0.036), as shown in Fig 3B.
Nonetheless, the atypical subjects in the high cognition cluster appear to experience a much steeper drop in mean cognition over time than the overall high cognition cluster as well as the possibly Cognitive Resilient subgroup, although aforementioned evidences suggest the atypical subjects in the high cognition cluster probably do not possess AD-related pathology.
Further analyses reveal that lewy bodies are present in 19% of the atypical subjects in the high cognition cluster while microinfarcts are present in another 43% of these atypical subjects, both of which have been identified to be possible non-AD causes of dementia in previous studies (McKeith et al. 1996;Arvanitakis et al. 2011).Additionally, we also apply our SCC method to the ROSMAP data using the clinician's diagnosis as the supervising auxiliary variable.Clinician's diagnosis, a summary diagnostic opinion rendered by a neurologist prior to a patient's death, is a categorical variable with three levels -no cognitive impairment (NCI), mild cognitive impairment (MCI), and Alzheimer's Disease (AD).Due to the large amount of heterogeneity in cognitive decline, there are no definitive standards to diagnose AD subtypes prior to death without postmortem pathology data.Therefore, clinician's diagnosis can be subjective and prone to judgement errors.We expect these diagnostic opinions to be noisy and can not fully trust them.Here, we would like to examine whether we can use genomics data to help improve these diagnostic opinions with our SCC method.Again, we apply our SCC-biclustering method with adjustment for age at death to simultaneously find group structures among subjects and RNASeq genes.Hence our method uncovers joint group structure and identifies subjects whose diagnoses might need to be re-assessed.

Discussion
In this paper, we develop a novel supervised convex clustering method that leverages the information from both supervising auxiliary variables and unlabeled data.Our method, in contrast to existing semi-supervised clustering approaches, is the first one to directly use outcome of interest to help cluster unlabeled data.In particular, our SCC borrows strength from both information sources and yields more scientifically interpretable group structures that may be hidden in completely unsupervised analyses of data.
This paper mainly addresses the methodological development for supervised convex One question that is worth further investigating is, whether we should use supervised convex clustering, and, how practitioners can tell whether the supervising auxiliary variable is useful for finding group structures.Further research could investigate when to apply supervised convex clustering and how much supervision is warranted for given problems.We suggest a data-driven approach to determine the amount of supervision using the relative deviance in the two data sources.But one might adopt some other approaches, such as learning the amount of supervision from the data.
We apply our method to a high-dimensional genomics case study.Yet, our approach may find applications in a variety of fields such as electronic health records, online market segmentation, and text mining, among the many other clustering applications.For example, in online market segmentation, some additional information on the users and the items are typically available, such as previous purchasing history, demographics, and social media usage, among others.We might use this meta information as supervising auxiliary variables to help understand joint group structures.To summarize, we develop a novel, unified approach to an interesting but challenging problem that leads to more scientifically interpretable clustering results and opens many avenues for future research.

Supervised Convex Clustering: Supplementary Materials
Minjie Wang, Tianyi Yao and Genevera I. Allen The supplementary materials are organized as follows.In Appendix A, we discuss the algorithm to solve supervised convex biclustering problem.In Appendix B, we discuss the detail of the exact parameters in the simulation study.

A Supervised Convex Biclustering Algorithm
In this appendix, we discuss the algorithm to solve supervised convex biclustering in Section 2.4.
The supervised convex biclustering is formulated as: minimize Notice we cannot use Dykstra-Like Proximal Algorithm (DLPA) mentioned in Weylandt et al. (2019) here as DLPA requires 2 -type loss whereas our loss is arbitrary here.To address this, we use multi-block ADMM to solve the above problem.To account for the difference between the columns of two centroids, we introduce a new variable M which is equal to U T .In this way, the row-wise and column-wise penalty on the difference between two centroids decompose.
We can recast the problem as the equivalent constrained optimization problem: to have different cluster centroids μk so that those observations form a cluster.The categorical supervising auxiliary variable is simulated similarly as in S2.
• AS2: Categorical simulation: number of classes of y is greater than number of clusters of X.
Table 1 shows all the results for the base simulation.Overall we see that our supervised convex clustering outperforms existing methods for different types of supervising auxiliary variables by leveraging information from both sources.
The resulting heatmap of RNASeq profiles is displayed in Fig1A, where we order the subjects (rows) and genes (columns) according to the cluster assignment estimated by our SCC-biclustering with black dashed lines indicating cluster boundaries.In addition, the corresponding global cognition score is displayed on the left side of the heatmap.For comparison, we also show a completely supervised approach that treats the global cognition score as the true response variable.The heatmap generated by this completely supervised approach is shown in Fig 1B, where the subjects are ordered by the ascending global cognition score (from top to bottom) while the genes are ordered in ascending order of p-values obtained from univariate association test of each gene with the global cognition score after adjusting for age.Finally in Fig 1C, subjects and genes are ordered according to the cluster heatmap obtained from completely unsupervised hierarchical biclustering on unlabeled RNASeq data with Euclidean distance metric and Ward linkage, as shown in Fig 1C.

Figure 1 :
Figure 1: (A) The subjects and genes are ordered according to the cluster assignment estimated by SCC-biclustering (adjusting for age at death).Cluster boundaries are indicated with black dashed lines.Atypical subjects in the high cognition cluster (low cognition cluster) are hightlighted with blue (red) on the left.Top 40 DEGs whose median expression levels are significantly different across the two SCC clusters at FWER of level 0.05 are highlighted with red bars on top.(B) The subjects are ordered in ascending order of cognition score and the genes are ordered in ascending order of p-values obtained from univariate association tests of each gene with cognition score adjusting for age.The rank of the DEGs found by SCC in terms of the univariate association p-values are indicated with red bars.(C) The subjects and genes are ordered according to dendrograms obtained from hierarchical biclustering on RNASeq data alone.

(
Fig 2. For comparison, we also select the top 100 RNASeq genes with the smallest p-values from the completely supervised univariate association tests with the global cognition score.The intersection between DEGs found by our SCC and genes that are significantly associated with cognition score are displayed in the orange circle in Fig 2. After conducting a literature search on the top DEGs discovered by our SCC, we found evidence in the AD literature, which links at least eight of these DEGs to cognitive decline and/or AD pathology in AD . These eight DEGs are shown in the blue circle in Fig 2. While this is only a preliminary investigation into the improved scientific interpretation and validity of clusters obtained from SCC, successfully identifying DEGs that have been validated in the AD literature is encouraging evidence.

Figure 2 :
Figure 2: We plot a Venn diagram to show relations between top 40 DEGs discovered by our SCC and genes found by other methods.Gray circle: the top 40 DEGs found by SCC.Orange circle: intersection between DEGs found by SCC and top 100 genes that are significantly, univariately associated with cognition score.Blue circle: Genes that are found to be related to cognitive decline/AD pathology in the biological literature.
, of the various subgroups identified by our SCC (Fig 3B).Overall, the low cognition cluster (purple box in Fig 3B) has significantly higher median amyloid level than the high cognition cluster (yellow box in Fig 3B) accordingto one-sided Wilcoxon rank sum test (p-value = 2.9 × 10 −15 ).In the meantime, the mean cognition trajectory of the overall low cognition cluster is well below that of the high cognition cluster, likely due to AD-related cognitive decline.On the other hand, despite significantly higher median amyloid level of these atypical subjects in the low cognition cluster as compared to the high cognition cluster (p-value = 3.3 × 10 −5 ), the mean cognition trajectory of these atypical subjects in the low cognition cluster is fairly close to that of the high cognition cluster (yellow curve in Fig3C).In other words, the atypical subjects in the low cognition cluster found by SCC manage to maintain cognitive abilities on par with the relatively healthy high cognition cluster, even though these atypical subjects also possess high amyloid levels indicative of extensive AD brain pathology.Discovery of such atypical subject groups from the ROSMAP data provides new potential avenues for further scientific studies to better understand the genetic mechanisms responsible for the development of Cognitive Resilience, conferring potential clinical utility to our SCC method in clinical genomics.

Figure 3 :
Figure 3: (A) Zoom-in view of the RNASeq heatmap produced by SCC.Atypical subjects with unusually high cognition in the low cognition cluster are highlighted with a red bar while atypical subjects with much poorer cognition in the high cognition cluster are highlighted with a blue bar on the left.(B) We plot the longitudinal trajectories of cognition score of all subjects with bolded lines representing smoothed mean cognition of the various subgroups identified by SCC.(C) Boxplots of amyloid plaque levels of the various subgroups identified by SCC.Collectively, the figures provide evidence that the atypical subjects in the low cognition cluster could be Cognitive Resilient (CR) while the atypical subjects in the high cognition cluster may have non-AD related dementia.
Fig 4A shows the heatmap of RNASeq profiles where the subjects and genes are ordered according to the cluster assignment estimated by SCC-biclustering with clinician's diagnosis displayed on the left side.Overall, we see that the genomics can help us differentiate AD subtypes fairly well.Interestingly, by borrowing strength from signals in both the supervising auxiliary variable and unlabeled RNASeq data, our SCC method deems that a handful of MCI subjects should be grouped together with the majority of the AD subjects.This might first come as a surprise, but a close examination of RNASeq expressions of the aforementioned MCI subjects does reveal that the gene signatures of these MCI subjects above the dashed line in Fig 4B indeed resemble those of AD subjects more than expressions of the rest of the MCI cluster.

Figure 4 :
Figure 4: (A) The results of SCC where the subjects and genes are ordered according to the cluster assignment estimated by SCC-biclustering with clinician's diagnosis as supervising auxiliary variable (adjusting for age at death).Cluster boundaries are indicated with black dashed lines.Zoom-in subjects in (B) are hightlighted with green.Cognition scores are plotted on the right for reference.(B) A zoom-in plot of the heatmap reveals that several MCI subjects have gene expression profiles that are more similar to AD subjects.
clustering but there are many possible open areas for future research.One potentially interesting area of future work may be to investigate supervised convex clustering with missing data or with missing supervising auxiliary variables.Handling missing data may be more amenable for our convex clustering based approach where Chi et al. (2017) have developed extensions for missing data, than for other clustering techniques.Another extension, based on the recent paper of Wang and Allen (2019), could be supervised convex clustering with data integration, where multiple sources of data or supervising auxiliary variables are observed.Additionally Wang et al. (2018) and Wang and Allen (2019) recently proposed to perform feature selection and convex clustering simultaneously, another extension that could be incorporated into our supervised convex clustering framework.This paper focuses on methodological development, but we expect our approach to inherit many desirable theoretical properties of convex clustering and plan to investigate this in future work.Finally, Weylandt et al. (2019) recently proposed fast algorithms and visualization tools both static, dendrograms, and dynamic, clustering path plots, of the convex clustering solution.Their theoretical assumptions should apply in our supervised convex clustering setting and thus allow us to use dendrograms to additionally aid in visualizing our results.
Also, we replace the squared Euclidean distances with the squared Frobenius norm to facilitate matrix-version algorithm.We can yield the augmented Lagrangian and apply multi-block ADMM to solve our supervised convex clustering problem.Further, note that both the θ and β sub-problems generally do not have analytical closed- Wang and Allen (2019)bitrary loss function .Hence we need to apply an inner optimization routine with nested iterative updates to solve the sub-problem until full convergence, which is computationally intensive.To address this and speed up computation, we adopt the generalized multi-block ADMM with inexact sub-problem approach byWang and Allen (2019)and take a one-step descent update to solve the sub-problem approximately.For differentiable loss , we take a one-step gradient descent update by applying linearized multi-block ADMM to the θ or β sub-problem for each iteration.This gives Algorithm 1, a multi-block ADMM algorithm to solve supervised convex clustering with differentiable loss.Similarly, for non-differentiable distance-based loss , we can introduce a new block for the non-smooth function and apply multi-block ADMM with simple closed-form solutions for each primal variable update.The dual variable is denoted by Q.

Table 1 :
Comparisons of adjusted Rand index for supervised convex clustering and existing methods; Table 2 suggests that our method also performs better than existing methods in this setup.

Table 2 :
Comparisons of adjusted Rand index for supervised convex clustering and existing methods;Additional simulation for Gaussian supervising auxiliary variables; the data is simulated from the same setup as S1 for Gaussian supervising auxiliary variable in the base simulation, but with different number of features and unequal group sizes.