A Bayesian hierarchical hidden Markov model for clustering and gene selection: Application to kidney cancer gene expression data

We introduce a Bayesian approach for biclustering that accounts for the prior functional dependence between genes using hidden Markov models (HMMs). We utilize biological knowledge gathered from gene ontologies and the hidden Markov structure to capture the potential coexpression of neighboring genes. Our interpretable model-based clustering characterized each cluster of samples by three groups of features: overexpressed, underexpressed, and irrelevant features. The proposed methods have been implemented in an R package and are used to analyze both the simulated data and The Cancer Genome Atlas kidney cancer data.

interest since they either have a similar function or are involved in a common biological process often controlled by the same transcriptional regulatory program, or are members of the same pathway or protein complexes, and hence functionally similar (Wikipedia Contributors, 2018).This forms the basic motivation for various clustering algorithms on microarray data, where a goal is to select genes that are coexpressed under all the experimental conditions (e.g., different environmental samples, different individuals or tissues, or different time points).In other words, all genes in a given cluster must show similar coregulation patterns across all experimental conditions.Therefore, clustering algorithms are a sensible choice in situations where experiments are conducted under a limited number of homogeneous experimental conditions.Now, with the advancement of sequence technologies and availability of publicly available repositories of omics data, such as TCGA (The Cancer Genome Atlas) (The Cancer Genome Atlas Research Network et al., 2013), we need to perform a similar analysis but with data from a heterogeneous compendium.In this case, the assumption of applying conventional clustering techniques may not hold as genes that share similar expression patterns only exhibit coexpression under a subset of conditions.Biclustering algorithms have the advantage of discovering genes that are coexpressed in a subset of conditions (instead of all), compared with conventional clustering methods.Moreover, since gene expression levels are measured under a large number of heterogeneous conditions, biclustering suits the need for the type of system-level analysis we need for discovering transcriptional modules, which provide essential clues for revealing genetic networks.
Although existing biclustering algorithms originate with the same goal in mind, to cluster both genes and samples at the same time, they differ from each other by the approach they use on the problems they try to solve.Among early work on biclustering methods, algorithms were designed to divide the data into checkerboard units of patterns-clustering algorithms were applied iteratively to both the genes and samples of the data (Alon et al., 1999).There are also algorithms particularly designed keeping the biclustering goals in mind.
Our model approach belongs to the class of biclustering algorithms where we question conventional clustering algorithms by the idea that genes that share functional similarities do not have to be coexpressed over all the samples in the data set.This idea has been used to develop clustering algorithms that are used to cluster genes that share similar expression patterns over a subset of samples, and similarly, it has been used to cluster tumor samples that share similar gene expression levels over a subset of genes.This type of algorithm where the problem has been broken down into two separate orientations-biclustering genes and samples, was first pioneered by Cheng and Church (2000).Multiple biclustering algorithms have been proposed in the literature (Chekouo & Murua, 2015;Chekouo et al., 2015;Cho et al., 2004;Getz et al., 2000;Gu & Liu, 2008;Lee et al., 2010;Sill et al., 2011;Turner et al., 2005;Xue et al., 2014;Zhang, 2010).For instance, Tanay et al. (2002) discretize the gene expression values into three levels-upregulated, downregulated, and inactive and represents the discrete matrix as a bipartite graph.It then uses a heuristic approach to find bicliques in the graph, which correspond to biclusters in the matrix.The plaid model of Lazzeroni and Owen (2002) is also one of the popular biclustering methods.
The spectral biclustering method (Kluger et al., 2003) uses singular value decomposition to solve the biclustering problem.
Most of these algorithms are not able to incorporate prior knowledge in their applications.In contrast, Bayesian models provide our method with a systematic base for the integration of prior knowledge and information from other data sources.The work of Chekouo et al. (2015) is an example of a Bayesian biclustering method that integrates other data sources.
Our proposed biclustering method in this manuscript not only circumvents these shortcomings but it also encompasses much of the rich structure in the genomic expression data.Moreover, the proposed method identifies "predefined" biclusters that are biologically meaningful and interpretable, that is, characterized by genes/features that are overexpressed and underexpressed.Since we use a single coherent probabilistic framework, the model is extendable and can incorporate other functional information, such as experiment type, putative binding sites, and so forth.Since genes can be sorted based on their annotation congruence (Christophe & Nives, 2017;Thomas et al., 2012), our proposed model takes advantage of this potential coexpression between neighboring genes by imposing first-order Markovian assumption on hidden gene states.Typically, as with other biclustering models, ours is an incomplete data problem and an unknown model structure.We will treat the hidden variables (gene labels and cluster labels) and the model parameters as random variables, and use Gibbs sampling to estimate their joint distributions.To sort genes based on functional similarity, we use the mgeneSim method in the Bioconductor package GoSemSim (Li et al., 2010) to compute the semantic similarity among multiple gene products.
The manuscript is organized as follows.In Section 2, we describe our Bayesian model-based (bi)clustering using the hidden Markov model (HMM).In Section 3, we present a simulated data study.Section 4 describes the application to TCGA kidney cancer data.We conclude the manuscript in Section 5.

| A BAYESIAN BICLUSTERING APPROACH USING HMM
In the context of model-based clustering for high-dimensional features, the feature space is usually divided into two nonoverlapping sets (Chekouo & Murua, 2018;Fop & Murphy, 2018); variables or features which are relevant for clustering and whose distribution directly depends on the group membership variable, and irrelevant variables which may be further divided into either redundant variables (i.e., contain similar information as relevant variables as they are correlated with the relevant ones but from the point of distributional representation are conditionally independent of the grouping variable Z given the relevant variable), or uninformative variables (i.e., correspond to noise and their distribution is completely independent of the group structure) (Maugis et al., 2009).In our modeling context, we will select variables within each cluster to be either relevant (overexpressed or underexpressed in most of the conditions/samples) or irrelevant (not significantly expressed but might be correlated to relevant ones).
In our approach, we assume a dependence structure between features via prior distributions of the feature indicator variables ρ .In fact, the feature selection process within each cluster uses first-order Markov assumption between feature variables.For each sample/ subject cluster, we model gene expression across samples using an HMM with three states/groups: overexpressed genes (i.e., state 1 ), underexpressed genes (i.e., state 2 ) and not actively expressed genes (i.e, state 3) (Cui et al., 2015).Since genes will be sorted/ ordered based on their functional similarity, the HMM can take advantage of potential coexpression by borrowing information across closer genes in their ordering structure.Hence, our biclustering results would not depend on the arbitrary order in which the p features are presented but, instead, they will depend on a prior ordering structure of features defined using biological knowledge (e.g., Gene Ontology [GO]).With this assumption, genes/features that are closer in the ordering structure will be more likely to belong to the same feature state.Within each sample cluster, we have three groups of features (i.e., three states), which will correspond to a total of 3 × K biclusters where K is the number of sample clusters.Figure 1 illustrates an example of a biclustering structure captured by our approach with K = 2 clusters.

| The approach
For each cluster k = 1, …, K, the latent feature indicator vector ρ j, k , will define feature expression with three possible states; ρ j, k is 1,2, and 3 according to whether column expression value for gene j in cluster k is viewed as overexpressed, underexpressed, or not actively expressed (i.e., irrelevant), for every feature j ∈ 1, 2, 3, …, p , respectively (see Figure 1 for example).Let z i be the cluster indicator variable for individual/sample i, where z i = k if individual i belongs to cluster k ∈ 1, 2, …, K .Let y ij be the matrix of values (e.g., gene expression) of feature j expressed under sample i.We assume that given the cluster label k = 1, …, K, and state l = 1, 2, 3, y ij follows independent (with respect to i and j) distributions defined as where σ jkl 2 is the error variance.We define a hierarchical model by defining further prior distributions on μ jkl and σ jkl 2 .In fact, we assume that μ jkl ∼ Normal μ k1 , σ k1 2 , if μ k1 > 0 and ρ j, k = 1 (relevant and overexpressed), Normal μ k2 , σ k2 2 , if μ k2 < 0 and ρ j, k = 2 (relevant and underexpressed), where μ k1 > 0 and μ k2 < 0 are the parameter means of overexpressed and underexpressed features, respectively, σ k1 2 and σ k2 2 are their corresponding variances within cluster k.From Equation (2), we can see that each cluster k is characterized by three different means:μ k1 , μ k2 , and 0 that corresponds, respectively, to the mean averages of overexpressed, underexpressed, and no active features.Our model (1 and 2) identifies irrelevant features (i.e., redundant) within each cluster k (not across all cluster k) which is different from standard methods that perform simultaneously clustering and variable selection (Chekouo & Murua, 2018;Fop & Murphy, 2018;Maugis et al., 2009).However, in our model, redundant/irrelevant features across all clusters (i.e., ρ j, k = 3 for every k) can be considered unimportant features.Moreover, we also assume that σ jk3 2 = σ j3 2 for every k, that is, independent of subject clustering.
In order to incorporate the prior ordering structure of our features into our model, we define a first-order Markov dependency between latent feature variables in cluster k, that is, where ρ k j − 1 = ρ 1, k , ρ 2, k , …, ρ j − 1, k denotes sequence history up to sequence point j − 1, and ρ j − 1, k denotes the state of feature j − 1 sequence point within cluster k. p ρ j, k | ρ j − 1, k represents the transition probability from state ρ j − 1, k to state ρ j, k within cluster k.Within cluster k, the observation y ij depends only on the latent state ρ j, k .If Hence, our complete HMM likelihood can be written as where θ is the set of other parameters defined below.

| Prior and full conditional distributions
In our specific case with a two-dimensional latent structure, one for sample clustering Z and the other for feature classification ρ , we are typically dealing with three separate entities for the vector of parameters θ for the model.That is, θ can be decomposed as, θ = Ω, ξ, ζ , where, (i) Ω = ω k ; k = 1, 2, …, K and ω k = Prob z i = k is the cluster prior probability that an individual i belongs to cluster k, (ii) ξ = ξ rs ; r, s = 1, 2, 3 is the common transition matrix of the Markov chain process ρ k = ρ 1, k , ρ 2, k , …, ρ p, k which is the sequence of states (1,2,3) of the p features, ξ rs is the transition probability from state r to state s, and (iii) ζ = μ jkl , σ jkl 2 , μ kl , σ kl 2 , σ; k = 1, …, K; l = 1, 2; j = 1, …, p parameterizes the conditional distribution of y ij given z and ρ.
The initial distribution of the process ρ k = ρ j, k , j = 1, …, p is assumed to be a uniform . We will impose conjugate Dirichlet prior both for estimating the cluster probabilities Ω and transition probability matrix ξ = ξ rs .

Cluster probabilities:
We impose a Dirichlet prior distribution for Ω with parameters α 1 , …, α K with α k > 0. Hence, the full conditional of Ω is also a Dirichlet distribution with and 0 otherwise.
The full conditional of the error variance σ j3 2 for irrelevant features is as where The full conditional of the error variance σ jkl 2 for relevant features l ∈ 1, 2 is as where N kl The full conditional distribution of μ jkl l ∈ 1, 2 is a normal distribution 2: for t 1 to N do 3: Simulate the new probability vector Ω from its full conditional distribution a Dirichlet distribution , conditional on the previous z.
In the first iteration, the random initialized value of z is used as an input.4: New values for μjkl and μkl, k = 1, …, K, l = 1, 2, and j = 1, …, p are simulated from the full conditional distribution see Equations 8 and 4 .6: Simulate the rows of ξ independently, with the r th row from the their full conditional Dirichlet distribution.7: New classification vector elements zi, i = 1, 2, … ., n is generated based on the following steps: i Determine the posterior probability of observing yi as belonging to a particular cluster k based on 11 ii Simulate classification vector values zi from the multinomial distribution determined in the above step i 8: In the last step, simulate the latent chain ρj, k's similarly as in Step 7 from their full conditional distribution (multinomial distribution, see Equations 9 and 10).9: end for 10: return MCMC samples of all parameters. where Latent chains ρ k 's and latent cluster memberships z: For each cluster k, the full conditional probabilities of ρ can be written as for j = 2, 3, … ., p and k = 1, 2, …, K.The full conditional probability for observation i belong to cluster k is (11)

| Gibbs sampling scheme
The overview of our Gibbs sampling scheme is presented in Algorithm 1.After every iteration, the parameter values are updated and used in the next iteration.
The number of clusters K in the proposed work is fixed.To determine an optimal K, we used the deviance information criterion (DIC, Gelman et al., 2014)  where S is the number of MCMC samples, θ s , z s , ρ s is the s th MCMC sample of all parameters, θ, z, ρ is the joint maximum a posterior estimate (Geman & Geman, 1984) approximated by the MCMC output.Given a set of competing models, smaller DIC values indicate a better-fitting model.We will use this criterion for selecting the number of clusters K.

| SIMULATED DATA STUDY
Setting 1: We tested our method on 12 simulated test scenarios by dividing them into pairs for comparison.The difference between the paired scenarios is that the input variance to the data-generating process is doubled to 2.0 in scenarios 2, 4, 6, 8, 10, and 12 in comparison to their respective paired case.The model's clustering capacity was tested by increasing (i) the number of clustering regimes, K ∈ 2, 4, 8 and (ii) the number of subjects, n ∈ 100, 500 for the generated data input to each paired case.In this first setting, we assume that the first 100 features are overexpressed (i.e., ρ j, k = 1) in each sample cluster, the next 100 features are underexpressed (i.e., ρ j, k = 2).The remaining features p − 200 = 800 are irrelevant for every sample cluster.As the features that belong to the same states are close (i.e., are ordered), we expect that our main approach that includes HMM would perform better.Expression values are generated as y ij = μ kl + ϵ ijk , where μ k1 = k + 1 /2, μ k2 = − k + 1 /2, and μ k3 = 0 for every k = 1, …, K. ϵ ijk is normal with variance σ jkl 2 ∈ 1, 2 .This will allow us to have different means between sample clusters.For cluster 1, for instance k = 1 , the means are μ 11 = 1 and μ 12 = − 1 for overexpressed and underexpressed features, respectively.Figure 2 shows some examples of simulated data for K ∈ 2, 4, 8 .
Setting 2: In this setting, the relevant 200 features are now chosen randomly out of the p = 1000 features.Relevant features within sample clusters are not necessarily ordered as in Setting 1.Hence, we do not expect an impact on clustering of the order structure between features for this scenario, that is, we would not expect an improvement of the feature clustering performance within sample clusters with the use of the Hidden Markov structure that encourages "closer" features to be clustered together in one of the three feature groups (underexpressed, overexpressed, and irrelevant).The rest of the parameters for simulated data in this setting are set as in Setting 1.

Hyperparameter settings:
In our Bayesian approach, we specify the prior distributions for each model parameter, in this case, cluster probabilities Ω , feature transition probability vector ξ , mean parameter for each cluster (μ kl and μ jkl ), and also the cluster variance parameter (σ kl 2 and σ jkl 2 ).We set hyperparameters of those prior distributions as follows.A truncated Normal distribution was used as a prior for cluster mean parameter μ kl , which was lower truncated at t = 0.2 for selected overexpressed conditions, and, upper truncated at −t for underexpressed conditions.The hyperparameters were set to μ 0 = 0.0 and σ μ 0 2 = 1000 for the mean and variance of the distribution, respectively.The mean was kept constant at zero when estimating biclusters corresponding to irrelevant features, ρ j, k = 3 .For estimating the cluster variance parameters (σ kl 2 and σ jkl 2 ), the hyperparameters were set to α 0 = 1 and β 0 = 1 for scale and shape of the distribution, respectively.We also fit the model when α 0 = 0.1 and β 0 = 0.1, a "more" noninformative prior for the variance parameters.Results are compared in Section 3.2.We choose natural noninformative Dirichlet prior for estimating cluster and transition probabilities, which is to set α k = 1 for k = 1, 2, …, K and δ l = 1 for l = 1, 2, 3.
To study the importance of incorporating the order structure and the biological significance (over/underexpressed features) into the determination of biclusters, we fitted four different models derived from our algorithm: (i) HMMBi-C: this method determines biclusters by assuming an order structure between features through a hidden Markov structure on the data y, and imposing a constraint on the mean parameters μ kl (positive, negative, and zero depending on the feature state) in order to have a better biological interpretation of obtained clusters; (ii) HMMBi-NoC: this method assumes an order structure between features through a hidden Markov structure on the data y, but does not impose a constraint on the mean parameters μ kl ; (iii) NoHMMBi-C: this method does not assume an order structure between features, but does impose a constraint on the mean parameters μ kl ; and (iv) NoHMMBi-NoC: this method does not assume an order structure between features, and does not impose a constraint on the mean parameters μ kl .Feature labels ρ j, k ∈ 1, 2 are not identifiable anymore for nonconstraint methods (i.e., HMMBi-NoC and NoHMMBi-NoC) within every cluster k but we would expect that the two feature clusters 1 and 2 obtained from these methods are exclusively each either the underexpressed feature cluster or the overexpressed feature cluster.We will also compare these four model methods with four other relevant (bi)clustering methods.We will use the R-package biclust (Kaiser et al., 2018) to compare with two among the several other provided algorithms to find biclusters in two-dimensional data, namely, BC-Spectral which is based on Spectral Bicluster algorithm (Kaiser et al., 2018) as described in Kluger et al. (2003) and BC-Plaid which performs Plaid Model Biclustering as described in Turner et al. (2005).We will also compare with model-based biclustering technique FABIA (Hochreiter et al., 2010) which uses factor analysis for determining biclusters.Finally, we compared our methods with the penalized biclustering method (PenPlaid) described in Chekouo and Murua (2015).

| Model comparison-
The principal approaches used in clustering comparison can be described through their development of criteria, of which there are two main approaches: pair counting (Pfitzner et al., 2008) and information-theoretic.In order to compare clustering results, we will use one of the variations of the pair counting measure the so-called F 1 measure which has been extensively used in the text mining literature and introduced to the biclustering literature.Consider two biclusters A and B, the F 1 measure (Chekouo & Murua, 2015) between A and B is defined as: , r A ∩ B denotes the number of common genes, c A ∩ B number of common conditions, and, n A = r A c A , n B = r B c B the number of elements in biclusters A and B, respectively.The F 1 measure is based on the harmonic mean of precision and recall, where recall measures the proportion of elements in B that belong to A and precision measures the proportion of elements in A captured in B. When several biclusters are to be compared we will use an F 1 -type average.Let M 1 = A 1 , A 2 , … … . ,A K be the set of estimated biclusters, and M 2 = B 1 , B 2 , … … . ,B L , the set of true biclusters.Then, the measure of the similarity of the estimate M 1 to the true biclustering M 2 is given by Our F 1 -type average is a symmetrized version of S M 1 , M 2 and will be defined as and it is equal to 1.0 if M 1 = M 2 .
The above-defined F 1 measure will be employed to compare our model methods with other relevant biclustering methods. 1 displays the means and standard deviations of the F 1 measure comparing the true collections of (bi)clusters to what is estimated by each of the methods studied for Setting 1 where features are generated from an ordering structure.Our main proposed model HMMBi-C, and the other three models are obtained by adjusting our main model.For the small data sets (n = 100, scenarios 1-6), the HMMBi-C model's performance is very similar to that of HMMBi-NoC which does not assume a constraint on the cluster parameters.However, when n is large (scenarios 7-12), HMMBi-C performs slightly better than HMMBi-NoC; that is, incorporating the constraint in the HMM model provides better performance.Moreover, for every scenario, HMMBi-C performs much better than models (NoHMMBi-C and NoHMMBi-C) that do not incorporate an ordering structure of features.

| Results-Table
Adding a constraint on a model without the HMM structure (NoHMMBi-C) has also better performance than not adding them (NoHMMBi-NoC), in particular in scenarios 3, 7, 8, and 10.In addition, as expected, when K or σ jkl 2 increases, each method's performance goes down.Overall, HMMBi-C clearly performs better than the three other methods, which shows that we obtain better results when we incorporate the ordering structure (through HMM) and a constraint on the cluster parameters for biological interpretation.Table 2 shows the results when α 0 = β 0 = 0.1.They are slightly worse than the results when α 0 = β 0 = 1 (Table 1) for almost all scenarios.In addition, the method HMMBi-C also performs much better in almost all scenarios when α 0 = β 0 = 0.1.Table 3 displays results obtained from fitting some competing biclustering algorithms on the simulated data.For a fair comparison with our previous four methods, we fix the number of biclusters to 3 × K. First, all those methods perform poorly compared to our four methods mentioned previously.Second, among the competing methods, the penalized method of Chekouo and Murua (2015) performs much better than the others.The plaid model of Turner et al. (2005) (BC-Plaid) were not able to identify any biclusters in most of the scenario.Also, BC-Spectral has not identified any biclusters for every scenario (results not shown in the table ).
Table 4 reports the results of Setting 2 of the simulated data, where the features are not ordered and assigned randomly to the three different groups (underexpressed, overexpressed, and irrelevant) as described previously.As expected, Table 4 shows that methods with ordering structure assumption perform similarly to methods without ordering structure (HMMBi-C vs. NoHMMBi-C or HMMBi-NoC vs. NoHMMBi-C).However, there is a slight improvement for models with constraints when compared to models without constraints (HMMBi-C vs. HMMBi-NoC or NoHMMBi-C vs. NoHMMBi-NoC).
Figure 3 plots DIC measures estimate for K = 1 to 15 for scenarios where the error variance is 1.As we notice from the plots, when more and more clusters are added, the DIC values begin to decline gradually until around the true number of clusters where the DIC values become relatively constant.We can then use a rule of thumb to select K associated with a point in the flat part of the DIC curve that falls near the elbow of the curve.We also observed that after the true number of clusters, some clustering results have clusters with few or zero subjects, and the corresponding number of clusters of these clustering results should not be selected.

| APPLICATION TO TCGA KIDNEY mRNA EXPRESSION DATA
We applied our new methodology to 534 samples from TCGA KIRC (kidney renal cell carcinoma) data, using the mRNA expression information collected from the Illumina HiSeq2000 platform (~20,000 protein-coding genes).We removed genes (i) with more than 30% missing values, and we imputed the remaining missing values with the k-nearest neighbor method implemented in the R package impute, and (ii) with minimal variance.
We used GO annotation to compute semantic similarity score between each gene pair by using the getGeneSim method provided with the Bioconductor package GOSim.We then reordered data frame columns (genes) based on a sorted genes list computed using the GO similarity score matrix computed from the Biological Process (BP) ontology.We retained 1009 mRNAs expressed across 534 samples and applied our methodology to the log-transformed standardized data.To run our MCMC algorithm, we used the same hyperparameter settings as in the simulation studies with 10,000 iterations and 5000 values discarded as "burn-in" for each run.

| Results
We applied our four methods to the RNA-seq data described previously with the same hyperparameter settings as in the simulation study.Figure 4 shows images of our data sorted with respect to cluster samples of the four methods.It shows horizontal patterns to the images which correspond to clusters.DIC criteria for each method detected 13, 12, 10, and 9 clusters for the four methods HMMBi-C, HMMBi-NoC, NoHMMBi-C, and NoHMMBi-NoC, respectively.We discard clustering results with cluster sizes less than 3. Figure 5 shows the image of six clusters (out of 13) obtained using the HMMBi-C method.For each cluster image, we sorted the genes with respect to overexpressed genes, underexpressed, and irrelevant genes as classified by our method.The images show a clear distinction between the three groups of features.Table 5 shows the distribution of the three groups of features for each cluster.Overall, the number of underexpressed genes is usually less than the number of overexpressed and irrelevant genes.Multiple genes can be over (under)expressed in multiple clusters.For instance, 157 and 216 genes are over-and underexpressed in both clusters 1 and 2. In addition, the table shows cluster sample means estimated over under-and overexpressed genes, respectively, for each cluster.In particular, cluster 1 has the smallest mean for underexpressed genes μ 12 = − 1.13 , and cluster 6 has the largest "overexpressed" mean μ 61 = 1.09 .
To determine whether our clustering schemas are associated with survival outcomes (70% of observations are censored), for each cluster, we fitted multivariate Cox regressions with covariates as clusters obtained from the HMMBi-C clustering algorithm.We adjusted the model with clinical covariates such as age, sex, and cancer stages.Clusters 2 and 11 provide positive and negative associations with survival time with respective hazard ratios of 2.98 and 0.48.Both regressions have good predictive performance, with concordance indices of about 0.72.
GO enrichment analysis was performed on the over-and underexpressed genes from Cluster 1 of the method HMMBi-C.The top eight enriched GO terms are shown in Figure 6 with their respective counts in Cluster 1.Among them, GO terms xenobiotic metabolic process and organic anion transport (OAT) were dominant among underexpressed genes (see Figure 7).But the GO term organic acid transport seems to be equally distributed between the two groups (over-and underexpressed) of genes.As an example, OATs are active transport proteins that regulate anion balance in the body and are members of the superfamily SLC (solute carriers).They are primarily expressed in the kidney and liver and have been recently involved in renal cell carcinoma (Whisenant & Nigam, 2022).

| CONCLUSION
In this work, we proposed an innovative interpretable mixture modeling in the context of biclustering expression data.The method brings together feature selection and clustering simultaneously under a single "wrapper" method (Dy & Brodley, 2004;Law et al., 2004).This approach has a benefit from two standpoints; first, both the processes, feature selection, and sample clustering are integrated into the model fitting process in a unified procedure.Second, feature selection is performed within each cluster of samples and features can be classified automatically into three biological groups: underexpressed, overexpressed, and irrelevant.Features are classified using an HMM with initial ordering among the features imposed using prior biological knowledge (e.g., GO).Hence, for each cluster of samples, "close" features are encouraged to be in the same group.
We assumed here a mixture of normal distributions to model clusters with a fixed K, the number of clusters.For future work, we can explore a natural extension to other distributions/assumptions (e.g., t-distributions, infinite mixture of factor analysis, nonparametric, etc.) that would capture more complex structures with an unknown number of clusters.
Another essential step will be integrating other omics data types such as DNA-methylation or microRNA data with the available RNA-Seq data.Fortunately, TCGA consortium (our data source) provides samples matched between those omics data for various cancers.Hence, extending our methods to integrate multiomics data types is possible, but we would need to understand the association between those data.For instance, the association between methylation and gene expression is usually unknown and difficult to determine (Ma et al., 2017).

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.

FIGURE 1.
An example of a biclustering structure captured by our approach.In this example, we have K = 2 clusters, and K × 3 = 6 biclusters.Images of some simulated data with respect to the number of clusters K ∈ 2, 4, 8 when n = 500 and σ jkl  Images of observed data sorted with respect to each cluster for each of our four methods.Images of clusters with features (genes) sorted with respect to overexpressed, underexpressed, and irrelevant groups of features.The six images represent images of the six first clusters identified using HMMBi-C.Top eight enriched Gene Ontology (GO) terms of GO Enrichment analysis of the gene set of over-and underexpressed genes in cluster 1 obtained from HMMBi-C.Gene Ontology (GO) Enrichment analysis of the gene set of over-and underexpressed genes in cluster 1 obtained from HMMBi-C.Top four GO terms: function1: anion transmembrane transport, function2: organic anion transport, function3: sodium ion transport, and function4: xenobiotic metabolic process.

ALGORITHM 1
Overview of MCMC Gibbs Sampling algorithm Input:y1…yn, N : Number of MCMC iterations Output:MCMC samples of all parameters 1: Initialization:The latent clustering vector z and the latent chain ρk are initialized using discrete uniform prior for K clusters and three groups, respectively, where the elements, z1, …, zn ∼i.i.d multinomialK 1, 1 K and for every j and k, ρj, k ∼i.i.d multinomial3 1, 1 3 .
to measure both the goodness of fit of the model and the model complexity and it is defined as DIC = D Θ + p DIC , where Θ = θ, z, ρ represents the set of unique parameters (including z and ρ), D Θ is the posterior mean of the deviance D Θ , and p DIC is the effective number of parameters used as a measure of model complexity.From the MCMC samples, the DIC is estimated as DIC = − 4 E Θ|y log p y|Θ + 2log p y | θ, z| θ s , z s , ρ s + 2log p y | θ, z, ρ , FIGURE 2.

TABLE 1
Setting 1: Simulation results for all scenarios (1-12) across 20 replicates.Values reported are the averaged F 1 measures across 20 replicates.Standard errors are between parentheses.The highest values are in bold.

TABLE 4
Setting 2: Simulation results for all scenarios (1-12) across 20 replicates.Values reported are the averaged F1 measures across 20 replicates.Standard errors are between parentheses.The highest values are in bold.