Unsupervised learning for medical data: A review of probabilistic factorization methods

We review popular unsupervised learning methods for the analysis of high‐dimensional data encountered in, for example, genomics, medical imaging, cohort studies, and biobanks. We show that four commonly used methods, principal component analysis, K‐means clustering, nonnegative matrix factorization, and latent Dirichlet allocation, can be written as probabilistic models underpinned by a low‐rank matrix factorization. In addition to highlighting their similarities, this formulation clarifies the various assumptions and restrictions of each approach, which eases identifying the appropriate method for specific applications for applied medical researchers. We also touch upon the most important aspects of inference and model selection for the application of these methods to health data.

where limited training data are available (relative to the data dimension, as is typical for healthcare data) unsupervised learning methods are superior. 4These methods also have technical advantages: because they model the full distribution, they can in principle identify outliers and handle missing data.
When data exhibit a latent structure, it is possible to map these onto a smaller latent (or hidden) space while preserving most of the relevant information.Probably the best-known unsupervised learning method is principal component analysis (PCA), which uses a linear projection to map the data onto a lower-dimensional latent space, in such a way that the variance in the projected space is maximized.Since its introduction in 1901 by Karl Pearson 5 it has made a huge impact across many disciplines.Spearman (of correlation fame) used it to develop his theory of intelligence, eventually leading to today's IQ tests; 6,7 Cavalli-Sforza was the first to show that PCA is very effective at identifying population structure in high-dimensional genetic data; 8 and PCA is used extensively to construct economic performance indicators and indices for socioeconomic status. 9,10Another form of latent structure occurs when data are drawn from a discrete set of distinct but unknown clusters.A widely used method for identifying such latent cluster structure was published in 1956 by the Polish mathematician Hugo Steinhaus. 11Now known as K-means, 12 it infers cluster assignments that minimize the total squared distance of data points to the center of their associated clusters.This method too has made a huge impact; restricting to the medical domain, K-means has been used to identify diabetes and cancer subtypes, 13,14 classify EEG signals, 15,16 and cluster gene expression data, 17 among many other examples.
Despite the clear and continued success of both PCA and K-means, the specific choices that these methods make (linear projections and variance as performance measure for PCA; Euclidian distance for K-means) do not fit all applications, and much effort has gone into modifying and extending these methods.6][27][28] Other methods combine several such features, for example, nonnegative matrix factorization (NMF) models nonnegative data, and induces sparsity by enforcing nonnegative linear superpositions, resulting in representations that are easier to interpret. 29nother important class of methods that focus on identifying latent structure in categorical and count data such as genetic data 30 and text documents 24 are topic models.Introduced in the 2000s, these models, the best known of which is called latent Dirichlet allocation (LDA), 24 have found applications beyond text analysis, for instance, analyzing health care, 31,32 genetic, 30,33 and longitudinal data. 34They use Bayesian methods to infer parameters; indeed, "Dirichlet" refers to a particular prior distribution used for both components and loadings, which result in sparse solutions and which tends to produce interpretable outcomes.
Clearly, a large number of methods are available to identify latent structure in a large and complex data set.This presents a challenge to an applied statistician: How to compare and contrast the various methods to identify the right method for a particular application?A helpful insight is that all of the methods mentioned so far share a fundamental component, namely low-rank matrix factorization.NMF wears this on its sleeve, but it is also true for PCA, LDA and K-means.Highlighting this fundamental similarity helps to bring out the essential differences between these methods, which is key to making the right choices in an application.To this end, we here present all methods in a common notation, and emphasize for each method the restrictions (or prior distributions) on the parameters, the optimization target (or likelihood function), and the approach used for parameter inference.
While several review papers have focused on subsets of these methods, [35][36][37][38][39][40] and the framework of low-rank matrix factorization was noted before in the context of genetics, 41 we believe that ours is the first to review general-purpose unsupervised learning methods using matrix factorization as a unifying theme.To keep our discussion focused, we limit ourselves to the application of linear unsupervised learning for unstructured medical data, that is, tabular data.Other important methods, such as convolutional neural networks, have been very successful at both prediction and learning from structured data, including image, 42 audio, 43 and DNA data. 44In addition, multi-layer ("deep") networks have been hugely successful, also for unsupervised learning tasks; 45,46 notable examples include variational auto-encoders, 47 deep Gaussian latent models, 48 Gamma belief networks, 49 and many more.Finally, we mention t-SNE and UMAP, 28,50 two dimension reduction methods that are widely used mainly for qualitative visualisation purposes.All of these methods aim to identify complex mappings that are fundamentally nonlinear, and fall outside the scope of this review.
The article is structured as follows.We begin by reviewing each of the four methods in Section 2. We then discuss approaches to parameter inference and model selection in Section 3, and we finally provide context for these methods within more recent developments in unsupervised learning in Section 4.

METHODS FOR UNSUPERVISED LEARNING
Unsupervised learning can be viewed as a way to summarise a high-dimensional data set by identifying a common structure among data points, and retaining this structure while discarding the residuals which are interpreted as noise.Another helpful view is to regard unsupervised learning as performing dimension reduction, which is a form of data compression onto the retained dimensions, with the remaining dimensions again discarded as noise.Unsupervised learning methods differ in the assumed form of both the structure and the noise.However, despite their superficial differences, a common element in many methods is a matrix factorization step, which approximates high-dimensional data in a matrix X by a lower-rank matrix Y written as a product of matrices W and V; symbolically, Here, we indicated the dimensions of the matrices with subscripts: N for the number of data points, P for the number of features, and K for the number of latent dimensions, which is equal to the rank of the approximation.The matrix Y is related to the data matrix X of the same dimension by an observation (or "error") model, V holds a representation of the data's underlying structure and consists of K exemplars (or principal components, factors, or topics), while W holds the per-observation weights (or latent variables) in a space of reduced dimension K ≪ min(N, P), and provides a reduced representation of each observation.Data compression is achieved because the number of model parameters, NK + KP, is generally much smaller than the original data dimension, NP.Priors may induce sparsity which further reduces the (effective) number of parameters.This means that some information is lost, and this lost information is captured by an error term.Specifying the distribution of this error term (and optionally, of the latent variables) turns the low-rank factorization into a probabilistic model: where F is some distribution parameterized by Y n ; a common example would be a Gaussian distribution with This turns out to be a flexible and rich framework, and different choices of the error distribution F, and the form and distribution of the weights W and exemplars V, lead to a range of important unsupervised learning models.Below we show how PCA, K-means clustering, non-negative matrix factorization, and latent Dirichlet allocation all fit into this framework, as well as several related models.

Principal component analysis
PCA has been widely used for dimension reduction since its introduction in 1901. 5,51One characterization of PCA is as a low-dimensional projection that maximizes the variance of the data in the projected space.If each observed variable contains identically distributed and independent Gaussian noise (so that the noise is rotation-invariant, or isotropic), the same is true in the projected space, so that by maximizing the variance in the projected space the relative contribution of the noise is minimized.This justifies the interpretation of the lower dimensional subspace as representing the principal components: they contain most of the signal and comparatively little noise.Tipping and Bishop 37 showed that PCA can equivalently be characterised as the maximum likelihood solution* of a probabilistic model for the data X: Here, we use vector notation and denote for example, by X n the nth vector of observations of the matrix X; I K and I P are identity matrices of dimension K and P.This is close to a model known as factor analysis, 53 which instead of assuming isotropic noise (the same in all directions) uses a more flexible error model. 37The formulation as a probabilistic model clearly shows the separation between the low-rank structure Y = WV, and the "measurement model" X = Y + .This representation also brings out several other features of PCA, for instance, that it models real-valued data with Gaussian noise; therefore, we may expect it to be strongly influenced by outliers.The assumption of isotropic noise also shows that PCA is not invariant under re-scaling of the data.If the features live on very different scales, then those with the largest scale will also tend to have the largest variance, and will have a strong influence on the result.One way to address this is to standardize the data to unit variance.In general, preprocessing of the data before using PCA is important, as both non-standardized data and outliers may bias results. 54he prior on W is also Gaussian, while V and  2 have no prior and are optimized by maximum likelihood.Neither of the priors on V and W induce sparsity, so for high-dimensional data we can expect that estimates of both the principal components and latent variables are noisy.
While PCA is best suited for real data with isotropic Gaussian errors, it has been applied successfully in situations where these assumptions are not strictly met.A notable application in the life sciences is the modeling of population stratification in genetic association studies. 55Due to the discreteness of genetic markers, the noise term is far from normally distributed.Nevertheless, theoretical arguments show that PCA is appropriate for idealized models of population structure, 56 and projecting the genetic variation within Europe onto its first two principal components recapitulates the geography of Europe to a remarkable degree 52 (Figure 1).The explanation may lie in that genetic markers are both numerous (genomes are large) and approximately identically distributed, so the central limit theorem applies and predicts that the latent variables W have a Gaussian distribution.In general, when the observed data contains a large number of samples (N ≫ P), the prior and distributional assumptions are less influential. 25I G U R E 1 Genetic data for individuals from different populations in Europe projected onto the first two principal components their genetic data (categorical observations).Figure from Bartholomew et al. 52 A strength of PCA is that inference can be done analytically using eigenvalue decomposition, and standard implementations are very fast and suitable for large data sets.A drawback of analytic solutions is that they are unable to handle missing data, 57 although specialized methods have been developed. 58,59Other extensions of PCA, which come at the cost of losing analytic tractability, include probabilistic PCA, 37 and Johnstone and Lu 60 present a version that uses a sparse prior.Bayesian extensions of PCA are more suitable for sparse data, and are also better able to handle missing observations. 21,22,61everal other methods have close connections to PCA.Multiple correspondence analysis is essentially PCA on transformed binary data, 62 and Canonical Correlation Analysis identifies relationships between two sets of variables by using PCA on a cross-correlation matrix. 63It is also worth mentioning Independent Component Analysis, which is superficially similar to PCA, but different in its aims.While PCA aims for data compression and dimension reduction, and results in orthogonal principal components (associated to the full data set) spanning the latent space, ICA aims for deconvolution, does not reduce the dimension of the data, and results in output vectors (each associated to a single data point) whose coordinates are independent. 36,64,65

Nonnegative matrix factorization
Sometimes restrictions are imposed on the latent structure, for instance, factor weights are sometimes interpreted as probabilities, which can never be negative.Nonnegative matrix factorization (NMF) is a special case of factor analysis where all matrices (observations, factors, and weights) are restricted to be either positive or zero. 29The restriction that both the factors and the weights remain non-negative results in many entries in fact becoming zero; in other words, the method induces sparsity.In general sparse factors are easier to interpret, which is best demonstrated when applying NMF to image data.For example, presented with facial images, NMF identifies noses, ears and eyes, 29 and the labelled image of a brain tumor in Figure 2 shows that the tumor is recognized and the image of the brain is factorized into a basis consisting of different areas of the tumor. 66][69][70] The model for NMF is where W contains the weights, V contains the basis (or factors), and the inequalities are interpreted element-wise.Aside from non-negativity, improper uniform priors are assumed on V and W.Although NMF is restricted to non-negative data, strictly speaking the error model above allows negative observations too.Without the nonnegativity restriction the model would be close to that of PCA, and as in PCA, in general many entries would become negative.Instead in NMF these entries are effectively "clamped" to 0, resulting in the desired sparsity. 29,35The nonnegativity constraints preclude an analytic solution, so that numerical algorithms that minimize the reconstruction error are needed to obtain the factorization.This has the advantage that the error function can be changed without making the algorithm slower; common choices are the mean-square error (MSE), corresponding to normally distributed errors as above, and the Kullback-Leibler (KL) and Itakura-Saito (IS) divergence.The latter are preferred over MSE if the data has features that have non-constant variance: the IS divergence is independent of scale, 71 and the KL divergence is well suited for count data, 72 in which case NMF becomes equivalent to probabilistic latent semantic analysis (pLSA) 73,74 which was developed to model word counts.NMF can naturally handle missing values 72 and can be used for imputation.NMF has been extended in various ways, such as Bayesian NMF to better account for uncertainty, 75,76 alternative objective functions, 35,77 binary NMF, 68 and semiand convex NMF. 78Additionally, a version of NMF dealing with ordinal data has been developed. 79

K-Means clustering
As mentioned in Section 1, K-means aims to identify an underlying grouping of similar data within a set of observations. 80Figure 3 shows an example of an application of K-means, in which patients are clustered into groups based on their symptoms.These symptoms (positive and negative) are converted to z-scores, and these real-valued and standardized values are then used to find clusters.How does K-means achieve this?K-means optimizes an objective function to learn the best division of observed data into K groups or clusters.Since K-means optimizes an objective function, it can be presented as a maximum likelihood solution to a probabilistic model; this model turns out to be Here, W n is a binary vector that encodes which cluster each observation belongs to, and B K is the set of binary vectors of length K that have a single "1" entry; this is often referred to as one-hot encoding.Inference of the model parameters V k and W n is done using maximum likelihood (ML).By writing the likelihood function in full and taking the logarithm, we can see that optimizing the likelihood for W n and V k is indeed equivalent to minimizing the Euclidean distance between observations X n and the nearest cluster center V k . 38,80,82,83The standard algorithm alternatingly assigns observations to the F I G U R E 3 K-means clustering of z-scores of symptoms experienced by patients with first-episode psychosis.Image from MacKay. 81earest cluster using the current centroids, and updates the centroids based on current membership, until convergence.The resulting algorithm is referred to as hard clustering. 82The equivalence between maximum likelihood and minimizing distances to centroids is directly related to the assumption that the error  n has a standard normal distribution, that is, that it is isotropic and that each component has variance 1.Assuming any other variance gives the same solution (it is equivalent to rescaling lengths), but assuming nonisotropic errors gives different variances for each dimension and does change the solution.Standardization of the data is therefore often advised for K-means.
3][84] In fact, soft K-Means clustering is an example of a broader class of probabilistic models known as Gaussian Mixture Models (GMM); K-means restricts the GMM such that the Gaussian distributions describing each group have their own mean but have identical (and isotropic) covariance matrices. 85he assumption of Gaussian errors limits K-means to applications for real-valued data.However, relatively small changes to the method have extended its application to discrete 83,86 and nonnegative data. 35,78Being an ML algorithm, K-means cannot easily deal with missing values.Bayesian extensions of K-means are able to deal with missing values by treating the missing values as latent variables and sampling from the posterior to impute or integrate out the missing values. 84,87,88

Latent Dirichlet allocation
Although the three previous methods have also been successfully applied in discrete settings, all three were designed for (possibly nonnegative) real-valued data.In addition, they were not specifically designed to handle sparse data (although NMF does implicitly promote sparsity, and sparse PCA methods have been developed).In 2003, Blei et al. 24 introduced a model termed latent Dirichlet allocation (LDA) to identify factors in text data, which is both essentially discrete and sparse.LDA, a generalisation of probabilistic latent semantic analysis (pLSA, 89 ) is essentially identical to a model developed in 2000 by Pritchard et al. 30 for the analysis of genetic data, which is similarly sparse and discrete.Both models have been hugely successful and have given rise to numerous extensions and applications.LDA is a model for categorical or count data; more precisely, for data that is (conditionally) multinomially distributed.For text applications, LDA models text as a "bag of words", 24 meaning that word order is ignored, and only word counts (over some fixed vocabulary) are considered.LDA describes a text document as a mixture of factors, termed "topics" in this context, which are modelled as probability distributions over words in the vocabulary.This allows the same word to have multiple meanings when used in different contexts; 24,90 these words are disambiguated in combination with the use (or lack of use) of other words.Sparsity is modelled by the use of Dirichlet priors over both words and topics, so that a topic tends to contain few words (and fewer still at high probability), and documents tend to contain few topics; this improves the interpretability of the inferred topics and reduces overfitting. 91,92The model has an elegant mathematical structure: the Dirichlet is conjugate to both the categorical and multinomial distributions; this and the exchangeability of the observations (a result of the bag-of-words assumption) mean that efficient and easy-to-implement inference algorithms are available. 24For these reasons, we could say that the Dirichlet priors are key to the success of LDA.
The model is often written as a generative model as follows: The observations are encoded by x in representing the ith word in the nth document.The generative model highlights the categorical distribution for the observations x and topic choices z, and the Dirichlet priors on the latent variables  for topics and  for mixture coefficients.If  and  are small topics tend to get assigned few words, and documents are a mixture of few topics. 24,93,94As LDA is a Bayesian model, inference procedures draw samples from the posterior distribution, which reduces the impact of over-training compared to maximum likelihood approaches such as pLSA.In the formulation above, the relation to the previous three models is not immediately obvious.To write LDA as a factorization method and highlight the similarity to the other models, we integrate out z and represent the data by a matrix of word counts, X np = #{i ∶ x in = p}.In this representation, X n is a vector of counts that follow a multinomial distribution, X n ∼ Mult(d n ,  n ), where d n is the number of words in document n.Renaming W n =  n and V k =  k , the model takes on the familiar form: LDA has been used in numerous applications, many focusing on text modeling. 32,96,97Figure 4 shows an example of topics (indicated by colors) that were discovered in an analysis of medical documents.Models based on LDA have been used in many other fields, for instance to deconvolve sparse and noise data from single-cell epigenetic assays 98 and to identify mutational signatures in cancer. 99When LDA is used to find only two topics, it performs (soft) biclustering for categorical data, a task often used in life sciences to assign observed variables to two outcome labels. 100Various extensions of LDA have been introduced; for example, to use different priors, 36,101 to account for topic correlations, 102,103 and to relax the bag-of-words assumption by modeling aspects of word order. 104

PARAMETER INFERENCE AND MODEL SELECTION
So far we have mainly focused on the statistical model that underlies the various unsupervised learning methods.However, this is just one of the ingredients of a method.The other main ingredient, and one that strongly influences the behavior of a method, is an approach for learning the parameters of the model.Broadly, two approaches can be distinguished.One can try to find point estimates for unknown quantities, or one can try to characterize their distribution. 82Maximum likelihood (ML) inference is the most prominent example of the first approach.As the name suggests, it finds parameters by maximizing a function termed the likelihood-the probability of the observed data under the model as a function of its parameters.Usually, observation are independent of each other given the parameters, so that the likelihood of the full data set is the product of the likelihoods of the individual observations.The resulting function can then be optimized numerically using (stochastic) gradient ascent, or by semi-analytical approaches such as expectation-maximization (EM).Maximum likelihood is a widely used method for statistical model fitting, and provides a single set of "best" parameters to use for the model.While ML methods are widely available, intuitive, and easy to implement and use, they have some drawbacks.In particular, they tend to overfit the data, give biased estimates, and ignore uncertainty in the estimates. 82These drawbacks are largely addressed by Bayesian inference methods.Bayesian methods aim to characterize the posterior distribution of a model's latent variables and parameters, rather than their point estimates.As a result, models are less prone to overfitting and bias.This is because the method does not maximize the probability of the observed data, rather it explores all possible combinations of unknown variables weighted by their (posterior) probabilities given the observations.However, compared to ML methods the algorithms are often more technically involved and slower.
In addition to latent variables and parameters that are inferred from data, models typically also depend on fixed parameters, often called "hyperparameters."In the models, we discussed this always includes the number of latent dimensions K; in LDA we also have for example, the prior parameters , , 97,105 The performance of a model depends on these hyperparameters; in fact, models with different choices for hyperparameters can be regarded as different models.An important task is model selection or hyperparameter optimization: selecting the optimal value of the hyperparameters such as K. 106 Often this is done using heuristic methods; for example, for PCA the retained variance per dimension can be visualized in "elbow-plots" (also called "scree-plots"), and the optimal K is visually determined to coincide with the elbow.Similar plots can be examined for K-means and NMF. 94n alternative is to define an objective performance measure, and determine a good value for K by increasing K until the performance measure no longer improves.However, in practice, this is often not feasible since the entire model needs to be recomputed for each value of K.More efficient approaches have been developed; for example, x-means clustering extends K-means to find the best value for K while optimizing the cluster assignments, 105 or one can use a grid search algorithm to find the optimal number of clusters for the data. 94n obvious performance measure is the model's likelihood of the data.Optimizing the likelihood has the virtue of simplicity but shares the drawbacks of ML methods. 91Also, structural hyperparameters, such as the number of latent dimensions, change the number of free parameters, and when used in combination with ML methods this increases the possibility of overfitting. 107One approach is to add priors for the (hyper)parameters and look for maximum a posteriori (MAP) rather than ML parameters. 24This reduces but does not eliminate overfitting, and does not account for the uncertainty in hyperparameter estimates.Using penalized likelihoods, such as the Aikaike information criterion (AIC), is another possibility.
A way to at least identify overfitting when it occurs is cross-validation -fitting a model on part of the data ("training data"), and validating it on the remaining part ("test data").Some care is needed to apply this in an unsupervised learning setting, as models for unsupervised learning do not predict an outcome.One approach is to mark part of the test data as "missing"; a better-fitting model can be recognized as it will assign a higher likelihood to the left-out (test) values.For instance, for NMF an algorithm for learning the optimal dimensionality removes a fraction of the observations in each data point, uses NMF for imputation, and uses the obtained imputation error as a performance measure. 72In the context of topic models, perplexity is often used as a performance measure-it captures the model's "surprise" when presented with new observations, 49 measured as the equivalent vocabulary size needed to express the missing information per word.This approach, of optimizing hyperparameters on held-out data, reduces (although not completely eliminates) overfitting, at the expense of a reduced training set.
A fully Bayesian approach addresses the problem of overfitting by integrating out unknown parameters rather than selecting a single "best" value for them.This also works for integrating out the number of latent dimensions K, as is done for instance in hierarchical Dirichlet processes, 108 but many more examples of such non-parametric Bayesian models exist, particularly for LDA 90,103,[108][109][110][111] and also the K-means model can be treated in this way. 88Even in a fully Bayesian approach some parameters are fixed, for instance the parameters of hyperpriors, such as  and  in LDA.While in most applications of LDA these are kept at default values, ML can be used to optimize these parameters, 112,113 and learning application-specific hyperparameters is important to obtain stable topics. 93,106,112A B L E 1 Overview of unsupervised learning methods as probabilistic models.

PCA K-means NMF LDA
Weight prior

DISCUSSION
We presented four widely used unsupervised learning methods in a way that emphasises their common structure as probabilistic models underpinned by a low-rank matrix factorization.This common structure made it possible to systematically point out the differences between the methods (Table 1) which helps to choose the right method for an intended application.For instance, the fixed-variance normal error model used in PCA, K-means and (often) NMF make these methods dependent on scaling and are vulnerable to outliers.Count data have a variance that typically scales with the counts themselves, making LDA more appropriate for these data.NMF with Kullback-Leibler divergence is similarly appropriate for count data, and in other cases, for example, where scaling is an issue, yet other measures of divergence between predictions and observations might be appropriate. 71Some mismatch between model assumptions and data characteristics does not necessarily preclude successful application.An example is the application of PCA to (binary or ternary) genetic marker data, 55,56 although for understanding the population structure an LDA-based model that more precisely fits the data type is often used. 30Similarly, mutational signatures in cancer have been successfully analyzed using both NMF 67 and LDA. 99he interpretability of the inferred latent representation differs between methods and is mostly driven by the sparsity of the solutions: in general, sparsity aids interpretation.PCA's Gaussian and uniform priors result in dense principal component (PC) vectors with both positive and negative entries, which are generally hard to interpret; however, the population genetic example (Figure 1) shows that the PC weights can have a clear interpretation even when the PC vectors themselves do not.K-means has uniform priors on V resulting in dense components, but enforces extreme ("one-hot") sparsity in the weights W, and as a result the cluster centers can be regarded as data exemplars, which helps interpretation.NMF induces a degree of sparsity on both components and weights by requiring nonnegative entries, and as a result the sparse basis obtained by NMF are often interpretable, 29 for instance when analyzing images (Figure 2) or somatic mutations in cancer; in the latter case, individual components can often be associated to particular mutagenic or DNA repair mechanisms. 67Finally, the Dirichlet priors used for LDA provide precise control of the sparsity of both topics and weights.Generally, sparse solutions are preferred and appropriate; for example, for text documents this means that individual documents include a relatively small number of distinct topics each containing a relatively small number of words. 82,83Similarly, for single-cell epigenetic data, the model expects a modest number of regulatory pathways, each pertaining to a limited number of genetic loci, to be active in individual cells; 98 linking these loci to genes then makes the components interpretable through enrichment analysis of gene ontology databases.
Although in principle many approaches for parameter inference can be used for any given model, in practice certain approaches are more natural than others.Maximum likelihood (ML) inference in the case of PCA boils down to eigenvalue decomposition, for which fast algorithms are available, making PCA suitable for large data sets.For K-means the ML solution is found using expectation maximization, which is also fast but does not guarantee global optimality.For NMF a variety of generic numerical optimization techniques are used, and because these techniques are generic it allows for different choices of likelihood (or divergence) function, providing additional flexibility.Finally, LDA is typically viewed as a Bayesian model, and inferential algorithms aim to describe the (posterior) distribution of parameters, using methods such as Gibbs sampling or variational Bayes.
The various methods we described have some limitations, at least in their standard implementations.One common feature of high-dimensional data is heterogeneity: data comprising a mixture of for example, real-valued, binary, ordinal and other data types.Re-scaling can render real-valued data homogeneous, which tends to help in the case of PCA, but it is more challenging to deal with mixtures of for example, the real-valued, categorical, ordinal and time-to-event data that are typically found in EHR and population cohort data.To identify patterns in such data, efforts have been made to extend unsupervised learning methods to handle heterogeneous observation data by, for example, choosing appropriate link functions, 49,113 enabling these methods to learn latent features from a combination of discrete, continuous, categorical, and binary features.Another typical feature of high-dimensional heterogeneous datasets, including but certainly not limited to cohort and EHR data, is that these typically contain many missing observations.Standard implementations of the methods we discussed (with the exception of LDA) typically do not gracefully handle missing data, instead relying on imputation or complete-data approaches that can lead to bias in the presence of high missingness.Bayesian methods are in principle well able to deal with this issue, although this may come at some computational cost.
To conclude, with the increasing availability of large and high-dimensional data sets, methods for learning meaningful patterns within these data can help to perform data reductions for further analysis.Meaningful hidden patterns may partially explain inter-individual differences and lead to new hypothesis for further investigation.Challenges remain, particularly in dealing with heterogeneous data and with high missingness, both of which are often encountered in the medical domain.Continuing advances in hardware and machine learning likely mean that powerful, yet efficient and practical, approaches will continue to be developed that can efficiently identify meaningful patterns in such data.

F
I G U R E 2 Top: manually labelled brain image, Bottom: image labelled by NMF using the weights and "parts-of-data" (basis) to label the parts in the brain into separate areas around a brain tumor (purple and blue regions are active tumor).Figure from Sauwen et al.66

F I G U R E 4
Topics, represented by colors, in medical documents identified by LDA. Figure from Tran et al.95 + , the set of P-dimensional vectors with nonnegative entries;  and , parameter vectors with dimensions K and P; EM, expectation-maximization; ML, maximum likelihood; VB, variational Bayes.