A fast and efficient Modal EM algorithm for Gaussian mixtures

In the modal approach to clustering, clusters are defined as the local maxima of the underlying probability density function, where the latter can be estimated either nonparametrically or using finite mixture models. Thus, clusters are closely related to certain regions around the density modes, and every cluster corresponds to a bump of the density. The Modal Expectation‐Maximization (MEM) algorithm is an iterative procedure that can identify the local maxima of any density function. In this contribution, we propose a fast and efficient MEM algorithm to be used when the density function is estimated through a finite mixture of Gaussian distributions with parsimonious component‐covariance structures. After describing the procedure, we apply the proposed MEM algorithm on both simulated and real data examples, showing its high flexibility in several contexts.


Introduction
The term cluster analysis encompasses a large set of methods and algorithms that aim at partitioning a set of data into some meaningful groups of homogeneous data points called clusters.
The presence of such clusters is not known a priori, sometimes even their number is unknown, nor is case labelling available. For this reason, cluster analysis is considered an instance of so-called unsupervised learning.
Several approaches and methods are available in the literature to explore the clustering structure of a dataset (Everitt et al., 2011). Among these, density-based approaches have been proposed to exploit the relationship between the underlying density of a dataset and the presence of clusters. In the parametric or model-based clustering approach each component of a mixture distribution is associated to a cluster (McLachlan and Peel, 2000;Fraley and Raftery, 2002).
Thus, observations are allocated to the cluster with maximal weighted component density.
However, there may be situations where more than a single component is required to represent the shape of a cluster. Merging of mixture components is a possible answer to this problem. Baudry et al. (2010) proposed a merging method based on an entropy criterion, while Hennig (2010) discussed several methods based on unimodality and misclassification probabilities. All these methods are hierarchical in nature, so clusters can only be obtained by merging two or more mixture components. This indeed may constitute a limitation because data points assigned to a single Gaussian component cannot be subsequently allocated to different clusters. A different approach to tackle this problem was proposed by Scrucca (2016) based on the identification of connected components from high density regions of the underlying density function.
Modal clustering is another density-based approach to clustering where clusters are taken as the "domains of attraction" of the density modes (Stuetzle, 2003). This follows the definition proposed by Hartigan (1975, p. 205), according to which "clusters may be thought of as regions of high density separated from other such regions by regions of low density". This definition of cluster is the one adopted in the paper.
Modal EM (MEM) is an iterative algorithm aimed at identifying the local maxima of a density function (Li et al., 2007). Let f (x) = G k=1 π k f k (x) be a finite mixture density for x ∈ R d , where π k is the mixing probability of component k with density function f k (x), under the constraints π k > 0 for all k = 1, . . . , G, and G k=1 π k = 1. Given an initial starting point x (0) , the following steps are iteratively executed until a stopping criterion is met: f (x (t−1) ) for k = 1, . . . , G; M-step: k log f k (x (t−1) ). Li et al. (2007) showed that the objective function in the M-step has a unique maximum if the f k (x) are Gaussian densities. They also reported a closed-form solution in case of Gaussian mixtures with common covariance matrix. This is a fairly strong assumption that rarely occurs in practice, so it would be interesting to address the general case, which is not only more complex to deal with, but also much more interesting from a practical point of view.
Regarding the question of how many modes a Gaussian mixture can have, we note that Carreira-Perpiñán and Williams (2003b,a) conjectured that the number of modes cannot exceed the number of components when the components of the mixture have the same covariance matrix (homoscedastic mixture), whereas if the components are allowed to have arbitrary and different covariance matrices (isotropic and full heteroscedastic mixtures) then the number of modes can be larger than the number of components. However, the first conjecture turns out to be wrong, so in general it is not possible to know a priori the number of modes of a multivariate Gaussian mixture. For a recent contribution on this issue see Améndola et al. (2019), where lower and upper bounds on the maximum number of modes of a Gaussian mixture are derived under the assumption they are finite.
Modal clustering plays a central role in the non-parametric approach to cluster analysis.
Several mode-seeking algorithms have been proposed in the literature, such as the mean-shift algorithm of Fukunaga and Hostetler (1975) and its many extensions (Carreira-Perpiñán, 2016).
However, regardless of the algorithm adopted, detection of high-density regions requires the choice of a density estimator, typically a kernel density estimator. The latter requires the selection of an appropriate kernel bandwidth, and extension to high dimensions is known to be somewhat problematic (Scott, 2009, ch. 9). Interestingly, connections exist between the Modal EM algorithm and the mean shift algorithm. In fact, Carreira-Perpiñán (2007) showed that the mean-shift algorithm is a generalized EM algorithm when the kernel of a non-parametric kernel density estimate is Gaussian. More recently, Chacón (2019) extended the use of the mean shift algorithm to non-isotropic Gaussian components. For a review on non-parametric modal clustering see Menardi (2016).
A motivating example Consider the data shown in Figure 1a. They represent a sample of n = 500 observations drawn from the following bivariate two-component mixture: where π = 1/3 is the mixing weight of the first Gaussian component with mean µ 1 = [5 − 2] and covariance matrix Σ 1 = [ 1 0 0 1 ], whereas the second component is a Skew-Normal distribution (Azzalini, 2013) with location µ 2 = [0 0] , scale matrix Σ 2 = [ 1 0.5 0.5 1 ], and skew parameter λ 2 = [5 1] . Figure 1b shows the density estimate corresponding to the "best" Gaussian finite mixture model according to BIC. The selected model is a mixture of three components with ellipsoidal covariance matrices having common orientation (VVE in mclust nomenclature; see Scrucca et al. (2016)). Figure 1c shows the corresponding clustering partition. Clearly, observations coming from the skewed component are not correctly identified by the estimated clustering partition. Indeed, two Gaussian components are needed to adequately represent this group of observations. However, note that the corresponding density estimate seems to correctly suggest a bimodal distribution. By exploiting this fact a better partition could be obtained, and the method discussed in this paper aims to deal with similar situations. In this contribution we propose a fast and efficient Modal EM algorithm for identifying the modes of a density estimated by finite mixture of multivariate Gaussians having any of the parsimonious covariance structures available in the mclust R package (Scrucca et al., 2016).
The outline of this article is as follows. Section 2 provides a brief review of the Modal EM approach for Gaussian mixtures available in the literature. Section 3 contains the proposal for extending the Modal clustering approach to any density estimated by fitting a finite mixture of Gaussian distributions with parsimonious component-covariance structures, and details on how to improve the computational efficiency of this approach. Section 4 describes the empirical results deriving from the application of the proposed Modal EM algorithm to examples using both synthetic and real datasets. The final section provides some concluding remarks.

Modal EM algorithm for Gaussian mixtures
Gaussian mixture models (GMMs) assume that the mixture components are all multivariate Gaussians with mean µ k and covariance Σ k , i.e. f k (x) ≡ φ(x; µ k , Σ k ). Therefore, the mixture density for any data point x i can be written as Clusters described by a GMM are centred at the means µ k , and with other geometric characteristics (such as volume, shape and orientation) determined by the covariance matrices Σ k .
These can be controlled by introducing some constraints on the covariance matrices through the following eigen-decomposition (Banfield and Raftery, 1993;Celeux and Govaert, 1995) where λ k = |Σ k | 1/d is a scalar which controls the volume, ∆ k is a diagonal matrix, such that |∆ k | = 1 and with the normalised eigenvalues of Σ k in decreasing order, which controls the shape, U k is an orthogonal matrix of eigenvectors of Σ k which controls the orientation. In this way, a total of 14 GMMs are obtained (Scrucca et al., 2016).
It is important to note that in this paper we shall consider the mixing proportions π k , the mean vectors µ k , and the covariance matrices Σ k as fixed (either estimated or known a priori) for all k = 1, . . . , G.
The MEM algorithm starts with t = 0 and initial data point x i = x i . At iteration t, MEM performs the following steps: • Set t = t + 1.
• E-step -update the posterior conditional probability of the current data point x i to belong to the kth mixture component: , for all k = 1, . . . , G.
• M-step -update the current value of x i by solving the optimisation problem: • Iterate the above steps until a stopping criterion is satisfied, for instance max{|x where is a tolerance value, say = 1e-5, or a pre-specified maximum number of iterations is reached.
By the ascending property of the MEM algorithm (Li et al., 2007, Appendix A), at convergence the value x (t) i is the mode associated with data point x i . Li et al. (2007) presented a closed-form solution only in the specific case of Gaussian mixtures with common covariance matrix, and reported that numerical procedures are required for the M-step if the covariance matrices are different across components. By replicating the above algorithm for all data points, it is possible to identify the modes associated with any x i (i = 1, . . . , n), but this process is time-consuming even for moderately large datasets. In the next section we present an approach aimed at accelerating the MEM algorithm by iterating simultaneously for all data points and for any parsimonious covariance matrix decomposition.

Proposal
In this section we detail our proposal to speed up the MEM algorithm for Gaussian mixtures having any of the parsimonious component-covariance matrix eigen-decomposition proposed by Banfield and Raftery (1993); Celeux and Govaert (1995), and implemented in the mclust package (Scrucca et al., 2016) for R (R Core Team, 2019).
To this end, we start by noting that, the objective function in the M-step presented in Section 2 can be written as where I ≡ {(i − 1)d + 1, . . . , id} is the set containing the indices used to select the rows of matrix A and the elements of vector b. Equivalently, solutions of the linear systems can be written as Compared to the approach based on the calculation of the solution for each data point as in (2), the main advantage of our proposal is that computing the matrix A and vector b is performed in a single step for all the observations. Then, the use of the indices I allows us to select the relevant parts of A and b for computing the solutions. Although algebraically equivalent, this approach turns out to be three times faster in our experiments under different settings.
The above algorithm is fast and efficient, but in practice Modal EM can suffer from some drawbacks which can be easily addressed as discussed below.

Setting the step size
Large jumps can occur during the initial iterations of the algorithm for those data points x i 's located in low-density regions. In these cases, since most z ik 's are very small, the inverse of G i=1 z ik Σ −1 k in (2) will contain large values (in magnitude). As a consequence, an initial data point would shift from the region of the attracting mode to the domain of attraction of a different mode, and therefore to converge to a mode different from the attractor of the data point. As an example, consider the data point at the bottom-right of Figure 2a, and the corresponding path of MEM iterations (see the red arrows). In this case, after an initial large jump, the algorithm converges to a mode further from the domain of attraction of the data point.
To avoid these situations, we may compute the update at iteration t as the convex linear combination of the solution at previous step and the proposed value as follows where ω i is a tuning parameter that controls the step size. The definition of ω i should consider whether or not a data point lies in a low-density region, and update this value at each iteration.
For instance, we could define a function that depends on δ i = G k=1 z ik Σ −1 k for i = 1, . . . , n, so when the determinant δ i is small, i.e. the corresponding data point lies in a low-density region, the value of ω i should be close to zero and the value x (t) i should be updated by small steps, whereas for relatively large values of δ i the associated weights ω i converge to one, so essentially setting x (t) i = x * i . However, implementing such strategy would require the computation of δ i at each iteration for all the data points, and this may result in a significant increase of the execution time.
For this reason, in practice, we suggest to compute the step size as ω i = 1 − exp{−0.1t} (see the function drawn in Figure 2b). The idea is that at earlier iterations, the step size is small and x (t) i must be updated by small steps, but as the number of iterations increase the step size converges to one, so the updated value becomes almost equivalent to x * i . Returning to the example given in Figure 2a, smaller initial steps (shown as blue arrows) allow to converge to the correct density mode. Finally, we note that such a strategy inevitably increases the number of steps performed by the MEM algorithm. However, since each step is quite fast to perform, overall the algorithm is not significantly affected. For instance, in the previous example, the MEM iterations for all data points increase from 7 to 18, while the execution time from 0.04 to 0.08 seconds.

Connected-components algorithm for "tight clusters"
After the final iteration of the Modal EM algorithm a set of points {x * i } n i=1 are obtained. These represent the modes to which each of the data points converge. However, in the limit, points that would converge to the same mode may be numerically different from each other by a small amount, whose magnitude depends on the tolerance value used for checking the convergence of the algorithm. Thus, solutions {x * i } n i=1 form tight clusters around the corresponding modes, widely separated from other tight clusters corresponding to different modes. The connectedcomponents algorithm described in Carreira-Perpiñán (2016) allows for the merging of those points that ideally would be identical. This can be applied as a post-processing step to obtain the final estimated modes {x m } M m=1 .

Denoising low-density modes
Certain regions of the features space may lack of sufficient data points to obtain reliable density estimates, particularly in high-dimensional features space. As a consequence, modes located in such regions might be spurious and, in these cases, it may be convenient to filter out these modes (Carreira-Perpiñán, 2000). We consider a simple rule to drop modes associated with regions of relatively low-probability. Following the approach of Banfield and Raftery (1993), we postulate the presence of a noise component uniformly distributed over the data region. Let V be the hypervolume of the data region, so each log-density value of a mode not exceeding − log(V ) can be considered as a noisy artefact of the density estimation process. Here, the logarithmic scale is used to improve stability and numerical accuracy.
In practice, we need to compute log(V ), and a simple approximation could be obtained by taking the minimum between: (i) the volume of hyperbox containing the observed data; (ii) the volume of the hyperbox obtained from principal component scores; (iii) the volume of ellipsoid hull, i.e. the ellipsoid of minimal volume such that all data points lie inside or on the boundary of the ellipsoid. Alternatively, the central (1 − α)100% region of a multivariate Gaussian distribution, i.e. the smallest region such that an observation falls in this region with probability (1 − α), can be computed. This region is an ellipsoid in d dimensions, with log-hypervolume equal to quantile of a chi-squared distribution with d degrees of freedom, and Γ() the gamma function. The covariance matrix Σ can be estimated as the marginal covariance matrix using the well-known relationship between the parameters of a multivariate mixture distribution and the marginal parameters (see for instance Frühwirth-Schnatter, 2006, where µ = G k=1 π k µ k is the vector of marginal means. Thus, modes whose log-density is smaller than − log(V ) or, equivalently, with density smaller than exp(− log(V )) = 1/V , can be dropped, and points associated with them are re-assigned to the remaining modes with addtional few steps of the MEM algorithm. This is coherent with Hartigan's definition of cluster adopted in the paper, because the groups associated with lowdensity modes lack the main requirement of being a high density cluster. The data example in Section 4.3 illustrates this approach.

Simulated data example
Recalling the bivariate Gaussian-SkewNormal mixture distribution described in Section 1, Figure 3a shows the estimated density obtained by the selected Gaussian mixture model, namely model VVE with 3 mixture components selected by BIC. To illustrate the procedure, some points are marked as blue filled points, and they are also reported in isolation in Figure 3b. For the selected points, the paths produced by the MEM algorithm described in Section 2 are shown as arrows in Figure 3c. At each step of the algorithm, points move up-hill toward the density modes. The estimated modes are shown in Figure 3d. Figure 3e shows the modal clustering for all the data points obtained according to the mode to which they converge. The MEM algorithm required 21 iterations and 0.23 seconds to run on an iMac with 4 cores i5 Intel CPU running at 2.8 GHz and with 16GB of RAM. Finally, Figure 3f illustrates the partition of the feature space that defines the "domains of attraction" of the estimated density modes.

Mass cytometry data
Mass cytometry is a recent technology that couples flow cytometry with mass spectrometry. It allows to simultaneously measure several features of a cell. The biological question of interest is the identification of subpopulations of cells. We consider two protein-markers, CD4 and CD3all, from a mass cytometry experiment (Bendall et al., 2011) to find latent classes in single-cell measurements. Data are preprocessed using the hyperbolic arcsin transformation (Holmes and Huber, 2018), i.e. asinh(x) = log(x + √ x 2 + 1). A random sample of 10,000 cells (out of 91,392) is shown in Figure 4a. The clustering obtained with the "best" GMM selected by BIC -namely, VVV in the mclust nomenclature (Scrucca et al., 2016, Table 3) with 9 components -is shown in Figure 4b. The corresponding density estimate and the modes estimated via the MEM algorithm are reported in Figure 4c, while Figure 4d shows the corresponding modal clustering partition. There appears to be five clusters, one for each combination of high/low values of the CD4 and CD3all markers, and an additional cluster formed by the highest values of both the CD4 and CD3all markers. This result can be contrasted with those obtained using approaches based on merging mixture components. Figure 4e shows the partition derived from the entropy-based approach of Baudry et al. (2010), while Figure 4f shows the clusters obtained using the unimodal ridgeline approach of Hennig (2010). Both partitions are clearly different from the one obtained using the MEM algorithm, with the latter producing more compact and easily interpretable clusters.
Finally, we note that with 10 thousands observations the MEM algorithm required 24 iterations and 5.4 seconds to run on an iMac with 4 cores i5 Intel CPU running at 2.8 GHz and with 16GB of RAM.

Bankruptcy dataset
Altman (1968) presented a study on financial ratios to predict corporate bankruptcy. The dataset provides the ratio of retained earnings (RE) to total assets, and the ratio of earnings before interests and taxes (EBIT) to total assets, for a sample of 66 manufacturing US firms, of which 33 had filled for bankruptcy in the following two years. Data are shown in Figure 5a.
The best GMM selected by BIC is the model VEI (diagonal, varying volume and equal shape) with three components. The contour plot of the corresponding density estimate is shown in Figure 5b, with points marked according to the implied maximum a posteriori (MAP) classi- fication. There appears to be two prominent clusters, consisting mainly of solvent and bankrupt companies, but also a spread out group of firms with very low values for either financial ratios. Figure 5c shows the density estimate via a 3D perspective plot. Following the approach described in Section 3.3, a plane is included at the uniform density level corresponding to exp(−11.17492) = 1/71319.39 = 1.402 × 10 −5 , where V = 71319.39 is the hypervolume of the 99% central region. As it can be seen, only two bumps of densities emerge, namely those corresponding to the main groups in the data. A very low density mode is also present in the region corresponding to small financial ratios; specifically, the density is equal to 4.661 × 10 −6 , a value approximately equal to one third the density threshold computed above. For this reason, the low density mode is filtered out by the denoising procedure.
The modes estimated by the MEM algorithm are shown in Figure 5d, with data points marked according to the clusters assigned by the modal clustering procedure. Only 4 companies are misclassified, as indicated by the circled data points. This result can be compared with those reported by Lo and Gottardo (2012,   This paper addresses the problem of computing the modes of a density estimated by fitting a Gaussian mixture model. The proposed approach is based on the Modal EM algorithm, an iterative procedure aimed at identifying the local maxima of a density function. By exploiting specific characteristics of the underlying Gaussian mixture model, we extend the Modal EM algorithm to deal with any parsimonious component-covariance matrix decomposition. Furthermore, we discuss a fast implementation of the algorithm that allows to perform the M-step simultaneously for all data points. Once the modes of the underlying density are estimated, a modal clustering partition can be obtained by associating each observation to the pertaining mode.
The Modal EM algorithm discussed, as any other mode-seeking procedure, relies on the quality of the underlying density estimate. Clearly, if the parameters of the mixture model are not well estimated some issues could arise. However, provided that the general form of the density estimate is not overly biased, the proposed method should not be significantly affected.
The proposed approach seems to be very promising and, in principle, it could be extended to mixtures of non-Gaussian distributions (e.g. t, skew-Normal, skew-t, shifted asymmetric Laplace, . . . ). However, it is necessary to investigate the potential benefits obtained by adopting more complex probability models. Recently, an adaptation of the proposed algorithm has been used for clustering from an ensemble of Gaussian mixtures (Casa et al., 2019).
Another area of future research involves the use of the MEM algorithm in high-dimensional data settings. In this regard, we plan to study the effectiveness of the MEM algorithm applied to the subspace estimated by the GMM-based projection pursuit method proposed by Scrucca and Serafini (2019).