A two‐step proximal‐point algorithm for the calculus of divergence‐based estimators in finite mixture models

Estimators derived from the expectation‐maximization (EM) algorithm are not robust since they are based on the maximization of the likelihood function. We propose an iterative proximal‐point algorithm based on the EM algorithm to minimize a divergence criterion between a mixture model and the unknown distribution that generates the data. The algorithm estimates in each iteration the proportions and the parameters of the mixture components in two separate steps. Resulting estimators are generally robust against outliers and misspecification of the model. Convergence properties of our algorithm are studied. The convergence of the introduced algorithm is discussed on a two‐component Weibull mixture entailing a condition on the initialization of the EM algorithm in order for the latter to converge. Simulations on Gaussian and Weibull mixture models using different statistical divergences are provided to confirm the validity of our work and the robustness of the resulting estimators against outliers in comparison to the EM algorithm. An application to a dataset of velocities of galaxies is also presented. The Canadian Journal of Statistics 47: 392–408; 2019 © 2019 Statistical Society of Canada


INTRODUCTION
The expectation-maximization (EM) algorithm (Dempster, Laird & Rubin, 1977) is a well-known method for calculating the maximum likelihood estimator (MLE) of a model where incomplete data are considered. For example, when working with mixture models in the context of clustering, the labels or classes of observations are unknown during the training phase. Several variants of the EM algorithm are available; see McLachlan & Krishnan (2007). Another way to look at the EM algorithm is as a proximal-point problem; see Chrétien & Hero (1998) and Tseng (2004). Indeed, we may rewrite the conditional expectation of the complete log-likelihood as the log-likelihood function of the model (the objective) plus a proximal term. Generally, the proximal term has a regularization effect on the objective function so that the algorithm becomes more stable, could avoid some saddle points and frequently outperforms classical optimization algorithms; see Goldstein & Russak (1987) and Chrétien & Hero (2008). Chrétien & Hero (1998) prove superlinear convergence of a proximal-point algorithm derived by the EM algorithm. Notice that EM-type algorithms usually enjoy no more than linear convergence.
Taking into consideration the need for robust estimators, and the fact that the MLE is the least robust estimator among the class of divergence-type estimators which we present below, we generalize the EM algorithm (and the version in Tseng, 2004) by replacing the log-likelihood function with an estimator of a statistical divergence between the true distribution of the data and the model. We are particularly interested in -divergences and the density power divergence (DPD) which is a Bregman divergence. The DPD introduced and studied by Basu et al. (1998) is defined for a > 0 as for two probability density functions and g. Given a random sample Y 1 , … , Y n distributed according to some probability measure P T with density p T with respect to Lebesgue measure, and given a model (p ) ∈Φ with Φ ⊂ ℝ d , the minimum density power divergence (MDPD) estimator is defined by: This estimator is robust for a > 0, and when a goes to zero, we obtain the MLE. A -divergence in the sense of Csiszár (Csiszár, 1963;Broniatowski & Keziou, 2009) is defined by: where is a nonnegative strictly convex function with (1) = 0, and Q and P are two probability measures such that Q is absolutely continuous with respect to P. Examples (among others) of such divergences are: the Kullback-Leibler (KL) divergence when 1 (t) = t log t + 1 − t, the modified KL divergence when 0 (t) = − log t + t − 1, and the Hellinger distance when 0.5 (t) = ( √ t − 1) 2∕ 2 . All these well-known divergences belong to the family of -divergences generated by the class of Cressie-Read functions defined by: and defining 0 and 1 as the limit as tends to 0 and 1, respectively. We consider the dual estimator of the -divergence (D DE) introduced independently by Broniatowski & Keziou (2006) and Liese & Vajda (2006). The use of this estimator is motivated by many reasons. Its minimum coincides with the MLE for (t) = − log t + t − 1. Besides, it does not take into account any partitioning or smoothing and has the same form for discrete and continuous models, which is not the case for other estimators considered by Beran (1977), Park & Basu (2004) and Basu & Lindsay (1994), who use kernel density estimators. For in Φ, the D DE is given by: with # (t) = t ′ (t) − (t). Al Mohamad (2018) argues that while this formula works well under the model, it underestimates the divergence between the true distribution and the model under misspecification of the model or contamination in the data, and proposes the following simpler estimator:D where K n,w is a nonparametric estimator of the true distribution P T . In this paper, K n,w is a kernel density estimator. The resulting new estimator is robust against outliers. It also permits getting rid of the supremal form from the dual estimator (4). The minimum dual -divergence estimator (MD DE) is defined by: Kernel-based MD DE = arg min ∈ΦD ( ).
Asymptotic properties and consistency of these two estimators can be found in Broniatowski & Keziou (2009), Toma & Broniatowski (2011 and Al Mohamad (2018). We propose to calculate the two MD DEs and the MDPD when p is a mixture model using an iterative procedure based on the work of Tseng (2004) on the log-likelihood function. This procedure has the form of a proximal-point algorithm and extends the EM algorithm. A similar algorithm was introduced in Al Mohamad & Broniatowski (2015& Broniatowski ( , 2016. Here, in each iteration we have two steps: a step to calculate the proportion and a step to calculate the parameters of the mixture components. The goal of this simplification is to reduce the dimension over which we optimize, since in lower dimensions optimization procedures are generally more efficient. Our convergence proof requires some regularity of the estimated divergence with respect to the parameter vector which is not easily checked using Equation (4). Results in Rockafellar & Wets (1998) provide sufficient conditions to solve this problem. Differentiability ofD ( ) with respect to may remain a very hard task in many situations.
The paper is organized as follows. We explain in Section 2 the context and indicate the mathematical notation which may be nonstandard. We also present the progression and the derivation of our set of algorithms from the EM algorithm and Tseng's generalization. In Section 3, we prove some convergence properties of the sequence generated by our algorithm. We show in Section 4 a case study of a Weibull mixture including a convergence proof of the EM algorithm. Finally, Section 5 provides simulations confirming the robustness of the resulting estimator in comparison to the EM algorithm. The proofs of the main results are in the Appendix.

General Context and Notation
Let (x 1 , y 1 ), … , (x n , y n ) be n realizations drawn from the joint probability density function (x, y| ) parameterized by a real vector ∈ Φ ⊂ ℝ d . The x i 's are the unobserved data (labels) and the y i 's are the observations. The observed data y i are supposed to be real vectors and the labels x i belong to a space  not necessarily finite unless mentioned otherwise. Denote by dx a measure defined on the label space  (it is counting measure if  is discrete). The marginal density of the observed data is given by p (y) = ∫ ℝ (x, y| ) dx, which is assumed to be a finite mixture model of the form For a parameterized function with a parameter a, we write (x|a). We use the notation k for sequences with the index k. For a set Φ, Int Φ denotes its interior.

EM Algorithm for Mixture Models
Let h i (x| k ) be the conditional density of the labels (at step k) given y i : Let 0 be an initial vector. Let (t) = − log t (or (t) = − log t + t − 1). The EM algorithm estimates the unknown parameter vector by generating the sequence: The formulation (6) is a proximal-point algorithm which was proposed by Tseng (2004) who studied the convergence properties for any convex nonnegative function . When p is a finite mixture model, the EM algorithm has the two-step form The EM algorithm and its generalization (6) for any convex nonnegative function produce estimators based on the likelihood function, which are not robust against outliers or misspecification of the model. Estimators calculated using statistical divergences such as the Hellinger or the chi-squared are known to be robust. Let D( , ) be some statistical divergence calculated between the model and the true distribution of the data, and letD( , ) be its estimator. We propose to estimate the mixture model through the sequences where D is as defined in (6). Examples of statistical divergences include -divergences (Broniatowski & Keziou, 2009), DPDs (Basu et al., 1998), S-divergences (Gosh et al., 2013 and Rényi pseudodistances (e.g., Toma & Leoni-Aubin, 2013). They all include the MLE for a suitable choice of the tuning parameter or the generating function so that the sequence (9) and (10) coincides with the sequence (7) and (8).
Our two-step algorithm in Equations (9) and (10) coincides with the one-step proximal-point algorithm introduced by Al Mohamad & Broniatowski (2015) for general models if we omit the optimization over the proportions. In other words, the one-step proximal-point algorithm is given by The remainder of the paper is devoted entirely to the study of the convergence of the sequences generated by the set of algorithms (9) and (10).

SOME CONVERGENCE PROPERTIES OF K
We adapt the ideas given by Tseng (2004) to develop proofs of convergence for our proximal algorithm as k goes to infinity while n is held fixed. The proofs are deferred to the Appendix. Let 0 = ( 0 , 0 ) be a given initialization for the parameters, and define the set We suppose that Φ 0 is a subset of Int Φ. The idea of defining such a set in this context is inherited from Wu (1983). We use the set of assumptions A0-A4 provided in Appendix. They are verifiable using Lebesgue theorems and the approaches provided in the Supplementary Material.
Proposition 1. Assume that recurrences (9) and (10) are well defined in Φ. For both algorithms, the sequence ( k ) k satisfies the following properties: (c) Suppose that assumptions A0 and A2 are fulfilled, then the sequence ( k ) k is defined and bounded. Moreover, the sequence {D ( k ) } k converges as k goes to infinity.
The interest of Proposition 1 is that the objective function is ensured, under mild assumptions, to decrease alongside the sequence ( k ) k . This permits to build a stopping criterion for the algorithm since in general there is no guarantee that the whole sequence ( k ) k converges. It may also continue to fluctuate in a neighbourhood of an optimum. The following result provides a first characterization of the properties of the limit of the sequence ( k ) k as a stationary point of the estimated divergence.
Proposition 2. Suppose that A1 is true, and assume that Φ 0 is closed and { k+1 − k } → 0 as k goes to infinity. If A4 is satisfied, then the limit of every convergent subsequence is a stationary point of  →D( ). Proposition 3. Assume that A1, A2 and A3 hold, then { k+1 − k } → 0 as k goes to infinity, which implies, by Proposition 2, that any limit point of the sequence k is a stationary point of  →D( ).
We can go further in exploring the properties of the sequence ( k ) k by imposing additional assumptions. The following corollary provides a convergence result of the whole sequence. The convergence holds also towards a local minimum as long as the estimated divergence is locally strictly convex.
Corollary 1. Under the assumptions of Proposition 3, the set of accumulation points of ( k ) k defined by (9) and (10) is a connected compact set. Moreover, ifD( ) is strictly convex in a neighbourhood of a limit point of the sequence ( k ) k , then the whole sequence ( k ) k converges to a local minimum ofD( ) as k goes to infinity.
Although Proposition 3 provides a general solution to assess that { k+1 − k } → 0 as k goes to infinity, the identifiability assumption over the proximal term is hard to be fulfilled. It does not hold in most simple mixtures such as a two component Gaussian mixture (Tseng, 2004). This is the reason behind our next result. A similar idea is employed by Chrétien & Hero (2008). Their work however requires that the log-likelihood approaches −∞ as ‖ ‖ → ∞, which is not satisfied by usual mixture models (e.g., the Gaussian mixture model). Our result treats the problem from another perspective using the set Φ 0 .
Proposition 4. Assume that A1 and A2 hold. For the algorithm defined by (9) and (10), if ‖ k+1 − k ‖ → 0 as k goes to infinity, then any convergent subsequence { N(k) , N(k) } k converges to a stationary point of the objective function ( , ) →D( , ) as k goes to infinity.
Proposition 4 requires a condition on the distance between two consecutive members of the sequence { k } k which is weaker than the same condition on the whole sequence k = ( k , k ). Still, as the regularization term D does not satisfy the identifiability condition A3, it remains an open problem for further work. It is interesting to notice that the condition ‖ k+1 − k ‖ → 0 can be replaced by ‖ k+1 − k ‖ → 0, but we then need to change the order of steps (9) and (10).
Following Chrétien & Hero (2008), we can define a proximal-point algorithm which converges to a global infimum. Let { k } k be a sequence of positive numbers which decreases to zero ( k = 1∕k does the job). Define The justification of such a variant falls directly from Theorem 3.2.4 of Chrétien & Hero (2008). The problem with this approach is that the infimum on each step of the algorithm needs to be calculated exactly, which does not happen in general unless the function  →D( ) + k D ( , k ) is strictly convex.

CASE STUDY: A TWO-COMPONENT WEIBULL MIXTURE
Let p be the two-component Weibull mixture where = ( , 1 , 2 ). We have Φ = [ , 1 − ] × ℝ * + × ℝ * + for some > 0 in order to avoid degeneracy. We will be interested only in power divergences defined through the Cressie-Read class of functions = given by (3). Functions h i are given by It is clear that functions h i are in class  1 (Int Φ) and so is  → D ( , ′ ) for any ′ ∈ Φ. Use the DPD defined by Equation (1): If we use the DPD (1), the continuity and differentiability of the estimated divergenceD a (the optimized function in Equation (2)) can be treated using Lebesgue theorems. To prove that Φ 0 is compact, we prove that it is closed and bounded in the complete space [ , 1 − ] × ℝ 2 + . We add zero to the values of the shape parameter so that the space becomes complete. Closedness is an immediate result of the continuity of the estimated divergence since Φ 0 is the inverse image of the closed set (−∞,D a ( 0 )]. To ensure boundedness of Φ 0 , we need to choose carefully the initial point ( 0 , 0 1 , 0 2 ) of the algorithm. Since is bounded by 0 and 1, we only need to verify the boundedness of the shapes. If both shapes 1 and 2 go to ±∞, thenD a ( ) → 0. If either of the shapes goes to infinity, then the corresponding component vanishes. In order to prevent the shapes from growing to infinity, we start at a point where the estimated divergence is lower than those extremities. Then, because of the decreasing property of the algorithm and the definition of Φ 0 , the algorithm never goes back to any of the unbounded situations. We thus identify a condition on the initialization of the algorithm in order to make Φ 0 bounded:D Use the dual estimator defined by Equation (4): If we use (4) to determine the estimator, then only continuity of the estimated divergence with respect to the parameters can be obtained. WriteD ( ) = sup ( , ). We list the following results without any proof, because it suffices to study the integral term in the formula. Suppose, without loss of generality, that 1 < 2 and 1 < 2 .
In both cases 2 and 3, if Φ is compact, then using Theorem 1.17 of Rockafellar & Wets (1998),  →D ( ) is continuous. Differentiability, however, is difficult to prove and requires more investigation on the form of the estimated divergence and the model used. We conclude that if Φ is compact, then Proposition 1 can be used to deduce that the sequenceD ( k ) converges, but no information about the convergence towards stationary points could be obtained. Use the kernel-based dual estimator given by Equation (5): If we use (5) to define the estimator, then the continuity ofD ( ) depends on the tail of the kernel to be used and the value of . For example, if we use a Gaussian kernel and for ∈ (0, 1), then the estimated divergence is  1 (Int Φ). A similar condition to (13) can be obtained and we have the same conclusion as Conclusion 1.
Use the likelihood of the model: If we use (t) = (t) = − log t + t − 1, we obtain the EM algorithm. Assumptions A1 and A4 are clearly satisfied. Let L( ) be the likelihood function, and J( ) = log L( ). The set Φ 0 is given by We will show that under suitable conditions, the set Φ 0 is compact. Suppose that the shape parameter can have values in ℝ + . The set Φ 0 becomes the inverse image of [L( 0 ), ∞) by the likelihood function which is continuous, and thus Φ 0 is closed in the space Similarly to the previous cases, in order to prove boundedness we need to avoid the cases where either of the shape parameters tends to infinity. For example, when 1 goes to infinity, which is bounded almost everywhere. We then choose the initial point of the algorithm 0 in such a way that and the set Φ 0 hence becomes bounded and therefore compact. The same conclusion as Conclusion 1 holds for the Weibull mixture model. Note that the verification of assumption A3 is a hard task, because it results in a set of n nonlinear equations in y i and cannot be treated in a similar way to the Gaussian mixture in Tseng (2004) or Al Mohamad & Broniatowski (2016).

EXPERIMENTAL RESULTS
We summarize the results of 100 experiments on 100 samples (with and without outliers) from two-component Gaussian and Weibull mixtures. We measure the error of replacing the true distribution of the data with the model using the total variation distance (TVD) which is calculated using the L 1 distance by the Scheffé lemma (e.g., Meister, 2009, p. 129).
We also provide for the Gaussian mixture the values of the square root of the 2 divergence between the estimated model and the true mixture (it gives infinite values for the Weibull experiment because of the sensitivity of the 2 to differences in the tail of the distribution). The 2 criterion is defined by: We also apply our algorithms to a dataset of the velocities of galaxies where only estimates of the parameters are provided.  We use different -divergences, namely the Hellinger, the Pearson and the Neyman 2 . For the MDPD, we use a = 0.5, a choice which gave the best tradeoff between robustness and efficiency suggested by the simulation results of Al Mohamad (2018). For the proximal term, we use (t) = ( √ t − 1) 2∕ 2 . The methods are compared with the EM algorithm. All the experiments are carried out using the statistical tool R (R Core Team, 2013).

A Two-Component Gaussian Mixture
We consider a Gaussian mixture with true parameters = 0.35, 1 = 2, 2 = 1.5 and fixed variances 2 1 = 2 2 = 1. Figure 1 shows the values of the estimated divergence for both formulas (4) and (5) on a logarithmic scale at each iteration of the algorithms (11), and (9) and (10) until convergence. The one-step algorithm refers to algorithm (11), whereas the two-step algorithm refers to algorithms (9) and (10). The results are presented in Table 1.
Contamination is done by adding in the original sample to the 5 lowest values random observations from the uniform distribution  [−5, −2]. We also add to the 5 largest values random observations from the uniform distribution  [2, 5]. Results are presented in Table 2. It is clear that both the MDPD and the kernel-based MD DE are more robust than the EM algorithm and the classical MD DE.

The Two-Component Weibull Mixture Model Revisited
We consider the Weibull mixture (12) with 1 = 0.5, 2 = 3 and = 0.35. We let = ( 1 , 2 ) denote the shape parameters of the Weibull mixture model p ( , ) , and = ( 1 , 2 ) for p ( , ) . Contamination is done by replacing 10 observations of each sample chosen randomly by 10 i.i.d. observations drawn from a Weibull distribution with shape 0.9 and scale 3. Results are presented in Table 3.
When there are no outliers, all estimation methods have the same performance.   We see no significant difference between the results obtained using the one-step algorithm (11) and those obtained using the two-step algorithms (9) and (10) using the Hellinger divergence. Differences appear when we use the Neymann 2 -divergence with the classical MD DE. This shows again the difficulty in handling the supremal form of the dual formula (4).

The Galaxies Dataset
We study a dataset of velocities of 82 galaxies in the Corona Borealis region at which they move away from our galaxy (Figure 2). The dataset is available from the R package MASS. The objective of the study is to figure out if   Algorithms (9) and (10)  there are superclusters in these galaxies. More details about this dataset can be found in Roeder (1990). Roeder (1990) estimates the number of clusters to be between three and seven modes and a test of unimodality is rejected at level 0.01. Using the R package mclust, we find that a mixture with four components best fits the data according to the BIC criterion. Therefore, we fix the number of components at four and assume that all components have the same variance to avoid degeneracy. We estimate a mixture of four Gaussian components using our algorithms.
For -divergences, we use (t) = 0.5 which corresponds to the Hellinger divergence. For the MDPD, we use a = 0.5. The results are provided in Table 4.  Algorithms (9) and ( The results obtained with the one-step algorithm are a little bit different from the results obtained with the two-step algorithm for the estimates of the proportions and the variance. The difference is almost negligible in the centres of the clusters. On the other hand, all the results support that there is a cluster with high proportion at velocity around 20 × 10 3 km s −1 .

CONCLUSIONS
We presented in this paper a two-step proximal-point algorithm whose objective was the minimization of (an estimate of) a statistical divergence for a mixture model. The EM algorithm constituted a special case. We established some convergence properties of the algorithm under mild conditions. Our simulation results showed that the proximal algorithm worked. The two-step algorithms (9) and (10) showed no difference from its one-step competitor (11) which was very encouraging, especially since the dimension of the optimization was reduced at each step in the two-step algorithm. Simulations showed again the robustness of -divergences and the DPD against outliers in comparison to the MLE calculated from the EM algorithm. The role of the proximal term and its influence on the convergence of the algorithm were not discussed here and might be considered in future work.

APPENDIX
The following are necessary assumptions required for the convergence results. Proof of Proposition 1.
(a) Recurrence (9) and the definition of the argmin give: The second inequality is obtained using the fact that D ( , ) = 0. Using recurrence (10), we get: The second inequality is obtained using the fact that D( , ′ ) ≥ 0. The conclusion is reached by combining the two inequalities (A.1) and (A.2). (b) Using the decreasing property previously proved in (a), we have by recurrence The result follows directly by definition of Φ 0 . (c) By induction on k. For k = 0, clearly 0 = ( 0 , 0 ) is well defined (a choice we make).
Suppose for some k ≥ 0 that k = ( k , k ) exists. The infimum in (9) can be calculated on 's such that ( , k ) ∈ Φ 0 . Indeed, suppose there exists a such that This means that ( , k ) ∈ Φ 0 and that the infimum need not be calculated for all values of ∈ Φ, and can be restricted to values which satisfy ( , k ) ∈ Φ 0 . Define now Therefore, Λ k is not empty. Moreover, it is clearly compact since Φ 0 is compact. Finally, since by assumption A0, the optimized function is lower semicontinuous so that it attains its infimum on the compact set Λ k . We may now define k+1 as any vector satisfying this infimum.
The second part of the proof treats the definition of k+1 and is carried out analogously to k+1 . Convergence of the sequence {D( k )} k in both algorithms results from the fact that it is nonincreasing and bounded. It is nonincreasing by virtue of (a). Boundedness results from the lower semicontinuity of  →D( ) and the compactness of the set Φ 0 . ◼ Proof of Proposition 2. Let {( n k , n k )} k be a convergent subsequence of {( k , k )} k which converges to ( ∞ , ∞ ). First of all, ( ∞ , ∞ ) ∈ Φ 0 , because Φ 0 is closed and the subsequence {( n k , n k )} k is a sequence of elements of Φ 0 (proved in Proposition 1(b)). Let us show that the subsequence ( n k +1 , n k +1 ) also converges to ( ∞ , ∞ ). We simply have: Since ( k+1 , k+1 ) − ( k , k ) → 0 and ( n k , n k ) → ( ∞ , ∞ ), we conclude that n k +1 → ∞ . By definition of n k +1 and n k +1 , they achieve the infimum respectively in recurrences (9) and (10). Therefore, the gradient of the optimized function is zero for each step. In other words: ∇D( n k +1 , n k ) + ∇ D (( n k +1 , n k ), n k ) = 0, ∇D( n k +1 , n k +1 ) + ∇ D (( n k +1 , n k +1 ), n k ) = 0.
Since both { n k +1 } and { n k } converge to the same limit ∞ , then setting ∞ = ( ∞ , ∞ ), we get that both n k +1 and n k tend to ∞ . We also have that both n k +1 and n k tend to ∞ . The continuity of the two gradients (assumptions A1 and A4) implies that: However, ∇D ( , ) = 0, so that ∇D( ∞ ) = 0. ◼ Proof of Proposition 3. By contradiction, let's suppose that k+1 − k does not converge to 0. We can prove using the compactness of Φ 0 the existence of a subsequence of { k } k such that N(k)+1 − N(k) does not converge to 0, and such that The real sequenceD( k ) k converges as proved in Proposition 1(c) so that both sequenceŝ D( N(k)+1 ) andD( N(k) ) converge to the same limit. In the proof of Proposition 1, we can deduce the following inequality:D which is also verified for any substitution of k by N(k). By passing to the limit on k, we get D (̃,̄) ≤ 0. However, D (̃,̄) > 0, so that it becomes zero. Using assumption A3, D (̃,̄) = 0 implies that̃=̄. This contradicts the assumption that k+1 − k does not converge to 0. The second part of the proposition is a direct result of Proposition 2. ◼ Proof of Corollary 1. Since the sequence ( ) k is bounded and satisfies k+1 − k → 0, then Theorem 28.1 of Ostrowski (1966) implies that the set of accumulation points of ( k ) k is a connected compact set. It is not empty since Φ 0 is compact. The remainder of the proof is a direct result of Theorem 3.3.1 of Chrétien & Hero (2008). The strict concavity of the objective function around an accumulation point is replaced here by the strict convexity of the estimated divergence. ◼ Proof of Proposition 4. If { k } k converges to, say, ∞ , then the result follows simply from Proposition 2. Suppose now that ( k ) k does not converge. Since Φ 0 is compact and ∀k, k ∈ Φ 0 (proved in Proposition 1), there exists a subsequence { N 0 (k) } k such that N 0 (k) →̃. Let us take the subsequence ( N 0 (k)−1 ) k . This subsequence does not necessarily converge; still it is contained in the compact Φ 0 , so that we can extract a further subsequence { N 1 ∘N 0 (k)−1 } k which converges to, say,̄. Now, the subsequence { N 1 ∘N 0 (k) } k converges tõ, because it is a subsequence of { N 0 (k) } k . We have proved until now the existence of two convergent subsequences N(k)−1 and N(k) with a priori different limits. For simplicity and without any loss of generality, we will consider these subsequences to be k and k+1 respectively.
We use again inequality (A.3) By taking the limits of the two parts of the inequality as k tends to infinity, and using the continuity of the two functionsD and D , we havê D(̃) + D (̃,̄) ≤D(̄).
Recall that under assumptions A1 and A2, the sequence {D ( k ) } k converges due to Proposition 1, so that it has the same limit for any subsequence, that is,D(̃) =D(̄). We also use the fact that the distance-like function D is nonnegative to deduce that D (̃,̄) = 0. Writing explicitly this equation gives This is a sum of nonnegative terms. Thus, each term is also zero, that is The integrands are nonnegative functions, so that almost everywhere we have ∀i ∈ {1, … , n}, ) h i (x|̄) = 0, dx-a.e.