Likelihood Maximization and Moment Matching in Low SNR Gaussian Mixture Models

We derive an asymptotic expansion for the log likelihood of Gaussian mixture models (GMMs) with equal covariance matrices in the low signal-to-noise regime. The expansion reveals an intimate connection between two types of algorithms for parameter estimation: the method of moments and likelihood optimizing algorithms such as Expectation-Maximization (EM). We show that likelihood optimization in the low SNR regime reduces to a sequence of least squares optimization problems that match the moments of the estimate to the ground truth moments one by one. This connection is a stepping stone toward the analysis of EM and maximum likelihood estimation in a wide range of models. A motivating application for the study of low SNR mixture models is cryo-electron microscopy data, which can be modeled as a GMM with algebraic constraints imposed on the mixture centers. We discuss the application of our expansion to algebraically constrained GMMs, among other example models of interest.


Introduction
Gaussian mixtures are a useful model to describe data in a wide variety of applications. Nevertheless, strong theoretical guarantees on the performance of classical algorithms for inference in Gaussian mixture models (GMMs) are lacking. This is primarily due to the complicated structure of the GMM log likelihood landscape. The most popular algorithm for inference is Expectation-Maximization (EM), an iterative algorithm which performs "soft assignment" of observations to mixture components. Although EM maximizes a surrogate function to the log likelihood at each step, it can nevertheless be viewed as gradient ascent on the log likelihood in the setting we study here. As such, analyzing it and other log likelihood optimizing algorithms is challenging.
Most existing guarantees are for for the case in which the component distributions of the mixture are "well-separated". In [BWY17] the authors characterize the basin of attraction in which the EM algorithm is guaranteed to converge to the global maximum of the likelihood of a well-separated two component mixture. This is generalized in [XHM16], in which the authors provide a global analysis of the convergence of EM for two component mixtures. A further generalization is obtained in [YYS17], in which the basin of attraction for gradient EM in arbitrary mixtures with equal covariances is quantified, also under the assumption of some separation between component distributions. For mixtures with three or more components, [JZB + 16] shows that there are well-separated mixtures for which the log likelihood landscape has bad local maxima. Moreover, they show that in some cases, the EM algorithm can converge to these bad critical points with high probability.
Other algorithms have been proposed for learning the parameters of poorly separated GMMs in polynomial time [BS10,KMV10] without relying on the log likelihood. The latter paper is based on the method of moments. While EM and its variants are the most widely used methods for inference in GMMs, the method of moments is another class of inference methods which bypasses the log likelihood entirely. This method was proposed by Karl Pearson in his 1894 paper [Pea94], which also introduces the Gaussian mixture inference problem for the first time. Pearson shows that the parameters of a mixture of two one-dimensional Gaussians can be deduced from the mixture's first six moments. In general, the approach is to form estimates from the data of enough moments of the distribution to uniquely specify it. The challenge is then to "invert" the moments to recover the ground truth parameters. As an example, for a K-component uniform mixture with centers µ (1) , . . . , µ (K) ∈ R d , which we collectively denote µ, the moments are defined as with higher moments T k (µ) given by higher order tensors. Given estimates of the ground truth moment tensors T k (µ * ), moment inversion amounts to finding µ such that T k (µ) = T k (µ * ), k = 1, 2, . . . . In some models, the moment tensors take a particularly convenient form and can be inverted explicitly. When this is not possible, one alternative approach is to minimize the objective function (1.1) min where λ k are regularizing weights.
In this paper, we study the log likelihood landscape of Gaussian mixture models in R d with the following defining characteristics: (1) the covariance matrices of the mixture components are all the same, and (2) the center of each mixture component is small in norm relative to |Σ| 1/d , where Σ is the covariance of each of the mixture components. We will think of Σ as being known (although this is not required for our main result), and the mixture centers as the "signal" we wish to estimate. Since this is made more difficult by larger variances, one can think of mixtures with this second feature as having low signal-to-noise ratio (SNR).
We show an intimate connection between log likelihood optimization and the method of moments in the low SNR regime. We do so by deriving an asymptotic series expansion of the GMM log likelihood with respect to a small parameter related to the SNR. This expansion illuminates the structure of the likelihood landscape. It shows that in the low SNR regime, log likelihood maximization reduces to a sequence of least squares minimization problems, in which successively higher moments are matched to those of the true distribution on the manifold on which all previous moments have been been fixed to the ground truth values. For the uniform mixture example, these minimization problems take the form (1.2) min where V 0 = R Kd and V k = {µ ∈ R Kd | T ℓ (µ) = T ℓ (µ * ), ℓ = 1, . . . , k}.
This is very similar to the strategy of moment inversion described above. Indeed, taking weights λ 1 ≫ λ 2 ≫ . . . in (1.1) effectively reduces that minimization problem to the sequence of individual moment matching problems (1.2).
This connection allows one to relate the roughness of the log-likelihood landscape with the roughness of the landscape of least squares moment matching objectives. In Section 4, we will classify the critical points of this moment matching landscape in two illustrative examples: a uniform mixture of two Gaussians in arbitrary dimension and an arbitrary (finite) mixture of Gaussians in one dimension. In general, however, understanding the roughness of this landscape can be a highly non-trivial task and is outside the scope of this paper.
The motivation for Taylor expanding the log likelihood comes from [BRW17]. In that paper, Taylor expansions for upper and lower bounds on the log likelihood are derived. However, in order to analyze algorithms which depend on the landscape of the log likelihood (i.e. on the function's derivatives), a Taylor expansion of the log likelihood itself is needed. For a certain class of models a recent paper [FSWW20], fruit of parallel research efforts, also establishes such an expansion, as we will discuss in more detail below.
A natural class of models to study in the low SNR regime are algebraically structured mixture models. A prime example is the orbit retrieval model, also known as multi-reference alignment (MRA). In this class of models, a known algebraic constraint relates the centers of the mixture components to one another. Specifically, the centers are all determined from any one center by applying to it the elements of a subgroup of rotations on R d . In particular, the centers therefore all have the same norm. This class of models is motivated by problems arising in molecule imaging using Cryo-Electron Microscopy (cryo-EM). The goal is to infer the density of a molecule from noisy observations of it in different unknown orientations. At a first approximation the data can be described by a GMM in which the centers are constrained to be observations of the same (unknown) molecule from different viewing directions. We describe this model in more detail in Section 4.3.
In a recent paper [FSWW20] 1 , the authors derive an asymptotic expansion for the log likelihood of the orbit retrieval model. Remarkably, the authors then leverage this expansion and the algebraic structure present in the orbit retrieval problem, to analyze the critical points of the log-likelihood landscape (via the critical points of the moment matching objectives (1.2)). The expansion we derive in the more general context of GMMs reduces to that of [FSWW20] when the model is of the orbit retrieval type. While an analysis of the complexity of the moment-matching landscape in the general case is beyond the scope of this paper, the results in [FSWW20] 1 The authors learned of this work at an earlier stage of preparing the current manuscript, and have since leveraged insights of [FSWW20] to help motivate and simplify some of our arguments. The derivation of the expansion in the case of general mixture models appears to require a different set of techniques and our arguments are quite different overall.
on the orbit retrieval model illustrate how such an analysis can be used to draw conclusions about the log likelihood landscape and maximum likelihood estimation.
We note that our likelihood expansion applies to other important algebraically structured models as well, such as heterogeneous MRA, in which the centers constitute the orbits of several points in R d under a group action. Cryo-EM data can be modeled this way, since one often observes a molecule in several different conformations. The distinct orbits are then the rotations of these distinct conformations.
The method of moments is a natural approach for inference in algebraically structured models in the low SNR regime, and a theoretical understanding of the method has been developed in this setting. With the help of our asymptotic expansion, we expect that some of this understanding can be transferred to draw conclusions about likelihood optimizing methods such as EM. We discuss this as well as potential implications of the expansion beyond algebraically structured models in Section 4.3.
We have alluded to the fact that in the model setting we study, EM is the same as gradient descent on the negative log likelihood. We make this precise in Section 3. Specifically, we show that for finite mixtures and orbit retrieval models, both standard EM and a variant known as gradient EM, are given by gradient ascent on the log likelihood with respect to the centers of the mixture. This implies that an understanding of the likelihood landscape directly translates into an understanding of the fixed points of EM and their basins of attraction. However, we will also show that the standard EM algorithm is suboptimal in the low SNR regime, in that it corresponds to gradient descent with too small a step size. This was shown in [FSWW20] for the orbit retrieval model. Thus gradient EM is a better option, since the step size is user-specified.
We note that in order to use our expansion to draw conclusions about EM and maximum likelihood estimation, a finite sample analysis of the likelihood landscape is required. Here, we focus only on the population log likelihood. In [FSWW20] concentration of the sample log likelihood and its first two derivatives around their population analogues is established for the orbit retrieval model. We also note that while our asymptotic expansion does not require the ground truth mixture weights to be known, we assume this is the case in our discussion of the consequences of the expansion.
Notation. For x ∈ R d , we let g(x) denote the probability distribution function of the standard normal Gaussian N (0, I), For a probability measure ρ on R d , we write supp(ρ) to denote its support. We write θ ∼ ρ to denote that θ ∈ R d is a random variable with distribution ρ (boldfont letters will always denote random variables). For θ ∼ ρ with ρ compactly supported, we define k = 1, 2, 3, . . . . We use T 1:k as shorthand for T 1 , . . . , T k . For two tensors T, S ∈ R d ⊗k with real entries, we let T, S denote the entry-wise inner product of their vectorizations in R d k , and T = T, T 1/2 .

Model Description and Main Theorem
Let Y ∈ R d be distributed according to a Gaussian mixture in which the component distributions have the same, nondegenerate covariance Σ. The assumption of equal covariances allows us to write Y as a Gaussian perturbation Σ 1 2 Z of a random variable θ ∈ R d encoding the centers of the mixture components and the mixture weights. For example, if Y is a uniform mixture of K Gaussian distributions N (µ j , Σ), j = 1, . . . , K, then θ is a discrete random variable taking the value µ j with probability 1/K, j = 1, . . . , K. In general, we have: If ρ is a sum of point masses, then Y is a discrete mixture of component distributions. If ρ has a density, then Y is a continuous mixture.
We will consider maximum likelihood estimation of ρ = ρ * given independent identically distributed observations y i ∼ Y, i = 1, . . . , N in the case N → ∞. The asymptotic expansion of the log likelihood presented in the next section is valid for the family of compactly supported measures ρ, and we therefore present it in this most general setting. Importantly, this general setting also includes the parametric framework in which it is known that ρ * belongs to a set parameterized by a finite number of variables.
Note that if Σ is known, then we can transform (2.1) into a mixture of spherical distributions by multiplying Y by Σ − 1 2 . Thus, the case in which the component distribution covariances are known, equal, and nondegenerate, is equivalent to the model We therefore assume the covariance is σ 2 I from now on. (We do not set σ = 1 because it will be convenient to perform Taylor expansions in 1/σ). Now, the distribution ρ induces a density q ρ (y) on Y . To compute q ρ , note that The population log likelihood L(ρ; ρ * ) is then given by where we have discarded the normalization constant. Writing Y = σZ + θ * , θ * ∼ ρ * , we can also express the log likelihood in the following form: Abusing notation, we will sometimes write L(θ; θ * ) for L(ρ; ρ * ). Note that ρ = ρ * is the unique global maximizer (up to measure zero) of L in the space of probability distributions on R d . This is a consequence of the fact that L(ρ; ρ * ) = −D KL (q ρ * ||q ρ ) + const., where D KL is the Kullback-Leibler divergence between q ρ * and q ρ and the constant term depends on ρ * only. The GMM formulation (2.2) lends itself to the signal processing viewpoint of the statistical estimation problem. Namely, one can consider the observations y i as draws from the "signal" distribution ρ corrupted by the additive noise σZ. This reasoning, as well as the likelihood expansion in the following section, motivate us to define the signal-to-noise ratio (SNR) as follows: Definition 1. Let ρ be a compactly supported measure on R d . We define the SNR as SNR(ρ, σ) = supp(ρ) − T 1 (ρ) 2 ∞ /σ 2 . We note that this definition of SNR is not sensitive to how θ varies for θ ∈ supp(ρ * ). For example, consider a discrete distribution ρ * concentrated on ±θ 1 * , ±θ 2 * , where θ 1 * ≪ θ 2 * . Then SNR(ρ * , σ) = θ 2 * /σ. One could argue that the SNR should depend not just on θ 2 * /σ but also on how small θ 1 * is relative to θ 2 * .
However, we will see that for our purposes this is a natural definition of SNR. Indeed, it is is the scale parameter which emerges in the asymptotic expansion. The smaller this value, the more clear-cut the separation between successive momentmatching stages, as will be explained in Section 2.1.
Guiding Examples. It is helpful to keep in mind the following two classes of GMMs as examples of models to which the log likelihood expansion can be applied. Both classes (i.e. families of measures ρ) can be parameterized by a finite number of variables, and we write the SNR and moments as functions of these parameters.
Discrete Finite Mixture Model. This class of models can be described by Y = σZ + θ ∈ R d where θ ∼ ρ, a finite sum of point masses. In other words, ρ is of the form We have where θ, α are shorthand for (θ j , α j ) K j=1 .
Orbit Retrieval. Let G ⊂ O(d) ⊂ R d×d be a possibly infinite subgroup of the group of orthogonal rotations in R d . Let γ be a measure on G, and g ∈ G denote the random variable with distribution γ. In the orbit retrieval model, we have Here, gθ denotes the action of g on θ, in this case multiplication by a matrix. Note that θ ∈ R d is deterministic. In general, both the point θ whose orbit under G constitutes the centers of the GMM, and the distribution γ, can be unknown. We have T k (θ, γ) = E g∼γ (gθ) ⊗k , k = 1, 2, . . . , (2.10) The term orbit retrieval is also sometimes used to denote the model in which γ is known and given by the Haar measure (the uniform distribution on G). Due to the invariance of the Haar measure under the action of G, we have T 1 = gT 1 for any g ∈ G, so that the SNR is given by SNR(θ, γ = Haar, σ) = θ − T 1 2 /σ 2 . An example of a discrete orbit retrieval model is Multireference alignment (MRA). Here, G = {g 0 , . . . , g d−1 } is the group which acts on vectors in R d by cyclically shifting their entries. In other words, we have The measure γ is therefore a sum of point masses, and induces the following distribution on θ: As an example of a continuous mixture, consider rotations in R 2 , distributed uniformly over angles of rotation ω ∈ [0, 2π). Then the random variable θ ∈ R 2 is distributed as 2.1. Main results. In this section we state our main result, the asymptotic expansion of the log likelihood function. Recall that the log likelihood is given by Theorem 2.1. Let θ ∼ ρ and θ ∼ ρ * be compactly supported random variables on R d and define δ = δ(ρ, ρ * ) = max{ θ − θ * | θ ∈ supp(ρ), θ * ∈ supp(ρ * )}. Let m be a positive integer. If then for any σ > 0 we have: where C m (θ * ) is independent of θ and the error term ǫ m = ǫ m (θ, θ * ) is bounded above by where C is a d-dependent absolute constant.

Discussion. Suppose the GMM lies in a parameterizable family
with Θ a set in a finite dimensional space. This allows us to consider gradient based local search algorithms for likelihood optimization in Θ. Theorem 2.1 shows that in the low SNR regime σ → ∞, any such algorithm attempts to match the moments of ρ = ρ θ to those of ρ * = ρ θ * one by one, starting from the first moment. In other words, likelihood optimization reduces to the sequence of minimization problems (2.14) min where V 0 = Θ and V k ⊂ Θ, k = 1, 2, . . . are the varieties This is a consequence of the fact that there is a scale separation between T m − T * m 2 /σ 2m and ǫ m . Indeed, provided θ ∞ = O(1) relative to σ, the former is on the order σ −2m and the latter is on the order σ −2m−2 . Consider (2.12) when m = 1. Due to this scale separation, the algorithm will prioritize minimization of T 1 (θ) − T 1 (θ * ) 2 over that of ǫ 1 . If the minimization is successful, θ will reach the variety V 1 . On this variety, the objective function to be minimized is now T 2 (θ) − T 2 (θ * ) 2 /4σ 4 to highest order. The algorithm will continue to step through these distinct minimization stages for m = 1, 2, . . . , provided it does not get stuck in a local minimum or saddle point θ of T m (θ) − T m (θ * ) 2 Vm−1 , i.e. a critical point for which T m (θ) = T m (θ * ). This suggests an intimate connection between likelihood optimizing algorithms such as EM and the method of moments in the low SNR regime. The connection between these two classes of algorithms will be discussed further in Section 4.
For nonparametric GMMs in which there is no knowledge of ρ * beyond the compact support assumption, the asymptotic expansion of the log likelihood reduces to a sequence of minimization problems in the space of measures, i.e.
where V m−1 is defined analogously to the parametrizable case. We note that the moments are linear in ρ, so that the objective function is quadratic and the varieties are given by linear constraints. The sequence of least squares moment matching problems is therefore a quadratic programming problem, albeit in an infinite dimensional space. While an analysis of the non-parametric setting is outside the scope of this paper, it would be interesting to explore the connection between the method of moments and maximum likelihood estimation in this context. For results on maximum likelihood estimation and inference in non-parametric mixture models, see e.g. [SG20,FD18,Lai78].
Theorem 2.1 is a direct consequence of the following key Lemma. To state it we will need the following two definitions.

Definition 2. Let
T ki = E θ∼ρi θ ⊗ki , i = 1, . . . , n, i.e. T ki is the order k i moment tensor of some distribution ρ i . Consider We define the total moment order of T to be n i=1 k i , i.e. the sum of all moment orders. We also say that the total moment order of each entry of T is n i=1 k i ; in other words, the total moment order of products of entries of moment tensors is the sum of all moment orders in the product.
Let ρ and ρ * be compactly supported measures on R d . In the definition and lemma below, we write T k , T * k as shorthand for T k (ρ), T k (ρ * ), respectively. Definition 3. We define V k [T 1:m , T * 1:n ] as the set of all constant coefficient linear combinations of outer products of moment tensors T j , j ≤ m, T * ℓ , ℓ ≤ n, of total moment order k.
We define R k [T 1:m , T * 1:n ] as the set of all constant coefficient linear combinations of products of entries of moment tensors T j , j ≤ m, T * ℓ , ℓ ≤ n, of total moment order k.
Lemma 2.2. Let ρ and ρ * be compactly supported probability measures on R d . For all m = 1, 2, 3, . . . we have Moreover, Q k is such that The error term ǫ m = ǫ m (ρ, ρ * ) is the same as in (2.12).
The expansion (2.16) generalizes the log likelihood series expansion (4.10) of [FSWW20], which is specific to the orbit recovery model (2.9) in which the measure γ on the group G is the Haar (uniform) measure. We note that our error bound decays as (1/σ) 2m+2 when θ ∞ = O(1) as σ → ∞; this is a somewhat tighter bound than that of [FSWW20], in which the error is shown to decay as (log σ/σ) 2m+2 when σ → ∞ and θ /σ = o(1/ log σ). (Note that θ ∞ = θ for the orbit retrieval model).
The expansion (4.10, [FSWW20]) is the same as (2.16) except that (4.10) has no term of the form T k , Q k . The following proposition explains why this is so. For the proof, see Proposition B.2 in the appendix.
Proposition 2.3. Let θ, θ * ∈ R d , and G ⊂ O(d) ⊂ R d×d be a group. Define the random variable g ∈ G distributed according to g ∼ γ, where γ is the Haar measure on G. Let T k , T * k be the moment tensors of the distributions gθ, gθ * , g ∼ γ, i.e. (2.17) Then for every tensor Q ∈ V k [T 1:m , T * 1:n ], we have In particular, the inner product T k , Q depends only on moment tensors T j , j = 1 . . . , m even if k > m.
The proof relies crucially on the Haar property of γ, namely, that g d = hg ∀h ∈ G. It follows from the proposition that for the orbit retrieval model, the σ −2k coefficient (for k > 1) in the asymptotic expansion (2.16) of −L(θ; θ * ) is given by

Expectation Maximization As Gradient Descent
In this section, we consider the EM algorithm for finite GMMs and the orbit retrieval model, assuming that the mixture weights are known. We show that in these cases, both the standard and gradient EM algorithms reduce to gradient descent on the negative log likelihood with respect to the centers. This equivalence has been pointed out in the literature, in the context of particular models (see, for example, [WZ19,FSWW20]). In light of the structure of the log likelihood landscape given in Theorem 2.1, we show that the gradient descent step size of standard EM is unnecessarily small, leading to slow convergence.
To present the EM algorithm, it will be helpful to slightly reformulate the model.
3.1. Model Reformulation. We will represent mixture models by where χ is a latent membership variable defined on a set X which parameterizes the component distributions of the mixture. We will use χ ∈ X (non bold) to denote a sample of χ.
be a possibly infinite group with elements {g χ | χ ∈ X}, and χ ∼ ρ, arbitrary. We let θ ∈ R d denote the vector which generates all the centers θ χ through the action of G, i.e. θ χ = g χ θ, χ ∈ X.
Since both of these models are parameterized by θ, we denote the density of Y by q θ . It is given by In the next section we will need the conditional distribution χ|Y . It is given by where we have defined Finally, the log likelihood is given by where we have discarded the normalization constant.
3.2. Algorithm Description. Assume θ * is the ground truth parameter. Define the function Q(θ ′ | θ; θ * ), which is a surrogate for the log likelihood. It is defined as follows: Note that if θ is an estimate of the ground truth parameter θ * , then the distribution q θ (dχ | Y ) = w θ (Y, χ)ρ(dχ) is our best guess for the distribution of the latent membership variable χ given the observed data Y . Given an initialization θ (0) , the standard and gradient EM updates are given by where τ > 0 is some step size. Solving the optimization problem for the standard EM update, we have for finite GMMs the update and for the orbit retrieval model Proposition 3.1. We have for both the finite mixture and orbit retrieval models. Therefore, gradient based EM with step size τ is the same as gradient ascent on L(θ; θ * ) with step size τ .
For the finite mixture model, the standard EM update can be written as for χ = 1, . . . , K. For the (possibly infinite) orbit retrieval model, the standard EM update can be written as We remark that in standard EM for finite mixtures, the step size τ t χ varies with time, and is also different for different centers θ χ .
Proof. For the finite mixture, we use the fact that (3.7) Thus, (3.9) Using this property to compute ∇ θ Q as well, we obtain (3.10) We immediately see that in both cases the gradients of Q and L are equal if θ ′ = θ.
To see why standard EM is also gradient ascent on L, note that Q is a quadratic function in the θ ′ χ for finite GMMs, and quadratic in θ ′ for orbit retrieval. Now, for a quadratic function f (θ ′ ) = − c 2 θ ′ 2 + x T θ ′ + const. we can reach the global maximum in one step of gradient ascent from any point θ ′ by taking a step size 1 c . In other words, It remains to compute c. For the finite mixture, considering Q as a function of θ χ we see that 3.3. EM in Low SNR Regime. We will use the expansion (2.16) to informally demonstrate that in the low SNR regime, the step size in the standard EM update (3.6) for the finite mixture model is much smaller than necessary, leading to slow convergence. The same is true for the orbit retrieval model, as shown in [FSWW20]. Let θ * = (θ 1 * , . . . , θ K * ) be the centers of the ground truth model with θ j * ∈ R d , j = 1, . . . , K and θ = (θ 1 , . . . , θ K ) be the argument to the log likelihood. We define θ ∞ = max j=1,...,K θ j . Recall that the ground truth mixture weights α j are considered known, and T k (θ) = K j=1 α j θ ⊗k j , k = 1, 2, . . . . As in Section 2, we will assume T 1 (θ * ) = 0, θ * ∞ = 1 and σ ≫ 1. Now, we will consider the standard EM update in the direction of T 1 (θ) and in the subspace orthogonal to it. First, we have Proposition 3.2. Let θ → G(θ) be the standard EM update, given by (3.5). Fix a constant R > 0. Then for all θ such that θ ∞ /σ ≤ R, we have The proof is give in Proposition C.3 in the appendix. Proposition 3.2 shows that if the EM iterates θ (t) remain in a radius O(1) ball, then starting with t = 1 the estimated first moment T 1 θ (t) is order O(σ −1 ) away from T * 1 = 0. While T 1 nearly converges in one iteration of standard EM, the algorithm is much slower in the subspace orthogonal to T 1 . To show this, we use the gradient descent representation of EM, For θ such that θ ∞ = O(1) with respect to σ, we have where q 4 is a homogeneous polynomial of order 4 with respect to the entries of θ, θ * . This follows from the representation of the log likelihood given in (2.16). Now, consider the gradient of (3.11) in the subspace orthogonal to T 1 . On this subspace, the highest order term of L, given by T 1 (θ) 2 /σ 2 , is constant (not optimized), while q 4 and its θ-derivatives are order O(σ −4 ). It follows that the optimal step size for gradient descent is O(σ 4 ). However, the actual step size is . This is shown in Lemma C.1 of the appendix.
Recall that for the orbit recovery model, the standard EM update is a gradient descent step on −L with step size σ 2 exactly. Numerical experiments in [FSWW20] show that gradient descent on −L in the subspace orthogonal to T 1 with step size O σ 4 achieves much faster convergence than standard EM.

Examples of Interest and Implications
Recall that Theorem (2.1) shows that in the low SNR regime, likelihood optimization for parameterizable GMMs reduces to the sequence of minimization problems (4.1) min In the following two sections, we characterize the critical points of the minimization problems (4.1) for two GMMs: a uniform mixture of two Gaussians in R d and an arbitrary finite mixture of Gaussians in R. We conclude the section with a discussion of the implications of the expansion for models with algebraic structure and GMMs with randomly chosen centers. We also discuss the necessary steps to make rigorous the connection between the moment matching and likelihood landscapes.
Motivated by [XHM16], we study the moment matching minimization problems in the following coordinates: Define α * , β * analogously for the ground truth parameters. This is a natural reparameterization for the landscape, since We see that the first moment T * 1 determines α * , while β * is determined up to sign from T * 2 given that α = α * . Swapping θ 1 and θ 2 does not change the mixture distribution (since it is uniform), so α and ±β uniquely specify the distribution. It therefore suffices to consider the first two moment-matching optimization problems.
The first unconstrained optimization problem, min α∈R d α − α * 2 has only a global minimum at α = α * . Now, on the manifold α = α * , the second optimization problem reduces to We see that the points β = ±β * are global minima, while β = 0 is a saddle point.
This aligns with the results of [XHM16] on the fixed points of EM. The authors reformulate the EM updates in the α, β coordinates (4.3). Letting α (t) , β (t) , t = 0, 1, . . . be the EM iterates, they show that α (t) converges to α * as t → ∞, while Here, α j are positive weights summing to 1. They are assumed known, so that the unknown parameters are θ = (θ 1 , . . . , θ K ) ∈ R K . Interestingly, if the mixture is uniform (α j = 1/K ∀j), then in the low SNR regime this model is equivalent to the following orbit retrieval model studied in [FSWW20]: Here, G = {g j , j = 1, . . . , K!} ⊂ O(K) ⊂ R K×K is a subgroup of the orthogonal group acting on vectors in R K by permuting their entries. In other words, the orbit of θ under G is the set of all permutations of the entries of θ.
The two models are equivalent in the sense that there is a one-to-one mapping (4.5) To show this, define the polynomials p ℓ (θ) = 1 K K j=1 θ ℓ j , so that T ℓ (ρ θ ) = p ℓ (θ) for ρ θ ∈ R 1/K-mix . Now, let ν θ be the corresponding measure in R orbit . The entries of the tensor T ℓ (ν θ ) are ℓ-degree polynomials in R [θ 1 , . . . , θ K ] which are invariant under permutation of the θ j . But the polynomials p j , j = 1, . . . , ℓ generate the permutation invariant polynomials of degree at most ℓ (see [FSWW20] and the references therein), showing that both sets in (4.5) are in one-to-one correspondence In particular, [FSWW20] shows that the moment-matching problem for the orbit retrieval model reduces to . . , k}. This is precisely the momentmatching problem for a uniform mixture on R.
We now generalize results in [FSWW20] on critical points of the above moment matching landscape to the case of non-uniform mixtures in R. Fix positive weights α = (α 1 , . . . , α K ) summing to 1, and define . . , k}. The following result characterizes critical points of the moment matching objective function that are not global minima.
Proof of Proposition 4.1. A point x ∈ V n is a critical point of f n+1 | Vn if and only if ∇f n+1 (x) lies in the span of ∇p j (x), j = 1, . . . , n. We have also lies in the span of ∇p j (x), j = 1, . . . , n. Hence x is a critical point of p n+1 | Vn . Now, by arguments analogous to those in Lemma 4.23 of [FSWW20], every point in V n has at least n distinct entries (for generic x * ), and V n is nonsingular (i.e. the gradients ∇p j (x), j = 1, . . . , n are linearly independent for every x ∈ V n ).
We show that a critical point x of p n+1 | Vn can have at most n distinct entries. Note that ∂ j p k (x) = kα j x k−1 j . Since ∇p n+1 (x) lies in the span of the gradients ∇p k (x), there exist λ 1 , . . . , λ n such that (4.6)    α 1 (n + 1)x n 1 . . .
Define the polynomial (4.7) q n (x) = (n + 1)x n − (nλ n−1 x n−1 + · · · + 2λ 1 x + λ 0 ). Now, q n is an nth order polynomial, and (4.6) gives that q n (x 1 ) = · · · = q n (x K ) = 0 (since the α j are nonzero). This implies that there are at most n distinct points among x 1 , . . . , x K . The second assertion follows from [Arn86], but we provide a proof for the sake of completeness. We will use the following characterization of critical points on manifolds, reviewed in Appendix D: Let f : R K → R and M ⊂ R K be the intersection of level sets of functions g 1 , . . . , g n . Let x ∈ M be a critical point of f | M and c 1 , . . . , c n ∈ R be such that Then x is a saddle, local minimum, or local maximum of f on M iff the quadratic form (4.9) is indeterminate, positive definite, or negative definite, respectively, on the tangent plane to M at x.
We apply this result with f = f n+1 , g j = p j and M = V n . Let x ∈ V n \ V n+1 be a critical point of f n+1 | Vn . Without loss of generality, assume x 1 > x 2 > · · · > x n are the distinct points. Letting λ j , j = 1, . . . , n be as in (4.6), we have Now the Hessian of f n+1 is given by (4.10) where ∇p n+1 is a column vector. Since ∇p n+1 (x) is a linear combination of ∇p j (x), j = 1, . . . , n, it is orthogonal to vectors in the tangent plane of V n+1 at x. We therefore drop it from the quadratic form and consider where the polynomial q n is as in (4.7). We now characterize the vectors v ∈ R K in the tangent plane to V n at x, i.e. perpendicular to ∇p j (x), j = 1, . . . , n. First, define the vectors u k ∈ R n , k = 1, . . . , n by Then the matrix V ∈ R n×n with columns u 1 , . . . , u n is a Vandermonde matrix with determinant Thus, u 1 , . . . , u n are linearly independent. Now, let v be perpendicular to ∇p j (x), j = 1, . . . , n. For such a v, we have Taking k = 1, . . . , n in (4.12), we see thatṽ is orthogonal to the n linearly independent vectors u 1 , . . . , u n and is therefore identically zero. The conditionṽ = 0 is clearly also sufficient for v to lie in the subspace orthogonal to ∇p j (x), j = 1, . . . , n.
Note that if x i is non-repeating for some i, then 0 =ṽ i = v i α i , from which we infer (4.13) where in the last line we have used that v i = 0 for all i such that x i is non-repeating. Now, let i ∈ {1, . . . , n} be such that x i repeats. Note that as v j for j such that x j = x i range over the set satisfying can take any value in [0, ∞). We therefore see that the quadratic form of (4.13) is indeterminate if q ′ n (x i ) takes both positive and negative values for repeating x i . If q ′ n (x i ) are all of the same sign s for repeating x i , then the quadratic form is positive if p n+1 (x) − p * n+1 also has sign s, and negative if p n+1 (x) − p * n+1 has sign −s. Now, recall that q n is an order n polynomial with roots x 1 , . . . , x n . Since these points are all distinct, this implies q n is a positive multiple of ( The derivative of such a polynomial has alternating sign from root to root, and q ′ n (x 1 ) > 0, since x 1 is the rightmost root on the real line. This finishes the proof of (b).

4.3.
Algebraically Structured Models and Discussion. There are many important inference problems that are naturally modelled as GMMs with algebraic structure imposed on the centers. A motivating application is that of molecule imaging using Cryo-Electron Microscopy (cryo-EM), in which the goal is to reconstruct the density of a molecule given partial observations of it. The imaging data can be modeled by a GMM which generalizes the orbit recovery model in several ways. We describe the model in full generality, since it encapsulates most of the algebraically structured models of interest. The observations in cryo-EM are given by noisy projections of the molecule taken from different unknown viewing directions. Moreover, the molecule may be observed in one of several conformations. A common model assumption is to consider the noise to be Gaussian, in which case we can model the data by the GMM Y = σZ +θ, where Z is the additive Gaussian noise and θ encodes the projection, rotation, and conformation of the molecule.
Specifically, let θ j ∈ R d , j = 1, . . . , K represent the densities of the molecule in its K different conformations. These are the signals we wish to recover. Let χ ∈ {1, 2, . . . , K} be a random variable representing the probability to observe conformation j = 1, . . . , K, with P(χ = j) = α j , j = 1, . . . , K. This distribution is also unknown. Next, let G be the group of rotations on R d , and g ∼ Haar(G) be a random variable which has uniform distribution over G. Finally, let Π : R d → R m , with m < d, be the tomographic projection, a linear projection operator corresponding to the imaging procedure. We can then write the GMM as (4.14) Y = σZ + θ, θ = Π(gθ χ ).
To summarize, the centers of this mixture, given by the support of θ, are the projections of the orbits under the continuous group G of K points in R d . The following are simplifications of this general model: (1) Discrete Homogeneous Orbit Retrieval. This is a type of orbit retrieval model (2.9) described in Section 2. Here, there is no projection operator and only one orbit. One special case of interest is Multireference Alignment (MRA), in which the group G = {g 0 , . . . , g d−1 } is the group which acts on vectors in R d by cyclically shifting their entries, i.e.
(2) Orbit Retrieval with non-uniform weights. In this case, the distribution of g is not restricted to be uniform over G, and is unknown. (3) Continuous Orbit Retrieval. Here, the group G may be infinite. As an example, continuous MRA is a generalization of discrete MRA, in which shifts of entries are generalized to continuous shifts of periodic functions on the torus, i.e. τ x θ(y) = θ(y + x mod 1), x, y ∈ [0, 1). The periodic functions are assumed bandlimited so that they can be represented in a finite-dimensional Fourier basis. (4) Heterogeneous Orbit Retrieval. There is no projection operator, but the centers form K > 1 orbits of a group G. A connection between the log-likelihood and the moments of the mixture was established in [BRW17, PWB + 17] for homogeneous MRA. In that paper, upper and lower bounds on the KL divergence (essentially the negative log likelihood) are given in the form of a series similar to ours, in which each term is the squared norm of the difference between true and estimated moments. This was then used to understand the sample complexity of the orbit retrieval problem, heavily exploiting the fact that the moments, due to the model's algebraic structure, correspond to invariant polynomials with respect to the group action. This showed that in the low SNR regime, the sample complexity of MRA increases from the standard O(1/SNR) to O(1/SNR 3 ), a previously unexplained phenomenon first observed in experiments performed in the context of Cryo-EM [Sig98]. This connection was then extended to the general setting (4.14) in [BBSK + 17].
While these results help in understanding the statistical complexity of the models, they fail to explain why iterative methods such as EM appear to perform well in practice. (For more on EM and maximum likelihood estimation in cryo-EM, see e.g. [SDCS10]). The asymptotic expansion of the log likelihood and its connection to least squares moment matching is a step toward understanding why this is so. It is important to note however that understanding the roughness of the landscape of least squares of moments can be a highly non-trivial task.
Nevertheless, even without a theoretical understanding of the roughness of the moments landscape, the connection between moment methods and more classical iterative approaches such as EM is itself of interest (and unexpected). The connection is especially tight if the method used for inverting the moments is minimization of an objective function of the form min θ k λ k T k (θ)−T * k 2 . This is one of the methods used in a series of papers in which the moment-based approach was suggested for algebraically structured mixture models [BBM + 18, BBL + 19, MBB + 20, LBBS20]. In fact, numerical simulations in these papers demonstrate this connection. The experiments suggest that the two methods have similar performances for MRA and some of its extensions mentioned above.
A particularly interesting example is that of Heterogeneous MRA, in which there are M = Kd mixture components corresponding to the orbits under cyclic shifts of K vectors in R d . Statistically, it is known [BBSK + 17] that moments up to degree 3 are enough to resolve the model (provided the vectors are generic) even for K growing linearly with d. However, numerical experiments, in which the vectors are chosen at random, suggest that algorithms start failing above K > ∼ √ d, and that the moment matching landscape has spurious local minima in this regime. It is conceivable that, for vectors chosen randomly from a Gaussian distribution, the third moment matching landscape is benign for K ≪ √ d and riddled with spurious critical points when K > ∼ √ d. If such a phase transition is established, our expansion could then be used as a vehicle to transfer such results into an understanding of the performance of EM and similar methods. 4.3.1. Random centers. Outside of algebraically structured GMMs, another interesting model is a GMM with "average-case" centers: take M randomly sampled vectors in R d from a Gaussian distribution and consider the GMM with these vectors as centers (and fixed isotropic covariances). The question of whether the mixture can be recovered from third moments is equivalent to low-rank tensor decomposition (the third moment tensor is a d × d × d tensor with rank ≤ M ). This problem is believed to exhibit a statistical-to-computational gap: while the low rank decomposition is decidable for M ≪ d 2 it is believed to be computationally hard for M ≫ d 3/2 [Wei18]. This is precisely the regime in which algorithms for moment inversion appear to fail in heterogeneous MRA, since the orbits of K ∼ √ d vectors form the centers of a GMM with M = Kd ∼ d 3/2 mixture components. A characterization of the roughness of the landscape of low-rank tensor decomposition in these regimes could, with the help of our expansion, potentially be transferred to study the performance of EM in such a mixture model.

4.4.
Towards finite sample guarantees. To make the connection rigorous between likelihood optimization and the series of minimization problems (4.1) in low SNR models, one must prove that the path of gradient descent on the negative log likelihood is well-approximated by the stagewise least squares moment minimization. In order to study these algorithms in the finite sample case, one must also quantify the deviation of the sample log likelihood and its first two derivatives from the population log likelihood and its first two derivatives, respectively.
[FSWW20] carries out this program to draw conclusions about log likelihood optimization in the case of homogeneous orbit retrieval. The tools developed in that paper lay the groundwork for analysis of more general models. In particular, the authors exploit the algebraic structure of the model to reparameterize the gradient descent dynamics in a basis of invariant polynomials under the group action. The varieties (4.2) are then level sets of these polynomials, simplifying the analysis of the landscape of (4.1).
We have not rigorously established the connection between the two landscapes for general GMMs or performed a finite sample analysis here, but we expect that doing so should be possible with the help of techniques developed in [FSWW20], as well as those used in the present paper for the derivation of the likelihood expansion.

Log Likelihood Asymptotic Expansion
In this section, we prove the asymptotic expansion of the population log likelihood given in Lemma 2.2, highlighting key parts of the argument and deferring technical lemmas to the appendix. Recall that the log likelihood is given by where θ ∼ ρ, θ * ∼ ρ * , and ρ, ρ * are compactly supported distributions on R d . Note that in this section only, we use θ, θ * to denote random variables, rather than θ, θ * . We begin with the following key observation.
Lemma 5.1. Let Z ′ ∈ R d be a random vector independent of Z, θ, and θ * such that Z ′ ∼ N (0, I). Then where Proof. Consider the random variable (θ − θ * ) T Z ′ . For θ, θ * fixed, it is a mean zero Gaussian with variance θ * − θ 2 and hence has characteristic function Using (5.3) with t = 1/σ, we then have Taking the logarithm and expectation with respect to θ * , Z gives (5. 2) The remainder of the proof centers around a finite Taylor expansion about t = 0 of .
Note that f is C ∞ in t for every Z. Hence, for every m, f has a finite Taylor expansion of the form Here, the κ j and ξ both depend on Z, θ * , and |ξ| < |t|. This expansion is valid for every t ∈ R and Z, θ * ∈ R d . Substituting (5.6) into (5.2) with t = 1/σ, we have To prove Lemma 2.2, it remains to compute the expectations of κ p , p = 1, . . . , 2m+ 1 and upper bound the expectation of the error term. The following theorem summarizes the results of these computations.
where r k ∈ R 2k [T 1:k−1 , T * 1:2k ] , Q k ∈ V k T 1:k−1 , T * 1:k−1 , and Q k is such that Q k (T * 1:k−1 , T * 1:k−1 ) = 0. The error term is bounded above by where C is a d-dependent constant and We now outline the main steps of the proof of Theorem 5.2. In Section 5.2, we obtain expressions for the κ p . We do so by taking advantage of generalized moment-cumulant relationships described below. We obtain where R Z is a Z-dependent radius of convergence, within which the series converges uniformly in t. That the radius of convergence depends on Z will not be an issue, as we only care about the coefficients κ p of (5.9). Indeed, the κ p of (5.6) and (5.9) are the same. In Section 5.3, we upper bound the error term by explicitly computing ∂ n+1 t f and bounding its Z, θ * -expectation. In Section 5.4, we compute the Z-expectation of the κ p , and in Section 5.5, we compute the θ * -expectations of the Z-expectations.
In several key steps of the proof, we make use of the polynomials which express the cumulants of a distribution in terms of its moments. We will apply these moment-cumulant relations in a more general setting, in which the "moments" are coefficients of any Taylor expansion satisfying certain conditions. Before proceeding with the proof, we describe these generalized moment-cumulant relations.

Generalized Moment-Cumulant Relations. Let X be a random variable with moment-generating function
and cumulant generating function The µ k are the moments of X, µ k = E[X k ], and the cumulants κ m are given by the following polynomials κ m (µ 1 , . . . , µ m ): where S m is the set of all finite lists λ of positive integers whose sum is m. The c λ are universal constants, and we will rarely need to know their exact values. As an example, κ 4 = c 4 µ 4 + c 3,1 µ 3 µ 1 + c 2,2 µ 2 2 + c 2,1,1 µ 2 µ 2 1 . The moment-cumulant relations (5.10) are typically applied in the context of random variables. However, they arise in a more general context: for a function f with Taylor series coefficients µ k , (5.10) describes how the Taylor series coefficients κ m of log f relate to the µ k . More concretely, we have the following result, proved in Proposition A.1 of the Appendix.
where the function κ m is defined by (5.10).
In particular, if u(t) is given by the convergent series expansion where the κ m are given by (5.10).
Remark 5.4. Note that (5.11) holds for any function M (t) with m derivatives in a neighborhood of t 0 , and M (t 0 ) = 0. In other words, (5.11) is also simply the expression for the mth derivative of log M that comes from repeatedly applying the rules of differentiation.
In the following four sections of the proof, we will make frequent use of the identity x ⊗k , y ⊗k = (x T y) k , x, y ∈ R d , k = 1, 2, . . . .

5.2.
Log-Sum-Exp Taylor Series. Recall that In this section, we derive the series expansion for R Z to be specified. Throughout the section, Z and θ * are considered constant.
The partial sums are each bounded in absolute value by exp (δt( Z + Z ′ )) which has finite θ, Z ′ -expectation. We can therefore interchange summation and expectation to get (5.13) Denote the coefficients in this expansion by We now apply the generalized moment-cumulant relations to Taylor expand the logarithm of M (t). In order to do so, we first limit the range of t to ensure M (t) remains in a small neighborhood of 1. Now, we can write M (t) as Using that w ≤ δ, we have Therefore, for |t| < R Z we have (5.14) where we have used the fact that |e x − 1| < 7|x|/4 when |x| < 1. For |t| < R Z we can thus make use of the generalized moment-cumulant relations of Proposition 5.3 to write In the third line, W λ denotes the set {W ℓ | ℓ ∈ λ}. We replaced W = Z + iZ ′ with vectors W ℓ = Z + iZ ℓ , ℓ ∈ λ, where the Z ℓ ∈ R d are i.i.d. standard normal and independent of Z. This allowed us to write the product of expectations as the expectation of a product.

Error Term Upper Bound.
In this section, the constant C depends on d only and may change value from line to line. Recall that the error term is given by for ξ such that 0 < ξ < 1/σ. To bound it, we first compute ∂ n t f at a generic point t (with n = 2m + 2). Recall that w = θ − θ * , W = Z + iZ ′ , and f = log M for M (t, Z, θ * ) = E θ,Z ′ exp tW T w . By (5.11), we have at any point t, since M (t) is never zero. We therefore have the error bound We now compute the t-derivative of M in the following indirect way, which will yield an expression that is simpler to bound. Let t = t 0 + s; we will write M in terms of s and take its derivative at s = 0. We have and note that Therefore, (5.18) We now take the derivative at s = 0, passing it inside both expectations. This is justified since the resulting derivative is absolutely integrable. We obtain Noting that E Z ′ e tw T W > 0, we have Now, fix λ ∈ S n , and let W ℓ = Z + iZ ′ ℓ , ℓ ∈ λ, where Z ′ ℓ are independent copies of Z ′ . We have where expectation with no subscripts denotes expectation with respect to all random variables and C is a constant depending on d only. Now, note that this upper bound is independent of λ. We also have λ∈Sn |c λ | ≤ n n ≤ n!e n .
This follows from [FSWW20], in which it was shown that the sum of the absolute values of the coefficients arising in an nth order multivariate cumulant is bounded above by n n . It is straightforward to see that this bound applies to univariate cumulants as well. Substituting these bounds in the error bound (5.17) and using |t| = |ξ| < 1/σ, we obtain (5.22) Taking n = 2m + 2 gives the desired error bound.
5.4. Z-Expectation of Cumulants. We will use the following notation in this section: for a list λ, we define ℓ max as the maximum element in the list, and λ \ ℓ is the list with one copy of ℓ removed. Now, recall that w = θ − θ * , and where W λ denotes the set {W ℓ | ℓ ∈ λ} and W ℓ = Z + iZ ℓ , where Z, Z ℓ , ℓ ∈ λ are independent standard normal vectors in R d . We therefore have We make two initial observations: (1) If p = 2k + 1 is odd, then the expectation on the right side of the inner product is zero, so that E[κ 2k+1 ] = 0. This follows from the fact that all Gaussian random variables appearing in the expectation are mean zero, and the total order of the product is 2k + 1.
(2) Suppose p = 2k. If λ is such that all ℓ ∈ λ are less than k, then the θ * expectation on the left is a polynomial only of moment tensors T p with p < k (though it could depend on moment tensors T * p for p up to 2k.) This is proved in Lemma B.4 in the appendix. Therefore, the inner product for such λs will contribute only to the remainder polynomial r k (T 1:k−1 , T * 1:2k ). We therefore consider p = 2k, and discard from the sum those λ ∈ S 2k in which all numbers are less than k. Let ℓ max denote the maximum number in a list λ. We have (5.23) Using results in [FPR19], we now prove the following proposition.
Proposition 5.5. Let λ ∈ S 2k be such that ℓ max , the maximum of the list, is at least k. Let Z, Z ℓ , ℓ ∈ λ be i.i.d. standard normal vectors in R d . Then The proof relies on the following result of [FPR19] Proposition 5.6. [Proposition 2 of [FPR19]] Let V = V 1 , . . . , V p ∈ C p be jointly circularly symmetric. Then only if k 1 + · · · + k p = m 1 + · · · + m p .
Proof of Proposition 5.5. Note that each Z + iZ ℓ is a circularly symmetric random vector. However, (Z + iZ ℓ ) ℓ∈λ ∈ C |λ|d is not jointly circularly symmetric, because the real parts are correlated while the imaginary parts are independent. We can nevertheless take advantage of Proposition 5.6 as follows: first, define Z ′ ℓ ∈ R d , ℓ ∈ λ \ ℓ max to be standard normal and independent of Z, Z ℓ , ℓ ∈ λ and of each other. Define W = Z +iZ ℓmax , and W ℓ = Z ′ ℓ +iZ ℓ , ℓ ∈ λ\ℓ max . Thus, (W ; (W ℓ ) ℓ∈λ\λmax ) ∈ C |λ|d is circularly symmetric. Now, for ℓ ∈ λ \ ℓ max we write With this notation, we have Note that each entry in this tensor is of the form given in the proposition, i.e. it is a product of some number of conjugated and unconjugated complex random variables which are jointly circularly symmetric Gaussian. We count how many conjugated and unconjugated variables appear in a typical entry of this tensor. There are at least ℓ max unconjugated variables (from the ℓ max copies of W ) , and at most 2k − ℓ max ≤ ℓ max conjugated variables. Thus, we immediately see that the expectation is zero if ℓ max > k. If ℓ max = k, the expectations of all terms with fewer than k conjugated variables are zero. Writing k instead of ℓ max , we then have But note that W ℓ is independent of W , so the expectation of products involving both entries of W and entries of W ℓ will split up into a product of two expectations, one of which involves only entries of W ℓ . But this expectation is zero by the proposition, since it involves only conjugated variables. Hence only the products involving W alone survive. We obtain We recall the additional factor 1 2 2k−ℓmax = 2 −k to conclude.
Substituting this back into the cumulant formula, we have This expectation is straightforward to evaluate (see Proposition B.6 in the Appendix), and we obtain (5.26) 5.5. θ * -Expectation of Cumulants. Recall that w = θ − θ * , and that We also remind the reader of the definitions of total moment order and the spaces V k , R k : where S ki is either T ki or T * ki . We define the total moment order of S to be n i=1 k i , i.e. the sum of all moment orders. We also say that the total moment order of each entry of S is n i=1 k i ; in other words, the total moment order of products of entries of moment tensors is the sum of all moment orders in the product.
Definition. We define V k [T 1:m , T * 1:n ] as the set of all constant coefficient linear combinations of outer products of moment tensors T j , j ≤ m, T * ℓ , ℓ ≤ n, of total moment order k. We define R k [T 1:m , T * 1:n ] as the set of all constant coefficient linear combinations of products of entries of moment tensors T j , T * ℓ , j ≤ m, ℓ ≤ n, of total moment order k.
In the previous section, we have shown that where s is such that E θ * [s(T 1:k−1 , θ * )] ∈ R k [T 1:k−1 , T * 1:2k ]. For brevity, define the tensor on the right of (5.28) by J k , that is To get a sense of J k , let us write out J 3 : In this section, we take the θ * -expectation of the inner product in (5.28), that is, of E θ w ⊗k , J k . We are interested only in terms involving T k , the highest order moment tensor of θ appearing in this inner product. Therefore, we can separate out T k on each side of the inner product and discard terms in which T k appears on neither side. The discarded terms will collectively be denoted s(T 1:k−1 , θ * ). Note that the θ-moment tensors which arise upon expanding E θ [(θ − θ * ) ⊗p ] are T j , j ≤ p. Therefore, T k appears only through E θ (θ − θ * ) ⊗k , which arises on the left of the inner product with coefficient 1 and on the right with coefficient c k,k . We have (5.29) We may then write (5.30) To get the last line, we substitute E θ θ ⊗k with E θ ′ θ ′⊗k , where θ ′ is an independent copy of θ. We then pull the θ ′ -expectation out of the inner product.
The following lemma will complete the proof of Theorem 5.2 .
Lemma 5.7. Let x ∈ R d be constant. Then where Q k ∈ V k T 1:k−1 , T * 1:k−1 . Moreover, Q is independent of x and satisfies Q k (T * 1:k−1 , T * 1:k−1 ) = 0. To finish the proof of Theorem 5.2, note that substituting x = θ ′ and taking the θ ′ -expectation gives Substituting into (5.30) and dividing by (2k)! gives where we have absorbed c k,k T * k 2 into r k . In Lemma A.2 in the appendix, we show that .
Hence, the coefficient in front of the squared difference of kth moments is as in Theorem 5.2. Now, Lemma 5.7 will follow from the following Lemma: Lemma 5.8. Let X, X * ∈ R be bounded random variables, and let W = X − X * . Let µ j = E X j , µ * j = E X j * , ω j = E X W j . (Note that ω j is random with respect to X * .) Let c λ be the universal coefficients of the moment-cumulant relations (5.10). Then where π ∈ R[µ 1:k−1 , µ * 1:k−1 ] and satisfies π(µ * 1:k−1 , µ * 1:k−1 ) = 0. We show how Lemma 5.7 follows from Lemma 5.8 in the appendix; see Lemma A.4. The idea is to take X = x T θ and X * = x T θ * .
Proof of Lemma 5.8. Define the polynomials The proof consists of the following sequence of computations.
A. We show that where B. We show that This is true even if ω k = 0. Indeed, (5.33) is true if ω k = 0, and both sides of the equation are continuous with respect to ω k (the right hand side is also a polynomial in ω k ), so we can set ω k = ǫ and take ǫ → 0. Taking the X * expectation, we get where π has the desired properties. We prove A. here and delegate the rest to the appendix; see Lemma A.3. Consider the moment-cumulant relation κ 2k (ω 1 , . . . , ω 2k ) = λ∈S 2k c λ ℓ∈λ ω ℓ . We write it as follows: The sum in the middle contains the expression we are interested in. We will express it in terms of the function κ 2k by strategically replacing some of the arguments with zero. Indeed, if we replace ω ℓ with zero for all ℓ > k, then the first sum will vanish. If we also replace ω k with zero, we get the third sum. The second sum is then the difference of these two. Summarizing, we have ω k λ∈S 2k λmax=k c λ ℓ∈λ\k ω ℓ = κ 2k (ω 1 , . . . , ω k , 0, . . . , 0) − κ 2k (ω 1 , . . . , ω k−1 , 0, 0, . . . , 0).
Proof. We have If sup |t−t0|<R |u(t)| < 1, then f (t) = log(M (t)/M (t 0 )) = log(1 + u(t)) is also real analytic in this neighborhood, and therefore has a convergent Taylor series expansion Note that . On the other hand, since the series defining u(t) is absolutely convergent and since |u(t)| stays below 1, we also have where in the last line, the κ m are obtained by expanding the powers of the series and rearranging terms to combine like powers of t − t 0 . Note that the lowest power of t − t 0 in u(t) k is k; therefore, the term (t − t 0 ) m arises only in the series expansions of u(t) 1 , . . . , u(t) m . Moreover, the coefficient of (t − t 0 ) m in the expansions of u(t) k , k = 1, . . . , m will depend only on µ 1 , . . . , µ m . Therefore, κ m is a polynomial of µ 1 , . . . , µ m , and it is clear that this polynomial should be of the form(A.1). The coefficients c λ clearly do not depend on the particular values of the µ k , so they are universal constants. Summarizing, we have shown that The second assertion about the Taylor expansion of log(1+u(t)) clearly follows.
Lemma A.2. The coefficient c 2p in front of the term µ 2p in the polynomial defining κ 2p (see (A.1)) is c 2p = 1. The coefficient c p,p in front of the term µ 2 p is given by Note that the moments appear individually in ǫ. Thus, the moment µ 2p by itself can only appear in ǫ 1 , where it has coefficient 1/(2p)!. Since κ 2p is (2p)! times the t 2p coefficient, we must have c 2p = 1. Now, similarly, µ 2 p is a product of two moments and can therefore only appear in ǫ 2 . In the expansion of ǫ 2 , µ 2 p t 2p will appear with coefficient 1/(p!) 2 . Multiplying by −1/2, we have that the coefficient of µ 2 p t 2p in the expansion of log(1 + ǫ) is −1/2(p!) 2 . Multiplying by (2p)!, we arrive at Lemma A.3. Let X, X * ∈ R be bounded random variables, and let W = X − X * . We denote the moments of X, X * , W as Note that ω j is random with respect to X * . Let c λ be the universal coefficients of the moment-cumulant relationships (A.1). Then where π is a polynomial in µ 1:k−1 , µ * 1:k−1 with universal coefficients (independent of the values of the moments) such that each monomial has total order k and which satisfies π(µ * 1:k−1 , µ * 1:k−1 ) = 0. Define The proof consists of the following steps (we only summarize them, for more detail see the main text).
A. We showed in the main text that where B. We show that where π has the desired properties. In Lemma 5.8, we proved A., so it remains to prove B., C., D.
Proof. For B., note that M (p) (0) = 0, p = 1, . . . , k −1, so all products ℓ∈λ M (ℓ) (0) involving ℓ < k are zero. But the only lists λ ∈ S 2k involving only moments of order k and higher are λ = {k, k} and λ = {2k}. We must therefore compute the kth and 2kth derivatives of M at zero. By the product rule, and using that only the kth derivative of t k is nonzero at t = 0, we have Therefore, (We used that c 2k = 1 and 2k k = −2c k,k , proved in Lemma A.2). Subtracting c k,k ω 2 k and multiplying by −1 gives the desired result. This concludes B. For C., define Note that G ω (t) = 1 + q ω (t) + t k+1 r ω (t) for some smooth function r ω (t). By Taylor expanding (1 + q ω ) −1 and (1 + q ω + t k+1 r ω ) −1 in a neighborhood of 0, we get Combining like powers of t (justified by the absolute convergence of both series in a neighborhood of zero), we see that the order t k term in both series are the same, and hence d k . We now take the X * expectation and bring it inside the derivative. This is justified by the absolute convergence of the series and its derivatives, and the fact that X * is bounded. Now, note that since W = X − X * , we can write G ω (t) = e −tX * G(t), so that Finally, G * /G = (1 + q * + t k+1 r * )(1 + q + t k+1 r) −1 for some smooth function r * and r. By a similar Taylor expansion argument as before, the t k coefficient of the expansion of G * /G around t = 0 is the same as that of (1 + q * )/(1 + q).

Appendix B. Moment Tensor Computations
Let θ, θ * ∈ R d be random vectors distributed according to distributions ρ and ρ * respectively, each of which is a compactly supported probability measure. Recall the moment tensors For a multi-index ℓ = (ℓ 1 , . . . , ℓ k ), we write T ℓ k to denote the tensor entry T ℓ1,...,ℓ k k . Recall the spaces V k , R k from Definition 3. We define these spaces more formally here. Proof of Lemma B.3. Since the elements g of the group G act on vectors by orthogonal transformation, we have gθ, hθ = θ, g −1 hθ = g ′ θ, g ′ g −1 hθ ∀g, g ′ , h ∈ G.
Lemma B.4. Let v 1 , . . . , v n be multi-indices, where the indices in each multi-index are between 1 and d. Let |v j | denote the number of indices in v j . Define L = max{|v 1 |, |v 2 |, . . . , |v n |} and M = |v 1 | + |v 2 | + · · · + |v n |, and let Note that |I| ≤ |v|. Now, applying (B.11) for v = v 1 , . . . , v n and multiplying the results together yields entries of θ-moment tensors of order no higher than L = max 1≤i≤n |v i | as well as products of at most M = |v 1 | + · · · + |v n | entries of θ * . Upon taking the θ * -expectation, these products will become entries of θ * -moment tensors of order at most M . Noting that the sum of the moment orders in the product (B.10) is M , we see that a(θ, θ * ) ∈ R M [T 1:L , T * 1:M ], as desired. Recall the definition Lemma B.5. Define r(θ, θ * ) = E θ * E θ w ⊗p − T p , J p − c p,p T p .
Proposition C.3. For all θ such that ( θ ∞ ∨ 1)/σ ≤ R, we have for some constant C that depends on d and R only.
Proof. The first condition is standard. To show the second condition, note that a critical point x is a local minimum (maximum) of f | M if and only if for every curve u(t) ⊂ M going through x, the function t → f (u(t)) has a local minimum (maximum) at t = t x , the point for which u(t) = x.
Let u be such a curve. Now, (f • u) ′ (t x ) = ∇f (x), u ′ (t x ) = 0, since ∇f (x) lies in the span of ∇g j (x), j = 1, . . . , n while u ′ (t x ) lies in the tangent plane to M at x. Hence t → f (u(t)) has a critical point at t x . Note that The last line follows from the fact that 0 = d 2 dθ 2 g j (u(t x )) = ∇g j (x), u ′′ (t x ) + u ′ (t x ) T ∇ 2 g j (x)u ′ (t x ), j = 1, . . . , n. Now, u ′ (t x ) is any vector in the tangent plane of M at x, i.e. any vector perpendicular to ∇g j (x), j = 1, . . . , n. Thus, for d 2 dt 2 f (u(t x )) to have the same sign for all curves u, the quadratic form ∇ 2 f (x) − n j=1 λ j ∇ 2 g j (x) must be determinate on the tangent plane at x.