On the entropy and information of Gaussian mixtures

We establish several convexity properties for the entropy and Fisher information of mixtures of centered Gaussian distributions. First, we prove that if $X_1, X_2$ are independent scalar Gaussian mixtures, then the entropy of $\sqrt{t}X_1 + \sqrt{1-t}X_2$ is concave in $t \in [0,1]$, thus confirming a conjecture of Ball, Nayar and Tkocz (2016) for this class of random variables. In fact, we prove a generalisation of this assertion which also strengthens a result of Eskenazis, Nayar and Tkocz (2018). For the Fisher information, we extend a convexity result of Bobkov (2022) by showing that the Fisher information matrix is operator convex as a matrix-valued function acting on densities of mixtures in $\mathbb{R}^d$. As an application, we establish rates for the convergence of the Fisher information matrix of the sum of weighted i.i.d. Gaussian mixtures in the operator norm along the central limit theorem under mild moment assumptions.

1. Introduction 1.1.Entropy.Let X be a continuous random vector in R d with density f : R d → R + .The (differential) entropy of X is the quantity where log always denotes the natural logarithm.The celebrated entropy power inequality of Shannon and Stam [30,31] (see also [25]) implies that for every independent continuous random vectors X 1 , X 2 in R d , we have In general, the entropy power inequality cannot be reversed (see, e.g., the construction of [11,Proposition 4]).However, reverse entropy power inequalities have been considered under different assumptions on the random vectors, such as log-concavity [9,15,3,26].It follows directly from (2) that if X 1 , X 2 are i.i.d.random vectors, then the entropy function t → h( ) is minimised at t = 0 and t = 1.In the spirit of reversing the entropy power inequality, Ball, Nayar and Tkocz [3] raised the question of maximising this function.In particular, they gave an example of a random variable X 1 for which the maximum is not attained at t = 1  2 but conjectured that for i.i.d.log-concave random variables this function must be concave in t ∈ [0, 1], in which case it is in particular maximised at t = 1  2 .It is worth noting that the conjectured concavity would also be a strengthening of the entropy power inequality for i.i.d.random variables, as (2) amounts to the concavity condition for the points 0, t, 1.So far, no special case of the conjecture of [3] seems to be known.
In this work, we consider (centered) Gaussian mixtures, i.e. random variables of the form where Y is an almost surely positive random variable and Z is a standard Gaussian random variable, independent of Y .The resulting random variable has density of the form In particular, as observed in [18], (4) combined with Bernstein's theorem readily implies that a symmetric random variable X is a Gaussian mixture if and only if x → f X ( √ x) is completely monotonic on (0, ∞).Therefore, distributions with density proportional to e −|x| p , symmetric pstable random variables, where p ∈ (0, 2], and the Cauchy distribution are Gaussian mixtures.Let us mention that Costa [16] also considered symmetric stable laws to prove a strengthened version of the entropy power inequality that fails in general.
Our first result proves the concavity of entropy conjectured in [3] for Gaussian mixtures.
Theorem 1.Let X 1 , X 2 be independent Gaussian mixtures.Then the function is concave on the interval [0, 1].
Theorem 1 will be a straightforward consequence of a more general result for the Rényi entropy of a weighted sum of n Gaussian mixtures.Let △ n−1 be the standard simplex in R n , The Rényi entropy of order α 1 of a random vector X with density f is given by and h 1 (X) is simply the Shannon entropy h(X).We will prove the following general concavity.
In [18,Theorem 8], it was shown that if X 1 , . . ., X n are i.i.d., then the function ( 8) is Schur concave, namely that if (a 1 , . . ., a n ) and (b 1 , . . ., b n ) are two unit vectors in R n , then for any α ≥ 1, where ⪯ m is the majorisation ordering of vectors (see [18]).As the unit vector with all coordinates equal to 1 n is majorised by any other vector in △ n−1 , (9) implies that the function (8) achieves its maximum on the main diagonal for Gaussian mixtures.
We note in passing that, while the conclusion of Theorem 1 has been conjectured in [3] to hold for every i.i.d.log-concave random variables X 1 , X 2 , the conclusion of Theorem 2 cannot hold for this class of variables.In [27], Madiman, Nayar and Tkocz constructed a symmetric log-concave random variable X for which the Schur concavity (9) does not hold for i.i.d.copies of X and thus, as a consequence of [28, p. 97], the concavity of Theorem 2 must also fail.1.2.Fisher Information.Let X be a continuous random vector in R d with smooth density f : R d → R + .The Fisher information of X is the quantity where ρ(x) is the score function of X. Fisher information and entropy are connected by the classical de Bruijn identity (see, e.g., [31]), due to which most results for Fisher information are formally stronger than their entropic counterparts.In particular, the inequality of Blachman and Stam [31,8], which holds for all independent random vectors X 1 , X 2 in R d , implies the entropy power inequality (2).In the spirit of the question of Ball, Nayar and Tkocz [3] and of the result of [18], we raise the following problem.
Question 3. Let X 1 , . . ., X n be i.i.d.Gaussian mixtures.For which unit vectors (a 1 , . . ., a n ) in R n is the Fisher information of n i=1 a i X i minimised?While Question 3 still remains elusive, we shall now explain how to obtain some useful bounds for the Fisher information of mixtures.In order to state our results in the greatest possible generality, we consider random vectors which are mixtures of centered multivariate Gaussians.Recall that the Fisher information matrix of a random vector X in R d is given by where f : R d → R + is the smooth density of X, so that I(X) = trI(X).Let F d ⊂ L 1 (R d ) be the space of smooth probability densities on R d .By abuse of notation, we will also write I(f ) and I(f ) to denote the Fisher information and Fisher information matrix respectively of a random vector with smooth density f on R d .In his recent treatise on estimates for the Fisher information, Bobkov made crucial use of the convexity of the Fisher information functional I(X) as a function of the density of the random variable X, see [10,Proposition 15.2].For our purposes we shall need the following matricial extension of this.
provided that F d ∥I(g)∥ op dπ(g) < ∞.Here ⪯ denotes the positive semi-definite ordering of matrices.
We propose the following definition of Gaussian mixtures in arbitrary dimension.As in the scalar case, a Gaussian mixture X in R d has density of the form Employing Proposition 4 for Gaussian mixtures we deduce the following bound.
Corollary 6. Fix d ∈ N and let X be a random vector in R d admitting a Gaussian mixture representation YZ.Then, we have This upper bound should be contrasted with the general lower bound where the first inequality is the multivariate Crámer-Rao bound [7,Theorem 3.4.4].
1.2.1.Quantitative CLT for the Fisher information matrix of Gaussian mixtures.Equality in the Cramér-Rao bound ( 16) is attained if and only if X is Gaussian.The deficit in the scalar version of this inequality is the relative Fisher information I(X∥Z) between X and Z and may be interpreted as a strong measure of distance of X from Gaussianity.In particular, in view of Gross' logarithmic Sobolev inequality [21] and Pinsker's inequality [29,17,23], closeness in relative Fisher information implies closeness in relative entropy and a fortiori in total variation distance.Therefore, a very natural question is under which conditions and with what rate the relative Fisher information of a weighted sum tends to zero along the central limit theorem, thus offering a strenthening of the entropic central limit theorem [4].As an application of Corollary 6, we obtain a bound for a matrix analogue of the relative Fisher information of Gaussian mixtures.Here and throughout, ∥ • ∥ op denotes the operator norm of a square matrix.
Theorem 7. Fix d ∈ N, δ ∈ (0, 1] and let X 1 , . . ., X n be i.i.d.random vectors in R d , each admitting a Gaussian mixture representation YZ as above.Assume also that Then, for every unit vector a = (a 1 , . . ., a n ) in R n the weighted sum where C(Y) is a constant that depends only on the moments of ∥YY T ∥ op .
There is a vast literature on quantitative versions of the central limit theorem.The first to obtain efficient bounds for the relative Fisher information of weighted sums were Artstein, Ball, Barthe and Naor [1] (see also the work [22] of Johnson and Barron) who obtained a O(∥a∥ 4  4 ) upper bound on I(S n ∥X), where S n = n i=1 a i X i for X 1 , . . ., X n i.i.d.random variables satisfying a Poincaré inequality.In particular, this bound reduces to the sharp rate O( 1 n ) on the main diagonal.Following a series of works on the relative entropy of weighted sums [12,13], Bobkov, Chistyakov and Götze investigated in [14] upper bounds for the relative Fisher information along the main diagonal under finite moment assumptions.More specifically, their main result asserts that if E|X 1 | s < ∞ for some s ∈ (2, 4), then where the n o (1) term is a power of log n, provided that the Fisher information of the sum is finite for some n.The exponent s−2 2 is sharp in this estimate.Moreover, it is also shown in [14] that if EX 4  1 < ∞, then the relative Fisher information decays with the optimal O( 1 n ) rate of convergence.This is a far-reaching extension of the results of [1,22] on the main diagonal as the Poincaré inequality assumption in particular implies finiteness of all moments.
The scalar version of Theorem 7 (corresponding to d = 1) is in various ways weaker than the results of [14].Firstly, it applies only within the class of Gaussian mixtures and it requires the finiteness of a negative moment of the random variable besides a positive one.Moreover, even if these assumptions are satisfied, the bound (18) yields the rate O( 1 n c δ ) with c δ = δ 2 (1+δ) 2 along the main diagonal if X has a finite 2 + 2δ moment.This is weaker than the sharp O( 1n δ+o (1) ) which follows from [14].On the other hand, Theorem 7 applies to general coefficients beyond the main diagonal and, in contrast to [1,22], does not require the finiteness of all positive moments.More importantly though, ( 18) is multi-dimensional bound with a subpolynomial dependence on the dimension d.To the best of our knowledge, this is the first such bound for the relative Fisher information matrix of a weighted sum and it would be very interesting to extend it to more general classes of random vectors and to obtain sharper rates.The logarithmic dependence on the dimension in Theorem 7 is a consequence of a classical result of Tomczak-Jaegermann [32] on the uniform smoothness of Schatten classes.While Theorem 7 is stated in terms of the operator norm, the proof yields an upper bound for any operator monotone matrix norm (see Remark 13) in terms of its Rademacher type constants.
Acknowledgements.We are grateful to Léonard Cadilhac for helpful discussions.

Concavity of entropy
This section is devoted to the proof of Theorem 2. We shall make use of the standard variational formula for entropy which asserts that if X is a continuous random variable, then Proof of Theorem 2. We start with the Shannon entropy, which corresponds to α = 1.Fix two unit vectors (a 1 , . . ., a n ) and and denote by g t : R → R + the density of X t .The statement of the theorem is equivalent to the concavity of the function f on the interval [0, 1].Let λ, t 1 , t 2 ∈ [0, 1] and set t = λt 1 + (1 − λ)t 2 .By the variational formula for entropy, we have Moreover, since X i has the same distribution as the independent product Y i Z i , the stability of Gaussian measure implies the equality in distribution Therefore, X t is itself a Gaussian mixture.By the characterisation of [18,Theorem 2], this is equivalent to the complete monotonicity of the function g t ( √ •).Thus, by Bernstein's theorem, is the Laplace transform of a non-negative Borel measure on (0, ∞) and therefore the function Hence, by ( 22) and ( 23), we have This completes the proof of the concavity of Shannon entropy.
Next, let α > 1 and consider again t = λt 1 +(1−λ)t 2 .Denoting by ψ t = g α−1 t ( √ •) and applying the same reasoning, we get Now ψ t = e −(α−1)ϕ t is log-convex and thus by Hölder's inequality and (23).By two more applications of Hölder's inequality, we get and Combining ( 26), ( 27) and ( 28) we thus obtain which is exactly the claimed concavity of Rényi entropy.□ where g : R d → R + is the density of a Gaussian mixture, A and B are positive semidefinite d × d matrices and z is a vector in R d .The validity of (30) for a Gaussian density with arbitrary covariance is equivalent to the operator concavity of the matrix function for an arbitrary positive semidefinite matrix Y .The following counterexample to this statement was communicated to us by Léonard Cadilhac.As the function f takes values in the cone of positive semidefinite matrices, operator concavity is equivalent to operator monotonicity (see the proof of [6, Theorem V.

2.5]). Take two non-negative matrices
and thus f is not operator monotone or concave.

Convexity of Fisher information
3.1.Warm-up: the Fisher information of independent products.Before showing the general argument which leads to Proposition 4, we present a short proof for the case of mixtures of dilates of a fixed distribution which corresponds exactly to the Fisher information of a product of independent random variables.As this is a special case of Bobkov's [14, Proposition 15.2], we shall disregard rigorous integrability assumptions for the sake of simplicity of exposition.Theorem 9. Let W be a random variable with zero mean and smooth-enough density and let Y be an independent positive random variable.Then, Proof.The first inequality is the Cramér-Rao lower bound.Suppose that W has density e −V with V nice enough.Then, Y W has density and thus, differentiating under the expectation and using Cauchy-Schwarz, we get Proof of Proposition 4. We start by proving the two-point convexity of I.
Proposition 10.The Fisher information matrix is operator convex on F d , that is, for Proof.First we claim that the function R : R d × R + → R d×d , given by R(x, λ) = xx T λ is jointly operator convex.To prove this, we need to show that for every θ ∈ (0, 1), x, y ∈ R d and λ, µ > 0, After rearranging, this can be rewritten as which is true since it is equivalent to (λx − µy)(λx − µy) T ⪰ 0.
Since the Fisher information matrix can be written as the conclusion follows by the convexity of R and the linearity of ∇ and .□ In order to derive the general Jensen inequality of Proposition 4 from Proposition 10, we will use a somewhat involved compactness argument that was invoked in [14,10].We point out that these intricacies arise since the space F d of smooth densities in R d is infinite-dimensional.As our argument shares similarities with Bobkov's, we shall only point out the necessary modifications which need to be implemented.We stard by proving the following technical lemma.
(ii) Moreover, we always have Proof.We start with (39).It clearly suffices to show that any subsequence of {X k } has a further subsequence for which the conclusion holds.If I(X k ) op ≤ I < ∞ for all k ≥ 1, then Write f k and f for the densities of X k and X respectively.Choose and fix any subsequence of {f k }.By the proof of [10, Proposition 14.2], using the boundedness of Fisher informations, there is a further subsequence, say f k j , for which f k j → f and ∇f k j → ∇f a.e. as j → ∞.Therefore for almost every u.Integration with respect to u, linearity and Fatou's lemma yield (39).
To prove (40), fix a subsequence X k j for which the lim inf in ( 40) is attained and without loss of generality assume that it is finite.Then the subsequence satisfies sup j ∥I(X k j )∥ op < ∞ and thus by (39) for every x ∈ S d−1 we have Taking a supremum over x ∈ S d−1 concludes the proof as I(X) is positive semi-definite.□ Equipped with the lower semi-continuity of I, we proceed to the main part of the proof.
Proof of Proposition 4. Inequality (35) may be extended to arbitrary finite mixtures by induction, that is if p 1 , . . ., p N ≥ 0 satisfy N i=1 p i = 1, then We need to extend (44) to arbitrary mixtures.We write . By the assumption F d ∥I(g)∥ op dπ(g) < ∞, we deduce that the measure π is supported on F d (∞).We shall prove that To derive inequality (45) in general, fix I 0 large enough such that π(F d (I 0 )) > 1 2 and for I ≥ I 0 write the inequality (45) for the restriction of π to F d (I), namely Denoting by f I the density on the left-hand side of the inequality, we have that f I converges weakly to the density F d g dπ(g) as I → ∞ and moreover (47) yields Therefore, the assumptions of (39) are satisfied for {f I } I≥I 0 and thus ≤ lim inf and this concludes the proof.□ Proof of Corollary 6.In view of ( 14) and Proposition 4, we have since the Fisher information matrix of a Gaussian vector with covariance matrix Σ is Σ −1 .□

CLT for the Fisher information matrix
Before delving into the proof of Theorem 7, we shall discuss some geometric preliminaries.Recall that a normed space (V , ∥ • ∥ V ) has Rademacher type p ∈ [1, 2] with constant T ∈ (0, ∞) if for every n ∈ N and every v 1 , . . ., v n ∈ V , we have The least constant T for which this inequality holds will be denoted by T p (V ).A standard symmetrisation argument (see, for instance, [24, Proposition 9.11]) shows that for any n ∈ N and any We denote by M d (R) the vector space of all d ×d matrices with real entries.We shall consider the p-Schatten trace class S d p of d × d matrices.This is the normed space S d p = (M d (R), ∥ • ∥ S p ), where for a d × d real matrix A, we denote [32] (see also [2] for the exact values of the constants) asserts We shall use the following consequence of this. Moreover, Proof.We first prove (54).In view of inequality (52), it suffices to prove that the Rademacher type (1 + δ)-constant of S d p satisfies T 1+δ (S d p ) ≤ (p − 1) δ 1+δ .Given a normed space (X, ∥ • ∥ X ) and n ∈ N, consider the linear operator T n : ℓ n p (X) → L p ({−1, 1} n ; X) given by where ε = (ε 1 , . . ., ε n ) ∈ {−1, 1} n .Then, it follows from (51) that In fact, if X is finite-dimensional (like S d p ) then it was shown in [20,Lemma 6.1] that the supremum is attained for some n ≤ dim(X)(dim(X)+1)/2.Either way, by complex interpolation of vector-valued L p spaces (see [5,Section 5.6]), we thus deduce that where θ 1 + 1−θ 2 = 1 1+δ .The conclusion of (54) follows by plugging-in the value of θ and the result of [32,2].The proof of inequality (55) is similar, interpolating between 1 and p.
Finally, to deduce (56), note that for any A ∈ M d (R), and thus plugging p = log(d + 1) + 1 in (54) we derive the desired inequality.□ Equipped with these inequalities, we can now proceed to the main part of the proof.op .Now, S n is a Gaussian mixture itself and it satisfies Corollary 6 yields the estimate Moreover by the multivariate Cramér-Rao lower bound [7,Theorem 3.4.4],we have and thus the matrix in the right-hand side of (61) is positive semi-definite.Therefore, since ∥ • ∥ op is increasing with respect to the matrix ordering on positive matrices, (63) and (64) yield For i = 1, . . ., n consider the i.i.d.random matrices To bound the probability of the complement of E ε , notice that (66) Moreover, since E∥W 1 ∥ 1+δ op ≤ 2 1+δ E∥YY T ∥ 1+δ op ,we get the bound Next, we write and use the triangle inequality to get To control the first term in (69), we use Jensen's inequality for ∥ • ∥ op to get where the second line follows from the identity X for positive matrices X, Y .Now, by the definition of the event E ε the last factor is at most ε and thus we derive the bound Finally, the function A → A −1 is operator convex on positive matrices (see [6, p. 117]), thus Applying the operator norm on both sides, plugging this in (71) and using the triangle inequality after taking the expectation, we conclude that In view of (67) and (72), the second term in (69) is bounded by To bound the third term in (69), we use Jensen's inequality and (72) to get where the last estimate follows from the triangle inequality.Now, by Hölder's inequality, where the last line follows from the triangle inequality in L 1+δ .Combining this with (75) and (67) we thus conclude that Plugging this bound along with (73) and ( 74) in (69), we get that for every ε > 0, for random matrices Y satisfying (17).The implicit constant depends on moments of ∥YY T ∥ and ∥YY T ∥ op and on the Rademacher type (1 + δ)-constant of ∥ • ∥.These conditions are, in particular, satisfied for all S d p norms and the corresponding type constant is subpolynomial in d for p ≥ 1 + δ.
Remark 14.As was already mentioned in the introduction, bounding the relative Fisher information of a random vector automatically implies bounds for the relative entropy in view of the Gaussian logarithmic Sobolev inequality [21].However, bounds for the Fisher information matrix allow one to get better bounds for the relative entropy using more sophisticated functional inequalities which capture the whole spectrum of I(X).We refer to [19] for more on this kind of inequalities.
Finally, we present some examples of Gaussian mixtures related to conditions (17).
Examples. 1. Fix p ∈ (0, 2) and consider the random variable X p with density c p e −|x| p , where x ∈ R. It was shown in [18,Lemma 23] that X can be expressed as where V p or, equivalently, δ < p−1 2 .Therefore, Theorem 7 applies for these variables when p ∈ (1, 2). 2. It is well-known (see, for instance [18,Lemma 23]) that for p ∈ (0, 2), the standard symmetric p-stable random variable X p can be written as where G p/2 is a standard positive p 2 stable random variable.In this setting, the factor G does not have a finite 2 + 2δ moment for any value of p, so Theorem 7 does not apply.

Remark 8 .
One may wonder whether Theorem 2 can be extended to Gaussian mixtures on R d in the sense of Definition 5. Denoting by √ M the positive semidefinite square root of a positive semidefinite matrix M and repeating the above argument, we would need the validity of the inequality

)
Fix x ∈ S d−1 and I ∈ N. By the operator convexity of the Fisher information matrix (Proposition 10), the functional f → ⟨I(f )x, x⟩ (46) is convex and by Lemma 11 lower semi-continuous on F d (I).Again by operator convexity, the set F d (I) is convex and by Lemma 11 it is closed.Now we may repeat exactly the same proof as in [10, Proposition 15.1, Steps 1-2], but working with the functional ⟨I(f )x, x⟩ instead of the Fisher information I(f ), to obtain (45) if the measure π is supported on F d (I).

Proof of Theorem 7 . 1 2 1 2−
Since ES n = 0 and Cov(S n ) = EYY T , we haveCov(S n ) I(S n )Cov(S n ) I d op ≤ EYY T op I(S n ) − EYY T −1 op , (61)using that for any PSD matrices A, B, AB op ≤ A op B op and A

1 opRemark 13 . 1 2 1 2−
constant depends only on the moments of ∥YY T ∥ op .Finally, the (almost) optimal choice ε = ∥a∥ 2δ 1+δ 2+2δ yields the desired bound.□We insisted on stating Theorem 7 as a bound for the operator norm of the (normalised) Fisher information matrix of S n but this is not necessary.An inspection of the proof reveals that given any norm ∥ • ∥ on M d (R) which is operator monotone, i.e.0 ⪯ A ⪯ B =⇒ ∥A∥ ≤ ∥B∥(78)and satisfies the ideal property∀ A, B ∈ M d (R), ∥AB∥ ≤ ∥A∥ op ∥B∥,(79)we can derive a bound of the formCov(S n ) I(S n )Cov(S n ) I d ≤ C Y, ∥ • ∥ ∥a∥

2 has density proportional to t − 1 2 g p 2 ( 2 -
t) and g a is the density of the standard positive a-stable law.The moments of Y p = V −1stable random variables have finite β-moments for all powers β ∈ − ∞, p 2 , the assumptions (17) are satisfied when min{2δ + 2, −2δ − 2} > −p − 1 (83) Gaussian mixture if X has the same distribution as YZ, where Y is a random d × d matrix which is almost surely positive definite (and thus symmetric) and Z is a standard Gaussian random vector in R d , independent of Y.1