Threshold for Detecting High Dimensional Geometry in Anisotropic Random Geometric Graphs

In the anisotropic random geometric graph model, vertices correspond to points drawn from a high-dimensional Gaussian distribution and two vertices are connected if their distance is smaller than a specified threshold. We study when it is possible to hypothesis test between such a graph and an Erd\H{o}s-R\'enyi graph with the same edge probability. If $n$ is the number of vertices and $\alpha$ is the vector of eigenvalues, Eldan and Mikulincer show that detection is possible when $n^3 \gg (\|\alpha\|_2/\|\alpha\|_3)^6$ and impossible when $n^3 \ll (\|\alpha\|_2/\|\alpha\|_4)^4$. We show detection is impossible when $n^3 \ll (\|\alpha\|_2/\|\alpha\|_3)^6$, closing this gap and affirmatively resolving the conjecture of Eldan and Mikulincer.


Introduction
Extracting information from large graphs is a fundamental statistical task.Because many natural networks have underlying metric structure -for example, nearby proteins in a biological network are more likely to share function, and users with similar interests in a social network are more likely to interact -a central inference problem is to infer latent geometric structure in an observed graph.Moreover, with the proliferation of large data sets in the modern world, statistical inference is inherently high dimensional, see e.g. the survey [Joh06].This motivates the study of inferring latent high dimensional geometry in a graph.
In this paper, we consider the hypothesis testing problem that determines if such inference is informationtheoretically possible.This continues a line of work originated by Bubeck, Ding, Eldan, and Rácz [BDER16] and continued by Eldan and Mikulincer [EM19].Our main contribution is a tight characterization of when detection is possible in the anisotropic setting introduced in [EM19].
Formally, given a graph G on [n], we wish to test between two hypotheses.The null hypothesis is that G is a sample from the Erdős-Rényi graph G(n, p), where each edge is present with independent probability p.The alternative hypothesis is that G is a sample from a random geometric graph (RGG), which we define precisely below.In such graphs, each vertex corresponds to a random point in some metric space and an edge exists between two vertices if their distance is smaller than a given threshold.
Arguably the most natural RGG is the isotropic model: each vertex i ∈ [n] corresponds to an independent latent vector X i sampled from the Haar measure on the sphere S d−1 or an isotropic d-dimensional Gaussian, 1 and edge (i, j) is present if X i , X j ≥ t p,d , where t p,d is chosen so that each edge is present with probability p.Let G(n, p, d) denote the isotropic RGG with spherical latent data; we fix p ∈ (0, 1) and allow d to vary with n.The following seminal result of Bubeck, Ding, Eldan, and Rácz characterizes, for fixed p ∈ (0, 1), when it is possible to test between G(n, p) and G(n, p, d).Let TV denote total variation distance.
Each coordinate of the latent vector X i represents an attribute of vertex i.The isotropic model assumes that each attribute has the same influence on the network structure.In real networks, some attributes are more important than others: for example, in a social network geographic location has a stronger influence on connectivity than preference of ice cream flavor.This motivates the following anisotropic generalization of the RGG, introduced in [EM19], in which attributes may have different weights.
By rotational invariance of the model, the assumption that X i has diagonal covariance is without loss of generality.Thus, all our results apply to the case of latent Gaussian vectors with arbitrary covariance.
Throughout, we fix p ∈ (0, 1) and allow d, α to vary with n.The central question we study is, under what limiting behaviors of (n, d, α) can one statistically distinguish G(n, p, α) from G(n, p)?This question was first studied in [EM19], in which the following upper and lower bounds on detection were derived.
Theorem 2. [EM19, Theorem 2] Let p ∈ (0, 1) be fixed.Then, When α = 1 d , this recovers the sharp d ≍ n 3 detection threshold in the isotropic model.However, for general α there is a polynomially sized gap between the upper and lower bounds of Theorem 2. For example, if Conjecture 1] conjectures that the hypothesis of part (a) can be weakened to n 3 ≪ ( α 2 / α 3 ) 6 , i.e. the detection threshold is The main result of this paper is to affirmatively resolve this conjecture.
In light of Theorem 1, this result can be interpreted as meaning that for the task of detecting geometry, the effective dimension of the anisotropic RGG is ( α 2 / α 3 ) 6 .
[EM19, Conjecture 1] is motivated by the fact that Theorem 2(b) is witnessed by the signed triangles statistic which is also the optimal statistic witnessing Theorem 1(b).A heuristic reason to expect that this statistic is optimal is that it is the lowest-degree nontrivial term in the Fourier expansion of the likelihood ratio dG(n, p, α)/dG(n, p).So, Theorem 3 confirms this heuristic and the optimality of the signed triangles statistic in the anisotropic setting.

Central Limit Theorem for Anisotropic Wishart Matrices
Closely related to the anisotropic RGG is the following matrix of inner products generating G(n, p, α).A sample of G(n, p, α) can be obtained by thresholding each entry of this matrix at t p,α .
Definition 1.2 (Diagonal-removed anisotropic Wishart matrix).Let W ∼ W (n, α) be the random n × n matrix generated as follows.Sample X ∈ R d×n with i.i.d.columns sampled from N (0, D α ), and set For fixed n, if d → ∞ and α ∞ / α 2 → 0, by the multidimensional CLT W (n, α) converges to the following matrix of Gaussians.
Definition 1.3.Let M ∼ M (n) be a symmetric random n × n matrix with zero diagonal and i.i.d.standard Gaussians above the diagonal.
If we now allow d, α to vary with n, a natural question is, for which (n, d, α) can one test between W (n, α) and M (n)?This can be regarded as the random matrix analog of the question of detecting geometry in random graphs.Eldan and Mikulincer obtain the following detection lower bound.
Of course, by Pinsker's inequality this also implies W (n, α) and M (n) converge in total variation.Furthermore, the statistic θ(M ) = tr(M 3 ) distinguishes W (n, α) and M (n) to total variation distance 1 − o(1) when n 3 ≫ ( α 2 / α 3 ) 6 , which can be verified by computing the mean and variance of this statistic under the two hypotheses.
Similarly to above, these upper and lower bounds match for α = 1 d , but in general there is a polynomially sized gap between them.We prove the following result, which identifies the sharp threshold for this detection task by improving the lower bound in Theorem 4. This result can be regarded as a tight CLT for anisotropic Wishart matrices.

Techniques
Theorem 3 follows from Theorem 5 by the thresholding idea introduced in [BDER16].Note that G(n, p, α) and G(n, p) are entry-wise thresholdings of W (n, α) and M (n).Thus TV(G(n, p, α), G(n, p)) is upper bounded by TV(W (n, α), M (n)) plus a small error term from the difference of the thresholds.
Our main technical contributions are in the proof of Theorem 5. We divide the entries of α into large coordinates α + and small coordinates α − , each accounting for a constant fraction of its L 2 mass.We note (Lemma 2.4) that ( α − 2 / α − 4 ) 4 is of the same order as ( α 2 / α 3 ) 6 , so Theorem 4 is sufficient to show that W (n, α − ) converges in total variation to M (n).
It remains to control the contributions of the large coordinates α + .We consider a procedure (Lemma 2.3) where we add the coordinates of α + to α − one by one.Note that the effect of this operation on W ∼ W (n, α) is to add an independent rank-one spike and scale down by a constant.By a data processing argument, the increase in TV(W (n, α), M (n)) from one step of this procedure is bounded by TV(M (n, u), M (n)), where M (n, u) is a linear combination of M (n) and an independent rank-one Gaussian spike (see Definition 2.1).
This last quantity is bounded (Lemma 2.2) using the Ingster-Suslina χ 2 method, as M (n, u) is a mixture of shifted Gaussian matrices parametrized by the spike.This is done after conditioning on a high probability event under which the χ 2 divergence's tails are integrable.The resulting χ 2 divergence is an expectation over two independent copies of the Gaussian spike, which is bounded by hypercontractivity estimates.

Related Work
There is a long history of work on low-dimensional random geometric graphs, see e.g.[Pen03].The study of high-dimensional random geometric graphs began in [DGLU11], which showed that the isotropic model G(n, p, d) converges in total variation to G(n, p) as d → ∞ for n fixed, and moreover that their clique numbers converge if d ≫ log 3 n.[BDER16] showed Theorem 1, that the threshold for convergence of G(n, p, d) and G(n, p) with p fixed is d ≍ n 3 .They conjectured that if p = o(1), convergence occurs at smaller d; in particular, for p = c/n they conjectured the threshold d ≍ log 3 n.[BBN20] proved convergence occurs when d = ω(n 3 p, n 7/2 p 2 ), meaning the threshold does decrease with p.Recently [LMSY22] proved that for p = c/n (c ≥ 1), convergence occurs when d log 36 n, resolving the conjecture of [BDER16] up to polylog factors.In a different direction, [LR21] obtain detection upper and lower bounds for soft random geometric graphs, wherein the inner products X i , X j determine the probability of edge (i, j) being present.
There is also a growing literature on CLTs for random matrices.[CM07] proved a general multidimensional CLT using Stein's method.[JL15] and [BDER16] concurrently showed that W (n, d) W (n, 1 d ) converges in total variation to M (n) if d ≫ n 3 .[BG18] generalized this result to arbitrary log-concave entry distributions, showing that the random matrix

Notation and Preliminaries
We adopt standard asymptotic notations: f ≫ g means that f /g → ∞ and f g means that f ≥ cg for an absolute constant c.Throughout, c, C > 0 denote universal constants that may change from line to line.We use TV, KL, and χ 2 to denote total variation, Kullback-Leibler divergence, and chi-square divergence.That is, for measures ν, µ with ν absolutely continuous with respect to µ, We recall that TV satisfies the triangle inequality and the data processing inequality TV(K(ν), K(µ)) ≤ TV(ν, µ) for any Markov kernel K.We also recall the Cauchy-Schwarz inequality 4TV(ν, µ) 2 ≤ χ 2 (ν, µ).
For g ∈ R n , let ∆(g) = (gg ⊤ − diag(gg ⊤ )).We introduce the following random matrix, consisting of a linear combination of a sample from M (n) and a rank-one Gaussian spike (with diagonal removed).
We defer the proof of the following lemma to Section 3. Using this lemma, we prove Theorems 3 and 5.
Remark 2.5.We conjecture that Theorem 5 remains true if the diagonal is not removed, i.e. if W (n, α) and M (n) are replaced by the law where X is as in Definition 1.2, and the law M * (n) of a GOE matrix.With minor modifications, the proof of Lemma 2.2 in the next section generalizes if the diagonal is not removed, i.e. if ∆(g) and M (n) are replaced by gg ⊤ − I n and M * (n).So, if Theorem 4 holds without diagonal removal, the above proof can be easily modified to conclude Theorem 5 without diagonal removal.The difficulty is that the entropy chain rule argument used to prove Theorem 4 requires the diagonal to be removed.

Detection Lower Bound for Anisotropic RGGs
The proof of Theorem 3 is identical to that of Theorem 2(a) (Theorem 2(b) in [EM19]), using Theorem 5 in place of Theorem 4.

Proof of TV Bound for Spiked Gaussian Matrix
In this section we will prove Lemma 2.2.Let M (n, u, g) be the random matrix M generated by (1) for g ∈ R n fixed and M ′ ∼ M (n).Thus M (n, u) is a mixture of the random matrices M (n, u, g) over latent randomness g ∼ N (0, I n ).Further, for (always measurable and high probability) S ⊆ R n , let µ S be the law of g ∼ N (0, I n ) conditioned on g ∈ S. Let M (n, u, S) be the law of M generated by (1) for g ∼ µ S and M ′ ∼ M (n).This can be regarded as M (n, u) conditioned on g ∈ S, and as a mixture of the M (n, u, g) over g ∼ µ S .
We begin with the following series of estimates.Let S ⊆ R n be a set we will specify later.
TV(M (n, u), M (n)) ≤ TV(M (n, u), M (n, u, S)) + TV(M (n, u, S), M (n)) ≤ P g∼N (0,In) The two estimates leading to (6) are by the TV triangle inequality and the data processing inequality.The estimate leading to (7) is by Cauchy-Schwarz.These estimates are the starting point of the so-called truncated Ingster-Suslina χ 2 method.It is necessary to condition on an appropriate S in (6) so that the tails of the χ 2 divergence (7) are integrable.The following lemma evaluates the inner expectation in (7).
Lemma 3.4.The following estimates hold for all a ≥ 1.
Proof.The first two claims follow by the symmetry of S(a) under the map (g 1 , . . ., g n ) → (x 1 g 1 , . . ., x n g n ) for any x ∈ {−1, 1} n .In the rest of this proof, let E denote expectation with respect to g, h ∼ N (0, I n ).By straightforward calculation, We estimate the discrepancy caused by changing the measure from µ S(a) to N (0, I n ) using the following generic bound.For any (g, h)-measurable ξ, For ξ = X 2 , by Lemma 3.2(a) E ξ 2 = E X 4 ≤ 4 8 (E X 2 ) 2 n 4 , which proves the third conclusion.For ξ = X 3 , we similarly have E ξ 2 n 6 , proving the fifth conclusion.For ξ = Y , a straightforward calculation shows E ξ 2 n 4 , proving the fourth conclusion.Lemma 3.5.For a ≥ 1 and integer i, j ≥ 0, For all g ∈ S(a), and similarly for h.Thus if g, h ∈ S(a), then |Y | ≤ (2an) 2 , which implies If i = 0, this implies the result.Otherwise assume i ≥ 1.By symmetry of the set S(a), the distribution of X under g, h ∼ µ S(a) is the same as that of X = 1≤i<j≤n x i x j g i g j h i h j where x ∼ unif({−1, 1} n ) and g, h ∼ µ S(a) are independent.By Lemma 3.2(a), conditioned on g, h, So, E g,h∼µ S(a) Recalling (8) and (9) this implies the result.
where X ∈ R d×n has i.i.d.entries from a log-concave measure, converges to M (n) if d/ log 2 d ≫ n 3 .[RR19] refined the result of [JL15, BDER16] by computing the limiting value of TV(W (n, d), M (n)) if n, d → ∞ with d/n 3 → c. [CW19] showed a countable sequence of phase transitions for the Wishart ensemble W (n, d): for each k ∈ N, if n k+3 ≫ d k+1 , they show that W (n, d) converges to an explicit density f k .CLTs have been shown for Wishart tensors [Mik22] and Wishart matrices with arbitrary deleted entries [BBH21].Finally, [NZ21] considers Wishart matrices W= √ d(d −1 X ⊤ X − I n ) where the columns of X are drawn i.i.d.from N (0, Σ) for Σ ∈ R d×d of the form Σ i,j = s(i − j),where s : Z → R is a covariance function with s(0) = 1.They show that W converges in Wasserstein distance to a Gaussian matrix if n 3 ≪ d and s ∈ ℓ 4/3 (Z), and under various conditions if s is the correlation function of a fractional Brownian noise.