On pairwise interaction multivariate Pareto models

The rich class of multivariate Pareto distributions forms the basis of recently introduced extremal graphical models. However, most existing literature on the topic is focused on the popular parametric family of H\"usler--Reiss distributions. It is shown that the H\"usler--Reiss family is in fact the only continuous multivariate Pareto model that exhibits the structure of a pairwise interaction model, justifying its use in many high-dimensional problems. Along the way, useful insight is obtained concerning a certain class of distributions considered in arXiv:2303.04258, a result of independent interest.


Introduction and main result
Multivariate Pareto distributions play a central role in tail dependence modeling and inference as the only non-degenerate limit laws that can arise from multivariate threshold exceedances.They are defined as the class of possible non-degenerate weak limits of u −1 X | X ∞ > u, as u → ∞, for random vectors X with unit Pareto margins.As such, they are usually considered to perfectly describe the possible tail dependence strucures of multivariate data.Originally introduced in Rootzén and Tajvidi (2006), they form the basis of multivariate peaks-over-threshold inference (Rootzén et al., 2018;Kiriliouk et al., 2019).Other than a constraint on their support and a certain marginal standardization arising from their definition, multivariate Pareto distributions essentially comprise all homogeneous distributions.In the absolutely continuous case, which is the focus of this note, they can be exactly defined as follows (cf.Engelke and Volgushev, 2022).The class of multivariate Pareto distributions is of course very rich.It is equivalent to the class of all extreme value copulas, or to all multivariate max-stale distributions (with fixed margins).As such, parametric subfamilies of multivariate Pareto distributions are derived from the corresponding multivariate max-stable models.The most popular such model is that associated to the family of Hüsler-Reiss distributions Hüsler and Reiss (1989), hereafter termed Hüsler-Reiss distributions themselves for convenience (rather than Hüsler-Reiss multivariate Pareto).
In the example below and in the rest of this note, let 1 denote a vector each element of which is 1, and the dimension of which will be clear from the context.For a vector y := (y 1 , . . ., y d ), we define log y as the elementwise (natural base) logarithm (log y 1 , . . ., log y d ).The space of symmetric d × d matrices is denotes by S d×d , and S d×d 1 ⊂ S d×d represents those matrices which have 1 in their kernel (i.e., that have zero row and column sums).Finally, S d×d 1,+ ⊂ S d×d 1 represents the matrices which, in addition, are positive semi-definite with rank d − 1.
Example 1 (Hüsler-Reiss distribution).The multivariate Pareto distributed random vector Y has a Hüsler-Reiss distribution if for a matrix Θ ∈ S d×d 1,+ , its density f is given by where µ HR (Θ) : , and Θ + denotes the Moore-Penrose pseudoinverse of Θ.While traditionally parametrized by the variogram matrix Γ, it has recently been suggested that the Hüsler-Reiss family can be elegantly parametrized by the Hüsler-Reiss precision matrix Θ (Hentschel et al., 2022).By Proposition 3.4 of that paper, these two matrices are uniquely determined by each other through a homeomorphic mapping between S d×d 1,+ and the space of symmetric, strictly conditionally negative definite matrices, to which Γ belongs.We follow Hentschel et al. (2022) and shall refer to the Hüsler-Reiss distribution with precision matrix Θ.
Hüsler-Reiss distributions enjoy a nice connection to recently introduced extremal graphical models (Engelke and Hitz, 2020).The authors of that paper declare that two components Y i and Y j of a multivariate Pareto random vector are conditionally independent given the other variables (Y k ) k / ∈{i,j} in the extremal sense if, roughly speaking, the density of Y admits the factorization where f i (respectively f j ) does not depend on its ith (respectively jth) argument; see their Proposition 1.While this would be equivalent to the usual notion of conditional independence were Y supported on a product space, this is not the case for multivariate Pareto distributions, which are supported on L.
They then define an extremal graphical model as a multivariate Pareto distribution satisfying a Markov property (with respect to this weaker notion of conditional independence) on a given undirected graph.
For the relevant notions of graphical modeling, the reader is referred to Engelke and Hitz (2020) or to Maathuis et al. (2019).
It is straightforward that if Y is Hüsler-Reiss distributed with precision matrix Θ, then Y i and Y j are conditionally independent given the other variables (in the extremal sense of Engelke and Hitz (2020)) if and only if Θ ij = 0.This forms an example of a pairwise interaction model: an exponential family of multivariate distributions where the (i, j)th element of a parameter matrix fully governs the dependence between the ith and jth variables.
Definition 2. Let q ∈ N. A (curved) exponential family of probability distributions supported on a common set Y ⊆ R d and indexed by a parameter space Ω ⊆ (R q ) d × S d×d will be called a pairwise interaction model if: (PI1) it corresponds to a family of Lebesgue densities for some measurable functions S i : R → R q and measurable, non-constant functions T i : R → R, and (PI2) for every pair (i, j), i = j, there exists at least one parameter (µ, Θ) ∈ Ω such that Θ ij = 0. Remark.
1.Only Lebesgue densities are considered for the purpose of the present note, but Definition 2 can be readily adapted to pairwise interaction models dominated by any base (product) measure.
2. Following standard definitions of exponential families, a so-called carrying density d i=1 h i (y i ) could be included as a factor in the functions f µ,Θ .However, since we do not require the family to be regular (in particular, the Hüsler-Reiss distributions form a curved exponential family), the carrying density can be absorbed in the marginal terms S i (y i ) by possibly adding a dimension to each µ i and reducing the domain Y.The reader is referred to standard texts such as Brown (1986) for notions of exponential families.
3. The property (PI2) is a technical requirement for the characterization of all pairwise interaction multivariate Pareto models in Lemma 1 below.It is very minor: it holds if at least one possible value of the parameter Θ has no zeros, i.e., if the model allows for the simultaneous presence of all pairwise interactions.Upon inspection of the proof of Lemma 1, it could even be further relaxed.For instance, consider the graph G which has an edge between i and j if and only if Θ ij = 0 for some (µ, Θ) ∈ Ω.
The property (PI2) states that G is connected, but Lemma 1 holds under the mere requirement that every edge in G is part of a cycle of odd length (e.g., a triangle).
Pairwise interaction models are ubiquitous in dependence modeling.When the common support Y is a product space, they are particularly elegant examples of undirected graphical models, where the conditional independence graph contains the edge (i, j) if and only if Θ ij = 0.Each variable then typically satisfies a generalized linear model conditionally on the other variables, with the regression coefficients being extracted from the parameter matrix.Structure learning and parameter inference on a given graph structure can be efficiently carried out via (possibly penalized) likelihood, but also regression (Yang et al., 2015) or score matching (Lin et al., 2016) based methods.Gaussian and Gaussian copula models (Liu et al., 2009) as well as the continuous square root graphical model (Inouye et al., 2016) are examples of pairwise interaction graphical models used for high-dimensional dependence modeling in Euclidean settings.Klein et al. (2020) introduce pairwise interaction graphical models for multivariate angular data.Discrete analogues include the Ising model and more general log-linear interaction models (Darroch et al., 1980) as well as the discrete square root graphical model (Inouye et al., 2016).
Even in a pairwise interaction model without a product space support (such as multivariate Pareto distributions), the (i, j)th element of the parameter matrix Θ is zero if and only if the density admits a factorization as in Equation (1).Yu et al. (2022) show that score matching can be adapted to perform model selection and inference for the interaction parameter Θ in such a setting.
The Hüsler-Reiss family has been the focus of many recent papers on high-dimensional modeling and inference for tail dependence, especially in relation to extremal graphical models (see, e.g., Asenova et al., 2021;Röttger et al., 2023;Engelke et al., 2022;Hentschel et al., 2022;Lederer and Oesting, 2023;Röttger et al., 2023).In fact, it is the only pairwise interaction family of multivariate Pareto distributions that can be found in the current literature, despite the fact that such families are naturally related to the density factorization property which underlies extremal graphical models.A natural question is whether there exists another such family.The main result of this note answers that question.
Theorem 1.Let P be a family of absolutely continuous multivariate Pareto distributions in dimension d ≥ 3 that forms a pairwise interaction model.Then P is a subset of the Hüsler-Reiss family.
Theorem 1 justifies the focus on the Hüsler-Reiss family in the recent literature on graphical extremes.It would be tempting to extend some of the existing models in the aforementioned papers into more complicated structures while retaining the practicality of estimating pairwise interaction models, but this is in fact not possible.
In particular, in order to develop an efficient score matching algorithm, Lederer and Oesting (2023) introduce the class of functions as surrogates of the densities of Hüsler-Reiss distributions.As the authors rightfully point out, F genHR strictly generalizes the class of Hüsler-Reiss densities; some of the functions therein are not even integrable, hence cannot be normalized to densities.While there are functions in F genHR which are in between, i.e., integrable but not Hüsler-Reiss densities, Theorem 1 guarantees that none of them corresponds to a multivariate Pareto distribution.In fact, a corollary of the proof of Theorem 1 is a full characterization of the functions f genHR µ,Θ ∈ F genHR based on the values of µ and Θ; see Lemma 2 below.
Theorem 1 is a direct consequence of the following two results, which are presented here because they are of independent interest.Their proofs are the object of Section 2.
Lemma 1.Let P be a family of absolutely continuous multivariate Pareto distributions in dimension d ≥ 3 that forms a pairwise interaction model and let F be the corresponding class of densities.Then F ⊆ F genHR .
Lemma 2. The functions f genHR µ,Θ in F genHR can be categorized as follows.
is the density of a Hüsler-Reiss distribution. (ii is not integrable on L. Remark.Part (i) in Lemma 2 was already stated and proved by Lederer and Oesting (2023), and is in fact the main justification for working with F genHR .The contribution here is in giving a complete picture of the functions in this class.In particular, there are no multivariate Pareto densities in F genHR other than Hüsler-Reiss densities.

Proofs
Throughout the proofs, for a vector y ∈ R d , we generically define y i as the ith entry of y and y \i ∈ R d−1 as the subvector obtained by removing its ith entry.The same conventions are used for indexing the rows and columns of a matrix.

Proof of Lemma 1
By assumption, the density class F is defined as in Equation (2).Using the assumed properties (MP2) and (MP3) of multivariate Pareto distributions as well as the property (PI2) of pairwise interaction models, we shall establish all the required properties that will ensure that each density in F is an element of F genHR .Specifically, it will be shown that the dimension q of the marginal parameters µ i can be taken as 1, that the functions S i and T i must be logarithmic, and that moreover the parameter matrix Θ must (or rather, can be assumed to) have zero row and column sums.
2.1.1The functions T i must be logarithmic and Θ can be assumed to have zero row sums The required homogeneity property (MP2) implies that for any y ∈ L and t > 1, Since for every pair (i, j), i = j, Θ ij can be non-zero, the above implies that for every such pair, any y ∈ L and any t > 1, for some functions a ij and b ij not depending on y j and on y i , respectively.Consider now the following auxiliary result.
To summarize, we have shown that holds for every t > 1 and x 1 , x 2 such that ξ(x 1 ) = ξ(x 2 ).By extension, it holds for every positive x 1 and x 2 , since ξ(x 1 ) = ξ(x 2 ) makes both sides vanish.Finally, the same clearly holds for t ∈ (0, 1] if δ ξ (t) is defined as 1 for t = 1, and as δ ξ (t −1 ) −1 for t < 1.This is the desired result for the function ξ.By symmetry, the same holds for ψ.
Applying Equation ( 5) to those four vectors, followed by Equation ( 7), we have 0 = a ij (y for any t > 1.By assumption, the last two terms in the product are non-zero.Deduce that δ i (t)δ j (t) = 1.However, for a third index k / ∈ {i, j}, we may apply the same logic to find that similarly, δ i (t)δ k (t) = δ j (t)δ k (t) = 1.This is only possible if δ i (t) = δ j (t) = δ k (t) = 1 for every t > 1, and by extension for every t > 0, recalling that δ i (t) is defined as 1 for t = 1 and as δ i (t −1 ) −1 for t < 1.The same argument applies to every triple (i, j, k), so that Equation ( 7) can be rewritten as which now holds for every index i ∈ {1, . . ., d} and positive y (1) i , y (2) i and t.Equivalently, for every t > 0 the function y → T i (ty) − T i (y) is constant.We now apply the following.
Applying Lemma 4, deduce that T i is a logarithmic function of the form T i (y) = c i log y + T i (1).Note however that the values of c i and T i (1) can be absorbed into the parameters µ i and Θ ij , j ∈ {1, . . ., d}, and into the normalizing constant, and are thus not important in characterizing the possible distributions in P. We may therefore assume that all the functions T 1 , . . ., T d are equal to the logarithm function.
With our current formulation of the distributions in P, the diagonal elements of Θ are not identifiable.Indeed, their value can be changed arbitrarily by adding a (q + 1)th dimension to each µ i and letting S i (y i ) q+1 = (log y i ) 2 .Therefore, we shall assume without loss of generality that Θ ii = − j =i Θ ij , so that the row sums of Θ (and by symmetry, its column sums) are all equal to zero.
2.1.2The parameters µ i can be assumed scalar and the functions S i must be logarithmic Replacing the functions T i by logarithms in Equation ( 4) and using the assumption that the row and columns sums of Θ are zero, we find Similarly to what was argued about the functions T i , deduce that for all y i and t > 0, which by Lemma 4 means that µ ⊤ i S i can be chosen to be simply a logarithm, up to scaling.Thus, we may assume without loss of generality that q = 1 and that the real-valued functions S i are logarithms.
To summarize, we have established that all the densities in F must be of the form with Θ symmetric with zero row (and column) sums.This concludes the proof.

Proof of Lemma 2
As mentioned after the statement of the result, (i) is already obtained by Lederer and Oesting (2023), so only (ii) and (iii) shall be proved here.
Let f genHR µ,Θ ∈ F genHR .It will first be shown that for f genHR µ,Θ to be integrable on L, it is necessary for Θ to be a Hüsler-Reiss precision matrix, i.e., an element of S d×d 1,+ , and for µ to satisfy µ ⊤ 1 > d, establishing (iii).Finally, it will be shown that for such a given matrix Θ, for f genHR µ,Θ to be a multivariate Pareto density, it is necessary for µ to have the specific form µ HR (Θ) in which case f genHR µ,Θ is a Hüsler-Reiss density, thus establishing (ii).

If
is not integrable For any index k, by the change of variable x = log y, we find that Decomposing x into x \k and x k , we may write (µ − 1) ⊤ x as (µ − 1) ⊤ \k x \k + (µ k − 1)x k , and x ⊤ Θx as where Θ (k) := Θ \k,\k .We may then rewrite Equation (8) as The inner integral is a Gaussian type integral.It is straightforward to show that it is finite if and only if Θ (k) is positive definite, using a spectral decomposition of that matrix.Deduce that for f genHR µ,Θ to be integrable, all the matrices Θ (k) must be positive definite (hence, of full rank d − 1).This implies that Θ must also be of rank d − 1.Moreover, Θ has to have only non-negative eigenvalues.Indeed, suppose it doesn't.Then there is an x ∈ R d such that x ⊤ Θx < 0. However, since Θ1 = 0, it is also true that 0 ) \k , contradicting the positive definiteness of Θ (k) .
It is therefore necessary for the integrability of f genHR µ,Θ on L that Θ ∈ S d×d 1,+ .For the remainder of the proof, we shall assume that this is the case.

If
is not integrable Now that we assume the matrices Θ (k) to be invertible, we may obtain a more precise expression for Equation (8).By tedious but elementary computations involving completion of the quadratic form x ⊤ \k Θ (k) x \k , we may rewrite the argument of the exponential in Equation (8) as where x \k only appears in the first term, we may rewrite Equation (8) as Expanding the quadratic form Therefore, Equation ( 9) is equal to and µ ⊤ 1 > d, which in particular implies (iii).By Equation ( 23) in Röttger et al. (2023), det(Θ (k) ) is in fact the pseudodeterminant of Θ and as such, does not depend on k.Hence, the marginal standardization property holds if and only if the value of (µ − 1) ⊤ \k Σ (k) (µ − 1) \k is the same for each k.

If f genHR
The matrices Σ (k) , however, enjoy a special structure.Let us augment Σ (k) by adding a row and column of zeros in its kth position, forming a matrix Σ (k) ∈ R d×d .Then the matrices Σ (k) satisfy Σ recalling the fact that µ ⊤ 1 = d + 1, or equivalently (µ − 1) ⊤ 1 = 1.This can only be zero for any k = ℓ if Γ k• (µ − 1) has the same value for every k, i.e., Γ(µ − 1) ∈ {γ1 : γ ∈ R}.By invertibility of Γ, this forms a non-singular system of d − 1 linear equations.The solution set is a one-dimensional linear subspace, only one element of which, say µ * − 1, satisfies (µ * − 1) ⊤ 1 = 1.Therefore, for any given parameter matrix Θ ∈ S d×d 1,+ , the parameter vector µ is uniquely determined by the multivariate Pareto structure.The resulting distribution can be none other than the Hüsler-Reiss distribution with precision matrix Θ.
Note that it can be verified by simple calculations that the unique solution to the above linear system is indeed the Hüsler-Reiss distribution.Indeed, using Lemma S.5.11 of Hentschel et al. (2022), it can be confirmed that Γ(µ HR (Θ) − 1) is indeed the vector 1 multiplied by the scalar d −2 1 ⊤ Γ( 1 2 ΘΓ − I)1, where I denotes the identity matrix, and that (µ HR (Θ) − 1) ⊤ 1 = 1.