Inducement of population sparsity

The pioneering work on parameter orthogonalization by Cox and Reid is presented as an inducement of abstract population‐level sparsity. This is taken as a unifying theme for this article, in which sparsity‐inducing parameterizations or data transformations are sought. Three recent examples are framed in this light: sparse parameterizations of covariance models, the construction of factorizable transformations for the elimination of nuisance parameters, and inference in high‐dimensional regression. Strategies for the problem of exact or approximate sparsity inducement appear to be context‐specific and may entail, for instance, solving one or more partial differential equations or specifying a parameterized path through transformation or parameterization space. Open problems are emphasized.


INTRODUCTION
Sparsity, the existence of many zeros or near-zeros in some domain, plays at least two roles in high-dimensional statistical theory: to aid interpretation and to restrain estimation error associated with multitudinous nuisance parameters. The ideas to be presented are motivated primarily by the latter, with a low-dimensional parameter of interest encapsulating relevant aspects of interpretation. In some contexts, there is a natural and interpretable notion of sparsity, and the statistical challenge is the now rather routine task of specifying an estimator that exploits this structure to give appropriate statistical guarantees. See, e.g., Wainwright (2019) for an extensive account covering numerous examples.
The present article is barely concerned with estimators and other sample quantities. Its contribution is to explore the idea that certain forms of abstract, population-level sparsity may be systematically induced, and to seek unification of some isolated examples through this perspective. The precursor, although not framed as sparsity inducement, appears to be the paper on parameter orthogonalization by Cox & Reid (1987).
The four principal examples considered fall broadly into two categories: sparsity induced by reparameterization and sparsity induced by transformations of the data. The exposition here follows this separation, although it is plausible that the two approaches are connected. To avoid repetition, only the essential aspects of each case are presented, from which a synthesis is attempted.

SPARSITY INDUCED BY REPARAMETERIZATION
2.1. Parameter Orthogonalization (Cox & Reid, 1987) Direct use of the likelihood function often produces misleading estimates of parameters of interest when the dimension of the nuisance parameter vector is of a similar order of magnitude to the number of independent observations, and is typically suboptimal even for moderately many nuisance parameters. One resolution, an implicit invocation of sparsity, is parameter orthogonalization (Cox & Reid, 1987), prior to likelihood inference on the parameter of interest.
Let and represent parameters of interest and nuisance parameters, respectively, in a parametric statistical model. Where ( , ) denotes the corresponding off-diagonal block of the Fisher information matrix, and are said to be globally orthogonal if ( , ) = 0 for all and and locally orthogonal if this equality holds at particular values. The implication of parameter orthogonality is that the maximum likelihood estimator of behaves "almost as if" were fixed at its true value in the sense that̂−̂= ( −1 ) for in an ( −1∕2 )-neighbourhood of the true value. Here,̂is the unconstrained maximum likelihood estimator and̂maximizes the likelihood over the constrained parameter space at . The previous statements assume that the dimension of is fixed.
Parameter orthogonality also enables higher-order inference via a simple modification to the profile log-likelihood function without the specification of an ancillary complement to the maximum likelihood estimator (Barndorff-Nielsen, 1983;Cox & Reid, 1987).
From an initial parameterization ( , ), Cox & Reid (1987) provided a way to construct an interest-respecting reparameterization ( , ( , )) such that = 0. In other words, they proposed a sparsity-inducing reparameterization. In general, establishment of the appropriate reparameterization entails solving a set of partial differential equations, although there is a simpler route if the parameter of interest is a canonical parameter of a full exponential family, as noted by Huzurbazar (1956) for two-parameter families and more generally by Barndorff-Nielsen (1978, p. 183). A version of the following more explicit derivation was presented by Barndorff-Nielsen and Cox (1994, p. 64). Write the log-likelihood function of a full exponential family with canonical parameters and , and canonical statistics and as is the matrix with arbitrary entry w ∕ . If and are one-dimensional, then and Equation (2.1) gives This holds in arbitrary dimensions, as can be shown using the expression for the inverse of a block-partitioned matrix: . Sparsity of the Fisher information matrix has emerged in recent literature on inference in high-dimensional regression (e.g., van de Geer et al., 2014;Fang, Ning & Liu, 2017;. However, its presence is assumed rather than induced as in Cox & Reid (1987). Direct imposition of such sparsity is highly restrictive, as can be seen from the simple example of a normal-theory linear regression model. In that context, with taken as an arbitrary regression coefficient and the remaining coefficients, the requirement = 0 without preliminary manoeuvres is equivalent to assuming that all columns of the design matrix corresponding to are orthogonal to those corresponding to . We return to this example in Section 3.2.

Sparse Parameterizations of Covariance Models
Covariance matrices and their inverses are encountered throughout classical multivariate analysis, almost always as nuisance parameters to be estimated, which necessitates a sparsity assumption to extend multivariate procedures to high-dimensional settings. More precisely, the requirement is typically that an estimator of the covariance or precision matrix is consistent in the spectral matrix norm as the dimension of the matrix tends to infinity with the effective sample size under a suitable scaling condition. This notional asymptotic regime is a theoretical device, a means of studying the probabilistic behaviour of an estimator as a function of when > .
Spectral-norm consistency, while achievable under a sparsity constraint, is relevant only if the assumptions made are valid to an adequate order of approximation. This motivates a search for parameterizations under which relevant covariance models are sparse. In particular, it raises the question, to which we refer henceforth as * : for a given (relevant) covariance model not obviously sparse in any domain, can a sparsity-inducing parameterization be deduced? An answer would enable reparameterization to achieve maximal sparsity, and on the transformed scale, a more effective and valid estimation could be achieved by exploiting the sparsity before transforming the conclusions back to the scale of interest. Battey (2019) established statistical guarantees for such a procedure assuming that the sparsity scale is known or can be reliably estimated, which does not address * . The idea of parameterizing and thereby estimating the sparsity scale using a device analogous to that proposed by Box & Cox (1964) has been suggested by Peter McCullagh in unpublished communication.
The problem * remains open. The proof of concept to be outlined follows Battey (2017), who gave an example of a covariance model that is unexpectedly sparse after reparameterization. The starting point was the converse formulation: to impose sparsity in unusual domains and study the structure induced on the original and inverse scales. Consider the matrix logarithm of a covariance matrix Σ, implicitly defined through the Taylor series expansion of the matrix exponential: In contrast to Σ, which belongs to the open cone of positive definite matrices, belongs to the vector space of symmetric matrices, for which there exists a natural basis,  = { 1 , … , ( +1)∕2 } say, consisting of ( + 1)∕2 -dimensional square matrices. The choice to study the matrix logarithm was made on mathematical grounds, as the properties of vector spaces make the formulation feasible and fruitful, and allow a simple characterization of sparsity through the basis expansion = ( ) = ∑ ( +1)∕2

=1
. In particular, the sparsity of corresponds to that of the basis coefficient vector = ( 1 , … , ( +1)∕2 ) ⊤ . With the constraint || || 0 = * ≪ , where || || 0 is the number of nonzero entries of , the eigenvalues 1 ≥ 2 ≥ · · · ≥ and corresponding eigenvectors ( ) =1 of the resulting matrix Σ( ) inherit substantial structure, as illustrated in Figure 1. The right-hand panel corrects Figure 1 of Battey (2017), which depicted a single realization instead of a Monte Carlo average. Figure 1 was obtained by generating 100 realizations of a random, sparse by taking the support of to be random samples of size * from the index set {1, … , ( + 1)∕2}. This was done for different values of * as indicated in Figure 1. The values of the nonzero basis coefficients were then drawn from a standard normal distribution, although the latter aspect is irrelevant as far as the induced structure on Σ is concerned: any other distribution could have been used instead.  Figure 1 indicates that an a priori unexpected structure is present in the eigenvectors and eigenvalues of the covariance matrix, which translates to structure on the covariance matrix and its inverse. Specifically, there exists a permutation matrix such that Σ = ⊤ , where consists of a potentially large, dense block and is otherwise diagonal. The dimension of the dense block is provided by theoretical analysis and can be expressed in terms of sparsity * and dimension through a random-matrix perspective. The key conclusion is that can be considerably more sparse than Σ.
This idea was extended to another class of examples by Rybak & Battey (2021), but the scope for further progress seems substantial. Most notably, the earlier work only contains a brief discussion of how one may traverse paths through a parameterization space in search of a sparse representation. These paths were chosen, following McCullagh's proposal, to pass through the covariance and inverse covariance parameterizations, as well as the matrix logarithmic parameterization. In the Box & Cox (1964) analysis of transformations, a path through a model space was viewed as a technically convenient way of assessing a discrete set of plausible models for compatibility with data. Any values of their key transformation parameter that yielded nonphysical models were not used in the final analysis. The situation in the present context is different, as the covariance matrix is almost always a nuisance parameter and the goal of the reparameterization is to aid inference on the parameters of interest-a line of argument similar to that of Cox & Reid (1987).

Construction of Factorizable Transformations in Matched-Comparison Problems
Suppose that 1 , … , are responses on blocks of individuals, where each is a vector of random variables. The simplest realistic example has ( ) =1 = ( , ) =1 , an outcome variable on pairs of homozygotic twins, where one twin from each pair has been chosen at random to receive a treatment and their outcome denoted by , while the other twin is the untreated control with outcome . The goal is inference on the treatment effect in the presence of pair-specific nuisance parameters 1 , … , , the latter arising from an inability or unwillingness to model the generating process in detail. These may, for instance, represent genetic differences between the twin pairs. Maximum likelihood estimation without preliminary manoeuvres typically produces misleading inference for . It is, however, sometimes possible to eliminate nuisance parameters from the analysis by exploiting the natural separation in the data due to the matching, and transforming the observations in such a way that the resulting likelihood function, ( , ; ) say, fruitfully factorizes. Such a factorization is of the form where the factor pa ( ; ) is called the partial likelihood (Cox, 1975). Ideally, little or no information for inference on is lost through relinquishment of the remainder likelihood r ( , ; ). Conditional (Bartlett, 1936(Bartlett, , 1937 and marginal (Fraser, 1968) likelihood are special cases of partial likelihood in which pa ( ; ) is replaced by the product of appropriate marginal and conditional probability functions, respectively, evaluated at the data. A different construction leading to the encompassing form in Equation (3.1) was given by Cox (1972) to evade estimation of the baseline hazard function in the proportional hazards model.
In a matched-comparison setting, a suitable marginal likelihood is found by making a transformation ( ) such that the probability density or mass function of = ( ) is free of , so that pa ( , ) can be taken as ∏ ( ; ). Such factorizations, with a partial-likelihood component that is free of nuisance parameters, need not exist, which raises the question of whether useful approximate versions are available.
Example 1. Suppose that and are exponentially distributed of rates and ∕ , respectively. The marginal density of = ∕ at is ( ) = 2 ∕(1 + 2 ) 2 , which does not depend on . Thus ( ) =1 are independent and identically distributed and can be used for likelihood-based inference on .
The connection to sparsity is that the partial derivative of with respect to is identically zero when the distribution of is free of . Thus, the search for a transformation = ( ) that produces sparsity of ∇ at any set of evaluation points corresponds to a search for a factorizable transformation of . Battey, Cox & Lee (2022) used this sparsity as the basis for systematically deducing a factorizable transformation. This was achieved via integro-differential equations constructed from the Laplace transform of and convertible to standard forms of partial differential equations in some contexts. The approach can also be viewed as inducing sparsity on the ( , ) off-diagonal blocks of the Fisher information matrix, as in Cox & Reid (1987).

Inference in High-Dimensional Linear Regression
The work to be outlined (Battey & Reid, 2022) stemmed from an attempt to formulate the high-dimensional linear regression problem in a way that would evade nuisance parameters as in Section 3.1. Sparsity is induced through linear transformations of the data, which exploits the linearity of the regression model.
For an arbitrary component v of the -dimensional regression coefficient vector , write the linear regression model for the observations in matrix form as where is an error term with a mean of zero and a variance of ; −v is the nuisance component of ; and −v is the matrix of observations on explanatory variables after removing the column corresponding to v . Although ≫ , we assume is sparse in the sense that || || 0 = ≪ .
Each coefficient is treated in turn as the parameter of interest and an interest-respecting transformation is sought that produces sparsity (the existence of many zeros or near-zeros) in the ( v , −v ) off-diagonal block of the conditional Fisher information matrix or, rather, the notional conditional Fisher information matrix that would have emerged had a normality assumption been made.

Sparsity is achieved approximately by premultiplication of the regression equation by an
with v such that̃v v is as orthogonal as achievable to the − 1 columns of̃v −v . Orthogonality means that direct regression of̃v oñv v estimates v without bias, as in a factorial experiment. In practice, exact orthogonalization is not achievable from this premultiplication strategy, so some bias is expected from the simple least squares regression of v on the single columñv v . Let Using the optimized value v = ( + −v ⊤ −v ) −1 v ensures that the observable upper bound on the mean-squared error of̃v is minimized, where is any nonzero real number and can be taken to be 1 without loss of generality.
Let v be the bias associated with̃v. This decays at a rate of ( ∕ ), where is the sample size and the sparsity of . The implication is that, by ignoring v , Wald-based inference for each v is accurate to an order of ( ∕ √ ) . There are strong connections to earlier work on the problem of inference in high-dimensional regression, notably to Zhang & Zhang (2014) and van de Geer et al. (2014). A key conceptual distinction that aligns these works with the theme of this present article is that the construction outlined above induces sparsity where earlier contributions assume it. Battey & Reid (2022) proposed using the confidence intervals for each v as part of a broader inferential framework concerned with uncertainty over the sparse regression model. The need for such confidence sets of models was emphasized in Cox (1968Cox ( , 1995, Cox & Snell (1974, 1989, and Cox & Battey (2017).

DISCUSSION AND OPEN PROBLEMS
This short article is an attempt to synthesize four ideas through a unifying theme of sparsity inducement: a search for parameterizations or data transformations that yield zeros or near-zeros as components of key population-level objects. This is in contrast to a large body of work in which any sparsity on population-level objects is assumed and subsequently imposed on sample objects through penalization or thresholding.
A precise mathematical characterization in full generality seems a formidable challenge. The following list of open problems may be useful endeavours in this goal.
(1) In Section 2.2, for a given covariance model, not obviously sparse in any domain, can a sparsity-inducing parameterization be deduced? What is the appropriate formalization of sparsity in such contexts? (2) If the interpretability of nuisance parameters is accepted as immaterial in Section 2.2, then a broader search over the parameterization space is suggested than that permitted through simple parameterized paths. How can this be operationalized? (3) What is the most appropriate approximate formulation of the factorizable transformation problem highlighted in Section 3.1? Such a formulation would need to allow the probability density function of the transformed random variable to depend weakly on the nuisance parameter. How should the adequacy of such transformations be assessed? (4) Are there systematic routes to deducing fruitful partial likelihood factorizations more generally, beyond the matched comparison problems of Section 3.1? This question was one of five posed by Cox (1975) and has remained open since, except for the modest progress by Battey, Cox & Lee (2022 (2022) is a normal-theory linear regression model with a coefficient vector and an unknown error variance 2 . The minimal sufficient statistic is (̂, 2 ), wherê is the least squares estimator and 2 is the residual sum of squares divided by the residual degrees of freedom. Direct use of the likelihood function produces an estimate of error variance that is too small, particularly when the number of covariates is appreciable relative to the number of independent observations. By the minimal sufficiency and independence of̂and 2 , the joint density function of the responses factorizes as ( ; , 2 ) = ( |̂, 2 ) (̂; , 2 ) ( 2 ; 2 ) and reliable inference for 2 can be obtained by using the final component to construct a partial likelihood. The point of considering simple examples is not to recover the correct answer, but rather to do so through a seamless application of theory, which can then be applied to more challenging situations. See Fraser, Reid & Lin (2018) for another discussion in this vein. (7) Likelihood theory has associated with it an important and enlightening differential geometric interpretation. How does partial likelihood (including marginal and conditional likelihood as special cases) fit into this discussion? (8) Can a connection be established between data-based transformations for the elimination of nuisance parameters via marginal or conditional likelihood and the interest-respecting reparameterizations of Cox & Reid (1987)? Some informal remarks in Battey, Cox & Lee (2022), which were based on Fraser (1964), made a connection between sample and parameter spaces through the notion of a local location model.  (2018) devised an approach to quantifying sparsity using ideas from extreme value theory. The connection to the present discussion is unclear.
A broad goal is a unified theory of inference that is Fisherian when the dimension of the unknown parameter is smaller than the sample size, yet is also operational when the converse is true. A notion of sparsity seems inevitable, and this may take unusual forms as exemplified in the present article.