A certain invariance property of BLUE in a whole‐genome regression context

Abstract A curious result from mixed linear models applied to genome‐wide association studies was expanded. In particular, a model in which one or more markers are considered as fixed but are allowed to contribute to the covariance structure by treating such markers as random as well was examined. The best linear unbiased estimator of marker effects is invariant with respect to whether those markers are employed in constructing a genomic relationship matrix or are ignored, provided marker effects are uncorrelated with those not being tested. Also, the implications of regarding some marker effects as fixed when, in fact, these possess a non‐trivial covariance structure with those declared as random were examined.


| INTRODUCTION
In genome-wide association (GWAS) studies (e.g., Manolio et al., 2009), an objective is to find statistical connections between molecular markers and genomic regions affecting some complex trait. A linear regression model including marker genotype codes as covariates is used, and the simplest version fits a single marker regression via ordinary least squares (OLS).
If aggregation or clustering due to familial or molecular similarity exists in the data, a better estimation approach is generalized least squares (GLS), as it poses a more general covariance structure than OLS (Aulchenko, de Köning, & Haley, 2007;Gianola, Fariello, Naya, & Schön, 2016;Hoffman, 2013;Kang et al., 2008;Listgarten et al., 2012;Yang, Zaitlen, Goddard, Visscher, & Price, 2014). One such structure, for example, results from declaring all or a subset of marker effects as random variables, for example, assuming that j ∼ N 0, 2 ,j = 1,2, … , p, with all markers in the set taken as independently and identically distributed random variables. A random effects specification induces markerbased measures of similarity among individuals called molecular "relationship" or "kinship" matrices (G). The marker (s) evaluated for association is (are) treated as a fixed effect (s), and a test of nullity of effects on a trait is based on well-established procedures.
Should the marker (s) being tested be included or excluded when building G? A priori, if a marker is declared random it could not be fixed and vice versa. Including the contribution of a marker to G while declaring it as fixed constitutes a form of "double counting." When the number of markers (p) is very large and a single marker regression is used, the impact on G of including or removing the marker is tiny. Listgarten et al. (2012) suggest that markers being tested should be removed from G, followed by a concomitant re-estimation of necessary variance components at each instance of testing. This approach is computationally taxing, especially when p is huge, as it is the case with DNA sequence data. In many situations, it may be reasonable to assume that variance component estimates are affected only mildly by including or excluding the tested marker in G. For many complex traits in animal and plant breeding, each of the numerous markers in a chip has a small effect on both mean and variance of the data distribution. Gianola et al. (2016) showed that the best linear unbiased estimator (BLUE, also GLS) of the fixed effect of a marker (or sets of markers) examined in GWAS is invariant with respect to whether or not the marker (s) tested for association is (are) included in the construction of G, provided that variance components are assumed constant. This short communication expands on the preceding, as follows. First, we provide an expression that gives the variance-covariance matrix of the BLUE of each of the marker effects being tested using a simple adjustment. Second, it is shown that the best linear unbiased predictor (BLUP) of effects treated both as fixed and random is exactly zero, provided that no covariance exists between such effects and other marker effects treated as random in the model. Third, it is shown that if such covariance is not null, the fixed effects of a set of markers affect phenotypes through direct and indirect paths, and over and above the impact of linkage disequilibrium captured by columns of the matrix of genotype codes.

| MODEL
The linear regression model (assume that nuisance location effects have been eliminated somehow) used in GWAS is often posed as where y is an n × 1 vector of phenotypes, X is an n × p matrix of marker genotype codes, is a vector of p allelic substitution effects and e ∼ N 0,I 2 e is a residual vector where 2 e is the variance of the distribution of model residuals. Let X = [ X 1 X 2 ] , where X 1 is n × p 1 (without loss of generality assume that X 1 has full column rank), and X 2 is n × p 2 where p 2 may be much larger than n. An equivalent representation of 1 is and consider two alternative covariance structures for the phenotypes. The first structure results from treating 1 as a fixed vector and assuming 2 ∼ N 0,I 2 : The GLS estimate of the fixed effect 1 under V 2 is The second covariance structure stems from treating 1 as random, with the ∼ N 0,I 2 assumption assigned to all marker effects, but then 1 is estimated as if it were fixed. Here, the phenotypic covariance matrix is with the GLS estimator of , the fixed effect corresponding to 1 being . Arrays X 2 X � 2 2 = S 2 and X 1 X � 1 2 + X 2 X � 2 2 = S 12 can be referred to as "similarity" matrices, as in Listgarten et al. (2012). Gianola et al. (2016) showed that 4 and 6 are identical; the proof is presented in more detail here. To show this, we employ a model representation where the effects of one or more loci with genotypes in X 1 are regarded as possessing both fixed and random effects:

| Best linear unbiased estimation
where are the fixed effects of markers in X 1 , with Using 8, X � 1 V −1 2 can be written as: Now, using 9, and X � V −1 2 y is From 10 and 11, the GLS estimator ̂ 1 given in 4 is formed as where ̂ , also given in 6, is the GLS estimator resulting from 7.
The preceding shows that the estimator obtained under covariance structure V 2 is identical to the estimator resulting from structure V 12 . The practical implication is that the same (7) y = X 1 + 1 + X 2 2 + e = X 1 + X 1 1 + X 2 2 + e, similarity matrix, S 12 , can be used for conducting either single marker or sets of markers GWAS studies using linear regression models, provided that 2 is assumed known and kept constant (as well as the residual variance) throughout.
Note that the sampling variance-covariance matrix of the estimates of the fixed effects must be taken under V 2 , that is, Since X 1 typically has one or a few columns in GWAS, advantage can be taken of 8 for computing Var̂ 1 ), as V −1 12 is obtained only once, whereas V −1 2 changes with the set of markers included in X 1 and, therefore, in X 2 . Further, note from 9 that so V −1 12 can be used throughout. The preceding representation illustrates the over-statement of uncertainty and "loss of power" incurred by use of X � 1 V −1 12 X 1 −1 instead of 13. Yang et al. (2014) present a related discussion and recommend that markers in close linkage disequilibrium with the target marker(s) be removed when building G. Their approach requires re-estimation of variance components at every instance of testing. Our results do not apply under such strategy, as the variance-covariance structure would be expected to change over markers tested (Listgarten et al., 2012;Yang et al., 2014). An alternative could be to include the marker tested and some neighbours in close linkage disequilibrium in X 1 , provided that p 1 < n and that no rank deficiency accrues, and then use all markers when building G. A disadvantage of the alternative is the potentially strong collinearity in the set of markers in X 1 , producing unstable estimates with large sampling variances. A caveat must be mentioned. In a genomic best linear unbiased prediction setting (e.g., Legarra, 2016;Van Raden, 2008), a similarity matrix under V 12 can be constructed as where 2 g = p 2 is the "genomic variance" captured by all available markers; G 12 is known as the genomic relationship matrix (Van Raden, 2008). Accordingly, where 2 g � = p 2 2 is the genomic variance marked by the variants included in X 2 . Clearly, two maximum-likelihood analyses of variance components, one with a set markers fixed and removed from the covariance structure, and the other one with all markers contributing to similarity, will produce different estimates and interpretations of genomic variance.
Estimates of 2 must always be interpreted with great care. In a standard random effects model, the variance among effects of levels of a random factor represents a population parameter, with maximum-likelihood estimates of variance components interpreted accordingly. For example, if the random factor is the effect of a paternal half-sib family (a situation known as a "sire" model in animal breeding), the variance among sires, 2 s , say, has the same interpretation irrespective of whether the number of families is 10 or 10,000. However, in a marker-based model with n < p, the meaning and estimates of 2 depend crucially on p, as the variance component acts then as a regularization parameter. In the n < p situation, it is typically the case that estimates of 2 decrease as p increases, and the rate of decrease in 2 is critical for interpretation of estimates of marker effects when p → ∞ (Gianola, 2013;León-Novelo & Casella, 2012).

| Best linear unbiased prediction
The best linear unbiased predictor of 1 in model 7 is with ̂ =̂ 1 calculated as in 4 or 6. The previous result follows because X 1 V −1 12 X 1̂ = X 1 V −1 12 y are the GLS equations, implying that X � 1 V −1 12 y − X 1̂ = 0. Henderson's mixed model equations (MME) can be employed to verify result 16. For model 7 the MME are as follows: where = 2 e 2 . Subtracting the equations for ̂ from the equations for � ⃗ 1 gives (I ) � ⃗ 1 = 0, implying that � ⃗ 1 = 0, which verifies 16. Thus, solutions for ̂ and � ⃗ 2 from 17 are identical to those from the MME corresponding to a model where X 1 is excluded from forming a similarity matrix. Such model is The result � ⃗ 1 = 0 is easy to verify empirically (a reviewer pointed out that it is probably well known by scientists working in genetic evaluation computations) but, to our knowledge, has not been reported in the literature. A "mechanistic" explanation for 16 is that BLUP of random effects with null means depends on y through error contrasts that have a null mean vector. So, if a locus is included in a model as a fixed effect, the error contrasts used for BLUP do not possess any information on the effects at such locus. Thus, if any factor (e.g., a marker locus) is included (13) in the model both as fixed and random, the BLUP of the random effect will depend entirely on the "prior" (Bayesian view), and, as shown above, it will be identically equal to 0.

| Interdependent sets of marker effects
All elements of were assumed independent and identically distributed, but the result holds for more complex dependency structures. Markers included in X 1 may be in linkage disequilibrium with those in X 2 , and the phenomenon is encoded by correlations between columns of X = [ X 1 X 2 ]. Further, models have been suggested that include dependencies among marker effects (e.g., Gianola, Pérez-Enciso, & Toro, 2003).
Suppose that ∼ (0,B) and e ∼ (0,R) are independently distributed, that 1 ∼ (0,B 11 ), 2 ∼ (0,B 22 ) where B 11 and B 22 are non-singular and assume Cov 1 , � 2 = 0. Here, the MME equations for the situation in which 1 is treated as both fixed and random take the form Subtracting the ̂ from the � ⃗ 1 equations gives B 11 ⃗ 1 = 0, implying that the MME equations reduce to which provide GLS( ) and BLUP( 2 ) under a model where 1 is fixed and 2 is random. Note that, within block, marker effects can be correlated or uncorrelated.
Allow now for a covariance structure between the two sets of random effects, and let Cov 1 , � 2 = B 12 . The mixed model equations where 1 is treated as both fixed and random become Subtracting the from the 1 equations produces � ⃗ 1 = −B 12 � ⃗ 2 , which is not null unless B 12 = 0, contradicting the model assumption. Hence, when Cov 1 , ′ 2 ≠ 0 use of V 2 or V 12 produces distinct sets of generalized least-squares solutions, so the result for the independence case does not hold here.
If two random vectors are not independent, fixing the value of one such vector (Listgarten et al., 2012, call this "conditioning")  where X * 1 = X 1 + X 2 B 21 . Under this specification, the phenotypic covariance matrix is V * 2 = X 2 B 2.1 X � 2 + R and GLS( 1 ) should be computed as Likewise, will predict the effect of markers in X 2 on phenotypes, conditionally on the effects of markers in X 1 , that is, in the absence of genetic variation at loci in marker set 1.
Note in 22 that X * 1 1 = X 1 1 + X 2 B 21 1 so that the "total signal" on the trait contributed by 1 is decomposed into a "direct" component X 1 1 and an indirect contribution X 2 B 21 1 mediated through the covariance between 1 and 2 (B 12 ). This sort of phenomenon is well known in structural equation modelling and path analysis (Wright, 1921).

| CONCLUSION
When conducting a GWAS with either a single marker or a set of markers treated as fixed, it is unnecessary to reconstruct the phenotypic variance-covariance matrix at each specific instance of testing, provided that a BLUP model is used and that marker effects in sets regarded as both fixed and random are independent across sets but not necessarily within sets. BLUE is invariant with respect to whether the genotypes of markers being tested in GWAS are employed in the construction of a genetic similarity matrix. Likewise, BLUP of the effects of the sets treated as random is invariant as well. However, the variance-covariance matrix of the GLS estimator and the prediction error-covariance matrix must be taken with respect to the assumptions made in the model employed for analysis. If marker effects in the two subsets of genotypes have a between-set nontrivial dependency structure, the GWAS model requires modification.
The results presented in this paper, shown first by Gianola et al. (2016) just for GLS (BLUE), are seemingly unrecognized in the GWAS literature (e.g., Chen, Steibel, & Tempelman, 2017). An additional and probably useful result reported here is that represented by Equation 13: the variance of the estimate of any of the marker effects tested in y =X 1 1 + X 2 B 21 1 + + e = X 1 + X 2 B 21 1 + X 2 + e = X * 1 1 + X 2 + e,