Phenotypic variance partitioning by transcriptomic gene expression levels and environmental variables for anthropometric traits using GTEx data

Phenotypic variation in human is the results of genetic variation and environmental influences. Understanding the contribution of genetic and environmental components to phenotypic variation is of great interest. The variance explained by genome‐wide single nucleotide polymorphisms (SNPs) typically represents a small proportion of the phenotypic variance for complex traits, which may be because the genome is only a part of the whole biological process to shape the phenotypes. In this study, we propose to partition the phenotypic variance of three anthropometric traits, using gene expression levels and environmental variables from GTEx data. We use the gene expression of four tissues that are deemed relevant for the anthropometric traits (two adipose tissues, skeletal muscle tissue and blood tissue). Additionally, we estimate the transcriptome–environment correlation that partly underlies the phenotypes of the anthropometric traits. We found that genetic factors play a significant role in determining body mass index (BMI), with the proportion of phenotypic variance explained by gene expression levels of visceral adipose tissue being 0.68 (SE = 0.06). However, we also observed that environmental factors such as age, sex, ancestry, smoking status, and drinking alcohol status have a small but significant impact (0.005, SE = 0.001). Interestingly, we found a significant negative correlation between the transcriptomic and environmental effects on BMI (transcriptome–environment correlation = −0.54, SE = 0.14), suggesting an antagonistic relationship. This implies that individuals with lower genetic profiles may be more susceptible to the effects of environmental factors on BMI, while those with higher genetic profiles may be less susceptible. We also show that the estimated transcriptomic variance varies across tissues, e.g., the gene expression levels of whole blood tissue and environmental variables explain a lower proportion of BMI phenotypic variance (0.16, SE = 0.05 and 0.04, SE = 0.004 respectively). We observed a significant positive correlation between transcriptomic and environmental effects (1.21, SE = 0.23) for this tissue. In conclusion, phenotypic variance partitioning can be done using gene expression and environmental data even with a small sample size (n = 838 from GTEx data), which can provide insights into how the transcriptomic and environmental effects contribute to the phenotypes of the anthropometric traits.

positive correlation between transcriptomic and environmental effects (1.21, SE = 0.23) for this tissue.In conclusion, phenotypic variance partitioning can be done using gene expression and environmental data even with a small sample size (n = 838 from GTEx data), which can provide insights into how the transcriptomic and environmental effects contribute to the phenotypes of the anthropometric traits.

| INTRODUCTION
Most human complex traits are determined by many genes and environmental factors.Understanding the variation in the human genome is crucial to elucidating genetic contributions to complex traits, including human health and diseases.Genome-wide associations studies (GWAS) have been successful to identify single nucleotide polymorphisms (SNPs) associated with causal variants underlaying many complex traits and diseases (Pasaniuc & Price, 2017;Sud et al., 2017).However, the identified genome-wide significant SNPs explain only a small fraction of the genetic variation of complex traits (Manolio et al., 2009;Speliotes et al., 2010).Even with whole-genome approaches that use all available SNPs in the genome, there remains a substantial gap between the variance captured by the genomic information and the total phenotypic variance (Young, 2019).This gap may arise because the genome is only a part of what determines the phenotypes of a complex trait, to which transcriptomic and environmental factors also contribute significantly.
Another technical issue when analysing genome-wide SNPs is that a large sample size is necessarily required.For example, for the analysis of SNP-based heritability estimation, the sample size should be more than 2000 individuals (Naomi R Wray et al., 2014;Visscher et al., 2014), which results in a standard error of 0.15 or less for the estimation (Visscher et al., 2014).Currently, the sample size of Genome-Tissue Expression (GTEx) project is 838 individuals, for which genome-based analyses cannot be applied, limiting the phenotypic variance partitioning for complex traits.This problem is mostly because the number of genomic variants (SNPs) is very large even when considering the effective number of independent SNPs only (typically >50,000 when considering random sample in a population; Visscher et al., 2014).However, when considering transcriptomic gene expression levels or environmental variables, the effective number of independent variables is a lot lower than genomic variants, which allows us to carry out phenotypic variance partitioning by transcriptomic and environmental data for anthropometric traits available from GTEx data.It is noted that there are few studies working on phenotypic analyses of GTEx data for complex traits.
Phenotypic variance partitioning for complex traits, such as SNP-based heritability estimation, has been mostly done by linear mixed models (LMMs) (Runcie & Crawford, 2019), which can be extended to phenotypic variance partitioning by transcriptomic or environmental data.In the partitioning analysis to estimate the proportion of phenotypic variance explained by transcriptomic or/and environmental data, transcriptomic and environmental effects are treated as random effects on the phenotypes, that is, LMM is a random-effects model.One of the key assumptions of LMMs is that random effects are independent without any correlation between them, which can be violated especially when considering multi-omics and environmental data.For example, gene expression levels can be varied, depending on environmental conditions, which can generate non-negligible association between transcriptomic and environmental effects on the phenotypes.If such an association or correlation between two random effects is not properly modelled, heritability estimates can be biased (Zhou et al., 2020).An alternative to avoid bias in the variance estimation is to explicitly estimate the covariance between random effects, using CORE REML (Zhou et al., 2020).
Here, we estimate the proportion of the phenotypic variance explained by transcriptomic and environmental data for anthropometric traits, using the GTEx data (Aguet et al., 2017).We use three anthropometric traits (height, weight, and BMI) that are highly polygenic and have been extensively studied as model traits (Macgregor et al., 2006;Silventoinen et al., 2008;Stunkard et al., 1986;Turula et al., 1990).To estimate the phenotypic variance explained by transcriptomic data, we select the gene expression data of four tissues that may be relevant to the anthropometric traits (subcutaneous adipose tissue, visceral adipose tissue, skeletal muscle tissue and whole blood samples).Additionally, we use environmental variables available for each individual from the GTEx project (age, sex, ancestry, drinking status and smoking status).From our analyses, we partition the phenotypic variance of anthropometric traits into two variance components: One due to transcriptomic gene expression levels and the other due to environmental variables.To obtain an unbiased estimate of phenotypic variance, we use CORE REML to account for covariance between transcriptomic and environmental effects on the phenotypes of anthropometric traits.This approach may be an alternative method to estimate variance components for complex traits using multi-omics data, even with a relatively small sample size.

| Environmental data
We use the environmental variables from the GTEx phenotypic information table (pht002742.v8.p2.c1).From the literature, we selected potential environmental factors that were suggested to have a significant impact on anthropometric traits (Chhabra & Chhabra, 2011;Flegal et al., 2016).In addition, we used linear regression analysis to identify other potentially relevant environmental factors.We first performed a univariate analysis to identify any environmental factors that were significantly associated with the anthropometric traits.We then used stepwise regression analysis to select the most significant environmental factors to include in the final model.The environmental variables selected for analysis were age, sex, ancestry, drinking status and smoking status (Table S2).We identified 920 individuals with available data for all five variables.

| Transcriptomic data
We used the gene expression level data available from the GTEx project version 8, which consists of a list of 56,200 genes for 17,384 samples measured in transcript per million (TPM).We extracted the expression levels for each tissue according to their sample attribute.Quality control analysis was performed to select genes with TPM ≥ 0.1 and genes that were expressed in at least 20% of all samples (Ferraro et al., 2020).We only considered the gene expression levels of four tissues that are related to anthropometric traits to be used in this study: subcutaneous adipose tissue (SAT: 30,109 expressed genes), visceral adipose tissue (VAT: 30,090 expressed genes), skeletal muscle tissue (SkM: 24,810 expressed genes) and whole blood tissue (WB: 24,051 expressed genes).Then, we selected individuals with available information for age, sex, ancestry, drinking and smoking status (SAT: 631, VAT: 582, SkM: 771 and WB: 713).The selection (sub-setting) of the individuals by tissues did not significantly change the percentage in the phenotype or environmental variables comparing with all sample individuals (920) (Table S1).

| Variance estimation
To estimate the proportion of phenotypic variance attributed to the effect of transcriptomic and environmental variables, we used linear mixed models (LMM) for each selected trait.The m x n 1 matrix of gene expression levels (T) was used to make the transcriptomic relationship matrix as A TT = /n t 1 for each tissue, where m is number of individuals and n 1 is the number of gene expression levels.We also estimated the environmental relationship matrix (B) and removed collinearity between variables through orthogonal transformation (Morzuch & Ruark, 1991).First, the m x n 2 matrix of environmental variables (E), where n 2 is the number of environmental variables, was columnstandardised (mean = 0, variance = 1).Then, we transformed E to a column-orthogonal matrix, Ω, multiplying E by V (Ω = EV) where V can be obtained from the eigenvalue decomposition of the covariance matrix of the product of the variables E t E. The environmental relationship matrix was constructed as B ΩΩ = /n t 2 .Variance estimations were performed using REML (Lee & van der Werf, 2006) implemented in MTG2 (v2.21) (Zhou et al., 2020) by the following models: where y is a m vector of trait phenotypes (height, weight and BMI) without any adjustment for fixed effects, t is the transcriptomic effects (due to gene expression levels of SAT, VAT, SkM and WB), e is the environmental effects (due to age, sex, ancestry, drinking status and smoking status), and ε is the residual effects.Additionally, we estimated the covariance between transcriptomic and environmental effects (c t e , ) by fitting the Cholesky decomposition of kernel matrix of A ( A ), using CORE REML (Zhou et al., 2020), in model 4. We also estimated the interaction between transcriptomic and environmental variables (i t e x ) derived from the Hadamard product of T and E, following (Zhou & Lee, 2021).We compared the single model (model 1 or 2) against the joint model (model 3), using the likelihood ratio test (LRT) with one degree of freedom.To determine if the covariance between transcriptomic and environmental effects (c t e , ) have significant effects on the phenotypic variance, we compared the model fitted by CORE REML (model 4) against the model fitted by REML (model 3), using the LRT with one degree of freedom.We also assessed if the interaction between transcriptomic and environmental effects (i t e x ) was significant (model 3 vs.model 5).The main variance components of the phenotypic variance partitioning are the transcriptomic (σ t 2 ), environmental (σ e 2 ) and residual variances (σ ε 2 ).

| Genotype data
Genotype data from GTEx consist of 838 donors and a total of 46,569,704 SNPs.We performed a stringent quality control using PLINK (v1.9) (Chang et al., 2015) to exclude unreliable SNPs with the following settings: Hardy-Weinberg equilibrium p-value < 1 × 10 −6 , minor allele frequency < 0.01, and missing genotype > 0.01.We filter individuals who had a genotype missing rate > 0.01.After the filtering process, we selected 838 individuals and 9,384,219 SNPs.Before estimating heritability using genotype data, we adjusted the phenotype information for cofounders such as age, sex, and population structure (10 first principal components of the genomic relationship matrix) using linear regression in R (v3.6.0).Heritability of the phenotypic traits (height, weight and BMI) were estimated using REML implemented in MTG2 (v2.21) (Zhou et al., 2020).The heritability estimation for all traits was <0 in all cases (Table S3).
Due to the small sample size, these results are not unexpected as it has been shown that it is necessary > 2000 individuals to accurately estimate the genetic variance of complex traits (Visscher et al., 2014;Wray et al., 2014).For that reason, we did not include genotype information in our models.

| The proportion of phenotypic variance explained by transcriptomic and environmental effects
We first applied model (1) to fit the transcriptomic gene expression data from SAT, VAT, SkM and WB tissues.
For weight, we found that a larger proportion of phenotypic variance was explained by the expression levels of adipose tissues, compared with that by other tissues (σ t 2 = 0.83 (0.06), 0.85 (0.06), 0.79 (0.05) and 0.51 (0.08) for SAT, VAT, SkM and WB) (Figure 1b and Table S4).For BMI, there is more variation across tissues.Interestingly, a very low phenotypic variance was observed for the WB tissue (σ t 2 = 0.70 (0.08), 0.62 (0.08), 0.58 (0.07) and 0.17 (0.07) for SAT, VAT, SkM and WB).We also applied model (2) that fitted the selected environmental variables.The estimated proportion of phenotypic variance explained by the environmental variables (σ e 2 ) was 0.47, 0.21 and 0.01 for height, weight, and BMI, respectively when using all participants (838 individuals) (Figure 1 and Table S5).When we consider tissue-specific sample sizes (n = 631, 582, 771 and 713 for SAT, VAT, SkM and WB), the estimated σ e 2 slightly changes depending on the number of individuals for each trait per tissue (Supplementary Table S4).
Then, we estimated the phenotypic variance using transcriptomic and environmental data simultaneously (model 3) and compared the joint model against a single model (model 1 vs. model 3 and model 2 vs. model 3) using the likelihood ratio test (Figure 1 and Table S4).The goodness of fit of model 3 was significantly better than the other models when using height phenotypes in SAT and SkM, and weight phenotypes in all tissues.Model 1, which fitted the transcriptomic data only, was the best among the three models (models 1-3) when using BMI phenotypes in all tissues.Model 2, which fitted the environmental data only, was the best when using height phenotypes in VAT and WB (Figure 1).The best model is indicated with a star sign (* on the bar) in Figure 1.
The proportions of phenotypic variance explained by the transcriptomic and environmental data are distributed differently across traits (Figure 1).For height, the estimated environmental variance is much higher than the estimated transcriptomic variance (Figure 1a).It is noted that estimated transcriptomic variance is remarkably high in the SkM, compared to the adipose tissues and WB (Figure 1a).For weight, both transcriptomic and environmental variance estimates are substantial, although the transcriptomic effect is much higher.In addition, model (3) performed better in all tissues (Figure 1b).Interestingly, the estimated transcriptomic variance is relatively low in the WB tissue, compared with other tissues (Figure 1b).For BMI, the phenotypic variance is mostly explained by transcriptomic data in all four tissues and the model ( 1) is the best in all cases (Figure 1c).Again, the estimated transcriptomic variance is relatively low in the WB tissue, compared to other tissues (Figure 1c).

| Covariance between transcriptomic and environmental effects
The conventional LMM assumes that random effects are uncorrelated (Henderson, 1953;Singer et al., 2017).However, this assumption may not hold in certain situations, such as when studying the relationship between gene expression levels and environmental factors underlying a phenotype, as these factors can interact and influence each other (López-Maury et al., 2008).Therefore, to obtain accurate estimations, it is necessarily required to assess if the correlation between transcriptomic and environmental effects is significant.We estimated the covariance between transcriptomic and environmental effect, using CORE REML (model 4) and assessed if the goodness of fit improved significantly, compared to the conventional REML (model 3).We compared the models ( 3) and ( 4), using the likelihood ratio test with one degree of freedom (Figure 2 and Table S6).Model (4), CORE REML, was significantly better than model ( 3) for all traits when using the gene expression levels of WB tissue (Figure 2).Modeel (4) also significantly improved, compared to model (3), for height and BMI when using VAT (Figure 2).
The improvement of the model was mostly because of correctly accounting for the covariance between transcriptomic and environmental data (Figure S1).We estimated a significant positive correlation between the transcriptomic effects of WB and environmental effects for BMI and weight (Figure S1).In contrast, we estimated a significant negative correlation between the transcriptomic effects of VAT and the environmental effects for height and BMI (Figure S1a).This result suggests that individuals with lower genetic profiles may be more susceptible to the effects of environmental factors on BMI, although those with higher genetic profiles may be less susceptible.These findings have potential implications for understanding the complex interplay between genetic and environmental factors in BMI regulation.For example, they may suggest that interventions targeted at modifying environmental factors, such as smoking and drinking alcohol, may be more effective in certain individuals, depending on their genetic profiles.
It is important to notice that the estimates are the proportion of the variance that is explained by the specific component (transcriptomics and environment) after accounting for other factors and it is not the actual total proportion of variance.Therefore, it is possible to observe that the variance explained by one component can be larger than the total variance, as is observed for height using the gene expression of SkM (Figure 2).This can happen because there is negative correlation between the transcriptomic effects of SkM and environmental effects for height (Figure S1a).However, CORE REML (model 4) estimation is not significantly better than REML (model 3) for height using SkM tissue (Figure 2).This is because the uncertainty of estimated variance increased substantially with the model (4) (see 95% CI in Figure S1b,c).In contrast, when using the WB tissue, model 4 was significantly better than model (3) because the uncertainty of estimated variance decreased substantially with model (4) although there was no significant transcriptome-environment correlation (Figure S1b,c).
When considering all models (1-4), the best model identified from the likelihood ratio with an appropriate degree of freedom for each trait per tissue is shown in Table S7, which indicates that the CORE REML (model 4) is still the best when using height and BMI phenotypes with VAT and WB, and weight phenotypes with WB (i.e., consistent with Figure 2). .For model comparison (REML vs. CORE REML), we used the likelihood ratio test.p-values < 0.05 mean that the model fitted with CORE REML is significantly better than REML.The total variance estimated for height using SkM is higher than 1 due to negative correlation between transcriptomic and environmental data.BMI, body mass index; SkM, skeletal muscle tissue.

| Interaction between transcriptomic and environmental effects
We also assessed if there is significant interaction between transcriptomic and environmental effects, using model (5) (Table S8).From the analysis, we did not observe any significant interaction although there is a marginal significance when considering weight and BMI phenotypes with SAT tissue.However, these interaction signals are not significant after multiple testing corrections.
Transcriptomic data have been widely used to estimate the heritability of gene expression levels, as demonstrated in previous studies that have utilised GTEx data (Emilsson et al., 2008;Price et al., 2011;Sun et al., 2019;Wright et al., 2014).However, few studies have used transcriptomic data to estimate the proportion of phenotypic variance in complex traits.In our study, we used mixed models to estimate the contribution of gene expression levels and environmental variables to the phenotypic variance of three anthropometric traits: height, weight, and BMI.These traits have been extensively investigated and known to be determined by many genes and environmental factors.The heritabilities of human height, weight and BMI were estimated to be 0.8-0.9 from twin studies (Macgregor et al., 2006;Silventoinen et al., 2008;Stunkard et al., 1986;Turula et al., 1990), and 0.3-0.6 from studies with unrelated population samples using genome-wide common SNPs (Ge et al., 2017;Llewellyn et al., 2013;Yang et al., 2010).Here, we showed that the proportion of phenotypic variance from these three anthropometric traits can be estimated using gene expression and environmental data in a tissuespecific analysis for each trait.
For height, the environmental factors explain most of the total phenotypic variance, which may be due to the strong effects of sex variable on height phenotypes (Silventoinen et al., 2001).In contrast, the transcriptomic data have little effect on height, except that the estimated transcriptomic variance is 17% of the total phenotypic variance when using the SkM tissue.This result shows that the gene expression of SkM might have some effect in maintenance of the trait, as we know that in adults height slightly decrease with age (Ge et al., 2017).For weight, the proportion of phenotypic variance explained by the transcriptomic and environmental variance was estimated as 70%-73% for all tissues except WB (37%).We observed that the transcriptomic variance (0.49-0.61) was much higher than the environmental variance (0.10-0.21).For BMI, interestingly, the proportion of phenotypic variance was mostly explained by transcriptomic gene expression levels for all four tissues.Estimated transcriptomic and environmental variances are not uniformly distributed across different tissues for each trait.For estimations using the gene expression of WB tissue, the variance estimates are lower than the other tissues for all traits.
There could be several reasons why the phenotypic variance explained by gene expression levels in WB was much lower than those in other tissues.First, WB is a heterogeneous tissue that consists of multiple cell types, which may have different gene expression patterns and different contributions to the phenotype of interest (Pinhel et al., 2017).Therefore, gene expression measured in WB may not fully capture the expression patterns that are relevant to the phenotype of interest, leading to a lower proportion of phenotypic variance explained.Second, gene expression levels in other tissues may have a more direct and specific relationship with the phenotype of interest compared with WB.For example, adipose tissue may have a more direct relationship with BMI/obesity compared to other tissues, leading to the higher variance explained by transcriptomic effects in that tissue (Hao et al., 2018).Finally, technical factors such as differences in RNA extraction and sequencing protocols, batch effects, and sample size could also contribute to the observed differences in variance explained by transcriptomic effects across tissues.
For height, we observed that the transcriptomic variance estimated in model 1 significantly decreased when we fit the transcriptomic and environmental data jointly in model 3.This indicates that there is a considerable correlation between transcriptomic and environmental effects on height, and the environmental data better explain the phenotypic variance of height.Indeed, we found a significant correlation between transcriptomic and environmental effects on height when considering VAT and WB tissues.For BMI, the phenotypic variance is mostly determined by transcriptomic effects although there is a significant correlation between transcriptomic and environmental effects when considering VAT and WB tissues.Such transcriptomeenvironment correlations would be expected because gene expression levels can be changed, depending on environmental conditions.However, the significance of transcriptome-environment correlation depends on tissues for each trait and several estimates are not significantly different from zero.
Our study has several limitations.The SNP-based heritability estimates for all three traits were not estimable using the genome-wide common SNPs because of a small sample size in this GTEx data.It has been JULLIAN FABRES and LEE   | 471     reported that a reliable SNP-based heritability estimate requires a larger sample size, for example, more than 2000 individuals (Visscher et al., 2014;Wray et al., 2014).With a large genotype sample size, we could have jointly analysed genomic, transcriptomic, and environmental data, which better dissect the biological architecture of the complex traits.Additionally, we acknowledge that our sample size may not have been large enough to fit each environmental variable as a random effect in the variance component analysis.Although performing variance component analysis separately for each environmental variable can provide information on the specific effects of each variable on the trait, our limited sample size made this impractical.Another limitation of our study is the limited number of environmental variables that are available to us.Future studies could benefit from a more comprehensive set of environmental variables, such as exposome data.It is worth noting that although SNP-heritability analysis is based on the assumption that genetic variation (i.e., SNPs) is the direct cause of phenotypic variation, transcriptomic gene expression is an intermediate phenotype that can reflect both genetic and environmental effects.This does not necessarily imply causality.Therefore, the interpretation of the results from our study should be made with caution.
Despite the limitations of our study, our findings offer valuable insights into the complex interplay between genetic and environmental factors that determine anthropometric traits.By leveraging transcriptomic and environmental data, we were able to estimate the proportion of phenotypic variance attributable to these factors, even with the small sample size of the GTEx data set.Our approach can be extended to incorporate other omics data and enable more comprehensive dissection of phenotypic variation in diverse traits.These results can inform the development of interventions aimed at improving health outcomes.Overall, our study underscores the importance of considering both genetic and environmental factors in understanding the biology of complex traits and highlights the potential of multiomics analyses to advance our knowledge in this field.data, like gene expression, is available on the GTEx portal: https://gtexportal.org.

F 2 .
I G U R E 1 Estimated proportion of phenotypic variance explained by transcriptomic (σ t 2 ) and environmental data (σ e 2 ) for height, weight, and BMI.As phenotypes were standardised, the total phenotypic variance is σ = 1 y The number of gene expressed (TMP > 0.1) for SAT (613 individuals) is 30,090 genes, VAT (582 individuals) is 30,090 genes, SkM (771 individuals) is 24,810 genes and WB (713 individuals) is 24,051 genes.The environmental variables are age, sex, ancestry, drinking status and smoking status.We used likelihood ratio test to compare the goodness of fit of model 1 (m1: y = t + ε) and model 2 (m2: y = e + ε) against model 3 (m3: y = t + e + ε) with one degree of freedom.Significant p-values (<0.05) means that model 3 is significantly better than model 1or model 2.An asterisk (*) indicates the best model for the tissue-trait combination.BMI, body mass index; SAT, subcutaneous adipose tissue; SkM, skeletal muscle tissue; VAT, visceral adipose tissue; WB, whole blood tissue.

F
I G U R E 2 Estimation of the proportion of phenotypic variance explain by transcriptomic (σ t 2 ) and environmental data (σ e 2 ) for height, weight and BMI estimated by REML and CORE REML for each tissue.As phenotypes were standardised, the total phenotypic variance is σ = 1 y 2