Transcriptome-wide association study of breast cancer risk by estrogen-receptor status

Previous transcriptome-wide association studies (TWAS) have identified breast cancer risk genes by integrating data from expression quantitative loci and genome-wide association studies (GWAS), but analyses of breast cancer subtype-specific associations have been limited. In this study, we conducted a TWAS using gene expression data from GTEx and summary statistics from the hitherto largest GWAS meta-analysis conducted for breast cancer overall, and by estrogen receptor subtypes (ER+ and ER−). We further compared associations with ER+ and ER− subtypes, using a case-only TWAS approach. We also conducted multigene conditional analyses in regions with multiple TWAS associations. Two genes, STXBP4 and HIST2H2BA, were specifically associated with ER+ but not with ER− breast cancer. We further identified 30 TWAS-significant genes associated with overall breast cancer risk, including four that were not identified in previous studies. Conditional analyses identified single independent breast-cancer gene in three of six regions harboring multiple TWAS-significant genes. Our study provides new information on breast cancer genetics and biology, particularly about genomic differences between ER+ and ER− breast cancer.

Breast cancer is a heterogeneous disease consisting of several well-established subtypes. One of the most important markers of breast cancer subtypes is estrogen receptor (ER) status. ER+ and ER− tumors differ in etiology (X. R. Yang, Chang-Claude, et al., 2011), genetic predisposition (Mavaddat, Antoniou, Easton, & Garcia-Closas, 2010), and clinical behavior (Blows et al., 2010). ER− tumor occurs more often among younger women, and patients are more likely to carry BRCA1 pathogenic variants (Atchley et al., 2008;Garcia-Closas et al., 2013). ER− tumor also has worse short-term prognosis. Among the 177 GWAS-identified breast cancer-associated single nucleotide polymorphisms (SNPs), around 50 are more strongly associated with ER+ disease and 20 are more strongly associated with ER− disease Milne et al., 2017).
SNPs associated with complex traits are more likely to be in regulatory regions than in protein-coding regions, and many of these SNPs are also associated with expression levels of nearby genes (Nicolae et al., 2010). For example, breast cancer GWAS-identified variants at 6q25.1 regulate ESR1, but also coregulate other local genes such as RMND1, ARMT1, and CCDC170 (Dunning et al., 2016, p. 1). These results suggest that by integrating genotype, phenotype, and gene expression, we can identify novel trait-associated genes and understand biological mechanisms. However, due to costs and tissue availability, acquiring GWAS and gene expression data for the same set of individuals remains challenging.
A recently published approach, referred to as transcriptome-wide association study (TWAS; Gamazon et al., 2015;Gusev et al., 2016), overcomes these difficulties by using a relatively small set of reference individuals for whom both gene expression and SNPs have been measured to impute the cis-genetic component of expression for a much larger set of individuals from their GWAS summary statistics. The association between the predicted gene expression and traits can then be tested. This method has been shown to have greater power relative to GWAS; and has identified 1,196 trait-associated genes across 30 complex traits in a recently performed multitissue TWAS (Mancuso et al., 2017).
To date, three TWAS of breast cancer have been conducted (Gao, Pierce, Olopade, Im, & Huo, 2017;Hoffman et al., 2017;Wu et al., 2018). A fourth study linked expression quantitative loci (eQTL) data across multiple tissues and breast cancer GWAS results using EUGENE, a statistical approach that sums evidence for association with disease across eQTLs regardless of directionality. That study then tested EUGENE-significant genes using a TWAS statistic, which does take directionality into account (Ferreira et al., 2019). The two earliest TWAS used GWAS data from the National Cancer Institute's "Up for a Challenge" competition, which included data from 12,100 breast cancer cases (of which 3,900 had ER− disease) and 11,400 controls, as well as eQTL data from breast tissue and whole blood from the GTEx and DGN projects (Gao et al., 2017;Hoffman et al., 2017). The subsequent TWAS by Wu et al. (2018) and the EUGENE analysis by Ferreira et al. (2019) used results from a much larger GWAS conducted by the Breast Cancer Association Consortium (BCAC), which included 122,977 cases (of which 21,468 had ER− disease) and 105,974 controls. Together, these four studies have identified 59 genes whose predicted expression levels are associated with risk of overall breast cancer, and five associated with risk of ER− disease. Of these 64 genes, 30 are at loci not previously identified by breast cancer GWAS.
These previous TWAS largely focused on overall breast cancer risk. Analyses of ER− disease either were conducted using a small sample size (Gao et al., 2017) or did not scan all genes using a directional TWAS approach (Ferreira et al., 2019). Moreover, none of the previous analyses considered ER+ disease specifically or examined differences in association between predicted gene expression and ER+ versus ER− disease.
The interpretation of TWAS results is not straight-forward (Wainberg et al., 2019). Specifically, TWAS statistic by itself cannot distinguish between a mediated effect (SNPs influence breast cancer risk by changing the expression of the tested gene), pleiotropy (SNPs associated with gene expression also influence breast cancer risk through another mechanism), or colocalization (SNPs associated with gene expression are in LD with other SNPs that influence breast cancer risk through another mechanism). Previous studies have conducted limited sensitivity analyses (e.g., Wu et al., 2018 andFerreira et al., 2019 conditioned the TWAS tests on lead GWAS SNPs), but the genetic architecture at TWASidentified loci remains largely unclear.
In the current analysis, we complement previous work by conducting a TWAS for overall breast cancer and for ER+ and ER− subtypes. We also applied a case-only TWAS test to identify predicted transcript levels that were differentially associated with ER+ and ER− disease. We conducted expanded sensitivity analyses, conditioning on multiple TWASsignificant genes in a region to account for possible confounding due to LD (colocalization). We chose to focus on the expression of normal breast tissue of European ancestry women to maximize specificity and identify good targets for near-term follow-up experiments in mammary cells. One advantage of using a biologically relevant tissue is that it both increases the a priori plausibility of observed associations and increases the likelihood that genes with observed associations will be expressed and influence tumor development in cells from the target tissue. We have reproduced previous results (Ferreira et al., 2019;Wu et al., 2018) and provided evidence regarding the independent associations of multiple genes in regions containing one or more TWAS-significant genes. We also identified genes with subtypespecific associations, highlighting different biological mechanisms likely underlying the disease subtypes. array experiments (5 M or 2.5 M), with Hardy-Weinberg equilibrium p < 10 −6 or showing batch effects were excluded. The genotypes were then imputed to the Haplotype Reference Consortium reference panel (McCarthy et al., 2016) using Minimac3 for imputation and SHAPEIT for pre-phasing (Delaneau, Marchini, & Zagury, 2011;Howie, Donnelly, & Marchini, 2009). Only SNPs with high imputation quality (r 2 ≥ .8), minor allele frequency (MAF) ≥0.05, and were included in the Hap-Map Phase 2 version were used to build the expression prediction models.

| Breast cancer meta-GWAS data
The GWAS breast cancer summary-level data were mainly provided by the Breast Cancer Association Consortium (BCAC; Michailidou et al., 2017), as well as the Consortium of Investigators of Modifiers of BRCA1/2 (CIMBA). BCAC conducted the largest breast cancer meta-GWAS to date (referred as the overall breast cancer GWAS analysis). The BCAC included 122,977 cases and 105,974 controls of European ancestry. Among these, 46,785 cases and 42,892 controls were genotyped using the Illumina iSelect genotyping array (iCOGS) on 211,155 SNPs; and 61,282 cases and 45,494 controls were genotyped using the Illumina OncoArray on 570,000 SNPs (Yovel, Franz, Stilz, & Schnitzler, 2008). The study also included data from 11 other GWAS on 14,910 cases and 17,588 controls. Genetic data for all individual participating studies were imputed to the 1000 Genomes Project Phase 3 v5 EUR reference panel Logistic regression was fitted to estimate per-allele odds ratios (ORs), adjusting for country and top principal components (PCs). Inverse variance fixed-effect meta-analysis was used to combine the genetic association for breast cancer risk across studies (Milne et al., 2017). In CIMBA, genotypes were generated by the Illumina OncoArray and imputed to the 1000 Genomes Project Phase 3 v5 EUR reference panel (Amos et al., 2016). A retrospective cohort analysis framework was adopted to estimate per-allele hazard ratios (HRs), modelling time-to-breast-cancer and stratified by country, Ashkenazi Jewish origin and birth cohort (Antoniou et al., 2005;Barnes et al., 2012). Fixed-effect meta-analysis (Willer, Li, & Abecasis, 2010) was performed to combine results across genotyping initiatives within the two consortia, assuming that the OR and HR estimates had roughly the same underlying relative risk. We restricted subsequent analyses to SNPs with an imputation r 2 > .3, and an MAF > 0.005 across all platforms were included in the analysis (approximately 11.5 M).
For the ER+ subtype, meta-GWAS summary data based on 69,501 ER+ cases and 105,974 controls (part of the overall breast cancer samples) were included and analyzed (Mavaddat et al., 2015). For the ER− subtype, meta-GWAS summary data based on 21,468 ER− cases and 105,974 controls from the BCAC were combined with 9,414 additional BRCA1 mutationpositive cases and 9,494 BRCA1 mutation-positive controls from CIMBA (Milne et al., 2017).
To distinguish different genetic signals between ER+ and ER− subtypes, we further retrieved GWAS summary-level data on a case-only GWAS, which compared ER+ patients (sample size: 23,330 in iCOGs and 44,746 in OncoArray) to ER− patients (sample size: 5,479 and 11,856;Milne et al., 2017). Logistic regression was performed to test the association between genetic variants with known ER status in the two studies separately, adjusting for substudy and top PCs for iCOGs, and patients' countries and top PCs for OncoArray. Results were then combined using a fixed-effect meta-analysis.

| Constructing expression weights
Before constructing the expression model (using GTEx data, regress gene expression on SNPs), we set several criteria to select eligible candidate genes for inclusion in the model (from the total 12,696 transcripts). We used a REML algorithm implemented in GCTA to estimate the cis (500 base-pair window surrounding transcription start site) SNP-heritability cis − ℎ g 2 for each transcript expression (Cai et al., 2014;J. Yang et al., 2010). Only genes with significant heritability (nominal p≤ .01) were included in the subsequent model construction (J. Yang, Lee, Goddard, & Visscher, 2011). The p values for null hypotheses cis − ℎ g 2 = 0 were computed using a likelihood ratio test. To account for population stratification, 20 PCs were always included as fixed effects. Consistent with previous research (ENCODE Project Consortium, 2012;J. Yang et al., 2010), we observed strong evidence for cis − ℎ g 2 on many genes (significantly non-zero for 1,355 genes).
We then constructed linear genetic predictors of gene expression for these genes. We performed five models: Bayesian Sparse Linear Mixed model, Best Linear Unbiased Predictor model, Elastic-net regression (with mixing parameter of 0.5), LASSO regression and Single best eQTL model. We used a fivefold cross-validation strategy to validate each model internally. Only genes with good model performance, corresponding to a prediction r 2 value (the square of the correlation between predicted and observed expression) of at least 1% (0.10 correlation) in at least one of the five models, were included in subsequent TWAS analyses. The weights were chosen from the best performed model out of the five models. We adopted this additional filter to improve the interpretability and specificity of results: significant TWAS results based on models with little or no predictive ability likely result from pleiotropy or colocalization, not the effect of modeled gene's expression levels. This additional filter narrowed the number of candidate genes to 901.

| Transcriptome-wide association study (TWAS) analyses
Using the functional weights of those 901 genes and summary level GWAS data, we assessed the association between predicted gene expression and breast cancer risk. We performed summary-based imputation using the ImpG-Summary algorithm (Pasaniuc et al., 2014). Briefly, let Z be a vector of standardized association statistics (z scores) of SNPs for a trait at a given cis locus, Σ s,s be the LD matrix from reference genotype data and let W = (w 1 w 2 w 3 … w j ) be the weights from the expression prediction model precompiled using the reference panel. Under the null hypothesis that none of the SNPs with w i ≠ 0 is associated with disease, the test statistic wz /(wΣ s,s w′) 1/2 follows a normal distribution with mean = 0 and variance = 1. To account for finite sample size and instances where Σ s,s was not invertible, we adjusted the diagonal of the matrix using a technique similar to ridge regression with λ = 0.1.

| Case-only TWAS
To assess whether genetically predicted expression was differentially associated with ER+ and ER− breast cancer, we applied the TWAS procedure described above to the Z statistics from the BCAC case-only analysis. Following arguments in Barfield et al. (2018) the standard TWAS statistic applied to a case-only GWAS results tests hypothesis H 0 : β 2 -β 1 = 0. This is similar to a conventional multinomial logistic model for subtype-specific breast cancer risk, with expression log odds ratio β 2 for ER− disease and β 1 for ER+ disease, under which scenario, the expression log odds ratio comparing ER− to ER+ cases is β 2 -β 1 .

| Conditional analyses
Colocalization makes the interpretation of TWAS hits challenging Wainberg et al., 2019). In addition to the main TWAS analysis, we also performed conditional and joint (COJO) multiple-SNP analysis at each TWAS significant gene location to distinguish colocalization, and to identify gene(s) independently responsible for the statistical association at each locus. COJO approximates the results of a joint conditional analysis including predicted expression levels from multiple proximal genes. The original COJO approach was designed to assess the association of individual SNPs with a phenotype; we used an extension that jointly models the associations between multiple linear combinations of individual SNPs (Gusev et al., 2016). We conducted two types of COJO: (a) For regions in which multiple associated features were identified (within 500 kb of each other, i.e., colocalization), we jointly modeled these significant TWAS genes to determine the strongest associated gene (or infer independent signals); (b) To provide information on whether the TWAS gene was responsible for the observed SNP-trait association, we also evaluated whether the GWAS-identified index SNPs remained significant after conditioning on the genes within the same region.

| Breast cancer TWAS
We selected 12,696 transcripts from the 67 GTEx breast tissue samples of Europeanancestry women that passed quality control. Based on GCTA-REML analysis, breast-tissue expression levels for 1,355 of these genes were heritable (p value for cis − ℎ g 2 < .01). We then built linear predictors for these heritable genes and estimated prediction r 2 using fivefold cross-validation. A total of 454 genes failed our cross-validation r 2 requirement (r 2 > .01), and we performed TWAS on the remaining 901 genes. We defined statistical significance for TWAS results as a marginal p < 5.5 × 10 −5 (Bonferroni correction controlling the familywise error rate at ≤0.05 for the 901 genes).
First, to compare with previous GWAS findings and to demonstrate the validity of our results, we performed TWAS analysis in overall breast cancer. We identified 30 genes in 18 cytoband regions associated with breast cancer risk (Table 1). Of these regions, 11 (containing 21 genes) were previously reported breast cancer susceptibility loci (harboring one or more GWAS-significant SNP). Five genes in the remaining seven regions were previously reported in TWAS or EUGENE analyses (LINC00886, CTD-2323K18.1, MAN2C1, NUP107, and CPNE1), while the remaining four genes in these regions were novel (MAEA, GDI2, ULK3, and HSD17B1P1). NUP107 and CPNE1 did not pass a stringent Bonferroni significance threshold in Wu et al. (2018) but passed a less-tringent false discovery rate threshold.
We also carried out analyses focusing on breast cancer subtypes. We found 20 genes associated with ER+ breast cancer, and six genes associated with ER− breast cancer (p < .05/901 = 5.5 × 10 −5 ; Table 1). In our results, all genes associated with ER− disease were also associated with ER+ disease, as well as with overall breast cancer risk. Using a more stringent threshold on the strength of the genetic predictor for expression (cross-validation r 2 > .1; 383 genes passed this threshold), we found four TWAS significant (p < .05/383 = 1.3 × 10 −4 ) genes for ER− disease, 14 genes for ER+ disease, and 19 genes for overall breast cancer (18 out of 19 genes are included in Table 1 except for one gene, CTD/3110H11.1).
As before, these gene sets were nested within each other.

| Difference of TWAS signal across breast cancer subtypes
We tested whether the imputed gene expression-breast cancer associations differed by subtype using GWAS summary statistics from a case-only analysis, which specifically compared ER+ with ER− breast cancer patients (see Section 2 for details), scanning through 901 eligible genes. Two genes, HIST2H2BA and STXBP4, showed significant associations (p < .05/901) with ER status among cases (Figure 1). These two genes were associated with ER+ breast cancer but not associated with ER− breast cancer.

| GWAS signal conditioning on TWAS gene expression
As shown in Table 2, 21 (of 30) TWAS-significant genes were located near GWAS signals.
To examine whether the observed GWAS signal within the gene region could be explained by the expression of that gene, we performed additional analyses conditioning SNP-cancer associations on the predicted expression of that particular significant TWAS gene (See Section 2 and Figure S1, for details). We found that for most regions, GWAS SNPs were no longer associated with the risk of breast cancer once conditioned on the expression of TWAS gene in the region: 15 of 21 genes had no SNPs with a conditional GWAS p value smaller than the genome-wide significant threshold (5 × 10 −8 ). Thus, there were six genes for which the GWAS SNP remained significantly associated with breast cancer risk at the genomewide threshold (5 × 10 −8 ) after conditioning on TWAS gene expression. The region containing HIST2H2BA had only one genome-wide significant SNP remaining, and the region containing ZNF155 and ZNF404 had five genome-wide significant SNPs remaining, indicating that the expression of identified genes might explain some but not all of the SNPbreast cancer associations in these regions. For CASP8 and MRPL23-AS1 regions, half of the GWAS hits remained genome-wide significant, and for the RP11-554A11.9 region, 33 out of 36 GWAS SNPs remained (Figures 2 and S1). These results suggest that the genetic association between breast cancer risk and those regions may not be mediated by transcriptional regulation of the genes on which we conditioned.
After mutually conditioning on the predicted expression of all significant genes in the same regions, ten genes remained nominally significant (p < 0.05). For some regions, only one gene remained, that is ATG10 for 5q14, L3MBTL3 for 6q22 and CRHR1-IT1 for 17q21 ( Figures 3a and S2); while for other regions, multiple genes remained significant, including CASP8 and ALS2CR12 for 2q33, ULK3 and MAN2C1 for 15q24, and ZNF404, ZNF155, and RP11-15A1.7 for 19q13 (Figures 3b and S2).

| DISCUSSION
We conducted a TWAS analysis using GTEx mammary tissue gene expression data and GWAS summary data from the largest meta-analysis for breast cancer risk. We assessed associations between overall breast cancer risk and ER+ versus ER− disease. We found 30 genes significantly associated with overall breast cancer risk, 20 genes associated with the ER+ subtype, and six genes with the ER− subtype.
These results are consistent with previous reports from TWAS or similar gene-based approaches, which used various algorithms to build gene expression models. For example, of the 30 genes that we found significantly related to overall breast cancer risk, 23 were also significant in Wu et al. (2018) with very similar test statistics (correlation = 0.96 for the z scores between our and Wu's results), and six were significant in Ferreira et al. (2019). One of the six genes we classified as significantly associated with ER− breast cancer was also found significantly associated with ER− breast cancer in Ferreira et al. (2019). Among these studies, the approach taken by Wu et al. was the most similar to ours. Only seven of the 30 genes that we identified were not identified by Wu et al. (2018), probably due to different cis-SNP selection criteria and different candidate genes selected for testing. We defined cis-SNPs using a 500 KB window around the gene boundary and included only candidate genes with a significant heritability, while Wu et al. used a 2 MB cis-SNP window and included genes with a prediction performance of at least 0.01 without heritability filtering. For genes whose expression could not be predicted well, Wu et al. built models using only SNPs located in promoter or enhancer regions. Despite these methodological differences, the two TWAS results were highly concordant. However, we did not replicate any of the findings in Hoffman et al. (2017) and Gao et al. (2017), which may reflect the smaller sample size of the breast cancer GWAS used in their analyses ( the BPC3, less than 2% of our GWAS sample). They also used different tissues to build their prediction weights: overall breast tissue (men and women combined, all ethnicities) and whole blood tissue (men and women combined, European ancestry).
Of the 30 genes associated with breast cancer risk in our study, 21 fell into known GWAS regions whereas nine were not close to any known GWAS hit and were, therefore, considered novel. Of these nine genes, five were identified and discussed in Wu et al. (2018) or Ferreira et al. (2019). The four genes uniquely identified in the present study were GDI2, HSD17B1P1, MAEA, and ULK3, several of which have been reported to play a role in breast tumorigenesis or related biological processes. For example, the expression of GDI2 has been linked with breast cancer through its contribution to enhanced epidermal growth factor receptor endocytosis (EGFR; de Graauw et al., 2014). HSD17B1P1 is a pseudo-gene related to HSD17, which participates in steroid hormone biosynthesis, metabolism, and signaling pathways potentially related to breast cancer risk (Jakubowska et al., 2010). These findings lend support to our results and suggested that further investigation into the roles of the novel genes identified for breast cancer is required.
We performed several conditional analyses not reported in previous TWAS. We examined the local GWAS signals conditioning on the expression of TWAS genes, to provide a measure of how well the expression level of identified TWAS genes explained the local GWAS signals. For many loci, these genes explained a large proportion of the local GWAS signals and were thus candidates for downstream experimental validation. We also identified candidate genes driving the statistical associations in regions with more than one TWAS gene (usually also regions with known GWAS risk loci) by jointly modeling multiple nominally significant genes. For example, previous studies have suggested that polymorphisms in CASP8 are associated with breast cancer risk (Cox et al., 2007), whereas a recent paper has shown that the most significant signal in this region is for the imputed intronic SNP rs1830298 in ALS2CR12 (telomeric to CASP8; Lin et al., 2015). Our results provide clarification on whether CASP8 or ALS2CR12 expression were more strongly associated with breast cancer risk, since both genes remained significantly associated with breast cancer risk after conditioning on the expression of the other (the conditional p value for ALS2CR12 was 3.70 × 10 −6 , whereas the conditional p value for CASP8 was .05). Eleven of the 12 GWAS hits disappeared after adjusting for the expression of ALS2CR12, while half of the GWAS hits remained after adjusting for the expression of CASP8. Therefore, we believe that ALS2CR12 SNPs have a stronger effect and are associated with breast cancer through ALS2CR12 expression, while CASP8 remains an additional independent hit, consistent with the latest fine-mapping results (Lin et al., 2015).
Because the genes found to be associated with ER− disease were also associated with ER+ disease, and these, in turn, were associated with overall breast cancer risk, it is difficult to conclude whether the differences in gene sets are due to distinct mechanisms underlying breast cancer subtypes or due to a lack of statistical power because of the smaller disease subtype sample sizes. To address this question, we further incorporated a case-only TWAS comparing ER+ versus ER− breast cancer. We identified two genes, STXBP4 and HIST2H2BA, associated with ER status, which were significantly associated only with ER+ but not ER− breast cancer. Previous studies supported the link between rs6504950 (a SNP in STXBP4) and overall breast cancer risk Warren Andersen et al., 2013). It has also been hypothesized that the risk allele for the two top breast cancer candidate SNPs, rs2787486 and rs244353, affected gene expression of STXBP4 (Darabi et al., 2016) and CD4 memory cells (Hnisz et al., 2013). One potential explanation for the association between STXBP4 and breast cancer risk is that it encodes syntaxin binding protein 4, a scaffold protein. In addition, STXBP4 functions to stabilize and degrade TP63 isoform (a member of the TP53 tumor suppressor protein family), a biologically plausible candidate cancer susceptibility gene. Similarly, SNPs rs2580520 and rs11249433 upstream of HIST2H2BA have been identified as breast cancer susceptibility alleles in a previous GWAS (Bogdanova, Helbig, & Dörk, 2013). Our results suggest that functional and pathway analyses targeting these two genes are likely to shed new light on the differences in tumorigenesis and progression mechanisms between ER+ and ER− patients.
By building gene expression linear predictors in GTEx breast tissue, our analysis offers a tissue-specific model of gene expression. The gene regulatory mechanisms in female breast tissue are arguably the most suitable for studying breast cancer. Moreover, by restricting our reference population to women of European ancestry, rather than mixing genders and ancestries, the resulting gene expression model was a better match to our breast cancer GWAS summary statistics. By using the largest GWAS meta-analysis currently available, we greatly improved the power compared with previous work by Hoffman et al. (2017) and Gao et al. (2017). Finally, by using case-only GWAS summary statistics, we provided insights into genes associated with breast cancer subtype specific risk compared with Wu et al. (2018) and Ferreira et al. (2019).
Similar to previous work by Wu et al. (2018) and Hoffman et al. (2017), our analyses focused on genetic tools trained using expression from breast tissue, chosen because of its direct relevance to breast carcinogenesis. However, given the relatively small sample size in the breast tissue eQTL panel, this choice limited both our power to detect genes with cisheritable expression and the precision of estimated genetic predictors for heritable transcripts. The genetic regulation of expression is constant across tissues for many genes, suggesting that considering other tissues with larger eQTL sample sizes or combining eQTL evidence across tissues may improve power. In addition, other tissues may be relevant for breast cancer development. For example, considering that obesity and hormonal signaling have been linked to breast cancer risk (Bertolini, 2013), gene expression in adipose tissue and brain tissue may have parallel involvement with breast cancer etiology. We are currently developing methods for cross-tissue TWAS, using sCCA (sparse canonical correlation analysis) to build features that combine gene expression values across tissues that share similar genetic regulation mechanisms, while allowing tissues with different regulation patterns to contribute to different features (Feng, Pasaniuc, Major, & Kraft, 2018).
In conclusion, we have identified new breast cancer target genes both for functional experiments and as causal gene candidates in the significant TWAS gene regions. We have also identified associations between gene expression and breast cancer risk specific to disease subtypes, where two novel genes have been found specifically associated with ER+ breast cancer risk. This analytic strategy warrants application in studies aimed at defining the genomic architecture of cancers other than breast cancer.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. Scatter plot comparing the transcriptome-wide association study z scores in ER+ and ER−  Conditional and joint analysis (COJO) for genes near a strong breast cancer GWAS hit. (a) COJO results adjusting for predicted expression of ALS2CR12. After conditioning on ALS2CR12, almost all original significant GWAS signals (grey dots) disappear (blue dots). (b) COJO results adjusting for the predicted expression of CASP8. After conditioning on CASP8, some of the original GWAS significant signals (grey dots) remains (blue dots)  Abbreviations: COJO, conditional and joint; GWAS, genome-wide association studies; SNP, single nucleotide polymorphisms.
a Proportion of marginally significant SNPs that are not significant in conditional analyses. Analysis was performed using GWAS summary statistics of ER+ subtypes. The difference between marginal SNP tests for association (GWAS p values) and the SNP p values conditional on significant TWAS genes provides some evidence regarding the independence of the TWAS and single-SNP association signals.
The number and proportion of SNPs that are genome-wide significant before and after conditioning on a TWAS-significant gene summarizes the degree single-SNP associations are dependent on (or independent of) the TWAS association.  a Bolded genes remain significant in conditional analyses. Analysis was performed using GWAS summary statistics of ER+ subtypes. Our primary goal in these analyses is to establish whether any of the marginally significant TWAS genes remains significant after conditioning for the most significant gene in the region; sincesince all of the regions with multiple significant genes contain 2-3 significant genes, using a conditional p value threshold of .05 is a reasonable threshold for identifying independent signals.