Association between breast cancer risk and disease aggressiveness: Characterizing underlying gene expression patterns

Abstract The association between breast cancer risk defined by the Tyrer‐Cuzick score (TC) and disease prognosis is not well established. Here, we investigated the relationship between 5‐year TC and disease aggressiveness and then characterized underlying molecular processes. In a case‐only study (n = 2474), we studied the association of TC with molecular subtypes and tumor characteristics. In a subset of patients (n = 672), we correlated gene expression to TC and computed a low‐risk TC gene expression (TC‐Gx) profile, that is, a profile expected to be negatively associated with risk, which we used to test for association with disease aggressiveness. We performed enrichment analysis to pinpoint molecular processes likely to be altered in low‐risk tumors. A higher TC was found to be inversely associated with more aggressive surrogate molecular subtypes and tumor characteristics (P < .05) including Ki‐67 proliferation status (P < 5 × 10−07). Our low‐risk TC‐Gx, based on the weighted sum of 37 expression values of genes strongly correlated with TC, was associated with basal‐like (P < 5 × 10−13), HER2‐enriched subtype (P < 5 × 10−07) and worse 10‐year breast cancer‐specific survival (log‐rank P < 5 × 10−04). Associations between low‐risk TC‐Gx and more aggressive molecular subtypes were replicated in an independent cohort from The Cancer Genome Atlas database (n = 975). Gene expression that correlated with low TC was enriched in proliferation and oncogenic signaling pathways (FDR < 0.05). Moreover, higher proliferation was a key factor explaining the association with worse survival. Women who developed breast cancer despite having a low risk were diagnosed with more aggressive tumors and had a worse prognosis, most likely driven by increased proliferation. Our findings imply the need to establish risk factors associated with more aggressive breast cancer subtypes.

gram of high quality total RNA (RIN > 8, assessed using bioanalyzer) and rRNA was depleted using RiboZero (Illumina, US). For each sample, stranded RNA-seq libraries were constructed using a TruSeq Stranded Total RNA Library Prep Kit (Illumina, US). Then 2 × 101 paired-end sequencing libraries, at a median of 33 million reads per library, were generated on an Illumina HiSeq 2500 (Illumina, US) at the Science for Life Laboratory (Stockholm, Sweden). From a total of 307, 296 samples from unique individuals were included in this study after quality control and removing potential outliers. For the validation dataset (SCAN-B), high quality RNA samples (RIN/RQS > 7), were used to construct RNA-seq libraries from poly-A enriched RNA using the SCAN-B dUTP library protocol (see full description in Ref. (2)). Paired-end libraries of 50 bp length were sequenced on an Illumina HiSeq2000 instrument. After exclusion, 376 out of 428 samples were available for our validation analysis. These tumor samples had a minimum of more than 5 million successfully aligned paired-reads and less than 60% duplication level estimated for the aligned reads.
Gene expression quantification was performed in two steps: first by computing transcriptlevel estimates and then by gene-level aggregation. Transcript-level expression estimates were quantified for each sample using Salmon(3) version 0.9.1 (quasi-mapping based mode). Index was built on the reference transcriptome GRCh38 downloaded from ENSEMBL database (ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/cdna/), using the following parameters: --type quasi -k 19. The tximportData package as implemented in R version 3.5.0 was used to import the files containing the transcript level expression values obtained from Salmon. Genelevel aggregation was performed using the tximport package(4) v1.8. Gene-to-transcript IDs were extracted from Homo_sapiens.GRCh38.92.gtf annotation file using rtracklayer and tidyverse packages to be fed into tximport function. Only gene-coding mRNA were selected (n=19,035), since the SCAN-B library preparation protocol enriched mRNA by poly-A tail selection.

Gene Set Enrichment Analysis (GSEA)
Input for the GSEA analysis consisted on gene-level statistics obtained from separate regression analysis in our discovery and validation dataset. P-values were combined using the Fisher method as implemented in the metaRNAseq R package.(5) Genes with low-gene count and conflicting differential expression (i.e. with opposite effect size direction) were excluded. In total 7,881 genes (3,679 with negative, and 4,202 with positive beta estimate) were included in the analysis. A total of 1,948 genes (801 down-regulated and 1,147 up-regulated) were mapped onto the MSigDB hallmark gene set collection.
We computed gene set-level statistics using six different GSEA methods that make use of all available gene-level statistics (P-value) obtained from the regression analysis, while considering the direction for association (i.e. negative or positive beta estimate): Wilcoxon ranksum test, tail strength, mean, median, sum, reporter features, and Stouffer's method. For each of the methods, five gene set P-values were obtained under three directionality classes: nondirectional, distinct-directional, and mixed-directional. In the former two, a P-value on each direction of association is computed. Under the 'distinct' directionality class, differential expression in opposite direction is allowed to counteract with each other, while under the 'mixed' directionality class, differential expression by sub-set of genes associated on either direction is tested on each gene set.
Gene set P-values were estimated based on the empirical background distribution generated through 10,000 permutations of gene labels. We controlled the FDR to be lower than 0.05. A consensus 'mean' score was computed for each possible directionality class in order to rank most significant gene sets, based on the adjusted P-values from the seven different GSA methods. To summarize statistically significant enrichment under each class, the median adjusted P-value was reported. List of 37 genes found to be strongly correlated with 5-year TC score, as per 1-percent decrease on the TC scale. Genes were found associated with TC at FDR < 0.05 and beta coefficient (β) > ±log2(1.5) in the validation dataset, which are shown sorted by the β effect size. Validation (SCAN-B) effect estimates found in the same direction of association are marked in boldface type. *, test statistics from an independent differential expression analysis in the validation dataset. ᶲ, genes replicated in the validation dataset at P-value < 1×10-5 and β > ±log2(1.5).