Pan‐cancer RNA‐seq data stratifies tumours by some hallmarks of cancer

Abstract Numerous genetic and epigenetic alterations cause functional changes in cell biology underlying cancer. These hallmark functional changes constitute potentially tissue‐independent anticancer therapeutic targets. We hypothesized that RNA‐Seq identifies gene expression changes that underly those hallmarks, and thereby defines relevant therapeutic targets. To test this hypothesis, we analysed the publicly available TCGA‐TARGET‐GTEx gene expression data set from the University of California Santa CruzToil recompute project using WGCNA to delineate co‐correlated ‘modules’ from tumour gene expression profiles and functional enrichment of these modules to hierarchically cluster tumours. This stratified tumours according to T cell activation, NK‐cell activation, complement cascade, ATM, Rb, angiogenic, MAPK, ECM receptor and histone modification signalling. These correspond to the cancer hallmarks of avoiding immune destruction, tumour‐promoting inflammation, evading growth suppressors, inducing angiogenesis, sustained proliferative signalling, activating invasion and metastasis, and genome instability and mutation. This approach did not detect pathways corresponding to the cancer enabling replicative immortality, resisting cell death or deregulating cellular energetics hallmarks. We conclude that RNA‐Seq stratifies tumours along some, but not all, hallmarks of cancer and, therefore, could be used in conjunction with other analyses collectively to inform precision therapy.


| INTRODUC TI ON
Human cancers are classified based on anatomical, histopathological and molecular features. Hanahan and Weinberg posited cancer unifying changes in cell biology (hallmarks): resisting cell death, sustaining proliferative signalling, evading growth suppressors, activating invasion and metastasis, enabling replicative immortality and inducing angiogenesis. 1 Now included are two 'enabling' hallmarks (genome instability and mutation, and tumour-promoting inflammation) and two 'emerging' hallmarks (deregulating cellular energetics and avoiding immune detection). [1][2][3] Aberrant signalling axes identify specific hallmarks in a given cancer. 1 Treatments targeting each hallmark promise individualized therapies. 1,2,4 Individualized cancer therapy requires identifying contributors to these hallmarks, that is targetable biochemical pathways and genetic interactions. Presently, this includes histopathological,
DNA, cytogenic and proteomic analyses. [5][6][7] Gene expression analysis is currently used case-by-case to discover targets 8 ; there is no consensus framework for using such analyses across all cancer types in clinical care. Although previous studies of specific primary site cancers such as pancreatic 9 and breast cancers 10 identified transcriptomic subgroups, investigation of transcriptomic subgroups across all cancers is not well studied. Using The Cancer Genome Atlas (TGCA) 11 and FANTOM5 12 transcriptomic data sets, Kaczowski et al 13 looked for primary site-independent cancer subgroups by grouping cancers according to differential expression of individual transcripts initially in cultured cells and secondarily in tumours. This analysis found that cancers could be classified into molecular subtypes defined by expression of transcripts involved in DNA and biopolymer metabolism, tumour suppression, oxidoreductase activity and developmental or cell cycle signalling.
The work of Kaczowski et al led us to hypothesize that cancer-associated mutations and epimutations alter expression of co-correlated groups of genes independent of cancer type and are detectable primarily in cancer tissue by RNA-Seq. We reasoned that assessing for co-correlated groups of genes is arguably more sensitive to changes in expression of gene networks underlying biological processes than is identifying common processes among individual transcripts. To test this, we analysed gene expression profiles from the University of California, Santa Cruz (UCSC) Toil recompute of the TCGA, Therapeutically Applicable Research to Generate Effective Treatments (TARGET) 14 and Genotype-Tissue Expression (GTEx) 15 data sets available on the Xena Platform. 16 We found consistent stratification of cancers by signatures of T cell activation, NK-cell activation, complement cascade, ATM, Rb, angiogenic, MAPK, ECM receptor and histone modification signalling.

| Data sources and processing
The RNA-Seq data and associated metadata files were downloaded from the UCSC Xena Data Browser 16 (2016-09-03 version, TcgaTargetGtex_rsem_gene_tpm (TTG) data set) ( Table 1).These contained transcript-non-specific expression data for all coding genes as well as for long non-coding RNA (lncRNA), pseudogenes and other non-coding transcripts with unique Ensembl ENSG identifiers. 17 The TTG data set quantifies gene expression as log 2 TPM + 1 and were converted to TPM + 1 for this analysis. The BioMart 18 database was used to extract genes having ENSG identifiers annotated with the protein_coding biotype. This eliminated 40,826 (67.5%) non-coding entries leaving 19,672 protein-coding entries (TTG-C data set, Figure 1A, Table 1). The TTG-C data set was then reduced to cancers that had corresponding normal samples and vice versa to create the T-C-PS and N-C-PS data sets, respectively (Table 1). Primary sites of uncertain histological equivalence between tumour and normal samples (eg blood cancers) or with sample numbers below 20 in either cancer or normal data sets were excluded.

| Data normalization
Because non-cancerous primary site-specific gene expression might obscure cancer signatures, we used two methods to subtract noncancer expression data. We analysed data before and after correction ( Figure 1B).
We used two binary primary site classification matrices: P c , a t × q matrix of cancer primary sites, and P n , a t × r matrix of normal tissue primary sites. q and r are the number of cancer and normal samples, respectively, and t is the number of primary sites. We used two gene expression matrices:C, a s × q matrix of cancer gene expression from the T-C-PS data set (Table 1), and N, a s × r matrix of normal tissue gene expression from the N-C-PS data set (Table 1).q and r are the number of cancer and normal samples, respectively, and s is the number of genes.
For a given cancer expression vector of gene i in matrix C, and for a binary classification vector for primary site l in matrix P c , we derived the vector of tissue-specific cancer gene expression X i by multiplying these two vectors: For a normal tissue, given the expression vector for gene i in matrix N and the binary classification vector for primary site l in matrix P n , we derived the vector of tissue-specific normal tissue gene expression Y i by multiplying these two vectors: By calculating X i and Y i or all primary sites and all genes, we created a series of vectors that form the two three-dimensional matrices X and Y. X i,j,l is the TPM gene expression value for gene i in cancer j of primary site l . Y i,k,l is the TPM gene expression value for gene i in normal tissue k of primary site l .
The tissue normal-corrected data set (subsequently called 'tissue-corrected') was calculated by first defining the mean normal expression Ĝtissue for gene i at each primary site l as: Where r is the number of normal tissue samples in primary site l, m l is calculated as: The tissue-corrected gene expression matrix L tissue was calculated as: The grand normal-corrected data set (subsequently called 'grand mean-corrected') was calculated by partitioning matrix Y by both the F I G U R E 1 A flowchart depicting the analyses used in this study. Transcriptome profiles were first restricted to protein-coding genes (A), then two different primary site-correction approaches were taken to analyse the three data sets in parallel (B). Each data set was analysed using WGCNA to identify groups of genes (modules) that were co-correlated, and variable across cancers (C). Genes found in modules were put through pathway enrichment analysis (WebGestalt) and used for hierarchical clustering (D) total number of primary sites, t and the number of cancers within each primary site, r: m l was calculated as before. Finally, the grand mean-corrected gene expression matrix L grand as calculated as:

| Gene selection for clustering analysis
For clustering analysis, the genes profiled were restricted firstly to protein-coding genes because mechanisms of tumorigenesis are currently better understood for the protein-coding transcriptome ( Figure 1A). Secondly, protein-coding genes were restricted to 'modules' with expression values co-correlated and variable across cancers using weighted gene co-expression network analysis (WGCNA, Figure 1C). 19 WGCNA was carried out using the WGCNA package in R (version 1.68). 20 The mean TPM values of all genes in a module were used to evaluate the expression of a module in a cancer.

| Characterization of modules identified by WGCNA
Modules were characterized using the over representation analysis (ORA) in the WebGestaltR package (version 0.4.1, Figure 1C). 21 ORA used all protein-coding genes as a reference set, the WikiPathway 22 database for functional annotations and the Benjamini-Hochberg method 23 for multiple testing correction. Modules were named using default WGCNA settings, which assign each module a colour.
The module names were not changed after characterization due to the complexity of the functional enrichment.

| Clustering by transcript profiling
Clusters of similar cancers were defined by hierarchical clustering 24 using the cosine distance 25 between the expression profiles of the genes included in the modules and Ward's method 26 for agglomeration ( Figure 1C). The number of clusters was determined with the find_k function from the dendextend R package (version 1.12.0); this function estimates k using maximal average silhouette widths. 27 Dendrograms were cut into k groups to assign cancers to a cluster.

| RE SULTS
To test the hypotheses that cancer-inducing gene expression changes are detectable by RNA-Seq and traverse cancer primary sites, we analysed the TTG data set (Table 1) and restricted it to tumour and corresponding normal data ( Figure 1; Table 1). These data sets were normalized, analysed and stratified.

| Hallmark cancer and tissue-specific pathways distinguish cancer clusters in uncorrected data
Analysis of the uncorrected data set showed that subtle expression differences in both hallmark cancer and cancer-unrelated, tissuespecific pathways differentiated clusters. WGCNA categorized 991 genes into 17 modules. Eleven of those modules were enriched for functional pathway annotations: brown, cyan, green, grey60, light yellow, midnight blue, pink, purple, red, tan and turquoise. Eight modules were enriched for tissue-specific processes: brown, cyan, green, light yellow, pink, red, tan and turquoise (Table S1, ORA, P ≤ .049).
The remaining three modules were enriched for cancer-relevant processes. The grey60 and purple modules were, respectively, enriched for natural killer (NK) cell signalling and T cell receptor (TCR) signalling (Table S1, ORA, P ≤ 2.9*10 -5 ), axes characteristic of the avoiding immune destruction hallmark. The midnight blue module was enriched for histone modification signalling (Table S1, (Table S2).
Post hoc analysis by Tukey's Test determined pairwise differences in module expression between primary sites (Table S3). The brown module was expressed higher in uterine cancers than other sites ( Figure 4A, Tukey HSD, P ≤ 3.2*10 -9 ). The light yellow module was expressed higher in breast cancers than other sites ( Figure 4A, Tukey HSD, P ≤ 5.9*10 -6 ). The pink module was expressed higher in liver cancers than other sites ( Figure 4A, Tukey HSD, P ≤ 2.0*10 -11 ).
The red, tan and turquoise modules were expressed higher in brain cancers than other sites ( Figure 4A, Tukey HSD, P ≤ 2.0*10 -11 ).The cyan module was expressed higher in stomach and prostate cancers than other sites ( Figure 4A, Tukey HSD, P ≤ 7.6*10 -6 ). The green module was expressed higher in skin cancers than in prostate, liver, kidney, brain or breast cancers ( Figure 4A, Tukey HSD, P ≤ .037).The For panels B and C, module expression is ln(Tumour/Normal) as defined in the Methods expression of the grey60, midnight blue and purple modules differed between pairs of primary sites but without an appreciable pattern ( Figure 4A, Tukey HSD, P ≤ .05).

| Hallmark cancer and tissue-specific pathways distinguish cancer clusters in tissue-corrected data
Because of the potential for cancer-unrelated primary site-specific pathways to obfuscate tumorigenic signatures, we repeated the analyses ( Figure 1D) after correcting for tissue-specific gene expression. This removed some, but not all, of the primary site-specific pathways seen in the uncorrected data and introduced new primary site-specific pathways (Table S4 vs Table S1).  (Table S4); the former represents the sustained proliferative signalling hallmark and the latter the genome instability and mutation hallmark. 1 The green module was enriched for processes related to cell cycle progression (Table S4), a characteristic of the evading growth suppressors hallmark. The green yellow, magenta and red modules were enriched for NK cell, T cell or inflammatory signalling (Table S4, To investigate whether anatomical cancer primary site corresponded with cluster assignment, we evaluated the primary site composition of each cluster. Clusters 1, 2, 3 and 5 were primary site heterogeneous. Cluster 1 was primarily composed of bladder (66%), testis (24%) and uterine (9%) cancers ( Figure 3B). Cluster 2 was composed of stomach (47%), colon (33%) and oesophagus (20%) cancers ( Figure 3B). Cluster 3 was predominantly composed of prostate (61%), lung (14%) and breast (13%) cancers ( Figure 3B).
Cluster 5 was composed of liver (67%) and pancreas (33%) cancers ( Figure 3B). Clusters 4, 6, 7, 8, 9 and 10 were ≥ 99.9% a single primary site: skin, kidney, ovarian, lung, brain and breast cancers, respectively ( Figure 3B). The primary site homogeneity of clusters suggests either that correction for primary site signatures in this data set was incomplete or that the detected cancer signatures are primary site-dependent.
Post hoc analysis by Tukey's Test determined pairwise differences in module expression between primary sites (Table S6). The expression of the black, blue, brown, cyan, dark green, green, green yellow, grey60, light cyan, light yellow, magenta, pink, purple, red, turquoise and white modules differed between primary sites but without an appreciable pattern ( Figure 4B, Tukey HSD, P ≤ .05). Hierarchical clustering of the 1,084 genes in grand mean-corrected WGCNA modules defined 10 clusters. These clusters were characterized by distinct expression of modules ( Figure 2C). All modules showed differential expression across clusters ( Figure 2C,  (Table S8).
Post hoc analysis by Tukey's Test determined pairwise differences in module expression between primary sites (Table S9). The green yellow module was expressed higher in kidney and liver cancers than other primary sites ( Figure 4C, Tukey HSD,P ≤ 2.0*10 -11 ).
The magenta module was expressed higher in skin cancers than in prostate, pancreas, ovary, lung, liver, kidney, brain, colon and breast cancers ( Figure 4C, Tukey HSD,P ≤ .02). The expression of black, blue, cyan, green, light cyan, pink, salmon, tan, turquoise and yellow modules differed for many pairwise comparisons of primary sites but did not have an appreciable pattern ( Figure 4C, Tukey HSD, P ≤ .05).

| Clusters are incompletely primary siteindependent
To evaluate whether cancer clusters express hallmarks independently of primary site, we assessed the stratification of a primary site across clusters and the primary site diversity within clusters. For the former, we counted the number of clusters in which the null hypothesis of the hypergeometric test was rejected (P ≤ .05, Table S10).
Primary sites with 2 or more clusters that rejected that null hypothesis were considered stratified across clusters. We observed that in the uncorrected data breast, oesophagus, kidney, ovary, prostate, stomach and uterine cancers were stratified by their gene expression profiles (Table 2), that in the tissue-corrected data no cancer types were stratified by their gene expression profiles ( Table 2), and that in the grand mean-corrected data breast, oesophagus and lung cancers were stratified by their gene expression profiles ( Table 2).

| RNA-Seq data consistently stratifies cancers by seven therapeutically targetable hallmarks of cancer
Consistent with prior studies, 29  inhibitors. 33 This illustrates that expression assays might have utility detecting biomarkers for resistance to therapies targeting the sustained proliferative signalling hallmark.
Components of the tumour-promoting inflammation and the avoiding immune destruction hallmarks were enriched in several modules and were detected in all data sets (Tables S1, S4, S7).
Because only a minority of patients within a given cancer type respond to CD8 + T cell dependent cancer immunotherapy, 35  Our analyses detected modules enriched for angiogenesis hallmark related processes. Although clinical targeting of the VEGF signalling axis frequently induces resistance 38 that limits it as a monotherapy, expression of VEGF and its receptors correlates with cancer stage and metastasis and might be a useful prognostic indicator. 39 The variation in the angiogenic signalling detected in our study divides breast, kidney and colon cancers into high and low expression groups ( Figure 4C), a division that might not only be useful as a staging marker but also for identifying tumours likely to respond to antiangiogenic therapy.  The deregulating cellular energetics hallmarks have a strong transcriptomic footprint. 1,47 The processes underlying this hallmark originate from changes in gene expression, namely, glucose transport, 48 glutamine transport 49 and the pentose phosphate pathway, 50 as well as the biosynthesis of nucleotides, 51 serine, 52 arginine 53 and proline. 54 Studies detecting these changes in gene expression, however, either did not use tumour biopsies as the source tissue or use more targeted methods than RNA-Seq. In contrast to cultured tumour cells or xenografts of cultured tumour cells, our analyses agnostically probed tumour biopsies, a highly complex cell population, 55 for gene expression signatures varying across samples.

| RNA-Seq data does not stratify cancers by three hallmarks of cancer
Consequently, we postulate that tissue heterogeneity introduces too much biological variability or that our assumption of variability across cancers is invalid. Supporting the latter hypothesis, prior studies show consistent expression of metabolic genes across cancer types, 56 and we find that genes with the least variable expression are enriched for metabolic pathway annotations (Table S11).
Like the cellular energetics hallmark, the resisting cell death hallmark has a strong transcriptomic footprint. It is characterized by the subversion of the regulatory and functional elements of the cellular apoptosis machinery. 1 Cancer cells impair apoptosis by decreasing expression of proapoptotic proteins or by increasing expression of antiapoptotic proteins. 57 The specific family members up or down-regulated are, however, relatively cancer type-specific. 57 Consequently, although gene expression networks involved with apoptosis are altered, specific cancers usually have changes in only a few genes in that network, 57 and our methodology is insensitive to such limited changes.

| Clusters defined by hallmark gene expression are incompletely independent of cancer primary sites
Analyses of the uncorrected data set showed that brain, oesophageal, ovarian, prostate, stomach and uterine cancers were stratified across clusters (Table 2). Due to the presence of modules enriched for non-cancer processes (Table S1) and the lack of distinct expression of modules enriched for hallmarks across clusters (Figure 2), we are not certain that hallmark cancer signatures solely underlie that stratification.
Suggesting the dependence of clustering on the anatomical primary site, analyses of the tissue-corrected data set found that no cancer types were stratified across clusters ( Table 2). The normalization for the tissue-corrected data set used separate 'normal' gene expression vectors for each primary site and that process could introduce signatures by over-correction. Supporting this, comparison to the uncorrected data shows the concurrent elimination of modules enriched for non-cancer processes in the uncorrected data and the detection of other modules enriched for non-cancer processes (Tables S1, S4).
Analyses of the grand mean-corrected data set showed that breast, oesophageal and lung cancers were stratified across clusters (

| Biological processes identified in this study align with previous literature
Previous investigations into transcriptomic subdivisions of cancer observed several of the pathways that we identified. Specifically, pancreatic, 9 breast 10 and pan-cancer 13 studies found that cell cycle pathways define transcriptomic subgroups and that immune signalling defines subgroups of pancreatic 9 and breast 10 cancers. In contrast to the pan-cancer study of Kaczowski et al did, 13 our analysis did not detect differential expression of genes relevant to the cellular energetics hallmark; we hypothesize, as discussed above, that this arose because our analysis of co-correlated groups of transcripts is insensitive to small numbers of transcripts with altered expression, whereas the approach of Kaczowski et al is not so limited because it focuses on differential expression of individual transcripts. 13 On the other hand, detecting expression changes in the chromatin remodelling, angiogenesis and ECM signalling axes, our analysis distinguished molecular subtypes that were not noted in studies of the pancreatic, 9 breast 10 or pan-cancer 13 data sets.

| Limitations
The detection of some cancer hallmark signatures is encouraging, given the conservative approach (requiring an expression signature across all cancers) and the following limitations of our study. First, the data set did not contain isoform-specific expression, and this prevented the incorporation of alternatively spliced transcripts into our analysis. Alternatively, spliced transcripts play important roles in cancer cell biology 58 and are relevant to all hallmarks of cancer. 28 Second, although dysregulation of non-coding RNAs is integral to cancer biology 59 and play a therapeutically relevant role, 60 we restricted our analysis to the protein-coding transcriptome to facilitate pathway enrichment analysis. Third, our analysis did not account for therapeutically relevant gene fusions detectable by RNA-Seq. 61 Fourth, we did not assess allele-specific expression, which is relevant to cancer biology and progression. 62 Fifth, to mitigate the effects of spurious expression differences, we did not consider modules of single or small numbers of genes 20 ; this might be addressed in the future by the inclusion of spike-in reference RNAs. 63 Sixth, for several technical reasons, our analysis was indifferent to tumour microenvironments (TMEs) which are known to modify processes (proliferation, metastasis and interaction with immune cells 64 ) underlying cancer hallmarks. Although these limitations decreased the sensitivity of our analysis to hallmark changes, they do not alter the specificity of our analysis.

| CON CLUS IONS
RNA-Seq detects some hallmarks of cancer and those hallmarks stratified some, but not all, cancer types. We consistently identi-

ACK N OWLED G EM ENTS
GF was supported by a post-baccelaureate fellowship from Sanford Imagenetics. We recognize the freely available data on the UCSC Xena Platform, which made this analysis possible. We thank Lauren

Sanders, Allison Cheney, Elise Valkanas, Elise Flynn and Dr Rachel
Myerowitz for critical review of the manuscript.

CO N FLI C T S O F I NTE R E S T
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

AUTH O R CO NTR I B UTI O N S
CB conceived the hypothesis tested in this study. GF and PC developed the methodology to test that hypothesis. All code for data analysis and processing was developed by GF. GF wrote the manuscript with input and revisions provided by CB, PC and SM. SM and CB aided the clinical and biological relevance of conclusions drawn from data analysis.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data used in this study is freely available from the UCSC Xena Data Browser. Specifically, the TCGA-TARGET-GTEx gene expres-