A human tissue‐specific transcriptomic analysis reveals a complex relationship between aging, cancer, and cellular senescence

Abstract Aging is the biggest risk factor for cancer, but the mechanisms linking these two processes remain unclear. Using GTEx and TCGA data, we compared genes differentially expressed with age and genes differentially expressed in cancer among nine human tissues. In most tissues, aging and cancer gene expression pattern changed in the opposite direction. The exception was thyroid and uterus, where we found transcriptomic changes in the same direction in aging and in their corresponding cancers. The overlapping sets between genes differentially expressed with age and genes differentially expressed in cancer across tissues were enriched for several processes, mainly cell cycle and the immune system. Moreover, cellular senescence signatures, derived from a meta‐analysis, changed in the same direction as aging in human tissues and in the opposite direction of cancer signatures. Therefore, transcriptomic changes in aging and in cellular senescence might relate to a decrease in cell proliferation, while cancer transcriptomic changes shift toward enhanced cell division. Our results highlight the complex relationship between aging and cancer and suggest that, while in general aging processes might be opposite to cancer, the transcriptomic links between human aging and cancer are tissue‐specific.

tissue, we identified differentially expressed genes with age using the following linear regression model: Where Y ij is the expression level of gene j in sample i, Age i denotes the age of sample i. Sex i denotes the sex of sample I, Death i denotes the death classification of sample i based on the 4-point Hardy scale (Ferreira et al., 2018), and ε ij denotes the error term. It should be noted that dataset downloaded from GTEx portal did not provide the actual age of each sample, the age ranges, i.e. 20-29, 30-39, 40-49, 50-59, 60-69 and 70-79, were provided instead. We then approximated the age of each sample to 25,35,45,55,65 and 75, respectively. Because our main aim was to compare the age-DEGs with cancer-DEGs from TCGA, and there was a large difference between the number of the total genes in GTEx (56,202) and TCGA (20,532), we focus only on protein coding genes. Protein-coding genes were identified using the R package biomaRt (version 2.36.1) (Durinck et al., 2009), based on Ensembl release 92 (April 2018). After removing the non-coding genes, there were 18,851 protein-coding genes.
To remove the low expressed genes, genes with expression less than 1 count per million (cpm) in more than 30 percent of samples were excluded. Raw read counts were normalized using TMM normalization and were voom transformed to remove heteroscedasticity from the count data. The linear model for each gene was generated by using the limma package in R (version 3.36.5) (Ritchie et al., 2015). Genes were considered to be significant differentially expressed genes with age if the empirical Bayes moderated t-statistics and their associated adjust P-value (Benjamini-Hochberg method) < 0.05 and absolute fold change across 50 years of age (from 25 to 75 years old) > 1.5. The genes in GTEx were in ensemble id, we then converted them into entrez id format using biomaRt. Lists of age-DEGs from all 26 tissues are provided in the Data S1. Age distributions of the samples from GTEx were shown in Figure S1.
We selected nine tissues for further analysis, including breast, colon, esophagus, liver, lung, prostate, stomach, thyroid and uterus. This selection was based on the criteria that 1) each of these tissues had a matched TCGA project from the same tissue of origin and 2) the number of solid normal samples in those TCGA projects were more than 10. We also performed the analyses for the brain. However, the number of solid normal samples in TCGA-GBM project was only 5 and they lack information of patient age, we included the brain-GBM results in supplementary figures instead. Numbers of samples used in our study were summarized in Table S1. Most of the age-DEGs were tissue-specific, indicating the tissue specificity of age-DEGs, consistent with a previous report (Yang et al., 2015). GRCh37. The data then processed using RNASeqV2 pipeline of TCGA, which provide the RSEM expected counts. Protein-coding genes were identified using the biomaRt R package (version 2.36.1). After removing non-coding genes, there were 18,548 protein-coding genes included in subsequent analysis. To remove the genes with low expression, genes with expression less than 1 count per million (cpm) in more than 30 percent of samples were excluded. The limma package in R was used to identify genes significantly differentially expressed between cancer and normal samples. The moderated t statistic P-value after correction by Benjamini-Hochberg method < 0.01 with an absolute fold-change > 2 were used to determine the statistical significance. Lists of cancer-DEGs from all 10 TCGA projects were provided in the supplementary Data S2. Age distributions of primary tumour and solid normal samples were shown in Figure S2.

Fold change with age in normal tissues of cancer-DEGs
We examined the fold change with age of the cancer-DEGs. After obtaining the cancer-DEGs from TCGA, we identified the fold change with age in normal tissues in GTEx of these cancer-DEGs. Although not every cancer-DEG was found in the GTEx data from the corresponding tissue because lowly expressed genes had been filtered out before performing linear regression, the majority of them retained (Data S6). For each tissue, the fold change with age in normal tissue of up-regulated genes in cancer and down-regulated genes in cancer were compared using Mann-Whitney U test.

Meta-analysis to identify cellular senescence signature genes
To identify cellular senescence signature genes, the meta-analysis of 20 cellular senescence GEO datasets was performed. The datasets included in the analysis were shown in Table S2.
We first separately identified differentially expressed genes in each dataset. The probe annotation for each dataset was slightly different, depended on the platform. In general, the steps are 1) Download the dataset from GEO, 2) Probe annotation to match probe ID with entrez ID using either the dataset's own GPL platform files, R packages (AnnotationDbi version 1.44.0 and org.Hs.eg.db version 3.7.0), or bioDBnet (https://biodbnetabcc.ncifcrf.gov/db/db2db.php) (Mudunuri et al., 2009) 3) Remove probes which do not match to any entrez ID 4) Remove probes which match to multiple entrez ID 5) Average intensity of probes which match to the same entrez ID. The differential expression analysis was performed using limma. Genes with a P-value below 0.05 and absolute fold change > 1.5 were considered putatively differentially expressed genes. For the datasets with only one nonsenescent sample and one senescent sample, only fold change cut-off was used to determine differentially expressed genes. There are two platforms (GPL2937 and GPL2947) in the dataset GSE3460, so we separately analysed each platform and treated them as two different datasets in the meta-analysis. After that, the results from all datasets were then combined using a binomial distribution and the false discovery rate set at Q < 0.05 using the same method as (de Magalhaes et al., 2009). In total, there were 1,259 senescence signature genes, with 526 overexpressed and 734 underexpressed. We found that 442 of 1,259 senescence signature genes were significantly overlapped with the signature of replicative senescence provided by Hernandez-Segura et al. (Hernandez-Segura et al., 2017) (P-value = 2.9e-161; Fisher's exact test), indicating the reliability of our data.

Overlap analysis
The overlap analyses were performed using the GeneOverlap package in R (version 1.16.0).
In each experiment, background was set differently depending on where the genes come from. For the overlap between age-DEGs from GTEx and cancer-DEGs from TCGA, all protein coding genes in GTEx data (18,851 genes) were used as a background. For the overlap between age-DEGs and cellular senescence signature genes, all protein coding genes in combined datasets used for the meta-analysis (18,878 genes) were used as a background.
For the overlap between cancer-DEGs and cellular senescence signature genes, all protein coding genes in combined datasets used for the meta-analysis (18,878 genes). The overlap was considered significant if an adjusted P-value < 0.05 (Fisher's exact test followed by Benjamini-Hochberg correction).

Functional enrichment analysis
Gene ontology (GO) enrichment analysis was performed using the clusterProfiler package in R (version 3.8.1) (Yu et al., 2012). A GO term was considered to be an enriched term if an adjusted P-value < 0.1 (Benjamini-Hochberg correction). The union set of age-DEGs and cancer-DEGs in each of the four conditions for each tissue was employed as a background for GO enrichment test of the overlapping gene set in that tissue. For example, the union set of genes up-regulated with age in colon and genes up-regulated in COAD was used as a background for GO enrichment analysis of overlapping genes between genes up-regulated with age in colon and genes up-regulated in COAD. KEGG pathway enrichment analysis was performed on the underexpressed and overexpressed cellular senescence signatures. A pathway was enriched if an adjusted P-value < 0.05 (Benjamini-Hochberg correction). All protein coding genes in combined datasets used for the meta-analysis (18,878 genes) were employed as a background.

Validating the results with recount2 data
Because GTEx and TCGA had processed the data differently, we confirmed that this has not affected the results of the comparison between age-DEGs and cancer-DEGs by using data from recount2 (Collado-Torres et al., 2017). Recount2 is a project to standardize the pipeline of analysis and provide the ready-to-analyze data, thus TCGA and GTEx (version 6) data have been processed using the same pipeline and they contain the same number of genes (58037 genes in total, including 19732 protein-coding genes). We performed the analyses following the same methods as describe above with this recount2 data and confirmed that age-DEGs and cancer-DEGs change in the same direction in thyroid, uterus, and brain, while change toward the opposite direction in the other tissues ( Figure S5). We, however, keep the results analysed from the GTEx database (version 7) and TCGA downloaded from FireBrowse, because GTEx (version 6) contain less samples, particularly in the oldest age group (70 -79 years old).

List of Supplementary Tables, Supplementary Figures and Additional Data
 Table S1 -Number of GTEx and TCGA samples used in the study  Table S2 -GEO datasets for meta-analysis of cellular senescence signatures  Figure S1 -Age distribution of samples from 26 GTEx tissues  Figure S2 -Age distribution of primary tumour and normal tissues from 10 TCGA projects used in this study  Figure S3 -The relationship between age-DEGs and cancer-DEGs in brain-GBM  Figure S4 -GO and KEGG pathway enrichment analysis of cellular senescence signatures  Figure S5 -Confirmation of the comparison between age-DEGs and cancer-DEGs using recount2 data  Data S1 -List of age-DEGs in GTEx tissues  Data S2 -List of cancer-DEGs in nine TCGA projects  Data S3 -List of enriched terms from GO enrichment analysis of overlap genes between age-DEGs and cancer-DEGs  Data S4 -List of cellular senescence signature genes  Data S5 -GO and KEGG pathway enrichment analysis of cellular senescence signature genes  Data S6 -Lists and fold change with age of cancer-DEGs presented in GTEx data

Figure S5 -Confirmation of the comparison between age-DEGs and cancer-DEGs using recount2 data
Numbers of (a) age-DEGs from GTEx (b) cancer-DEGs from TCGA, RNA-Seq data obtained from recount2 resource. (c) Fold change with age in GTEx data of cancer-DEGs (Compare to Figure 1c and Figure S3a). (d) Overlap analysis between tissue-specific age-DEGs and cancer-DEGs (Compare to Figure 1d and Figure S3b). Because only one age-DEG identified from stomach, the overlap analysis between age-DEG from stomach and cancer-DEGs from STAD was not able to performed. (e) GO enrichment analysis of significant overlap gene sets (Compare to Figure 1e and Figure S3c). The plot shows examples of significant enriched terms (P-values with Benjamini-Hochberg correction < 0.1).