muTarget: A platform linking gene expression changes and mutation status in solid tumors

Large oncology repositories have paired genomic and transcriptomic data for all patients. We used these data to perform two independent analyses: to identify gene expression changes related to a gene mutation and to identify mutations altering the expression of a selected gene. All data processing steps were performed in the R statistical environment. RNA‐sequencing and mutation data were acquired from The Cancer Genome Atlas (TCGA). The DESeq2 algorithm was applied for RNA‐seq normalization, and transcript variants were annotated with AnnotationDbi. MuTect2‐identified somatic mutation data were utilized, and the MAFtools Bioconductor program was used to summarize the data. The Mann‐Whitney U test was used for differential expression analysis. The established database contains 7876 solid tumors from 18 different tumor types with both somatic mutation and RNA‐seq data. The utility of the approach is presented via three analyses in breast cancer: gene expression changes related to TP53 mutations, gene expression changes related to CDH1 mutations and mutations resulting in altered progesterone receptor (PGR) expression. The breast cancer database was split into equally sized training and test sets, and these data sets were analyzed independently. The highly significant overlap of the results (chi‐square statistic = 16 719.7 and P < .00001) validates the presented pipeline. Finally, we set up a portal at http://www.mutarget.com enabling the rapid identification of novel mutational targets. By linking somatic mutations and gene expression, it is possible to identify biomarkers and potential therapeutic targets in different types of solid tumors. The registration‐free online platform can increase the speed and reduce the development cost of novel personalized therapies.

recommendations. 1 Molecular changes can serve as predictive biomarkers indicating treatment response. Other biomarkers can predict prognosis and help estimate the progression-free or overall survival of patients. 2 In many cases, it is not understood how exactly a genetic aberration contributes to cancer progression. Somatic mutations in a protein coding sequence affect not only the corresponding protein's function but the perturbation can propagate to a network of functionally related genes in the same and other cellular pathways. Uncovering molecular interaction networks can help prioritize drug combinations and biomarkers for experimental testing. 3 At the same time, multiomics studies contribute to the delivery of different levels of data, such as genomes, transcriptomes, proteomes, metabolomes and epigenomes. These data can support more complex interpretations of phenotype-genotype relationships compared to the investigation of only a single data layer. 4 Previously, a network-based integration of multiomics data was performed to prioritize cancer genes. 5 Graph-based integration of multiomics data, including copy number alteration, methylation, microRNAs (miRNAs) and gene expression data, could also help predict clinical outcomes. 6 However, such multiomic studies are still limited by data availability; to establish a sufficiently robust interaction network, one needs to have each layer of data at hand for each included clinical specimen.
In the present study, we selected breast cancer as a model. The reason behind this was the robust number of clinical breast cancer specimens available for our analysis. In breast cancer, several predictive and prognostic biomarkers are routinely used; some examples include the use of estrogen receptor (ER) and progesterone receptor (PGR) to predict sensitivity for endocrine therapy 7 and the use of expression of human epidermal growth factor receptor 2 (HER2) to predict response to HER2-targeted therapies such as trastuzumab, 8 pertuzumab 9 and lapatinib. 10 In addition, BRCA1/2 germline mutations assist in predicting the response to PARP inhibitors in metastatic breast cancer. 11 In addition to single gene indicators, multigene prognostic models have also gained acceptance in recent years. 12 However, despite all the advances, large groups of breast cancer patients harboring different somatic mutations still do not have adequate targeted therapies.
Identification of biomarkers and novel therapeutic targets is an essential goal in precision oncology that will facilitate the optimal treatment of human cancers. Here, we hypothesized that mutations affect multiple genes and that the resulting transcriptomic fingerprint of a mutation can harbor other clinically useful genes. To this end, our aim was to identify genes with altered expression in relation to somatic mutations in other genes. We validated the approach by employing a training-test approach in a data set comprising a suitable sample number. Finally, we established an online portal enabling the analysis for any new agent.

| Database processing
Data processing steps and algorithm development were performed in the R v3.5.2 statistical environment (http://www.r-project.org). RNA-seq-based gene expression and Mutect2-identified somatic mutation data were obtained from the The Cancer Genome Atlas (TCGA) repository (https://portal.gdc.cancer.gov/). We collected only solid tumors with more than 200 cancer specimens into the database. Data acquisition and processing were performed in November 2019.
MAFtools R Bioconductor program was used for the summarization. 13 For cancer-specific somatic mutations, mutation annotation format (MAF) files were utilized, which already exclude low-quality signals and variants most likely representing germline alterations. 14 Mutations were filtered based on the judgment of Mutect2 as well as additional hard filters of at least 20× coverage and the presence of the alteration in at least five reads.
We classified each mutation into three functional groups, including "all mutations," "protein coding mutations" and "disruptive mutations," based on the sequence ontology generated by snpEff for the canonical transcripts. 15 The all mutations group included any alteration in the gene; protein coding mutations group included those that hit a protein coding sequence, including missense, synonymous, frameshift, translational start and stop codon-associated mutations; and the disruptive mutations group included only those actually disrupting the protein structure, namely, translational start and stop codon-associated mutations and frameshift mutations ( Figure 1).
For gene expression, the raw high-throughput sequencing (HTSeq) count RNA-seq data generated by the Illumina HiSeq 2000 RNA Sequencing Version 2 platform were used. The "DESeq" package based on the negative binomial distribution was used to normalize the raw count data. 16 A second scaling normalization was performed to set the mean expression of all genes in each patient sample to 1000 to reduce batch effects. The "AnnotationDbi" package (https:// bioconductor.org/packages/AnnotationDbi/) was applied to annotate Ensembl transcript IDs with gene symbols.

| Linking mutation to gene expression changes
There are two possible approaches connecting genotype changes and gene expression: (a) we can start from a mutated gene and identify all What's New?
Genetic mutations do not always alter just a single protein. In some cases, they can affect the transcription of an entire pathway. In this study, the authors correlated somatic mutations and gene expression in cancer. They developed a platform that enables correlations to be analyzed in two directions: first, to identify changes in gene expression that are related to a specific mutation, and second, to identify mutations that alter the expression of selected genes. An online portal should enable this type of analysis for any new agent, to help researchers identify potential biomarkers and therapeutic targets.
up-and downregulated genes, or (b) we start by selecting a target gene and identify all mutations leading to an expression change in this specific target ( Figure 2). In both cases, the analysis is run across all genes, but the data source is different (either expression data or mutation data). Below, we present these two pipelines separately.

| Determining the effects of a gene mutation
In this analysis, our goal was to identify genes showing altered expression in tumor samples harboring a selected mutation.
In the first step, we used a single binary vector to assign each sample to one of two cohorts based on the mutation status of the selected gene. In this vector, "0" denoted wild-type samples, and "1" denoted samples with the selected mutation class (all/coding/disruptive mutations are processed independently). Then, we separately compared the expression of a gene between the two cohorts using a Mann-Whitney nonparametric test. The computation was run across all genes, but genes with a mean expression below 100 (corresponding to 10% of the mean expression across all genes) were not included in the analysis to reduce noise discovery. To perform multicore calculation of the Mann-Whitney U test, we applied the "doParallel" (https:// CRAN.R-project.org/package=doParallel) and "parallel" (https://www. rdocumentation.org/packages/parallel) packages.
Genes with significant expression changes between the two cohorts are selected based on the P value and mean fold change (FC) differences between the mutated and wild-type cohorts. We accepted a gene as significant only when the P value was below .01 and the FC was either over 1.4 or below 0.714. The final output was a list of up-and downregulated genes whose expression levels were significantly associated with the given genotype alteration in the original input gene.
Boxplots were generated to visualize expression differences for the most significant genes. For this, we employed the "ggplot2" package (https://ggplot2.tidyverse.org/). In each plot, the x-axis displays the mutation status, and the y-axis shows the expression value of the gene.

| Identifying mutations disturbing a gene
In this analysis, we started from a target (eg, a potentially druggable gene), and our goal was to uncover gene mutations that could be used as biomarkers to select patients most likely to benefit from targeted therapy. For this, we identified gene mutations associated with expression changes in the investigated gene.
The pipeline in this analysis is similar to that in the first approach: we started by generating a separate binary vector for each gene's mutation status, where "0" denoted wild-type samples, and "1" denoted samples with the selected mutation class. In the next step, differential gene expression for the selected gene was calculated based on the binary vectors prepared in the previous step using a Mann-Whitney U test. The analysis was run using the mutation status of all genes independently, and the multicore calculation of the Mann-Whitney U test described above was also applied.
Significant genes were selected based on the Mann-Whitney P value and mean FC of the test, where we used the default thresholds of P ≤ .01 and 0.714 > FC > 1.4. In addition, we restricted the analysis to those mutations that had a prevalence of at least 1% F I G U R E 1 Classification of the determined mutations into functional groups. "All mutations" include any alteration in the gene, "protein coding" mutations include those which hit a protein coding sequence and "disruptive mutations" include only those actually disrupting the protein structure [Color figure can be viewed at wileyonlinelibrary.com] F I G U R E 2 Overview of the two possible analysis pipelines linking mutations and gene expression changes. The analysis can start from a mutated gene (GENOTYPE) or from the selection of the target gene (TARGET). In both cases, the analysis is run across all genes but using expression (GENOTYPE) or mutation (TARGET) data [Color figure can be viewed at wileyonlinelibrary.com] across all patients in the investigated tumor cohort. In the case of gene expression data, genes with a mean expression below 100 were excluded from the analysis. The output of the analysis was a list of genes whose mutation status was significantly associated with upregulation or downregulation of the investigated gene. We generated boxplots of the most significant genes using the "ggplot2" package to visualize expression differences.

| Training-test validation
The established analysis pipeline assumes that gene expression changes are consistently linked to gene mutations. To validate this hypothesis, we employed a training-test approach using the breast cancer data set. We selected breast cancer as this tumor type had by far the most patients samples simultaneously available mutation and gene expression data enabling the splitting of patients into two cohorts with robust sample numbers.
The training set was generated by a random selection of 50% of breast cancer patients using the "sample" function in the R environment.
The remaining 50% of breast cancer patients were added to the test set.
We selected the top five most frequently mutated genes for validation of the downstream gene expression changes linked to their mutations.
The entire analysis pipeline was performed separately for the training and test sets. Genes with mean expression values below 100 were excluded from the analysis. Statistical significance was accepted in the case of P ≤ .01 and a FC cutoff of <1.4 and >0.714. The chi-square test was employed to compute statistical significance when comparing the overlap of the results generated in the training and test sets.
Venn diagram and volcano plots were applied to visualize the distribution of the overlapping differentially expressed genes between the training and test sets. In the case of the volcano plot, the x-axis shows log 2 FC values, and the y-axis shows the −log 10 P values.

| Setting up a web server for future validation
We established an online platform enabling the validation of the results presented in our study as well as for the future analysis of any selected gene by either the "determine the effects of a gene mutation" or by the "identify mutations disturbing a gene" pipeline.     Table 3.
These results were generated when using all samples. We performed an in silico validation for the identification of differentially expressed genes between the TP53 mutated and wild-type samples by splitting the entire data set into equally sized training and test cohorts. The complete analysis was performed independently as described above for both cohorts.
We identified 2676 genes that showed significant expression changes in relation to TP53 somatic mutations in the training set. In the test set, there were 2559 genes significant. Of these two lists, 2124 genes overlapped between the gene lists of the training and test sets ( Figure 3F). A chi-square test of independence was performed to examine the overlap between the training and the test sets, and the results were highly significant (chi-square statistic = 16 719.7 and F I G U R E 4 Linking CDH1 mutation to gene expression changes in breast cancer (GENOTYPE). Top genes with higher expression associated with CDH1 mutations in breast cancer, A-E. Proportion of genes overlapping between training and test sets showing significant expression changes in relation to CDH1 mutations, F. Volcano plot displaying the differential expression of all genes when comparing CDH1 mutated and wild-type patient cohorts, G [Color figure can be viewed at wileyonlinelibrary.com] P < .00001). We generated a volcano plot to visualize the full list of genes analyzed. In this plot, genes significantly altered in both the training and test sets are indicated by red dots ( Figure 3G).  Table 4. The results of gene ontology analysis of the up-and downregulated genes are found in Supplemental Table 5.

| Determining the transcriptomic effects of CDH1 gene mutations in breast cancer
CDH1 mutation-associated genes were also validated by using an identical training-test approach as described above. In this analysis, we identified 1402 and 953 genes with an expression change significantly associated with CDH1 somatic mutations in the training and test sets, respectively, and 681 of these were significantly associated with the CDH1 gene mutation status in both the training and test sets ( Figure 4F, G). The chi-square test of independence examining the overlap between the training and the test sets showed high significance (chi-square statistic = 8653.1 and P < .00001).

| Determining the transcriptomic effects of HER2 and IGF1R gene mutations
Furthermore, the transcriptomic effects of HER2 and IGF1R gene mutations were also determined in breast cancer. HER2 somatic mutations were found in 3%, and IGF1R mutations were identified in 1% of breast cancer patients. We identified 265 genes and 118 genes showing significant correlation to HER2 and IGF1R mutations, respectively (Supplemental Table 6 and Supplemental Table 7). In the HER2  Table 8 and Supplemental Table 9, respectively.
F I G U R E 5 Linking expression changes to mutations (TARGET). The five genes whose mutations are most strongly associated with PGR expression changes in breast cancer. PGR, progesterone receptor [Color figure can be viewed at wileyonlinelibrary.com] 3.5 | Identifying mutations resulting in expression changes in the input gene PGR is a ligand-activated transcription factor used to define the luminal and basal subtypes of breast cancer. We selected PGR because it is one of the FDA-approved makers employed in treatment selection. We aimed to identify mutations influencing PGR expression. In principle, these mutations could guide breast cancer treatment. We performed the Mann-Whitney U test to identify gene mutations associated with PGR expression and selected genes with P < .01 and .714 > FC > 1.4.
Mutations in 178 genes were correlated with PGR expression, and the six most strongly associated genes are depicted in Figure 5. PGR expression was higher in PIK3CA-mutant (mutation present in 33% of the patients) and MAP3K1-mutant (mutation prevalence 8.2%) breast cancer patients than wild-type in breast cancer patients ( Figure 5B, D).

| Online validation platform
We established a registration-free online analysis platform enabling the above described analysis pipeline in real time. The platform has two independent components: the "GENOTYPE" analysis module and the "TARGET" analysis module ( Figure 2). The online analysis site is available at http://www.mutarget.com/.
The GENOTYPE module can identify gene(s) showing altered expression in samples harboring a mutated input gene. Up to five genes can be investigated in one setting-in the case of multiple markers, the algorithm combines the mutation status of all selected genes into a single binary vector. All three types of somatic mutation data are available, and the analysis can be performed across 18 different types of solid tumors. We also added a filter to restrict the analysis to cancer hallmark genes 18 and to targets of FDA-approved drugs. 19 The TARGET module can identify mutations resulting in expression changes in the selected input gene. In addition to gene and mutation filtering, the user can also set a threshold to select only those genes that are mutated in at least one, 2 or 3 % of patients. The validation platform can be used to validate the genes discussed in the present manuscript and to investigate and rank new genes and biomarker candidates.

| DISCUSSION
In this study, we utilized a database containing 18 different types of solid tumors and more than 7800 patients with somatic mutations and RNA-seq gene expression data. First, we set up a pipeline to identify genes with altered expression when comparing mutated and wild-type patient cohorts. This first option is useful in case one searches for new drug targets in a cohort of patients with a given mutation. We have to note that in the demonstration of this option we were specifically seeking genes upregulated in patients harboring a selected mutation because these could actually serve as druggable targets in cancer therapy.
A second pipeline was established to identify mutations resulting in expression change of a selected gene. This approach is useful in case one has a drug target gene and seeks biomarkers to select patient cohorts with enriched expression of this gene. A biomarker is an objectively measured marker enabling the stratification of patients into subcohorts based on their expected response. Those with a mutation (or combination of mutations) could be eligible for a therapy targeting the original input gene. To demonstrate the utility of these approaches, we used the breast cancer data set with 979 specimens and selected genes that are among the most frequently mutated genes in breast cancer. 20 The results were validated using randomly separated training and test sets of the breast cancer data set.
We identified differentially expressed genes between TP53 mutated and TP53 wild-type breast cancer patient cohorts. Many of the top genes identified are in line with the results of previous studies investigating TP53 mutations, including expression changes in CDC20, 21 PSAT1, 22 CENPA 23 and CENPW. 24 We have to note that we were specifically seeking upregulated genes because these could serve as druggable targets in cancer therapy. Of the abovementioned genes, targeting CDC20 protein with a small molecule inhibitor was already suggested as a novel strategy for the treatment of different kinds of human cancers. 25 Apcin, an inhibitor of the anaphasepromoting complex/cyclosome (APC/C), contributes to the inhibition of the ubiquitination of CDC20 substrates. 26 Our results, also validated using a training-test approach, suggest that TP53 mutation has a characteristic transcriptomic change that could be exploited in novel anticancer therapies.
CDH1 is among the most frequently mutated genes in breast cancer. 20 It encodes a tumor suppressor protein, and its mutation contributes to the promotion of breast cancer invasion and metastasis. 27 In our analysis, we uncovered genes with an expression change associated with somatic mutations in the CDH1 gene and identified RAPGEF3, BTG2 and TFAP2B, among others, as those with the strongest difference related to CDH1 mutation. RAPGEF3-induced metabolic activation results in oncogenesis, which was demonstrated in a 3D cell culture model. 28 The Ras inhibitor farnesyl thiosalicylic acid (salirasib) can selectively inhibit RAPGEF3 protein in different types of cancer, including pancreatic cancer, nonsmall-cell lung cancer and some hematological malignancies. 29 Our results suggest the possibility of higher efficiency of RAPGEF inhibitors in CDH1-mutated tumors.
Increased expression of TFAP2B was found in lobular carcinoma in situ and invasive lobular breast cancer. 30 The BTG2 gene showed decreased expression in breast cancer, and lower expression correlates with high tumor grade, TP53 mutation status, vascular invasion, metastatic recurrence and decreased overall survival. 31 We previously demonstrated that by linking different types of data-somatic mutations and gene expression-it is possible to identify genes with prognostic relevance. We performed such studies on KRAS in lung adenocarcinoma patients 32 and on NPM1 in acute myeloid leukemia, 33 and potential druggable target genes were identified in breast 34 and colon cancer. 35 Finally, somatic mutations associated with elevated PD-L1 expression were identified in gastric cancer. 36 In these studies, we focused on a few genes; however, in the present project, we provide a platform to uncover and validate any gene in a large number of different tumor types.
Additionally, we set up a pipeline to pinpoint mutations correlated to a target gene's expression. To demonstrate this approach, we analyzed PGR expression, a gene with high prognostic relevance in luminal breast cancer. 37 A previous study reported that increased PGR expression was associated with PIK3CA, MUC16, EVC, GNAS, LRRIQ1, TRIO and TTN mutations in breast cancer; however, none of these associations reached statistical significance. 38 Another paper also described the lack of statistical significance between PGR expression and BRCA1 gene mutations. 39 According to our results, PGR receptor expression was increased in MAP3K1-and PIK3CA-mutated tumors and decreased in TP53-and DYNC2H1-mutated breast cancer patients.
Finally, we also implemented a direct link to gene ontology, specifically including biological processes and molecular functions. Actual phenotypic changes result from an interplay of multiple genes with linked functions and gene ontology enables a higher level evaluation of the generated results. In this context, the results not only enable the selection of single targets but further prioritization is possible capable to identify previously hidden relationships. As gene ontology categories are shared among different organisms, these results can also help to speed up preclinical studies investigating cancer pathophysiology. Due to limitations of present study, a validation of the best gene ontology hits is only possible in a future large-scale analysis.
There are some limitations to our approach including small sample sizes in the case of some tumor data sets. We used only data sets where the number of patients with both somatic mutations and transcriptomic data was higher than 200; because of this cutoff, we had to exclude esophageal cancer, glioblastoma, pancreas, rectum and testicular cancer. Future updates of the database could increase the sample numbers and enable the inclusion of further tumor types as well. A second limitation is the grouping of mutations into functional groups.
Because of this, the exact effect of a specific mutation remains hidden. However, the stratification of patients by exact mutations (and even by exact mutation types) cannot provide meaningful results due to the very limited sample sizes remaining in each mutation subgroup.
In conclusion, we developed two alternative approaches that can help cancer researchers prioritize genes for further studies and identify new biomarkers. The first approach establishes a direct link between a gene mutation and the whole transcriptome that is useful in case one searches new drug targets in a cohort of patients with a given mutation.
The second approach connects the expression changes of a gene with mutations in different genes. This comes handy in case one has a drug target gene and aims to identify patient cohorts with enriched expression. The entire analysis pipeline is provided in a registration-free web portal accessible at www.mutarget.com.