Bioinformatic Identification of a Breast‐Specific Transcript Profile

To identify a breast‐specific transcript profile for the first time, and present an updated bioinformatics strategy for searching tissue‐specific transcripts and predicting their significance in cancer.


Introduction
Tissue-specific biomarkers are valuable for screening early cancer, assessing treatment response, monitoring recurrence, identifying metastatic tumor origin, and serving as targets for cancer immunotherapy. [1][2][3][4][5] For example, the thyroid-specific thyroglobulin (Tg) has been widely used in post-thyroidectomy surveillance for predicting treatment response and monitoring recurrence of patients with differentiated thyroid cancer. [6,7] Moreover, the first FDAapproved urine-based cancer detection assay PROGENSA, which was specially designed for the testing of the prostatespecific prostate cancer gene 3, has been confirmed helpful to reduce unnecessary repeat prostate biopsies in suspicious prostate cancer cases. [8] Markers with tissue specificity are also important for breast cancer, for both tissue-based biopsy and minimal invasive serological testing. [9,10] For example, the dynamic serum level changes of CEA and CA 15-3, two commonly used serum markers for breast cancer in clinical practice, could predict response to systematic therapy and monitoring cancer progress to some extent. [11] However, due to the lack of sensitivity for low-volume diseases and unsatisfactory specificity, values of CEA, CA 15-3 were limited for breast cancer screening and post-operative surveillance. [9,12] Mammaglobin was one of the most widely accepted breast-specific markers. [13,14] Its positive expression in metastatic lesions was helpful to identify carcinoma of breast origin with high specificity (85-100%), while its sensitivity was relatively low (26-84%), especially for triple-negative breast cancer (TNBC). [15] Besides, the expression of mammaglobin is not truly breast-specific that it can also be detected in endometrial, ovarian, and cervical cancer tissues. [16] To date, the exact number of breast-specific markers in human genome remains unknown. Taken together, a comprehensive view is needed to get a better understanding of breast-specific markers, especially for novel ones with potential clinical values.
The rapid development of high-throughput gene analysis techniques has generated a vast amount of data on human transcriptomics and proteomics, [17,18] enabling the search of

Significance Statement
Our data indicated that these novel breast-specific transcripts may be valuable for tissue-based biopsy, blood-based testing, and circulating tumor cell (CTC) analysis in breast cancer. For tissue-based biopsy, their breast specificity is helpful to identify the metastatic status of sentinel lymph nodes and the breast origin of metastatic tumors. For blood-based testing, these transcripts can be used as serum biomarkers for screening early breast cancer, assessing treatment response, monitoring cancer recurrence. Breast-specific biomarkers can also be included in the multi-marker panel to increase the sensitivity and specificity of CTC analysis in breast cancer.
breast-specific biomarkers in a systematic way. For instance, the open-access expression atlas database provides manually curated high-quality microarray and RNA-seq data from over tens of thousands of homo sapiens experiments, including the normalized gene expression data in hundreds of human tissues. [19] Furthermore, the cancer genome atlas (TCGA), a landmark cancer genomics program, has established a rich genomics data resource by molecularly characterizing over 20 000 primary cancer and matched normal samples spanning 33 cancer types. [20,21] In addition, the genomic, epigenomic, transcriptomic, proteomic, and clinical characteristics data of 1098 breast cancer cases are also publicly available in the TCGA database. [22] Previously, our data suggested that a protein-coding gene ankyrin repeat domain 30A (ANKRD30A) and a novel long noncoding RNA (LncRNA) long intergenic non-protein coding RNA 993 (LINC00993) were enriched in human breast tissues. However, a definitive conclusion regarding their breast specificity could not be drawn because our results were only based on data from one RNA-seq project. [23] Therefore, in this study, to confirm our previous findings as well as demonstrate our hypothesis that there exist breast-specific transcripts in human genome, we searched for transcripts with breast specificity systematically using an updated bioinformatics strategy. Briefly, using RNA-seq data of 49 311 transcripts across 88 human tissues, we identified 96 breast-specific transcripts for adult women, and most of these transcripts are poorly understood at the current stage. Furthermore, we demonstrated that the two transcripts of interest, ANKRD30A and LINC00993, may play important roles in breast cancer because of their breast cancer specificity, distinct expression patterns in breast cancer subtypes, and values for prognostic predictions.

Breast-Specific Transcripts in Human Genome
To search for eligible projects for the screening of breast-specific biomarkers, we searched a total of 1293 homo sapiens experiments in the expression atlas, a database that integrated a huge amount of high-quality RNA-seq projects with normalized gene expression data. Only three RNA-seq projects met our criteria, including the GTEx (43 539 transcripts across 52 human tissues), the illumina body map (49 311 transcripts across 16 human tissues), and the RIKEN FANTOM5 project (21 105 transcripts across 76 human tissues). The normalized gene expression data in adult women tissues of the three experiments were integrated for further analysis. Based on our definition, a total of 96 breastspecific transcripts were identified (Figure 1), with 7/43 539 transcripts in the GTEx, 84/49 311 transcripts in the illumina body map, and 7/21 105 transcripts in the RIKEN FANTOM5 project. Only two breast-specific transcripts, ANKRD30A and RN7SL314P, lied in the overlap between the GTEx and the illumina body map project, whereas none of the transcripts in the RIKEN FANTOM5 overlapped the list of the other two projects. These data suggested that only quite a small proportion (about 0.19%) of transcripts in human genome were breast-specific.

Annotations of Identified Transcripts
To get a better understanding of these breast-specific transcripts, the Ensembl genome browser and the TCGA database were used to obtain gene annotations. Gene types, chromosome locations, and Ensembl gene IDs were available for all the 96 transcripts, while mutation status and copy number variations in breast cancer were documented for only 19 13.54%), miscellaneous RNAs (6/96, 6.25%), small nucleolar RNAs (4/96, 4.17%), and joining chain T cell receptor gene (3/96, 3.13%). Chromosome 10 had the highest number of breast-specific transcripts (14/96), while none of the identified transcripts located on chromosome 9, 13, 22, or X (see File S3, Supporting Information).

ANKRD30A and LINC00993
Interestingly, 6 out of the 14 breast-specific transcripts on chromosome 10 located in the same region in p11.21, including ANKRD30A, LINC00993, AL157387.1, RN7SL314P, Y-RNA, and VN1R53P (Figure 2). ANKRD30A is the only protein-coding gene and its sequence covers the full length of the other five transcripts. Next, we evaluated the expression correlation between ANKRD30A and its adjacent transcripts using the gene expression profiling interactive analysis (GEPIA) database based on their expression levels in breast cancer and normal breast tissues. The expression of ANKRD30 was strongly correlated with LINC00993, RN7SL314P, VN1R53P, and AL157387, with the Spearman's correlation coefficient of 0.9, 0.92, 0.94, and 0.94, respectively. Moreover, our previous findings suggested that ANKRD30A and LINC00993 were breast-enriched and LINC00993 may be important for breast cancer because it was significantly downregulated in TNBC [24] and acts as a tumor suppressor by inhibiting tumor proliferation and inducing cell apoptosis. [25] Therefore, the breast-specific protein-coding gene ANKRD30A and the LncRNA LINC00993 were selected for further research (Figure 3).

The Breast Specificity of ANKRD30A and LINC00993 in Cancer Tissues
To assess if ANKRD30A and LINC00993 were also specific for breast cancer, we examined their expression in 33 types of common cancer and paired normal tissues using the RNA-seq data from the GEPIA database. To be specific, apart from some prostate adenocarcinoma tissues in males, the expression of ANKRD30A and LINC00993 were only detectable in breast cancer and paired normal tissues, while the median expression of the two transcripts in other non-breast tissues, cancer or paired normal tissues, were always below the cut-off of 1 transcripts per million (TPM) (Figure 4). The median expression of ANKRD30A was 7.16 TPM in breast cancer tissues and 5.37 TPM in paired normal tissues, while that of LINC00993 was much higher (36.66 TPM in tumors and 42.15 TPM in matching non-tumor tissues). Moreover, compared with the paired normal tissues, their expression in breast cancer tissues were often aberrant with a considerably wide range. These findings suggested that ANKRD30A and LINC00993 were also specific for breast cancer in adult women, and the two transcripts may be involved in the development of breast cancer because of their aberrant expression in cancer tissues.

ANKRD30A and LINC00993 in Breast Cancer Subtypes
Since the expression of ANKRD30A and LINC00993 were often dysregulated in breast cancer, we further analyzed the association between their expression and molecular features of breast cancer. Using the TCGA-BRCA data collection, we integrated the normalized breast cancer RNA-seq data and the corresponding clinical characteristics, including but not limited to ER status (IHC), PR status (IHC) and human epidermal growth factor receptor 2 (HER2) status (IHC and FISH). The expression of ANKRD30A was available in 506 cases with confirmed ER, PR, and HER2 status, while 780 cases were available for LINC00993 analysis. We found that the expression levels of ANKRD30A and LINC00993 were significantly higher in Luminal (ER+/PR+/Her2±) breast cancer tissues compared with HER2 subtype (ER-/PR-/HER2+), while their expression levels in TNBC subtype (ER-/PR-/HER2-) were the lowest ( Figure 5). Specifically, the log2 median expression levels of ANKRD30A were 3.888 in luminal breast cancer tissues, 1.239 in HER2 subtype, and −0.085 in TNBC cases, while the values for LINC00993 were 4.0830, 0.8940, and −5.4470, respectively. These results, consistent with our previous findings, [24] demonstrated the expression levels of ANKRD30A and LINC000993 were significantly correlated with ER status, and considerably downregulated in TNBC.

Prognostic Values of ANKRD30A and LINC00993
Lastly, we examined the values of ANKRD30A and LINC00993 for predicting prognosis of patients with breast cancer using the Kaplan-Meier Plotter (Figure 6). For ANKRD30A, its high expression in breast cancer tissues indicated better relapse-free survival

Previous Methods for the Identification of Tissue-Specific Markers
Different methods were used to identify tissue-or diseasespecific markers, such as the serological analysis of recombinant tumor cDNA expression libraries (SEREX) [31][32][33] or using bioinformatics tools. However, there were few potential limitations of previous methods. For the SEREX technique, some important tissue-specific markers may be omitted due to the lack of IgG responses or sample heterogeneity, [34] such as the omission of the prostate specific antigen (PSA). [35,36] For bioinformatics approaches, to generate more reliable results, strategies need to be updated after the publication of several milestone projects, such as the GTEx project [37] in 2015 and the RIKEN FANTOM5 project [38] in 2017.

Updated Strategy for the Identification of Breast-Specific Markers
In this study, taking breast-specific markers as an example, we present an updated strategy for the searching of tissue-specific biomarkers, especially for potential cancer-related markers. Briefly, RNA-seq data of the GTEx, the illumina body map, and the RIKEN FANTOM5 project were analyzed for the screening of transcripts with potential tissue specificity. Next, annotations of selected markers were obtained from the Ensembl Genome Browser and the TCGA database, including gene type, chromosome location, adjacent genes, mutations, and copy number variations in breast cancer. The expression of candidate transcripts in common human cancer tissues was illustrated using the GEPIA database, and their potential associations with molecular features of breast cancer were evaluated using data from the TCGA. Lastly, the role of the chosen marker for prognosis prediction was assessed with the Kaplan-Meier Plotter.

Feasibility of our Bioinformatics Strategies
By reviewing previous studies regarding ANKRD30A, we partly proved the feasibility of our bioinformatics methods for the search of breast-specific transcripts. ANKRD30A, also known as NY-BR-1, antigen B726P, or ENSG00000148513, was first identified by the SEREX method in a metastatic breast cancer patient in 2001. [32,39] It has 37 exons and encodes a protein of 150 to 160 kDa. [40,41] The expression of ANKRD30A in different human tissues has been examined with tissue microarray, immunohistochemical staining, or RT-PCR in several studies with hundreds of samples. [40,[42][43][44][45] Apart from the male-specific testis and prostate tissues, ANKRD30A can only be detected in normal breast or breast cancer tissues. Hence, it was generally considered as a differentiation antigen of the mammary gland. [40] In addition, the expression of ANKRD30A in breast cancer was found to be significantly correlated with ER status and often markedly down-regulated in TNBC cases. [42][43][44] Due to the breast specificity and the encoding of HLA-A2 restricted CD8+ T cell epitopes, ANKRD30A was also considered as an ideal target for the immunotherapy of breast cancer. [41,46,47] For LINC0099, little is known about its breast specificity. However, considering the expression of LINC00993 is strongly correlated with ANKRD30A, it is still reasonable to regard it as a potential biomarker with breast specificity, but further evidence is needed before a definitive conclusion could be drawn.

Clinical Values of ANKRD30A and LINC00993 for Tissue-and Blood-Based Testing
There are some potential clinical applications of ANKRD30A and LINC00993 for tissue-and blood-based testing because of their breast specificity. For example, ANKRD30A could serve as a useful marker to improve the diagnostic accuracy of sentinel lymph nodes, [48][49][50] and it was reported that its positive predictive value was 83.8% and the negative predictive value was 90.9% in 314 SLN sections of 150 patients. [49] Besides, as is shown in this study and another study on TNBC, the abundance of ANKRD30A in breast cancer tissues may be used for better prognosis prediction. [51] Moreover, based on data from the human protein atlas, the expression of ANKRD30A should be extremely low in 19 types of blood cells in a breast-cancer-free population. [52] Therefore, like serum PSA for prostate cancer screening or Tg for post-thyroidectomy surveillance, ANKRD30A may be valuable for breast cancer early diagnosis and monitoring recurrence by detecting its expression in blood. However, to our best knowledge, currently, there is not any evidence to support ANKRD30A or LINC00993 is detectable in blood or specific for breast cancer as serum markers.

Potential Clinical Values of ANKRD30A and LINC00993 for Circulating Tumor Cell Analysis
The potential values of ANKRD30A and LINC00993 for circulating tumor cell (CTC)-based liquid biopsy should also be mentioned. Tissue-specific markers could be utilized to increase the sensitivity or specificity of CTC detection and identify CTCs without a known site of origin. [1] For instance, the inclusion of the colon epithelium marker CDX2 in CTCs enrichment could improve the detection rate of CTCs for colorectal cancer. [53,54] Likewise, thyroid transcription factor 1, which is significantly abundant in thyroid and lung, was considered as a feasible biomarker to confirm non-small-cell lung cancer CTCs. [55] According to the data of one single-cell RNA sequencing study, [55] the expression of ANKRD30A and LINC00993 was detectable in CTCs of metastatic breast cancer patients. Therefore, their expression status may be helpful to identify the breast origin of CTCs or improve the enrichment sensitivity of breast cancer CTCs. However, due to insufficient evidence at the current stage, further evidence is needed to demonstrate the value of the two markers for CTCs analysis.

Poor Understanding of Identified Transcripts
Interestingly, little is known about the majority of the potential breast tissue-specific transcripts identified in our study. For example, only 19/96 of these transcripts were annotated with information regarding gene mutations or copy number alterations in the TCGA database. Moreover, we originally intended to perform the Gene Ontology [56] and the Kyoto Encyclopedia of Genes and Genomes [57] analysis to obtain annotations of these genes in genomes, cellular components, biological functions, diseases, pathways or drugs. However, a reliable result cannot be generated because more than half of them were unable to be mapped in these databases. This may partly due to the fact that about 80% of the breast-specific genes are ncRNAs, which were initially regarded as "junk" in the human genome. [58,59]

Conclusion
In conclusion, to our best knowledge, we identified a breastspecific transcript profile for the first time. Based on our data, only a small proportion of transcripts in human genome (0.19% in this study) was breast-specific, with most of which were poorly understood. We also presented an updated and feasible bioinformatics strategy for the identification of tissue-specific markers. Moreover, we emphasized the significance of ANKRD30A and LINC00993 in breast cancer, not only because of their breast specificity, but also their correlation with ER, TNBC, prognosis, and potential clinical values. We believe our data could provide new insights into the research of tissue-specific biomarkers, especially for cancer-related tissue-specific markers.

Experimental Section
Breast-Specific Transcripts in Human Tissues: All 1352 homo sapiens experiments in the expression atlas database (https://www.ebi.ac.uk/gxa/ experiments?species=homo%20sapiens) [19] were screened for the search of potential breast specific transcripts in human tissues. Only experiments that examined gene expression in healthy adult human tissues and including data in breast tissues were eligible. The GTEx (E-MTAB-5214), the illumina body map (E-MTAB-513), and the RIKEN FANTOM5 (E-MTAB-3358) were the only three projects that met the eligibility criteria, involving the expression of almost 49311 transcripts across 88 human tissues. The normalized RNA-seq data of the three projects in TPM were integrated for further analysis. Data in fetal tissues and male-specific tissues were excluded (see File S5, Supporting Information). A potential breast tissue transcript was defined as follows: 1) its expression in non-breast tissues is less than 1 TPM; 2) its expression in breast tissues is 1 TPM or more; 3) its expression in breast tissues is at least 5-times greater than the highest in non-breast tissues.
Ensembl Genome Browser: Ensembl [26] (http://www.ensembl.org/ index.html) is a genome browser for vertebrate genomes that annotates genes, computes multiple alignments, predicts regulatory function, and collects disease data. Gene names of included breast-specific transcripts were searched in the Ensembl Genome Browser to obtain annotations, including Ensembl gene IDs, chromosome locations, gene types, and adjacent genes.
GEPIA Database: The GEPIA database (http://gepia.cancer-pku.cn/ index.html) is an interactive web server for analyzing the RNA sequencing expression data of 9736 tumors and 8587 normal samples from TCGA database and the GTEx project. [27] The expression of ANKRD30A and LINC00993 in 33 types of human cancer and paired normal tissues were illustrated with the automatically generated graphs using the GEPIA database. Expression correlations between transcripts of interest were also evaluated using the GEPIA database. Spearman correlation coefficient was calculated based on gene expression data in breast cancer and paired normal tissues from the TCGA-BRCA project as well as in normal breast tissues from the GTEx project. The strength of the correlation was determined with the following cut-offs: 0.00-0.19 for "very weak," 0.20-0.39 for "weak," 0.40-0.59 for "moderate," 0.60-0.79 for "strong," and 0.80-1.0 for "very strong." p-value < 0.05 was considered statistically significant.

TCGA
Database: TCGA (https://www.cancer.gov/about-nci/ organization/ccg/research/structural-genomics/tcga) provides public available genomic, epigenomic, transcriptomic, proteomic, and paired clinical characteristics data of 1098 breast cancer cases, including but not limited to ER status (IHC), PR status (IHC), and HER2 Status (IHC and FISH). The association between selected transcripts and clinical features of breast cancer was analyzed with the TCGA breast cancer RNA-seq data and the matching clinical features file following the standard TCGA workflow. [28] Patients with the unknown or ambiguous status of ER, PR, or HER2 were excluded. The two-sample t-test was performed to determine the difference.
Kaplan-Meier Plotter: The Kaplan-Meier Plotter [29] (http://kmplot. com/analysis/index.php?p=service&cancer=breast) provides metaanalysis based tools to evaluate the prognostic value of potential survival biomarkers with data from the Gene Expression Omnibus, [30] the European Genome-phenome Archive, and the TCGA database. The Kaplan-Meier survival analysis of ANKRD30A and LINC00993 was performed using the Kaplan-Meier Plotter database. Briefly, patients were split by the trichotomization method based on the specific gene expression value; those greater than the upper tercile value were considered as "high expression," and similarly, those less than the lower tercile value were considered as "low expression." The hazard ratio with a 95% CI and the corresponding log-rank p-value were automatically calculated.
Data Analysis and Others: Gene expression data were analyzed using the Microsoft Excel software (version 2019, Microsoft, USA). Statistical analysis was performed using the Stata software (version 16.0, StataCorp, USA). Methods in detail were described in the File S6, Supporting Information.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.