Two novel qualitative transcriptional signatures robustly applicable to non‐research‐oriented colorectal cancer samples with low‐quality RNA

Abstract Currently, due to the low quality of RNA caused by degradation or low abundance, the accuracy of gene expression measurements by transcriptome sequencing (RNA‐seq) is very challenging for non‐research‐oriented clinical samples, majority of which are preserved in hospitals or tissue banks worldwide with complete pathological information and follow‐up data. Molecular signatures consisting of several genes are rarely applied to such samples. To utilize these resources effectively, 45 stage II non‐research‐oriented samples which were formalin‐fixed paraffin‐embedded (FFPE) colorectal carcinoma samples (CRC) using RNA‐seq have been analysed. Our results showed that although gene expression measurements were significantly affected, most cancer features, based on the relative expression orderings (REOs) of gene pairs, were well preserved. We then developed two REO‐based signatures, which consisted of 136 gene pairs for early diagnosis of CRC, and 4500 gene pairs for predicting post‐surgery relapse risk of stage II and III CRC. The performance of our signatures, which included hundreds or thousands of gene pairs, was more robust for non‐research‐oriented clinical samples, compared to that of two published concise REO‐based signatures. In conclusion, REO‐based signatures with relatively more gene pairs could be robustly applied to non‐research‐oriented CRC samples.


| INTRODUC TI ON
With technological advancement and reduced cost, transcriptome sequencing (RNA-seq) has become the primary technology for gene expression measurements. 1,2 This platform generally requires input of pre-selected high-quality RNA samples, designated as researchoriented clinical samples, to obtain reliable research results. For example, adequate amounts of RNA extracted from fresh-frozen (FF) samples containing at least 60% or 70% of the tumour nuclei, 3 and high RNA integrity (RIN) scores, such as RIN > 6 or RIN > 7, 4 are required. However, realistically, millions of samples obtained in hospitals and tissue banks worldwide are considered to be nonresearch-oriented clinical samples with low-quality RNA, 5 characterized by RNA degradation or fragmentation, 6 low amounts of RNA, 7 low tumour purities, 8 or above multiple features simultaneously. Although amplification technology can be used for these samples containing low amounts of RNA, it may introduce amplification bias. 9 Thus, for these non-research-oriented clinical samples, the accuracy of gene expression measurements would be significantly challenged. As a result, the transcriptional signatures based on risk scores that are summarized from the expression levels of signature genes are rarely applied to these non-research-oriented samples.
Rather, these samples are primarily limited to pathological or immunohistochemical analysis 10 ; however, these samples contain valuable pathological information and follow-up data, 11 which are precious resources in disease-related research. Therefore, it is imperative that transcriptional signatures should be developed that are applicable to non-research-oriented clinical samples with low-quality RNA.
Colorectal carcinoma (CRC) is one of the most common malignant tumours with high morbidity and mortality, 12 which is mainly transformed from acquired pre-cancerous lesions. Inflammatory bowel disease (IBD), including ulcerative colitis (UC) and Crohn's disease (CD), is a main type of pre-cancerous colorectal lesions, which could result in dysplasia, eventually develops and progresses to CRC. 13 Some studies have been reported that long-term exposure to chronic inflammation is the primary risk factor for CRC pathogenesis. 14 Meanwhile, some patients with stage II and III CRC after surgery treatment commonly have relapse risk. Currently, it is essential to timely discriminate early CRC patients from patients with inflammation and accurately predict the recurrence risk for stages II and III CRC patients after surgery. 15,16 However, established noninvasive tests, such as the guaiac-based faecal occult blood test and faecal haemoglobin, usually lack proper sensitivity and specificity for early diagnosis. Carcinoembryonic antigens, CA125 and CA19.9, which have already been applied into clinical practice, are not highly promising diagnostic or prognostic targets for personalized medicine. 17 Therefore, there is a critical need to develop highly robust and reliable biomarkers for diagnosis and prognosis of CRC patients.
Specific methods, such as TSP (top scoring pairs), 18 k-TSP 19 and other adjusted methods, 20  and so on. 23 Nevertheless, gene expression measurements could be widely and significantly affected by the low-quality non-researchoriented clinical samples. If the expression measurements of one or several signature genes are severely influenced, and even become zero, the performance of these concise REO-based signatures with several gene pairs may be seriously weakened or even rendered unfeasible. In consideration of the rapid development and decreasing cost of high-throughput sequencing technology, we proposed that the REO-based signatures should include relatively more gene pairs, potentially even hundreds or thousands of gene pairs, to obtain robust performance for the non-research-oriented clinical samples with low-quality RNA.
To this end, herein we analysed 45 stage II CRC non-researchoriented samples that were formalin-fixed paraffin-embedded (FFPE) samples without location pre-selection or pre-purification of tumour cells, measured using RNA-seq, and evaluated the influences of low-quality samples on their gene expression measurements. For these widely preserved non-research-oriented clinical samples, two REO-based signatures with relatively more gene pairs were developed and robustly applied to diagnosis and recurrence prediction of individuation. It had great value for clinical translational applications.

| Samples and data measurement
A total of 45 stage II CRC FFPE samples, denoted as CRC45, including 24 non-relapse and 21 relapse samples, were collected from FFPE tissue blocks which have been preserved at room temperature for about 6 years. The RIN scores, overall alignment rate of sequencing reads and clinical information of the 45 samples are shown in Table S1. Each FFPE tissue block without location pre-selection or pre-purification of tumour cells was cut into 6-10 slides of approximately 5 μm thickness. Then, the slides with frozen preservation were directly sent to the sequencing company. The whole process was about 72 hours. Next, according to the manufacture's protocol, total RNA was isolated from each sample using the RNAprep pure FFPE kit (Tiangen Biotech) and the quality of the RNA was assessed by electrophoresis on an Agilent 2100 Bioanalyzer system (Agilent Technologies). Ribosomal RNA was removed using the Globin-Zero Gold rRNA Removal kit & directional library, and the stranded RNA-seq library was constructed using the NEB Next ® Ultra™ RNA Library Prep kit. Paired-end sequencing (2 × 150) was performed on the Illumina HiSeq X Ten system (Illumina). Subsequently, the generated raw RNA-seq (FASTQ) files were pre-processed using the Trimmomatic, 24 and the reads were aligned to the reference genome (GRCh38) using HISAT2. 25

| Evaluation of REO-based cancer features in non-research-oriented clinical samples
For a gene pair with two genes, for example gene i and gene j, the REO was denoted as G i > G j (or G i < G j ), where the expression measurement of gene i was higher (or lower) than that of gene j within a sample. A gene pair exhibiting the same REO pattern in most samples from one group, for example 95% or 99%, was defined as a stable gene pair. A stable gene pair with an opposite REO pattern between two groups was defined as a stable opposite gene pair. Using a hypergeometric cumulative distribution model, we evaluated whether a specific REO pattern, for example G i > G j in one group of samples, was significantly opposited into the pattern where m was the number of stable or significant opposite gene pairs, and k was the number of gene pairs that maintained the cancer features in a non-research-oriented sample. Notably, the gene pair was removed if two expression measurements of the gene pair both carried a value of zero. Additionally, the gene pairs containing a gene that was not measured were also removed.

| Identification of an REO-based signature for the early diagnosis of CRC
The stable opposite gene pairs between stage I FF CRC samples and FF normal samples were selected as candidate gene pairs for the REO-based signature for the early diagnosis of CRC. The gene expression profiles were then transformed into gene expression rank profiles according to their expression levels. All genes were sorted in ascending order, and the rank difference (RD) for a gene pair (i, j) in sample t was calculated as: where R ti and R tj were the expression ranks of gene i and gene j in sample t, respectively. Next, the RD for each gene pair was calculated in each stage I sample or normal sample, respectively.
(1) ratio = k∕m  All statistical analyses were performed by R 3.6.0. The R scripts were provided in Supplementary scripts.

| Protein-protein interactions and functional enrichment analysis
A regulatory protein-protein interaction (PPI) network for the genes of interest was constructed based on the integrated data from the

| Quality evaluation of non-research-oriented clinical samples
Here, we firstly evaluated the RNA qualities of the 45 non-researchoriented FFPE samples of stage II CRC measured in our laboratory.
Compared to the requirement of high-quality FF samples with RIN scores of 6.0 for RNA-seq, the RIN scores of total RNA in these nonresearch-oriented clinical samples ranged from 2.1 to 2.7. These  Figures 1B and S1). On the contrary, the retention rate of stable and significant opposite gene pairs (G i > G j ) in the normal samples were only kept about 5% and 20%, respectively. These results indicated that most of the REO-based cancer features were well preserved in the non-research-oriented clinical samples.

| Development of an REO-based signature for the early diagnosis of CRC
Recently, we have reported a concise REO-based signature consisting of seven gene pairs for discriminating early CRC from IBD samples, including UC and CD samples. 16 Table S2). The cut-off of the voting rule was set to 60% as the 73 colitis samples in the GSE72819 dataset were all correctly assigned to the non-cancer group. For a sample, if the retention ratio of the specific cancer pattern (G i > G j ) was ≥60%, the sample was labelled as CRC; otherwise, it was labelled as non-cancer (see Section 2). In the training datasets, 106 CRC and 51 normal samples were all correctly assigned to the CRC group and the non-cancer group, respectively.

| Identification of an REO-based signature for predicting post-surgery relapse risk of stage II and III CRC
The process for developing an REO-based signature for predicting the post-surgery relapse risk of stage II and III CRC was summarized in Figure 2B. First, 619 CRC samples with survival information were selected from TCGA. 27 Based on the hypothesis that the relapse of stage II and III CRC could be attributed to micro-metastasis, 31 Table S3). Next, the sorted significant opposite gene pairs were categorized into six groups for six candidate signatures based on the rule: the nth candidate signature consisted of 3500 + 1000 * (n − 1) gene pairs. Specifically, the six candidate signatures included 3500, 4500, 5500, 6500, 7500 and 7849 gene pairs, respectively. For example, the first candidate signature was assigned the top 1 to 3500 gene pairs, and the top 3501 to 8000 gene pairs were assigned to the second candidate signature and so on. The accuracy of classification for each candidate signature was evaluated in the training dataset. The cut-off was set to 49% in comparison with 50% and 51%, as the higher accuracy for stage IV samples. For a given sample, if more than 49% of the gene pairs in the signature contained the specific REO pattern for relapse, the sample was labelled as high-risk relapse, and vice versa (see Section 2).
Using the cut-off of 49%, the accuracies of two candidate signatures were found to both be above 80% in stage I and stage IV.  Table S4).
For 172 CRCs with DFI time obtained from TCGA, 80 samples and 92 samples were classified by the signature as high relapse risk and low relapse risk, respectively, of which the former had significantly worse DFI survival than the latter (univariate Cox, HR = 2.78, 95% CI = 1.15-6.72, log-rank test P = 0.018, Figure 3A). And there was no separate residual for the high relapse risk and the low relapse risk groups in the 172 CRCs from TCGA (the Schoenfeld residuals test, P = 0.25). The retention rates for 172 CRC samples were show n in Figure S3. Meanwhile, similar results were also observed within the 5-year (univariate Cox, HR = 3.42, 95% CI = 1.34-8.74, log-rank test P = 0.0064, Figure 3B) and 3-year DFI time (univariate Cox, HR = 3.65, 95% CI = 1.29-10.29, log-rank test P = 0.0090, Figure 3C). Moreover, our signature was able to robustly predict the post-surgery relapse risk in stage II CRC ( Figure 3D-F). Unfortunately, for stage III patients, the significant difference of DFI time between high and low relapse risk groups was not observed, as shown in Figure 3G-I. Actually, one patient with 85 years who had cancer relapse in about four months was classified into the low relapse risk group. After excluding the sample, the DFI time between the low and high relapse risk groups in stage III CRC patients was moderately different (univariate Cox, HR = 5.2, 95% CI = 0.65-41.71, log-rank test P = 0.083, Figure S4A). Moreover, the 5-year and 3-year DFI time was significantly different ( Figure S4B,C). Meanwhile, we also performed multivariable cox proportional-hazards regression analyses for the CRC relapse signature with 4500 gene pairs. Because the MSI information of 37 patients were lost, 135 II-III colorectal cancer patients, including 27 patients with MSI-high, 21 patients with MSIlow and 87 patients with MSI-stable, were evaluated. After correction for stage, gender, age and MSI, there was a modest difference Dataset Accuracy/sample size

Normal IBD Cancer
The performance of the signature in the training datasets  Figure 4B), the results further demonstrated that our signature with 4500 gene pairs had higher prognosis capacity, and more robust predictive performance in non-research-oriented samples.  Figure 5B. We also observed that seven hub DEGs (IFNG, IL2RB, IL12RB1, CCR7, XCR1, CXCR6

| The potential relapse mechanism of CRC
and NOS2) with more than ten PPI interactions were all immunerelated genes.
Furthermore, using the RankCompV2 algorithm (FDR < 0.05), 956 DEGs were detected between 24 non-relapse and 21 relapse samples measured in our laboratory, of which 42 DEGs overlapped with the above-mentioned 499 DEGs. The concordance score between the two DEG lists was as high as 85.71% (36 DEGs), which was unlikely to occur by chance (binomial test, P < 2.2 × 10 −07 ).
Meanwhile, the PPI analysis showed that the 18 DEGs directly interacted with 301 other genes (as shown in Figure S5) and these 319 genes were also significantly enriched in immune-related functional pathways, including 'T cell receptor signalling pathway' and 'B cell receptor signalling pathway' (Table S6). These results further suggest that the potential relapse mechanism of CRC might be closely related to immune dysfunction. In summary, the REO-based signature with relatively more gene pairs could be robustly applied to the non-research-oriented clinical samples containing low-quality RNA, and it holds significant value for clinical translational applications.

DATA AVA I L A B I L I T Y S TAT E M E N T
The in-house data used and analysed during the current study are available from the corresponding authors upon reasonable request.