Multiple cancer type classification by small RNA expression profiles with plasma samples from multiple facilities

Abstract Liquid biopsy is expected to be a promising cancer screening method because of its low invasiveness and the possibility of detecting multiple types in a single test. In the last decade, many studies on cancer detection using small RNAs in blood have been reported. To put small RNA tests into practical use as a multiple cancer type screening test, it is necessary to develop a method that can be applied to multiple facilities. We collected samples of eight cancer types and healthy controls from 20 facilities to evaluate the performance of cancer type classification. A total of 2,475 cancer samples and 496 healthy control samples were collected using a standardized protocol. After obtaining a small RNA expression profile, we constructed a classification model and evaluated its performance. First, we investigated the classification performance using samples from five single facilities. Each model showed areas under the receiver curve (AUC) ranging from 0.67 to 0.89. Second, we performed principal component analysis (PCA) to examine the characteristics of the facilities. The degree of hemolysis and the data acquisition period affected the expression profiles. Finally, we constructed the classification model by reducing the influence of these factors, and its performance had an AUC of 0.76. The results reveal that small RNA can be used for the classification of cancer types in samples from a single facility. However, interfacility biases will affect the classification of samples from multiple facilities. These findings will provide important insights to improve the performance of multiple cancer type classifications using small RNA expression profiles acquired from multiple facilities.


| INTRODUC TI ON
Globally, the number of cancer patients has been increasing.
It is important to detect and treat cancer at an early stage because the survival rate decreases significantly as the disease progresses. 1,2 Various methods are used for cancer screening, including computed tomography, endoscopy, X-ray, and PET. These screening tests are effective in reducing cancer deaths. [3][4][5] However, undergoing numerous screening tests is physically hard, time-consuming, and costly for patients.
As a solution to these problems, liquid biopsy is expected to be a promising cancer-screening method because of its low invasiveness and ability to detect multiple cancer types simultaneously.
Circulating tumor DNA (ctDNA) and small RNAs (microRNA [miRNA] and PIWI-interacting RNA [piRNA]) are the focus of cancer screening tests. [6][7][8][9][10] In a study of multiple cancer classification using ctDNA, 6 eight types of cancers were successfully classified, with a median sensitivity of 70% and a specificity of over 99%. However, the classification performance of stage 1 was lower than that of stages 2 and 3. The performance of cancer classification by miRNAs has been reported. In gastric cancer, the sensitivity and specificity were 73% and 89% 11 ; in pancreatic and biliary tract cancers, they were 82% and 97% 12 ; in breast cancer, they were 97% and 83% 13 ; in cervical cancer, they were 84% and 90% 14 ; and in prostate cancer, they were 95% and 87%. 15 In addition, several studies have reported that miR-NAs can detect even early-stage cancers with high sensitivity, and it is expected that miRNAs will be useful as an early cancer screening test. [14][15][16][17][18][19] However, issues remain for miRNA-based multiple cancer type screening tests. First, the possibility of identifying cancer types by collecting samples from multiple facilities has not been determined.
Second, there have been reports that small RNAs specific to cancer types have been identified, but the variety of small RNAs is not consistent among reports. 17,[20][21][22][23][24][25] To overcome these issues, we collected cancer and healthy control samples from multiple facilities with a standardized protocol. Then, the classification accuracy was evaluated by machine learning models constructed with the small RNA expression profiles.  Table S1. Informed consent was obtained from the participants.

| Plasma sample collection
Blood samples were collected in vacuum tubes with EDTA-2Na and refrigerated within 30 min. The blood samples were centrifuged only once at 1,500-1,900 ×g for 10 min within 12 h of venipuncture, and plasma fractions were collected. The plasma samples were stored at −80°C until further analysis, and more than two freeze-thaw cycles were avoided.

| Sequence data acquisition
To avoid bias toward certain cancer types in a particular experimental batch, data acquisition was conducted under the following conditions. First, a batch consisted of at least three categories of samples.
Second, the number of samples from a single category should be <50% of the total.

| Small RNA expression analysis
The raw sequence data were processed using fastp (version 0.19.6) 26 to remove the amplification adapter sequence. The unique molecular index (UMI) sequences and the probe adapter were trimmed using umi_tools (version 1.0.1). 27 Mapping was performed using Bowtie (1.3.0) with reference to the human genome reference (hg19). The following options were applied: -n0, -v0, -l18, -k1, and --best. After sorting with SAMtools (version 1.7), 28

| Machine learning for multiple cancer classification
XGBoost was used to classify the cancer types. 35

| Statistical analysis
The correlation analysis, principal component analysis (PCA), Cohen's d value and box plots were performed using the statistical analysis software R (version 3.6.3). 38 The effect size was calculated from Cohen's d value.

| Calculation of the hemolysis index
The hemolysis index was calculated from the small RNA expression counts of each sample using miR-23a-3p for the miR-451a express ion levels. 37,39,40 The following formula was used: Hemolysis index = Log 2 (miR-451a) − Log 2 (miR-23a-3p).

| Feature selection based on principal component analysis loading factors
For the first and second principal components in the PCA, the factor loadings of each miRNA were calculated, the threshold values were set, and the small RNAs were removed from the data according to the threshold values. The small RNA lists for 10%, 15%, 20%, and 25% removal are shown in Table S2.

| One-dimensional nearest neighbor matching
The x-axis (PC1) and y-axis (PC2) of each sample in the PCA were extracted, and the matching rate was defined as the probability that both sides of an arbitrary point projected on the x-axis or y-axis were in the same class.

| Study design
Cancer and healthy control plasma samples were collected from multiple facilities to construct and validate the cancer type classification model. A total of 2,475 cancer samples and 496 healthy control samples were collected. Information on the age, sex, and stage of the collected samples is summarized in Table 1.
For these 2,971 samples, the expression levels of small RNAs (miRNAs and piRNAs) were measured by next-generation sequencing (NGS), and 583 types of small RNAs were detected.
Thirty-four imprinted genes located on chromosome 14q32.31 were removed from our analysis because these genes were differentially expressed in healthy controls. A total of 534 small RNAs were used to construct the cancer classification models. First, samples of the multiple cancer types were collected from a single facility, and the F I G U R E 1 Schematic of cancer type classification model establishment. Plasma samples were obtained from 2971 subjects, including 496 healthy controls (CTR) and 2475 cases. The small RNA sequences were matched to a database, and 583 genes were identified. Imprinted genes and genes with low expression were removed, and 534 small RNAs were retained as features for the machine learning models. Cancer classification Models 1-5 were constructed using samples obtained from five facilities, and Model 6 was constructed using samples from all facilities performance of the cancer type classification in each of the five different facilities was examined using Models 1-5. Next, samples collected from multiple facilities were used to evaluate the performance of the classification model (Model 6) for multiple cancer types to determine whether the same results could be obtained across facilities (Figure 1).

| Verification of the reproducibility of data acquisition
To identify cancer type-specific expression profiles, it is necessary  Figure S1). These results verified that the data acquisition was highly reproducible.

| Classification performance using each singlefacility sample
The small RNA expression profiles of samples from multiple cancer types that were collected from five single facilities were obtained to evaluate the classification performance (Models 1-5). The classification models were constructed by downsampling with the fewest number of samples to equalize the cancer types. The facilities, cancer types, and number of samples used to construct the models are summarized in Table 2. The cancer type classification performance outcomes of Models 1-5 are shown in Figure 3. The AUCs of each model are shown in Figure S2. The sensitivities of Model 1 using COL, ESO, LBI, PAN, GAS, and CTR samples collected at KB were 69%, 58%, 47%, 52%, 45%, and 73%, respectively. These were higher than the sensitivity of 17% for six-class random classification. The AUC for the KB samples was 0.89 (95% confidence interval [CI], 0.88-0.89), as shown in Figure 3A. The sensitivities of Model 2 using BRE, COL, LUN, PAN, and GAS samples collected at OC were 83%, 30%, 38%, 34%, and 50%, respectively. These were higher than the sensitivity of 20% for five-class random classification. The AUC for the OC samples was 0.77 (95% CI, 0.75-0.78), as shown in Figure 3B. The sensitivities of Model 3 using BRE, LUN, PRO, and GAS samples collected at KC were 56%, 42%, 52%, and 56%, respectively. These were higher than the sensitivity of 25% for four-class random classification. The AUC was 0.76 (95% CI, 0.74-0.78), as shown in Figure 3C. The sensitivities of Model 4 using COL, ESO, PAN, and GAS samples collected at NC were 31%, 52%, 38%, and 47%, respectively. These were higher than the sensitivity of 25% for four-class random classification. The AUC was 0.67 (95% CI, 0.63-0.70), as shown in Figure 3D. The sensitivities of Model 5 using BRE, COL, LBI, and LUN samples collected at JD were 59%, 40%, 48%, and 52%, respectively. These were higher than the sensitivity or 25% for four-class random classification. The AUC was 0.77 (95% CI, 0.74-0.79), as shown in Figure 3E. Based on the above classification performance outcomes of the five facilities, Models 1-5 using samples from a single facility showed higher classification performance outcomes than random classification. These data suggest that a machine learning model with small RNA profiles can classify multiple cancer types at least within a single facility.
Next, we evaluated the influence of stage on performance in cancer classification. Stages 0, 1, and 2 were grouped together as early stages, and stages 3 and 4 were grouped together as advanced stages. We selected KB and OC based on sample numbers and compared their performance outcomes (Table S3). KB samples showed the following sensitivities: COL (early 58%, advanced 44%), ESO (early 42%, advanced 67%), GAS (early 71%, advanced 51%), PAN (early 53%, advanced 65%), and LBI (early 56%, advanced 54%), as shown in Figure S3A. OC samples showed the following sensitivities: COL (early 57%, advanced 57%), LUN (early 67%, advanced 31%), and PAN (early 21%, advanced 53%), as shown in Figure S3B. Early stages showed equal to or higher sensitivities than advanced stages in five out of eight classifications.

| principal component analysis using multiplefacility samples
To explore facility-specific biases, we performed PCA using multiplefacility samples. The color-coded PCA by cancer type, facility, and data acquisition period are shown in Figure 4A, Furthermore, other RBCs-derived miRNAs, such as hsa-miR-451a, 40 hsa-miR-486-5p, 44 hsa-miR-4732-3p, 45 and hsa-miR-363-3p, 46 were enriched in higher rankings in PC1 factor loadings (Table S4). The formation of clusters at different facilities was thought to be due to differences in the degree of hemolysis in the samples. Therefore, the hemolysis index of each sample was colored in the plot. The results showed that the increase or decrease in PC1 was linked to the hemolysis index ( Figure 5A). Then, the distributions of the hemolysis index at collection facilities were compared. The distribution range of the hemolytic index was different among facilities ( Figure 5B). In fact, the distribution of the hemolytic index in KC was different from that in KH, and the clusters of both in PCA were also clearly separated ( Figure 5C). The distribution of the hemolytic index in KB was near that in KC, and the clusters of both in PCA were also close ( Figure 5D).
These results suggest that PC1 reflects the effect of hemolysis and that there were differences in the hemolysis index among the facilities, which led to the formation of clusters.

F I G U R E 2
Evaluation of data acquisition reproducibility using reference samples. The distribution of correlation analysis using small RNA expression profiles measured from the same reference sample is shown. Reference samples were always measured at fixed positions of the 96-well plate. (Reference samples 1 (A) and 2 (B) are shown. Each data acquisition experiment was assigned a number, and the correlation coefficients between experiments were color-coded in matrices. The histogram of the correlation coefficients of Reference sample 1 is shown in (C) and that of Reference sample 2 is shown in D)

Model Facility
Category and number of samples

| Classification performance using multiplefacility samples
It was suggested that the hemolysis index and data acquisition period could affect the cancer classification performance. Therefore, we attempted to reduce these effects by removing the small RNAs and 98 for 25% removal ( Figure 6E). The retained small RNAs are listed in Table S2. As the removal rate increased, these clusters converged into a group ( Figure 6E).
To quantitate the change in cluster formation with/without the removal process, the probability that the nearest neighbor points of an arbitrary point were in the same class was calculated. In PC1, the matching rates focusing on cancer type were almost the same with/without small RNA removal, but the matching rates focusing on facility were reduced: 5.0% (without removal), 6.2% (10% removal), 5.7% (15% removal), 3.6% (20% removal), and 2.6% (25% removal). The matching rates of the data acquisition period were reduced to 5.5% (without removal), 3.0% (10% removal), 3.0% (15% removal), 2.4% (20% removal), and 2.2% (25% removal), as shown in Figure 6F. The reducing effect was largest under 25% removal. In PC2, the matching rates of cancer types and facilities were almost the same with/without small RNA removal, but the matching rates of the data acquisition period were reduced to 7.6% (without removal), 13.0% (10% removal), 11.0% (15% removal), 4.1% (20% removal), and 4.3% (25% removal), as shown in Figure 6G. The consistency of cancer type-associated small RNAs between our classification models and previously reported models was investigated (Table S5). To avoid interfacility variability, we extracted cancer type-associated small RNAs from Models 1-5 by quantifying their contribution. ESO, GAS, COL, LBI, PAN, and BRE were selected because these cancer types were tested in more than two facilities.
We narrowed down to the common miRNAs in multiple facilities and compared these common miRNAs with previously reported miRNAs. Two or three miRNAs in our common miRNAs were also reported in previous studies of cancer type-associated miRNAs. Our results were consistent with previous reports to some extent.
Regarding the effect of hemolysis on the classification performance, we collected samples with a standardized sample collection protocol (see Materials and Methods 2.2). Therefore, we expected that the interfacility variation in the hemolysis index would not exceed the intrafacility variation. However, it was found that the interfacility variability exceeded the intrafacility variability ( Figure 5B). Thus, the differences in the hemolysis index among the facilities became noise, and the accuracy of cancer type classification decreased. At the same time, the differences between facilities could be misinterpreted as differences in cancer types, resulting in a higher accuracy than the values with minimized hemolysis influences ( Figure 7A,B). As a result, the difference in the hemolysis index among facilities not only affected the performance of the cancer classification but also suggested the possibility that the differences among the facilities could be confounded by the differences in the cancer types.
Although the classification performance was higher than ran-