Machine Learning‐Assisted Evaluation of Circulating DNA Quantitative Analysis for Cancer Screening

Abstract While the utility of circulating cell‐free DNA (cfDNA) in cancer screening and early detection have recently been investigated by testing genetic and epigenetic alterations, here, an original approach by examining cfDNA quantitative and structural features is developed. First, the potential of cfDNA quantitative and structural parameters is independently demonstrated in cell culture, murine, and human plasma models. Subsequently, these variables are evaluated in a large retrospective cohort of 289 healthy individuals and 983 patients with various cancer types; after age resampling, this evaluation is done independently and the variables are combined using a machine learning approach. Implementation of a decision tree prediction model for the detection and classification of healthy and cancer patients shows unprecedented performance for 0, I, and II colorectal cancer stages (specificity, 0.89 and sensitivity, 0.72). Consequently, the methodological proof of concept of using both quantitative and structural biomarkers, and classification with a machine learning method are highlighted, as an efficient strategy for cancer screening. It is foreseen that the classification rate may even be improved by the addition of such biomarkers to fragmentomics, methylation, or the detection of genetic alterations. The optimization of such a multianalyte strategy with this machine learning method is therefore warranted.


Introduction
Cancer biomarkers exist to monitor the response to treatment for a specific cancer or group of cancers. They are limited, however, with regard to sensitivity and specificity in early detection or mass screening. [1] Machine learning methods were applied for the development of predictive models, [2] such as artificial neural networks (ANNs), Bayesian networks, support vector machines (SVMs), and decision trees (DTs). Different strategies have been described for the detection of different cancer types, [3] e.g., by identifying a subset of three single nucleotide polymorphisms as key discriminators between control and breast cancer patients. [4] The use of cfDNA [5] was recently proposed by our team, among others, [6,7] for cancer screening and early detection. Existing approaches include the detection of the viral DNA of the Epstein-Barr virus in asymptomatic nasopharyngeal cancer cases, [8] to the analysis of the methylome and cfDNA methylation patterns in different cancer types. [9,10] cfDNA levels [11] and the detection of genetic alterations [12,13] have also been considered for their diagnostic potential. Other studies have associated mutation detection and protein markers. [6] Despite intensive research, however, the reliability of cfDNA-based tests have been impacted by issues regarding their sensitivity and specificity, especially for the early stages of cancer.
Our team (among others) has worked for more than a decade on the quantitative and structural characteristics of cfDNA, [14][15][16][17][18] as a means of discriminating cancer and healthy subjects. We have identified various quantitative and structural biomarker candidates, and have independently evaluated their performance in three experimental models. We subsequently proposed a machine learning method based on a predictive DT model in a large cohort of healthy and cancer individuals. We believe that cfDNA-based attempts at developing a pan-cancer test should be multianalyte, using a combination of qualitative (i.e., presence or absence of genetic or epigenetic alterations) and quantitative parameters. In order to rigorously evaluate the predictive potential of different combinations of quantitative parameters, therefore, we are convinced that building our machine learning method's proof of concept is warranted.

Determination of cfDNA Quantitative Parameters in a Xenografted Mouse Model
To determine the potential of different quantitative cfDNA parameters, we used a murine model of nude mice xenografted with SW620 colorectal cancer human cells which we had established previously. [15] This model allowed us to differentiate human tumor cfDNA from normal murine cfDNA in the same mouse. Using Q-PCR (quantitative polymerase chain reaction), we quantified nuclear and mitochondrial cfDNA of human and murine origin in 14 xenografted mice. We found a significant increase of total nuclear DNA concentrations (ng mL −1 ) (which we called Ref A 67), in tumor over normal cfDNA. The receiver operating characteristics (ROC) curve analysis showed an AUC (area under curve) of 0.91 (0.8125-1.014, 95% CI; confidence interval) (Figure 1a,b) (Sensitivity Se = 0.79; Specificity Sp = 0.86). The results were also significant when analyzing mitochondrial DNA concentrations (Ref M 67) (Figure 1c,d) and the mitochondrial to nuclear ratio (MNR) (Figure 1e,f). These two parameters decreased significantly and featured higher AUCs of 0.99 (0.963-1.016, 95% CI) and 0.96 (0.9047-1.024, 95% CI), with respective sensitivities of 0.93 and 1, and specificities of 1 and 0.86. This suggests the high potential that mitochondrial DNA concentration and the MNR have for distinguishing between cfDNA of normal and of tumor origin.

Evaluation of the Ref A 67 and the MNR in Cell Culture Supernatants
The Ref A 67 and the MNR were then tested in the supernatant of cells in culture, to assess their ability to discriminate between normal and cancer cells. The supernatant of 14 cancer and 5 normal cell lines were tested (Table S1, Supporting Information). The Ref A 67 showed a significant increase, with a high AUC of 0.97 (0.901-1.042, 95% CI) (Se = 0.86, Sp = 1) (Figure 1g,h), while the MNR showed a potential of 100% in discriminating between normal and cancer cells in this series of 19 cell lines (Figure 1i,j).

Evaluation in an Exploratory Cohort of Healthy and Cancer Subjects
Next, the quantitative parameters were validated in the plasma of a small independent exploratory cohort of 76 healthy individuals and 50 stage IV colorectal cancer (CRC) patients ( Table 1). The Ref A 67 showed a significant increase in CRC patients with an AUC of 0.81 (0.73-0.89, 95% CI) ( Figure S1a,b, Supporting Information). As for the Ref M 67, a significant decrease was observed in cancer patients with an AUC of 0.89 (0.83-0.95, 95% CI) ( Figure S1c,d, Supporting Information). The MNR then showed a decrease, with the highest AUC (AUC = 0.98; 0.96-0.99, 95% CI) between healthy subjects and stage IV CRC patients ( Figure  S1e,f, Supporting Information).

Validation of the cfDNA Parameters in a Large Retrospective Cohort
The discriminative value of these cfDNA quantitative parameters was then evaluated in a large retrospective cohort of 289 healthy individuals and 983 patients with CRC (N = 791), breast (N = 169), or other types of cancers (hepatocellular, pancreatic, ovarian, and lymphoma) (N = 23) ( Table 1). The total nuclear cfDNA concentration showed a significant increase, with a pvalue of < 0.0001 between healthy individuals and the different cancer types (CRC, breast and others) ( Figure S2a, Supporting Information). This was the case between the healthy group and CRC patients, regardless of their stage (stage 0/I/II, stage III, stage IV) ( Figure S2b, Supporting Information). An AUC of 0.83 (0.81-0.86, 95% CI) was observed for the CRC group compared to healthy subjects ( Figure S3a, Supporting Information), with a high value of 0.79 (0.76-0.83, 95% CI) for early stages 0, I, and www.advancedsciencenews.com www.advancedscience.com II ( Figure S3b, Supporting Information). The Ref A 67 showed an AUC of 0.7 (0.74-0.82, 95% CI) for the discrimination of the breast cancer group from the healthy group ( Figure S3c, Supporting Information). We then assessed the potential of the MNR. This parameter decreased significantly between healthy patients and all cancer types ( Figure S2c, Supporting Information). It showed a good discriminative potential for CRC patients (AUC = 0.78; 0.75-0.82, 95% CI) and breast cancer patients (AUC = 0.82; 0.78-0.86, 95% CI) ( Figure S4a,c, Supporting Information). A significant decrease and a high discriminative potential were also observed between the healthy group and early stage CRC patients (AUC = 0.79; 0.76-0.82, 95% CI) (Figures S2d and S4b, Supporting Information).

Cohort Age Adjustment and cfDNA Parameters' Individual Performance for CRC Detection
With the aim of maximizing discriminative power, we wanted to assess the potential of a machine learning approach, by evaluating the combination of different cfDNA variables. We first focused on the CRC group compared to healthy subjects. A bias was observed in the age distribution of the studied cohort (Figure 2a); the fraction of patients < 30 years old was constituted of healthy individuals only, and the fraction of patients >70 years old of CRC patients only. No gender bias was observed ( Figure S5, Supporting Information). For this reason, subjects between 30 and 70 years old were regrouped in five categories: ≤ 45 years old, 45-50, 51-55, 56-60, and > 60 years old (Figure 2b). The tendency of all parameters was then appreciated on a heatmap (Figure 2c) clustering patients (by columns) and parameters (by rows). The importance score of each parameter was then extracted, after the application of recursive partitioning on 500 different resampling ( Figure 2d).
Two other cfDNA parameters were studied in the plasma of the majority of the tested cohort, which were to be added to the quantitative variables. These parameters correspond to the nuclear cfDNA concentration of the fragments with a size ≥ 145 base pairs (bp) (Ref A 145 (copy nb mL −1 plasma)) or ≥ 320 bp (Ref A 320 (copy nb mL −1 plasma)). They provide information regarding the size distribution of cfDNA in the healthy and the cancer group. The Ref A 145 was significantly higher in CRC patients than in healthy individuals, with a high discriminative value (ageadjusted AUC  Table 2).
Regarding early stage (0-I-II) CRC patients versus healthy individuals, when taken individually these parameters showed lower sensitivity, specificity, and AUC, with much larger confidence intervals than those obtained when analyzing CRC patients of all stages ( Table 2).

DT Construction and Prediction Value for CRC Detection
We implemented a DT prediction model by including our five different cfDNA quantitative and structural parameters, obtained www.advancedsciencenews.com www.advancedscience.com  a, Parameters estimation for each variable based on minimum distance between corresponding ROC curve and the point (Se = 1, 1-Sp = 0) with the use of an empirical bootstrap to build their respective confidence interval. These estimations were obtained before and after age adjustment with a focus on early stages (0-I-II) colorectal cancer patients. b, Se = sensitivity; c, Sp = specificity; d, AUC = area under curve.
from Q-PCR analysis after age resampling, to examine whether their association would further increase the discriminative power ( Figure 4a and Table S3, Supporting Information). 527 CRC and healthy patients were used for the construction of the tree, and showed 0.87 specificity (0.79-0.92, 95% CI) and 0.76 sensitivity (0.67-0.85, 95% CI) ( Table 3). Performance was assessed by the cross-validation of 2000 rounds of resampling of a total of 424 patients not used for the tree construction. This validation estimated a specificity of 0.89 (0.84-0.94, 95% CI) and a sensitivity of 0.77 (0.73-0.80, 95% CI) for CRC and healthy patients' classification (Table 3). Our DT prediction model provides improved specificity and sensitivity, compared to discrimination made using only individual parameters (Table 2). When applied to early stages CRC, a relatively high Sp and Se of 0.87 (0.83-0.91, 95% CI) and 0.72 (0.67-0.76, 95% CI), respectively, were observed (Table 3), improving the quality and robustness of early stages CRC detection.

Breast Cancer and Other Cancer Types
The distribution and discriminatory power of each of our five cfDNA parameters were tested in breast cancer patients (N = 169) of stages II and III versus healthy individuals. A two-sample nonparametric one-sided Mann-Whitney test with normal approximation (n > 50) was adopted for continuous variables comparison of the populations studied ( Figure S9 and    www.advancedsciencenews.com www.advancedscience.com respectively ( Figure S10 and  Figure S10 and Table  S5, Supporting Information). In the DT construction, we adopted the same strategy for the breast cancer cohort (Figure 4b,c and

Conclusion
Despite significant statistical differences in cfDNA concentration, the large overlap in cancer and healthy subjects [17,19] severely undermines its potential as a tool for cancer screening. Alternatively, another strategy which has been considered is the detection of genetic alterations from cfDNA. [6,12,20,21] Indeed, several attempts at this have recently been implemented. However, the false positive rates observed preclude its application, mainly due to i) the detection of white blood cell-derived mutations among cfDNA variants (notably with age), and ii) the very low circulating tumor DNA (ctDNA) rate in a significant number of cancer patient plasma samples. We hypothesized that quantitative and structural indicators from total cfDNA not completely and directly associated with malignancy may compensate the eventual deficiencies of qualitative indicators, such as genetic or epigenetic alterations, especially with respect to sensitivity. We believe that the combination of both types of measures in a multianalyte approach might meet the stringent levels of reliability required for the evaluation of a screening test or an early detection test. This study aims to provide the proof of concept of using cfDNA quantitative biomarkers for such tests, and proposes a machine learning method developed specifically to that end. Drawing on our extensive work on the origins and structures of cfDNA, in particular from cancer patients, we have identified two quantitative cfDNA candidate parameters which can be used to discriminate healthy individuals and cancer patients: the Ref A 67 and the MNR, determined by specific Q-PCR, targeting both nuclear and mitochondrial sequences. In this study, we first validated these two parameters independently, in three experimental models: cell culture media, tumor xenografted mice, and plasma from an exploratory cohort of healthy and cancer subjects.
Both our own team's work [16,17] and that of others [22][23][24] have shown that cfDNA fragmentation [25] and size profiles differ in healthy and cancer patients, especially in the fractions below 60 bp, between 60 and 145 bp, between 145 and 300 bp, and higher than 300 bp. [16,26,27] For this reason, we added two parameters to our study: the Ref A 145 (nuclear cfDNA concentration of the fragments with a size ≥ 145 base pairs) and the Ref A 320 (≥ 320 bp), which provide information regarding cfDNA size distribution. The identification of these parameters originated from previous reports showing that the examination of size profile and fragmentomics [25] may help in discriminating healthy and cancer individuals. [16,26,27] Subsequently, these variables were all evaluated in a large retrospective cohort of 289 healthy individuals and 983 patients with CRC (N = 791), breast (N = 169), and other types of cancers (hepatocellular, pancreatic, ovarian, and lymphoma) (N = 23). This evaluation was done using a machine learning approach, with the variables considered both independently and in combination, after age resampling. Total nuclear cfDNA concentration increases in cancer patients compared to healthy individuals, but mitochondrial cfDNA levels are reduced. Although the cellular level of mitochondria in cancer patients is now under intense scrutiny, there remain significant discrepancies in the literature. Furthermore, little is known about the production or release of mitochondrial cfDNA into the blood stream. This makes it difficult currently to provide a definitive explanation for this observation. The Ref A 67 and the MNR showed the two highest AUCs, and therefore the highest potential for discriminating cancer patients and healthy individuals.
The implementation of a DT prediction model for the detection and classification of healthy and cancer patients showed very encouraging results even for early CRC stages (specificity of 0.89 (0.84-0.94; 95% CI) and sensitivity of 0.72 (0.67-0.76; 95% CI)). It also showed an increase in discrimination potential. These results emphasize the importance of integrating both the nuclear and mitochondrial origin of cfDNA. Each of the tested parameters was selected as a discriminating variable at a node of the DT model, except for the Ref A 320. Although a difference can be discerned in the Ref A 320 levels of colorectal cancer and healthy controls, e.g., in Figure S8  Among the many techniques available for automatic classification, we opted for a robust, high-performance method which is capable of revealing nonlinear relationships between variables, and whose output would be exploited in a clinical setting in the future. For these reasons, we decided to use recursive regression trees, [28] which compared favorably to other machine learning techniques, such as ANNs or SVM for mortality prediction in gliomas. [29] However, the conventional deployment of regression trees with cross-validation (2/3 as learning set, 1/3 as test set), such as in the discrimination between head and neck squamous www.advancedsciencenews.com www.advancedscience.com cell carcinoma patients and healthy adults, [30] produced unstable results from our data. We obtained excessively large confidence intervals for sensitivity and specificity (Table S7 and Figure S11, Supporting Information), and the resulting DTs were highly dependent on the training set. The main reason for this was the age bias in our data, which contained an overrepresentation of cancer and healthy cases among older and younger patients, respectively. In addition, we noted that age was related to other parameters in a nonlinear way (Table S8, Supporting Information). We therefore divided patients into age classes, in such a way that an equivalent number of cancer and healthy cases could be drawn from each class before being pooled for training and testing. To obtain a robust DT, we applied an iterative procedure where trees were first generated by multiple resampling of the data (with correction for age bias), and the variable to be used as the most recurrent top node was fixed. A second round of resampling yielded a new series of trees, with the latter variable at the top; that variable was then used to select the most recurrent next level nodes, etc. At every step, i.e., for every node/variable, the threshold was estimated with a confidence interval which was built using a bootstrap method. This type of iteration has been applied in other domains, e.g., to infer suicide attempts [31] or to detect drug usage. [32] The final tree was tested on the patients not used for training. The iterative construction procedure resulted in more stable trees with narrower confidence intervals than direct regression (Table S7 and Figure S11, Supporting Information).
The selected machine learning method appears well suited to classifying and testing combinations of quantitative biomarkers. This would also encourage its use for combinations of both quantitative and qualitative biomarkers. The selected parameters evaluated in this study showed a significant screening power, but by themselves were not sufficiently efficient and robust to distinguish early stage patients. Their combination in a DT for the CRC or the breast cancer cohort revealed better results, and showed an improvement over the CancerSeek test [6] and over the analysis of the cfDNA methylation profile. [9] Breast cancer patients were poorly predicted by the DT based on the CRC cohort (Se = 0.58), despite the high specificity found (0.90). This high specificity could be explained by the fact that the same healthy patients were included for the construction of both algorithms. Alternatively, the use of the breast cancer tree to classify colorectal cancer patients was quite efficient (Se = 0.85), although the drop in specificity (0.66) was notable. This could be due to the fact that the breast cancer cohort only concerns women, and that a difference probably exists between healthy males and females regarding the cfDNA measured parameters, especially for mitochondrial cfDNA. [33] This may also explain the importance of the MNR, which appears in the first node of the breast cancer DT, for the classification of breast cancer patients. The algorithm was then applied on a small cohort of patients with other cancer types (N = 23), and showed a good level of prediction given the sensitivity. This seems to indicate that these trees are not cancer-type specific.
The reliability of biomarkers based on the total cfDNA quantification is also known to vary in conditions and pathologies other than cancer, such as autoimmune diseases, [34] sepsis, [35] myocardial infraction, [36] exercise, [37] and others. It is therefore crucial to investigate the impact on this approach of confounding con-ditions (such as inflammation). Although the quantitative multianalyte test performance presented in this work is noteworthy as a blood test, our study presents critical limitations in the context of a universal cancer screening. The studied cohort is retrospective and composed of healthy controls devoid of asymptomatic patients. While it does appear to confirm diagnosis at the prescreening stage, for it to be properly called "screening" the test ought to have been performed on an asymptomatic population ostensibly in good health, with positive results being confirmed by other testing techniques.
Protein markers have approximately a > 10% sensitivity and a > 99% specificity, and have the capacity to indicate one or more cancer types. [7] However, their diagnostic use does not currently meet the requirements for multicancer tests used at the population scale. The primary requirements for this are test performance (high sensitivity without sacrifice of specificity; > 90% and > 99%, respectively), reproducibility, and robustness. In addition, cancer screening test validation should ultimately be subject to large-scale population studies of people with no known diagnosis, with the inclusion of potentially confounding conditions in order to ensure specificity, and using multiple study sites to ensure demographic diversity. We believe that a blood test based on cfDNA analysis might have the capacity to achieve this goal, especially given its minimal invasiveness, high compliance, affordability, and scalability.
Other requirements for such a test would be generalizability to population, and the capacity to identify anatomic location, in order to direct appropriate diagnostic follow up. The biomarkers tested in this study cannot determine the tumor tissue of origin, which is one of the reported potentials of CancerSeek and methylation analysis. [6,9] For this reason, the addition of other parameters might improve the predictive capacity of the model. The detection of genetic alterations such as mutations, aneuploidy, or translocations/re-arrangements would a priori seem to be the best screening parameters. However, these qualitative indicators showed some limitations as screening biomarkers. This is because current methods implying cfDNA analysis, such as conventional next generation sequencing (NGS), show some restrictions in obtaining useful data, especially in the plasma of patients with a small tumor. [6] In addition, white blood cell-derived mutations among cfDNA variants (especially with age) might lead to false positive results. [38] For these reasons, their use was precluded as a single strategy. [39] A cfDNA-based cancer screening test would be multianalyte. Although the performance of the combination of quantitative and structural parameters as reported here is not of a sufficient standard, such a combination may reveal synergic power, especially with respect to early cancer stage detection, when associated with cfDNA parameters from different origins, such as genetic alterations. An examination of genome-wide methylation profiling was found to be specific to cancer and very informative as to cancer type, while also showing good sensitivity and specificity. [9] This could therefore be a candidate partner of choice. Analysis of the cfDNA fragment size profile might be another powerful candidate, as we have previously demonstrated, [26,27] and as has been recently confirmed. [40,41] The synergistic value of using our approach in combination with another cancer detection method should therefore be assessed, in order to confirm its additive advantage.

www.advancedsciencenews.com www.advancedscience.com
This study validated the concept of using a DT based on cfDNA quantitative and structural parameters to distinguish cancer patients from healthy individuals. The ultimate goal would be to use machine learning to combine these parameters with others, such as methylation profile and fragmentomics, in order to achieve a better selectivity between different cancer types in the context of pan-cancer testing. Evaluation of the combination of the quantitative biomarkers tested here with methylation and fragmentomics analysis is ongoing, using the method we have proposed.

Experimental Section
Study Design: This study presented an observational analysis of different quantitative and structural cfDNA parameters, calculated using an ultrasensitive Q-PCR method to quantify various wild type nuclear and mitochondrial cfDNA sequences in the plasma of cancer patients and healthy individuals. To establish its proof of concept, this study also proposed a machine learning method based on a predictive DT model combining these different parameters for early cancer detection.
The plasma of a small exploratory cohort of 76 healthy subjects and 50 CRC patients (stage IV) was first analyzed. Then 1272 plasma samples (Table 1)  Early stages CRC patients (stages 0, I, and II) represented more than 50% of the total CRC cohort tested.
The capacity of each of the tested cfDNA parameters was evaluated to discriminate between healthy individuals and cancer patients, especially for early stages cancers. Then a DT was developed, which combined five different parameters, and established its sensitivity and specificity, in order to assess its potential for early cancer detection.
Patient and Sample Characteristics: Blood samples from healthy individuals were obtained from healthy donors, from the Etablissement Français du Sang (E.F.S), which is Montpellier's blood transfusion center (Convention EFS-PM N°21PLER2015-0013). These samples were analyzed (virology, serology, immunology, blood numeration) and ruled out whenever any abnormality was detected.
Plasma samples from patients with colorectal cancer were provided by the hospital centers of Clermont-Ferrand and Limoges in France, and of Salamanca in Spain, as well as from the "BCB colon" cohort from Montpellier in France (ICM Biobank (Biobank number BB-0033-00059)) and from the KPLEX I and KPLEX II (ClinicalTrials.gov Identifier: NCT02784639) studies previously conducted by the team. Plasma samples from individuals with breast cancer were obtained from the "IDEA Sein" study (ClinicalTrials.gov Identifier: NCT03255486) provided by the ICM Biobank (Biobank number BB-0033-00059) of Montpellier. Hepatocellular carcinoma plasma samples were provided by Eric Assenat and Marie Dupuy from Montpellier, and three pancreatic adenocarcinomas, one lymphoma, and one ovarian cancer samples were obtained from the hospital center of Salamanca in Spain. All plasma samples from cancer patients were obtained at the time of diagnosis, or at least 45 days after treatment interruption to eliminate any bias due to the liberation of cfDNA by the treatment. Written informed consent was obtained from all participants prior to the onset of the study. The characteristics of all patients included are listed in Table 1.
Cell Lines: The supernatant of 14 tumor cell lines in culture was analyzed (Table S1, Supporting Information): three colorectal cancer (SW620, SW480, and CaCo2), five prostate cancer (VCAP, 22rV1, DU145, LNCAP, and PC3), three breast cancer (SUM159, MDA 468, and R2sh P53), two lymphoma (RAMOS and BJAB), and one lung cancer cell line (H1975). Five normal cell lines of various origins were also analyzed: human foreskin fibroblasts (HFF), skin fibroblasts (CDC45K), mammary fibroblasts (R2), lung fibroblasts (IMR-90 A), and hepatocytes (LWFD). Genomic DNA of the DIFI human colorectal cancer cell line grown in RPMI 1640 and supplemented with 10% fetal calf serum and ATB (antibiotics) was used as a stan-dard for human nuclear DNA quantification. Most cell lines were obtained from the American type culture collection (Manassas, VA, USA). The R2sh P53 cell line was obtained from the Charles Theillet/Claude Sardet team. Cell culture medium supernatant was collected when cells were at 70% confluence, and underwent two centrifugations: the first at 1200 g, and the second at 16 000 g, at 4°C for 10 min each. DNA was then extracted using the QIAamp DNA Blood Mini Kit (Qiagen, Courtaboeuf, France) from 200 µL of the supernatant and eluted in a final volume of 80 µL.
Mice Models: An experimental xenografted mouse model developed previously was used, [15] to allow to discriminate within the same mouse the nontumor-derived cfDNA (of murine origin) and tumor-derived cfDNA (of human origin). Mice were purchased from Harlan (Gannat, France) and maintained in a specific pathogen-free facility in an accredited establishment (N°B-34-172-27; Institut de Recherche en Cancérologie de Montpellier-CRLC Val d'Aurelle-Paul Lamarque, Montpellier, France). Peripheral blood was drawn into ethylenediaminetetraacetic acid (EDTA) precoated tubes of 14 female athymic nude mice (6-8 weeks old) xenografted with 1 × 10 6 SW620 human cancer cells 30 days post-injection, and was used for plasma preparation within 1 h. Tumors were also collected and weighed (200-600 g). All experiments complied with the current national and institutional regulations and ethical guidelines, and were performed by an accredited person (Dr. B. ROBERT, N°34-156). Research projects involving the use of animals were authorized by the French Ministry of Research.
Plasma Isolation and cfDNA Extraction: Human and murine blood samples were handled according to an established preanalytical guideline. [33,42] Blood was drawn in EDTA tubes and centrifuged at 1200 g at 4°C in a Heraeus Multifuge LR centrifuge for 10 min. The supernatants were isolated and centrifuged at 16 000 g at 4°C for 10 min. Subsequently, the plasma was either immediately handled for DNA extraction or stored at −80°C. cfDNA was extracted from 200 µL of plasma and eluted in a final volume of 80 µL using the QIAamp DNA Blood Mini Kit (Qiagen, Courtaboeuf, France) according to the published protocol. [43] DNA samples were kept at −20°C until use. Freeze-thaw cycles were avoided in order to reduce the phenomenon of cfDNA fragmentation. Hemolyzed blood samples were discarded and not included in the study. Hemolysis did not affect cfDNA, but was an indicator of mishandling or improper storage of sample tubes which had led to blood cell deterioration.
DNA Quantification by Q-PCR and Copy Number Calculation: Q-PCR amplifications were carried out at least in triplicate, in a 25 µL reaction volume on a CFX96 touch Real-Time PCR detection system (Bio-Rad) instrument, using the CFX manager software. Each PCR reaction mixture was composed of 12.5 µL of SsoAdvanced Universal SYBR Green Supermix (Bio-Rad, Marnes-la-Coquette, France), 2.5 µL of each amplification primer (3 pmol µL −1 ), 2.5 µL of Nuclease free water (Qiagen), and 5 µL of DNA extract. Thermal cycling was consisted of three repeated steps: a 3 min Hot-start polymerase activation-denaturation step at 95°C, followed by 40 repeated cycles at 90°C for 10 s, then at 60°C for 30 s. Melting curves were obtained by increasing the temperature from 55 to 105°C, with a plate reading every 0.2°C. The concentration was calculated from the Cq detected by Q-PCR. A triplicate of nontemplate negative control was included in each run for each pair of primers.
Human DNA Quantification: For human nuclear DNA, serial dilutions of genomic DNA from the DIFI human colorectal cancer cells were used as a standard for nuclear DNA quantification. The initial concentration and purity were assessed by optic density at = 260, 230, and 280 nm, with an Eppendorf BioPhotometer D30.
A control standard curve using a 3382 bp human ORF (open reading frame) vector with a 786 bp MT-CO3 insert obtained from ABM good (accession no.YP_003024032) of known concentration was used for human mitochondrial DNA quantification in cell lines and human samples. Initial vector solution concentration and purity were determined by measuring the optic density at = 260, 230, and 280 nm, with an Eppendorf BioPhotometer D30.
For each human plasma sample and cell line supernatant extract, a nuclear wild type KRAS amplicon and a MT-CO3 mitochondrial amplicon of 67 bp each were separately quantified. A wild type BRAF nuclear amplicon of 105 bp was also targeted for several human plasma samples in order to validate the total nuclear DNA quantification, which was calculated by targeting the KRAS 67 bp amplicon. KRAS, BRAF, and MT-CO3 were monogenic, and the respective targeted sequences were known to be wild type, and not affected by any mutation or genetic alteration. KRAS sequence-based detection was always controlled by BRAF sequencebased detection, and both quantifications had previously shown a very high correlation. [17,33] Targeting these specific genes confirmed whether the cfDNA fraction quantified was derived from the nucleus or the mitochondria. It was previously demonstrated that quantification of both mitochondrial and nuclear Q-PCR systems was highly specific, down to 1 copy/reaction (Ref A 67 and Ref M 67). [33] Amplicons of 145 and 320 bp of the KRAS gene were also targeted for cfDNA size profile analysis. Based on the team's work, [16,17] as well the work of others, [22][23][24] it was shown that cfDNA was present in the blood mainly in the form of mononucleosomes, highlighting a chromatin organization and that fragmentation [25] and size profiles were differed in healthy and cancer patients, especially in the fractions between 60 and 145 bp, between 145 and 300 bp, and higher than 300 bp. [16,26,27] As mononucleosomes had a size of 145-168 bp, the following fractions based on the Q-PCR systems were quantified: the total amount of cfDNA or Ref A 67 (amplicon of 67 bp); the amount of fragments over the mononucleosome size or Ref A 145 (amplicon of 145 bp); and the amount of cfDNA over the mononucleosome size or Ref A 320 (320 bp). The Q-PCR primers system used to amplify DNA sequences of various sizes was previously validated by the team, and demonstrated high specificity. [17,38] It should also be noted that this system was recently compared to whole-genome sequencing of single-stranded DNA library preparation, and that both analytical approaches showed similar results regarding the size distribution of cfDNA in cancer patients. That similarity could be taken as a confirmation of the system's accuracy. [18] Nuclear cfDNA copy number calculation: Nuclear cfDNA copy number per milliliter of plasma/supernatant was determined with the following calculation Q nuclear is the NcfDNA copy number per milliliter; c is the NcfDNA concentration (pg µL −1 ), determined by Q-PCR targeting the nuclear KRAS gene sequence; 3.3 pg is the human haploid genome mass; V elution is the volume of cfDNA extract (µL); and V plasma/supernatant is the volume of plasma or supernatant used for the extraction (mL).
Mitochondrial cfDNA copy number calculation: Mitochondrial cfDNA copy number per milliliter of plasma/supernatant was determined with the following calculation Q mito is the McfDNA copy number per milliliter of plasma/supernatant; c is the McfDNA mass concentration (g µL −1 ) determined by Q-PCR targeting the mitochondrial MT-CO3 gene; N a is Avogadro's number (6.02 × 10 23 molecules per mole); L vector is the plasmid length (nucleotides); MW is the molecular weight of one nucleotide (g mol −1 ); V elution is the elution volume of cfDNA extract (µL); and V plasma is the volume of plasma or supernatant used for the extraction (mL).
Murine DNA Quantification: For murine samples, murine nuclear DNA was quantified using as reference a serial dilution of genomic murine DNA (Promega). Human nuclear DNA of tumor origin was quantified using a standard curve of genomic DNA of the human Difi cell line. The relative amount of mitochondrial DNA to nuclear DNA was determined using the equation 2 −dCq , where dCq = (Cq mito -Cq nuc). Human cfDNA (of tumor origin) and murine cfDNA (of nontumor origin) were quantified in the same murine plasma samples using adequate primer sets for each amplification, and expressed in nanograms per milliliter of plasma (ng mL −1 ). Human KRAS nuclear amplicon and MT-CO3 mitochondrial amplicon of 67 bp each were quantified to assess the human cfDNA of tumor origin.
On the other hand, murine cfDNA of nontumor origin was analyzed by targeting a KRAS murine nuclear amplicon of 63 bp and a MT-CO1 murine mitochondrial amplicon of 114 bp.
Oligonucleotides: Oligonucleotides were synthesized and purified by high performance liquid chromatography by Integrated DNA Technologies (Coralville, Iowa). Quality control of the oligonucleotides was performed by matrix-assisted laser desorption/ionization-time of flight. They were tested for specificity and sensitivity before use. The sequences and characteristics of the selected primers are presented in Table S9 in the Supporting Information. The primer selection and validation were done in accordance with stringent internal guidelines, requiring the use of four distinct software programs developed by the team. When the DII was high (> 0.4), the sample was excluded from the analysis, as this suggested that the fraction of long cfDNA fragments was high, and that the sample could be contaminated by genomic DNA derived from white blood cells.
The mean mitochondrial DNA copy nb mL −1 value when targeting the MT-CO3 wild type 67 bp amplicon was named Ref M 67 (copy nb mL −1 of plasma or supernatant), and corresponded to the total concentration of mitochondrial cfDNA.
The MNR was expressed as the ratio of the mean of mitochondrial DNA copy nb mL For murine samples, human tumor MNR was expressed as the ratio of human mitochondrial cfDNA relative concentration to human nuclear cfDNA concentration (ng mL −1 ). Murine nontumor MNR was expressed as the ratio of murine mitochondrial cfDNA relative concentration to murine nuclear cfDNA concentration (ng mL −1 ).
Statistical Analysis: Murine, cell culture supernatant, and exploratory cohort samples: Data were expressed as mean ± SD. The nonparametric Wilcoxon-Mann-Whitney test was used for the comparison of medians for murine, cell culture supernatant, and exploratory cohort data. ROC curve was presented and the AUC was calculated. A probability of less than 0.05 was considered to be statistically significant; *p ≤ 0.05, **p ≤ 0.01, ***p ≤ 0.001, ****p ≤ 0.0001.
Human plasma samples: The analysis was conducted using R 3.6.0 and the stats package.
Colorectal cancer cohort: Patients with missing values on parameters used for tree regression were removed. This concerned 131 patients (129 CRC patients and 2 healthy patients). A two-sample nonparametric one-sided Mann-Whitney test with normal approximation (n > 50) was adopted for continuous variables comparisons among the populations studied, rather than a two-sample proportion test with continuity correction for binary variables comparison (Table S2, Supporting Information).
Breast cancer cohort: Patients with missing values on parameters used for tree regression were removed. This concerned two patients diagnosed with breast cancer. A two-sample nonparametric one-sided Mann-Whitney test with normal approximation (n > 50) was adopted for continuous variables comparison among the populations studied (Table S4, Supporting Information).
Age adjustment: Due to different age distribution among healthy and colorectal cancer patients (Figure 2a), age was split into five different categories in order to pair patients: less than 45 years old, 45 to 50 years old, 51 to 55 years old, 56 to 60 years old, and more than 60 years old (Figure 2b). This division into categories reduced the number of patients included to 527. Moreover, interaction effect between age and some measured parameters was observed using logistic modeling of the colorectal cancer status probability (Table S8, Supporting Information), which confirmed the benefit of removing this bias. This age distribution between breast cancer patients and healthy individuals was equally accentuated, and resulted in the overrepresentation of young healthy patients. Accordingly, a window of patients between 25 and 65 years was created.
Univariate preliminary analysis: For each measured parameter, ROC curves were produced using the pROC package. An optimal threshold corresponding to the best prediction was estimated by minimizing the distance between the ROC curve and the ideal point (Se = 1, 1-Sp = 0). Sensitivity, specificity, AUC, and the optimal threshold were estimated with their respective confidence intervals using an empirical bootstrap method, whose use for quantitative diagnostic tests had already been demonstrated. [44] First, 2000 resampling were applied with replacement and the punctual values were estimated for each step. The difference between each step estimation and the targeted value were then calculated, with the 2.5th and 97.5th distance percentiles being obtained and added to the targeted punctual value. Second, the same method of 2000 resampling was applied to 155 CRC and 155 healthy patients among the previously established age categories. Third, the same method was applied to patients from the breast cancer cohort, with 2000 resampling of 236 patients.
Machine learning: The cancer prediction algorithm was conducted by applying recursive partitioning for regression trees [45] using the rpart package. First, a global view of all the measured parameters was generated, in order to appreciate the tendency of each one, by ascending hierarchical clustering. The resulting heatmap was built using the ComplexHeatmap package (Figure 2c), clustering patients (by columns), and parameters (by rows), based on the Euclidian distance and using the ward.D2 agglomeration method. The values were reduced among the samples following log 10transformation, in order to appreciate the variability in a convenient color scale. Second, recursive partitioning (using the rpart package) was applied to 500 different resampling, in order to extract the importance score of each variable introduced in the procedure (Figure 2d). This allowed to exclude some variables created by certain parameter combinations (data not shown). It also allowed to check the absence of bias regarding the Age and Gender variables.
To build the global tree, an iterative procedure was performed on each node to retain the most frequently selected variable; this was done using the resampling method described in the Univariate Preliminary Analysis section above. The results were confirmed by applying a homogeneity Χ 2 -test to the frequency table. Also, a proportion test with continuity correction was applied to the two most frequent variables, in order to discriminate them. Once the selection was confirmed, the parameter of interest at the node was fixed, and its conditional threshold was estimated using the median calculated on 2000 resamplings with replacement. The confidence interval was obtained using the distribution of the difference from the mean. The median was kept as the condition applied at the node according to empirical bootstrap methods (Table S3, Supporting Information). Branch expansion was stopped as soon as i) no specific parameter could split the patients further, ii) less than 10% of patients in the learning set would remain upon further splitting, and iii) no parameter was selected according to an X² test. This procedure was iterated with a display in order to obtain fully visual trees for both the colorectal (Figure 4a) and breast cancer cohorts (Figure 4b,c). Each node contained the variable with its estimated threshold and the proportions of cancer/healthy patients present at this step. Each leaf contained the inferred status (cancer or healthy), the proportion of cancer/healthy patients, and the global proportion of patients ending at this leaf. For each DT obtained, its performance was assessed by comparing the predictive values given by the tree and the real diagnosis. Sensitivity and specificity were punctually estimated, then the 95% confidence interval was subsequently obtained by empirical bootstrap, as described in the Univariate Preliminary Analysis section above. Different groups of patients were used. First, the parameters were calculated with the patients used for tree building (N = 527 for CRC, N = 236 for breast cancer). Second, the tree prediction was applied to the remaining patients not included in the tree building. These predictions were also applied to early (0-I-II) CRC patients (Table 3).
Other cancer types: These parameters were measured using 23 patients with different cancer types. The results were predicted using both the breast cancer cohort tree and the CRC cohort tree (Table S6, Supporting Information).

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.