Risk factors and predictors of lymph nodes metastasis and distant metastasis in newly diagnosed T1 colorectal cancer

Abstract Background Lymph nodes metastasis (LNM) and distant metastasis (DM) are important prognostic factors in colorectal cancer (CRC) and determine the following treatment approaches. We aimed to find clinicopathological factors associated with LNM and DM, and analyze the prognosis of CRC patients with T1 stage. Methods A total of 17 516 eligible patients with T1 CRC were retrospectively enrolled in the study based on the Surveillance, Epidemiology, and End Results (SEER) database during 2004‐2016. Logistic regression analysis was performed to identify risk factors for LNM and DM. Unadjusted and adjusted Cox proportional hazard models were used to identify prognostic factors for overall survival. We performed the cumulative incidence function (CIF) to further determine the prognostic role of LNM and DM in colorectal cancer‐specific death (CCSD). LNM, DM, and OS nomogram were constructed based on these models and evaluated by the C‐index and calibration plots for discrimination and accuracy, respectively. The clinical utility of the nomograms was measured by decision curve analyses (DCAs) and subgroups with different risk scores. Results Tumor grade, mucinous adenocarcinoma, and age accounted for the first three largest proportion among the LNM nomogram scores (all, P < .001), whereas N stage, carcinoembryonic antigen (CEA), and tumor size occupied the largest percentage in DM nomogram (all, P < .001). OS nomogram was formulated to visually to predict 3‐, 5‐, and 10‐ year overall survivals for patients with T1 CRC. The calibration curves showed an effectively predictive accuracy of prediction nomograms, of which the C‐index were 0.666, 0.874, and 0.760 for good discrimination, respectively. DCAs and risk subgroups revealed the clinical effectiveness of these nomograms. Conclusions Novel population‐based nomograms for T1 CRC patients could objectively and accurately predict the risk of LNM and DM, as well as OS for different stages. These predictive tools may help clinicians to make individual clinical decisions, before clinical management.


| Patient enrollment and characteristics
The records of patients were downloaded from the SEER 18 registry database using SEER*Stat 8.3.6 software (http:// seer.cancer.gov/seers tat/). SEER database currently collects and publishes cancer incidence and survival data covering approximately 34.6 percent of the US population (https:// seer.cancer.gov/about /overv iew.html). Within the SEER database, we identified 17 516 adult patients who were diagnosed as suffering from only one primary, T1 colorectal cancer from January 2004 to December 2016. The flowchart of cases selection is shown in Figure 1. Colorectal carcinoma cases were screened by International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3) Hist/behav, malignant. The TNM staging data were retrieved based on the American Joint Committee on Cancer (AJCC) 7th. Other characteristics at diagnosis of all patients were obtained, including year of diagnosis, age at diagnosis, race, gender, marital status, tumor location, histology, tumor size, regional nodes examined, grade, survival status, carcinoembryonic antigen (CEA), and follow-up time. Patients with carcinoma in situ or T2-4, unknown T, N, or M status were not included in the cohort. Cases with regional nodes examined < 12 12 or incomplete data of survival information (time and cause of death) were also excluded. Based on existing evidencebased medicine, LNM status is a significant prognosis factor for T1 CRC patients without M1, whereas patients with M1 (T1NXM1) could be identified as stage IV and LNM status would not determine the treatment. So we further divided the patients into two study sets, forming T1N0-2M0 CRC population named the study cohort N (n = 17 309) for predicting LNM and T1 CRC population called the study cohort M (n = 17 516) for predicting DM. No ethical approval was sought for this study, as the data used were collected from the public SEER database, which is available as open-access and anonymized data.

| Nomogram construction and validation
Univariable and multivariable analyses were performed to identify independent risk factors and prognostic factors in T1 colorectal carcinoma in the SEER cohort. Binary logistic regression models 13 were used to identify risk factors of LNM in the study cohort N and DM in the study cohort M, respectively. Cox proportional hazard model 14  conducted to identify potentially important prognostic factors for patients with T1 CRC. The Kaplan-Meier method was used for plotting overall survival curves. Cumulative incidence function was applied for plotting cancer-specific cumulative incidence. Based on the multivariate binary logistic regression models and Cox proportional hazard model, three novel nomograms 15 were established and validated by the concordance index (C-index) and calibration plots conducted by a bootstrapping method with 1000 resamples. The C-index was used to quantify the discriminatory power of the model and the calibration plots was used to evaluate the accuracy of the nomograms. The clinical application value of the nomogram models was determined by decision curve analyses (DCAs) that calculate the net benefits at each risk threshold probability. Additionally, based on the DCAs, clinical impact curves were plotted to help us more intuitively understand the nomogram models' significant value. Moreover, all participants were divided by risk scores quartile into low-, medium-, and high-risk subgroups, by which the clinical utility of the nomograms could be measured.

| Statistical analysis
All the statistical analyses were calculated in R software (version 3.6.1, https://www.r-proje ct.org/). The Chi squared tests or Fisher's exact tests were used to compare categorical variables. Survival variables were compared by Wilcoxon tests. The nomograms, C-index, calibration curves, Kaplan-Meier curves, cumulative incidence curves DCAs, and clinical impact curves were conducted using R 3.6.1 with relevant packages and functions, such as rms, rmda, survival, cmprsk, and stdca (https://www.mskcc.org/depar tment s/epide miolo gybiost atist ics/healt h-outco mes/tutor ial-r). A two-tailed value of P < .05 was statistically significant. Year of diagnosis

| Patients and tumor characteristics
According to the screening criteria, 17 516 patients diagnosed with T1 colorectal cancer who underwent surgical resection during 2004-2016 from the SEER database, were finally included in this study. There were two study groups, the study cohort N ( Table 1.

| Independent risk factors of lymph nodes metastasis and development of the nomogram
Independent risk factors for LNM were determined by univariable and multivariable binary logistic regression analyses. These significant risk factors for LNM included year of diagnosis, age at diagnosis, race, gender, marital status, tumor location, histology, tumor size, number of regional nodes examined, grade, survival status, and CEA ( To more intuitively display the risk factors for LNM in CRCs with T1N0-2M0 stage, a nomogram model was established ( Figure 2A). Additionally, scores assignments and predictive probability for each variable in the nomogram were calculated in Table 3. According to the LNM nomogram, tumor grade accounted for the largest proportion, followed by age, histology, tumor location, tumor size, race, CEA, marital status, and gender. The calibration curve showed an effectively predictive accuracy of the nomogram, with a C-index of 0.666 ( Figure 2B). Moreover, DCA and CIC were performed on the LNM nomogram in the study cohort N ( Figure 2C,D), showing that threshold probabilities of 0-0.3 were the most beneficial for predicting LNM by our nomogram.
Based on the significant risk factors identified in the multivariable regression analysis, a nomogram was constructed to predict the probability of DM in patients with T1 colorectal carcinoma ( Figure 3A). Each variable was assigned a score and the estimated DM possibility were calculated by the total scores in Table 3. N classification made the largest contribution in the DM nomogram, followed by CEA, tumor size, grade, and age at diagnosis. The calibration plot showed a relative satisfactory predictive accuracy of the nomogram ( Figure 3B). The nomogram displayed a C-index of 0.874, which effectively predicted the risk of DM from T1 CRC. Furthermore, we found that threshold probabilities of 0-0.3 were the most beneficial for predicting DM by DCA and CIC plotted on the DM nomogram in the study cohort M ( Figure 3C,D).
The significant factors identified by the COX regression analyses were used to develop the nomogram to predict the probability of overall survival in patients with T1 CRC. The plot of the OS nomogram is shown in Figure 5A. The C-index of the OS nomogram was 0.760 and the calibration curves revealed relatively excellent agreement between the nomogram    Figure 5B). Furthermore, using DCA, we found that the most beneficial threshold probabilities for predicting the 3-, 5-, and 10-year death probability were 0-0.3, 0-0.5, and 0-0.7, respectively ( Figure 5C,E).
Low-, medium-, and high-risk subgroup, separated by the score 0-59, 59-105, and 105-272, respectively, showed the statistical significance in overall survival probability (P < .001) ( Figure 6C).   chemotherapy. However, T1 CRCs with IV stage usually lose the chance of surgery for cure and rely on systemic treatment, including chemotherapy, targeted drugs and immunity therapy. In summary, it is important to distinguish the status of lymph nodes and distant metastasis in clinic. Of course, It is also significant to be able to evaluate overall survival based on clinical pathological characteristics.

| DISCUSSION
Recently such studies are increasing, but there are still many shortcomings and limitations. First, former studies 16,17 constructed models based on logistic regression analyses and COX regression analyses, but these models could not obtain the prediction probability, making it difficult to apply clinically. Nomogram, as a new form of display, could intuitively predict LNM, DM, and OS. This method forms nomogram diagrams to predict related probabilities, and makes references for further examination and clinical decision making. Second, in the nomogram studies of the T1 CRC population, 18,19 there are still a lack of integrity. Only the occurrence of LNM or DM is studied, but the prognosis of this population is rarely predicted, which cannot fully reflect the LNM, DM, and OS of T1 CRC patients in one dataset. Third, there are differences in patient inclusion criteria. For example, some studies directly use the data of all T1 CRC patients to assess whether there is lymph node metastasis. [18][19][20] Although there is a prediction result of LNM, it is of little significance. This is because only nonmetastatic CRC patients have a clinical significance in predicting LNM and prediction of this population will influence treatment decisions. Moreover, these studies [18][19][20] did not request the number of lymph node biopsies, and usually 12 or more lymph nodes need to be examined to determine the status of the lymph nodes. 12 Therefore, we divided the included population into N subgroup (T1M0NX CRC for LNM) and M subgroup (T1NXMX CRC for DM and OS). The incidence of LNM and DM, and OS of T1 CRC were analyzed, and the corresponding nomograms were constructed. Three nomograms were established and validated for predicting LNM, DM, and OS in patients with T1 CRC. LNM nomogram includes nine factors: age at diagnosis, race, gender, marital status, tumor location, histology, tumor size, grade, and CEA, whereas DM nomogram incorporates five factors, namely, age at diagnosis, tumor size, N classification, grade, and CEA. OS nomogram for predicting 3-, 5-, and 10-year overall survivals involves 10 factors: age at diagnosis, race, gender, marital status, histology, tumor size, N classification, M classification, grade, and CEA.
All the nomograms indicated good agreement between predictions and observations. C-index of the LNM nomogram, DM nomogram and OS nomogram were calculated with values of 0.666, 0.874, and 0.760, respectively. These nomograms reveal good clinical utility in the proper threshold probability range. Furthermore, based on the interquartile scores from the nomograms, low-, medium-, and high-risk groups were identified to plot stacked bar charts and Kaplan-Meier survival curves, which intuitively indicated the discrimination ability of the nomograms.
In this population-based study, we found that tumor grade III-IV, mucinous adenocarcinoma, and age 18-to 49-year old accounted for the largest proportion among the LNM nomogram scores. As a significant factor, that degree of differentiation has been reported to be closely associated with LNM in T1 CRC. 9 In this study, compared with well-differentiated carcinoma, the LNM risk of poor-differentiated T A B L E 6 (Continued) F I G U R E 5 Nomogram, calibration curve, and decision curve analysis for predicting overall survival in patients with T1 colorectal carcinoma. There are ten factors in OS prediction nomogram (A). Calibration curve(B) for predicting 3-, 5-and 10-year OS is shown and C-index = 0.760. The diagonal line shows equality between the actual and predicted OS probability. With the solid line close to the diagonal line, the plot reveals excellent agreement between the probability of nomogram prediction and actual observations. The decision curve (C-E) of the nomogram for predicting 3-, 5-and 10-year OS were plotted. The x-axis represents the threshold probability and the y-axis shows the net benefit. The horizontal blue line represents one extreme situation that all patients were alive, and the black line indicates the other extreme situation that all patients were dead and undifferentiated cancer rose to approximately 3.99 and 2.33, respectively (both P < .001). Consistent with previous findings in T1 CRC, 21 in this study, patients with mucinous adenocarcinoma increased LNM risk by more than 1 times, in comparison with adenocarcinoma patients. More and more studies 22,23 reveal that young age at diagnosis is related to an increased risk of LNM in patients with early Colon Cancer. Like these studies, we found that the LNM risk of youngest T1 CRC group (patients age 18-49 years old) was higher than older patients. For the DM nomogram, the largest proportion in risk scores were N2 stage, positive CEA, and tumor size over 30mm. Not surprisingly, N classification was a significant predictor for the risk of LNM in T1 CRC. Of note, patients with the worse N stage are more prone to occur distant metastasis. Preoperative CEA has been found to be predictive of distant metastasis in T1 colorectal cancer after radical surgery. 24 Here, we reveal similar observations, which indicate that cancer with positive CEA prior to treatment is a significant predictive factor for the risk of DM in T1 CRC. Unlike these studies 24,25 concerning T1 CRC, we found that tumor size was significantly associated with the risk of DM.
In term of the OS nomogram, age over 80 years old, M1 stage and N2 classification take up the largest percentage of the risk score for overall survival. It is not surprising that patients with distant metastasis and N2 classification have poor prognosis. Consistent with our researches, it was reported that older patients with early stage CRC are significantly related to shorter overall survival. 26,27 Furthermore, we found that LNM in T1 CRC was associated with cancer-specific death and noncancer-specific death, whereas DM was only linked with cancer-specific death.

F I G U R E 6
Clinical effects of the risk score in the nomograms. Based on the quartile of risk score, three nomograms divide participants into low-, medium-and high-risk subgroup, respectively. Clinical utility of these subgroups for predicting LNM and DM is present by constituent ratio, as well as shows significant difference (A-B). In term of overall survival, Kaplan-Meier method is used to found out the significance among the different risk subgroups (C) In this database-based study, we screen 17 516 eligible patients with a median follow-up of 53 months from real-world data. We analyzed the data by appropriate statistical methods and found these convincing conclusions. However, there were some limitations in our study. This was a population-based retrospective analysis lacking important treatment information, such as surgical methods and chemoradiotherapy protocols. In addition, these data lacked the description of the distant metastasis site and the detection of key molecules of colorectal cancer, like KRAS and BRAF. Finally, these models were developed from the SEER database and were not verified by external data, being continuously modified based on the application in the future.
In conclusion, based on independent risk factors from a large population database, we constructed three nomograms which can accurately predict the LNM, DM, and OS of T1 CRC patients at different stages. Moreover, our nomograms were demonstrated to have high accuracy and reliability by the validation of discrimination and calibration, as well as perform well in clinical utility. Therefore, they can help doctors to make clinical decisions for patients with T1 CRC, including diagnostic investigations, individual treatment, and follow-up management strategies.