Risk factors and predictors for tumor site origin in metastatic adenocarcinoma of unknown primary site

Abstract Background Metastatic adenocarcinoma of unknown primary site (MACUP) is the most common cancer of unknown primary site, and shows worse prognosis. Prediction of its tumor site origin attracts a growing attention. However, the site determined by gene expression profiling does not have a significant impact on the survival. Some other special method might need to be found out. Methods We reviewed 1011 MACUP patients diagnosed by pathological examination and immunohistochemistry based on the Surveillance, Epidemiology, and End Results (SEER) database during 2010–2016. Kaplan–Meier curves and Cox proportional hazard model were analyzed to compare the survival. Logistic regression models and relevant nomograms were performed to predicting the probability of the primary site which including digestive system, respiratory system, and female breast. The validation and clinical utility of models were measured with relevant statistical approaches. Results About 324 (32.1%), 299 (29.6%), and 203 (20.1%) of MACUP patients were identified as the primary sites of digestive system, respiratory system, and female breast, respectively. Patients derived from digestive system and respiratory system showed poorer survival than these with other sites. Digestive system was significantly associated with liver (Odds ratio =13.21 [95% confidence interval =8.48–21.02]) or lung (2.36 [1.40–3.97]) metastasis, while respiratory system was linked to brain (11.68 [6.68–21.26]) or lymph node (3.39 [2.26–5.13]) metastasis. Patients identified as female breast were prone to occur bone metastasis (5.85 [3.68–9.45]). Logistic regression nomograms were developed to help clinicians intuitively predict the probabilities of tumor site origin with 0.867, 0.824, and 0.753 of the C‐index, respectively. Decision curve analysis and clinical impact curves both revealed the clinical effectiveness. Conclusions We profiled different tumor site origin of MACUP patients and established prediction models. These features might be significant for clinicians to improve the probabilities of predicting the primary sites, and to decide subsequent treatment strategy.


| INTRODUCTION
Cancer of unknown primary site (CUP) represents a heterogeneous group with metastatic disease, for which the origin site cannot be detected despite a standardized diagnostic approach with careful examinations. 1 It is estimated that CUP accounts for approximately 3%-5% of all newly diagnosed carcinomas 2 and there are about 4-19 cases per 100,000 persons every year, 3 although the exact incidence rate is hard to determine for various objective reasons. About 70%-80% of CUP histopathology is metastatic adenocarcinoma 2,4 and more than 60% patients present with metastasis in internal organs, 4,5 so metastatic adenocarcinoma of unknown primary site (MACUP) takes up the vast majority of the CUP. Patients with CUP usually receive empirical chemotherapy with a platinum-taxane regimen, 6 but remain poor prognosis with the median survival of approximately 6-9 months. 7,8 With the development of diagnostic method, it is more popular to formulate individualized treatment plan by gene expression profiling. However, this site-specific therapy does not show significant difference to acquire clinical benefits. 9 In some CUP cases that T staging is classified as T0, the primary site could be determined by pathological examination and immunohistochemistry, although no tumor site of origin could be detected by imaging examination. Compared with other CUP, the overall survival of these T0NXM1 patients is significantly prolonged. 10 Therefore, we aimed to analyze the relationship between clinical characteristics of MACUP and different origin tumor sites determined by pathological examination and immunohistochemistry, and to study the survivals of the origin sites, based on the Surveillance, Epidemiology, and End Results (SEER) database. In this study, we focused on the primary tumor sites of digestive system, respiratory system, and female breast. First, we exhibited the differences of clinical features between these tumor sites, and calculated the survivals time. Next, some factors were found to be significantly related to the probability of the origin sites by binary logistic regression models. Finally, we intuitively predicted the primary site probabilities for MACUP patients and stratified the cases with different probabilities by constructing nomograms.

| Population selection and characteristics
The data used in the present study were extracted from the SEER 18 registry database, which involves cancer incidence and survival data and covers approximately 34.6% of the population in USA. Cases of metastatic adenocarcinoma of unknown primary site (MACUP) were identified American Joint Committee on Cancer (AJCC) stage (7th edition) "T0NxM1" and "International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3) Hist/behav, malignant." In addition, patients aged 18-79 years and initially diagnosed between January 2010 and December 2016, were included in the current study. In these patients, tissue or organ sources determined by pathological examination of the metastatic cancer, were recorded in "Site recode ICD-O-3/WHO 2008." Cases that were missing significant information, including unknown race, unknown cause of death or unknown survival time, were subsequently excluded. In order to analyze the different sources of primary tumor site, we then divided the included MACUP patients into two groups: the study cohort 1 (whole population) for digestive system or respiratory system, and the study cohort 2 (female population) for female breast. Subsequently, the study cohorts were split into the training set and the validation set for data analysis. The flowchart of patients screening was exhibited in Figure 1. Events per variable (EPV) 11 of prostate cancer in male population is less than ten, so we did not establish the predictive nomogram of resource site from male prostate by logistic regression model. Because the data analyzed were downloaded from the SEER database, which publicly provide open-access and anonymized data for everyone in the world, ethical approval was unnecessary to seek for this study.

| Outcome and variable declaration
Overall survival (OS) refers to the time from cancer diagnosis to death. Kaplan-Meier curves were performed to study the impact of the specific primary site or treatment (radiation therapy and chemotherapy). For demographic features, we included age (18-49 years, 50-64 years, 65-79 years), gender (female, male), race (white, black, other), and marital status (married, other). Tumor characteristics involved source site (digestive system, respiratory system, female breast; male prostate, gynecology system, other), lymph node metastasis (N0, Nn, Nx), liver metastasis (No, Yes), lung metastasis (No, Yes), bone metastasis (No, Yes), brain metastasis (No, Yes). Treatment-related covariates included radiation therapy (No, Yes), chemotherapy (No, Yes), and surgery (No, Yes). Other variables involved length of follow-up and status (alive, dead).

K E Y W O R D S
cancer of unknown primary site, metastatic adenocarcinoma, nomogram, predictors, SEER

| The binary logistic regression and Cox proportional hazard modeling
The logistic regression models were conducted to evaluate the probability of primary site derived from digestive system, respiratory system, or female breast, based on the training set from the study cohort 1 and study cohort 2, respectively. Candidate risk factors involved age, gender, race, marital status, tumor size, lymph node metastasis, liver metastasis, lung metastasis, lung metastasis, bone metastasis and brain metastasis. Multivariable logistic regression models were established to determine the significant risk factors. Additionally, univariable and multivariable Cox proportional hazard model were developed to study the prognostic factors of MACUP patients. Especially, in order to explore the efficacy of treatment regions on survival of patients stratified with different tumor sources, subgroup analysis displayed with forest plots was conducted by univariable Cox regression, and then, multivariable regression if the variable was significant, acquiring the HRs and 95%CIs that have corrected the bias caused by some factors.

| Logistic regression nomograms development and validation
In order to help the physicians to predict the primary site of the whole and female MACUP patients, we constructed nomograms based on the multivariate logistic regression model of the training set. Then, patients with low-, medium-, and high-risk were determined, respectively, by calculating the quantiles of scores and the difference of the specific primary site probabilities (digestive system, respiratory system, and female breast) among these subgroups was compared. Next, we calculated the concordance index (C-index) and conducted calibration plots to perform internal validation of nomograms. 12 The C-index was used to evaluate the discriminatory power of the nomograms and the calibration curves were used to quantify the accuracy of the models. External validation was exhibited by calibration curves using the validation set. Moreover, we performed decision curve analysis (DCA) 13 to reveal the clinical utility of the nomograms by calculating the net benefits at each threshold probability, while clinical impact curves (CIC) 14 were conducted to help us understand the models' clinical value more intuitively.

| Statistical analysis
All data in this study were analyzed in R software (version 3.6.1, https://www.r-proje ct.org/). The logistic regression model, nomogram, C-index, calibration curves, DCA, CIC, Cox regression model, Kaplan-Meier curves, and forest plots were performed by using R software with packages, such as stat, rms, rmda, survival, and forestplot. p < 0.05 was considered statistically significant.

| Population enrollment and features
We included a total of 1011 patients diagnosed with MACUP in this study, involving 425 (42%) of male and 586 (58%) of female. We found that digestive system, respiratory system, female breast, male prostate, gynecology system accounted for approximately 95% of all tumor sources determined by pathological examination of the metastatic cancer, while tumor source from breast was the most common site in female patients (Table S1). The characteristics of whole and female MACUP patients were exhibited in Table 1 and Table 2, respectively. Specifically, MACUP patients derived from digestive system were more likely to be without lymph node metastasis (60.49%) and to occur liver metastasis (63.89%), while these from respiratory system were prone to being with lymph node metastasis (59.20%) and to occur brain metastasis (35.79%). Moreover, patients with white in race (85.71%) or bone metastasis (59.61%) were significantly associated with the source site of female breast.

| Survival analysis and prognostic factors
MACUP patients whose primary site was at digestive system or respiratory system showed the worse prognosis, with the median OS of only 6 and 9 months, respectively, when compared to the source sites of female breast (33 months), male prostate (31 months), gynecology system (34 months), and other (40 months) (Figure 2A). Kaplan-Meier curves also revealed that radiation therapy and chemotherapy did not affect the OS, while surgery might prolong the OS, although only 55 patients underwent the resection ( Figure 2B-D).
Considering the sample size, we conducted the subgroup analysis stratified by the primary tumor sites to determine the impact of radiation therapy or chemotherapy on the prognosis. Similarly, there was no significant difference for patients with different sites between radiation therapy group and nonradiation group ( Figure 3A). Interestingly, patients derived from digestive system (HR = 0.41, 95%CI = 0.32-0.53, p < 0.001), respiratory system (HR = 0.54, 95%CI = 0.41-0.70, p < 0.001), or gynecology system (HR = 0.27, 95%CI = 0.11-0.66, p < 0.001) could benefit from the chemotherapy ( Figure 3B).

| Risk factors of the specific source sites and prediction models
The study cohorts were further divided into the training and the validation sets, the characteristics of which were exhibited in Table S2 and S3. Risk predictors for the source sites of digestive system, respiratory system, or female breast were estimated by using binary logistic regression models, and the results of the characteristics were provided in Tables 3-5 (Table 6). Logistic regression nomograms were constructed on the basis of the significant factors to predict for the source sites of digestive system ( Figure 4A) or respiratory system ( Figure  5A) in whole MACUP patients, and for the source sites of breast in female patients ( Figure 6A). The C-index of these nomograms with 86.7%, 82.4%, and 75.3%, respectively, showed good discrimination, while excellent accuracy of the models was revealed by the calibration curves, regardless of the internal or external validation ( Figure 7A-F) T A B L E 4 Logistic regression analysis of the risk factors for the source site of digestive system in MACUP patients the risk scores of some covariates were calculated and exhibited in Table S4, identifying low-, medium-, and high-risk patients by using the 25th and 75th the quantiles of scores. The risk stratification indicated that the nomograms showed significant difference of site probability between these subgroups ( Figure S1). Furthermore, DCAs were conducted based on the logistic-regression nomograms and showed the proper threshold probabilities for predicting the source site of digestive system (0%-90%) ( Figure 4B), respiratory system (0%-90%) ( Figure 5B), or female breast (0%-80%) ( Figure  6B). CICs revealed that the number of high-risk patients evaluated by these three models would be closer to the number of high-risk events, as threshold probabilities increased ( Figures  4C, 5C, and 6C).

| DISCUSSION
Cancer of unknown primary site (CUP) is considered as a mysterious malignant tumor and its prognosis is poorer than metastatic cancers with clear primary site. 15 Pathological examination and gene expression profiling detected to determine the tissue of origin are attracting the wide attention. 16,17 Theoretically, the treatment of CUP should depend on the tumor-site origin, and site-specific therapy rather than a nonselective empirical chemotherapy would be better to conducted individually. However, whether precision therapy can bring survival benefits is controversial based on current evidence. Hayashi et al. 18 and Fizazi et al. 19 conducted randomized controlled trials and found that site-specific therapy based on  10 : Type 1 refers that tissue of origin cannot be determined by pathological examination and immunohistochemistry of the metastatic cancer; Type 2 means that tissue of origin can be identified by pathological examination and immunohistochemistry of the metastatic cancer, resulting in that the T staging is defined as T0. Moreover, it was reported that type 2 of CUP exhibited better prognosis than the type 1 in the study of Tao et al. 10 Metastatic adenocarcinoma of unknown primary site (MACUP) is the most common CUP 20 and signals an unfavorable prognosis. 4 Considering the data availability, we planned to include the MACUP patients whose potential primary sites were identified by pathological examination and immunohistochemistry, and next profiled their demographic variables and tumor characteristics. Especially, the probabilities of digestive system, respiratory system, or female breast were displayed by our proposed nomograms, respectively, which might have an impact on the clinical strategy.
1011 patients with MACUP were involved in the present study. Based on the detection of pathological examination and immunohistochemistry, digestive system (32.05%), respiratory system (29.57%), and female breast (20.08%) were the three most common sites, which was similar to the cancers identified at the definitely primary sites. 21 As for the prognosis of different origin sites, we found that digestive system or respiratory system showed the worst survival in the Kaplan-Meier curves and multivariate Cox regression, which was also consistent with the carcinomas whose primary sites were determined at the initial diagnosis time. 21 Interestingly, radiation and chemotherapy were not significantly associated with the prognosis, while surgery for existed tumors could prolong the survival time, indicating that MACUP was so heterogeneous that radiation or chemotherapy could not T A B L E 6 Logistic regression analysis of the risk factors for the source site of female breast in MACUP patients accurately treat the primary tumor. Furthermore, subgroup analysis was conducted and the result showed that it was necessary for MACUP patients derived from digestive or respiratory system to accept chemotherapy, which might be due to their composition of colon cancer and lung cancer as well as high sensitivity of the chemotherapy. Considering the sample size, we performed three logistic regression models to identify risk factors for the origin sites of digestive system, respiratory system, and female breast. We found that metastatic features were significantly associated with different primary sites. MACUP patients derived from digestive system were more likely to be with liver or lung metastasis and without lymph node, bone, or brain metastasis, while lymph node or brain metastasis and nonliver or nonlung metastasis were more prone to be existed in respiratory system. For the origin site in female breast, lymph node, bone, or nonliver metastasis were more common. However, these results were not reported before and we first revealed the phenomena, which might provide some reference for clinical practice.
Finally, regarding the models conducted by logistic regression and variables screening, we constructed three relevant nomograms for predicting the different origin sites, including digestive system, respiratory system, and female breast. These nomograms could be effective to predict the probabilities of the primary site for MACUP patients by using common clinical features, with the high c-index and excellent calibration. Moreover, the clinical effectiveness of our nomograms was evaluated by DCAs and CICs, revealing good utility in the large range of threshold probability, which could help clinicians identify MACUP as the potential site-determined carcinomas according their probabilities. Additionally, subsequent site-specific therapy would be tailored individually by referring to the protocol of the known tumor site, and these patients could benefit from this treatment model. 10 Nevertheless, there are still some shortcomings and limitations in this study. First, it was a retrospective analysis based on the SEER database, and the current determination of the origin sites by pathological examination and immunohistochemistry lacked a specific standard, which might be due to the complexity of MACUP and the future direction of prediction for the primary sites. Second, gene testing might be important to clarify the heterogeneity of MACUP as a mysterious carcinoma. Although these common clinical features could help physicians calculate the probabilities of the specific sites, directly distinguishing the one origin site from another might be difficult and further research should F I G U R E 4 (A) Logistic-regression nomogram for predicting the site probability of digestive system in patients with metastatic adenocarcinoma of unknown primary site (MACUP). There are six factors in this nomogram, involving gender, lymph node metastasis, liver metastasis, lung metastasis, bone metastasis and brain metastasis; (B) Decision curve analysis shows that if the threshold probability was between 1% and 90%, then using the nomogram to predict the site probability of digestive system in MACUP patients added more clinical benefits; (C) Clinical impact curve reveals that the number of high risk increases as the threshold probability increases, indicating that the nomogram can provide good clinical utility in MACUP patients F I G U R E 5 (A) Logistic-regression nomogram for predicting the site probability of respiratory system in patients with metastatic adenocarcinoma of unknown primary site (MACUP). There are six factors in this nomogram, involving age, gender, lymph node metastasis, liver metastasis, lung metastasis and brain metastasis; (B) Decision curve analysis shows that if the threshold probability was between 1% and 90%, then using the nomogram to predict the site probability of respiratory system in MACUP patients added more clinical benefits; (C) Clinical impact curve reveals that the number of high risk increases as the threshold probability increases, indicating that the nomogram can provide good clinical utility in MACUP patients F I G U R E 6 (A) Logistic-regression nomogram for predicting the site probability of female breast in patients with metastatic adenocarcinoma of unknown primary site (MACUP). There are four factors in this nomogram, involving race, lymph node metastasis, liver metastasis and bone metastasis; (B) Decision curve analysis shows that if the threshold probability was between 1% and 80%, then using the nomogram to predict the site probability of female breast in MACUP patients added more clinical benefits; (C) Clinical impact curve reveals that the number of high risk increases as the threshold probability increases, indicating that the nomogram can provide good clinical utility in MACUP patients