Prediction of lung cancer risk in Chinese population with genetic‐environment factor using extreme gradient boosting

Abstract Background Detecting early‐stage lung cancer is critical to reduce the lung cancer mortality rate; however, existing models based on germline variants perform poorly, and new models are needed. This study aimed to use extreme gradient boosting to develop a predictive model for the early diagnosis of lung cancer in a multicenter case–control study. Materials and Methods A total of 974 cases and 1005 controls in Shanghai and Taizhou were recruited, and 61 single nucleotide polymorphisms (SNPs) were genotyped. Multivariate logistic regression was used to calculate the association between signal SNPs and lung cancer risk. Logistic regression (LR) and extreme gradient boosting (XGBoost) algorithms, a large‐scale machine learning algorithm, were adopted to build the lung cancer risk model. In both models, 10‐fold cross‐validation was performed, and model predictive performance was evaluated by the area under the curve (AUC). Results After FDR adjustment, TYMS rs3819102 and BAG6 rs1077393 were significantly associated with lung cancer risk (p < 0.05). For lung cancer risk prediction, the model predicted only with epidemiology attained an AUC of 0.703 for LR and 0.744 for XGBoost. Compared with the LR model predicted only with epidemiology, further adding SNPs and applying XGBoost increased the AUC to 0.759 (p < 0.001) in the XGBoost model. BAG6 rs1077393 was the most important predictor among all SNPs in the lung cancer prediction XGBoost model, followed by TERT rs2735845 and CAMKK1 rs7214723. Further stratification in lung adenocarcinoma (ADC) showed a significantly elevated performance from 0.639 to 0.699 (p = 0.009) when applying XGBoost and adding SNPs to the model, while the best model for lung squamous cell carcinoma (SCC) prediction was the LR model predicted with epidemiology and SNPs (AUC = 0.833), compared with the XGBoost model (AUC = 0.816). Conclusion Our lung cancer risk prediction models in the Chinese population have a strong predictive ability, especially for SCC. Adding SNPs and applying the XGBoost algorithm to the epidemiologic‐based logistic regression risk prediction model significantly improves model performance.


| INTRODUCTION
Lung cancer is the most common cause of cancer mortality, and its 5-year survival rate is less than 20% overall and 56% in the early stages. 1 Early diagnosis and interventions for lung cancer are crucial for the extension of survival time. One of the early detection methods is lowdose computed tomography (LDCT), but low population coverage and X-ray exposure limit its application in lung cancer screening. [2][3][4] SNP-based risk prediction models are appropriate tools for preventive interventions that provide an estimate of the risk of developing lung cancer. 5 A total of 301 SNPs combined with smoking pack-years have a model prediction performance area under the curve (AUC) value of 0.656 in Caucasians. 6 The smoking duration (years)-based model has an AUC of 0.75, and 20 further combined SNPs resulted in an AUC up to 0.81 in Caucasians. 5 Logistic regression is the most widely used method for model building because of its efficiency and interpretability. However, there are several disadvantages in the application of logistic regression. First, the assumption of linearity in logistic regression is rarely established. Next, the use of the coefficient present in logistic regression as an odds ratio does not consider the association with other independent variables. Moreover, logistic regression is not sufficiently robust when a strongly influential outlier is present. 7 The extreme gradient boosting (XGBoost) algorithm, a large-scale machine learning algorithm, is an efficient and scalable variant of gradient boosting and can be used for both classification and regression problems. XGBoost has resulted in the best AUC compared to other machine learning algorithms for predicting the stage of cancer patients. 8 To our acknowledge, several studies have included SNPs in model-building in the Chinese population. [9][10][11] All of them used logistic regression, and none of these models had sufficient predictive performance, with an AUC <0. 7. In this study, we constructed susceptibility models using epidemiologic information and 61 SNPs by using logistic regression and machine learning methods for lung cancer and stratifications such as lung adenocarcinoma (ADC) and lung squamous cell carcinoma (SCC). The area under the receiver operating characteristic (AUC) was used to assess the contribution of the presence of SNPs and machine learning methods in the risk prediction models.

| Study subjects
We recruited 974 cases and 1005 controls for the casecontrol study of the Chinese Han population. Briefly, lung cancer patients diagnosed between March 2005 and January 2010 were recruited from Shanghai and Taizhou, as shown in Table S1. The inclusion criteria for patients were histologically or cytologically confirmed ADC, SCC, adenosquamous, or small cell lung cancer. Control subjects with no individual history of any cancer were recruited, and case subjects were matched by age and sex. Every subject was interviewed to collect epidemiological and clinical data including sex, age, smoking intensity, smoking duration, family history of lung cancer (family history), lung cancer histology, and lung cancer stage. All participants signed informed consent, and the study was approved by the ethics committee of the School of Life Sciences, Fudan University.

| SNP selection and genotype detection
Approximately 3-5 ml of peripheral blood was collected from each individual. Genomic DNA was extracted from whole blood samples by using the Qiagen Blood Kit (Qiagen) following the manufacturer's instructions. SNPs were selected from among those that were significantly associated with lung cancer in genome-wide association studies (GWAS) [12][13][14] or association studies, 15-21 as shown model predicted with epidemiology and SNPs (AUC = 0.833), compared with the XGBoost model (AUC = 0.816).

Conclusion:
Our lung cancer risk prediction models in the Chinese population have a strong predictive ability, especially for SCC. Adding SNPs and applying the XGBoost algorithm to the epidemiologic-based logistic regression risk prediction model significantly improves model performance.

K E Y W O R D S
Chinese population, extreme gradient boosting, lung cancer, risk model, single nucleotide polymorphisms in Figure 1 and Table S2. Genotyping was performed as previously described. 22 Successful genotyped SNP criteria were as follows: call rate >95%, control group Hardy-Weinberg equilibrium (HWE) p > 0.01, and minor allele frequency (MAF) >0.01.

| Statistical analysis
Pearson's chi-squared test or Student's t-test was used to evaluate clinical and environmental factor differences between the case and control groups. Univariate unconditional logistic regression was used to estimate the association between SNPs and the risk of lung cancer. The odds ratio (OR) and 95% confidence interval (CI) were calculated by adjusting for age and gender in the additive, dominant, and recessive models of SNPs. p-values were adjusted for multiple testing by using the false discovery rate (FDR) method.
Two types of models were built using 61 SNPs and age, gender, smoking intensity, smoking duration, and family history: logistic regression (LR) and extreme gradient boosting (XGBoost) models. For the LR model, a stepwise method (both directions) was used for feature selection. For XGBoost, improvements in accuracy were used for the feature selection. After model building, 10fold cross-validation 23 was performed. The performance of all risk models was evaluated using the AUC statistics of the receiver operating characteristic (ROC). The cutoff value for the ROC curve was calculated using the Youden method, and the corresponding sensitivity and specificity were also calculated. The 95% confidence interval (CI) of the AUC was calculated by bootstrapping 2000 times. Two ROC curves were compared using bootstrap 2000 replicates, and the p-value was calculated. Further stratifications were applied by histology, sex, smoking status (non-smoker vs smoker), and family history.
All tests were two-sided, and statistical significance was set at p < 0.05. All the above analyses were performed using R v3.6.2.

| RESULTS
The epidemiologic characteristics of 974 patients with lung cancer and 1005 controls are shown in Table 1 and  Table S1. Male patients (71.1%) enrolled more than females. ADC accounted for 48.8% and SCC for 36.5% of all lung cancer patients. Smoking intensity was significantly associated with lung cancer risk; however, smoking duration was not associated with lung cancer risk.
All 61 SNPs were successfully genotyped. The association between 61 SNPs and lung cancer risk was evaluated in the additive model. After adjustment for sex and age, two SNPs were significantly associated with lung cancer risk (p < 0.05, Tables S3-S5). TYMS rs3819102 conferred an increased risk of lung cancer in the additive model. BAG6 rs1077393 was found to decrease the risk of developing lung cancer. Further stratification showed that TYMS rs3819102 was significantly associated with the risk of SCC. Compared with the additive model, no new significant lung cancer-associated SNPs were found in the dominant and recessive models. Some results have been previously published. 22,24 For lung cancer risk prediction, the model predicted only with epidemiology attained an AUC of 0.703 for LR and 0.744 for XGBoost (Table 2). Further addition of SNPs increased the AUC to 0.759 in the XGBoost model. Significant improvements were achieved using XGBoost and the addition of SNPs (p < 0.001).
The improvements in the accuracy of the model variables are listed in Table 3. Smoking intensity was the most important factor in the model, followed by smoking duration, age, and sex. Among the 17 SNPs included in the lung cancer GWAS model, BAG6 rs1077393, TERT rs2735845, and CAMKK1 rs7214723 were more important than family history. BAG6 rs1077393 and TERT rs2735845 were significantly associated with lung cancer risk in our study, while CAMKK1 rs7214723 showed no association with lung cancer risk. We performed model building for ADC and SCC risks separately (Table 4). XGBoost was applied in the model for the ADC and significantly elevated model performance. Compared with the LR model predicted only with epidemiology, the XGBoost model predicted with epidemiology, and SNPs achieved a significant elevation in the AUC, of approximately 0.06 (p = 0.009). Further analysis of model accuracy improvement showed that age was more important than smoking duration in ADC (Table 5). Moreover, family history was not considered important in the ADC model. The three top SNPs in the ADC model were CHRNB3 rs4236926, ARHGEF11 rs868188, and NQO1 rs1800566. All three SNPs were not significantly associated with ADC risk in the logistic regression analysis. The AUC of all SCC models was above 0.8 ( Table 4). The performance of the LR model predicted only with epidemiology was AUC = 0.818. Adding SNPs or applying XGBoost resulted in little improvement in AUC, with the best AUC being 0.833 in the LR model predicted with epidemiology and SNPs. The most important factors for the T A B L E 3 Improvement in accuracy for epidemiology factor and SNPs in the XGBoost-base lung cancer risk model In the male population (AUC = 0.791), people aged 60 and older (AUC = 0.761), smoking (AUC = 0.785), people without family history (AUC = 0.734), and people with family history (AUC = 0.790), lung cancer risk prediction showed good performance. Interestingly, some of these sub-populations (male, people aged 60 and older, smoker, and people with family history) showed that the SNP-based model predicted with epidemiology and SNPs built by XGBoost is the best approach for modeling, while the others (female, people under age 60, non-smoker, people without family history) showed that LR is the best approach for modeling. All the models in stratification showed a trend of improved model performance in the LR model before and after adding SNPs, in which some models were significant (female, non-smoker) ( Table 7).
In different stratifications, the clinical factors are different. For the lung cancer risk model in the male population, people under age 60, people aged 60 and older, smokers, and people without family history, the top three important factors were smoking intensity, smoking duration, and age. For the lung cancer risk model in people with family history and age was a more important factor than smoking duration. For the lung cancer risk model in the female population, age was the most important factor, while smoking duration became less important.
In all stratifications of the models, TYMS rs3819102, the most significant SNP associated with lung cancer (p = 0.007), was used. This SNP was found to have an important role (among the top five important SNPs) in risk models for males, people aged 60 and older, smokers, and people without family history. In the lung cancer risk model, BAG6 rs1077393, the most important SNP, remained in the top five important SNPs in risk models for people under the age of 60 and people with family history.

| DISCUSSION
This study demonstrates the use of logistic regression and XGBoost algorithms to investigate the contribution of SNPs to the prediction of lung cancer susceptibility. Using genotype data from 974 lung cancer patients and 1004 healthy people in the Chinese population, we found that adding SNPs or applying the XGBoost algorithm to the epidemiologic-based logistic regression risk prediction model significantly improved the model performance, especially for the lung cancer risk prediction model and ADC risk prediction model.
Lung cancer is a heterogeneous disease that includes two main types: ADC (40%) and SCC (25-30%). Different disease patterns and treatment strategies exist. 25 Exome sequences in tumor-normal pairs reveal that ADC and SCC are less similar than other cancer types for significantly mutated genes. 26 GWAS in Caucasians also revealed that SNPs associated with both ADC and SCC only account for a small part of ADC-or SCC-related SNPs. 27 However, heterogeneity between ADC and SCC has not been considered in model-building studies in the Chinese population. 9-11 Our model has a better result for lung cancer models when compared with previous model-building studies in the Chinese population.
Smoking intensity is the most important factor that improves the performance of the XGBoost model in the lung cancer risk model. Cigarette smoke (CS) contains hundreds of carcinogens and various factors that induce cancer. CS-generated oxidants lead to DNA adducts and further induce DNA double-strand breaks (DSBs) in normal human bronchial epithelial cells or A549 pulmonary carcinoma cells, especially in S-phase cells. 28  DNA damage repair and promotes lung ADC development in response to CS-carcinogen exposure. 29 Another in vitro study showed that cigarette smoke components notably inhibited glycogen synthase kinase 3 (GSK3) and induced the expression of involucrin, a marker of squamous differentiation, implying a possible mechanism for CS-induced squamous differentiation. 30 We found good predictive performance in the SCC model, with logistic regression, whereas SNPs played little role in the model. Smoking is a major risk factor for lung cancer, 31,32 and smokers mostly develop squamous cell lung cancer with a high somatic mutation burden. 33 Our SCC model has a good effect, suggesting that smoking plays a key role in the development of lung SCC. The role of SNPs in the model is relatively small, possibly because the environment-induced somatic mutations in lung SCC exceed the role of germline SNPs.
In our study population, TYMS rs3819102 increased lung cancer risk in our study population and also showed importance in our lung cancer risk prediction models, especially among the population of males, people aged 60 and older, smokers, and people without a family history of cancer. TYMS encodes thymidylate synthase, an enzyme involved in the biosynthesis of thymidylate, and is a key regulator of DNA synthesis. 34 Thymidylate synthase uses CH2H4 folate as a methylene group donor; thus, the function of this enzyme is regulated by the folate pathway. The TCGA database shows that TYMS expression is elevated in lung ADC and SCC. 35 Data from three microarray datasets and protein-protein interaction network construction show that TYMS expression changes are the key signature in carcinogenesis of NSCLC. 36 rs3819102, located in the 3′-flanking region of TYMS, was reported to increase the risk of endometrial cancer in the Chinese population. 21 In both XGBoost-based lung cancer risk and SCC risk prediction models, the most important SNP was in BAG6. BAG6 is involved in cellular processes, such as apoptosis, gene regulation, and protein degradation. 37 Soluble BAG6 can be released from heat-shocked tumor cells and inhibit NK cell cytotoxicity, indicating that BAG6 may be a mediator of tumor immune escape. Polymorphism in BAG6 (rs3117582) was reported to have strong evidence of association with lung cancer risk among Caucasians in a metaanalysis. 38 However, in the ADC risk prediction model, CHRNB3 was found to be more important than BAG6. CHRNB3 is a kind of nicotinic acetylcholine receptor, and a study among Caucasian smokers involving 661 lung ADC cases and 1347 controls found that two CHRNA3 SNPs (rs1051730 and rs12914385) associated with ADC risk had significant indirect effects on lung ADC risk through nicotine dependence. 39 There are several points in our research that require improvement. First, before applying this model, it was necessary to verify the effect of the model in a larger cohort. Next, the predictive performance of the ADC model was poor, the main reason for which may be that at the time of SNP selection, most studies focused on the effect of SNPs on lung cancer and smoking; only a small part of the study reported ADC-associated SNPs, so most SNPs included in this study were associated with smoking. Future studies could include more ADC-associated SNPs in the model. Moreover, environmental factors such as particulate matter 2.5 level 40 were not included in the model. Thus, more SNPs and environmental factors need to be considered in the model and this result should be validated with independent studies or meta-analysis. In summary, our XGBoost modeling method significantly improved the predictive performance of the lung cancer model in the Chinese population. This model may be useful in assisting the early screening of lung cancer, which develops clinical application of machine learning in genetic epidemiological data for screening and surveillance. High-risk population such as people with family history or somkers whose risk is above the cutoff value should receive lung cancer screening examinations with LDCT [41][42][43] and have access to high-quality lung cancer screening early. Moreover, the SNPs included in our model may provide new targets for lung cancer.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the ethics committee of the School of Life Sciences, Fudan University. The patients/participants provided their written informed consent to participate in this study.