Risk prediction model for lung cancer incorporating metabolic markers: Development and internal validation in a Chinese population

Abstract Background Low‐dose computed tomography screening has been proved to reduce lung cancer mortality, however, the issues of high false‐positive rate and overdiagnosis remain unsolved. Risk prediction models for lung cancer that could accurately identify high‐risk populations may help to increase efficiency. We thus sought to develop a risk prediction model for lung cancer incorporating epidemiological and metabolic markers in a Chinese population. Methods During 2006 and 2015, a total of 122 497 people were observed prospectively for lung cancer incidence with the total person‐years of 976 663. Stepwise multivariable‐adjusted logistic regressions with P entry = .15 and P stay = .20 were conducted to select the candidate variables including demographics and metabolic markers such as high‐sensitivity C‐reactive protein (hsCRP) and low‐density lipoprotein cholesterol (LDL‐C) into the prediction model. We used the C‐statistic to evaluate discrimination, and Hosmer‐Lemeshow tests for calibration. Tenfold cross‐validation was conducted for internal validation to assess the model's stability. Results A total of 984 lung cancer cases were identified during the follow‐up. The epidemiological model including age, gender, smoking status, alcohol intake status, coal dust exposure status, and body mass index generated a C‐statistic of 0.731. The full model additionally included hsCRP and LDL‐C showed significantly better discrimination (C‐statistic = 0.735, P = .033). In stratified analysis, the full model showed better predictive power in terms of C‐statistic in younger participants (<50 years, 0.709), females (0.726), and former or current smokers (0.742). The model calibrated well across the deciles of predicted risk in both the overall population (P HL = .689) and all subgroups. Conclusions We developed and internally validated an easy‐to‐use risk prediction model for lung cancer among the Chinese population that could provide guidance for screening and surveillance.


| INTRODUCTION
Lung cancer remains the leading cause of death from cancer worldwide. 1 In China, lung cancer has been a serious issue in terms of public health. According to the data from the International Agency for Research on Cancer (IARC), in 2018, 37.0% of new cancer cases and 39.2% of cancer-related deaths occurred in China. 2 The survival rate of lung cancer was poor (16.1%) in China, however, the prognosis varies greatly at different stages of diagnosis. 3 The 5-year survival was <10% for stage IV lung cancer patients, but over 77% for patients with stage I diagnosis. 4 Taken together, early detection and prevention strategies could have a profound effect on the reduction of the overall disease burden attributable to lung cancer.
It has been shown that lung cancer screening is beneficial. The low-dose computed tomography (LDCT) screening was shown by the National Lung Screening Trial to reduce lung cancer mortality in asymptomatic high-risk smokers in 2011. 5 Then, annual screening for lung cancer with LDCT in adults aged 55-80 years who were current or former (<15 years since quitting) smokers (≥30 pack-years) were recommended by the US Preventive Services Task Force (USPSTF). 6 However, screening using LDCT could lead to a huge number of indeterminate nodules, and a significant proportion of lung cancer cases could not meet the screening entry criteria defined by USPSTF. 7 Therefore, accurate identification of high-risk subpopulation to be screened is critical to maximize the efficacy of lung cancer screening.
An accurate lung cancer risk prediction model can contribute effectively to the identification of high-risk individuals. There have been several lung cancer risk prediction models, primarily on the basis of established risk factors such as smoking, occupational exposures, family history of lung cancer, and respiratory diseases.  Previous studies have shown that lipids and high-sensitivity C-reactive protein (hsCRP) were predictive of lung cancer risk. 41,42 However, evidence on the predictive performance of these markers in lung cancer beyond smoking-based epidemiological models is limited. Moreover, there is no risk prediction model for lung cancer among Chinese mainland population based on traditional epidemiological risk factors and biomarkers. Therefore, in the present study, with the focus on established risk factors for lung cancer routinely available in general clinical settings, we aimed to develop and internally validated a risk prediction model for lung cancer.

| Study population
The Kailuan cohort is a large prospective dynamic cohort study in Tangshan  individuals aged more than 18 years were invited to participate in questionnaire interviews and clinical examinations every 2 years at 11 hospitals that are affiliated with the Kailuan Group.
Participants who provided informed consent and completed the questionnaire interview were enrolled in the present study. Participants with a diagnosis of cancer before the baseline survey (n = 555) or had missing information on covariates included in the models (n = 15 653) were excluded. Ultimately, a total of 122 497 participants were included in the final analysis in this study.
The study was approved by the Medical Ethics Committee of the Kailuan Medical Group. All participants have signed written informed consent forms.

| Exposure assessment
Standardized questionnaires and health examination for all individuals were conducted by trained staff at baseline. Information regarding demographics, lifestyle factors, personal medical history, and family history of common noninfectious chronic disease (NCD) as potential indicators were collected.
Smoking was defined as smoking ≥1 cigarette per week for at least 12 months. Drinking was defined as drinking ≥1 time per month for at least 6 months. In addition, we derived information on coal dust exposure from each miner's work history.
The weight and height of the individuals were measured on standard stadiometers and scales without wearing shoes. The body mass index (BMI) was calculated by weight (kg)/height (m 2 ). The waist circumference (WC) was measured at the midpoint between the supramargin of the iliac crest plane and the lower edge of the rib. The blood pressure (BP) was measured on the left arm using a mercury sphygmomanometer according to the standard recommended procedures. 43 Systolic blood pressure (SBP) was defined as the point at which the first of two or more Korotkoff sounds are heard, and diastolic blood pressure (DBP) was defined as the disappearance of Korotkoff sound.

| Ascertainment of lung cancer cases
We followed participants beginning at the baseline examination and ending at the occurrence of cancer, death, or 31 December 2015, whichever event came first. The details of cohort follow-up and cancer assessment have been published previously. 42 In brief, people with cancer were identified through biennial health examinations and annual searches of the Tangshan medical insurance system and the Kailuan social security system. Moreover, the outcome information was further confirmed by checking discharge summaries from hospitals where participants were diagnosed or treated. The diagnosis of incident primary lung cancer was confirmed by the reviewed medical records review by clinical experts. Information on pathological diagnosis, imaging diagnosis (including ultrasonography, computerized tomographic scanning, and magnetic resonance imaging), blood biochemical examination, and alpha-fetoprotein test was collected for the incident lung cancer assessment. Cancers were coded according to the International Classification of Diseases, Tenth Revision (ICD-10) and lung cancer was coded as C10.

| Statistical methods
Categorical variables were described by percentages and the Chi-squared test was used to compare the difference between different groups. Continuous variables were described by mean (standard deviation) and ANOVA was conducted to compare the difference between different groups. For each risk factor, the association with lung cancer risk was first assessed adjusting for age group by logistic regression. Stepwise multivariable-adjusted logistic regressions (P entry = .15, P stay = .20) were conducted to choose the variables included in the prediction model. Odds ratios (ORs) and 95% confidence intervals (CIs) were presented. Predicted risk of lung cancer was calculated by We exp(β o +∑β i X i )/(1 + exp(β o +∑β i X i )), where β o was the intercept, and β i was the regression coefficient for risk factor X i . Model discrimination was evaluated by receiver-operating characteristic (ROC) curves and concordance statistics (C-statistics). In addition, the internal validation of model discrimination was evaluated by 10-fold cross-validation. The total cohort was randomly divided into 10 subsets, the prediction model was firstly fitted in 90 percent of the population (training set), and the predictive lung cancer risk was estimated in the remaining 10 percent of the population (validation set). This procedure was repeated for all 10 subpopulations, and the average C-statistics was calculated. The Hosmer-Lemeshow goodness-of-fit test was used to evaluate the model calibration by comparing the observed and predicted probabilities. A value of P HL > .05 indicated satisfactory calibration. Subgroup analyses were performed by age (<50 years vs ≥50 years), gender (male vs female), and smoking status (never smoking vs former or current smoking).
Furthermore, we calculated the integrated discrimination improvement (IDI) and the net reclassification improvement (NRI) to evaluate the added predictive ability of new factors in risk prediction models. 44 The NRI focuses on reclassification tables constructed separately for participants with and without events, and quantifies the correct movement in categories-upwards for events and downwards for nonevents. 45 The IDI focuses on the improvement in the mean discrimination slope and the probability of discrimination between the base model (eg, simple model) and the new models (eg, full model). 45 Larger NRI and IDI values indicate greater improvements in model discrimination.
In the secondary analysis, we evaluated all the potential predictors among participants aged more than 50 years old, to see the applicability of our model among LDCT screening targeted population. In addition, in sensitivity analyses, continuous variables were also used instead of categorical variables to examine the potential probability of improving discrimination.
All analyses were conducted using the SAS software (Version 9.4; SAS Institute). All statistical tests were two sided, and the significance level was set as P < .05.
By December 2015, with a median period of follow-up of 8.87 (7.09-9.15) years and a sum of 976 663 person-years, a total of 984 (0.80%) primary lung cancer cases were identified. Lung cancer cases were typically older, with a lower BMI, lower educational level (junior high school or below), and were more inclined to smoke and drink compared with controls (all P < .05). Moreover, the levels of LDL-C (P = .036) and BMI (P = .019) were lower in lung cancer cases than controls, while the levels of WC (P = .003), SBP (P < .001), and HsCRP (P < .001) were significantly higher in lung cancer cases than in those without lung cancer (Table 1).  Table 2). In addition, age-adjusted logistic regression showed a positive association of smoking duration and age started smoking with lung cancer risk. However, no association was found between the risk of lung cancer with smoking cessation time, family history of cancer, family history of lung cancer, abdominal obesity, FBG, BP, TC, TG, and HDL-C (Table S1).

| Predictors included in models
In the present study, we considered two set of models: the epidemiological model included six established predictors for lung cancer including age, gender, smoking status, alcohol intake status, coal dust exposure status, and BMI; and then through stepwise logistic regression, the full model additionally included two metabolic markers including HsCRP and LDL-C (Table 2).

| Predictive performance of the models
The epidemiological risk prediction model generated a C-statistic of 0.731. Significant improvement in C-statistics was observed when the full model (C-statistic = 0.735, P = .033) was compared to the epidemiological model (Table 3). ROC curves also suggested improved discrimination when adding metabolic markers to the epidemiological models ( Figure 1). Stratified analysis by age showed that the discriminatory performance of the full model was better in participants <50 years (C-statistic, 0.709) than in participants aged ≥50 years (C-statistic, 0.655). Moreover, the full models yield better C-statistic in females (C-statistic, 0.726) than in males (C-statistic, 0.716). Notably, the C-statistic of the full model in former or current smokers (0.742) was higher than in never smokers and was statistically significantly higher than the C-statistic of the epidemiological model in former or current smokers (0.735, P = .016) ( Table 3).
The results of internal validation by 10-fold cross-validation showed the stability of the models' predictive power. The average C-statistic of the epidemiological model and the full model were 0.728 and 0.735, respectively (Table S2). Table S3 showed reclassification results. Compared with the epidemiological model, statistically significant (P < .001) higher NRI was observed for the full model (15.4,95% CI,. Similarly, we found statistically significant improvement for the IDI (P < .001) for the full model (0.03, 95% CI, 0.02-0.05).
The full model showed good calibration across deciles of predicted risk (P HL = .689). The predicted risk for lung cancer was 2.40% in the highest decile compared with 0.09% in the lowest decile (OR, 38.26; 95% CI, 18.95-77.23) (Table 4). Meanwhile, the full model also showed good calibration in all subpopulations.
To test the broad utility of our models for the LDCT screening set, in secondary analysis, we considered only participants aged more than 50 years old. As shown in Table S4, through stepwise regression, the included predictors and the corresponding associations were almost the same with the model developed among the whole population, which confirmed the stability and potential utility of our present models.
Finally, in the sensitivity analysis, if continuous variables were used instead of categorical variables, the C-statistics of the full models were not improved (C-statistic, 0.728).

| DISCUSSION
In this study, we developed and validated internally two sets of risk prediction models for lung cancer based on data from routine health check-ups, aiming at providing simple and efficient tools for tailored lung cancer screening by identity high-risk subpopulations effectively. Our results showed that the model that included solely demographic information and lifestyle behavior information could strongly discriminate incident lung cancer cases from noncases. Moreover, the incorporation of CRP and LDL-C as metabolic markers provided a satisfactory increase in discriminatory performance (C-statistic for the full model, 0.735). Because all the indicators included in this model can be acquired easily from general clinical or screening sets, the potential of translating into use is great. Internal validation suggested the models may perform well regarding model discrimination when applied to other populations. The evidence base for the included predictors is one of the important measurements of the validity of a risk prediction model. In this study, all the predictors have been shown associated with lung cancer risk. It has been proven that smoking is causally associated with the risk of lung cancer since the 1950s. 46 Additionally, alcohol intake was shown to be related to elevated lung cancer risk. 47,48 Moreover, reduced risk for lung cancer has been indicated in men or women with higher levels of BMI. 49,50 Consistent with previous evidence from epidemiological studies, in this study, we observed the positive association of smoking, alcohol, and inverse association of BMI with lung cancer risk. As for the metabolic markers, we had reported the elevated lung cancer risk for participants with low LDL-C. 46 Furthermore, based on 20 population-based cohort studies in the United States, Asia, Australia, and Europe, muller et al found that former and current smokers with higher hsCRP had an increased risk of lung cancer. 41 In addition to credible predictors, a risk prediction model should also meet performance standards related to discrimination defined as the ability to distinguish lung cancer cases from controls, and calibration defined as the consistency between observed and predicted risk for lung cancer. There have been several lung cancer prediction models for the general population developed in different population. 51  population, never smokers (eg, EPIC model), 20 or overall population (eg, LLPi model) 29 were included for developing risk models. To our knowledge, this study is the only study assessing CRP and lipids directly to develop a lung cancer risk prediction model. It is hard to directly compare the discriminatory performance of risk prediction models as each was developed in different populations with varying baseline risks or lengths of follow-up time. Nevertheless, each of the models' discriminative ability was relatively similar, with C-statistics ranges from 0.72 to 0.86. Our model showed comparable predictive performance compared with previous studies.
A major limitation of our study is that we were not able to validate the risk prediction model externally to assess the general applicability. However, the results of the internal validation suggest promisingly that this model will obtain well performance when applied to other populations. Another limitation is that because of the limited number of identified squamous cell carcinoma (SCC, n = 150), adenocarcinoma (AC, 143), and small cell lung carcinoma (SCLC, 71), we did not construct separate models for these two histologic types. However, the goal of our model is to apply in the screening setting, and previous studies indicate that many of the commonly cited risk factors for lung cancer are shared by different pathological types. Furthermore, the competing risks for death and/or development of other kinds of cancer were not corrected in present model, which may lead to potential bias in terms of the predictive accuracy of the models. Additionally, as the logistic regression model was used in this study, certain time interval predicted risk could not be calculated. Finally, information on lung function, asbestos exposure, history of pneumonia, and history of chronic obstructive pulmonary disease was not collected, so their roles in lung cancer risk prediction could not be evaluated in this study. Meanwhile, this study has its unique strengths. To the best of our knowledge, this is the first model that predicts lung cancer risk by assessing the CRP and lipids levels in a population-based study. The present study provides a few advantages for the development of lung cancer prediction model, given the large sample size, which enables us to validate the prediction model in an independent subset of the population, as well as the detailed information from questionnaire and blood test, especially the comprehensive information which is easily available in general settings, are particularly important in the stratification of population for screening.
In conclusion, we developed and validated internally a risk prediction model for lung cancer that incorporates metabolic markers, based on data from Chinese residents. The model consisted of predictors that are readily available or easily accessible in general clinical or primary care settings showed satisfactory performance in terms of both discrimination and calibration.  Therefore, this model could be used effectively as a practical tool to identify high-risk individuals for tailored lung cancer screening.