Using a risk model for probability of cancer in pulmonary nodules

Abstract Background Considering the high morbidity and mortality of lung cancer and the high incidence of pulmonary nodules, clearly distinguishing benign from malignant lung nodules at an early stage is of great significance. However, determining the kind of lung nodule which is more prone to lung cancer remains a problem worldwide. Methods A total of 480 patients with pulmonary nodule data were collected from Shandong, China. We assessed the clinical characteristics and computed tomography (CT) imaging features among pulmonary nodules in patients who had undergone video‐assisted thoracoscopic surgery (VATS) lobectomy from 2013 to 2018. Preliminary selection of features was based on a statistical analysis using SPSS. We used WEKA to assess the machine learning models using its multiple algorithms and selected the best decision tree model using its optimization algorithm. Results The combination of decision tree and logistics regression optimized the decision tree without affecting its AUC. The decision tree structure showed that lobulation was the most important feature, followed by spiculation, vessel convergence sign, nodule type, satellite nodule, nodule size and age of patient. Conclusions Our study shows that decision tree analyses can be applied to screen individuals for early lung cancer with CT. Our decision tree provides a new way to help clinicians establish a logical diagnosis by a stepwise progression method, but still needs to be validated for prospective trials in a larger patient population.


INTRODUCTION
Lung cancer shows the highest morbidity and mortality of all cancers in both sexes combined worldwide with a large proportion of patients being diagnosed at an advanced stage of disease. 1 Previous studies [2][3][4] have demonstrated that computed tomography (CT) is recommended by US guidelines for high-risk individuals to reduce lung cancer mortality because more early-stage lung cancers can be diagnosed with the assistance of CT and more invasive procedures can be implemented. Owing to the widespread availability of CT screening, more and more lung nodules are being diagnosed in a timely manner, so that the risk of lung cancer screening programs is surgical resection performed for intent to cure malignant disease in patients without lung cancer. 5 As medical technology advances, patients undergoing video-assisted thoracoscopic surgery (VATS) lobectomy have been reported to exhibit lower probability of readmission, pneumonia, and postoperative blood transfusion compared with those undergoing open lobectomy; 6 however, it is not only a waste of medical resources in patients with benign nodules undergoing surgery, but incalculable harm can be caused to patients' body and mind. Therefore, it is crucial to improve the diagnostic accuracy of lung cancer and reduce unnecessary surgery. Although many studies 7-10 have previously used different methods to analyze differences in the CT imaging characteristics between benign nodules and lung cancer patients, how to analyze benign and malignant nodules remains controversial.
Big data mining technology has opened up a new era in which guidelines and characteristics of many things are readily available from a mass of basic data. 11 In lung cancer, mixed models combining multiple factors have been shown to provide excellent prognostic benefits. 12,13 At present, many studies have tried to establish models to achieve the intelligent identification of benign and malignant nodules, and have shown that machine learning plays an irreplaceable role in disease diagnosis. 14,15 In this study, we performed a retrospective analysis of patients with lung nodules undergoing VATS lobectomy which aimed to (i) compare the clinical features and image characteristics of pulmonary benign and malignant nodules, (ii) compare several common machine learning models from multiple aspects and (iii) provide a new method for clinicians to distinguish benign from malignant pulmonary nodules.

Patient selection
A retrospective analysis was conducted of 480 ( Figure 1) patients with lung nodules who had undergone VATS lobectomy from January 2013 to November 2018 in Shandong Provincial Hospital, China. Weobtained definitive pathological results following surgery, which allowed us to proceed to further studies. First, we preliminarily excluded nodules greater than 3 cm in the longest length (16 cases) from the imaging reports. We also excluded nodules with unclear boundaries that could not be studied further (three cases). In addition, nodules confirmed by pathology as atypical hyperplasia (three cases) was also excluded. Thus, there were 458 cases (102 cases of benign nodules and 356 cases of malignant nodules) in the study. Before the patients underwent lobectomy via video-assisted thoracoscopic surgery (VATS), we performed auxiliary examinations such as craniocerebral magnetic resonance imaging (MRI), abdominal ultrasound, and also positron emission tomography/computed tomography (PET/CT). If tumors outside the lung were found, surgery was not performed; therefore the samples in this study did not include those patients with metastatic lung cancer.

Patient characteristics
The clinical characteristics of all patients were derived from the electronic medical record system of the hospital, including gender, age, profession, smoking history, drinking history and family history of cancer. We evaluated all chest CTs for each patient within our picture archiving and communication system (PACS), and both radiologists who had received special training in chest radiology described the characteristics of the nodules without knowing the pathological results of the nodules. Conclusions were made with consensus. Our CT cases were obtained using Somatom Definition Flash CT, Somatom Definition Edge CT and Somatom Force CT Scanner (Siemens). At our institution, chest CT reconstruction protocols include 1-, 1.25-and 5-mm axial slices. A previous study 16 has stated that various viewing techniques have similar detection rates when experienced observers focus on nodule detection. Sagittal and coronal reconstructions are routinely obtained.
Documented characteristics of the nodules included their maximum diameter, location in the lung, signs of lobulation, spiculation, satellite nodule, vessel convergence sign, pleural indentation, if there was a distinct boundary, and types determined by density (solid, ground-glass nodules or part-solid). A solitary pulmonary nodule (SPN) is defined as a round opacity that is at least moderately well circumscribed and no larger than 3 cm in diameter. 17 SPNs include solid and subsolid nodules, and subsolid nodules include ground-glass nodules (GGNs) and part-solid nodules. 18 Ground-glass nodules (GGNs) are nuanced nodular opacities that do not obscure underlying bronchovascular structures of the lung. 19

Statistical analysis
Categorical clinical characteristics include gender (male or female), age (0-45, 45-60, 60+), profession (others, workmen, labourers, office clerk), smoking (no/yes), drinking (no/yes), and family history of cancer (no/yes). They were compared between benign and malignant nodules using logistic regression analyses. In addition, categorical variables such as nodule size (<0.6 cm, 0.6-1.0 cm, 1.0-2.0 cm, 2.0-3.0 cm), nodule location, lobular (no/yes), spiculation (no/yes), boundary (no/yes), satellite nodule (no/yes), vessel convergence sign (no/yes), pleural indentation (no/yes) and nodule type (solid, GGNs or part-solid) were added. Univariable analysis and a multivariable logistic model were applied to explore the risk factors of lung cancer among patients with pulmonary nodules diagnosed by CT. Variables which had statistical significance in the univariable analysis were included in the multivariate analysis, and the Holm-Bonferroni correction was subsequently applied to factors with p-values <0.05. All logistic models were performed using SPSS software (version 20.0). The diagnostic performance of the predictive model was calculated by the receiver operating characteristic (ROC) curve analysis. In addition, we compared multiple indicators of four common machine learning models (naivebayes, support vector machines, decision tree and random forests). All models were developed within WEKA (Waikato Environment for Knowledge Analysis) 3.8.3 (The University of Waikato, Hamilton, NZ). Moreover, the support vector machines (SVM) used a sequential minimal optimization (SMO) algorithm and the decision tree used a cost-sensitive version of J48, an implementation of the C4.5 algorithm. The area under the curve (AUC) of models were analyzed using one-way ANOVA and Tamhane's T2 post hoc at a significance level of α = 0.001.

Nodule characteristics
The characteristics of the nodules according to lung cancer status are shown in Table 1. In a univariate analysis, significant consistent predictors of lung cancer not only included the age, but also covered the nodule size, lobulation, spiculation, satellite nodule, vessel convergence sign, pleural indentation and nodule type (p < 0.05).   We took the maximum diameter less than 0.6 cm as a reference, and found that an increase in nodule size was associated with lung cancer to some extent (0. 6 were at high risk of malignancy (p < 0.001). However, nodules with satellite nodules had a lower risk for lung cancer (OR: 0.06; 95% CI: 0.01-0.47; p = 0.008). It is worth mentioning that these nodules are rarely accompanied by satellite nodules (6/458), especially those found to be malignant (1/458). Previous studies have concluded that rheumatoid pulmonary nodules are more likely to have satellite nodules. 20 We think it may better explain this phenomenon.

Predictors of malignancy
We removed the variables that were not significant in the univariate model, obtaining the multivariate model shown in Table 1, which includes the largest nodule diameter, lobulation, spiculation, satellite nodule, vessel convergence sign, pleural indentation, GGNs and part-solid as all significant predictors of a nodule being malignant. Logistic regression analysis is essential in displaying how multiple variables act on each other and quantifying the effect size of each characteristic; however, it is unrealistic to put into use during the clinical diagnosis.
Machine learning models were constructed to distinguish lung cancer from benign lung diseases. First, all clinical characteristics were used as input features to develop the models of naivebayes-1, SMO-1, J48-1 and randomforest-1. Then, all clinical and imaging characteristics were employed as the input variables to develop the models of naivebayes-2, SMO-2, J48-2 and randomforest-2. Finally, clinical and imaging features were extracted from the logistics regression screening model and adopted to develop the models of naivebayes-3, SMO-3, J48-3 and randomforest-3. The effect of the model was evaluated by sensitivity, specificity, precision, F-measure and AUC. For each model, we selected 3-10-fold cross-validation and in Table 2 we showed the average, maximum and minimum values. For example, three-fold cross-validation is that the dataset was randomized and split up into three subsets with similar class balances, then we used two subsets to train a model in each fold, while the remaining subsets were used to validate it. The SMO-1 model determined all nodules as malignant. The evaluation of various models is shown in Table 2. Results showed that the efficiency of the naivebayes-2 was higher than other models (p < 0.001) except randomforest-2 (p = 0.996), naivebayes-3 (p = 0.999) and randomforest-3 (p = 0.002) by AUC comparison. However, the sensitivity (83.4% vs. 83.2%) and specificity (46.2% vs. 37.9%) of J48-2 was slightly higher than naivebayes-2. Attracted by the intuitiveness and visibility of the decision tree, although it is not the best classifier, we carried out a specific analysis of the decision tree. In terms of decision tree, each model achieved the largest AUC in the six-fold verification. J48-2 was better than J48-1 (p < 0.001), and there was no difference between J48-2 and J48-3 by AUC comparison (p > 0.05). The J48-2 model was too complicated to be applicable to the clinic, so in the end we chose J48-3, hoping to give clinical doctors a reference. A streamlined version of this evidence-based decision tree is shown in Figure 2. Per nodule, this classifier is 84.5% accurate overall. The decision tree shows that lobulation was assigned by the first and most informative node, followed by spiculation, vessel convergence sign, nodule type, satellite nodule, nodule size or patient age. The decision tree can be converted into a set of if-then rules by tracing the path from the root node to each terminal node. The if-then rules created by the model are presented in Table 3.

DISCUSSION
Our study analyzed the differences in clinical and CT imaging characteristics between benign and malignant nodules, determined the increased odds ratio (OR) of lung cancer among patients with pulmonary nodules and provided a more feasible method for judging the nature of nodules in clinical work.
Moreover, a previous study 21 has shown that the combined use of multiple methods to build a model can optimize the model. To our knowledge, this is the first study that has examined and combined the utility of logistic regression model with the machine learning models, the naivebayes, decision tree, support vector machine and random forest model, to predict lung cancer in a large Chinese population. Through our comparison, the decision tree model has certain advantages. We have tried to establish a variety of decision tree models to screen out the optimal model. Decision tree is a valuable classification algorithm in data mining methods. 22,23 In the decision tree, the first  variable (root) is the most important factor and variables far away from the root are the next important factors in classifying the data. 24 This study shows that lobulation is the most significant attribute discriminating between benign and malignant nodules. A lobulated border was defined when a portion of the surface of a lesion showed a wavy or scalloped configuration, apart from regions abutting the pleura. 25 This result once again confirms that the feature of lobulation in previous studies [25][26][27] is a predictor of malignant nodules. The size and morphology of a pulmonary nodule are the two primary determinants of cancer risk. 7 Morphology refers specifically to the margins (smooth, lobulated, or spiculated) and attenuation (solid, partly solid, or purely ground-glass) of the nodule. 28 A fine spiculated margin is defined as very fine linear strands extending radially 1-2 mm beyond a lesion. 25 The decision tree shows that spiculation, vessel convergence sign, nodule type, satellite nodule, nodule size and age of patient are the following important factors after lobulation. The tree identified a subgroup of individuals (22 nodules [88%]) without lobulation, without spiculation and solid that were benign nodules. Another subgroup of individuals (197 nodules [91%]) with lobulation, with vessel convergence sign and without satellite nodule were identified as malignant nodules.
Notably, in the decision tree, we found that the sizes of 1.1 and 1.4 cm were also the dividing points, which may differ slightly across different samples and patient populations, and this tree provides an outline on how to estimate malignancy risk. However, the conclusion coincides with previous studies, 27,28 that nodules of greater diameter are more likely to be malignant. A previous studies indicates that the lifetime risk of receiving a diagnosis of cancer by age 30 years is approximately 1% and is 2% by age 40 years. 29 We also found that the average age of a lung cancer diagnosis is 58.91 AE 9.69 years old, and the risk of suffering from lung cancer increases with age, which is consistent with many previous studies. 30,31 The decision tree confirms that 55 years old is a truncation, and we therefore advocate the use of routine chest CT scans for older individuals.
A major strength of this study is that we used a real medical dataset of patients with lung nodules who underwent VATS lobectomy at Shandong Provincial Hospital. All laboratory pathological results were obtained in all patients, and the results are therefore more reliable. Through the AUC evaluation, the naivebayes model has obvious advantages, but the simplicity and visualization of the decision tree make it possible for use by clinicians. The selection of features in logistic regression make the use of decision trees easier. In future, our model does need prospective trials to be validated in a larger patient population.
Our study has limitations. First, the sample selected were patients undergoing VATS lobectomy, and our study is not applicable to patients with advanced metastatic lung cancer. Second, data were collected from only one large hospital in China. Further studies with additional data from this hospital and other centers need to be performed. Third, the clinical trials of patients were based on medical records; therefore, it may lead to an information bias.
In conclusion, in comparison to previous studies, our study had a much larger sample size of nodules (458 patients) and each sample was from PACS and based on pathological results, which allowed us to generate a more accurate and robust model. Here, we combine decision tree with logistics regression, simplifying the model as much as possible without reducing the goodness of fit of model, and thus making it possible to use clinically, especially for young doctors who do not have extensive experience in judgment. Although our decision tree is not specific enough, it provides a new concept for our future clinical work and research, hopefully enabling better use of CT in the early screening of lung cancer.

ACKNOWLEDGMENTS
All work included in the manuscript was performed at Shandong Provincial Hospital. The research was approved by the internal review board of the institution.