Application of computer‐aided detection for NCCN‐based follow‐up recommendation in subsolid nodules: Effect on inter‐observer agreement

Abstract Rationale and Objectives Computer‐aided detection (CAD) of pulmonary nodules reduces the impact of observer variability, improving the reliability and reproducibility of nodule assessments in clinical practice. Therefore, this study aimed to assess the impact of CAD on inter‐observer agreement in the follow‐up management of subsolid nodules. Materials and Methods A dataset comprising 60 subsolid nodule cases was constructed based on the National Cancer Center lung cancer screening data. Five observers independently assessed all low‐dose computed tomography scans and assigned follow‐up management strategies to each case according to the National Comprehensive Cancer Network (NCCN) guidelines, using both manual measurements and CAD assistance. The linearly weighted Cohen’s kappa test was used to measure agreement between paired observers. Agreement among multiple observers was evaluated using the Fleiss kappa statistic. Results The agreement of the five observers for NCCN follow‐up management categorization was moderate when measured manually, with a Fleiss kappa score of 0.437. Utilizing CAD led to a notable enhancement in agreement, achieving a substantial consensus with a Fleiss kappa value of 0.623. After using CAD, the proportion of major and substantial management discrepancies decreased from 27.5% to 15.8% and 4.8% to 1.5%, respectively (p < 0.01). In 23 lung cancer cases presenting as part‐solid nodules, CAD significantly elevates the average sensitivity in detecting lung cancer cases presenting as part‐solid nodules (overall sensitivity, 82.6% vs. 92.2%; p < 0.05). Conclusion The application of CAD significantly improves inter‐observer agreement in the follow‐up management strategy for subsolid nodules. It also demonstrates the potential to reduce substantial management discrepancies and increase detection sensitivity in lung cancer cases presenting as part‐solid nodules.


| INTRODUCTION
Low-dose computed tomography (LDCT) screening can detect nodules.Pulmonary nodules are clinically significant since they may be early manifestations of lung cancer.Therefore, the accurate assessment and timely follow-up of these nodules are crucial.However, explaining LDCT lung cancer screening results and formulating accurate follow-up recommendations are labor-intensive for radiologists. 1][5] Subgroup analyses based on nodule classification have revealed different malignant tendencies of subsolid nodules at different time points.Many persistent subsolid nodules represent early-stage invasive adenocarcinoma. 6urthermore, persistent non-solid nodules gradually evolve into part-solid nodules, with reports indicating that approximately 9.0% of these progressing to minimally invasive or invasive adenocarcinomas. 7,8However, in the Asian region, due to differences in epidemiological profiles, healthcare systems, and accessibility of technology, the standards for lung cancer screening and characteristics of nodules may vary significantly compared to the Western countries.These variations can affect the detection rates of nodules, the types of nodules predominantly found, and the subsequent management strategies adopted.The results of a multicenter study on baseline LDCT lung cancer screening conducted in Shanghai indicate that the proportions of nonsolid nodules, part-solid nodules, and solid nodules among lung cancer patients were 52.94%, 31.93%, and 15.13%, respectively. 9A study on LDCT screening targeting different lung cancer risk groups in Taiwan revealed that the proportion of subsolid nodules among participants with a family history of lung cancer was significantly higher than that in participants without such history (17.7% vs. 5.2%). 10Consequently, subsolid nodules in the Asian population warrant more focused attention.In contrast, overdiagnosis can lead to anxiety and unnecessary treatment.Therefore, reasonable and consistent follow-up recommendations are crucial for the effective management of nodules.
Pivotal factors in the management of subsolid nodules include the presence of any solid component and their dimensions.However, visual assessment of nodule types and manual diameter measurements is influenced by significant inter-observer variability, posing challenges to the accurate characterization and measurement of nodules, leading to inconsistent interpretations and management decisions. 11,12revious studies have shown varying degrees of agreement among observers in distinguishing solid and subsolid nodules and in assigning Lung-RADS categories. 13he reliance on subjective assessments emphasizes the need for standardized and objective approaches to evaluate nodules, such as adopting advanced imaging techniques or implementing computer-aided detection (CAD) systems.CAD systems not only enhance the sensitivity of nodule detection but also enable further nodule volume measurement, automatic segmentation, classification, and risk assessment. 14,15Thus, incorporating CAD with human observers for double reading may help mitigate the impact of observer variability and improve the reliability and reproducibility of nodule assessments in clinical practice. 16,17Therefore, this study aimed to evaluate the inter-observer agreement of follow-up recommendations using CAD for subsolid nodules.

| NCCN management classification and study groups
Nodules in the dataset used in this study were categorized according to the NCCN guidelines (Version 2023.01,Data S1). 2 The observers recorded the type, size, and location of each dominant-risk nodule.The NCCN guidelines inform nodule type identification (solid, part-solid, or nonsolid) and size measurement (the longest diameter and its perpendicular diameter in the axial view/2).Recommendations for subsequent follow-up were determined by the presence of risk-dominant nodules.There were four recommendation categories: repeat scans at 0, 3, 6, and 12 months, in this context, a "0-month follow-up" refers to immediate subsequent evaluations within a short timeframe, including positron emission tomography and computed tomography (PET-CT) scans and biopsy.

| Data collection
Between 2014 and 2017, we obtained LDCT scans and essential patient details from individuals who underwent lung cancer screening at the National Cancer Center.The screening results for all patients were categorized in the system into positive (>6 mm), indeterminate (4-6 mm), and negative (<4 mm) nodules.In the first step, we analyzed radiological reports from individuals with positive nodules that included specific terms related to subsolid nodules, such as ground-glass, nonsolid, subsolid, and part-solid nodules.At this stage, our study incorporated subsolid nodules from reports recommending additional evaluation within 6 months or those recommending biopsy and PET-CT scan.After excluding the descriptions of typical ground-glass opacities associated with inflammation, 68 patients with subsolid nodules were included in this step.In the second step, we randomly chose 200 LDCT reports of participants with semi-positive nodules, all of which recommended annual repeat screening.Concurrently, 42 LDCT cases with subsolid nodules were included.In the third step, two experienced radiologists (with 20 and 15 years of chest expertise, respectively) individually assessed the initial 110 (68 + 42) cases, aligning each with a management strategy based on the NCCN guidelines to establish a benchmark for subsequent management.Discrepancies were resolved by a chief radiologist with 30 years of chest expertise.Ultimately, the count at 0-, 3-, 6-, and 12-month follow-ups were 12, 16, 40, and 42, respectively.In the fourth step, to maintain a balanced representation across all follow-up categories, we created a dataset encapsulating every NCCN follow-up classification proportioned in a 1:1:2:2 ratio across intervals (0, 3, 6, and 12 months).As a result, the final dataset for interobserver agreement assessment consisted of 60 scans from 60 participants, and 29 patients were pathologically confirmed to have lung cancer.Figure 1 presents a detailed flowchart of the dataset composition.
The Ethics Committee of Cancer Hospital, Chinese Academy of Medical Sciences approved this retrospective study and waived the need for written informed consent.

| LDCT scan parameters
CT scans were conducted using 64-detector row scanners from various manufacturers, including the General Electric Medical Systems (Discovery CT750 HD, or Optima CT660) and the Siemens Medical Systems (MOMATOM go Top or Definition Edge).Scans were performed at full inspiration, following standardized parameters, including a tube voltage of 120 kVp, automatic current time ranging from 25 to 100 mA with a rotation time of 0.5 s, and a slice thickness of 5 mm.Reconstructed images were generated using a standard algorithm with thicknesses of 1.0 or 1.25 mm and an interval of 0.8 mm.

| CAD for automatic nodule classification
A deep learning-based CAD system (Medical Imaging Assisted Diagnosis Platform, v.3.5.2;Huiying Medical Technology, Beijing) was utilized for the automatic classification of nodules in LDCT scans.After identifying the nodules, conducting a segmentation-based assessment and size evaluation, all pinpointed nodules (with a threshold >3 mm) were prioritized based on their risk level.The CAD-generated list of nodules was displayed in a separate window, as shown in the illustration.This list provides information about the nodules detected using CAD, including their location, nodule type, twodimensional diameter, volume, volume of solid components in partially solid nodules, and nodule risk level (Figure 2).

| Observer and reading method
In this study, each observer performed two assessment rounds for all LDCT scans.The first round was performed manually and independently, whereas the second round was conducted with the assistance of CAD.The two rounds of assessment were spaced 2 weeks apart to eliminate recall bias.Five observers participated in this study, including two chest radiologists with 5 and 20 years of experience in chest diagnostics.The remaining three observers were radiology residents.The types of nodules are classified as part-solid nodules, nonsolid nodules, and solid nodules.All lung cancer cases have been histologically confirmed.In the first round, observers recorded the type, size, and location of the risk-dominant nodule and provided follow-up recommendations for each case.Determination of the nodule type (solid, part-solid, or nonsolid) and size (calculated as the average of the longest and perpendicular diameters in the axial plane) relied on annotations provided in the NCCN guidelines.All measurements were conducted using images with layer thicknesses of 1 mm or 1.25 mm.According to the NCCN guidelines, observers were required to classify the follow-up recommendations into four categories (0, 3, 6, and 12 months) based on the level of risk associated with the nodules.Throughout the evaluation phase, all the observers were 12 familiar with and referred to the printed version of the NCCN guidelines.
During the second round, the observers analyzed the CT images in a simultaneous-observation setting, considering the annotations produced by the CAD.Upon evaluation of the entire LDCT scan, the observers were given the choice to identify a nodule detected by CAD or any newly discovered nodule as the risk-dominant nodule.In certain scenarios, such as when the demarcation between a nodule and surrounding blood vessels is unclear, if the observers found the nodule type or size of the CAD-detected nodules to be incorrect, they could modify the nodule type and measurement values.In such cases, the software automatically saved the modified results.

| Statistics analysis
Inter-rater agreement among multiple observers was evaluated using the Fleiss' kappa test.The linearly weighted Cohen's kappa test was used to measure agreement between paired observers.The interpretation of kappa values adhered to the Landis and Koch criteria. 18Discordant follow-up management categories between the observers were analyzed based on differences in risk-dominant nodule selection, nodule type, and measurement.McNemar's test was used to compare the proportions of these differences between the two rounds in the overall reading pairs.For each observer, changes in the follow-up recommendation category between the two reading rounds were assessed.
To evaluate the effect of inconsistent classifications on patient management, inconsistent cases were classified into two categories based on variations in follow-up recommendations.Minor inconsistency referred to a discrepancy shorter than 6 months (such as a 6-month follow-up vs. a 3-month follow-up), whereas major inconsistency was of 6 months or more (such as a 12-month follow-up vs. a 6-month follow-up; a 6-month follow-up vs. a 0-month follow-up).Furthermore, a substantial discrepancy was defined as a variation of at least 9 months in follow-up time, specifically arising from the 12-month versus 0/3month follow-up recommendations.
The sensitivity of both the manual measurement and CAD in classicification of lung cancer was determined using a 6-month follow-up recommendation as a positive threshold.The sensitivities of the combined observers across the two rounds of reading were compared using generalized estimating equations.p < 0.05 was deemed statistically significant, and the Statistical Package for the Social Sciences (SPSS, version 27) was used for statistical calculations.

| Demographics results
The median age of the participants was 54 years, with 30 men (50.0%) and 30 women (50.0%).Of them, 38.3% were either current or former smokers.Most of the participants had a history of passive smoking (75.0%), and 21.7% had a family history of lung cancer, and 36.7% of other malignant tumors.Of the 60 patients, 29 were diagnosed with lung cancer, and all cases of lung cancer were pathologically confirmed as adenocarcinoma.Furthermore, all diagnosed instances were categorized as Stage I lung cancer.Table 1 presents the detailed participant demographic information.

Observers with CAD
In the second round, the agreement among the five observers increased, reaching substantial agreement with a Fleiss kappa value of 0.623 (0.573-0.673) (Table 2).Among all 10 pairs of readings by the five observers, the agreement increased in 9 pairs compared to the previous round.The average agreement between the pairs was higher in the second round than in the first, with a Cohen's kappa value of 0.733 (0.673-0.793), ranging between 0.646 (0.498-0.794) and 0.814 (0.697-0.930) (Table 3).
For the different risk-dominant nodules, the agreement in recommendation improved from 0.367 (0.288-0.446) to 0.561 (0.474-0.648) in the second round.For identical risk-dominant nodules, the agreement in recommendation increased from 0.484 (0.420-0.549) to 0.652 (0.589-0.715).In cases where there were disagreements in nodule type for identical risk-dominant nodules, with the assistance of CAD, the agreement increased from 0.159 (0.052-0.265) to 0.419 (0.300-0.537).
Most readings with discrepancies were related to the same risk-dominant nodules, accounting for 72.0%(113/157).Whereas, in 26.1% (41/157) of readings had different follow-up recommendations due to different nodule types.Additionally, 45.9% (72/157) of patients had different follow-up recommendations based on differences in nodule size.Major and substantial management discrepancies were observed in 9.2% (55/600) and 0.3% (2/600) of the cases, respectively.

CAD application
We analyzed the changes in follow-up recommendations from the five observers across two reading rounds.Among the five observers, 58.3% (35/60) to 81.7% (49/60) of cases remained unchanged across the two reading rounds.In the second reading session, follow-up management strategies were upgraded by the observers in an average of 13.3% of cases, whereas 15.0% of cases were more likely to be downgraded.Most alterations were observed between adjacent categories relative to the first round (Table 4 and Figure 4).

| 9 of 12
Among all possible 290 pairs of lung cancer case readings between the five observers (calculation as previous approach, 10 × 29 = 290), the use of CAD resulted in a reduction of approximately two-thirds of substantial management discrepancies (differences in follow-up time between 1/3 and 12 months), decreasing from 11 to four cases.
Regarding patient management, the observers determined a mean follow-up period of 7.4 months in the first round and 7.3 months in the second round.Although the observers provided a slightly shorter average follow-up interval for the second reading, the change was minimal, with an average reduction of 0.1 months.

| DISCUSSION
CAD application in CT lung cancer screening is typically believed to reduce missed diagnoses and improve work efficiency.However, there has been limited research on the impact of CAD on inter-observer agreement in terms of lung nodule management strategies.In our study, after CAD implementation, we noted a marked enhancement in agreement among observers, as evidenced by the increase in the Fleiss kappa value from 0.437 (0.388-0.487) to 0.623 (0.573-0.673).
Accurate and consistent management of nodules is imperative for effective screening of lung cancer.Among the 600 paired measurements, the average Cohen's kappa value for the manual measurements was 0.609 (0.518-0.701), ranging between 0.484 (0.325-0.644) and 0.777 (0.665-0.889).Agreement improved after using CAD, with an average Cohen's kappa value of 0.733 (0.673-0.793), ranging between 0.646 (0.498-0.794) and 0.814 (0.697-0.930).The proportion of patients with inconsistent follow-up management decreased by 12.8% (39% vs. 26.2%)after CAD application.Major and substantial management discrepancies reduced by 11.7% (27.5% vs. 15.8%), and 3.3% (4.8% vs. 1.5%), respectively.With the introduction of CAD, we observed improved inter-observer agreement caused by different riskdominant nodules, with the Fleiss kappa increasing from 0.367 (0.288-0.446) to 0.561 (0.474-0.648).However, the proportion of inconsistencies resulting from the choice of alternative risk-dominant nodules increased slightly (26.5% vs. 28.0%).A possible explanation is that CAD may provide more nodules of similar size within the same category, potentially increasing discrepancies in the selection of riskdominant nodules, leading to divergences in management decisions.To address this limitation, CAD should facilitate a more detailed and in-depth characterization of lung nodules to provide radiologists with nodule information of higher clinical relevance.The deep learning-based CAD system developed by Trajanovski et al. integrates features of lung nodules and their surrounding background, enhancing the accuracy of nodule classification and outperforming the PanCan model in large datasets. 19Additionally, discrepancies in nodule definition and perception among different  Note: Data are numbers of cases.
observers, as well as missed diagnoses by CAD, can potentially lead to discrepancies in management.
One advantage of CAD in enhancing the agreement of follow-up management strategies is the automated identification of nodule type.In our study, compared with manual measurements, CAD significantly reduced the proportion of inconsistencies caused by discrepancies in nodule type classification (from 13.8% [83/600] to 6.8% [41/600]; p < 0.001).Although CAD also offers automated measurements, its impact on managing discrepancies arising from variations in nodule size measurements was limited (from 14.8% [89/600] to 12.0% [72/600]; p > 0.05).A possible reason for this is that the classifications in the NCCN guidelines have narrow thresholds between adjacent categories.For instance, a subsolid nodule with a solid component measuring 5.3 mm is recommended for a 12month follow-up, while one measuring 6.3 mm falls into the 6-month follow-up category, with a mere difference of 1 mm between them.Thus, the clinical implications of such inaccuracies become most evident when the measurements approach these decision thresholds, thereby affecting the agreement of management strategies.
Our study revealed that CAD assistance significantly improved agreement among resident physicians.All three sets of paired measurements across the resident physicians showed varying degrees of enhancement (from 0.417, 0.541, and 0.639 to 0.736, 0.814, and 0.782, respectively).However, the agreement of enhancement for chest radiologists was minimal (increasing from 0.655 to 0.656).Our findings emphasize the variation in follow-up management assessment with CAD assistance based on diagnostic experience, indicating that younger physicians may benefit more from CAD to ensure agreement in lung cancer screening results.
Another advantage of CAD is its capacity to reduce substantial management discrepancies among lung cancer cases.With the use of CAD, substantial management disagreements among the 29 lung cancer cases decreased by 63.6% (from 11 to 4 cases).Additionally, in lung cancer cases that presented as part-solid nodules, the average sensitivity with CAD assistance was notably higher than that with manual measurement (82.6% vs. 92.2%;p < 0.05).Thus, CAD offers a more consistent and accurate approach in the management and detection of lung cancer.Specifically, it helps in minimizing the discrepancies in lung cancer case management, especially in cases presenting with part-solid nodules.
Although CAD systems have demonstrated potential in enhancing inter-observer consistency and increasing the sensitivity of detecting lung cancer cases presenting as part-solid nodules, overdiagnosis remains a serious concern that must be addressed with diligence.Liu et al. suggest that employing varying growth thresholds to assess the growth patterns of subsolid nodules may more effectively discriminate between high-risk subsolid nodules and indolent ones.They propose developing personalized predictive models for different growth trends of subsolid nodules to optimize management strategies and maintain a balance between the benefits and drawbacks of lung cancer screening programs. 20More effective lung cancer risk stratification should be realized to enhance screening outcomes in non-smoking populations, thereby reducing overdiagnosis among those at lower risk. 21Meanwhile, while the high sensitivity of CAD systems has enhanced the detection of nodules, it may also lead to the overmanagement of minor ground-glass nodules, thus raising concerns about potential overtreatment.Such concerns are understandable.However, in our study, the management strategies for nodules were based on risk-dominant nodules, and these additional minor ground-glass nodules detected by CAD were unlikely to affect the patients' follow-up strategies significantly.
Nonetheless, our study had several limitations.First, unlike the randomly selected CT datasets, the kappa values in this study were based on a specific dataset that evenly included all nodule categories.However, in a real-world setting, nodules detected during screenings are often benign, predominantly requiring 6-12 months of follow-up.Hence, evaluations should be based on a specific context rather than sweeping generalizations.Second, the sample size was relatively limited, and an expansion of the sample is necessary to enhance the generalizability of our findings.Moreover, the NCCN guidelines recommend a 1 mm scan slice thickness for LDCT.However, due to equipment variations in our study, we used slice thicknesses of 1 mm and 1.25 mm.Whether this affects the study outcomes warrants further investigation.

| CONCLUSIONS
The utilization of CAD bolsters agreement in management strategies for subsolid nodules among observers, diminishes substantive management disagreements, and concurrently elevates the average sensitivity in detecting lung cancer cases presenting as part-solid nodules.

| 7 of 12 F I G U R E 3
Management discrepancies stemming from same risk-dominant nodules and different risk-dominant nodules.(A) Depicted discrepancies in follow-up management a rising from divergent interpretations by readers concerning the nodule type and measurements.In the first reading round, all five observers identified a subsolid nodule in the left upper lobe as the risk-dominant nodule.Of these, three classified it as a part-solid nodule and, due to measurement variations, recommended follow-ups at either 3 or 6 months.Meanwhile, two observers categorized it as a ground-glass nodule, suggesting a 12-month follow-up.During the second reading, with the assistance of CAD, all five observers unanimously defined the nodule as part-solid and recommended follow-ups at either 3 or 6 months.(B) Illustrated the discrepancy in risk-dominant nodules due to one reader's failure to detect the nodule.The axial CT image displayed a subsolid nodule measuring 12.0 × 8.7mm (Left), located adjacent to the mediastinal pleura of the left lower lobe.In the first reading round, four observers categorized it as a part-solid nodule: three observers recommended a 6-month follow-up, while one suggested a 3-month follow-up.Another reader, however, missed this nodule and instead chose a solid nodule in the left upper lobe as the risk-dominant nodule, recommending a 3-month follow-up (Right).In the second round of reading, with the assistance of CAD, the nodule was correctly identified, with all five observers suggesting a 6-month follow-up.(C) Highlighted that the discrepancies in determining the risk-dominant nodule might be attributed to variations in the nodule's definition among the observers.The axial and coronal CT scans depicted a fibrotic consolidation of approximately 10.7 mm in the left lower lobe (Left, black arrow).In the first reading round, Observers 3, 4, and 5 identified this lesion as the risk-dominant nodule and recommended a 3-month follow-up.Conversely, Observers 1 and 2 pinpointed another nodule as the riskdominant nodule and suggested a 6-month follow-up (Right, white arrow).In the subsequent reading round, CAD flagged nodule (Right, white arrow) as the risk-dominant nodule.However, despite the CAD's indication, Observers 4 and 5 mainta ined their selection consistent with the first round, whereas Observer 3 heeded the CAD's recommendation.Observers 1 and 2 stayed true to their initial selections at either 3 or 6 months.

T A B L E 3 T A B L E 4 Note:
Multirater inter-reader agreement in NCCN-based follow-up management categorization measured by Fleiss kappa.n = 39 0.652 (0.589-0.715)Nodule typedisagreed n = 10 0.419 (0.300-0.537)Nodule type-agreed n = 29 0.733 (0.658-0.809)Changes to NCCN-based follow-up management categorization when CAD results were added.Data are numbers of cases.

F I G U R E 4 Round 1
Distribution of follow-up management classifications of all the cases (n = 60) among five observers when manual measurement and when using the CAD system.T A B L E 5 NCCN-based follow-up management categorization changes in 29 cancer cases.Round 2 Round 1 Round 2 Round 1 Round 2 Round 1 Round 2 Round 1 Round 2 Summarizes the basic demographic information of 60 participants.
T A B L E 1 a Data are median value and numbers in parentheses are the range.
Pairwise inter-observer agreement in NCCN-based follow-up management categorization measured by weighted Cohen's Kappa.
T A B L E 2Note: 95% confidence intervals are shown in parentheses.
Comparison of observer sensitivity between two rounds in 29 cancer cases.McNemar's test for the comparison within each reader and Logit model with generalized estimating equations in pooled data.