Opportunistic CT Screening—Machine Learning Algorithm Identifies Majority of Vertebral Compression Fractures: A Cohort Study

ABSTRACT Vertebral compression fractures (VCF) are common in patients older than 50 years but are often undiagnosed. Zebra Medical Imaging developed a VCF detection algorithm, with machine learning, to detect VCFs from CT images of the chest and/or abdomen/pelvis. In this study, we evaluated the diagnostic performance of the algorithm in identifying VCF. We conducted a blinded validation study to estimate the operating characteristics of the algorithm in identifying VCFs using previously completed CT scans from 1200 women and men aged 50 years and older at a tertiary‐care center. Each scan was independently evaluated by two of three neuroradiologists to identify and grade VCF. Disagreements were resolved by a senior neuroradiologist. The algorithm evaluated the CT scans in a separate workstream. The VCF algorithm was not able to evaluate CT scans for 113 participants. Of the remaining 1087 study participants, 588 (54%) were women. Median age was 73 years (range 51–102 years; interquartile range 66–81). For the 1087 algorithm‐evaluated participants, the sensitivity and specificity of the VCF algorithm in diagnosing any VCF were 0.66 (95% confidence interval [CI] 0.59–0.72) and 0.90 (95% CI 0.88–0.92), respectively, and for diagnosing moderate/severe VCF were 0.78 (95% CI 0.70–0.85) and 0.87 (95% CI 0.85–0.89), respectively. Implementing this VCF algorithm within radiology systems may help to identify patients at increased fracture risk and could support the diagnosis of osteoporosis and facilitate appropriate therapy. © 2023 Amgen, Inc. JBMR Plus published by Wiley Periodicals LLC on behalf of American Society for Bone and Mineral Research.


Introduction
O steoporosis is characterized by decreased bone mass and deterioration in bone microarchitecture (1,2) and is usually identified by decrements in the standard deviation scores of bone mineral density (BMD). (3) Osteoporosis is associated with an increased risk of fragility fractures, including hip and vertebral fractures, but most fragility fractures occur in individuals with BMD values above the threshold used to define the disease. (2,4,5) Fragility fractures are associated with significant morbidity and mortality, with hip fractures associated with a 1-year mortality in excess of 20%. (6) The most common fracture type associated with fragility of the bone are vertebral fractures. (7,8) Vertebral compression fractures (VCFs) occur when the vertebral body in the spine collapses.
Clinical presentation is quite variable, ranging from asymptomatic, height loss/kyphosis to severe pain requiring hospitalization. (2,9) Most vertebral fractures are clinically unrecognized, but they have importance in identifying skeletal fragility and are associated with increased risk of other fractures, including hip fracture. (2) Women with pre-existing vertebral fractures had approximately four times greater risk of subsequent vertebral fractures than those without prior fracture. (10) Women with preexisting vertebral fractures have a 1.5-to 2-fold increased risk of incident hip fracture compared with those without. (8) Further, vertebral compression fractures are associated with persistent pain, as well as increased risk of progression of age-related kyphosis (with its associated decreased pulmonary function, increased risk of gastroesophageal reflux disease [GERD], and decreased physical function). (8) Finally, incident clinical vertebral fractures are associated with an initial 2-to 8-fold increased ageadjusted mortality rate. (8,11,12) Given that many vertebral fractures are clinically silent and others present with nonspecific back pain, diagnosis is a clinical challenge. Many patients, however, receive diagnostic tests for other clinical reasons that may incidentally detect vertebral fractures. Zebra Medical Imaging algorithms are meant to assist radiologists in detecting frequently overlooked lesions. The Zebra VCF detection algorithm was developed utilizing a combination of traditional machine vision segmentation and convolutional neural network (CNN) technology (13) and may be applied to CT images of the chest, abdomen, and/or pelvis. We conducted an independent and blinded validation study on previously completed CT scans of chest or abdomen/pelvis from women and men aged 50 or older, who as outpatients or inpatients, had studies at Cedars-Sinai Medical Center in Los Angeles, CA, USA. We estimated the sensitivity, specificity, likelihood ratios, and predictive values and their associated 95% confidence intervals (CI), using the diagnosis of board-certified neuroradiologists as the reference standard.

Study design and participants
Participants for this study were men and women aged 50 or older, with previously conducted CT scans of the chest or abdomen/pelvis performed as Cedars-Sinai Medical Center within the period from 2012 to 2017, with information on age and sex. The protocol requested that consecutive CT scans be identified from the radiology record system in reverse chronologic order from June 2017 until 550 CT scans for the chest were identified from 550 unique individuals and 550 CT scans of the abdomen/ pelvis were identified from 550 unique individuals. In addition, to ensure that there were enough study participants with VCF, we also sought to identify 50 CT scans of the chest and 50 CT scans of the abdomen/pelvis with previously defined verifiable VCF (based on a prior radiologist determination in the record) from the same time period. The CT scanners used to obtain the original images were of multiple types, including CT scanners made by GE, Siemens and Canon.

Image analysis and validation
Each set of CT scan images was independently reviewed by two neuroradiologists (from a pool of three neuroradiologists, each with more than 10 years of relevant experience) to identify both the presence of compression fractures of the spine and associated grade (severity of vertebral body height loss, using the semiquantitative scale of Genant and colleagues (14) ). The level(s) of the fractures were also determined and documented. The images were viewed by using a Carestream picture archiving communication system (PACS). The information was collected and then recorded in an independent csv file.
Scans with differences in VCF diagnoses (presence, grade, or location) were reviewed by the principal investigator (BDP) and a final diagnosis made. Study radiologists did not have access to the original CT study reports.
The Zebra Medical Vision software was installed at Cedars-Sinai using a server disconnected from the "radiology information system," but with access to the de-identified CT scans. The software did not have access to participant diagnoses or to the assessments of radiologists. Further, study radiologists were also blind to the assessments provided by the Zebra Medical Vision software. The algorithm output was a csv file containing a positive/negative finding per CT study. The algorithm did not identify the location of the fracture; instead, there was a positive finding if there was a fracture identified at any vertebral level, and there was a negative finding if no fracture was identified by the algorithm at any level of the spine. The details of the algorithm were described by Bar and colleagues. (13) After a full review, the study protocol was approved by the Cedars-Sinai Institutional Review Board. Patient consent was waived by the review board.

Data collection and protocol amendments
A separate csv file was created containing baseline information, including age and sex, and a study participant identifier. All files were exported to Amgen, where the files were merged into an analysis data set.
Data collection was planned before the performance of the index test and the reference standard. However, after initial data collection, it was discovered that the implementation team had not included the 100 CT studies with previously defined verifiable VCF. We therefore requested that the study data be augmented with an additional 200 CT studies, 100 with previously defined VCF, and the data randomly shuffled. This was done to maintain blinding.

Statistical analysis
Based on prevalence estimates from previous studies, (15,16) we estimated that we needed approximately 1000 patients to determine if the Zebra VCF detection algorithm had a positive likelihood ratio above 10 (based on a sensitivity of 0.9 and specificity of 0.91) and for the 95% confidence interval to exclude a positive likelihood ratio of 8.
Analyses were conducted both at the CT study level and at the vertebral level in the spine. Results were stratified by CT study region (chest versus abdomen/pelvis). We estimated sensitivity (probability of Zebra VCF algorithm positive result conditional on presence of VCF as determined by study radiologists), specificity (probability of Zebra VCF algorithm negative result conditional on absence of VCF as determined by study radiologists), positive and negative likelihood ratios, and predictive values along with relevant 95% confidence intervals. Ninety-five percent confidence intervals for sensitivity, specificity, and predictive values were estimated using exact binomial distributions. Likelihood ratios and their 95% CIs were estimated using loglinear regression models. We also re-estimated sensitivities inclusive of images that were not evaluable by the VCF algorithm.
To estimate interrater reliability of the initial evaluations of the three neuroradiologists, excluding the principal investigator (two per study participant), we used mixed-effects linear regression models with random intercepts at the study participant level and at the radiologist level, using presence versus absence of VCF, VCF severity, or number of VCF fractures per study participant, respectively, as a continuous outcome variable. We additionally included fixed-effects terms for the ratings of each of two radiologists (with the other modeled as the intercept). We estimated the intraclass coefficient (ICC) as the ratio of between subject variance to total variance (sum of between-subject variance and within-subject variance). We estimated 95% biascorrected with acceleration bootstrap (BCa) confidence intervals for ICCs from respective 1000 bootstrap samples for each of the three estimations.
The data analysis for this article was generated using Stata software (StataCorp, College Station, TX, USA), version 17, (17) and SAS/STAT software (SAS Institute, Cary, NC, USA), version 9.4 of the SAS system for LIN X6 platform. (18)

Description of sample
Of the 1200 study participants (962 chest CT, 169 abdomen/pelvis, and 69 for chest/abdomen/pelvis), the Zebra VCF algorithm was able to make a VCF determination on 1087 (90.6%, Figs. 1  and 2). The algorithm did not provide a reading for 113. Of the remaining 1087 study participants, 588 (54%) were women.

Distribution of VCF
Based on radiologist diagnosis (reference standard), among the 1087 scans that were evaluable by the algorithm, 227 (21%) were determined to have at least one VCF. Ninety had mild VCF,  (Table 2). One hundred fifteen presented with two or more VCFs. After excluding patients with previously documented VCF, 161 of 996 (14%) were determined to have at least one VCF. Table 3 and Figures 2-5 show the relationship between radiologist diagnosis and the results from the Zebra VCF algorithm at the patient level. The sensitivity and specificity of the Zebra VCF algorithm in diagnosing any VCF were 0.66 (95% CI 0.59-0.72) and 0.90 (95% CI 0.88-0.92), respectively, and for diagnosing moderate/severe VCF were 0.78 (95% CI 0.70-0.85) and 0.87 (95% CI 0.85-0.89), respectively. In other words, a positive finding with the VCF algorithm was associated with 6.88 increased odds of having a VCF, relative to not having a VCF (95% CI 5.32-8.44) and 5.98 increased odds of having a moderate/severe VCF relative to not having such a fracture (95% CI 4.87-7.10).

Performance characteristics
Most VCFs were identified at T 8 , T 11 -L 2 vertebral levels in the spine (Table 4).
At a patient level, it appeared that the Zebra VCF algorithm was able to identify most fractures that were located at T 5 or below (Table 4).

Interrater reliability of radiologists' measurements
The intraclass correlation coefficient as a measure of interÀradiologist reliability for determining the presence versus absence

Discussion
The Zebra VCF algorithm works to identify approximately fivesevenths (inclusive of non-evaluable images) of moderate to severe VCF identifiable on CT studies of thorax or abdomen in adults, aged 50 years and older, who receive CT scans for other reasons, while falsely labeling a tenth of patients without fracture as having fracture. The sensitivity for all VCF was 0.66 of evaluable images ($59% of all images). The positive likelihood ratios of 6.9 and 6.0 for any fracture and for moderate to severe fracture versus other, respectively, means that a positive finding by the algorithm increases the odds of any VCF by almost 7Â and for moderate to severe VCF by approximately 6Â. Kolanu and colleagues, (19) who evaluated the performance of the Zebra VCF algorithm in 1686 thoracic/abdominal CT studies at an Australian single tertiary-care facility, found that the algorithm had a lower sensitivity of 0.54 but a slightly higher specificity of 0.92 for all VCF. They did find sensitivity and specificity of 0.65 and 0.92, respectively, for moderate/ severe (Genant 2/3) VCF. The study, however, had a major limitation in that CT scans were reviewed by a second radiologist only in situations where there were discrepancies between the VCF algorithm and the initial radiologist. This version of reference standard is non-ideal in that the standard is much more dependent on the performance of one radiologist relative to others and can bias the apparent performance of the algorithm in ways that are difficult to generalize. In a previous study (13) that reported on the training of the Zebra VCF algorithm, the authors evaluated the algorithm in a validation set, reportedly balanced between positive and negative samples, distinct from that used for training, and reported accuracy of 0.89, with sensitivity of 0.83 and specificity of 0.94. However, the authors did not report on procedures used for blinding, if any. Two other studies (20,21) evaluated the Zebra VCF algorithm in terms of feasibility and prediction, but neither of the studies was blinded or involved a systematic validation. Roux and colleagues (21) reported on the use of the algorithm in a large cohort of French patients, but no validation by radiologists was done. Dagan and colleagues (20) reported on the predictive performance of a Zebra-defined CT scan-based algorithm performed on scans taken before 2012 to predict major osteoporotic fractures between 2012 and 2017. Other studies (22)(23)(24)(25) have evaluated  other machine learning-based algorithms to identify VCFs in thoracic/abdominal CT scans performed for other reasons and most have reported very high sensitivity. However, these studies were limited by one or more of inadequate reference standard, inadequate blinding, or small study size.
Using routine CT scans of the chest or abdomen have some limitations in that they do not allow evaluation of the entire spine. Thoracic CT allows examination of the thoracic and upper lumbar spine. (26) Abdominal CT allows examination of the lower thoracic (T 10 and below) and lumbar spine. (27,28) The most  frequent sites for VCFs are midthoracic (T 8 ) and at the thoracolumbar junction (T 12 -L 1 ). (29,30) Therefore, most VCFs may be identified by CT scans of thorax or abdomen, but some will likely be missed. Additionally, a dedicated CT of the spine, either the entire thoracolumbar spine or the individual thoracic/lumbar spine, can be acquired with a more magnified view of the spine only, and the remainder of the soft tissues of the chest, abdomen, and pelvis are excluded from the scans. This allows a more detailed, focused evaluation of the spine.
Accurate diagnosis of VCF requires knowledge of other deformities that can generate false positives, including intervertebral osteochrondrosis, Schmorl's nodes, and congenital abnormalities. (9,26) These complicate the widespread implementation of computer-aided diagnostic methods like the Zebra algorithm. As many of these entities may cause the appearances of the individual vertebral bodies to mimic various severities of VCFs, proper implementation will require validation of findings by trained or experienced medical staff.
Strengths of this research include the design, which ensured that radiologists were blind to the ratings of each other and to the ratings of the VCF algorithm. Further, the implementation of the VCF algorithm was implemented in such a way that it was blind to the rating of the radiologists. The data resulting from the two evaluations were handled by separate institutions and were only merged after the evaluations were complete. The study was also designed to have adequate sample size to estimate the performance metrics with reasonable precision. We made a valiant attempt to optimize the reference standard by using verification by expert neuroradiologists who regularly review spine CT scans as part of their regular work and resolving disagreements by a third neuroradiologist. However, there was some disagreement between radiologists, particularly with respect to determining severity (Genant grade) and to determining the vertebral level of the VCF. Although radiologists graded VCF at the level of the vertebra, the Zebra algorithm determined VCF at the patient level only.
The study had other limitations. We evaluated CT scans from the past rather than a prospective identification of patients. One of the consequences of this was that other than age and sex, we had very little covariate data on included patients and we did not know the reason for the examination. Furthermore, although reformats can be made of the thoracic or lumbar spine from the CT scans of the chest, abdomen, or pelvis, being that scans were identified retrospectively, the reformats were often made from the available data set (often reformatted at 2.5-mm intervals) as opposed to the original volumetric thin data sets that are not set to the PACS (picture archiving and communication system) database due to storage limitations. This limits the resolution of the reformatted images that are used to determine the presence of VCFs. This issue would have been avoided in a fully prospective study. In addition, the implementation team had not included the 100 CT studies with previously defined verifiable VCF in the original data evaluated by the algorithm. These data were added later together with an additional 100 CT studies of undetermined VCF status, which were shuffled to maintain blinding. Further, the VCF algorithm was not able to provide a determination on almost 10% of CT images evaluated.
The Zebra VCF algorithm works to identify just over 70% of moderate to severe VCF in adults, aged 50 years and older, who receive CT scans for other reasons, provided evaluation can be done. Implementing the Zebra VCF algorithm within radiology systems may help to identify patients at increased fracture risk and could support the diagnosis of osteoporosis. When used as a step in a comprehensive diagnostic process, the Zebra computer-aided diagnostic algorithm for VCF identification may be a helpful tool. Note: 95% confidence intervals (CI) for the sensitivity were calculated using an exact binomial distribution.
Amgen, who provided project management support, technical support, and statistical programming support, respectively. The work was supported by Amgen Inc. Zebra Medical Vision provided the software that evaluated the images.