We studied reproducibility of the ISCD vertebral exclusion criteria among four interpreters. Surprisingly, agreement among interpreters was only moderate, because of differences in threshold for diagnosing focal structural defects and choice of which vertebra among a pair discordant for T-score, area, or BMC to exclude. Our results suggest that reproducibility may be improved by specifically addressing the sources of interobserver disagreement.
Introduction: Although DXA is widely used to measure vertebral BMD, its interpretation is subject to multiple confounders including osteoarthritis, aortic calcification, and scoliosis. In an attempt to standardize interpretation and minimize the impact of artifacts, the International Society for Clinical Densitometry (ISCD) established criteria for vertebral exclusion, including the presence of a focal structural defect (FSD), discrepancy of >1 SD in T-score between adjacent vertebrae, and a lack of increase in BMC or area from L1 to L4. Whereas the efforts of the ISCD represent an important advance in BMD interpretation, the interobserver reproducibility with application of these criteria is unknown. We hypothesized that there would be substantial agreement among four interpreters regarding application of the exclusion criteria and the final lumbar spine T-score.
Materials and Methods: Each interpreter read a set of 200 lumbar DXA scans obtained on male veterans, applying the ISCD vertebral body exclusion criteria.
Results: Surprisingly, agreement among interpreters was only moderate. Differences in interpretation resulted from differing thresholds for recognition of FSD and the choice of excluding the upper or lower vertebral body for the criteria requiring comparison between adjacent vertebrae.
Conclusions: Despite their apparent simplicity, the ISCD vertebral exclusion criteria are difficult to apply consistently. In principle, appropriate refinement of the exclusion criteria may significantly improve interobserver agreement.
BMD IS THE single best test to predict fragility fracture.(1–4) Although various biochemical markers of bone turnover, ultrasonic properties of bone, and quantitative CT scanning each predict fragility fracture,(5–10) BMD measured using DXA remains the gold standard for osteoporosis diagnosis and fracture risk assessment in clinical practice. However, DXA is an imperfect tool. For example, interpretation of lumbar spine bone mass is confounded by artifacts including degenerative arthritis, aortic calcification, and scoliosis.(11) To address this problem, the International Society for Clinical Densitometry (ISCD) currently recommends exclusion of individual vertebrae when interpreting lumbar spine BMD if any of four criteria are present.(12) Specifically, a vertebral body is excluded if there is a focal structural defect (FSD), unusual discrepancy in T-score between adjacent vertebrae, or a lack of increase in BMC or bone area when proceeding caudally from L1 to L4.(12)
While the efforts of the ISCD represent an important advance in the approach to BMD interpretation, interobserver reproducibility in the application of ISCD criteria for vertebral body exclusion is unknown. We hypothesized that interpreters could achieve good consensus in applying the ISCD exclusion criteria and determining the resultant lumbar spine T-score. To test this hypothesis, we analyzed agreement among observers in a set of 200 lumbar spine DXA scans.
MATERIALS AND METHODS
De-identified lumbar spine DXA reports of 200 male veterans among 533 tested at the William S. Middleton Veterans Hospital between January 2 and October 5, 2002 were randomly chosen for evaluation. A single ISCD-trained technologist performed all examinations using a GE Medical Systems Lunar Expert-XL densitometer (Madison, WI, USA). Data were analyzed using software version 1.92 and the GE Lunar male normative database.
Subsequently, four ISCD-certified physicians reviewed the DXA images and ancillary results to estimate reproducibility of the ISCD criteria for vertebral body exclusion. The indication(s) for exclusion of vertebral bodies were noted, along with the resulting lumbar spine T-score, for each report. The GE Lunar software for these BMD reports did not permit calculation of T-score among noncontiguous vertebrae. Thus, if two adjacent and one noncontiguous vertebral body remained, interpreters were instructed to choose the mean T-score of the two adjacent vertebrae as the final lumbar spine T-score, as suggested by the ISCD position statement.(12) When two nonadjacent vertebral bodies remained after application of the exclusion criteria, interpreters were asked to choose the numerically lower T-score as the final value.
All study data were entered in duplicate into an Excel spreadsheet for statistical analysis. The University of Wisconsin Human Subjects Committee and the William S. Middleton Veterans' Hospital's Research and Development Committee reviewed the study protocol and, as all patient identifiers had been removed from the data, deemed that informed consent was not needed for this study.
To determine sample size for this study, interpreters 2 and 3 reviewed a random sample of 54 male BMD reports, applying the ISCD criteria for vertebral body exclusion. Subsequently, high agreement in lumbar spine T-score was achieved (r = 0.98, Pearson's correlation coefficient), with an average difference in BMD T-score of 0.08 ± 0.4 between the two interpreters. Based on these data, using a two-sided α error of 0.05, review of 200 BMD reports would provide a power of 94% to detect a difference of 0.1 in T-score between paired interpreters.
We used McNemar's and κ test statistics to assess differences in rates of, and indications for, vertebral body exclusion among interpreters. ANOVA and paired and unpaired t-tests were used to compare continuous variables, while the χ2 test was used to compare categorical variables. We performed linear regression to compare interpreters' T-scores for each scan, obtaining the Pearson correlation coefficient, prediction equation, and SE of the estimate for each interpreter pair. Significance levels were adjusted according to Bonferroni's correction to account for multiple comparisons. Statistical analyses were performed using Analyze-It software (Leeds, UK), Sigma Stat 2.03 (SPSS), and Excel (Microsoft). Calculation of κ was based on formulas obtained from the Medical Algorithms Project.(13)
Clinical characteristics of the study population
Male veterans in this study had a mean age of 64 ± 12 years (range, 25–89 years). Most were white (97%), with two Hispanics and four blacks among the group. The average body mass index was 29.1 kg/m2 (range, 17.4-45.1 kg/m2). Current and prior alcohol use was reported among 39 (20%) and 27 (14%) men, whereas current and prior tobacco use was noted in 33 (17%) and 66 (33%) men, respectively. Among all men, 98 (49%) reported a prior fracture, including 95 (48%) with low trauma fractures.
Vertebral body exclusion
Figure 1 shows vertebral exclusion by the various observers, and two trends are apparent. First, the rate of vertebral body exclusion rises from L1 to L4. Second, each observer has a characteristic threshold for excluding vertebrae. For example, observer 1 excluded the fewest vertebrae, whereas observer 4 excluded the most vertebrae, at any given level. Both trends are significant by two-way ANOVA, with p < 0.0001 for vertebrae and p = 0.0016 for observers, without a significant interaction between vertebrae and observers. Post hoc analysis by Tukey's test reveals significant pairwise differences in vertebral body exclusion between all vertebrae except L1 and L2 and between all interpreters except 2 and 3.
Differences in BMD interpretation among the interpreters are also reflected by the number of scans deemed unreadable. Lumbar spine images were completely excluded from analysis in 4 cases by interpreter 1, 2 cases by interpreter 2, 16 cases by interpreter 3, and 24 cases by interpreter 4. The frequency of unreadable scans is significantly different among observers (χ2 = 29.8, df = 3, p < 0.0001). Pairwise comparisons are significant for observers 2 and 4 (χ2 = 13.9, df = 1, p = 0.0002), observers 1 and 4 (χ2 = 18.1, df = 1, p = 0.00002), and observers 1 and 3 (χ2 = 9.8, df = 1, p = 0.0017), whereas the difference is of marginal significance for observers 2 and 3 (χ2 = 6.4, df = 1, p = 0.012).
We used the κ statistic to gauge agreement among four observers regarding vertebral body exclusion and regarding application of each of the four ISCD exclusion criteria (Table 1). κ values range from −0.039, indicating no agreement, to 0.851, indicating excellent agreement. However, the majority of κ values fall between 0.2 and 0.6, indicating fair to moderate agreement between observers.
The surprisingly poor agreement among observers prompted us to seek the factors leading to different choices regarding vertebral body exclusion. Two factors account for most of the discrepancy among observers: differing thresholds for vertebral exclusion because of FSD and differing choice of which vertebra among a discordant pair to exclude.
The four observers exhibited differing judgment regarding the presence of FSD (Fig. 1). An example of disparate exclusion based on FSD is shown in Fig. 2. In this case, interpreter 3 excluded L1 only, interpreters 1 and 2 excluded L1 and L2, and interpreter 4 excluded all vertebral bodies because of FSD, resulting in an unreadable scan. We analyzed FSD by two-way ANOVA, finding significant differences in FSD frequency related to vertebra (p = 0.00013) and observer (p = 0.00009) without interaction between these factors. Posthoc analysis by Tukey's test reveals significant pairwise differences for all vertebral comparisons except L1 versus L2 and L2 versus L3 and for all observer pairs except 1 versus 2 and 3 versus 4. Overall, observer 4 had the lowest threshold for calling a FSD, followed by observers 3, 2, and 1.
In general, increasing patient age was associated with less consistent agreement regarding the presence of FSD. For example, between interpreters 1 and 2, the mean age of men for whom full agreement regarding the presence of FSD was achieved was 61.2 ± 12 years. Conversely, among men for whom disagreement regarding FSD was noted, the mean age was 68.8 ± 9.6 years (p < 0.0001 for comparisons of age by independent t-test). Table 2 provides additional comparisons for age and agreement regarding FSD.
Table Table 2. Agreement Regarding FSDs and Impact on Mean T-Score
Exclusion criteria based on comparison of adjacent vertebrae
A T-score discrepancy of >1 SD unit among adjacent vertebrae and a lack of increase in BMC or area proceeding caudally from L1 to L4 are each cause for excluding a vertebral body. These exclusion criteria differ from the presence of FSD because they are applied to vertebral pairs, rather than to single vertebrae. Moreover, once a pair has been identified as warranting exclusion of one vertebral body, the interpreter must still decide which vertebral body to exclude.
The κ data in Table 1 reveal that there was greater agreement among observers for these three comparative exclusion criteria than for FSD. Moreover, good agreement was achieved between interpreters (corrected McNemar's test not significant for all but 5 comparisons among 72) when applying the three exclusion criteria. However, despite the apparent objectivity of the numeric exclusion criteria, interpreters differed in their application of these criteria. In Table 3 we show one such example, with interpreters 1 and 4 excluding L2 and L4, whereas interpreters 2 and 3 excluded L1, L3, and L4 with application of the three comparative exclusion criteria.
Table Table 1. Agreement Among Interpreters by κ Statistic*
Table Table 3. Example of Interpreter Disagreement Regarding Vertebral Exclusion for Comparative Criteria
Therefore, given the observer's choice regarding which of two paired vertebrae to exclude, we investigated whether scoring agreement by the vertebral pair, rather than by the single vertebra, would reveal greater concordance among observers by κ. Because exclusion of L2 or L3 could be due to comparison with either the vertebra above or below it, we limited this additional analysis to the L1-L2 and the L3-L4 pairs. The corrected κ for the three comparative exclusion criteria are shown in Table 4. With adjustment for arbitrarily choosing the upper or lower of two vertebral bodies, the average κ among paired interpreters increased by an average of 0.195 to a mean κ of 0.629, indicating substantial agreement. Moreover, 12 of 36 corrected comparisons yielded κ > 0.8, indicating excellent agreement among observers.
Table Table 4. Improvement in κ Statistic After Correction for Paired Vertebral Analysis
Impact of vertebral exclusion on T-score
Given that observers differ in their application of the ISCD vertebral exclusion criteria, it is important to determine the impact of interobserver differences on final T-scores. Excluding those studies deemed not assessable by at least one interpreter (n = 36), the mean lumbar spine T-score was −1.55 ± 1.32 for interpreter 1, −1.70 ± 1.29 for interpreter 2, −1.82 ± 1.29 for interpreter 3, and −1.85 ± 1.29 for interpreter 4 (p > 0.05 by ANOVA and subsequent pairwise comparisons). For all 200 BMD reports, the mean lumbar spine T-score was −1.36 ± 1.46 for interpreter 1, −1.45 ± 1.48 for interpreter 2, −1.68 ± 1.43 for interpreter 3, and −1.73 ± 1.40 for interpreter 4 (p > 0.05 by ANOVA and subsequent pairwise comparisons).
Using linear regression to assess the relationship of lumbar spine T-scores between observer pairs, we determined regression equations, correlation coefficients, and SE of the estimate. Scatter plots and summarized statistics are shown in Fig. 3. All correlation coefficients exceeded 0.88. The mean difference (Δ) and SE of the estimate in lumbar spine T-score among reports and between paired observers ranged from 0.05 and 0.47 (interpreters 3 and 4) to 0.28 and 0.63 (interpreters 1 and 4), respectively. The SE of the estimate is a measure of the distance between an average data point and the regression line and provides an additional measure of agreement among observers.
Despite good correlation in T-score, review of Fig. 3 shows that, between interpreters, the final diagnostic category may differ substantially. Overall, changes in diagnostic category ranged from 18 (observers 3 and 4) to 30 patients (observers 1 and 4). For example, osteoporosis, osteopenia, and normal bone mass were diagnosed by DXA in 50, 66, and 80 men by interpreter 1 and 54, 69, and 53 men by interpreter 4. Among 176 men with T-scores provided by both interpreters, 30 (17%) experienced a change in diagnostic category between interpreters, moving down to a lower bone mass category in all but six cases with interpreter 4 compared with interpreter 1.
We found that interobserver disparities in T-scores could be attributed largely to disagreement regarding the presence of FSD. For example, interpreters 2 and 4 agreed fully on the presence of FSD among four vertebral bodies for 76 reports, and the mean T-score was −1.68 ± 1.51 and −1.69 ± 1.54 for these reports, respectively (p = 0.93 by paired t-test). Conversely, for the BMD reports with less than full agreement regarding FSD (n = 124), interpreters 2 and 4 had significantly different mean T-scores (−1.32 ± 1.44 and −1.75 ± 1.28, respectively, p = 0.0001). Complete data are summarized in Table 2.
Although bone turnover markers, ultrasound, and quantitative CT scanning each predict fragility fracture,(5–10) BMD measured by DXA remains the gold standard for osteoporosis diagnosis and fracture risk assessment in clinical practice.(1–5,12) However, DXA is an imperfect tool for various reasons including artifacts such as degenerative arthritis, aortic calcification, and scoliosis.(11,12) Because DXA is widely used to diagnose and manage metabolic bone disease, methods that improve the quality of DXA interpretation will thereby improve patient care. Indeed, consistent interpretation is crucial to quality patient care and the conduct of multicenter studies using BMD endpoints. As standardization of diagnostic criteria for vertebral fracture has proven beneficial to clinical investigation and patient care,(14–19) a similar benefit may be anticipated from successful standardization of BMD interpretation.
The ISCD has taken an important first step in standardizing BMD interpretation, by developing criteria for vertebral body exclusion. However, whereas the exclusion criteria seem straightforward, we have shown that their application is problematic and variable, even among ISCD-certified physicians. Contrary to our expectation of good agreement among interpreters, we find that agreement generally falls into the moderate range and leads to disagreement in diagnostic classification. This disagreement among observers is secondary to differing thresholds for recognizing FSD, with a trend toward a lower final T-score among interpreters with lower thresholds for noting FSD and inconsistent exclusion of either the upper or lower vertebra from a discordant pair. Nevertheless, despite differences in vertebral exclusion, a strong linear relationship between interpreters is noted with respect to lumbar spine T-scores.
Determining when to exclude a vertebral body because of a FSD is the most subjective of the four ISCD exclusion criteria. Thus, it is not surprising that the greatest disagreement between interpreters was observed when applying this exclusion criterion. We sought, but did not find, evidence suggesting that differing thresholds for FSD are related to experience (data not shown). Moreover, when we reviewed the data reported herein, disagreements among us focused on whether FSD were of sufficient severity to warrant vertebral exclusion rather than on whether a FSD was present. Our data suggest that progress in standardizing DXA interpretation therefore will require establishment of uniform FSD definitions. Such standardization might be accomplished in any of several ways, each of which has advantages and disadvantages. One possible approach is to develop and use an atlas of FSD when training densitometrists. A second potential strategy would require that vertebral body exclusion because of FSD be accompanied by an ancillary exclusion criterion, such as an unusual discrepancy in T-score. Third, plain radiographs could be used to confirm an FSD before excluding the vertebral body. Finally, software is under development to identify differences in the digital image pixel density of each vertebra, allowing such differences to be highlighted as potential FSD in the printed report. Thus, whereas reaching consensus on FSD may be a difficult task, it is achievable in principle with adequate discussion among the bone community.
The other three ISCD exclusion criteria are based on numerical data; the only subjective decision is which of two vertebral bodies to exclude. For example, if there is a lack of increase in bone area from L1 to L2, the interpreter could choose to exclude L1 or L2. Such minimal subjectivity is likely responsible for the stronger agreement between observers when applying these three exclusion criteria. Nevertheless, we found that agreement among observers is better still if one corrects for arbitrarily excluding one or the other of a discordant vertebral body pair. Specific guidance regarding which vertebra of a pair should be excluded is therefore an important step in refining the vertebral exclusion criteria. Such guidance could be incorporated into bone density software, which could in turn highlight discordant pairs in the printed report, eliminating the possibility of arithmetic errors and allowing an interpreter to exclude vertebral bodies with more consistency.
Our study has several important limitations. Only DXA scans of predominantly white male veterans were evaluated. Reproducibility may be different among women and among subjects with more varied ethnic backgrounds. Another limitation of our study is the use of a single densitometer and technician; the performance characteristics of different technicians and instruments may affect scan performance. Most importantly, the data presented here offer no guidance regarding “best practice” in applying the ISCD exclusion criteria. It remains an open question whether the exclusion criteria in general, and the various thresholds for recognizing FSD in particular, improve the ability of spinal BMD to predict fracture. Such guidance will ultimately require that various interpretation schemes be prospectively related to incident fractures. Other areas for future study include validation of our findings in different patient groups and a more systematic analysis of reader experience on DXA interpretation.
We conclude that, despite their apparent simplicity, the ISCD criteria for vertebral body exclusion are difficult to apply consistently. Differences in interpretation are caused by differing thresholds for FSD and choice of which vertebra of a discordant pair to exclude. Refinement of exclusion criteria can overcome both sources of disagreement.
KEH received salary support from Grant 1K12RRO17614-01 during the course of this study. NB receives funding from Grant R01 DK58363-01, and MKD receives support from Grants R01 AR27032-33, 1K12RR017614-1, and R01 SK 65830–1. RDB gratefully acknowledges support provided by Grant DAMD17-00-1-0071. The U.S. Army Medical Research and Materiel Command, Ft Detrick, MD, is the awarding and administering acquisition office. The views expressed do not necessarily reflect the position or policy of the U.S. government, and no official endorsement should be inferred. This material is based on work supported in part by the Office of Research and Development, Medical Research Service, Department of Veterans Affairs, and was conducted in the Geriatrics, Research, Education, and Clinical Center at the William S. Middleton Veterans' Hospital.