Dr Leslie receives speakers fees from Merck Frosst Canada Ltd., honoraria, and unrestricted educational grants from Sanofi-Aventis and Proctor & Gamble Pharmaceuticals Canada
The most widely used procedure for performing a BMD reproducibility assessment (same-technologist with simple repositioning on the same day) systematically underestimates precision error and will lead to over categorization of change in a large fraction of monitored patients.
Introduction: The most common procedure for establishing the least significant change (LSC) to monitor bone mineral density (BMD) with DXA is for the same technologist to perform repeat subject scans on the same day with simple repositioning. The objective of the current report is to determine how the reproducibility scanning procedure impacts on the precision assessment and categorization of change in routine clinical practice.
Materials and Methods: The study population was drawn from the database of the Manitoba Bone Density Program which includes all clinical DXA test results for the Province of Manitoba, Canada. All patients who had baseline and follow up total spine (L1–4) and the total hip BMD measurements on the same instrument up to March 31, 2007 were included as the ‘clinical monitoring population’ (N = 5048 scan-pairs). BMD precision was assessed in a convenience sample of patients who were agreeable to undergoing a repeat assessment (50% performed on the same day with repositioning, 68% by different technologists) (N = 331 spine and 328 hip scan-pairs).
Results: Precision error was greater when the scan-pairs were acquired on different days than on the same day for both the total spine (p < .001) and total hip (p < .01). No other factor was consistently associated with precision error. The reference LSC (different days and different technologists) categorized the smallest fraction of the monitored population with change, whereas other combinations gave a significant rate of over categorization (up to 19.3% for the lumbar spine and up to 18.3% for the total hip).
Conclusions: The most widely procedure for performing a BMD reproducibility assessment (same-technologist with simple repositioning on the same day) systematically underestimates precision error and will lead to over categorization of change in a large fraction of monitored patients.
BMD measurement has a primary clinical role in the initial diagnostic and fracture risk assessment of osteoporosis(1,2) and are also widely used for serial monitoring of patients with suspected or confirmed osteoporosis.(3) Assessment of precision errors in BMD is a prerequisite to characterizing longitudinal changes.(4,5) The International Society for Clinical Densitometry (ISCD) has proposed a standardized methodology for such precision studies.(6,7) The ISCD procedure states that precision error should be obtained from an assessment with 30 degrees of freedom (e.g., 30 individuals with two scans each or 15 individuals with three scans each) drawn from the patient referral population and using the root mean square (RMS) approach, with the recommendation that the subjects' scans be performed after simple repositioning (arising from the scanner table between the repeat measurements).(6,7) Typically, the repeat scan is performed by the same technologist on the same day as the initial scan. We hypothesized that this procedure may systemically underestimate both short-term (day-to-day) and long-term precision error that is likely to be encountered when patients are scanned by other technologists. The objective of this study was to determine whether the reproducibility scanning procedure or other subject characteristics significantly impacted on the precision assessment and how this would affect categorization of change in routine clinical practice.
MATERIALS AND METHODS
Clinical monitoring population
Data from the Manitoba Bone Density Program were used for the analyses. This report was reviewed and approved by the facility's Office for Clinical Research. The program was established in 1997 and provides all BMD services to the population of Province of Manitoba, Canada (total population 1.1 million according to 2001 Statistics Canada census data).(8) An electronic database contains all DXA BMD tests performed because DXA testing was first offered. The database is >99% complete and accurate as judged by chart audit.(9) From this database we identified all individuals who had baseline and follow-up total spine (L1–L4) and the total hip BMD measurements on the same instrument up to March 31, 2007. We excluded cases where scanning was performed on different instruments, where the lumbar spine and/or hip were not scanned or were unsuitable for clinical reporting and where there were vertebral exclusions for severe focal structural defects. This left 5048 scan-pairs for the final clinical monitoring population.
Replicate measurements of the spine and hip were obtained from a convenience sample of individuals referred for bone density testing who were agreeable to undergoing a repeat assessment, of which one half (50%) were performed on the same day with repositioning and the reminder were performed on a separate day (median interval, 7 days; range, 3–49 days). The majority of the scans used for the reproducibility assessment (68%) were performed by two different technologists. For comparability with the clinical monitoring population, individuals with severe focal structural defects in the lumbar spine were not included in the reproducibility assessment. The final reproducibility population consisted of 331 spine scan-pairs and 328 hip scan-pairs. These were acquired as part of the Manitoba Bone Density Program's ongoing clinical quality assurance program and are therefore representative of BMD test precision during the years of clinical monitoring. During the course of this study, 20 technologists were involved in the precision assessments. No individual outliers or temporal variation was identified in terms of technologist performance.
The DXA scans in our densitometry clinics are performed and analyzed in accordance with the manufacturer's recommendations. All equipment and technologist performance is subject to a rigorous quality assurance program developed from published models and monitored by a medical physicist.(10–12) All densitometers underwent daily assessment of stability using an anthropomorphic spine phantom and each showed excellent long-term phantom stability (CV < 0.5%). A pencil-beam instrument (Lunar DPX; GE Lunar, Madison, WI, USA) was the primary instrument used before 2000 and a fan-beam instrument (Lunar Prodigy) was used after that date.
Weight and height were recorded at the time of each DXA assessment. Before 2000, this was by self-report. Starting in 2000, patient weight and height were directly measured using a calibrated scale (nearest pound) and wall-mounted stadiometer (nearest 0.1 cm). Body mass index (BMI) was calculated as weight divided by height squared.
The RMS method was used to calculate the SD of the precision error (RMS-SD; g/cm2) for the entire reproducibility population. The 95% confidence least significant change (LSC) was defined for each RMS-SD precision error by multiplying by 2.77.(4,5) The ISCD recommends the use of absolute LSC (g/cm2) for assessing significance in the absolute change between two BMD measurements rather than relative precision error (%).(6,7) The RMS-SD and LSC were computed for subgroups of the reproducibility population defined on the basis of timing (same-day versus different-day scans), technologist (same-technologist versus different-technologists), scanner design (pencil- versus fan-beam), age (quartiles), weight (quartiles), height (quartiles), BMI (quartiles), bone area (quartiles), T-score (quartiles), and WHO category (normal T-score −1 or higher versus osteopenic T-score between −1 and −2.5 versus osteoporotic T-score −2.5 or lower). Precision errors for independent subgroups were compared using the F-ratio test. The ISCD recommends that precision error should be obtained from an assessment with 30 degrees of freedom to ensure that the upper limit for the 95% CI of the calculated precision value is no more than 34% greater than the calculated precision value.(7) All subgroup precision errors consisted of at least 79 degrees of freedom except for the number of individuals in the WHO osteoporotic range (N = 40 for the total spine, N = 9 for the total hip). As an additional test of factors that might affect BMD precision, Pearson rank correlation statistics were computed between the CVs for each reproducibility scan-pair and the previously listed candidate variables as dichotomous (same-day versus different-day timing, same-technologist versus different-technologists, pencil- versus fan-beam scanner), or continuous measures (age, weight, height, BMI, bone area, T-score).
To assess effects on categorization of change in the monitored population, we evaluated the two interrelated factors of timing (same-day versus different-day) and number of technologists (same-technologists versus different-technologists). Although the latter was not statistically significant in the analysis of precision errors, it was included in the assessment of categorization because it is a fundamental consideration in setting up a facility's precision assessment procedure. Although a true “gold standard” is not available, we designated the “worst case” situation of different-days and different-technologists as the reference subgroup. The LSC measurements referred to the previously were applied to the clinical monitoring population to determine the proportion with the change exceeding this cut-off.
Calculations and statistical analyses were performed with Excel 2002 (Microsoft) and Statistica version 7.1 (Statsoft, Tulsa, OK, USA). p < 0.05 was used to determine statistical significance.
The characteristics of the clinical monitoring and reproducibility populations are summarized in Table 1. For the reproducibility population, there was no appreciable change from the first to second scan (mean difference total spine, −0.001 ± 0.023; total hip, 0.001 ± 0.013 g/cm2). The LSC for the total spine (0.047 g/cm2 was significantly greater than for the total hip (0.027 g/cm2, p < .001). For the clinical monitoring population, a slight mean increase in BMD was seen at the total spine (+0.002 ± 0.060 g/cm2) with a slight decrease at the total hip (−0.003 ± 0.045 g/cm2). Based on the LSC from the entire reproducibility population, a change in BMD exceeding the LSC was seen in 37.6% of the total spine measurements and 44.9% of the total hip measurements. The total hip site classified a higher proportion of the population with change than the total spine (p < .001).
Table Table 1.. Characteristics of the Study Populations
Precision error was greater when the scan-pairs were acquired on different days than on the same day for both the total spine (p < 0.001) and total hip (p < 0.01; Table 2). In the subgroup that underwent the reproducibility assessment on different days, there was no correlation between the interval (number of days) and precision error (Spearman r: total spine, −0.04, p > 0.2; total hip, 0.02, p > 0.2). There was no significant difference in precision error when scan-pairs were acquired by a single technologist or two technologists. Scanner design (fan-beam versus and pencil-beam), age, height, and bone area also did not affect spine or hip precision error. Weight, BMI, and T-score showed inconsistent differences in precision error between quartiles. There was a suggestion that higher T-scores might be associated with greater precision spine error (highest quartile compared with lowest quartile, p < 0.01), but this was not confirmed for the second highest spine quartile or for the hip. Furthermore, when measurements were categorized according to WHO category, no significant differences were identified. The Pearson rank correlation between individual subject precision errors and age, weight, height, BMI, bone area, and T-score were not significant (Table 3). Once again, a significant timing effect (different-day versus same-day scan-pair acquisition) was noted (p < 0.05).
Table Table 2.. Precision Error Measurements (RMS-SD, g/cm2) for the Total Spine and Total Hip by Patient or Testing Characteristics
Table Table 3.. Pearson Rank Correlations Between Patient or Testing Characteristics and the Individual Subject Precision Error Measurements (SD, g/cm2) for the Total Spine and Total Hip
Figure 1 shows how timing and the number of persons performing the precision assessment impacted on categorization of change in the clinical monitoring population. Precision assessments based on different days and different technologists resulted in the greatest LSC for the total spine (0.057 g/cm2) and conversely the lowest rate of categorized change (29.4%). When the same technologist performed the precision assessment on different days, there was a slightly lower LSC (0.052 g/cm2) and the number of individuals categorized with change increased by an absolute 4.1%. When the precision assessments were performed on the same day, there was a substantial decrease in LSC whether performed by different technologists (0.037 g/cm2) or the same technologist (0.038 g/cm2) and a considerable absolute increase in the number of individuals categorized with change (+19.3% and +17.3%, respectively, compared with the reference measurement). Generally similar results were seen with the total hip. Once again, the greatest LSC (0.031 g/cm2) was observed when different technologists were involved in a position assessment performed on different days (change categorized in 39.1%). The greatest discrepancy was seen with precision assessments performed on the same day by the same technologist (LSC, 0.020 g/cm2) with an absolute increase in categorized change of 18.3% versus reference.
This analysis identifies that the single most important factor determining BMD precision error is whether the reproducibility assessment is performed on the same day or different days. The former systematically underestimates short-term variability. The net effect is a significant rate of overcategorization of change (up to 19.3% for the lumbar spine and up to 18.3% for the total hip). Patient-related factors were found to be clinically insignificant.
Our findings are consistent with a report from Fuleihan et al.(13) that same-day precision error was smaller than different-day precision error for the lumbar spine, but this report did not assess the total hip and did not directly assess the impact on patient categorization. Furthermore, timing analyses could have been affected by the use of relative precision error (% CV) rather than absolute precision error (g/cm2), especially with regard to the analysis of menopausal status, as age-related bone loss will result in a larger relative error even when the absolute error is constant.(14) For this reason, the use of absolute precision error is preferred.(6,7) Although it has been reported that long-term precision of BMD is only slightly worse than short-term precision,(15) even small differences in precision error can have disproportionately large effects on categorizing change in patients.(14,16) Obesity has been suggested to have an adverse effect on BMD precision errors.(15) This was not clearly evident in our study, although extremes in weight could have a larger effect, especially in the presence of an abdominal panniculus.(17) Vertebral body exclusion and presence of focal structural defects each decrease lumbar spine precision,(18) with a marked degradation in spine precision in the most severely affected group (CV up to 8.7%).(19) We could not evaluate these factors as individuals with severe lumbar spine severe focal structural defects in the lumbar spine were not included in the reproducibility assessment, although we did not see a significant effect of lumbar spine area. Two studies have reported older age to be associated with larger precision errors,(20,21) but age was not identified to affect BMD precision of the spine or hip in our study. The independence of precision error over a wide range of BMD values has been previously noted(22) and is consistent with our findings.
Our study has important implications for how BMD precision assessments should be conducted. Although having subjects return on a different day for repeat scanning is inconvenient, it is clear that same-day scans will systematically underestimate precision error. Although we did not specifically study the technical reasons for this, one can speculate that same-day repeat scans will not capture small day-to-day calibration shifts, differences in clothing, and abdominal contents. The ratio in different day to same day LSC was 1.49 for the total spine and 1.30 for the total hip. The greater timing effect on the spine would be consistent with greater tissue thickness and variation in abdominal contents related to gut peristalsis and meals (bowel “commotion”). It is unclear if these ratios can be used to adjust for temporal effects by predicting different-day precision error from same-day precision error, but if this approach could be validated, it would be convenient for patients and practitioners by allowing for the use of existing same-day precision results to predict long-term precision error and LSC. A corollary to this work is that precision assessments intended to evaluate technologist performance (equivalent to our same-day/same-technologist subgroup) may not be optimal for overall facility performance (analogous to our different-day/different-technologist subgroup). Simple pooling or averaging of technologist precision errors will still systematically underestimate precision error when different technologists are involved.
An additional consideration relates to establishing acceptable precision standards for bone densitometry. The ISCD position is that the minimum acceptable precision for an individual technologist is 1.9% for the lumbar spine and 1.8% for the total hip. These figures would be applicable for the ISCD procedure where scans are completed on the same day, but would not be applicable when scans are performed on different days. It would be counterproductive to penalize technologists and facilities that perform reproducibility assessments under “real world” conditions. Although achieving a low precision error is clearly desirable as an indicator of good quality assurance/quality control, a biased LSC could adversely affect patient care. Falsely identifying BMD change, especially a decrease, could be deleterious in patient care because it may lead to initiation of treatment that is unnecessary or to the incorrect conclusion that a treatment has failed. Although failing to detect a decrease in BMD that is just outside of the LSC limit is not inconsequential, the reality is that ongoing BMD loss will be detected with additional follow-up measurements. Therefore, the impact of the temporal bias identified in this study on management decisions is likely to be smaller than its effect on categorization of change because treatment will usually be unaffected in those patients in whom BMD is stable or increases, and unless there is a very high rate of BMD loss between two measurements many physicians will wait to see evidence of loss on serial measurements before changing treatment and/or instigating further studies.
Strengths of this study include the relatively large size of the reproducibility population and of the clinical monitoring population, but several limitations are acknowledged. The reproducibility studies were performed as part of an ongoing quality assurance program and spanned many years. Although there was no evidence that precision error shifted during this time, staff changes could affect our results. We also had too few men in the reproducibility assessment to assess the effect of sex. It is possible that measurement errors in men may be worse than in women because of a relatively high prevalence of anatomic abnormalities such as degenerative sclerosis. We did not evaluate the femoral neck, which is an important diagnostic site, because previous studies show that it has poor reproducibility compared with the total hip site.(23) Finally, a direct comparison of timing and technologist factors would have been ideal. This would have required at least five scans per subject (one baseline and for repeat scans to assess combinations of timing and technologist number). This design would have maximal statistical efficiency and would avoid unknown factors that could confound our analysis.
In summary, these findings suggest that the most widely procedure for performing a BMD reproducibility assessment (same-technologist with simple repositioning on the same day) systematically underestimates precision error and will lead to overcategorization of change in a large fraction of monitored patients. Individuals involved in BMD quality assurance and in clinical reporting should be aware of this, and may wish to modify the procedure for precision assessment to more accurately reflect same day-to-day variability.
This article has been reviewed and approved by the members of the Manitoba Bone Density Program Committee. The author and Committee thank Manitoba Health, the Winnipeg Regional Health Authority, and the Brandon Regional Health Authority for their vision, trust, and support in the establishment of this program.