The Study of Osteoporotic Fractures Research Group:
University of California, San Francisco (Coordinating Center): S.R. Cummings (principal investigator), D.M. Black (coinvestigator, study statistician), M.C. Nevitt (coinvestigator), D.G. Seeley (project director), H.K. Genant (director, central radiology laboratory), C. Arnaud, D. Bauer, W. Browner, L. Christianson, M. Dockrell, C. Fox, C. Glüer, S. Harvey, M. Jergas, Mario Jaime-Chavez, R. Lipschutz, G. Milani, L. Palermo, A. Pressman, R. San Valentin, K. Stone, H. Tabor, D. Tanaka, and C. Yeung.
University of Maryland: J.C. Scott (principal investigator), R. Sherwin (coprincipal investigator), M.C. Hochberg (coinvestigator), J. Lewis (project director), Cheryl Bailey, (clinic coordinator), A. Bauer, L. Finazzo, G. Greenberg, D. Harris, B. Hohman, S. Kallenberger, E. Oliner, T. Page, A. Pettit, S. Snyder, L. Stranovsky, and S. Trusty.
University of Minnesota: K. Ensrud (principal investigator), R. Grimm Jr. (coinvestigator), C. Bell (project director), E. Mitson (clinic coordinator), M. Baumhover, C. Berger, S. Estill, S. Fillhouer, J. Hansen, K. Jacobson, K. Kiel, C. Linville, N. Nelson, E. Penland-Miller, and Jayne Griffith.
University of Pittsburgh: J.A. Cauley (principal investigator), L.H. Kuller (coprincipal investigator), M. Vogt (coinvestigator), L. Harper (project director), L. Buck (clinic coordinator), C. Bashada, A. Githens, A. McCune, D. Medve, M. Nasim, C. Newman, S. Rudovsky, N. Watson, and J. Carothers.
The Kaiser Permanente Center for Health Research, Portland, Oregon: T.M. Vogt (principal investigator), W.M. Vollmer, E. Orwoll, H. Nelson (coinvestigators), J. Blank (project director), S. Craddick (clinic coordinator), R. Bright, J. Wallace, F. Heinith, K. Moore, K. Redden, C. Romero, and C. Souvanlasy
Vertebral deformities are common and important outcomes in clinical trials and epidemiologic studies of osteoporosis. While several different methods for defining new deformities have been proposed, it is not clear which is best. We used data from serial spine radiographs obtained an average of 3.7 years apart in 7238 women age ≥65 years from the Study of Osteoporotic Fractures to compare several approaches to defining new deformities by morphometry including a fixed percentage reduction in any vertebral height (FIXED%), a change in a summary spinal deformity index, a change in a vertebra from no prevalent deformity at baseline to a deformity at follow-up, as well as several variations of these methods. We compared results of each definition with several clinical correlates, including height loss, back pain, age, baseline bone mineral density, and the presence of a baseline deformity. We also estimated the sample size required for a clinical trial using various cut points. At a given level of incidence, all methods had similar relationships with each of the correlates. Given that similarity, the FIXED% method was simplest and needed no reference data. Using the FIXED% method, a 20–25% vertebral height reduction criterion for deformity maximized the power for a clinical trial. We conclude that all of the morphometric approaches to defining incident deformities have similar relationships to clinical correlates of vertebral deformity, but that use of a fixed percentage reduction in vertebral height is the simplest and most practical. For the FIXED% method, a 20–25% reduction in vertebral height minimizes the sample size required for clinical trials and epidemiologic studies.
Vertebral fractures are the most common manifestation of osteoporosis and have important clinical consequences, including back pain, loss of stature, and kyphosis.(1-3) The frequency and importance of vertebral fractures have led to their use as an endpoint in both epidemiological studies and clinical trials of osteoporosis (e.g., see Refs. (4-7)).
In the past, vertebral fractures have been assessed by a radiologist's subjective evaluation based either on spinal radiographs taken at a single time point (prevalent fractures) or on a series of radiographs taken over time (incident fractures). More recently, techniques have been developed that use measurements of vertebral heights (vertebral morphometry) to define vertebral deformities objectively. In parallel, more systematic methods for defining deformities using subjective assessments based on a radiologist's reading have been proposed.(8,9)
Several different methods have been proposed for defining prevalent deformities from a single set of radiographs.(10-13) While the method proposed by Melton and modified by Eastell(13) is perhaps the most commonly used, other techniques have come into common use as well.(10,11) A recent study compared these techniques based on their relationship with clinical consequences of osteoporosis.(14)
Fewer techniques have been proposed for defining incident deformities using vertebral dimensions from a series of radiographs taken at multiple time points. Currently, the predominant method compares corresponding heights from the first and second radiographs and defines a deformity as a fixed percentage decrease in any one height (FIXED%).(15-17) While 15% and 20% reductions in vertebral height have often been used to define deformity, it is unclear which is best. This percentage reduction method has been suggested as a standard in recent guidelines issued by the National Osteoporosis Foundation(18) and is part of proposed guidelines for clinical trials under consideration by the Food and Drug Administration. It has also been suggested that a minimum change in vertebral height (e.g., 4 mm) be incorporated into the definition,(18) but this addition has not been systematically studied.
A second general approach to defining incident deformities involves developing indices of vertebral area to summarize frequency and severity of deformities across the spine.(11,19) Changes in these indices can then be used as a continuous measure of vertebral osteoporosis. These changes can also be dichotomized to define a significant change in spinal osteoporosis.(11)
A third approach to defining incident deformities involves a change in the presence or number of prevalent deformities.(10) In this method, a new deformity is defined as a vertebra that is unfractured at the first time point but fractured at the second. Any definition of prevalent deformity could be used in this way.
Ultimately, any comparison of techniques for defining incident deformities is hampered by the lack of a gold standard. The accuracy and reproducibility of radiologists' assessments are controversial. In particular, there may be disagreements between qualitative readings of mild deformities.
To assess the relative validity of various definitions of incident vertebral deformities, we compared height loss and back pain (consequences of vertebral fractures) in patients with and without deformities, using various definitions at the same level of incidence. We also looked at the relationship between known strong predictors of incident deformity (age, bone mineral density [BMD], and presence of an existing deformity) and incident deformity defined by various methods. We propose that the definition consistently yielding the strongest relationship to height loss, back pain, and the predictors should be considered the one that best models true deformity occurrence.
We also compared a variety of cut points and examined the impact of cut point on the sample size required for a randomized trial of a hypothetical drug expected to reduce the rate of vertebral deformities. We propose that the definition yielding the smallest required sample size should be preferred.
MATERIALS AND METHODS
All subjects were participants in the Study of Osteoporotic Fractures.(20) During 1986–1987, 9704 white women aged 65 years and older were recruited from population-based lists in four cities in the United States and attended a baseline examination that included vertebral radiography. Black women were excluded due to their low incidence of fractures.
All participants had lateral radiographs of the thoracic and lumbar spine taken during the baseline examination. During 1990 and 1991, 7629 of the participants (83% of survivors, average of 3.7 years later) returned for a third examination which included a repeat vertebral radiograph. Of these, 7238 (95%) had both baseline and follow-up radiographs available of sufficient quality to be included in the analysis. Radiographs were taken in accord with current guidelines.(18) Thoracic spine radiographs focused on T7–T8 and were taken using a breathing technique. Lumbar spine radiographs focused on L2–L3. All films were taken at a 40-in focus-film distance, with the participant lying in the left lateral decubitus position.
We used a technician triage system designed to decrease the number of films requiring morphometric measurements. The triage is based on a semiquantitative system for grading prevalent vertebral deformities,(8,9) which classifies vertebrae by gross visual inspection into categories based on the approximate percentage reduction in vertebral height from normal.
In this triage system, using the follow-up radiograph, the participant is assigned to one of the following three categories based on a semiquantitative assessment of the lumbar and thoracic radiographs(9):
a.Normal (< 20% reduction in anterior, middle, and/or posterior height compared with same or adjacent vertebrae);
b.Uncertain (one or more uncertain fractures, anatomical problems, or poor quality film); and
c.At least one potential reduction (> 20% reduction in vertebral height).
The classification was performed by X-ray technicians trained and certified by the study radiologist in semiquantitative evaluation and in vertebral morphometry. Participants falling into category a (no vertebral deformities, 56%) were classified as not fractured and were not assessed by vertebral morphometry. Participants classified into category c (at least one mild vertebral deformity, 41%) had baseline and follow-up films assessed by vertebral morphometry, as described below. For participants classified into category b (uncertain, 3%), had baseline and follow-up films that were first reviewed by the study radiologist and any with exposure, positioning, or anatomical problems that made accurate assessment of prevalent or incident deformities unlikely were excluded from the study. The radiologist classified the remaining uncertain films by the presence or absence of >20% reduction in height; those with >20% reduction were assessed by vertebral morphometry.
The triage technique has been validated previously in a random sample of 503 participants from the Study of Osteoporotic Fractures.(8,14)
Assessment of vertebral dimensions
Vertebral morphometry was performed using a translucent digitizer and cursor, and the coordinates were used to calculate three heights for each vertebral body: anterior (Ha), middle (Hm), and posterior (Hp).(14) A total of 13 vertebral levels (L4–T4) were examined on each film.
Morphometric definitions of incident vertebral deformity
We compared seven methods for defining an incident vertebral deformity. These methods are described in more detail below and are summarized in Table 1. For this discussion, we denote the baseline height as Hbl and the follow-up height as Hfu.
Table Table 1.. Methods Used for Defining Incident Vertebral Deformities
Method 1a uses the percentage change in anterior, posterior, and midvertebral heights between baseline and follow-up. The percentage change in height
is calculated for each of the three heights on each vertebra. A vertebra in which any of the three heights has decreased by more than a specified percentage criterion is defined as having an incident deformity. We compared a number of criteria (cut points) between 5% and 30% height loss.
Method 1b is a variation of method 1a but adjusts the percentage change by the mean change (in unfractured vertebrae) for each height at each vertebra. The change is calculated by:
where refmean(Hfu − Hbl) is the reference mean change in vertebral height for the particular height and vertebra. The goal is to adjust for systematic changes (e.g., small changes in tube-to-film distance) between baseline and follow-up radiographs. We calculated the reference mean changes for each of the 3 heights for each of the 13 vertebrae (39 means total). Deformed vertebrae were deleted from this calculation using a trimming method originally developed for deriving reference values fro prevalent deformities.(21)
Method 1c is the same as method 1a but, in addition to a percentage reduction, requires that the change be at least 3 mm for a significant vertebral deformity. We chose 3 mm because it is approximately equal to a 15% change in an average unfractured vertebra and is about three times the SD of the change in vertebral heights (see method 2 below). We also repeated the analysis using a 4 mm minimum change.
Melton and colleagues(12,13) developed a method for defining prevalent deformity based on the means and SDs of vertebral height ratios; however, there have been no attempts to use an analogous method for definition of incident deformities. A number of studies(12,14) have shown that use of the means and SDs to define prevalent deformities is superior to use of FIXED%. For each of the 3 heights at each of the 13 levels, we calculated the reference mean (as described above) and reference SD of the change in height for normal women (i.e., women without incident deformities) using methods previously published.(21) Using these reference values, we defined vertebral deformity cut points based on SDs below the reference mean change for each height and each vertebra.
Developed by Minne and colleagues,(11,22) method 3 is based on the ratio of each anterior, posterior, and midvertebral height to the corresponding heights of T4. Each “T4-corrected” height is then compared with a normal range for that value for each vertebral body. We calculated our own normal ranges. Any height that is below the third percentile of a normal range is defined as abnormal; the score for that height is the deviation of the height from the third percentile value. The sum of the three scores for each vertebra is the vertebral deformity index (VDI) for that vertebra, and the sum of the VDIs across 13 vertebra is the spinal deformity index (SDI). The SDI was calculated separately for baseline and follow-up radiographs. We defined a woman as having a deformity if the difference between her baseline and follow-up SDI values was over a specified cut point. Cut points were chosen to yield incidence rates comparable to other methods.
Methods 4a and 4b
These methods are point-prevalence definitions of deformity; a new deformity is defined by a vertebra that is fractured at follow-up but not fractured at baseline.(10,21) Thus, these two methods compare cross-sectional definitions of prevalent deformities and do not directly utilize longitudinal changes in vertebral height, as do methods 1–3.
Developed by McCloskey and colleagues,(10) method 4a defines the expected posterior height based on posterior heights of adjacent vertebrae and population normal ratios. A vertebral body is considered deformed by criteria involving ratios of observed and expected heights. Vertebrae are defined as deformed if they show a 3-SD decrease in these values, compared with the mean at that level among normal women. At the suggestion of the authors of the original paper, we removed the criterion for the posterior wedge mentioned in the paper (McCloskey E, personal communication). A woman was defined as having an incident deformity if one or more vertebrae undeformed at baseline reach the criterion for deformity at follow-up (point prevalence). Because of the nature of the point prevalence methods, only the single 3-SD prevalence cut point was used.
Method 4b is similar to method 4a in defining an incident deformity if one or more vertebrae that are undeformed at baseline reach the criterion for deformity at follow-up. However, rather than using the McCloskey definition of prevalence, we have used the prevalence definition originally proposed by Melton and Eastell(12,13) with a modification for changes in anterior height.(14)
We also examined other methods for defining vertebral deformities, including the change in the Eastell/Melton height ratios, in absolute, SD, and percentage terms. However, since these other methods showed no stronger relationship to clinical criteria than the other methods compared in this analysis, we have not included their results in this paper.
Cut points for morphometry definitions
We examined methods 1–3 across a range of cut points. To standardize the comparison of the methods, we chose cut points that yielded incidences of 2, 5, 14, and 20% over the follow-up period. (These cut points are shown in Table 2.) We focused on two cut points: 5% (low incidence) and 14% (high incidence. The 5% incidence level was chosen because it was close to the incidence observed with method 1a using a criterion of a 20% decrease in vertebral height (the most commonly used decrease criterion); the 14% incidence value was chosen since it approximately corresponds to the incidence generated by the point prevalence methods. Note that for methods 4a and 4b, the incidence rate cannot be directly altered by changing the cut point. Therefore, multiple cut points are analyzed only for methods 1, 2, and 3.
Table Table 2.. Cut Points Used for Each Method to Achieve Goal Incidence Level
Reference values for morphometry definitions
Methods 1b and 2 require reference values for longitudinal changes in vertebral height, and methods 3, 4a, and 4b require reference values from cross-sectional data. For methods 1b and 2, we adapted methods previously published for cross-sectional data to derive normal values from data containing a mix of fractured and unfractured values.(21) These reference values for changes in vertebral dimensions were derived from all radiographs (n = 3203) in which changes in vertebral dimensions were measured. Using these methods, we found a mean increase in the reference vertebral height from baseline to follow-up (range of 39 means is 0.0098–0.037 cm, average of 39 means is 0.023 cm). This mean increase was small, < 1% of the average vertebral height. The reference SDs ranged from 0.0879 to 0.141 cm.
While published cross-sectional reference values for vertebral height ratios are available for methods 3, 4a, and 4b, we derived our own, because recent work(23,24) has shown that published values cannot reliably be applied to external data sets. To calculate these, we used the 503 radiographs digitized to validate the triage system. For methods 3 and 4a, we derived reference values from all baseline radiographs among the 503 that the radiologist had judged as containing no deformities (semiquantitative grade indicating < 20% reduction, n = 301). For method 3, the lower cut point was then defined as the third percentile among these 301 radiographs.(22) For defining deformities using method 4a, we calculated the observed means and SDs as recommended by the authors.(10) For method 4b, we derived reference values using the entire random sample of 503 by applying a trimming method.(21)
Baseline prevalent deformity
A baseline prevalent deformity was defined using the criteria originally proposed by Melton and Eastell(12,13) with a modification for changes in the anterior height.(14) This is the same definition of prevalence used in method 4b.
During the second examination (an average of 2.2 years after the baseline exam), measurements of bone density of the lumbar spine were made using dual-energy X-ray absorptiometry (QDR 1000; Hologic Inc., Waltham, MA, U.S.A.). Details of these measurement methods and quality control procedures have been published elsewhere.(25) Mean coefficients of variation between clinical centers were ∼1.5% when measuring two young staff members who visited all centers and 0.8% for a 1.0 g/cm2 phantom.(25)
Participants were seen at the baseline examination, at a second examination ∼2 years after baseline, and at a third examination an average of 3.7 years after baseline. Height was measured at all examinations using Harpenden stadiometers(26) (Holtain Ltd., Crosswell, Wales, U.K.). Height loss was defined as the change from the baseline to the third examination.
Back pain was assessed approximately annually by self-administered questionnaire, using methods previously described.(2) Presence of severe back pain was defined as a report of severe back pain at any one of the two or three follow-up contacts.
A woman was classified as having a new vertebral deformity if any one of her vertebra met the criteria of the method for defining incident deformity. Vertebra deformed at baseline were included in all analyses; thus a new deformity may have been a worsening of an existing deformity (except for the point prevalence methods). To assess agreement between the various methods, we calculated the κ(kappa) statistic(27) comparing method 1a to the other methods at the low- and high-incidence cut points.
We used a total of five variables for validating the definitions. Two of the validation variables, height loss and presence of severe back pain, were analyzed as consequences or outcomes of vertebral deformity. The three other validation variables, age, BMD, and presence of baseline vertebral deformity, were analyzed as predictors of new deformity.
Validation by outcome variable
For each method of defining deformity, and for each outcome variable, we calculated two proportions: PD, the proportion of women with a vertebral deformity who experienced the outcome variable; and PN, the proportion of women without a vertebral deformity who experienced the outcome variable.
For example, we defined a loss of at least 2 cm in height between baseline and follow-up visit as “height loss,” and thus PD was the proportion of women with height loss among those defined as having experienced a vertebral deformity. Similarly, PN was the proportion of women with height loss among those defined as not having a vertebral deformity. We then calculated the difference between the two proportions:
for each outcome variable and each method of defining deformity and compared these differences. We reasoned that the best method for defining deformity would be the one that resulted in the largest D holding incidence rate constant.
Validation by predictor variable
When the validation variable was a predictor of deformity (age, BMD, or presence of a baseline vertebral deformity), we compared the proportions of women experiencing an incident vertebral deformity among those with the predictor to those without the predictor. The proportions calculated in this case were: PL, the proportion of women in the low-risk category of the predictor variable who experienced an incident deformity, and PH, the proportion of women in the high-risk category who experienced an incident deformity.
Age, for example, was dichotomized into women age 75 years or younger (low risk) and those over age 75 (high risk). PH was the proportion of women defined as having an incident deformity among those over age 75, and PL was the proportion of women defined as having an incident deformity among those age 75 years and younger. For BMD, we compared those in the lowest quartile (high risk) with the those in the highest three quartiles combined (low risk), and for baseline deformity, we compared those with prevalent deformities (high risk) to those without prevalent deformities (low risk).
For these predictor variables,
and again we reasoned that the method of defining deformity that yielded the largest D was the best method.
In both sets of analyses, we plotted D for each method across a range of cut points. Each cut point results in a particular measured incidence of women with deformity, and these incidences were used as the common scale for the cut points. For example, for method 2, D was calculated at the cut points yielding 2, 5, 10, 14, and 20% incidence of deformity, and the result is plotted as a solid line in Fig. 1. For methods 4a and 4b, we included the difference in proportion with the outcome (D) at the single incidence point.
Methods 1b and 1c were not included in the figures since the results were indistinguishable from those for method 1a. Again we reasoned that if one curve was consistently higher than the other, then the method that yielded that curve was the best method of defining deformity.
More detailed results are given in the tables for each method at the low incidence (5%) and high incidence (14%) cut points. Method 1c could not be included in the 14% analysis since the incidence never reached 14%, regardless of the cut point chosen. Similarly, since methods 4a and 4b have a single incidence, they were only compared in the high-incidence categories. The 95% confidence intervals were calculated for D.(27)
For each of the five validation variables, we used a Bootstrap analysis(28,29) (1000 repetitions) to assess the statistical significance of differences between the values of D derived from each of the methods at the low-incidence (5%) as well as the high-incidence (14%) cut points. We compared methods 1b–4b to method 1a at these two cut points.
Deformity definition in a hypothetical clinical trial
To determine the optimal cut point, we examined the impact of the cut point using method 1a on the sample size required for a hypothetical clinical trial comparing vertebral deformity incidence between groups of women on active medication versus placebo. For a range of cut points using method 1b, we calculated the number of women with a vertebral height decrease (vertebral deformity) at each cut point and took this to be the incidence in the placebo group. We then calculated the number of women with height increases at each cut point (e.g., a vertebral height increase from baseline to follow-up ≥ 20%), which was assumed to be a measure of random error. We estimated the true incidence as the difference between the number of women with height decreases and increases.
To estimate the incidence that would be observed in the active treatment group, we assumed that the treatment would affect only the true incident deformities, that the random error would be unaffected by treatment, and that treatment would reduce the true incidence of deformity by 40%. The observed incidence in the active group was then the sum of the random error plus the true rate decreased by 40%.
For example, if a particular cut point yielded an observed incidence of 10% and an observed proportion with height increases of 6%, we would calculate the true incidence as 4%. For the sample size calculation from this example, we would assume that the incidence in the placebo group would be 10% (4% true incidence plus 6% random error) and in the active treatment group would be 2.4% (40% reduction from the 4% true incidence) plus the 6% random error, or 8.4% Using a significance level of 0.05 (two-sided) and a power of 0.90, we calculated the sample size required using standard statistical methods for comparing dichotomous variables in two groups.(27) This analysis was performed in the entire Study of Osteoporotic Fractures sample and also repeated in the subgroup with prevalent vertebral deformities at baseline.
A description of the 7238 women who attended both the baseline and follow-up visits is given in Table 3. Their mean age was 71.1 years and 1416 (20%) had prevalent vertebral deformities at baseline, based on a 3-SD definition.(12-14)
Table Table 3.. Characteristics of SOF Cohort with Baseline and Follow-up Radiographs (N = 7238)
The cut point for defining a deformity strongly influenced the incidence (Table 2). For example, with method 1a, a cut point of 39.5% height decrease was required to achieve an incidence of about 2%, while a cut point of 6.8% yielded an incidence of about 20%. The incidences for methods 1b, 2, and 3 were similarly influenced by the cut point chosen. The two point-prevalence methods (4a and 4b) yielded similar incidences: 14.0% for method 4a and 13.6% for method 4b.
At the low-incidence cut point, there was a very high level of agreement among methods 1a, 1b, and 1c (κ = 0.99; Table 4). Similarly, method 2 closely agreed with method 1a (κ = 0.94). Method 3 had a much lower level of agreement with method 1a (κ = 0.75). Results were similar for the high incidence cut points. However, methods 4a and 4b were only weakly related to method 1a (κ = 0.56). Agreement between methods 4a and 4b was only moderate (κ = 0.76; data not shown).
Table Table 4.. Agreement Between Definitions of Deformity for Low- and High-Incidence Cut Points
Approximately 56% of women were triaged as normal (category a) and therefore did not have radiographs assessed by morphometry. We tested whether any deformities were missed by the triage by examining the data from the 503 women for whom morphometric assessments were made on all radiographs. No fractures were missed for methods 1a–1c with cut points of 15% or higher or for the point prevalence methods. A few mild deformities were missed at the highest incidence cut points for the other methods (e.g., with method 1a using a cut point of 8.9% height decrease [14% incidence] about 10% of deformities found by morphometry had been classified as normal by triage).
Validation based on clinical consequences of deformities
Those with new deformities were much more likely to have at least a 2-cm height loss than those without new deformities, regardless of the method or cut point used (Table 5 and Fig. 1). The relationship was much stronger at the lower incidence cut points. For example, using method 1a at the low-incidence cut point, 49.9% of those with a new deformity lost at least 2 cm in height compared with 26.5% of those at the high incidence cut point (Table 5).
Table Table 5.. Relationship of Incident Vertebral Deformity to Height Loss >2 cm
At both the low- and high-incidence cut points, all methods showed similar differences in the proportions of women with height loss except one; method 3, had a significantly lower difference compared with method 1a (35.8% vs. 42.2%, p = 0.004 for the low-incidence cut point; 16.5% vs. 19.3%, p = 0.018 for the high-incidence cut point). Across a wider range of cut points (Fig. 1), method 3 produced a consistently weaker relationship between deformity and height loss than the other methods. Analyses examining mean height loss yielded similar findings.
There was a greater incidence of severe back pain in those with new deformities than in those without deformities for all methods and all cut points (Table 6 and Fig. 2). As with height loss, the relationships were stronger at low-incidence cut points than at higher-incidence cut points.
Table Table 6.. Relationship of Incident Vertebral Deformity to Occurrence of Severe Back Pain
At the lower-incidence cut point, the methods were virtually identical in their relationship to severe back pain. However, at the higher-incidence cut points, the relationships were weakest for methods, 3, 4a, and 4b.
Validation based on predictors of deformity
The incidence of new deformities in those older than 75 years was approximately twice that in those under 75 years (Table 7 and Fig. 3) regardless of the definition used. The relationships were generally of similar magnitude for the different methods at both the low- and high-incidence cut points.
Table Table 7.. Relationship of Incident Vertebral Deformity to Age
The presence of a baseline vertebral deformity was a very strong predictor of a new deformity; those with a baseline deformity had approximately a three to five times greater risk of a new deformity than those without (Table 8 and Fig. 4). The strength of the relationship of baseline deformity to new deformity was similar for all definition methods except method 3, which was significantly stronger at the low-incidence cut point (p = 0.002), and consistent across the range of cut points (Fig. 4).
Table Table 8.. Relationship of Incident Vertebral Deformity to Baseline Vertebral Deformity
Spinal BMD was also strongly predictive of new deformities (Table 9 and Fig. 5). Those with spinal BMD in the lowest quartile were approximately two to three times more likely to develop new deformities than those in the highest three quartiles. Method 3 had a slightly weaker relationship to BMD than the other methods at the lower incidence cut point (risk difference 5.2% for method 3 compared with 6.0–6.2% for other methods), but this difference was not statistically significant. At the higher incidence cut points, the difference between method 3 and the other methods was more pronounced and was statistically significant (p = 0.006). The discrepancy between method 3 and the other methods continued to increase as incidence increased (Fig. 5).
Table Table 9.. Relationship of Incident Vertebral Deformities to Spine BMD Quartiles
Comparison of required sample size when different definitions of deformity are used
Figures. 6A and 6B show the sample size required for a hypothetical clinical trial with a vertebral deformity endpoint (using method 1a) in which the treatment reduces the risk of new deformities by 40%. For all women in the Study of Osteoporotic Fractures population (Fig. 6A), the relationship between cut point and sample size was U-shaped, with higher sample size for the lower (e.g., 15%) cut points as well as the higher (e.g., 25%) cut points. The minimum required sample size was 5386 when a 24% cut point was used, which corresponds to about 5% incidence. Sample size was only slightly larger for cut points in the range of 20% (n = 5977) to 26% (n = 5702). However, sample size requirements were much higher for a 15% cut point (n = 8268), the other commonly used criterion for incident morphometric deformity. For women with vertebral deformities at baseline, there was also a decreasing sample size requirement as the cut point increased, up to about 22%, after which the sample size remained relatively constant. The required sample size was 12,550 for a 15% cut point, 2691 for a 20% cut point, and 2232 for a 25% cut point. The sample size was minimized at about 24% (n = 2198). A similar pattern held for the other methods: sample sizes tended to be minimized at cut points yielding about 5% incidence (data not shown).
We found that a number of methods for defining incident vertebral deformities generally yielded similar relationships to clinical validation criteria: no method was consistently more strongly related to clinical criteria once we controlled for the absolute level of incidence. Since no method could be judged superior based on clinical criteria, other factors must be used to choose among them. One consideration might be the simplicity of calculation, particularly with regard to the requirement for reference data.
Among the methods we tested, only methods 1a and 1c did not require reference values. Estimation of reference values presents significant problems of both precision and accuracy and can complicate the definition of vertebral deformities.(23,24) In terms of precision, a large number of films are required to establish reference values reliably.(21,23,24) Small differences in the protocol for obtaining radiographs or in the measurement of vertebral heights, either across studies or within a study, can lead to different reference means and SDs. Also, reference values may vary among populations.(23) For these reasons, reference values for cross-sectional vertebral height ratios vary greatly among studies,(24) leading to the recommendation that new reference values be developed for each study or study site.(23,24) These problems with reference values can be avoided by choosing methods, such as 1a and 1c, that do not require them.
The choice between methods 1a and 1c depends on the advantage of incorporating a minimum height reduction into the definition of deformity. The primary rationale for including a minimum height decrease is that a very small absolute change in height (e.g., a 1 mm change in an already fractured vertebra) would be below the precision level for the technique. However, we found that incorporating the 3 mm minimum with a low-incidence cut point (e.g., 20% vertebral height decrease) made very little difference in the group of women identified as experiencing an incident deformity; the κ between methods 1a and 1c was 0.99. The inclusion of the 3-mm criteria did have a larger impact on the proportion of height increases at higher incidence cut points.
Since undeformed vertebrae are, on average, 2–3 cm high, any change of 15% or more in an undeformed vertebra will be greater than about 3 mm. Thus, the primary impact of requiring a minimum change will be in vertebrae that are deformed at baseline. Therefore, use of a minimum may have greater impact in a sample of women with higher fracture prevalence than in our population-based sample.
While a 4-mm minimum has been used in a number of studies, we compared the use of absolute minimum values of 3 and 4 mm and found little difference between them when a 20% criterion is used: only one woman who met the 3 mm minimum did not meet the 4 mm minimum. However, the minimum value used should depend on the percentage change criteria for deformity. For example, requirement of a minimum 4 mm change will invalidate many 15% height decreases and, therefore, a smaller minimum value is probably appropriate at that cut point.
The similarity in the clinical relationships resulting from the use of method 1a and the point-prevalence methods was surprising. Point-prevalence methods, such as methods 4a and 4b, are fundamentally different from other measures of deformity in that they do not rely on longitudinal changes in vertebral height. These methods have the potential disadvantage of decreased reliability since a small change in ratios could define a new deformity. For example, a vertebral ratio changing from a standardized score of 2.9 at baseline to 3.1 at follow-up might correspond to a change of < 1 mm, but would be considered a new deformity. The ability to accurately assess changes this small is beyond the reproducibility of current morphometry techniques.
However, point prevalence methods have the advantage of unifying definitions of prevalence and incidence. Thus, a vertebra exhibiting an incident deformity between two radiographs obtained at different times would, by design, be defined as deformed if examined only on the follow-up film. Using vertebral height increases as a yardstick for comparison, several studies have suggested that point prevalence techniques are superior to methods based on longitudinal changes in height(10); however, our study failed to show either of the point-prevalence methods to be superior in its correlation to clinical outcomes. This disparity in results may be due to a limitation in the use of vertebral height increases as a validation measure and is discussed in more detail below.
When the point-prevalence methods were used, the method of defining prevalence made little difference in the relationship of incident deformities to clinical criteria. Thus, the prevalence definition proposed by McCloskey seems to offer no advantage over the simpler Eastell-Melton criteria. This result, with respect to incident deformities, agrees with an earlier comparison of the two methods for assessing prevalent deformities.(14)
The SDI (method 3) combines information about frequency and severity of deformity and therefore might be expected to correlate more closely with clinical criteria. However, the SDI did not show stronger correlations than the binary definition methods. In fact, it showed a weaker correlation with all validation criteria except age and baseline vertebral deformity. The potential advantages of the SDI may be offset by decreased precision due to its use of heights at T4, which are often distorted or missing from radiographs centered at on T7–T8. It may be possible to develop continuous measures of spinal deformity that combine frequency and severity information without these disadvantages. Such methods require further study.
At the same incidence level, all methods generally yielded similar relationships to validation criteria. Therefore, the question of choosing a method becomes secondary to the question of choosing an optimal cut point.
For a given method, a lower incidence cut point is more specific and will always lead to smaller group of women with deformities. This smaller group will always show a stronger relationship to clinical criteria since more false positives and true mild deformities (which are likely to be more weakly related to clinical outcomes) have been excluded. However, a lower incidence cut point will be less sensitive, missing some mild deformities.
As with any clinical measurement, the choice of an optimal cut point involves a tradeoff between sensitivity and specificity, but our results offer some evidence that a lower incidence cut point would be superior. We found an increasing ratio of vertebral height increases to decreases as the incidence increased, suggesting that higher-incidence cut points may be more affected by small systematic problems, such as the very small mean vertebral height increase in our data. Also, our results suggest that lower-incidence cut points (e.g., 20–25% height decrease) should allow studies to use smaller sample sizes. If lower-incidence cut points are advantageous, then this would question the use of point prevalence methods that generate higher incidences and for which the incidence cannot be easily altered.
Our comparison of sample size requirements for different cut points uses vertebral height increases to assess the imprecision of the measurements. The underlying assumption of this technique is that the imprecision is symmetric in the sense that a vertebral height increase represents a random error and that height decreases include a similar proportion of random errors. However, our results suggest that this assumption may not always hold. We found that the errors were not entirely random; we observed an average increase in vertebral heights, which probably represents a systematic error. Potential causes of such systematic errors may include descrepancies in tube-to-film distances between the first and second radiographs or other differences in the manner in which the radiographs were obtained. Similar limitations apply to other studies that have compared different methods for defining deformity using vertebral height increases as a measure of imprecision.(10) Our conclusions regarding optimal cut points need to be verified in other contexts. For example, one could use various cut points to compare the treatment-control difference in a clinical trial of an agent shown to be effective in reducing vertebral deformities in order to address more accurately the optimal definitions for minimizing sample size.
Since there is no gold standard for defining a vertebral deformity, we reasoned that the best definition would be that resulting in the strongest relationship to clinical consequences of spine fractures (such as back pain or height loss) or the one that best predicted deformities by markers of osteoporosis (such as presence of an existing deformity, low bone mass, etc.). While each of these criteria showed some relationship to new vertebral deformity, the strongest relationships involved height loss, presence of a baseline vertebral deformity, and bone mass in the lowest quartile. In particular, height loss showed a very strong relationship to deformity and seems to have the greatest potential as a validation tool.
Our study represents the largest and most systematic attempt to compare methods for defining vertebral deformities from vertebral height measurements. Despite these advantages, there are still some important limitations. First, it is a population-based study in which the incidence of deformities is relatively low. Some methods may perform differently in populations with higher incidence, such as women with existing deformities. Second, X-rays triaged as normal were not assessed by morphometry and, therefore, we may have missed some morphometric deformities. However, the small number of very mild deformities missed at the highest incidence cut points could not have affected our overall conclusions. Third, we utilized only quantitative morphometric measurements and did not incorporate a radiologist's reading of individual vertebrae. It has been argued that the optimal definition of incident deformity may be some combination of morphometry with a radiologist's reading.(30,31) This possibility is currently under study. Fourth, our radiographs were obtained in a carefully standardized fashion at the four clinical sites. In settings where such standardization is not possible, the methods may perform differently relative to each other. Last, our study represents healthy volunteers from the United States, most of whom are Caucasian and all of whom are female. These results may not be generalizable to other populations.
Despite these limitations, after comparing several commonly used definitions, we recommend that a simple percentage reduction in vertebral heights be adopted as the standard definition of incident vertebral deformity. The addition of an absolute height reduction, while making little difference in our study, may offer advantages in studies with very high incidence or lower measurement precision. Furthermore, our results suggest that a lower incidence cut point (e.g., 20–25% decrease in vertebral height) has the advantage of reducing sample size requirements for clinical trials and epidemiologic studies compared with a higher incidence cut point (e.g., ≤ 15% decrease in vertebral height).