Analyzing reliability of change in depression among persons with rheumatoid arthritis

Authors


  • The opinions contained herein are those of the grantee and do not necessarily reflect those of the Department of Education or the Department of Veterans' Affairs.

Abstract

Objective

To examine several methods of determining reliability of change constructs in depressive symptoms in patients with rheumatoid arthritis (RA) and to demonstrate the strengths, weaknesses, and uses of each method.

Methods

Data were analyzed from a cohort of 54 persons with RA who participated in a combined behavioral/pharmacologic intervention of 15 months duration. These longitudinal data were used to examine 3 methodologies for assessing the reliability of change for various measures of depression. The specific methodologies involved the calculations of reliable change, sensitivity to change, and reliability of the change score.

Results

The analyses demonstrated differences in reliability of change performance across the various depression measures, which suggest that no single measure of depression for persons with RA should be considered superior in all contexts.

Conclusion

The findings highlight the value of utilizing reliability of change constructs when examining changes in depressive symptoms over time.

INTRODUCTION

Several studies have indicated that individuals with rheumatoid arthritis (RA) experience a higher prevalence of major depressive disorder (MDD) and generalized depressive symptoms than do individuals in the general population (1, 2). Given both the elevated frequency of depressive symptoms among persons with RA and the body of research indicating that depression is a highly treatable condition (3–6), recent studies have examined depression-related symptoms in cohorts of individuals with RA (7–10).

When assessing the treatment of depression among individuals with RA, the accurate assessment of symptoms of depression and their change over the course of an intervention are of paramount importance. Commonly used measures of depression in arthritis research include the Center for Epidemiologic Studies Depression Scale (CES-D) (11), the Geriatric Depression Scale (GDS) (12), the Hamilton Depression Inventory (HAM-D) (13), the Affect score from the Arthritis Impact Measurement Scales 2 (AIMS2-A) (14), and the Depression subscale from the Symptom Checklist-90-R (SCL-90-R) (15). Several studies have demonstrated sound psychometric properties for these scales (14–17), including studies that have assessed their reliability and validity within the RA population (18, 19), but the utility of depression measures for assessing change over time has not been as carefully examined.

Research involving the accuracy of assessing change over time is a fairly new phenomenon in the medical literature, but one that is critically important for treatment outcome research. Traditional measures of reliability such as Cronbach's alpha and the Kuder-Richardson-20 provide “snapshots” of the accuracy of an instrument at certain points in time, but do not provide information about the reliability of change over time or the reliability of the change scores associated with the instrument. For example, an instrument may be internally consistent at both pre- and postintervention, but the actual change measured by the instrument may not in and of itself be reliable or accurate. Therefore, there is a need to examine reliability in the context of the actual change measured by an instrument.

Discussions of reliability of change in the medical literature can be confusing because researchers have used similar terminology to describe different concepts. Specifically, 3 conceptualizations of reliability of change have evolved in the literature: reliable change, sensitivity to change, and reliability of the change score. First, reliable change studies are designed to determine the amount of change that is necessary on a particular instrument for a change in scores to be considered reliable above and beyond change accounted for by measurement error (20–22). Second, sensitivity to change involves assessing the amount of change that is captured by an instrument; sensitivity to change can be measured in terms of effect size (23), which provides information such as amount of variance accounted for or the treatment effect. Finally, reliability of a change score involves the actual psychometric properties of the change score in the form of a reliability estimate, which takes into account within subject variability; higher scores are indicative of instruments that yield change scores that are more reliable (24, 25).

Both reliable change and sensitivity to change have been discussed in the arthritis literature. For example, Stratford and Binkley (21) reported that one pain measure required a 13.3-point difference for reliable change to occur, whereas the difference for reliable change was only 6.7 points on another measure. In another study, Fitzpatrick et al (23) reported effect sizes on different health-status measures for a variety of indices (e.g., pain, mobility), with no measure demonstrating clearly superior sensitivity to change. However, there appears to be a paucity of studies in the arthritis literature that have addressed reliable change or sensitivity to change among depression instruments.

The purpose of this study was to calculate reliable change, sensitivity to change, and reliability of change scores on several self-report depression instruments within a sample of persons with RA. These analyses demonstrated the uses of each type of calculation and how they can be applied and interpreted among persons with RA.

SUBJECTS AND METHODS

Subjects.

This study was a secondary analysis of data obtained from a previous study of persons with RA (10). There were 54 subjects (15 male, 39 female) with a diagnosis of classic or definite RA. The mean ± SD age of the subjects was 54.6 ± 11.4 years, the mean ± SD educational level was 12.6 ± 2.3 years, and the median annual income was between $15,000 and $19,999. Two subjects (4%) were classified as functional Class I, and 26 (48%) each as Class II and Class III. Eight (15%) subjects were recruited from a Midwestern Department of Veterans Affairs hospital, 24 (44%) subjects from a university medical center, and 22 (41%) subjects were from a private rheumatology practice. The diagnosis of RA was made by a collaborating rheumatologist (SEW) using the criteria of the American College of Rheumatology (ACR; formerly the American Rheumatism Association) (26). Subjects also met the diagnostic criteria for MDD as diagnosed by a collaborating psychiatrist using the Structured Clinical Interview for the Diagnostic and Statistical Manual, Third Edition, Revised (27). Subjects were evaluated for MDD after reporting depressive symptoms during a phone screening for depression.

Measures.

The CES-D (11) is a 20-item self-report measure that assesses depressive symptoms. Scores on each individual item range from 0–3, with higher scores indicating more frequent experience of the symptoms; the range for the summary CES-D score is 0–60. The CES-D has been shown to exhibit adequate reliability and validity (28), including studies that have assessed its psychometric properties in patients with RA (18, 19).

The GDS (12) is a 30-item questionnaire designed to measure depression in an elderly population; scores can range from 0–30, with higher scores indicating greater depressive symptomatology. The GDS has been shown to have adequate internal consistency and validity (12).

The HAM-D (13, 29) is a 17-item, interview-based inventory that yields a measure of depression severity; scores can range from 0–100, with higher scores indicating greater depressive symptomatology. Studies have indicated that the HAM-D is a reliable and valid measure of depression (16, 30).

The SCL-90-R is a 90-item self-report inventory designed to measure psychological distress. There are 9 clinical scales and 3 measures of global distress. The SCL-90-R has been shown to have adequate reliability and validity (15). In this study, the depression subscale (SCL-D) was utilized; SCL-D T-scores have a mean ± SD of 50 ± 10, with higher scores indicating greater depressive symptomatology.

The AIMS2 is a 78-item questionnaire designed to measure health status for persons with arthritis. Five factor scores have been identified for the AIMS2: Affect score, Physical score, Symptom score, Social Interaction score, and Role score (14). Adequate reliability and validity data for the AIMS2 have been reported (14). In the current analyses, the affect score (AIMS2-A) was used. AIMS2-A normalized scores can range from 0 to 10, with higher scores indicating greater depression-related impairment.

Procedures.

Details on subject selection have been provided elsewhere (10). In brief, subjects were screened for eligibility to participate in the study by administering the CES-D. Those subjects who scored ≥1 on the instrument were then assessed for MDD, and those subjects who met criteria for MDD were invited to participate in the study. A total of 638 persons with RA were screened, 254 persons with RA were invited to participate in the evaluation for MDD, 84 consented to the diagnostic interview, and 54 subjects who met criteria for MDD were enrolled in the study.

Subjects were assigned to 1 of 3 treatment groups: medication only, medication plus cognitive-behavioral therapy, or medication plus attention control. All subjects were prescribed an antidepressant medication (sertraline), although 3 subjects were switched to nortriptyline after determination that sertraline was clinically ineffective. Data were collected at baseline (Time 1), postintervention (Time 2, 3 months from baseline), 6-month followup (Time 3), and 15-month followup (Time 4). Dropouts occurred during the course of the study, resulting in 41 of the original 54 subjects still participating at the time of the 15-month followup.

Results of the previous study (10) indicated no significant between-group differences at any time period for either the depression, cognitive, or health-status measures, but a significant decrease in depression symptomatology over time occurred for all 3 groups. Therefore, the groups were combined for the current analyses of reliability of change.

Statistical analysis.

Reliable change, sensitivity to change, and the reliability of the change score were calculated for each depression instrument; calculations assessed change between Time 1 (baseline) and Time 4 (15-month followup) as well as between Time 2 (postintervention) and Time 4. The procedures for calculating each statistic are detailed below.

The procedures outlined by Iverson et al (20) were used to calculate reliable change. Only the data obtained at the first and last points in time are used to calculate the amount of change necessary on a particular construct for such change to be considered reliable. The calculations involve the standard error of the difference (Sdiff), and a z-score constant that is representative of a specified confidence interval. Sdiff is calculated by taking the square root of the sum of the standard error of measurement at Time 1 squared and the standard error of measurement at Time 2 squared. Because the sample sizes are the same at the 2 points in time, the estimated Sdiff is as follows:

equation image

where S12 and S22 denote the variances at Times 1 and 2, respectively, and n is the common sample size. Sdiff, then, is an estimate of measurement error at both time periods. The Sdiff is multiplied by a z-score to establish the amount of change necessary for such change to be considered reliable, within a certain confidence interval. For example, multiplying Sdiff by a z-score of 1.28 would establish the value necessary to be 80% confident that such change is outside the range of measurement error, while the Sdiff would have to be multiplied by 1.64 to establish the value necessary to be 90% confident that such change is reliable.

To calculate sensitivity to change, effect sizes were calculated. In the same manner as the reliable change statistic, only data from the first and last points in time are used. To calculate the effect size, denoted by E, for each instrument, the mean score at Time 2 was subtracted from the mean score at Time 1, and the difference was divided by the standard deviation of the change scores as follows:

equation image

where d̄ = (1/n) ∑math image di, and di is the value at Time 2 subtracted from the value at Time 1 for the ith subject,

equation image

which is the standard deviation of the differences. Larger values are indicative of greater effect sizes, which can be interpreted as being more sensitive to change than smaller values.

The reliability of the change score was calculated according to the procedures outlined in van Belle et al (24). An in-depth discussion of these calculations is beyond the scope of this report, however in brief, for this statistic, all of the data for a given subject are used in the calculation, not just the first and last points in time. In this method, the reliability of the change score, denoted by ρc, is determined by dividing the variability of true scores by the variability in observed scores. This is accomplished by calculating a regression line for each subject across at least 3 measurement points. The computational formula for ρc involves dividing the variability of true change among subjects by the variability of the true change among subjects plus the product of residual variability (measured by the residual from the regression line for each subject) and the square of the spacing of observations for each subject. Higher values for ρc are indicative of more reliable change scores. Specifically, a high value for the reliability of change score indicates that within-subject variability is small in comparison with between subjects variability. The maximum value for a reliability of change score is 1.0. Although negative numbers are possible, they are indicative of a large amount of within-subjects variability. The formula is as follows:

equation image

where equation imageb2 is the variance of the slopes of the regression lines obtained for each subject, and

equation image

where equation imagey.t2 is the mean of the residual mean squares obtained from the individual regression lines, and [t2] =∑math image (ti − t̄)2 where t1, …, tk are the times at which the data were obtained and t̄ = (1/k) ∑math image ti. In our study, k = 3.

RESULTS

The descriptive statistics for the depression change scores are shown in Table 1.

Table 1. Statistics for depression change scores*
MeasureChange between Time 1 and Time 4Change between Time 2 and Time 4
MinimumMaximumMean ± SDSEMinimumMaximumMean ± SDSE
  • *

    HAM-D = Hamilton Rating Scale for Depression; GDS = Geriatric Depression Scale; AIMS2-A = Affect subscale of the Arthritis Impact Measurement Scales 2; CES-D = Center for Epidemiologic Studies-Depression scale; SCL-D = Depression subscale of the Symptom Checklist-90-R.

HAM-D4.0054.0031.66 ± 12.011.88−20.0026.005.61 ± 11.041.72
GDS−25.001.00−10.95 ± 6.721.05−12.0010.00−1.95 ± 4.570.71
AIMS2-A−1.006.502.23 ± 1.630.26−2.503.750.50 ± 1.340.21
CES-D−12.0052.0017.00 ± 12.501.95−21.0030.002.98 ± 10.211.59
SCL-D0.0051.0013.93 ± 11.601.73−23.0038.002.95 ± 10.911.70

Reliable change.

The values necessary on each depression instrument for change to be considered reliable are presented in Table 2. Results indicated that from postintervention to 15-month followup, the HAM-D and the SCL-D required the greatest raw score change for such change to be considered reliable, with values of 14 and 18 for 80% and 90% confidence, respectively. Therefore, one would require a 14-point or greater change to occur on these instruments between the 2 time periods to be 80% confident that such change was not due to measurement error. Scores on the AIMS2-A required a change of 2 or greater at the 90% confidence interval. The amount of change required for a given variable depends on the range of possible values of the variable and the variability.

Table 2. Reliable change (amount of change necessary for change to be considered reliable)*
Measure80% Reliable Time 1 to Time 490% Reliable Time 1 to Time 480% Reliable Time 2 to Time 490% Reliable Time 2 to Time 4
  • *

    See Table 1 for definitions.

HAM-D15201418
GDS91167
AIMS2-A2322
CES-D16211317
SCL-D14181418

Sensitivity to change.

Sensitivity to change, as represented by effect sizes, is shown in Table 3. Results indicated that from baseline to 15-month followup the HAM-D was the most sensitive to change, with scores on the HAM-D yielding the greatest effect size (2.64). The sensitivity to change was similar for the 4 other measures, with values ranging from 1.26 (SCL-D) to 1.63 (GDS). Sensitivity to change was more similar among the measures when assessing change between postintervention and 15-month followup, with values ranging from 0.27 (SCL-D) to 0.43 (GDS). The effect sizes, as expected, were smaller when comparing postintervention to 15-month followup change than when comparing baseline to 15-month followup change. These results indicated that the HAM-D is the instrument that is most sensitive to change when assessing change over the course of an entire intervention, but there was little difference among the measures when assessing change from postintervention.

Table 3. Sensitivity to change (effect sizes) of the depression measures*
MeasureEffect Size Time 1 to Time 4Effect Size Time 2 to Time 4
  • *

    See Table 1 for definitions.

HAM-D2.640.41
GDS1.630.43
AIMS2-A1.370.37
CES-D1.360.29
SCL-D1.260.27

Reliability of the change score.

Reliability of the change score, as represented by ρc, is shown in Table 4. There are 2 noteworthy results to these analyses. First, reliability was low when assessing change between baseline and 15-month followup, with values ranging from −2.03 (HAM-D) to 0.24 (SCL-D). The reliability scores improved considerably when comparing change between postintervention and 15-month followup, with values ranging from −0.25 (AIMS2-A) to 0.57 (SCL-D). When calculating the reliability of the change score via the ρc statistic, which uses all data points, one assumes linear change. In the present study, there was a large change on the depression measures between baseline and postintervention, followed by more gradual change between postintervention and 15-month followup. Although a linear relationship existed between postintervention and 15-month followup, the relationship between baseline and 15-month followup was not linear. Therefore, the low reliability observed between baseline and 15-month followup may be more related to not meeting linear assumptions than with the properties of the depression instruments themselves.

Table 4. Reliability of the change score (ρ) for the depression measures*
Measureρ Time 1 to Time 4ρ Time 2 to Time 4
  • *

    See Table 1 for definitions.

HAM-D−2.030.05
GDS−0.410.42
AIMS2-A−0.33−0.25
CES-D−0.120.22
SCL-D0.240.57

DISCUSSION

The purpose of our study was to demonstrate the utility of 3 different methods of assessing the reliability of change of depression scores among persons with RA. Each of the methods (i.e., reliable change, sensitivity to change, and reliability of the change score) provided unique information regarding change that occurred on the results of the depression instruments. The results also demonstrated differences in reliability of change performance across the various depression measures, suggesting that no one measure of depression for persons with RA should be considered superior in all contexts.

The concept of reliability of change can be confusing because prior studies have used different methodologies to describe this concept. An important component of this study involved examining the uniqueness of different methods in terms of what they offer toward an understanding of change that occurs. For example, results from the reliable change analyses (the amount of change necessary for such change to be considered outside the range of measurement error) demonstrated that different depression instruments required different amounts of change in raw scores for such change to be considered reliable. Similarly, results indicated that the various depression instruments were heterogeneous in terms of sensitivity to change (measured by an effect size), which conveys that each measure demonstrated a different amount of change that occurred in terms of subjects' depression scores. In a sense, both reliable change and sensitivity to change measure the same concept, but present them in opposite ways. Reliable change is presented in terms of how much change would have to occur, whereas sensitivity to change is presented in terms of how much change did occur. However, this study shows that the third methodology, reliability of the change score, is quite different from both the reliable change and the sensitivity to change methodologies. Instead of assessing how much change would need to occur or did occur, the reliability of the change score method provides a way of quantifying the reliability of the change score itself; the results again reveal large differences across the various depression measures.

The results from this study are important to RA clinicians and researchers in at least 2 ways. First, the results indicated that all of the measures of depression manifested comparative limitations in some of the reliability of change analyses. Specifically, the various depression instruments yielded good results in 1 or 2 areas of analysis, but no instrument demonstrated superiority across all 3 areas. For example, the HAM-D and the GDS were the most sensitive in terms of capturing change, but the SCL-D yielded the highest reliability of the change score. These results indicate that no gold standard exists for measuring reliable change in depression for persons with RA; different measures of depression may excel in different contexts.

Second, the various reliability of change constructs discussed in this article might be more or less useful for various subgroups of arthritis health professionals. For example, clinicians treating a depressed person with RA who are measuring his or her progress might be most interested in the concept of reliable change; it would be useful for them to know the absolute magnitude of change that needs to occur for such change to be considered beyond measurement error. In contrast, researchers examining the efficacy of various treatments might be more interested in examining the amount of change that actually occurred, which can be assessed via sensitivity to change.

The main message from these analyses appears to be that both clinicians and researchers who use depression measures to assess change over time should carefully consider their applications from a reliability of change standpoint. As these analyses demonstrate, multiple strategies for assessing reliability of change exist, although the preferred strategy is dependent upon the specific measurement application. Greater utilization of reliability of change constructs in the future can serve to increase measurement precision in the field of rheumatology.

There were limitations to this study. The sample size was relatively small and geographically homogeneous, and the change that occurred between pretest and posttest was considerably larger than the changes that occurred from posttest to the various followups. Although such a trend is expected in a longitudinal intervention study, this nonlinear change was a limiting factor in the evaluation of the reliability of the change score from Time 1 to Time 4. One other cautionary note is that reliability of change analyses do not address the possibility of regression to the mean as a source of methodologic error; regression to the mean must be considered in the context of the overall measurement application (e.g., sampling strategy, research design).

Despite these limitations, the results illustrate important measurement concepts for researchers and clinicians who assess depression longitudinally in persons with RA. Although the focus of this study was on depression, the constructs of reliable change, sensitivity to change, and reliability of the change score can be beneficially applied to other areas of rheumatologic investigation and practice.

Ancillary