TREATMENT OUTCOMES IN antidepressant medication trials are usually assessed by clinician-administered rating scales, such as the Hamilton Rating Scale for Depression (HAM-D). In addition to the use of a structured interview guide for such scales (e.g. Williams, 1988), training raters to administer such outcome measures reliably and validly has become a critical concern for clinical research over the past decade.[3-7] During this period, structured, procedurally invariant computer algorithms to obtain comparable electronic patient reported outcomes (ePRO) using interactive voice response (IVR) technology have been developed and validated in the USA.[8-11] In light of mounting evidence of systemic bias in clinician-administered rating scales in randomized clinical trials[12, 13] and evidence of psychometric equivalence between the clinician-rated and IVR-rated HAM-D assessments (clinician HAM-D; IVR HAM-D),[8-11] the Food and Drug Administration (FDA) Division of Psychiatric Products announced that ePRO assessments, such as the IVR HAM-D assessment, would be acceptable as primary outcome measures to establish the efficacy of antidepressant medications in outpatient randomized clinical trials (presented by T. Laughren from the FDA at the New Clinical Drug Evaluation Unit at Phoenix, AZ on 30 May 2008). Although the clinical validity and usefulness of the IVR HAM-D assessment in clinical trials have been established and accepted in the USA, there have been no studies that have specifically demonstrated comparability between clinician and IVR HAM-D assessments in Japanese or any other Asian populations. This study was aimed to examine the reliability and validity of a Japanese IVR HAM-D (17 items) program.
- Top of page
The patients consisted of 40 men and 20 women aged 22–69 years (mean = 40.7; SD = 11.1). There were 11 patients with major depressive disorder (MDD) single episode, nine with MDD single episode and dysthymia (double depression), 27 with MDD recurrent, five with MDD recurrent and dysthymia, two with dysthymia alone, five with bipolar II, and one with bipolar I. One patient failed to conduct the second IVR assessment; however, the remaining 59 patients completed the full procedure.
The clinician-HAM-D score ranged between 2 and 33 points (mean = 15.1; SD = 6.7). The ICC of IVR HAM-D assessments between Day 1 and Day 2 was 0.93, indicating that test–retest reliability was high (Fig. 1). Internal consistency of total scores obtained by the clinician (Day 1), psychologists (Day 1) and IVR program (Days 1 and 2) were all acceptable to good consistency (Cronbach's alpha: 0.77, 0.79, 0.78, and 0.83).
Figure 1. Scatter plots of Interactive Voice Response (IVR) Hamilton Rating Scale for Depression (HAM-D) total scores on Days 1 and 2 (n = 59). ICC, intraclass correlation coefficient.
Download figure to PowerPoint
Regarding concurrent validity, Pearson's correlation coefficient of HAM-D total scores on Day 1 was 0.81 (P < 0.0001) for clinician HAM-D versus IVR HAM-D and 0.93 (P < 0.0001) for clinician HAM-D versus psychologist HAM-D (Fig. 2), indicating strong correlations and high concurrent validity. Mean HAM-D total scores for clinician, psychologist, and IVR (Days 1 and 2) are shown in Table 1. With respect to the differences in HAM-D total scores, the mean IVR HAM-D score (Day 1) was approximately 3 and 4 points higher than the mean clinician HAM-D (t = 5.2, d.f. = 59, P < 0.001, paired t-test) and psychologist HAM-D (t = 6.0, d.f. = 59, P < 0.001) scores, respectively, indicating that IVR tends to overestimate the depression severity, compared with the clinician and psychologists.
Figure 2. Scatter plots of Hamilton Rating Scale for Depression (HAM-D) total scores on Day 1 (n = 60). (a) Plots of clinician HAM-D versus Interactive Voice Response (IVR) HAM-D, Pearson's r = 0.81. (b) Plots of clinician HAM-D versus psychologist HAM-D, r = 0.93.
Download figure to PowerPoint
Table 1. Mean HAM-D total score determined by different assessments on Days 1 and 2
| ||Day||n|| |
|IVR HAM-D||1||60||18.1 (7.5)|
|IVR HAM-D||2||59||17.6 (8.3)|
|Clinician HAM-D||1||60||15.1 (6.7)|
|Psychologist HAM-D||1||60||14.2 (6.2)|
The inter-rater consistency of each HAM-D item determined by different assessments (i.e. IVR, clinician and psychologist HAM-D) was examined using Cohen's kappa coefficient (Table 2). The obtained coefficients between IVR (Day 1) versus IVR (Day 2) HAM-D scores and those between clinician versus psychologist showed moderate and substantial agreement, respectively (mean: 0.55 [SD 0.13] and 0.62 [SD 0.18]). However, relatively lower values were obtained for coefficients between clinician versus IVR (Day 1) HAM-D scores (mean 0.25 [SD 0.13]), indicating that observer agreement between clinician and IVR was ‘fair’ according to the criteria by Landis and Koch.
Table 2. Inter-rater consistency of each HAM-D item by different assessments on Days 1 and 2
|HAM-D item||Cohen's kappa coefficient|
|IVR vs clinician (Day 1) (n = 60)||IVR (Day 1) vs IVR (Day 2) (n = 59)||Clinician vs psychologist (n = 60)|
|01 Depressed mood||0.33||0.56||0.52|
|02 Feelings of guilt||0.08||0.55||0.69|
|04 Early insomnia||0.26||0.74||0.70|
|05 Middle insomnia||0.23||0.71||0.60|
|06 Late insomnia||0.21||0.49||0.78|
|07 Work and activities||0.11||0.54||0.53|
|10 Mental anxiety||0.02||0.39||0.48|
|11 Somatic anxiety||0.19||0.44||0.57|
|12 Somatic symptoms (gastrointestinal)||0.33||0.70||0.85|
|13 Somatic symptoms (general)||0.39||0.52||0.44|
|14 Genital symptoms||0.49||0.78||0.77|
|16 Loss of weight by history||0.27||0.52||0.93|
|Average (SD)||0.25 (0.13)||0.55 (0.13)||0.62 (0.18)|
- Top of page
We examined, for the first time, the reliability and validity of the Japanese version of the IVR program to score the HAM-D scale in 60 patients with depressive disorder. The observed ICC of 0.93 between IVR HAM-D scores of Day 1 and Day 2 indicates that the test–retest reliability of the IVR HAM-D program for the assessment of depression severity is quite high. The computed Cronbach's alpha coefficients (0.77–0.83) showed acceptable to good internal consistency when the patients, the clinician or psychologists assessed HAM-D separately using the IVR program or by a clinical interview, suggesting that construct validity of the IVR program is acceptable to good as well.
With regards to the concurrent validity of overall depression severity assessment (i.e. total HAM-D score) using the IVR HAM-D program, Pearson's correlation coefficient of 0.81 between clinician HAM-D and IVR HAM-D supports its validity. The high coefficient (0.93) for clinician and psychologist assessments indicates a very high inter-rater reliability in their assessments. All our data suggest that the Japanese IVR HAM-D program is reliable and valid to assess HAM-D total score in patients with depressive disorders. Although the HAM-D total score obtained by the IVR program and that by the clinician showed a strong correlation (0.81), the former was three points higher than the latter on average, suggesting that adjustment is required when the program was introduced and used in a regular clinical practice. A similar difference between IVR and clinician was reported in a previous study conducted in the USA.
The computed Cohen's kappa coefficient for each HAM-D item assessed using the IVR program and by the clinician seemed to follow different trends depending on the item (Table 2). The obtained coefficients between IVR (Day 1) versus IVR (Day 2) HAM-D scores and those between clinician versus psychologist showed moderate and substantial agreement, respectively. However, relatively lower values were obtained for coefficients between clinician versus IVR (Day 1) HAM-D scores, indicating that observer agreement between clinician and IVR was fair. Accordingly, caution is required when one uses individual item score of HAM-D by the IVR program in clinical practice. In particular, Cohen's kappa coefficients between clinician and IVR scores for feelings of guilt (0.08), work and activities (0.11), retardation (0.18), agitation (0.13) and mental anxiety (0.02) items were very low (see Table 2). When these items were scrutinized, a significant correlation between clinician and IVR scores (nominal P < 0.05) was obtained for feelings of guilt (r = 0.44, P < 0.001), work and activities (r = 0.39, P = 0.002), and mental anxiety (r = 0.27, P = 0.036), but not for retardation (r = 0.25, P = 0.056) or agitation (r = 0.02, P = 0.86). Because agitation and retardation of the HAM-D scale must be scored on the basis of objective observation in the clinical setting, the results of low correlations as well as low kappa values for these items are not surprising. Taken together, adjustment or improvement of the IVR program is required to obtain higher kappa values for some items (e.g. feelings of guilt, work and activities, and mental anxiety) and face-to-face interview might be required to obtain accurate scores for retardation and agitation, which might be a limitation of the current IVR program.
There are several limitations in the study. First, it should be considered that this study included a relatively small number of subjects with severe depression; there were only five subjects whose clinician HAM-D score was 24 or more. Secondly, the imbalance of the subjects' sexes (40 men and 20 women) and the fact that the study was performed at only one site should also be taken into account. Thirdly, psychologists rated HAM-D score by attending and referring to the clinician's interview but not by interviewing the patients themselves. Therefore, psychologists' rating is not entirely independent, which may have biased the result of inter-rater reliability towards the observed high correlation (r = 0.93) between clinician and psychologist HAM-D scores. These limitations, or findings, require careful consideration for the clinicians and/or the psychologists to make an accurate assessment of depression severity in clinical practice.
In conclusion, despite the limitations of this study, our results confirm that the Japanese version of the IVR HAM-D program is a reliable and valid method to assess overall depression severity in clinical practice. Although the total score was reliably and validly obtained, the score of each item did not always show high agreement. Further research in a larger number of subjects will provide Japanese clinicians more detailed information regarding the use of the IVR HAM-D program.