SEARCH

SEARCH BY CITATION

Keywords:

  • Hamilton Rating Scale for Depression;
  • interactive voice response program;
  • Japanese;
  • reliability;
  • validity

Abstract

  1. Top of page
  2. Abstract
  3. Methods
  4. Results
  5. Discussion
  6. Acknowledgments
  7. References

Aim

The aim of this study was to examine the reliability and validity of the Interactive Voice Response (IVR) program to rate the 17-item Hamilton Rating Scale for Depression (HAM-D) score in Japanese depressive patients.

Methods

Depression severity was assessed in 60 patients by a clinician and psychologists using HAM-D. Scoring by the IVR program was conducted on the same and the following days. Test–retest reliability, internal consistency, and concurrent validity for total HAM-D scores were examined by calculating intraclass correlation coefficient, Cronbach's alpha, and Pearson's correlation coefficient. Inter-rater consistency for each HAM-D item was examined by Cohen's kappa.

Results

Test–retest reliability of the IVR program was high (intraclass correlation coefficient: 0.93). Internal consistency of each total score obtained by the clinician, psychologists, and IVR program was high (Cronbach's alpha: 0.77, 0.79, 0.78, and 0.83). Regarding concurrent validity, correlation coefficients between total scores obtained by the clinician versus IVR and that by the clinician versus psychologists were high (0.81 and 0.93). The HAM-D total score rated by the clinician was 3 points lower than that of IVR. Inter-rater consistency for each HAM-D item evaluated by the clinician versus IVR was estimated to be fair (Cohen's kappa coefficient: 0.02–0.50).

Conclusion

Our results suggest that the Japanese IVR HAM-D program is reliable and valid to assess 17-item HAM-D total score in Japanese depressive patients. However, the current program tends to overestimate depression severity, and the score of each item did not always show high agreement with clinician's rating, which warrants further improvement in the program.

TREATMENT OUTCOMES IN antidepressant medication trials are usually assessed by clinician-administered rating scales, such as the Hamilton Rating Scale for Depression (HAM-D).[1] In addition to the use of a structured interview guide for such scales (e.g. Williams, 1988),[2] training raters to administer such outcome measures reliably and validly has become a critical concern for clinical research over the past decade.[3-7] During this period, structured, procedurally invariant computer algorithms to obtain comparable electronic patient reported outcomes (ePRO) using interactive voice response (IVR) technology have been developed and validated in the USA.[8-11] In light of mounting evidence of systemic bias in clinician-administered rating scales in randomized clinical trials[12, 13] and evidence of psychometric equivalence between the clinician-rated and IVR-rated HAM-D assessments (clinician HAM-D; IVR HAM-D),[8-11] the Food and Drug Administration (FDA) Division of Psychiatric Products announced that ePRO assessments, such as the IVR HAM-D assessment, would be acceptable as primary outcome measures to establish the efficacy of antidepressant medications in outpatient randomized clinical trials (presented by T. Laughren from the FDA at the New Clinical Drug Evaluation Unit at Phoenix, AZ on 30 May 2008). Although the clinical validity and usefulness of the IVR HAM-D assessment in clinical trials have been established and accepted in the USA, there have been no studies that have specifically demonstrated comparability between clinician and IVR HAM-D assessments in Japanese or any other Asian populations. This study was aimed to examine the reliability and validity of a Japanese IVR HAM-D (17 items) program.

Methods

  1. Top of page
  2. Abstract
  3. Methods
  4. Results
  5. Discussion
  6. Acknowledgments
  7. References

Subjects

Subjects were 60 Japanese patients with major depressive episode or dysthymia who had a wide range of depression severity. They were under treatment and recruited at the National Center of Neurology and Psychiatry (NCNP) Hospital (Tokyo, Japan) from July 2010 through October 2011. Diagnosis was made according to the DSM-IV-TR[14] based on a structural interview Mini-International Neuropsychiatric Interview,[15] an additional unstructured interview and information from medical charts. The subjects who had functional disturbance of recognition, delirium or other conditions that would compromise insight were excluded from the study. The study protocol was approved by the ethics committees at NCNP and GlaxoSmithKline K.K (GSKKK). After describing the study, written informed consent was obtained from every subject. Research was conducted in accordance with the Helsinki Declaration as revised in 2008.

Assessment procedure

The IVR HAM-D program[9-11] was provided by eResearch Technology (ERT; Philadelphia, PA, USA). The Japanese HAM-D script (pre-recorded by a Japanese native speaker) built in the IVR program was linguistically validated with the original English script by ERT. The original English script was developed by Healthcare Technology Systems (Philadelphia, PA, USA) (http://www.healthtechsys.com/). The IVR HAM-D assessment was conducted through telephone. The patients called the IVR HAM-D center and input their ID numbers and passwords by pushing buttons on the phone, which initiated automated voice-based questions about symptoms according to the script. The patients answered all questions by pushing buttons on the phone. Based on the answers, the program provided a score for each HAM-D item.

Initially, each patient was subject to IVR HAM-D assessment, and then he or she underwent clinician HAM-D and psychologist-rated HAM-D (psychologist HAM-D) assessments (Day 1). Assistance by a coordinator on how to operate the phone was available on Day 1. The clinician and the psychologists, who did not know the detail of the script and algorithm of the IVR program, rated the HAM-D on the basis of the Japanese version of the Structured Interview Guide for the Hamilton Depression Rating Scale (SIGH-D).[16] The clinician rater was a fellow psychiatrist (H.K.) member of the Japanese Society of Psychiatry and Neurology who had 25 years of clinical and research experience. Two psychologists rated the subjects (the first 10 subjects were done by M.H. and the remaining 50 by N.K.). The patients were also instructed to self-administer (i.e. without attendance of the coordinator) IVR HAM-D 1 or 2 days after their initial assessment at their own place (Day 2). They were instructed to do this second assessment of IVR HAM-D at a time when the first assessment was done in order to avoid the possible effect of circadian change of depression severity. On Day 1, a well-trained psychologist was concurrently present at the interview at which the clinician administered HAM-D, and provided an independent rating for the same subject. Each separated assessment result from the clinician and the psychologist was taken care of independently throughout the study.

Statistical analyses

To assess the test–retest reliability of the IVR program, the intraclass correlation coefficient (ICC) of IVR HAM-D total scores was calculated using data obtained on Days 1 and 2. To investigate internal consistency and construct validity of HAM-D total score, Cronbach's alpha coefficient was calculated for clinician HAM-D (Day 1), psychologist HAM-D (Day 1) and IVR HAM-D assessments (Days 1 and 2). Scores of the two psychologists were combined in the analysis. To assess concurrent validity of HAM-D total scores, Pearson's correlation coefficient was calculated for clinician HAM-D and IVR HAM-D assessments (Day 1) and from clinician and psychologist HAM-D assessments (Day 1). To investigate inter-rater consistency, Cohen's kappa coefficient was calculated for clinician HAM-D and IVR HAM-D assessments (Day 1), IVR HAM-D assessments (Days 1 and 2), and clinician and psychologist HAM-D assessments (Day 1) concerning each HAM-D item. Paired t-test was used to compare total HAM-D scores rated by IVR (Day 1), clinician, and psychologist. Statistical analyses were performed using sas 9.1.3 (http://www.sas.com/).

Results

  1. Top of page
  2. Abstract
  3. Methods
  4. Results
  5. Discussion
  6. Acknowledgments
  7. References

The patients consisted of 40 men and 20 women aged 22–69 years (mean = 40.7; SD = 11.1). There were 11 patients with major depressive disorder (MDD) single episode, nine with MDD single episode and dysthymia (double depression), 27 with MDD recurrent, five with MDD recurrent and dysthymia, two with dysthymia alone, five with bipolar II, and one with bipolar I. One patient failed to conduct the second IVR assessment; however, the remaining 59 patients completed the full procedure.

The clinician-HAM-D score ranged between 2 and 33 points (mean = 15.1; SD = 6.7). The ICC of IVR HAM-D assessments between Day 1 and Day 2 was 0.93, indicating that test–retest reliability was high (Fig. 1). Internal consistency of total scores obtained by the clinician (Day 1), psychologists (Day 1) and IVR program (Days 1 and 2) were all acceptable to good consistency (Cronbach's alpha: 0.77, 0.79, 0.78, and 0.83).

figure

Figure 1. Scatter plots of Interactive Voice Response (IVR) Hamilton Rating Scale for Depression (HAM-D) total scores on Days 1 and 2 (n = 59). ICC, intraclass correlation coefficient.

Download figure to PowerPoint

Regarding concurrent validity, Pearson's correlation coefficient of HAM-D total scores on Day 1 was 0.81 (P < 0.0001) for clinician HAM-D versus IVR HAM-D and 0.93 (P < 0.0001) for clinician HAM-D versus psychologist HAM-D (Fig. 2), indicating strong correlations and high concurrent validity. Mean HAM-D total scores for clinician, psychologist, and IVR (Days 1 and 2) are shown in Table 1. With respect to the differences in HAM-D total scores, the mean IVR HAM-D score (Day 1) was approximately 3 and 4 points higher than the mean clinician HAM-D (t = 5.2, d.f. = 59, P < 0.001, paired t-test) and psychologist HAM-D (t = 6.0, d.f. = 59, P < 0.001) scores, respectively, indicating that IVR tends to overestimate the depression severity, compared with the clinician and psychologists.

figure

Figure 2. Scatter plots of Hamilton Rating Scale for Depression (HAM-D) total scores on Day 1 (n = 60). (a) Plots of clinician HAM-D versus Interactive Voice Response (IVR) HAM-D, Pearson's r = 0.81. (b) Plots of clinician HAM-D versus psychologist HAM-D, r = 0.93.

Download figure to PowerPoint

Table 1. Mean HAM-D total score determined by different assessments on Days 1 and 2
 Dayn

HAM-D total score

Mean (SD)

  1. IVR, Interactive Voice Response; HAM-D, Hamilton Rating Scale for Depression.

IVR HAM-D16018.1 (7.5)
IVR HAM-D25917.6 (8.3)
Clinician HAM-D16015.1 (6.7)
Psychologist HAM-D16014.2 (6.2)

The inter-rater consistency of each HAM-D item determined by different assessments (i.e. IVR, clinician and psychologist HAM-D) was examined using Cohen's kappa coefficient (Table 2). The obtained coefficients between IVR (Day 1) versus IVR (Day 2) HAM-D scores and those between clinician versus psychologist showed moderate and substantial agreement, respectively (mean: 0.55 [SD 0.13] and 0.62 [SD 0.18]). However, relatively lower values were obtained for coefficients between clinician versus IVR (Day 1) HAM-D scores (mean 0.25 [SD 0.13]), indicating that observer agreement between clinician and IVR was ‘fair’ according to the criteria by Landis and Koch.[17]

Table 2. Inter-rater consistency of each HAM-D item by different assessments on Days 1 and 2
HAM-D itemCohen's kappa coefficient
IVR vs clinician (Day 1) (n = 60)IVR (Day 1) vs IVR (Day 2) (n = 59)Clinician vs psychologist (n = 60)
  1. IVR, Interactive Voice Response; HAM-D, Hamilton Rating Scale for Depression.

01 Depressed mood0.330.560.52
02 Feelings of guilt0.080.550.69
03 Suicide0.500.640.79
04 Early insomnia0.260.740.70
05 Middle insomnia0.230.710.60
06 Late insomnia0.210.490.78
07 Work and activities0.110.540.53
08 Retardation0.180.360.25
09 Agitation0.130.390.32
10 Mental anxiety0.020.390.48
11 Somatic anxiety0.190.440.57
12 Somatic symptoms (gastrointestinal)0.330.700.85
13 Somatic symptoms (general)0.390.520.44
14 Genital symptoms0.490.780.77
15 Hypochondriasis0.250.650.66
16 Loss of weight by history0.270.520.93
17 Insight0.230.410.66
Average (SD)0.25 (0.13)0.55 (0.13)0.62 (0.18)

Discussion

  1. Top of page
  2. Abstract
  3. Methods
  4. Results
  5. Discussion
  6. Acknowledgments
  7. References

We examined, for the first time, the reliability and validity of the Japanese version of the IVR program to score the HAM-D scale in 60 patients with depressive disorder. The observed ICC of 0.93 between IVR HAM-D scores of Day 1 and Day 2 indicates that the test–retest reliability of the IVR HAM-D program for the assessment of depression severity is quite high. The computed Cronbach's alpha coefficients (0.77–0.83) showed acceptable to good internal consistency when the patients, the clinician or psychologists assessed HAM-D separately using the IVR program or by a clinical interview, suggesting that construct validity of the IVR program is acceptable to good as well.

With regards to the concurrent validity of overall depression severity assessment (i.e. total HAM-D score) using the IVR HAM-D program, Pearson's correlation coefficient of 0.81 between clinician HAM-D and IVR HAM-D supports its validity. The high coefficient (0.93) for clinician and psychologist assessments indicates a very high inter-rater reliability in their assessments. All our data suggest that the Japanese IVR HAM-D program is reliable and valid to assess HAM-D total score in patients with depressive disorders. Although the HAM-D total score obtained by the IVR program and that by the clinician showed a strong correlation (0.81), the former was three points higher than the latter on average, suggesting that adjustment is required when the program was introduced and used in a regular clinical practice. A similar difference between IVR and clinician was reported in a previous study conducted in the USA.[11]

The computed Cohen's kappa coefficient for each HAM-D item assessed using the IVR program and by the clinician seemed to follow different trends depending on the item (Table 2). The obtained coefficients between IVR (Day 1) versus IVR (Day 2) HAM-D scores and those between clinician versus psychologist showed moderate and substantial agreement, respectively. However, relatively lower values were obtained for coefficients between clinician versus IVR (Day 1) HAM-D scores, indicating that observer agreement between clinician and IVR was fair.[17] Accordingly, caution is required when one uses individual item score of HAM-D by the IVR program in clinical practice. In particular, Cohen's kappa coefficients between clinician and IVR scores for feelings of guilt (0.08), work and activities (0.11), retardation (0.18), agitation (0.13) and mental anxiety (0.02) items were very low (see Table 2). When these items were scrutinized, a significant correlation between clinician and IVR scores (nominal P < 0.05) was obtained for feelings of guilt (r = 0.44, P < 0.001), work and activities (r = 0.39, P = 0.002), and mental anxiety (r = 0.27, P = 0.036), but not for retardation (r = 0.25, P = 0.056) or agitation (r = 0.02, P = 0.86). Because agitation and retardation of the HAM-D scale must be scored on the basis of objective observation in the clinical setting, the results of low correlations as well as low kappa values for these items are not surprising. Taken together, adjustment or improvement of the IVR program is required to obtain higher kappa values for some items (e.g. feelings of guilt, work and activities, and mental anxiety) and face-to-face interview might be required to obtain accurate scores for retardation and agitation, which might be a limitation of the current IVR program.

There are several limitations in the study. First, it should be considered that this study included a relatively small number of subjects with severe depression; there were only five subjects whose clinician HAM-D score was 24 or more. Secondly, the imbalance of the subjects' sexes (40 men and 20 women) and the fact that the study was performed at only one site should also be taken into account. Thirdly, psychologists rated HAM-D score by attending and referring to the clinician's interview but not by interviewing the patients themselves. Therefore, psychologists' rating is not entirely independent, which may have biased the result of inter-rater reliability towards the observed high correlation (r = 0.93) between clinician and psychologist HAM-D scores. These limitations, or findings, require careful consideration for the clinicians and/or the psychologists to make an accurate assessment of depression severity in clinical practice.

In conclusion, despite the limitations of this study, our results confirm that the Japanese version of the IVR HAM-D program is a reliable and valid method to assess overall depression severity in clinical practice. Although the total score was reliably and validly obtained, the score of each item did not always show high agreement. Further research in a larger number of subjects will provide Japanese clinicians more detailed information regarding the use of the IVR HAM-D program.

Acknowledgments

  1. Top of page
  2. Abstract
  3. Methods
  4. Results
  5. Discussion
  6. Acknowledgments
  7. References

We acknowledge the various contributions of all the participants in the research. Hiroshi Kunugi designed the study, rated the subjects (clinician HAM-D), and wrote the manuscript. Miyako Hashikura and Norie Koga assessed the subjects as psychologists using the simultaneous interview method. Takamasa Noda and Yu Shimizu supported subject recruitment and provided valuable comments on the manuscript of the study. Takayuki Kobayashi, who is a full-time employee of GSKKK, designed the study protocol in collaboration with the first author. Jun Yamanaka (another full-time employee of GSKKK) coordinated with ERT, and supported publication creation. Noriaki Kanemoto (another full-time employee of GSKKK) advised the lead author concerning statistical analyses. Teruhiko Higuchi supervised the study conduct and participated in the preparation of the manuscript. We would also like to express our appreciation to ERT, the service provider of the IVR HAM-D program used in the study. Funding for this research was provided by GSKKK.

References

  1. Top of page
  2. Abstract
  3. Methods
  4. Results
  5. Discussion
  6. Acknowledgments
  7. References
  • 1
    Hamilton M. A rating scale for depression. J. Neurol. Neurosurg. Psychiatry 1960; 23: 5662.
  • 2
    Williams JB. A structured interview guide for the Hamilton Depression rating scale. Arch. Gen. Psychiatry 1988; 45: 742747.
  • 3
    Bagby RM, Ryder AG, Schuller DR, Marshall MB. The Hamilton Depression Rating Scale: Has the gold standard become a lead weight? Am. J. Psychiatry 2004; 161: 21632177.
  • 4
    Demitrack MA, Faries D, Herrera JM, DeBrota DJ, Potter WZ. The problem of measurement error in multisite clinical trials. Psychopharmacol. Bull. 1998; 34: 1924.
  • 5
    Greist J, Mundt J, Jefferson J, Katzelnick D. Comments on ‘Why do clinical trials fail? The problem of measurement error in clinical trials: Time to test new paradigms?’ J. Clin. Psychopharmacol. 2007; 27: 535537.
  • 6
    Kobak KA, Brown B, Sharp I et al. Sources of unreliability in depression ratings. J. Clin. Psychopharmacol. 2009; 29: 8285.
  • 7
    Kobak KA, Lipsitz J, Williams JB, Engelhardt N, Jeglic E, Bellew KM. Are the effects of rater training sustainable? Results from a multicenter clinical trial. J. Clin. Psychopharmacol. 2007; 27: 534535.
  • 8
    Kobak KA, Mundt JC, Greist JH, Katzelnick DJ, Jefferson WJ. Computer assessment of depression: Automating the Hamilton depression rating scale. Drug Inf. J. 2000; 34: 145156.
  • 9
    Kobak KA, Greist JH, Jefferson JW, Mundt JC, Katzelnick DJ. Computerized assessment of depression and anxiety over the telephone using interactive voice response. MD Comput. 1999; 16: 6468.
  • 10
    Moore HK, Mundt JC, Modell JG et al. An examination of 26,168 Hamilton depression rating scale scores administered via interactive voice response across 17 randomized clinical trials. J. Clin. Psychopharmacol. 2006; 26: 321324.
  • 11
    Mundt JC, Kobak KA, Taylor LV et al. Administration of the Hamilton Depression Rating Scale using interactive voice response technology. MD Comput. 1998; 15: 3139.
  • 12
    Kobak KA, Kane JM, Thase ME, Nierenberg AA. Why do clinical trials fail? The problem of measurement error in clinical trials: Time to test new paradigms? J. Clin. Psychopharmacol. 2007; 27: 15.
  • 13
    Mundt JC, Greist JH, Jefferson JW et al. Is it easier to find what you are looking for if you think you know what it looks like? J. Clin. Psychopharmacol. 2007; 27: 121125.
  • 14
    American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders 4th, Edn Text Revision (DSM-IV-TR). APA, Washington, DC, 2000.
  • 15
    Sheehan DV, Lecrubier Y, Sheehan KH et al. The Mini-International Neuropsychiatric Interview (M.I.N.I.): The development and validation of a structured diagnostic psychiatric interview for DSM-IV and ICD-10. J. Clin. Psychiatry 1998; 59 (Suppl. 20): 2257.
  • 16
    Nakane Y, Williams JB. Japanese Version of Structured Interview Guide for the Hamilton Depression Rating Scale (SIGH-D). Seiwa Shoten, Tokyo, 2004 (in Japanese).
  • 17
    Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159174.