An editorial comment on Lepping P, Schönfeldt-Lecuona C, Sambhi RS, Lanka SVN, Lane S, Whittington R, Leucht S, Poole R ‘A Systematic Review of the Clinical Relevance of Repetitive Transcranial Magnetic Stimulation’ .
How can we make the results of trials and their meta-analyses using continuous outcomes clinically interpretable?
Article first published online: 11 APR 2014
© 2014 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd
Acta Psychiatrica Scandinavica
Volume 130, Issue 5, pages 321–323, November 2014
How to Cite
Furukawa, T. A. (2014), How can we make the results of trials and their meta-analyses using continuous outcomes clinically interpretable?. Acta Psychiatrica Scandinavica, 130: 321–323. doi: 10.1111/acps.12278
- Issue published online: 13 OCT 2014
- Article first published online: 11 APR 2014
Our scientific community seems to have by now come to a broad agreement that repetitive transcranial magnetic stimulation (rTMS) produces statistically significantly greater improvement than sham rTMS in the treatment of acute depression. However, the clinical relevance of this statistical superiority remains unclear, and Lepping and colleagues in this issue of Acta Psychiatrica Scandinavica addressed this problem with a novel and clinically intuitive approach .
They first systematically searched for clinical trials that compared rTMS against sham rTMS in the treatment of depression and that used the Hamilton Depression Rating Scale (HAMD). They then calculated the percentage change in HAMD scores for each arm, took their weighted average across all arms of rTMS or sham rTMS, and then converted this average percentage change to the corresponding Clinical Global Impression-Improvement Score (CGI-I), using the established conversion method . Because the CGI-I has clearly interpretable anchor scores, ranging from 1 = very much improved, 2 = much improved, 3 = minimally improved, 4 = no change through 7 = very much worse, it is now easy to assess the effects of rTMS or sham rTMS qualitatively and clinically.
The improvement on rTMS was −36% on HAMD, which corresponded with 2.9 on CGI-I, while that on sham rTMS was −23% on HAMD and 3.4 on CGI-I. The difference is therefore 0.5 between rTMS and sham rTMS on a 7-point scale of CGI-I, which does appear small and the review authors concluded ‘Whilst rTMS appears to be efficacious (…), the clinical relevance of its efficacy is doubtful’.
Two of the most recent and comprehensive reviews on the same topic reached different conclusions, using different summary methods.
Slotema et al.  conducted a standard systematic review and meta-analysis of rTMS versus sham rTMS and found an effect size of 0.55 (P < 0.001) and concluded ‘It is time to provide rTMS as a clinical treatment method for depression’. Berlim et al.  focused on high-frequency rTMS and the dichotomized outcomes of response rates and remission rates and found an odds ratio of 3.3 (P < 0.0001) of rTMS over sham rTMS for both outcomes and concluded that rTMS ‘seems to be associated with clinically relevant antidepressant effects’.
Acknowledging that these three reviews included overlapping yet different sets of trials and focused on related yet different outcomes, how can the conclusions still be so different?
Basically there are three ways to make the continuous outcomes, such as ratings of depression severity, interpretable: (i) conversion to SD (standard deviation) units, (ii) conversion to dichotomous outcomes, and (iii) conversion to natural units.
Slotema et al. took the first approach: the effect size, also known as standardized mean difference, is the difference in the mean scores of the two groups, divided by the standard deviation. ‘0.55’ is an enigmatic number, unless we use the rough rule of thumb for interpreting the effect size proposed by Cronbach that 0.2 represents a small effect, 0.5 a moderate effect, and 0.8 a large effect .
Berlim et al. used the second approach by meta-analyzing response and remission rates. Because this is a meta-analysis of dichotomous outcomes, the results can be legitimately summarized as an odds ratio. Unfortunately, here again, clinicians are at a loss to evaluate if an odds ratio of 3.3 represents a large, moderate, or small effect.
Fortunately, both the effect size and the odds ratio can be converted to the number needed to treat (NNT), assuming a certain control event rate [6, 7]. Assuming the clinically expected response rate of 30% on sham rTMS for ordinary depression, the effect size of 0.55 corresponds with an NNT of 51, and the odds ratio of 3.3 with an NNT of 41.
Lepping et al.'s method is a variant of the third approach, namely to convert the results to natural, interpretable units. In itself, this method, summarized above, is straightforward and the results should be very easy to interpret clinically. Let us, however, focus on what they mean for individual patients. Here, we have to keep sharp distinction between individual change, group mean change, and difference in group mean changes.
On average, the patients on rTMS achieved a CGI-I of 2.9, that is, an improvement between 2 = much improved and 3 = minimally improved. On the other hand, the patients on sham rTMS achieved a mean CGI-I of 3.4, that is, an improvement between 3 = minimally improved and 4 = no change. It must be noted that these are all group averages: Individual patients may have made more or less improvement than this average.
So how many of these patients will have made CGI-I score of 3 = minimally improved or better improvement, that is, above the minimum clinically meaningful threshold of individual change? To make some estimates, we can use a formula based on the mean and SD . The SD of the CGI-I itself is unavailable in Lepping et al.'s study, but we do have an SD for percentage change in HAMD, which is 16% [Table 3 in ], which then would roughly correspond with a CGI-I of 0.7 using the conversion formula [Table 1 in ]. If we can assume that the percentage changes in HAMD and the corresponding CGI-I scores are largely normally distributed, the percentage of patients showing minimal or greater improvement can be calculated to be 56% on rTMS and 28% on sham rTMS using the formula 1. The NNT then is 4.
An astonishing agreement? But this should be no wonder because, after all, all three reviews dealt with an essentially identical clinical question and used largely overlapping trials in the literature.
The bottom line messages of this editorial comment then are two. First, the traditional summary method of systematic reviews remains hard to interpret clinically and leaves room for improvement [9, 10]. Anchoring to natural units is a way forward but there is a caveat here too: the difference in group mean changes (i.e., 0.5 in this case on CGI-I), and the group mean changes themselves (i.e., 2.9 in this case on CGI-I for patients undergoing rTMS) need be evaluated from a different perspective than the individual change (e.g., one point to move up slightly on the improvement ladder of seven).
Whether we think an NNT of 4 or 5 to bring about one more response for acute depression with rTMS than with sham rTMS is clinically meaningful or not is a value judgment on the part of the society and the individual patients, with due consideration for the sufferings that depression brings, the possible adverse effects of rTMS, and the availability and relative benefits and risks of alternative treatments.1
Declaration of interest
TAF has received lecture fees from Eli Lilly, Meiji, Mochida, MSD, Pfizer, and Tanabe-Mitsubishi, and consultancy fees from Sekisui and Takeda Science Foundation. He is diplomate of the Academy of Cognitive Therapy. He has received royalties from Igaku-Shoin, Seiwa-Shoten, and Nihon Bunka Kagaku-sha. The Japanese Ministry of Education, Science, and Technology, the Japanese Ministry of Health, Labor and Welfare, and the Japan Foundation for Neuroscience and Mental Health have funded his research projects.
- 4Response, remission and drop-out rates following high-frequency repetitive transcranial magnetic stimulation (rTMS) for treating major depression: a systematic review and meta-analysis of randomized, double-blind and sham-controlled trials. Psychol Med 2014;44:225–239., , , .
- 5Statistical power analysis in the behavioral sciences. Hillsdale, NJ: Erlbaum; 1988.
- 6Measurement of patients' experience. In: Guyatt G, Drummond R, Meade MO, Cook DJ, eds. Users' guides to the medical literature: a manual for evidence-based clinical practice. 2nd edn. New York: The McGraw-Hill Companies, Inc.; 2008. 249–271., , ,
- 7From effect size into number needed to treat. Lancet 1999;15:353..