Compelling or irrelevant? Using number needed to treat can help decide

Authors

  • L. Citrome

    1. Department of Psychiatry, New York University School of Medicine, the Nathan S. Kline Institute for Psychiatric Research, Orangeburg, NY, USA
    Search for more papers by this author

Leslie Citrome, MD, MPH, Nathan S. Kline Institute for Psychiatric Research, 140 Old Orangeburg Road, Orangeburg, NY 10962, USA.
E-mail: citrome@nki.rfmh.org

Abstract

Objective:  The metric of number needed to treat (NNT), defined as the number of patients who need to be treated to achieve one additional favorable outcome, can help clinicians appraise claims that one intervention is meaningfully superior to the other.

Method:  A review of the use of NNT to evaluate the differences between interventions in the treatment of depression, schizophrenia and bipolar disorder. Instead of using disparate measures such as point change on a rating scale, kilograms gained over time or relative differences, results can be converted into a common unit of measure –‘patient units’– so that the clinician can anticipate how often actual differences between interventions would be expected to be observed. Calculation of NNT is demonstrated using reports published in the psychiatric literature, together with different graphical techniques to display this.

Results:  Clinical trial results expressed as NNT can be easily summarized and communicated effectively to patients, their families and payers. Limitations include ensuring that the NNT metric is calculated from well-designed and well-conducted research that enrolls subjects similar to patients that one treats in actual clinical practice, with doses of medications similar to what is used in the ‘real world’. Direct calculation of NNT is limited to binary or dichotomous outcomes.

Conclusion:  Using NNT can help predict treatment response in terms of both efficacy and tolerability.

Clinical recommendations

  • • NNT can be used to appraise the clinical relevance of a statistically significant result.
  • • NNT can be used to assess both the advantages and disadvantages between two competing interventions.
  • • Before agreeing that a statistically significant clinical trial result is important to the clinician and patient, the NNT can be calculated to determine if this difference will likely be encountered in day-to-day clinical practice; NNT will inform if the clinical trial result is compelling or if it is irrelevant.

Additional comments

  • • NNT is an effect size and is independent of statistical significance or ‘P-value’; thus NNT is best used together with its confidence interval.
  • • NNT is best calculated from well-designed and well-conducted research.
  • • NNT can only be calculated for binary (yes/no) outcomes; other measures of effect size will be needed for continuous variables or alternatively the continuous outcome will need to be converted to a binary one.

Introduction

Clinical trials can provide a large amount of data, some of it useful in day-to-day practice, some of it not. Clinicians often struggle to find interventions that make a difference in the wellbeing of patients. It is not always easy to discern whether or not a study result should actually change routine practice. Fortunately, there are now tools available to help gauge clinical significance or clinical relevance, instead of focusing exclusively whether or not a study result is statistically significant at the ‘P < 0.05 level’.

These clinician-friendly tools fall under the general rubric of ‘Evidence-Based Medicine’ (EBM). EBM techniques can help answer clinical questions regarding two different competing interventions being proposed for an individual patient. Gray and Pinson (1) describe the five steps of EBM and it should be noted that clinical judgment and clinical expertise are still required to make the best decision possible: i) formulate the question, ii) search for answers, iii) appraise the evidence, iv) apply the results and v) assess the outcome. EBM is not ‘cookbook medicine’, nor is it another way of implementing treatment guidelines (2). The first step is formulating the question so that the clinician can identify a search strategy to find published evidence that answers the question. This evidence can vary in quality, from anecdotal reports which are subject to bias, and hence of lower value, to the ‘gold standard’ of randomized clinical trials and systematic reviews of randomized clinical trials. Searching for evidence is made substantially easier by the availability of the internet and on-line search portals such as PubMed or HighWire (http://pubmed.gov; http://highwire.org/; see also reference (3). These portals make it easier to locate a journal article entry and usually provide a link to the publisher’s website. Many journal articles are available as a free download for everyone. Clinicians with hospital or university affiliations can get broader access to even more journal titles. Other resources include EBM online (http://ebm.bmj.com) and the Cochrane Collaboration (http://www.cochrane.org).

The clinician will need to identify evidence that differences between treatments are both statistically significant and clinically significant or relevant. In the treatment of many mental disorders, such as depressive disorder or schizophrenia, differences between drugs are clinically relevant when the differences are commonly encountered in day-to-day clinical practice. This is not the same as statistical significance in that the latter denotes whether or not the result could have arisen from chance. For example, when comparing drug A and drug B on the outcome of remission of a major depressive episode at 6 weeks (Fig. 1), if remission with drug A occurs 30.5% of the time and with drug B 31.5% of the time, this result could be statistically significant, particularly if the sample sizes are large enough. However, the clinician will not commonly observe any difference between drug A and drug B in day-to-day clinical practice, and hence this 1% difference is not likely to be clinically important. Figure 1 was deliberately formatted to accentuate the difference between drug A and drug B. Despite the impressive P-value, the results are not clinically relevant.

Figure 1.

 The difference in remission for a major depressive episode at 6 weeks for drug A vs. drug B is highly statistically significant.

Clinical significance can be quantified by calculating the effect size between two interventions. Effect sizes are used to standardize the descriptions of differences between interventions, risk factors or anything else being compared. There are a number of measures of effect size that are commonly used (4), and one in particular is very easy to calculate and understand: number needed to treat (NNT) (5).

Aims of the study

The aim of this study was: i) to describe how to calculate number needed to treat (NNT) and apply this to the individual care of patients and ii) to demonstrate different graphical representations of NNT.

Material and methods

The calculation and application of NNT is demonstrated using data from reports published in the psychiatric literature. Differences between interventions in the treatment of depression, schizophrenia and bipolar disorder are described. Instead of depending on disparate measures such as point change on a rating scale, kilograms gained over time or relative differences, results can be converted into a common unit of measure –‘patient units’– so that the clinician can anticipate how often actual differences between interventions would be expected to be observed.

Results

NNT was introduced about 20 years ago by Laupacis and colleagues as a clinically useful measure of the consequences of treatment (6). NNT is a number that represents how many patients one would need to treat with one intervention vs. another to see a difference in an outcome. If the NNT were 2, it would mean with every two patients a difference in outcome would be seen. This is a very large effect size (Fig. 2). In general, single digit NNT values represent differences that would be apparent in day-to-day clinical practice and hence compelling. A NNT of 50 or 100 or more would indicate differences that are largely irrelevant. However, clinical judgment must be exercised when evaluating if a NNT is important. For example, an intervention would be avoided if it results in an additional death once every 1000 patients (NNT 1000). Some interventions have a NNT of over 100 and yet can be quite compelling – for example, the treatment of mild or moderate diastolic hypertension (90–109 mmHg) will prevent death, stroke or myocardial infarction with a NNT of 141 to prevent one event in a 5-year period (7). In the opposite direction, a NNT of 3 for a mild dry mouth would likely be irrelevant for medications that otherwise show excellent efficacy. Gauging the importance of a particular NNT will also be dependent on what the patient agrees is important, particularly when assessing the different risks for adverse outcomes. The NNT for an adverse outcome or a disadvantage can be referred to as a number needed to harm (NNH).

Figure 2.

 Number needed to treat (NNT) ranges from 1 (essentially theoretical) to infinity.

The concept of NNT is valuable because of its simplicity. It is easy to calculate for any binary outcome such as response/no response, remission/no remission and presence of an adverse event/absence of an adverse event. A clinical trial report will usually provide the proportion of patients who have the outcome of interest for each intervention. To calculate NNT, we first take the absolute difference in rates for the outcome of interest (formally called the ‘absolute risk reduction’). Calculating the reciprocal of this difference is the NNT. Rounding up to the next higher whole number is necessary in order not to potentially exaggerate differences between treatments (for example an NNT of 3.2 or 3.7 is rounded up to 4). An example is the prior hypothetical comparison of drug A and drug B on the outcome of remission of a major depressive episode at 6 weeks. The rates were 30.5% and 31.5%. The difference in rates was 1%. NNT is 1 divided by 0.01 or 100. This means the clinician would have to treat 100 patients with drug B instead of drug A to encounter one extra remission. This is not a compelling difference, despite its statistical significance.

Differences between interventions may also be accentuated by focusing on the relative instead of the absolute difference. This is illustrated in Fig. 3 where drug A’s effect on remission is 30%, but drug B’s is only 20%. Drug A appears ‘50% better’ than drug B in achieving remission, because 0.30/0.20 is a 50% relative difference. However, to translate this into how often one would see this difference in clinical practice requires the calculation of NNT, 1/(0.30–0.20) = 10. The difference between drug A vs. drug B in achieving one additional remitted patient will be encountered only every 10 patients given drug A instead of drug B. This is not necessarily going to sway treatment if there are other considerations such as speed of onset, sustained response, tolerability or cost.

Figure 3.

 Relative vs. absolute differences. Drug A is ‘50% better’ than drug B, yet number needed to treat (NNT) is 10.

Confidence intervals (CI) should be reported along with the NNT. If the CI includes ‘infinity’, it means there is the possibility that it would take an infinite number of patients to be treated with one intervention instead of the other to encounter a difference. This really means there is no difference. If the CI has a lower bound that is a negative number and an upper bound that is a positive number, the result is not statistically significant because the CI includes infinity (8). On-line calculators are available to determine the CI, such as the website maintained by the University of Toronto’s Centre for Evidence-Based Medicine (http://www.cebm.utoronto.ca).

Applying NNT to the treatment of major depressive disorder

The efficacy of duloxetine for the treatment of major depressive disorder was evaluated using NNT (9). Data were pooled from several clinical trials that tested 8–9 weeks of acute treatment with duloxetine as well as fluoxetine or paroxetine, compared with placebo. The Hamilton Depression Rating Scale was used to define response and remission, and improvements were also measured by the Clinical Global Impression Scale. Table 1 reveals a favourable NNT for all active treatments vs. placebo. For example, NNT for remission for duloxetine vs. placebo was 7, meaning for every seven patients randomized to duloxetine instead of placebo, one additional remitted patient was observed, as defined by a Hamilton Depression Scale score of no more than 7 at endpoint. The 95% CI ranges from 5 to 20, meaning there is a 95% statistical likelihood that the actual NNT can be as strong as 5 or as weak as 20. Although the NNTs for duloxetine vs. placebo are stronger (i.e. lower absolute numbers) than those for fluoxetine or paroxetine vs. placebo, the differences between duloxetine and fluoxetine or paroxetine are not statistically significant. This can be surmised by inspecting the CIs (they overlap) or by directly testing this as the authors have done in their report (9). Before applying these results to patients in a clinical practice, it would be important to discern whether or not the patients enrolled in the pertinent clinical trials are similar to the actual patients for whom the intervention is being proposed (this can be assessed by inspection of the report’s demographic and baseline clinical characteristics table). Another consideration is what doses were tested in the clinical trials – duloxetine was dosed at 80 to 120 mg/day, and fluoxetine and paroxetine were dosed at 20 mg/day. Other factors to contemplate include potential differences in tolerability and patient preference, which for many patients are inexorably linked.

Table 1.   Treatments for major depressive disorder: NNT (and 95% CIs) for duloxetine 80–120 mg/day or serotonin-specific reuptake inhibitors (fluoxetine or paroxetine 20 mg/day) vs. placebo [adapted from Table 4 of Cookson et al. (9)]
Outcome NNT duloxetine vs. placebo (95% CI)NNT fluoxetine or paroxetine vs. placebo (95% CI)
  1. CI, confidence interval; NNT, number needed to treat.

Response (Hamilton Depression Scale score decrease by at least 50% at endpoint)6 (4, 13)7 (5, 18)
Remission (Hamilton Depression Scale score no more than 7 at endpoint)7 (5, 20)11 (not significant)
Improvement (Clinical Global Impression change from at least 4 at baseline to 1 or 2 at endpoint)7 (4, 15)8 (5, 40)

Applying NNT to the treatment of schizophrenia

Patients with schizophrenia who discontinue antipsychotic medication are at high risk for relapse. Quantifying this risk can help communicate this issue to patients and their families. NNT for relapse risk can be calculated from data provided in a review of studies of antipsychotic withdrawal vs. maintenance (10). Mean relapse rates for the withdrawal and maintenance groups were 51.5% and 16.2% respectively. The difference was statistically significant (paired t = 12.48, df = 27, P = 0.0001). The NNT is 1/(difference in relapse rates) = 1/(0.515−0.162) = 1/0.353 = 2.8 = round up to 3. This means for every three patients remaining on maintenance antipsychotic treatment instead of discontinuing the antipsychotic, one extra relapse would be avoided. In other words, for every three patients discontinued from antipsychotic treatment, one extra relapse would be encountered.

The Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) for schizophrenia (11–14) can also be re-interpreted using NNT (15) to describe how often differences can be expected between the antipsychotic medications tested (Fig. 4). The NNT or NNH values reported below are all statistically significant. NNT for all-cause discontinuation in phase 1 for olanzapine compared with perphenazine, quetiapine, risperidone and ziprasidone ranged from 6 to 11 depending on the comparator, with an advantage for olanzapine (Fig. 5). For example, for every six patients randomized to olanzapine instead of quetiapine, there was one additional patient receiving olanzapine who did not discontinue phase 1 prematurely. Differences are also observed when comparing quetiapine with perphenazine or risperidone with a disadvantage for quetiapine (Fig. 6) (16). However, there were marked differences between antipsychotics regarding association with weight gain and metabolic effects, with olanzapine demonstrating a NNH (a disadvantage) ranging from 13 to 18 depending on the comparator, in terms of discontinuation of treatment in phase 1 because of these effects (15). Weight gain greater than 7% from baseline was more commonly encountered with olanzapine, with an NNH ranging from 5 to 8, depending on the comparator (15).

Figure 4.

 All-cause discontinuation in phase 1 of Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE).

Figure 5.

 All-cause discontinuation in phase 1 of Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) and number needed to treat (NNT) relative to olanzapine. Positive values of NNT indicate an advantage for olanzapine.

Figure 6.

 All-cause discontinuation in phase 1 of Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) and number needed to treat (NNT) relative to quetiapine. Negative NNT values indicate a disadvantage for quetiapine.

The additional phases of CATIE (phase 1B and the two pathways of phase 2) can provide information on the different outcomes after switching from one randomized medication to another (16). For the clozapine pathway of phase 2 (12), NNT for all-cause discontinuation for clozapine compared with risperidone or quetiapine were 4 and 3 respectively (Fig. 7a). The comparison of clozapine vs. olanzapine was not statistically significant, but yielded a NNT of 7. These results can also be shown with the corresponding CIs (Figs 7b,c), and serves as an illustration on some of the difficulties in plotting the CI for results that are not statistically significant (8). The reader may prefer to inspect a table of all possible pair-wise comparisons (Fig. 7d).

Figure 7.

 All-cause discontinuation for the clozapine pathway of phase 2 of Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) and number needed to treat (NNT) relative to clozapine. (a) NNT displayed without confidence intervals for the statistically significant results. (b) NNT displayed with 95% confidence intervals. Because confidence intervals that represent non-statistically significant results must include ‘infinity’ (∞), that interval is not continuous. For example, for the comparison of clozapine vs. olanzapine, the confidence interval includes two segments: −10 to −∞ and +3 to +∞. Also, as an NNT must be 1 or greater (or −1 or greater), the area around zero is boxed out on the graph. (c) NNT displayed with 95% confidence intervals. Centering the graph on ∞ eliminates some of the problems found in (b). Confidence intervals that cross ∞ denote lack of statistical significance. (d) NNT displayed as a 4 × 4 table with the 95% confidence intervals for statistically significant results. Note that when an intervention is compared against itself, the NNT is ∞, as there is no difference.

Tolerability issues are highlighted in the ziprasidone pathway of CATIE phase 2 (13). Although ziprasidone did not yield an advantage regarding the primary outcome measure, all-cause discontinuation, there were advantages for ziprasidone on metabolic parameters in terms of discontinuation because of weight gain or metabolic effects, with NNTs ranging from 11 to 21 depending on the comparator (15). Ziprasidone’s advantage in phase 2 was greatest for the outcome of weight loss greater than 7% in patients who had gained greater than 7% from baseline in phase 1, with a NNT of 3 vs. olanzapine or quetiapine, and a NNT of 5 vs. risperidone (17). In other words, for every three patients randomized to ziprasidone instead of olanzapine or quetiapine, one additional patient receiving ziprasidone experienced this weight loss.

CATIE outcomes upon switching were highly dependent on what the patient was receiving in the prior randomized segment of the study. NNT can quantify these differences and reinforces the idea that clinical context is important when interpreting this metric (16). Despite quetiapine’s relatively poor performance in phases 1 and 2, it was superior to risperidone in phase 1B, where patients randomized to perphenazine and who discontinued that drug were re-randomized to risperidone, olanzapine or quetiapine (14). In phase 1B, the NNT for all-cause discontinuation for quetiapine vs. risperidone was 4, and that for olanzapine vs. risperidone was 5 (16). This is illustrated in Fig. 8 with a ‘negative’ NNT for risperidone (used as the primary reference intervention for the calculation of the NNT) vs. the comparators of quetiapine and olanzapine. The authors proffered the explanation that failure with an antipsychotic that has relatively high affinity for the dopamine D2 receptor may predict failure with another high affinity agent (14).

Figure 8.

 All-cause discontinuation for phase 1B of Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) and number needed to treat (NNT) relative to risperidone. Negative NNT values indicate a disadvantage for risperidone.

Payers may be particularly interested in re-hospitalization rates. NNT to avoid a psychiatric hospitalization as a result of the exacerbation of schizophrenia ranged from 3 to 7 in favour of olanzapine compared with the other antipsychotics in phase 1 (18).

NNT can also be applied to the problem of comparing interventions to treat agitation associated with either schizophrenia or bipolar mania (19). The efficacy and safety of the intramuscular formulations of ziprasidone, olanzapine, aripiprazole, haloperidol and lorazepam were compared using data from the nine available pivotal registration trials for the second-generation antipsychotics. NNT for treatment response at 2 h vs. placebo (or placebo equivalent) was 3 for ziprasidone 10–20 mg and olanzapine 10 mg, and was 5 for aripiprazole 9.75 mg, in contrast to 4 for haloperidol 6.5–7.5 mg or lorazepam 2 mg. Moreover, a clear dose–response relationship was evident for intramuscular ziprasidone, with the NNT strengthening from 4 to 2 when doses of 10 and 20 mg were compared with 2 mg respectively (19). Treatment-emergent adverse events occurring during the pivotal trials revealed a statistically significant NNH vs. placebo (or placebo equivalent) for aripiprazole for headache (NNH 20) and nausea (NNH 17), for ziprasidone in the treatment of headache (NNH 15) and for olanzapine in treatment-emergent hypotension (NNH 50). Olanzapine and aripiprazole had a more favourable extrapyramidal side-effect profile compared with haloperidol (there was no haloperidol treatment arm in the ziprasidone studies), demonstrable not only by the rating scales used in the studies and the standard comparisons of total score differences but also by NNH.

Applying NNT to the treatment of bipolar disorder

Cookson and colleagues (20) assessed the antidepressant effects of 8 weeks of quetiapine monotherapy in patients with acute bipolar depression. The Montgomery–Asberg Depression Rating Scale was used to define response and remission. Table 2 summarizes the NNT for quetiapine 600 or 300 mg/day vs. placebo on these outcomes, and for the NNH for treatment-emergent mania/hypomania. The authors also report on speed of onset of response and remission and calculate the respective NNT for each weekly time-point (20). Such analyses can add useful information for the clinician.

Table 2.   Treatments for bipolar disorder (bipolar depression): NNT (and 95% CIs) for quetiapine 300 or 600 mg/day vs. placebo [adapted from Cookson et al. (20)]
OutcomeNNT quetiapine 300 mg/day vs. placebo (95% CI)NNT quetiapine 600 mg/day vs. placebo (95% CI)
  1. CI, confidence interval; NNH, number needed to harm; NNT, number needed to treat.

Response (Montgomery–Asberg Depression Rating Scale score decrease by at least 50% at endpoint)5 (4, 9)5 (4, 9)
Remission (Montgomery–Asberg Depression Rating Scale score no more than 12 at endpoint)5 (3, 7)5 (3, 7)
Treatment-emergent mania/hypomania (NNH)153 (not significant)56 (not significant)

Discussion

Limitations

NNT is an effect size and independent of statistical significance, and thus is best used together with its confidence interval. The quality of the NNT estimate is highly dependent on the quality of the source data. Poorly designed and badly analyzed studies yield NNTs of questionable utility. Care must be taken to ensure that the research subjects that were enrolled in the clinical trial being evaluated are similar to the actual patients for whom the treatment is being proposed. Study design may also be problematic if the doses of the medication interventions are dissimilar to what would be commonly used. Moreover, NNT can only be calculated for well-defined binary (yes/no or dichotomous) outcomes; other measures of effect size will be needed for continuous measures or alternatively the continuous outcome will need to be converted to a binary one by establishing a clinically-appropriate threshold for abnormality (for example, instead of analyzing change in body weight in kilograms over time, patients would be categorized by whether they gained 7% in body weight from baseline or not).

To conclude, using NNT can help predict treatment response in terms of both efficacy and tolerability. Clinical trial results expressed as NNT or NNH can be easily summarized and communicated effectively to patients, their families and payers. Medication switching decisions ultimately depend on the individual patient’s history of response to prior interventions in terms of both efficacy and tolerability, as well as by patient preference.

Disclosure

Leslie Citrome, MD, MPH, is a consultant for, has received honoraria from, or has conducted clinical research supported by the following: Abbott Laboratories, AstraZeneca Pharmaceuticals, Avanir Pharmaceuticals, Azur Pharma Inc, Barr Laboratories, Bristol-Myers Squibb, Eli Lilly and Company, Forest Research Institute, GlaxoSmithKline, Janssen Pharmaceuticals, Jazz Pharmaceuticals, Pfizer Inc, and Vanda Pharmaceuticals.

Ancillary