Number needed to treat and number needed to harm are not the best way to report and assess the results of randomised clinical trials

Authors


Prof Jane L. Hutton, Department of Statistics, University of Warwick, Coventry, UK.
E-mail: J.L.Hutton@warwick.ac.uk

Summary

The inverse of the difference between rates, called the ‘number needed to treat’ (NNT), was suggested 20 years ago as a good way to present the results of comparisons of success or failure under different therapies. Such comparisons usually arise in randomised controlled trials and meta-analysis. This article reviews the claims made about this statistic, and the problems associated with it. Methods that have been proposed for confidence intervals are evaluated, and shown to be erroneous. We suggest that giving the baseline risk, and the difference in success or event rates, the ‘absolute risk reduction’, is preferable to the number needed to treat, for both theoretical and practical reasons.

Man must learn to simplify, but not to the point of falsification. Aldous Huxley

Assessment of new therapies through clinical trials is well established, and results should obviously be expressed clearly and correctly. For results to be useful, data must be appropriately collected and analysed. The subsequent choice of presentation of results can be made separately from the method of analysis.

Results of clinical trials that compare the rates or proportions achieving an outcome, such as thromboembolism or bleeding, can be analysed and presented in several ways. Common methods of comparing rates or proportions are the relative risk (RR), the difference in proportions, which is also known as the absolute risk reduction (ARR), the odds ratio (OR), and the log odds ratio (log OR). A measure called the ‘number needed to treat’ (NNT), the inverse of the difference in probabilities, has been advocated in medical journals, first in the context of randomised controlled trials (RCTs) (Laupacis et al, 1988; Cook & Sackett, 1995; Sackett et al, 1996). As treatments can have detrimental effects, the expression ‘number needed to harm’ (NNH), was introduced (Altman, 1998).

We let p1 be the baseline or placebo response rate, and p2 the response rate for the intervention group in a clinical trial. The ARR is the difference between the rates: ARR = p1 − p2. The use of the inverse of the difference between the rates: NNT = 1/(p1 − p2), was suggested as an alternative presentation of such results (Laupacis et al, 1988). The correct interpretation of this statistic is quite subtle, and expressing it correctly requires considerable care. If p1 > p2, and if NNT patients similar to those in the study were treated with treatment 1, on average one more patient would have a positive response within the time interval considered in the trial than if the NNT patients were treated with treatment 2. The NNT is the average number of typical patients ‘needed to be treated’ under treatment 1 to achieve one additional positive response over treatment 2. A negative NNT corresponds to a negative ARR, i.e. a poorer outcome on the drug, and should be interpreted as ‘the number needed to treat to harm’ (NNTH); a positive NNT is then to be interpreted as ‘the number needed to treat to benefit’ (NNTB) (Altman, 1998).

For example, a meta-analysis reported several measures: ‘Extended-duration prophylaxis for 30–42 d significantly reduced the frequency of symptomatic venous thromboembolism [1·3% vs. 3·3%, OR 0·38; 95% confidence interval (CI) 0·24–0·61, numbers needed to treat (NNT) = 50],’ without specific interpretation of NNT (Eikelboom et al, 2001). Another meta-analysis was more explicit: ‘For treatment effects that were statistically significant, the authors determined the absolute risk reduction and the number needed to treat for benefit [NNT (B)] to prevent an outcome. During anticoagulant prophylaxis, patients had significant reductions in any PE [relative risk, 0·43 (CI, 0·26–0·71); absolute risk reduction, 0·29%; NNT (B), 345]’ (Dentali et al, 1977).

For estimating the effects of treatment on a success or failure outcome, statisticians generally prefer to use the log OR, for theoretical reasons: essentially, this scale provides the most accurate and reliable comparison of treatments. Medical journals have generally accepted the importance of indicating the precision or accuracy of the main outcome measures, and estimates reported, as recommend in the Consolidated Standards of Reporting Trials (CONSORT) statement (CONSORT, 1994). These considerations – correct analysis and reporting the precision of outcome measures – are important in assessing the claims made in the articles which introduce and advocate NNT.

Arguments for NNT

A good discussion of the issues to be considered in comparing risks and benefits of treatments, and the difficulties that are encountered in assessing evidence from published reports of clinical trials was given in the article which introduced ‘Number needed to be treated’ (Laupacis et al, 1988). The authors considered RRR, AR, OR and NNT. Unfortunately, the relationship between various ways of summarizing two rates in one statistical measure was poorly understood and expressed. The conclusions were affected by two major confusions as well as the failure to distinguish between properties of statistics, properties of methods of data collection, and characteristics of the people who use the statistics. Of six alleged shortcomings of OR and RRR only the first is actually a property of RRR and OR: the fact that two measurements are combined into one. However, this is also true of ARR and NNT, so cannot be the basis for choosing between these four statistics. The remaining five disadvantages are important issues, which are carefully explained, but are not properties of these statistics. They are aspects of pharmacology, patient characteristics, study design, incomplete knowledge and appropriate use of data, again irrelevant to the choice of summary statistic.

The authors wanted a summary statistic that compared the effects of an intervention with no intervention, or an alternative intervention; summarised the harm associated with each intervention; and identified high risk patients who would respond to treatment (Laupacis et al, 1988). These criteria are relevant to the design of studies, not to choice between measures. The consequences of interventions, and characteristic of patients at risk, can only be found by carrying out the relevant studies. To measure effects and summarise harm, one has to define benefit and harm, and how they will be measured. A common scale is necessary for comparisons across interventions and conditions, but the choice between scales depends on external factors. Haemoglobin concentration can be reported as g/l, g/dl or mol/l. A patient’s temperature can be given in degrees Fahrenheit or Celsius; countries which use the metric system will generally use Celsius. It is certainly useful to have measures of both risk and benefit, but there is no unique way of comparing harm and benefit.

A NNT of 7 was incorrectly defined as ‘seven patients would have had to be treated for three years in order to prevent one event’ (Laupacis et al, 1988). This error encouraged other supporters of NNT to claim that NNT yields an interpretation ‘in terms of patients treated rather than the arguable less intuitive probabilities’ (Cook & Sackett, 1995). As the NNT is an inverse of a difference in probabilities, it must be expressed probabilistically. Consider the example of NNT of 50 for reduction symptomatic venous thromboembolism (Eikelboom et al, 2001). There is only a 37% probability that one event will be prevented by treating 50 patients, almost equal to the 36% chance that no events will be prevented. Averages might provide a simple interpretation of NNT: ‘for every 50 patients treated with extended-duration prophylaxis for 30–42 d, an average of one patient will not have a symptomatic venous thromboembolism which they would otherwise have had’. Simply stating that ‘50 patients must be treated for one thromboembolism to be prevented’ is wrong.

Similar reasons for NNT and against ORs were advanced, and similar mistakes made, by other authors (Cook & Sackett, 1995; Sackett et al, 1996). Costs and side effects of a treatment are relevant to clinical decisions, but independently of the choice of summary statistic. The claim that NNT can be used to extrapolate published research findings does not distinguish it from other statistics (Cook & Sackett, 1995). Reliable extrapolation requires careful statement of other assumptions, such as the baseline or placebo rate, and the similarity of patient populations, for NNT as much as for RRR or ARR. The mistake of using a property of all summaries of two rates as an argument against only one statistic is repeated. It is important to appreciate that all the one parameter summaries of p1 and p2 have different properties if the summary parameter is held constant while the two parameters vary. This means that the choice of summary parameter can affect the rankings of treatments. Before ranking therapies, we must decide what criteria are to be used in the ranking: is ARR more important than RR? The claim that it is not easy to calculate the NNTs from ORs is wrong, as there is a simple formulae for this (Sackett et al, 1996). The claim that it is easier to work on the ARR scale than the RRR scale is not generally valid, but depends on the method of analysis used (Hutton, 2000). Simplicity on the ARR scale does not automatically translate to the NNT scale, whether in terms of meta-analysis or confidence intervals.

Misinterpretation of ORs is alleged to support the use of NNT as ‘a clinically useful measure’ (Sackett et al, 1996). Ease of understanding of NNT is assumed, without evidence (Cook & Sackett, 1995). That ‘very few clinicians are facile at dealing with odds and relative odds’, is an observation about clinicians, not odds ratios (Sackett et al, 1996). Clinicians are even less likely to understand differential equations, but no-one would therefore claim that work in pharmacokinetics should not use differential equations. One of the reasons for restricting access to many drugs, by requiring a prescription from a qualified practitioner, is that these drugs cannot safely be used by the general public. The lack of knowledge of the general public is not a reason for advocating that such drugs are not used. The lack of knowledge and ability of clinicians is not a good reason to abandon models of the behaviour of drugs which accurately describe such behaviour.

It seems that suppressing uncertainty is a large part of the attraction of NNT, and is essential to the major claim for NNT, which is that it is easy for clinicians to understand. Despite this, clinicians are recommended to attend an ‘educational course in clinical epidemiology’ before using published nomograms for NNT (Chatellier et al, 1996). Evidence that NNT can be misunderstood is given implicitly in the discussion of its properties, and explicitly in the mistakes made in calculating confidence intervals (Hutton, 2000). Methods can be misused, and statistics can be misinterpreted.

In order to gain knowledge about therapies, it is essential to design studies carefully, record outcomes for all therapies included, and model the resulting data efficiently and accurately. There are many situations in which regression model using log OR will be the best, in terms of efficiency, goodness of fit, and flexibility. These criteria are the relevant basis for judging the choice of statistical methods. The choice of scale on which to present the results to a general readership is separate.

The claims made of difficulties with existing methods and advantages of NNT have no substance. We now consider the theoretical properties of the NNT and its estimators.

Problems with NNT

There are several undesirable properties of NNT, including problems of bias and precision (Lesaffre & Pledger, 1999; Hutton, 2000). Results of treatment comparisons often do not exclude the possibility that there is no difference between the treatments or groups. For ARR and log OR, no difference is equivalent to zero, and a confidence interval which includes zero indicates that we cannot ignore the possibility that the groups do not differ. On the OR and RR scales, the value 1 indicates no difference, and again we can see whether a confidence interval includes 1. There is no finite value, and no single value of NNT which corresponds to no difference: both positive and negative infinity are possible values. No simple test of no treatment effect can be constructed for the supposedly easily comprehensible NNT.

The CONSORT guidelines require that an estimate of the precision of the estimated treatment effects is given, usually by a confidence interval (CONSORT, 1994). As the usual statistical theory does not work for NNT, the standard error cannot be estimated directly from standard statistical package output (Hutton, 2000). The difficulty with NNT when the two rates are the same means that bias in estimates of NNT cannot be eliminated. This is true regardless of whether any particular confidence interval for the ARR includes zero. The standard interval construction is invalid, and no consistent simulation methods can be found that are reliable for general clinical practise (Hutton, 2000).

The naive approach to a confidence interval for NNT of inverting the limits of ARR intervals does not, in general, yield a valid interval. When the interval for ARR includes zero, this procedure results in an interval that apparently spans zero. Incomplete confidence intervals have been given in publications: e.g. ‘a NNT of −12·5 (−3·7 to 1)’ (Tramèr et al, 1995). Notice that −12·5 is not included in the interval (−3·7 to 1), but the interval (−1, 1), in which NNT cannot take values, is included. The haematology examples cited do not give confidence intervals for NNT. Dentali et al (1977) avoided giving NNT for non-significant relative risks. However, an NNT for 250 was associated with an estimated OR (95% CI) of 0·43 (0·17, 1·06), (Table I in Eikelboom et al, 2001).

It is not possible to use NNT directly as a summary statistic in a meta-analysis, because simple calculations, such as addition, on the NNT scale give ridiculous results (Lesaffre & Pledger, 1999). In order to calculate ARR or NNT from a meta-analysis, assumptions about the baseline rates must be made. The claim that a fixed-effect model for meta-analysis is ‘an assumption-free model because it does not assume that included studies are a random sample of the universe of studies’ is wrong (Eikelboom et al, 2001). A fixed-effect model assumes that all studies have the same true underlying OR. A further assumption on baseline rates is required to calculate NNT. As Table I (Eikelboom et al, 2001) gave the NNT for any venous thromboembolism as 50, the inverse of 3·3–1·3%, it seems that Eikelboom et al (2001) assumed all nine trial populations had a true control rate of 3·3%, though the observed rates of thromboembolism ranged from 1·6% to 8·6%. However, this is not consistent with the stated OR of 0·43, but OR = 0·33 if one decimal place is used with the percentages, or 0·24 if the precise rates are used.

Discussion

Ease of understanding is the only claim for NNT that cannot be directly refuted. Despite the concept of NNT being propagated within evidence-based medicine, no evidence for ease of understanding is presented (Sackett et al, 1997). Careful definition and measurement of ‘understanding’ of infinity, singularity, and disjoint intervals would be essential in any attempt to seek evidence of ease of understanding. This claim is not given any plausibility by the confusions and errors of the original advocates. That it is not simple (easy enough for any clinician to use) is demonstrated by the errors in the methods for, and the use of, confidence intervals, and calculations of NNT for meta-analysis.

The most important point in the discussion is that results of studies that compare treatments should be presented clearly and correctly. Success rates are widely used: newspapers and television report results of GCSEs in this way. Car ownership is widespread and many people understand ‘miles per gallon’ well enough to use that rate in comparing cars. Haematologists deal with rates: many clinical measurements are expressed as rates, e.g haemoglobin levels, drugs are prescribed per day, and many outcomes are rates of success or adverse events per patient within particular time intervals.

Reporting a single measure without a baseline rate is very unwise. Media reports on oral contraceptives which only gave relative risks led to many women not taking contraceptives, with a subsequent increase in abortions. Such problems led to the publication of ‘Guidelines on Science and Health Communication’, which require that risk is communication in terms of absolute as well as relative risk (Social Issues Research Centre, The Royal Society and The Royal Institution, 2001). A clear interpretation of the meta-analysis on extended-duration prophylaxis was given: ‘The reduction is risk is equivalent to about 20 symptomatic events per 1000 patients treated’ (Eikelboom et al, 2001). This is easily improved to meet the guidelines: ‘Without treatment, about 33 symptomatic events would be observed per 1000 patients. Treatment reduces the rate by 20/1000, to 13/1000. Although reducing the rate of symptomatic events from 250/1000 to 230/1000 is the same absolute reduction, the level of anxiety which a patient and their family is likely to have will differ if they know the basic rate is 20/1000 or 250/1000. Although one can, and should, obviously give the baseline rates or baseline odds when quoting RRR and OR, the expression in terms of direct rates (ARR) is the form which is in common use.

The name ‘number needed to treat’ is unfortunate, as it encourages people to think it is a precise number, without probabilistic content, which does not have to be referred to a baseline risk. In contrast, ‘absolute risk reduction’ explicitly mentions both the probabilistic element and the comparative element which are inherent in the estimator. Given the regularity with which clinicians state that they find statistical concepts difficult, it is as well to have names that keep the basis for judgment explicit.

The NNT has poor qualities, and at best conveys only the same information as the ARR. The ARR is an absolute measure in a form which is in common use, and has good statistical properties. Therefore, it appears to us strongly preferable to base both statistical inference and scientific conclusions on the latter. If RCTs and meta-analyses are held up as the gold standard method for obtaining evidence, an unreliable statistic should not be used in interpreting this evidence.

Ancillary