Can we trust the results of trials that are stopped early?
Article first published online: 22 JUN 2006
BJOG: An International Journal of Obstetrics & Gynaecology
Volume 113, Issue 7, pages 766–768, July 2006
How to Cite
Khan, K. and Hills, R. (2006), Can we trust the results of trials that are stopped early?. BJOG: An International Journal of Obstetrics & Gynaecology, 113: 766–768. doi: 10.1111/j.1471-0528.2006.00972.x
- Issue published online: 22 JUN 2006
- Article first published online: 22 JUN 2006
- Accepted 3 April 2006.
The purpose of a medical intervention is to help patients get better. There is probably no other valid reason for starting the treatment. The ‘randomised controlled trial’ (RCT) is a central tool in deciding who should get what treatment and is at the heart of evidence-based practice. As with all research, there is an ethical obligation to ensure that clinical trials are conducted properly so that their results are interpretable, even if they are negative. For this reason, clinical trials should be randomised, with a peer-reviewed and registered study protocol,1 and investigators need to obtain suitable ethics committee approval.2 In order to ensure that the trial remains ethical and the question at the heart of a trial is still relevant, trials need to be adequately monitored during recruitment.3 Finally, trials need to be reported honestly, even if they are negative or inconclusive. To assist in this process, there are a number of reporting guidelines in place for clinical trials, of which the most widely used is the CONSORT guideline.4
In order to ensure the highest standards of clinical research, BJOG encourages researchers to register their detailed trial protocols so that errors can be picked up and rectified before it is too late.1BJOG emphasises that to avoid misleading conclusions (especially Type II errors–concluding that treatments do not differ when in fact they do) ‘trials must also be of adequate size’ and they also aim to ‘stop authors giving up trials early’. It encourages transparency if a trial stops early, so that both editors and readers are aware and can adjust their conclusions appropriately. Two such trials were published in BJOG in February 2006.5,6
There are some people who say that if a trial has failed to recruit its original sample size, it is automatically of poor quality and its results in some sense suspect. This type of attitude would perhaps be fine in an ideal world, where clinicians and participants were lining up to join RCTs and there could be no excuse for failing to recruit. However, as anyone who has taken part in a clinical trial will recognise, real life is seldom that simple. In a hard-pressed hospital environment, there simply may not be the resources to spend the extra time required in explaining a study to a participant and then gaining informed consent, doing the baseline assessments, and ringing up the randomisation service, let alone any long-term follow up. Additionally, although there may be overall scientific uncertainty as to the best treatment (demonstrated by the use of a wide variety of treatments in current practice), this may merely reflect a diversity of deeply entrenched positions. Individual clinicians (or participants) may not be uncertain at all: but they have reached a variety of different (even diametrically opposing) conclusions. We consider that just because a trial has failed to reach its original sample size, it is not necessarily of poor quality. Indeed, the size of the trial is automatically taken into account in the P values and confidence intervals for the estimates of effect size. So long as the trial has closed early for administrative reasons (such as failing to recruit sufficiently quickly or the nonavailability of one or other treatment), this early closure is not of itself a reason to disbelieve the results.
However, it is a different matter when a trial is closed early because of the findings of the analysis of interim data.7–9 The most important thing to remember about statistical analysis is that it never proves anything absolutely, merely establishes things beyond a reasonable doubt. In many cases, the level of ‘reasonable doubt’ is set at P= 0.05: this means that there is a 1 in 20 chance of a trial showing a ‘statistically significant difference’ between treatments even if in truth they are completely equivalent. This chance is higher than that of throwing a double six in Monopoly. Moreover, if two trials of the same treatment comparisons are performed, there is almost a 1 in 10 chance of at least one of them coming up with a statistically significant result (this highlights the importance of journals publishing negative as well as positive trials, as otherwise the literature is bound to report a positive bias). But, even more importantly, if one runs a single trial, but keeps looking at the accumulating data, the more times one looks, the greater is the chance of finding a significant difference by chance, even when the treatments being compared are equivalent. For example, if one analyses a trial twice (i.e. one interim analysis), then the chance of seeing a significant result at P= 0.05 goes up to 7%, with five interim analyses the rate is approximately 10% and analysing after every patient gives a rate of about 20%. Clearly, analysing after every patient and then stopping and reporting when a significant result is achieved is misleading: instead of the reported P value of 0.05, the real chance of a false positive is 0.2.
For this reason, it is recommended that interim analyses are kept confidential and shown only to an independent data monitoring committee. That way, there can be no suggestion that an investigator closed the trial early just because the result momentarily became significant. Further, the interim analyses need to be performed at a level of significance that makes sure that the overall chance that a trial will report a significant result when two treatments are equivalent is P= 0.05. A number of interim analysis and stopping rules have been proposed: one of the simplest is for the data monitoring committee to recommend stopping the trial (it is up to the trial’s steering committee to make the final decision) only if an extreme level of significance is reached (typically P= 0.002 [Peto Haybittle rule] or similar) and the results seen are likely to change clinical practice. This latter condition is designed to ensure that trials are not stopped early when the long-term outcome of patients, rather than a short-term surrogate marker, is the real test of which treatment is better, or if results appear too good to be believable. The advantages of this simple stopping rule are that it is clear and simple to explain, it allows for clinical judgement, it is not overly formulaic, and the final analyses do not need to be adjusted to take the interim analyses into account. There are more complex methods for carrying out interim analyses which allow stopping for smaller differences as the trial size increases (e.g. Barnard’s sequential ‘t’ test), but these may require adjustment of the final P values. They also may require a fixed schedule of interim analyses. Although the method proposed above can adapt to a variable number of interim analyses, it is usually a good idea to define the intervals at which interim analyses will typically be carried out in the protocol (although this may change if the data monitoring committee request further interim analyses or other research results come to light).
So, how do these considerations help us in considering one of the papers in BJOG in February 2006?5 The trial was closed after 5 years of recruitment but yielded only 100 out of an original target sample size of 420. There was one interim analysis. If the results of this analysis were kept confidential by the data monitoring committee, they will have played no role in the decision to stop the trial. In this situation, we can be confident that the decision to stop the trial early was not data driven. If closure was a purely administrative decision, we should take the results of the trial as they stand. The observed result for the clinically important outcome of delivery before 30 weeks of gestation is not statistically significant (metronidazole 11/53 versus placebo 5/46, relative risk 1.91, 95% CI: 0.72–5.09, P= 0.2), and for the less important outcome of delivery before 37 weeks of gestation, the results show a marginally significant adverse result for antibiotic (metronidazole 33/53 versus placebo 18/46, relative risk 1.59, 95% CI: 1.05–2.41, P= 0.02). In the first result, the problems associated with a small sample size manifest themselves since the result is consistent with either a 25% reduction in preterm birth using metronidazole therapy or a five-fold increase. Thus, instead of being able to provide convincing evidence of there being no beneficial effect of metronidazole, this nonsignificant result merely fails to provide any good evidence of an effect (or indeed lack of effect). The other outcome of delivery before 37 weeks shows significant evidence of harm (although since the confidence interval includes an odds ratio of 1.05, very close to 1.0, perhaps the true conclusion is that metronidazole is not beneficial). It is beyond the scope of this commentary to argue whether, taken together, these two results suggest convincingly that metronidazole should be avoided or whether further evidence is needed, given the possible 25% reduction or five-fold increase in preterm birth suggested by the first result.
In summary, when a trial closes early purely for administrative reasons, we have to take the results as they stand. What cannot be done is to speculate on whether a result would have been significant had the trial run to completion. A trial which closes early provides us with (possibly tantalisingly little) evidence on the effects of a treatment. Wishful thinking about what might have happened under different circumstances is appealing but runs contrary to the principles of evidence-based medicine. The evidence may be scanty, but we have to use what is there and, if necessary, set out to acquire more. On the other hand, when early trial closure is data dependent, this is a matter of concern and one reason why independent data monitoring committees are so important. However, sometimes even the best-designed trials miss the prevailing clinical mood and under-recruit. Under these circumstances, we must analyse and report such results as they are available but both authors and readers need to recognise the special importance in such cases of looking beyond mere P values. Absence of evidence is not evidence of absence,10 and it is the responsibility of authors and practitioners alike to take measures of uncertainty into account whether or not a trial closes early. Even if a trial has closed with a negative result, therefore, it does not automatically follow that the treatment is entirely ineffective, for the confidence intervals could still include a clinically relevant effect size.
If we instinctively mistrust all trials which do not reach target recruitment, then we must discount the results from many important cancer trials, which have contributed to improving outcomes for thousands of patients. The take-home message to readers is indeed to be vigilant about trials which close early,11–13 but not to dismiss them out of hand: indeed, an ambitious multicentre trial which fails to reach its target is likely to be better run and better monitored, and therefore give more reliable evidence, despite having closed early, than a single-centre trial which reaches a smaller target recruitment.
- 2Ethical issues in the design and conduct of randomised controlled trials. Health Technol Assess 1998;2:i–vi, 1–132., , , , , .
- 3Issues in data monitoring and interim analysis of trials. Health Technol Assess 2005;9:1–238, iii–iv., , , , , , et al.
- 6Does a first trimester dating scan using crown rump length measurement reduce the rate of induction of labour for prolonged pregnancy? An uncompleted randomised controlled trial of 463 women. BJOG 2006;113:171–6., , , , .