The contribution of randomized trials to the cure of haematological disorders from Bradford Hill onwards

Authors


Correspondence: Dr Robert K. Hills, Department of Haematology, Cardiff University School of Medicine, Heath Park, Cardiff CF14 4XN, UK. E-mail: HillsRK@cf.ac.uk

Summary

It is now 75 years since the publication of Sir Austin Bradford Hill's classic textbook on Medical Statistics, and half a century since the formation of the Medical Research Council Working Party on Leukaemia. In the intervening period, trials in haematological malignancies have been at the forefront of cancer research, both in the proportion of patients recruited, and in the adoption of novel trial designs. In this paper, the principles propounded by Hill for reliable evaluation of new treatments are considered and placed in the context of the development and evaluation of novel treatments in the 21st century. Many of the original principles espoused are still highly relevant today, while the emerging heterogeneity of the conditions, both in aetiology and outcome provide their own newer challenges, which are discussed here.

This year sees the 75th anniversary of the publication of Sir Austin Bradford Hill's seminal work on medical statistics (Hill, 1937). In the past three quarters of a century, the process of identifying and validating new treatments across medicine has changed considerably. Hill's work, and its successive editions, laid the foundations of evidence-based medicine. It identified methods of reducing bias in the evaluation of treatments, thus providing a firm basis for their adoption in practice. Since the first randomized trial in 1946, which tested the value of pertussis vaccine (which, by being published after the more famous streptomycin trial, is largely overlooked; Doll, 1992) the landscape of medical research has been altered completely and many useless remedies have been replaced by treatments with evidence of effectiveness (Doll, 1991).

Work in haematological malignancies has been at the forefront of this change in emphasis towards randomized evaluations of treatment. For half a century, national trials in the UK have been run under the auspices of, first, the Medical Research Council (MRC) and, later, the National Cancer Research Institute (NCRI). The process began in the late 1950's with the formation of a Working Party, which first convened in 1959, and included both Sir Austin Bradford Hill and Sir Richard Doll (Christie & Tansey, 2003). Hill retired in 1961, but the Working Party that was set up, and its successors, have remained at the heart of UK research into haematological malignancies since that point. Outcomes have improved markedly: so too have the proportion of patients recruited into trials – from 40% of childhood acute lymphoblastic leukaemia (ALL) to over 90% today (Eden et al, 2000); and similarly for acute myeloid leukaemia (AML), where the use of efficient factorial designs means that approximately 1000 patients per year (around half of all cases with AML) account for well over 2000 randomizations annually.

Trials in haematological malignancies today represent an important flagship in the identification and testing of new treatments to improve patient outcomes. Many of the points made by Hill remain important: even in the age of individualized medicine and biomarkers there is a need for robust evidence on which to base treatment decisions. Randomized trials in leukaemia have brought some notable successes: for example, using intensification over the period 1980–1997, 5-year survival in children with ALL improved steadily from 72% in UKALL VIII to 77% in UKALL X and 85% in UKALL XI (Eden et al, 2000). Dramatic improvements in survival were seen with the introduction of long all trans retinoic acid (ATRA) therapy in acute prolymphocytic leukaemia, with a jump in 5-year survival from 52% to 71% (Burnett et al, 1999), and survival now exceeding 80% (Lo-Coco et al, 2004) and, with the introduction of imatinib for chronic myeloid leukaemia survival is approaching 90% (Druker et al, 2006). These improvements are all the result of significant clinical trial results. However, improvements in outcome have been seen even without significant advances in therapy. Forty years ago AML was almost uniformly fatal; however, approximately half of all patients in UK trials aged under 60 years are alive after 5 years, with better survival in children (Fig 1A, B). While much of this improvement can doubtless be ascribed to improvements in supportive care, it is likely that such improvements have been accelerated by the discipline of a trial protocol used by most centres in the UK. There is also the added advantage that with central data collection (and latterly sample collection) prognostic clinical and molecular markers have been extensively explored. However, less progress has been made in older patients, or those with adverse cytogenetics (Fig 1C, D). There also remain groups of patients (such as those with refractory or relapsed disease) with exceedingly poor outcomes: for example, older patients with AML unsuitable for intensive chemotherapy have a median survival of 3–4 months (Burnett et al, 2007). This heterogeneity of outcomes has led to a variety of different trial designs tailored to the particular situation. Recently too, there have been increases in the number of molecular markers and the development of targeted therapies, and the challenge is to design trials that identify treatments (or combinations of treatments) that may only work in a subgroup of patients. While significant progress has been made over the last 75 years, there are still challenges to be faced today.

Figure 1.

Outcomes in MRC/NCRI AML Trials 1970–2010: (A) Age 15–59 years; (B) Age 0–14 years; (C) Age 60 + years; (D) survival in patients with adverse cytogenetics aged under 60 years by trial (AML10, 12, 15).

The need for randomized trials

The sine qua non for practising evidence-based medicine is a reliable and firm evidence base. It is crucial to be able to gauge accurately the potential benefit of a new treatment before deciding whether or not to offer it to a patient. Typically, a new intervention represents a moderate advance upon existing therapy: for treatments where the effect is both immediate and dramatic, the evidence can be so striking that early adoption is imperative. But with the need to be able to detect relatively modest improvements, and to balance the potential benefits and risks, comes the need to be able to distinguish the effect of a new treatment from other differences that could confound the analyses, either masking a true difference or giving rise to false positive results. One common way of assessing a new treatment is to use a historical control group, comparing outcomes following the introduction of a new treatment with outcomes prior to its introduction. For beneficial treatments, outcomes should improve. However, this approach can give misleading results. First, outcomes now are compared to outcomes some time ago. During that period, other changes may have occurred in supportive care, or experience with a given treatment. Any difference seen may not be due to the new treatment, but rather a general trend for improving outcomes over time. The pitfalls are eloquently displayed in the assessment of SAB therapy (‘Same As Before’) in AML, where a significant improvement in remission rates was seen between two successive trials (Wheatley, 2002) with the same underlying treatment. Thus, any historical comparison will always be confounded by other changes in care that take place over time, and it is generally impossible to allow for this confounding.

Further, it is not always clear that the groups are comparable in their presenting characteristics. This problem is inherent in any non-randomized comparison between two different cohorts of patients, even if treated contemporaneously. If a novel treatment has different eligibility requirements (e.g. on renal or liver function), then comparing patients treated with the new treatment with a non-randomized control group may mean that the patients given the novel therapy, who satisfy particular criteria, are better risk than those on control, where the particular criteria were not required to be met, and may indeed not even have been tested. It is difficult to retrospectively apply the same selection criteria to both cohorts. Indeed, if any selection takes place, then even though baseline characteristics may appear similar, patients could differ in ways that have not been captured. This means that designs, such as the ‘common standard arm’, where strict randomization is not used can lead to unquantifiable biases, which cannot be allowed for (Hills et al, 2003). Even so-called ‘matched-pair’ analyses are not immune – matching can only be performed on those variables that were collected, meaning that patient groups could differ in undocumented ways, which are impossible to adjust for.

Randomization is a method by which these kinds of selection biases can be minimized (Collins et al, 1996). In particular, by introducing randomization, differences at baseline between groups will be due to the play of chance. Interestingly, Hill's book (Hill, 1937; and the series of Lancet articles on which it was based) does not discuss randomization, but instead talks of the, then standard, practice of allocating treatments by alternation (Hill, 1990; Doll, 1992). Hill's argument for this was that to introduce randomization at this early stage could ‘have scared [doctors] off’ (Hill, 1990). However, even then, evidence was available that this method of allocation could be subverted: if clinicians knew the next treatment, then the decision of whether or not to enter a particular patient might be influenced by this knowledge, introducing bias. Thus, hand-in-hand with random allocation comes the requirement that there is no allocation foreknowledge. Methods such as minimization do provide balanced groups, so long as the clinician cannot use previous allocations to predict the next treatment allocated. The advantage of minimization is that it produces not only balanced numbers in each group, but also balance by important stratification variables. Another such approach is the balanced block design introduced to clinical trials by Hill (Hill, 1951). Here, a number of allocation lists are drawn up for different strata, with the aim of ensuring balance across these strata between treatments and reducing the likelihood of chance imbalances in important prognostic factors that might obfuscate the results of the trial. Improving balance, however, can lead to an increased likelihood of correctly predicting the next treatment allocation – while, in a trial without blinding, with simple randomization the chance of correctly predicting which of two treatments will be allocated will be 50%, it can be much higher in trials with balanced block or minimization schemas, particularly in single centre studies or those stratified by doctor or hospital (Hills et al, 2007).

Other methods equivalent to randomization have been used in the assessment of sibling allograft in leukaemia. So-called Mendelian randomization (Gray & Wheatley, 1991) compares patients with a sibling donor with those with no donor, under the assumption that the presence of a matched sibling donor is unlikely to be related to disease prognosis and can therefore be viewed as essentially random. However, the methodology does not enable an evaluation of the best time to transplant, as for some patients waiting for failure and then transplanting in second remission may give equivalent survival with less cost and morbidity (Burnett et al, 2012). Further, in the current era of matched unrelated donor allograft, it will require modification to correctly evaluate the effect of allograft from whatever type of donor: Mantel-Byar analysis correctly allows for the time taken to reach transplant (so that early mortality does not bias results against the no transplant group), but cannot allow for all selection factors in choosing who to transplant.

Tied to the concept of randomization is the issue of interpreting single-arm phase II trials. Such trials are designed mainly to demonstrate evidence of activity of a drug. It is a mistake to think of a single-arm phase II trial as a substitute for a larger phase III trial, unless the outcome is so striking that further research would be unethical. As has been seen above, with heterogeneous diseases, such as the haematological malignancies, outcome can vary greatly between patient groups; and despite the introduction of molecular markers, and techniques such as microarray analysis, it is still not possible to predict exactly a patient's prognosis from presenting characteristics. For this reason, estimating the scale of treatment benefit based upon a single-arm study is fraught with dangers, and it should be used only as a guideline to the expected effect size in a properly randomized phase III trial.

Suitable choice of size and endpoints

Having accepted the twin principles of acquiring robust evidence for any treatment policy, and minimizing bias by means of a randomized controlled trial, the obvious next question arises as to the required size of any trial, and the measure by which superiority, non-inferiority, or equivalence will be demonstrated. Historically, this measure has been overall survival: this has the advantage of being objective, and indeed final. Choosing other outcome measures may appear attractive, in that they occur much sooner, and generally require fewer patients (because one would tend to wish to see a larger effect on a such a marker), but can also lead to problems in interpretation, for example when an improvement in a surrogate marker does not necessarily translate into an improvement in survival (Burnett et al, 2010a). In the case of stem cell transplant, using relapse-free survival as the primary endpoint does not allow for differences in survival following relapse, especially as prior transplant is an adverse risk factor for patients who relapse. Benefits here may not translate into improved survival. However, in patient groups with extremely good outcomes survival may not be a feasible outcome measure as there is little or no scope to improve clinical outcomes, and the sample sizes required may well be prohibitive without multinational collaboration. It may therefore be reasonable to change the endpoint to embrace patient quality of life (and patient survivorship issues), so long as improved quality of life is not obtained at the cost of clinical outcome. However, most particularly in safety endpoints it is important to recall that absence of evidence is not the same as evidence of absence (Altman & Bland, 1995) – it is always possible to make a treatment non-significantly worse by making the trial small enough.

Having identified a suitable outcome measure, the next important consideration is the choice of sample size. Hill's earliest comments upon sample size tend to imply that a sample of 50–100 patients would generally be sufficient (Hill, 1951), although to identify the sort of therapeutic advances looked for today the trials of the 1950s would appear hopelessly underpowered. There are many excellent books and articles upon the mathematical basis of a sample size (e.g. Machin et al, 2009) but the fundamental principle is that the smaller the difference one wishes to detect (or refute) the larger is the sample size required. For endpoints such as survival, the ability to detect a certain proportional reduction in mortality depends on the number of deaths observed, so in trials where outcomes are good, this can lead to greater numbers of patients being required as the number of deaths expected becomes proportionally fewer. It is partly for this reason that, in good risk diseases, attention is turning away from survival as a primary endpoint. To detect a 30% proportional reduction in mortality (i.e. a hazard ratio of 0·7) with 80% power at a 5% significance level using a standard log-rank test would require about 250 deaths: if baseline survival was 25% this would equate to an improvement to about 38% and around 350 patients being required, but with a baseline survival of 75% (and an improvement to just under 82%), the required number of patients would be around 1200 (Machin et al, 2009). As haematological malignancies are relatively rare compared to cancers, such as breast or colorectal, the challenge is to design trials that do not take too long to complete (by which time the question may be irrelevant), but at the same time are not underpowered, with limited ability to detect a clinically meaningful treatment effect.

While generally speaking, the results of a significant trial tend to be interpreted as providing evidence of the superiority of one or other treatment over the other, issues arise when considering the results of trials which are not significant. Strictly speaking, a non-significant result occurs when there is insufficient evidence to conclude that there is a difference, which is very different from concluding equivalence. If two different treatments produce remission rates of 30% and 50%, then the size of difference is definitely one that would be viewed as clinically relevant. But in an underpowered trial of only 10 patients per arm, the P-value for a comparison of 3/10 vs. 5/10 is 0·4; i.e. the trial is not significant. But it would not be safe to conclude that there is no difference; rather, what is seen here is absence of evidence of effect (Altman & Bland, 1995). Evidence of absence of effect, that is, equivalence (or non-inferiority) can only be inferred if there is evidence that a clinically meaningful difference is unlikely. Such evidence is derived from the confidence interval. In this instance the point estimate and 95% confidence of the odds ratio are 0·45 (0·08–2·59). Thus not only is a halving of the odds of remission possible, so is a doubling, and indeed even more extreme effects. It clearly would not be safe here to conclude no difference. Reporting of clinical trials needs to utilize confidence intervals and not just P-values; in particular it is incorrect and potentially dangerous to dichotomize P-values at P = 0·05 into ‘treatment works’ vs ‘treatment does not work’ (Gardner & Altman, 1996).

Compliance and timing of randomization

It is difficult to interpret the results of trials where compliance rates are low. Lack of compliance is likely to mean that patients, despite being in different arms of the study, did not receive markedly different treatments. This can arise for a number of reasons, such as toxicity or tolerability, difficulty of administration, the requirement for additional unwelcome clinic appointments, or in the case of an educational intervention, a ‘halo’ effect, where it may be difficult to deliver a control intervention without aspects of the novel method creeping in. So an intention-to-treat (ITT) analysis, which analyses patients according to the originally allocated arms, will tend to show no difference. ITT is preferable in superiority trials, in that the method is conservative. A significant result here will show a benefit for one or other treatment policy, i.e. there is likely to be real-world difference in adopting a policy of giving one or other treatment, even allowing for non-compliance, which will tend to dilute any treatment effect. However, large-scale non-compliance will tend to reduce the size of any treatment effect, reducing significance and possibly making interpretation difficult. A non-significant result may in fact mask a genuine difference in those patients who were able to tolerate the treatment. Table 1 shows the effect of non-compliance on an intention-to-treat analysis where only 40% of patients actually receive one allocated treatment. Here, although patients receiving the treatment see a 75% remission rate, non-compliance means that in an ITT analysis the observed remission rate is only 60%. Thus in a 100 vs. 100 patient comparison with control (remission rate 50%), the result is non-significant (= 0·16); however, if those patients likely to be compliant with treatment could be identified a priori the comparison becomes highly significant (P = 0·0003). If this had been a superiority trial, then lack of compliance would mean that the group of patients who might have derived benefit from the novel treatment would not be identified, and these patients may have been denied access. But there is perhaps a greater danger. An ITT analysis of a trial with poor compliance will tend to underestimate any actual treatment benefit (assuming that those who are not likely to comply can be identified prospectively). It is therefore possible that a treatment dis-benefit may appear non-significant, and thus a non-inferiority trial may mistakenly brand the two treatments as no worse than each other – for a subgroup of patients one treatment is inferior. As it is not always possible at the point of analysis to identify the subgroup of patients on one treatment where compliance with the other treatment is likely to be low, this makes it difficult to interpret non-inferiority trials where compliance is not high (Lewis & Machin, 1993).

Table 1. The effect of non-compliance upon intention-to-treat analysis
 Control groupTreated (full compliance)Treated (40% compliance)
Remission rate on protocol treatment50%75%75%
Compliance100%100%40%
Actual ITT remission rate50%75%60% (=0·4 × 75% + 0·6 × 50%)
Number of remissions/number of patients50/10075/10060/100
P-value versus control 0·00030·16

Clearly more reliable evidence will arise from minimizing non-compliance wherever possible. For this reason, it is advisable to ensure eligibility criteria are chosen so that patients are likely to comply initially with whatever treatment is allocated, and to carry out randomization as close as possible to the time at which treatments diverge. If a significant proportion of the patients die before treatments diverge, then any apparent effect size will be reduced, as these early deaths will contribute equally to both arms and dilute any treatment effect. At the same time, these deaths artificially narrow the confidence intervals: taken together this means there could be a misleading impression of equivalence. It can be argued that patients who relapse, or die before treatment divergence are not in a position to benefit from a novel therapy. For this reason, it does not necessarily make sense to randomize to maintenance or consolidation therapy, or duration of maintenance, up front – by the time maintenance is due to start patients may be dead, or have decided not to receive further treatment. The comparison has been cluttered up with noise at the expense of signal. Consequently, care needs to be exercised before embarking upon a trial with an up-front randomization (Hills et al, 2003).

Factorial designs

As has been noted above, the proportion of patients entering trials in haematological malignancies is large, especially compared to that in other cancers (Cameron et al, 2010). However, the incidence of haematological malignancies is considerably lower than, for example, breast or colorectal cancer, so the underlying pool of patients is smaller. Consequently, it can take considerable time to recruit sufficient patients to answer a given clinical question. One of the major challenges in haematological malignancies is to speed up the evaluation of different agents. A number of novel approaches have been put forward, but one approach, which has long been used in the MRC/NCRI trials, particularly in acute leukaemia, is the factorial design.

Factorial design was developed by R A Fisher for agricultural experiments (which also used randomization) (Fisher, 1926). It allows the simultaneous evaluation of two or more treatments in the same group of patients. If one is, for example, simultaneously evaluating treatments A and B, then patients would equally be split between control, treatment A alone, treatment B alone and treatment A and treatment B. An evaluation of treatment A would be obtained by combining an analysis of A versus control with treatment A and B versus treatment B alone. (In both cases the difference between the regimens actually given is treatment A.) So all patients contribute to an analysis of treatment A and, equally, mutatis mutandis, all contribute to an analysis of treatment B (Table 2).

Table 2. Design and analysis of factorial randomized trial
 Treatment AControlAssessment of Treatment A
Treatment BA + B (¼ of patients)B alone (¼ of patients)A + B vs. B plus A vs. control
ControlA alone (¼ of patients)Control (¼ of patients)
Assessment of treatment BA + B vs. A plus B vs. control 

Generally speaking, factorial trials are designed upon an assumption of no interaction between the treatments (i.e. they are additive in effect). This is generally true when treatments have different mechanisms of action – one would not necessarily use this kind of design to test two drugs of the same class. This design is the cornerstone of the MRC/NCRI AML trials, and no such interactions have so far been discovered (Burnett et al, 2007, 2009, 2010b, 2011a). However, even if treatments do interact, synergy will give the trial additional power: it is also worth pointing out that it is only by the use of factorial designs and properly stratified analyses that treatment interactions can be identified (Collins et al, 1996).

A variation on the factorial design is to have randomizations at different time-points: for example, an induction randomization, followed by a consolidation and/or maintenance question. As with the classical factorial design, it is crucial to perform proper tests for interaction between treatments: however, here, one can only look at later treatments stratified by earlier treatments as generally one can only stratify by variables that are known before randomization.

Trials in the modern era

One trend in randomized trials over the past 60 years is a tendency to get larger. Very few randomized trials, at least in haematological malignancies, have the 50–100 patients of Hill's day; those that do will either have surrogate laboratory endpoints, or be hypothesis-generating phase II trials. Partly this growth in trial size stems from the medical community's success in treating leukaemia. With better outcomes and, in particular, a treatment that is to a greater or lesser extent effective, incremental steps get smaller, and consequently trial size needs to increase. Typically a ‘one size fits all’ trial, where eligibility is wide, will now need to recruit several hundreds, or even thousands of patients. There is still undoubtedly a place for these trials: recent evidence on gemtuzumab ozogamicin from both the NCRI and Acute Leukaemia French Association (ALFA) groups arises from this sort of large-scale trial (Burnett et al, 2011a,b; Castaigne et al, 2011). The use of meta-analysis, which combines the results of several trials asking broadly the same question further increases power, and reduces random error (Collins et al, 1996). Additionally, a large national trial brings with it central data and sample collection, providing a framework for other research projects. However, it is worth examining these trials in the context of molecularly driven and individualized medicine.

One criticism of randomized trials is the emphasis upon the average benefit seen. Trials provide evidence as to the whole of the dataset (Hill, 1966). They do not necessarily tell us the best treatment for a given patient at a particular time-point. Particularly in the era of individualized medicine, treatments might not be expected to benefit all patients equally, so a reliance on an overall average is not necessarily appropriate. In this instance, eligibility can be restricted to those possessing an appropriate marker. Indeed, if several such subgroups can be identified, it is possible to test different targeted therapies in different groups within the context of a larger trial, such as in the case of the NCRI AML17 study, where randomization options are determined by assessments during and immediately after course 1 of chemotherapy (Fig 2). However, if the target is less clear, a suitably formulated stratified analysis with test for interaction (Hill, 1966; Early Breast Cancer Trialists' Collaborative Group [EBCTCG], 1990) can be used to identify subsets of patients who might benefit particularly from a given treatment (e.g. the benefit of gemtuzumab ozogamicin appears smaller in patients with adverse risk cytogenetics, Burnett et al, 2011a). While this approach is not new, and indeed has been a feature of trials for some time, the advent of a plethora of molecular markers puts this approach into particular relief. However, one cannot simply look at subsets of the data and report those groups which are significant, as some subsets may well show a treatment effect by chance. Even when a treatment acts equally across groups, it is likely that some subgroups will show a larger, and some a smaller effect than the average purely by chance. Tests for interaction give a P-value for the variation being due to chance, and thus give the strength of the evidence that a subgroup behaves differently from another. However, with P < 0·05 there is a one in 20 chance of detecting a non-existent interaction, and thus care must be taken in interpreting these results, especially if several subgroups are examined. Generally speaking, such subgroup analyses should be specified in advance, and those not given a priori viewed and reported as hypothesis-generating (Assmann et al, 2000; Clarke & Halsey, 2001).

Figure 2.

Design of the NCRI AML17 trial, which uses cytogenetics, molecular results, and outcomes following course 1 to determine eligibility for different treatment randomizations. APL, acute promyelocytic leukaemia; R, randomize; D, daunorubicin; DA, daunorubicin + cytarabine (ara-C) (figures give dose of daunorubicin); D Clofarabine, daunorubicin + clofarabine; CBF, core binding factor; FLT3, fms-related tyrosine kinase 3; GO3, gentuzumab ozogamicin 3 mg/m2; CEP-701, lestaurtinib; mTOR, mammalian target of rapamycin inhibition (everolimus); FLAG-Ida, fludarabine, ara-C, granulocyte colony-stimulating factor, idarubicin; CR, complete remission.

Many targeted therapies have the statistical advantage of being developed to deal with adverse prognostic markers, so that outcomes in this group of patients is relatively poorer than the main population, and thus event rates are higher, the required level of benefit is larger, and fewer patients are required. Even so, it can be a challenge to identify sufficient patients, and a collaboration or meta-analysis of similar sub-studies between groups may be the only way to obtain a large enough sample size in the case of some rarer groups.

One of the major unmet needs comes in those patients with disease subtypes that have particularly poor outcomes (relapsed or refractory disease, older patients, etc.). Here, outcomes have improved little if at all over time. Many such groups are not easy to define in terms of clinical characteristics alone, (e.g. older patients with AML not suitable for intensive treatment) and hence outcomes can vary widely between cohorts. Thus, randomization is essential, as otherwise it becomes impossible to contextualize the results. One novel approach here is the so-called ‘Pick-A-Winner’ design, which seamlessly moves from a randomized phase II study to a phase III trial in treatments that show evidence of promise (Hills & Burnett, 2011). The approach is designed for disease areas where the aim is to identify quickly novel treatments that produce relatively large improvements, and was originally developed in the context of older patients with AML unsuited for intensive treatment, where outcomes are poor and little or no progress has been made. In this design a number of novel treatments are randomized against control. At two or more time-points in the life of the trial, the outcomes in the novel arms are compared with those in the control arm: unless there is sufficient evidence of benefit, the arm is closed. For this design, there needs to be number of novel treatments available, an early measure of success, and a requirement that the minimum relevant improvement is large. However, within these constraints, there is the option to add new treatment arms to the study at any time, with comparisons taking place only between patients on a new treatment and contemporaneous controls who could have been allocated the novel treatment in question. This rolling programme of drug discovery uses patients in an efficient way, in that a single control group can be used for a number of novel treatments, and that treatments that do not show evidence of promise are discarded early. While it might appear that such trials would be difficult, because the agreement of a number of different pharmaceutical companies is needed, the fact that comparisons are made not between novel treatments but with a control ‘standard of care’ has meant that in practice, these problems have not arisen. Indeed the recently closed NCRI AML16 trial has evaluated four such novel agents fully in the space of 4·5 years, with recruitment markedly increased on the previous trial in the same population. The design is now being carried forward in the Leukaemia and Lymphoma Research-supported LI-1 trial where there are, at present, five novel arms. However, it is not always so easy to set up a trial with multiple novel agents (especially from rival companies), and developing trial designs that satisfy the needs of industry while at the same time answering important clinical and scientific questions poses an increasing challenge to collaborative groups.

Conclusions

Since the earliest trials of 50–100 patients randomized between two arms, much has changed in medicine in general, and haematology in particular. As the heterogeneity of leukaemia has been unveiled, so the requirement for proper randomized comparison has become clearer. Only in this way is it possible to put results in context. While observational studies give important insights into who receives what treatments in real life and how they perform, there is still selection by the clinician in deciding who to treat, and thus they can only complement rather than replace the randomized trial (Mauri, 2012).

Stratified analyses, or multiple eligibility pathways, can be used to investigate the molecular heterogeneity of the disease. The wide range of outcomes seen require different approaches, different numbers of patients, and perhaps different endpoints. In the case of the worst risk patients, new methodologies have been developed that bridge the phase II/III divide. Indeed, in haematological malignancies, as in other areas, the old phase definitions are becoming more fluid with the development of phase I/II trials that determine and evaluate a dose of drug, based not only on toxicity but also on efficacy (Lionberger et al, 2011). So too, Bayesian trial designs open the possibility of speeding up drug discovery. These techniques allow trial designs to be modified in response to accumulating data, either from within the trial or elsewhere, and importantly, when used in trial monitoring, can be used to close arms (or even trials) early for futility in a manner similar to that in the Pick-a-Winner design (Berry, 2006).

However, even 75 years after Sir Austin Bradford Hill many of the principles remain the same; these are summarized in Table 3. Indeed, even individualized medicine seems to have been foretold by Hill in his 1965 Heberden Oration (Hill, 1966). Far from being outdated, the principles established by Hill and his colleagues in the early days of the MRC Working Party remain the cornerstones of research even well in to the 21st century. Hill himself said that his aim as a medical statistician was ‘professional suicide’ – to educate other clinicians so that he himself could fade away (Hill, 1966). It is testament to his work that, even almost 50 years after his retirement, Hill's principles, and those of his colleagues and successors, still underpin medical research today; and that his lasting monument, the randomized controlled trial, remains the one fair and unbiased way of evaluating new treatments.

Table 3. Summary of principles of clinical trial design
Randomization ensures comparable groups and reduces issues with selection biases
Non-randomized studies can be subject to unquantifiable biases, which inhibit reliable interpretation and can lead to misleading results
Trials need to be adequately powered to detect a clinically relevant difference
Outcomes should be reported using confidence intervals wherever possible – a non-significant result is not proof of equivalence or lack of efficacy
Randomization should take place as close to the point of treatment divergence as possible
Analysis methods should be appropriate to the type of trial being performed: depending on compliance, intention-to-treat analysis alone may not be appropriate in a non-inferiority trial
Subgroup analyses should be performed cautiously: a test for interaction should always be reported, and such analyses need to be specified a priori wherever possible.
Factorial designs allow the efficient evaluation of more than one treatment, so long as treatments can be given together. A factorial design is the only way of exploring interactions between treatments; it is possible to design trials that randomize both induction and consolidation or maintenance treatment.
Trial design should be appropriate for the population being studied: for better risk patients, either collaboration or focus upon quality of life or other such endpoints may be required; for patients with extremely poor outcomes, designs such as ‘Pick-A-Winner’ can evaluate several new treatments efficiently.

Acknowledgements

RKH wrote the manuscript; I am grateful for discussions with various members of the NCRI AML Working Party and Haematological Oncology Clinical Studies Group for their perspectives on the challenges in haematological malignancy trial design.

Ancillary