Notice: Wiley Online Library will be unavailable on Saturday 27th February from 09:00-14:00 GMT / 04:00-09:00 EST / 17:00-22:00 SGT for essential maintenance. Apologies for the inconvenience.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
Phase II cancer trials are conducted to decide whether an experimental cancer treatment is worth testing in a large, costly phase III trial. Traditionally, cancer agents were cytotoxic, that is, designed to destroy tumour cells, and phase II cancer trials were single-arm trials that compared the anti-tumour activity of the experimental drug with historical control data . For cytotoxic drugs, tumour shrinkage remains a widely used primary endpoint. This is because a cytotoxic agent would have to display some level of anti-tumour activity in order to have a positive effect on overall survival, the usual primary endpoint in phase III cancer trials. In recent times, cytostatic drugs have become increasingly common. Cytostatic drugs are molecularly targeted agents that are designed to improve survival through mechanisms other than directly destroying tumour cells and so in phase II trials are primarily assessed through progression-free survival . However, whether the tumour increases in size is often an important secondary outcome. This is because if the agent fails to control tumour growth, survival is likely to be shortened. Thus, in phase II trials of both cytotoxic and cytostatic drugs, change in the size of the tumour is an important outcome. Although phase II cancer trials were traditionally single-arm trials, in recent times, randomised trials have become more common.
The most common way of assessing change in size of the tumour is the Response Evaluation Criteria in Solid Tumors (RECIST) . RECIST classifies patients into complete responses (CR), partial responses (PR), stable disease (SD) or progressive disease (PD). Generally, in trials of cytotoxic agents, CR and PR are classed as treatment successes, with SD and PD classed as treatment failures. The proportion of patients that are PR or CR is called the objective response rate (ORR). In trials of cytostatic agents, SD is included in treatment success, and the proportion of patients that are successful is called the disease control rate (DCR). Both ORR and DCR are partly determined from a dichotomisation of the underlying continuous shrinkage in the total diameter of pre-specified target lesions (henceforth referred to as tumour size). To be classed as a success using ORR requires that tumour size shrinks by > 30%, with success using DCR requiring an increase of < 20% or a shrinkage. Generally using a dichotomised continuous variable loses statistical efficiency , and so the idea of directly using the tumour shrinkage itself as an endpoint has been proposed [5-7]. However, RECIST also classifies patients as PD (and hence treatment failures in both ORR and DCR) if new tumour lesions are observed or if non-target lesions noticeably increase in size. Both of these possible events are associated with a poorer long-term survival prognosis, and using only the tumour shrinkage as the endpoint does not take into account patients who are treatment failures for these important reasons.
In addition, other possible outcomes may be of interest, such as toxicity. Because cytotoxic cancer treatments are toxic, patients in cancer trials often experience toxicities. At phase II, a new treatment would not be considered for a phase III trial if it caused substantial risk of death or toxicity, even if it caused tumour shrinkage. Bryant and Day  argue that toxicity should be considered in phase II cancer trials and extend the design of Simon  to include toxicities. Toxicities are generally graded from 1 to 4 using the Common Terminology Criterion for adverse events (http://ctep.cancer.gov/protocolDevelopment/electronic_applications/ctc.htm), with grades 3 and 4 being considered serious and often resulting in treatment being discontinued. We henceforth refer to grades 3 and 4 toxicities as ‘toxicity’. To complicate matters, once a patient experiences progressive disease or suffers a toxicity, they are usually removed from the trial, and their tumour shrinkage no longer measured.
To improve precision in estimation of a treatment's ORR or DCR, we consider a composite ‘success’ endpoint determined by (1) the change in tumour size; (2) the appearance of new lesions or increase in non-target lesion size; and possibly also (3) toxicity and/or death. This success endpoint therefore has both continuous and binary components. To be classified as a treatment success, a patient must be a success for the binary component (i.e. not have new tumour lesions), and their continuous component (tumour shrinkage) must be greater than a pre-defined threshold, which will depend on whether the treatment is cytotoxic or cytostatic. The probability of treatment success is equivalent to the ORR if toxicity or death are not considered and the threshold is 30%; similarly, it is equivalent to the DCR if the threshold is − 20%.
In trials comparing two treatments, and those comparing one or two treatments to historical data, it is of interest to estimate the various probability of treatment successes and to provide some measure of uncertainty for this estimate, for example, a confidence interval (CI). In this paper, we propose a method that we call the augmented binary approach. This uses the actual value of the observed tumour shrinkage (henceforth referred to as continuous tumour shrinkage), rather than just whether it is above a threshold, in order to reduce the uncertainty in the estimate of success probability. Consequently, the width of the CI for the probability of success can be reduced. This also increases power to detect differences between arms or to test a hypothesis comparing the treatment to historical data. The idea of testing hypotheses about binary outcomes using continuous data was originally suggested by Suissa . There, the binary endpoint was formed purely by a dichotomisation of a continuous variable, and each individual had an observed value for the continuous variable. The augmented binary method is a generalisation of Suissa's approach to a composite binary endpoint where complete tumour shrinkage data are not available for patients who are treatment failures for reasons (2) or (3) in the previous paragraph. The method leads to valid inference when the probability of dropout depends only on observed information (i.e. when data are missing at random (MAR)). Although trials are often not powered for a comparison of treatment success probabilities in the two arms, such a comparison is often made in randomised trials, and so we also consider the power of the augmented binary approach when this is carried out. We compare this power with those of a logistic regression approach and an approach proposed by Karrison et al. , which directly tests the continuous shrinkage using a nonparametric test.
2.1 Estimating success probability using the augmented binary method
We assume that the aim is to estimate the probability of success of a treatment, that is, the proportion of patients that have tumour shrinkage above some critical value (assumed for now to be 30 %) and do not fail for other reasons (i.e. new lesions, non-target lesions increase in size, toxicity or death).
During the trial, n patients are allocated to the treatment under consideration. Each patient has their sum of target lesion diameters measured at baseline. This quantity is measured halfway through the treatment and at the end of treatment for patients who remain in the study at these times. We denote these measurements for patient i as z0i (baseline), z1i (interim) and z2i (end). We define non-shrinkage failure indicators: D1i = 1 if patient i fails for a reason other than tumour shrinkage before the interim measurement, and D2i = 1 if such a failure occurs between the interim measurement and the end of treatment. Henceforth, such failures are referred to as non-shrinkage failures. To allow the distribution of the continuous measurements to be approximated as a multivariate normal distribution, we use the log tumour-size ratio : . Note that complete responses, that is, a complete disappearance in tumour lesions, will have an undefined log tumour-size ratio. Instead, the lowest tumour-size ratio of all other patients can be substituted. If the proportion of complete responses is low, as is the case in most applications using RECIST, then the resulting deviation from the normality assumption does not affect the operating characteristics of methods assuming normality; if the proportion of complete responses is higher, then a more sophisticated model, such as one based on the censored normal distribution, could be used instead .
We define Si as the observed composite success indicator for patient i. The value of Si is 1 if D1i = 0, D2i = 0 and y2i < log(0.7). In words, Si is equal to 1 if patient i has a tumour shrinkage of more than 30 % at the end of treatment and no non-shrinkage failure. The value of Si is missing if the patient drops out of the trial for a reason other than one of the failure criteria.
For the augmented binary approach, models must be specified for the tumour shrinkage and the probability of non-shrinkage failure. The tumour shrinkage is modelled using a bivariate normal model:
where μ1i = α + γzi0, μ2i = β + γzi0. Tumour size measurements that are missing because of non-shrinkage failures are treated as MAR. This is a valid assumption if the probability of non-shrinkage failure depends only on the previously observed tumour size. Additional covariates can be included in the tumour-shrinkage model to make the MAR assumption more plausible. Model (1) assumes that mean logarithm of the shrinkage is determined by baseline tumour size and the time of the observation (i.e. interim or end). An unstructured covariance matrix, Σ, is used. This class of model can be fitted in R  using the gls function in the nlme library .
For the non-shrinkage failure process, we separately model the probability of non-shrinkage failure before the interim (i.e. the probability of Di1 = 1) and the conditional probability of non-shrinkage failure between interim and the end given that the patient survived to the interim (i.e. the probability of (Di2 = 1 | Di1 = 0)). Logistic regression is used for both models:
Let θ be the vector of parameters from the tumour shrinkage model, (1), and the non-shrinkage failure models, (2) and (3). Using the aforementioned parameterisations, we have that θ is of length 10 (three parameters for μ1 and μ2, three for the covariance matrix Σ and two each in the two non-shrinkage failure models). The probability of success for patient i, with baseline tumour size z0i, is as follows:
where is the pdf of the bivariate normal distribution from Equation (1).
The mean success probability of the treatment is , which can be estimated by , where is the maximum likelihood estimator of θ. To get a CI for this probability, we transform to the log-odds scale, . A CI for l(θ) requires an estimate of the variance of , which we obtain using the delta method:
where is the vector of partial derivatives of l(θ) with respect to θ evaluated at . These partial derivatives can be approximated using the finite difference method. The parameters of the tumour size model and two non-shrinkage failure models are distinct, and so Var , the covariance matrix of , is block diagonal, that is, the covariance between parameter estimates from the different models is zero.
An approximately (1 − α)100% CI for is
where Φ − 1(1 − α ∕ 2) is the 100(1 − α ∕ 2) percentile of the standard normal distribution and expit is the inverse logit function.
For comparison, we also use the method of estimating the probability of success and a CI by using just the binary success indicators (i.e. the Sis). The CI is found using a Wilson score interval . This method is referred to as the ‘binary’ method.
2.2 Testing for a difference in success probability between two treatments
We assume that the trial proceeds as follows: 2n patients are recruited, with n randomised to each treatment. All patients are measured as in Section 2.1, and all definitions remain the same. The only difference is that a parameter representing the effect of treatment is included in models (1)-(3). Thus, the tumour shrinkage is modelled as follows:
where i indexes the ith patient, μ1i = μ + β1ti + γzi0, μ2i = μ + δ + β2ti + γzi0, and ti is the treatment indicator for patient i.
The models for non-shrinkage failure are as follows:
Let θ be the vector of parameters from the tumour shrinkage model, (6), and the non-shrinkage failure models, (7) and (8). Using the aforementioned parameterisations, we have that θ is of length 14 (five parameters for μ1 and μ2, three for the covariance matrix Σ and three each in the two non-shrinkage failure models). Using the three models, the probability of success for a patient, , is as in Equation (4). We define the true mean difference in success probability, m(θ), as follows:
Formulating the difference in this way adjusts the analysis for a chance imbalance in baseline tumour size (or other covariates that may affect probability of success, if they are included in any of the models).
The estimated mean difference is . We use the Wald test for the null hypothesis that the true mean difference in success probabilities is zero. The variance of is again estimated using the delta method.
R code to apply the augmented binary method is given in the Supporting information. ‡
For comparison, we consider a logistic regression that is fitted directly to the observed S ′ s, with a parameter for the baseline tumour size and a parameter for the treatment effect. A Wald test of the treatment effect is used as the test statistic for the difference in success probability between arms. Patients in which success status is missing are not included in the analysis. This method is subsequently referred to as the ‘logistic-regression’ method.
Also considered is the method of Karrison et al. . The log tumour shrinkage of patients, yi2, is directly compared between arms using a nonparametric test. For patients who suffer a non-shrinkage failure, their log tumour shrinkage is set to the worst observed value from all other patients. Patients who drop out for non-shrinkage reasons are not included in the analysis.
3 Simulation study
To compare the operating characteristics of the augmented binary approach with those of estimating the success probability using just the binary success data (for non-comparative trials) and the logistic regression approach and Karrison's method (for comparative trials), we conducted a simulation study. We describe the simulation setup first for non-comparative trials and then for comparative trials. In all simulations, n = 50 or n = 75 patients are randomised to each treatment. This represents sample sizes seen in recent randomised phase II cancer trials.
3.1 Simulation setup for non-comparative trials
We assume each patient's baseline tumour size is uniformly distributed between 5 and 10 cm. We denote the mean log tumour size ratio at the final endpoint as δ1. We generate data assuming that for a given treatment, with treatment effect δ1, the distribution of the log tumour size ratio at the interim and final endpoint is multivariate normal with mean (0.5δ1,δ1) and covariance matrix .
The models used to determine the probabilities of non-shrinkage failure before and after interim for the simulated data are those given by Equations (2) and (3). For all simulation scenarios, the values of γD1 and γD2 are set to the same value, γD; similarly, αD1 = αD2 = αD.
A similar model is independently used to simulate dropout due to non-failure reasons (subsequently referred to as dropout). We denote the intercept and effect of tumour size in this model as αO and γO, respectively. This model determines if an individual's non-shrinkage failure status and continuous tumour shrinkage are missing at future observation times. Thus, if in a particular interval an individual is simulated to suffer a non-shrinkage failure and also to drop out, they are recorded as having dropped out.
In the simulation study, the values of δ1, σ2, αD, γD, αO and γO are varied. The performance of the augmented binary method is assessed in terms of bias and coverage from 5000 simulation replicates. Also, we estimate the reduction in the width of the 95% CI compared with the binary methods.
3.2 Simulation setup for comparative trials
For comparative trials, we denote the mean log tumour size ratio at the final endpoint as δ0 and δ1 in the control and experimental arms, respectively. The values for δ0 and δ1 are determined by two parameters, x and ψ:
The value of 2x determines the difference in the mean log tumour size ratio of the two treatments, and the value of ψ reflects the effectiveness of the control treatment. When ψ = 0, the two mean shrinkages are symmetric around log(0.7), corresponding to a 30 % shrinkage, which is the dichotomisation point used in the ORR endpoint.
The data were then simulated as in the previous section. The models for simulating non-shrinkage failures and dropout also include treatment effect parameters βD and βO.
The augmented binary approach was compared with fitting a logistic regression model to the binary success data and also with Karrison's method applied using the Wilcoxon rank-sum test. For each simulation study, 5000 datasets were simulated for each parameter combination. For a true type I error rate of 0.05, this gives a Monte Carlo standard error for the estimated type I error rate of 0.003.
3.3 Operating characteristics of augmented binary approach for non-comparative trials
Table 1 summarises the operating characteristics of the augmented binary approach for different parameter values. In most situations, the augmented binary method considerably reduces the width of the CI compared with the binary method. For example, with n = 75 and the baseline simulation scenario, the augmented binary method reduces the average width of the 95 % CI by 17 %. To obtain a similar average width using just the binary data, a sample size of 1.172 = 1.37 times bigger, that is, around 103, would be needed. This figure of 37 % is very similar to the loss in information from dichotomising a continuous treatment outcome (e.g. in Wason et al. ).
Table 1. Operating characteristics of augmented binary method (Aug Bin) in comparison with just using the binary success data (Bin).
Reduction in 95 %
All estimates based on 5000 replicates. Simulation parameters are described in Section 3.1. Baseline scenario corresponds to δ1 = − 0.356,σ = 1,μD = − 1.5,γD = 0,μO = − ∞ ,γO = 0 (i.e. no dropout); non-baseline scenarios are as in the baseline scenario except for the specified difference.
δ1 = 0
δ1 = 0
δ1 = 0.18
δ1 = 0.18
σ = 2
σ = 2
(μD,γD) = ( − 2.5,0.2)
(μD,γD) = ( − 2.5,0.2)
(μO,γO) = ( − 2.15,0)
(μO,γO) = ( − 2.9,0.1)
Although there is a clear reduction in CI width, the augmented binary method appears in two scenarios to have slightly below nominal coverage. These two scenarios are also the scenarios where the power gain is greatest. The worst coverage observed (92.4 %) is when δ1 = 0.18, which corresponds to a median increase in tumour size between baseline and final time points of 20 %. We changed the dichotomisation threshold to 0%, which improved coverage to 94.5 %. In practice, before the trial started, if one expected a treatment to result in a low average tumour shrinkage, or an average increase in tumour size, a more suitable dichotomisation threshold should be used, such as that of the DCR endpoint.
Interestingly, in scenarios when there is dropout (the last two rows in Table 1), the augmented binary approach gives roughly the same average reduction in 95% CI width as occurs in the analogous scenario with no dropout. This is despite a decrease in the number of patients with complete data. This suggests that the fact that the augmented binary approach allows inclusion of interim data for patients who dropout between the interim and the end improves the precision of the estimated probability of success. This is only the case if there is some correlation between the interim measurement and the final measurement.
3.4 Comparison of approaches for testing the treatment effect in comparative trials
3.4.1 Comparison at mean log tumour ratio varies
We first investigate the case where the non-shrinkage failure process does not depend on treatment or tumour size, that is, βD = 0 and γD = 0. The value of αD is set to − 1.39, corresponding to a 20 % chance of failing between baseline and interim, and 20 % between interim and the final observation. We assume no dropout. Further, we set ψ to 0, corresponding to the mean log tumour size ratio at the final timepoint of the two treatments being symmetric around log(0.7). We varied x in increments of 0.05 between 0 and 0.40.
Figure 1 shows the power of the three methods as x varies. It shows there is a consistent power advantage by using the augmented binary approach. The worst power is shown by Karrison's method. The type I error rate of all three methods (i.e. the power when x = 0) is controlled at the nominal level of 0.05.
Figure 2 shows the power of the three methods for n = 75 and x = 0.35 as the value of ψ changes. The mean shrinkage of both treatments decrease as ψ increases. Because the Wilcoxon rank-sum test statistic uses only ranks, the value of ψ does not affect the power of Karrison's method. For values of ψ < − 0.65, the augmented binary approach has the worst power of the three methods. As ψ increases, the power of the augmented approach increases, whereas the power of logistic regression reaches a peak at ψ = 0.25 and then decreases.
Table 1 in the Supporting information summarises the type I error rates as the value of ψ changes. For the augmented binary and logistic regression methods, they are generally slightly inflated for negative ψ and deflated for positive ψ. The deviation from 0.05 is generally greater for the logistic regression than for the augmented binary approach (the type I error rate of the former when ψ = 1 is 0.026 when n = 50 and 0.038 when n = 75). This is consistent with previous research showing that when there are fewer than five ‘events’ (in this case, failures) per parameter in a logistic regression, the standard error can be poorly estimated . Karrison's method controls the type I error rate at the nominal level for all values of ψ.
The aforementioned results show that the augmented binary approach has low power for large negative values of ψ. A negative value of ψ means that both treatments are, on average, effective at shrinking tumours, and so most patients surviving will have a tumour shrinkage far above the threshold for success. If this is the case, then using the exact tumour shrinkage will not improve estimation of the probability of success as much as it will when the mean shrinkage is close to the threshold. However, this only explains why the power of the augmented binary approach should be equal to that of the logistic regression approach, although it appears to be slightly lower in Figure 2. This slightly lower power could be due to the augmented binary approach requiring estimation of a greater number of parameters (14 for the augmented binary method versus 3 for the logistic regression method).
When both treatments are highly effective, a better dichotomisation threshold would be one that creates a more equal balance between the number of successes and failures. We investigated the power of the augmented binary and logistic regression methods for (x,ψ) = (0.35, − 1) as the dichotomisation threshold was varied (Karrison's method was not considered because its power does not depend on the dichotomisation threshold). These parameters correspond to a median shrinkage of around 63 % using the control treatment and around 82 % using the new treatment. The results are shown in Figure 1 of the Supporting information. They show that the power of both approaches increases as the dichotomisation threshold decreases (i.e. a greater tumour shrinkage is required to declare success) and the augmented binary approach becomes more powerful than the logistic regression method. This indicates that if large average tumour shrinkages are expected, the dichotomisation threshold should be lowered – not only will this cause both methods to gain power but also will make the augmented binary method more powerful than the logistic regression method.
3.4.2 Comparison as probability of non-shrinkage failure varies
We next investigated the relative power of the three methods as the parameters used to generate the non-shrinkage failure process were varied. Both treatments were assumed to have a median tumour shrinkage of 30%, that is, (x,ψ) = (0,0). Figure 3 shows the power as βD, the effect of treatment on the log-odds of non-shrinkage failure, changes. Negative values of βD mean that the probability of non-shrinkage failure is lower when using the new treatment compared with the control treatment. We set γD, the effect of the tumour size on probability of non-shrinkage failure, to be zero. The results show that Karrison's method is the most powerful in this situation, with the augmented binary approach in second place. Karrison's approach of setting all patients who die or suffer toxicity to the worst possible outcome makes the approach very powerful when there is a difference in probability of non-shrinkage failure between the two arms. The augmented binary approach has noticeably higher power than logistic regression. This is likely to be because the augmented binary approach models the probability of non-shrinkage failure before the interim and after the interim separately, whereas the logistic regression method only models whether or not a non-shrinkage failure occurred and does not distinguish between events before and after the interim.
Figure 2 in the Supporting information shows the power for βD = − 1 as γD varies. The power of all three approaches appears to be insensitive to γD (although there is a slight decrease as γD increases). Karrison's method consistently shows the highest power, followed by the augmented binary approach.
We also investigated a scenario where the new treatment has a higher mean tumour shrinkage and also a lower probability of non-shrinkage failure (x = 0.175, βD = − 0.5, γD = 0 and αD = − 1.155). In this scenario, the augmented binary approach has a slightly higher power than Karrison's method (0.688 compared with 0.642). More generally, the most powerful method will depend on the relative magnitudes of the effects of the new treatment on mean tumour shrinkage and on probability of non-shrinkage failure.
3.5 Sensitivity analyses
We wished to assess the sensitivity of the operating characteristics to two assumptions made by the augmented binary method. The first assumption is that the probability of non-shrinkage failure depends only on the previous tumour size observation; the second is that the various reasons for non-shrinkage failure can be included together in one binary category; and the third is that of normality of the log tumour size ratio. A full description of the methods, together with simulation results, is given in the Supporting information. Generally, the augmented binary method was robust to all three assumptions.
4 Case study
To illustrate the use of the augmented binary approach, we applied it to data from the CAPP-IT trial (discussed by Corrie et al. ). CAPP-IT was a multi-centre, randomized, placebo-controlled study assessing the effect of pyridoxine on reducing dose medications when treating cancer patients with capecitabine. Hand-foot syndrome is a common adverse effect of capecitabine, and its occurrence often results in treatment being modified (i.e. delayed or discontinued). In the trial, 106 patients who had been assigned to palliative single-agent capecitabine chemotherapy were randomized to receive pyridoxine or placebo (53 in each arm). The primary outcome was the probability of capecitabine dose modification, with tumour response a secondary outcome. The trial was not powered to detected differences between tumour response in the two arms, so we consider the two arms separately. Patients were assessed every 12 weeks until disease progression, toxicity (including hand-foot syndrome) or dropout for other reasons. We analyse the data as if the endpoint of interest were the probability of success at 24 weeks. We thus have a maximum of three tumour size measurements per patient: at baseline, halfway through treatment and at the end of treatment.
As in the simulation study, we define a patient as successful if no toxicity or death occurs, no new lesions develop and the tumour size shrinkage between baseline and the final observation is greater than 30%. Because patients were recorded as treatment failures if their tumour size increased by 20 % or more between the baseline and interim measurements, we also include this as a failure criterion. With this addition, the probability of success is similar to Equation (4), except that success requires a first stage log tumour shrinkage ratio of less than 1.2:
Table 2 shows the numbers of patients, successes and patients with unknown success status.
Table 2. Summary of number of successes and failures on each arm.
Only patients who did not drop out of the trial before the baseline tumour measurement are included.
Note that categories are not mutually exclusive, that is, patients can fail for both tumour shrinkage and non-tumour shrinkage reasons.
Failures due to less than required tumour shrinkage
We estimated the probability of success, together with a 95 % CI, for each arm separately. The non-shrinkage failure models are as in Equations (2) and (3). The model for tumour shrinkage is the same as in Equation (1). The augmented binary method is compared with estimating the probability of success from the binary data alone. All patients with baseline tumour size data were included in the augmented binary analysis, whereas only complete cases could be considered using the binary method.
Table 3 shows the estimated probability of success and 95 % CI for both arms using the augmented binary and binary methods. We consider three possible dichotomisation thresholds: 0.7, corresponding to a 30 % shrinkage in tumour size required for success; 1, corresponding to any shrinkage required for success; and 1.2, corresponding to a shrinkage or increase of less than 20 % required for success. The first and third of these are the thresholds used in the objective response rate and the disease control rate, respectively.
Table 3. Estimated probability of success and 95 % CIs from binary and augmented binary methods for the two treatments in the case study.
95 % CI
Table 3 shows that the augmented binary method can change the estimate of the success probability considerably in some cases. The largest change is a reduction in the estimated success probability from 0.122 to 0.068 for pyridoxine with a dichotomisation threshold of 0.7. In this case, three successful patients had tumour shrinkages very close to the dichotomisation threshold, whereas just one treatment failure was close to being a success. In other cases, the two methods give similar estimates. In all cases, the 95 % CIs from the two methods overlap. The augmented binary method gives reductions in the width of the CI in all cases. In the case where the estimates are most similar (placebo with a dichotomisation threshold of 0.7), the augmented binary method gives a 25 % reduction in the width of the CI – a considerable reduction.
In this paper, we have proposed and assessed the augmented binary method, which makes inference about a composite success outcome defined by a continuous outcome and a binary outcome, using the continuous component to improve precision. This method is motivated by phase II cancer trials, where tumour response is a composite endpoint defined by continuous tumour shrinkage and binary non-shrinkage failure indicators, such as whether new lesions are observed. We find that in general, the augmented binary approach improves inference about the probability of success considerably over methods that only consider whether the continuous tumour shrinkage is above a threshold.
There are several issues for consideration before using the augmented binary method. One issue is whether the more complicated methodology is worth applying for the gains seen. We show that the information gain from using the augmented binary method is comparable with the gain seen from modelling a continuous outcome directly rather than dichotomising it. There is a strong consensus amongst statisticians that it is a bad idea to dichotomise continuous outcomes. However, generally this consensus is seen in situations where alternative continuous models are easy to apply, such as use of a linear model instead of a logistic regression. We have included code in the Supporting information, which we hope will reduce the difficulty of implementing the augmented binary method. A second issue is how the sample size for a trial using the augmented binary method should be chosen. Because several endpoints are of interest at phase II, we believe the sample size should be chosen as if the trial were to be analysed using traditional methods. Then, using the augmented binary approach provides extra precision on the estimated success probability or higher power for a comparison between two arms. Because of the number of parameters used, we would suggest that the method only be used for reasonably large sample sizes, at least 50 per arm. A third issue is that in certain situations, the augmented binary approach does not add any power – for instance, when the probability of success is very high. Because of this, the dichotomisation threshold for the tumour shrinkage is very important. A suitable dichotomisation threshold should be pre-specified so that the expected probability of success is not too low or high. For example, if few partial responses are expected, the disease control rate would be a more suitable choice of endpoint than the response rate. In these situations, the simulations show clear gains from use of our method.
For phase II trials with fewer than 50 patients, the number of parameters in the model could be reduced by making additional assumptions, for example, that the effect of the tumour size on the probability of non-shrinkage failure is the same in models (2) and (3). Alternatively, p-values and confidence intervals could be calculated by using a bootstrap procedure.
A recently published paper  proposes an alternative method for using continuous tumour information in randomized comparative phase II trials. Unlike in our paper and in other trials using RECIST as an outcome, the outcome of interest is overall survival. Historical data are used to estimate the effect of tumour size change on overall survival in the absence of treatment. It is assumed that the association between tumour shrinkage and overall survival in untreated patients is the same in the historical and current datasets and that any treatment effect on overall survival is captured by the effect of treatment on change in tumour size. These assumptions enable the difference in expected overall survival in the treated and untreated group implied by the observed difference in tumour shrinkage to be derived. Finally, the test statistic for the treatment effect on overall survival is the sum of two test statistics: one based on this difference in expected overall survival and one based on the observed difference in overall survival. The paper shows that the approach is promising, with required sample sizes considerably smaller than those for trials using binary tumour response. Note that the method neither explicitly take into account treatment failures for non-shrinkage reasons, such as new lesions appearing, nor allow for interim tumour-size measurements to be made, although extensions to allow these may be possible.
One assumption made by the augmented binary approach is that the probability of non-shrinkage failure depends only on the most recently observed value of the tumour size. This is a strong assumption, because the tumour may change in size considerably between observations and thus cause the probability of non-shrinkage failure to change over time. We investigated, using simulations, the effects of deviating from this assumption (Supporting information) and found no evidence that the method was sensitive to this assumption. If there is, nevertheless, still a concern about the effect about a possible violation of this assumption, an alternative to the model we have proposed would be a shared parameter model . In this latter model, the tumour size process and non-shrinkage failure process depend on common unobserved random effects. This enables the hazard of failure to depend on the current underlying tumour size. In the same way, dropout for other reasons could also be allowed to depend on current underlying tumour size.
In many phase II cancer trials, patients are followed up until they progress, with tumour measurements taken at regular intervals. Tumour response is then analysed by considering the best observed response seen before progression. In this paper, we focus on response at a fixed timepoint (for example, when treatment ends). We believe this is a better choice than the best observed response for several reasons: (1) the response at a fixed timepoint has previously been shown to be more informative for overall survival than the best observed response ; (2) there is a high measurement error in assessing tumour size, and the best observed response is likely to be more susceptible to measurement error; (3) the response at a fixed timepoint will often take considerably less time to observe than the best observed response, so trials can be conducted more quickly. If patients are followed up to progression, then instead of analysing the best observed response, a more natural analysis would be to fit a model to the time-to-progression data and to assess tumour response at a fixed timepoint as a secondary analysis. If it is of interest to assess the best observed response, it would be possible to extend our methodology to do this. A similar model, with more timepoints, would be fitted to the continuous tumour data. It would be necessary to make simplifying assumptions to reduce the number of parameters in this model, for example, by imposing additional structure on the covariance matrix. For the non-shrinkage failure data, a time-to-event model such as a Cox model could be fitted. One would then simulate the best observed response by simulating patient data from the two models. A CI for this estimate could be found using a method such as bootstrapping. This would be extremely computationally intensive, and alternative quicker approaches are a topic of further research.
This work was funded by the Medical Research Council (grants G0800860 and U105260558). We thank the associate editor and two reviewers for their useful comments that helped improve the paper.
Supporting information may be found in the online version of this article.