## 1 Introduction

Phase II cancer trials are conducted to decide whether an experimental cancer treatment is worth testing in a large, costly phase III trial. Traditionally, cancer agents were cytotoxic, that is, designed to destroy tumour cells, and phase II cancer trials were single-arm trials that compared the anti-tumour activity of the experimental drug with historical control data [1]. For cytotoxic drugs, tumour shrinkage remains a widely used primary endpoint. This is because a cytotoxic agent would have to display some level of anti-tumour activity in order to have a positive effect on overall survival, the usual primary endpoint in phase III cancer trials. In recent times, cytostatic drugs have become increasingly common. Cytostatic drugs are molecularly targeted agents that are designed to improve survival through mechanisms other than directly destroying tumour cells and so in phase II trials are primarily assessed through progression-free survival [2]. However, whether the tumour increases in size is often an important secondary outcome. This is because if the agent fails to control tumour growth, survival is likely to be shortened. Thus, in phase II trials of both cytotoxic and cytostatic drugs, change in the size of the tumour is an important outcome. Although phase II cancer trials were traditionally single-arm trials, in recent times, randomised trials have become more common.

The most common way of assessing change in size of the tumour is the Response Evaluation Criteria in Solid Tumors (RECIST) [3]. RECIST classifies patients into complete responses (CR), partial responses (PR), stable disease (SD) or progressive disease (PD). Generally, in trials of cytotoxic agents, CR and PR are classed as treatment successes, with SD and PD classed as treatment failures. The proportion of patients that are PR or CR is called the objective response rate (ORR). In trials of cytostatic agents, SD is included in treatment success, and the proportion of patients that are successful is called the disease control rate (DCR). Both ORR and DCR are partly determined from a dichotomisation of the underlying continuous shrinkage in the total diameter of pre-specified target lesions (henceforth referred to as tumour size). To be classed as a success using ORR requires that tumour size shrinks by > 30%, with success using DCR requiring an increase of < 20% or a shrinkage. Generally using a dichotomised continuous variable loses statistical efficiency [4], and so the idea of directly using the tumour shrinkage itself as an endpoint has been proposed [5-7]. However, RECIST also classifies patients as PD (and hence treatment failures in both ORR and DCR) if new tumour lesions are observed or if non-target lesions noticeably increase in size. Both of these possible events are associated with a poorer long-term survival prognosis, and using only the tumour shrinkage as the endpoint does not take into account patients who are treatment failures for these important reasons.

In addition, other possible outcomes may be of interest, such as toxicity. Because cytotoxic cancer treatments are toxic, patients in cancer trials often experience toxicities. At phase II, a new treatment would not be considered for a phase III trial if it caused substantial risk of death or toxicity, even if it caused tumour shrinkage. Bryant and Day [8] argue that toxicity should be considered in phase II cancer trials and extend the design of Simon [1] to include toxicities. Toxicities are generally graded from 1 to 4 using the Common Terminology Criterion for adverse events (http://ctep.cancer.gov/protocolDevelopment/electronic_applications/ctc.htm), with grades 3 and 4 being considered serious and often resulting in treatment being discontinued. We henceforth refer to grades 3 and 4 toxicities as ‘toxicity’. To complicate matters, once a patient experiences progressive disease or suffers a toxicity, they are usually removed from the trial, and their tumour shrinkage no longer measured.

To improve precision in estimation of a treatment's ORR or DCR, we consider a composite ‘success’ endpoint determined by (1) the change in tumour size; (2) the appearance of new lesions or increase in non-target lesion size; and possibly also (3) toxicity and/or death. This success endpoint therefore has both continuous and binary components. To be classified as a treatment success, a patient must be a success for the binary component (i.e. not have new tumour lesions), and their continuous component (tumour shrinkage) must be greater than a pre-defined threshold, which will depend on whether the treatment is cytotoxic or cytostatic. The probability of treatment success is equivalent to the ORR if toxicity or death are not considered and the threshold is 30%; similarly, it is equivalent to the DCR if the threshold is − 20%.

In trials comparing two treatments, and those comparing one or two treatments to historical data, it is of interest to estimate the various probability of treatment successes and to provide some measure of uncertainty for this estimate, for example, a confidence interval (CI). In this paper, we propose a method that we call the augmented binary approach. This uses the actual value of the observed tumour shrinkage (henceforth referred to as continuous tumour shrinkage), rather than just whether it is above a threshold, in order to reduce the uncertainty in the estimate of success probability. Consequently, the width of the CI for the probability of success can be reduced. This also increases power to detect differences between arms or to test a hypothesis comparing the treatment to historical data. The idea of testing hypotheses about binary outcomes using continuous data was originally suggested by Suissa [9]. There, the binary endpoint was formed purely by a dichotomisation of a continuous variable, and each individual had an observed value for the continuous variable. The augmented binary method is a generalisation of Suissa's approach to a composite binary endpoint where complete tumour shrinkage data are not available for patients who are treatment failures for reasons (2) or (3) in the previous paragraph. The method leads to valid inference when the probability of dropout depends only on observed information (i.e. when data are missing at random (MAR)). Although trials are often not powered for a comparison of treatment success probabilities in the two arms, such a comparison is often made in randomised trials, and so we also consider the power of the augmented binary approach when this is carried out. We compare this power with those of a logistic regression approach and an approach proposed by Karrison *et al.* [6], which directly tests the continuous shrinkage using a nonparametric test.