Performance Measures in Dose‐Finding Experiments

In the first phase of pharmaceutical development, and assuming that the probability of positive response increases with dose, the main statistical goal is to estimate a percentile of the dose–response function for a given target value Γ . We compare the Maximum Likelihood and centred isotonic regression estimators of the target dose and we discuss several performance criteria to assess inferential precision, the amount of toxicity exposure and the trade‐off between them for a set of some exemplary adaptive designs. We compare these designs using graphical tools. Several scenarios are considered using simulation, including the use of several start‐up rules, the change of slope of the dose‐toxicity function at the target dose and also different theoretical models, as logistic, normal or skew‐normal distribution functions.


Introduction
The goal of dose-finding experiments is to estimate the dose having a targeted expected proportion of positive responses by assigning doses sequentially to cohorts near the target dose. Such designs are used in many fields. For example, Lagoda & Sonsino (2004) determine failure thresholds in electrical and material engineering; Zera (2017) describes sensory threshold estimation; Chang & Ying (2009)) work in adaptive learning. Without loss of generality, we use the language of phase I clinical trials. To develop a new drug in the context of clinical trials, the first step aims to establish a dose for which the rate of toxicity reaches a pre-specified target; subsequent trials are more concerned with efficacy. Ting (2006) and Chevret (2006) provide a comprehensive list of issues to consider in dose-finding experiments, in the framework of drug development. Each topic is discussed and analysed in a separate chapter by an expert.
Conflicting desires to avoid imprecise estimates and hazardous assignments while dosefinding raises questions about how various allocation procedures handle the trade-off between these goals. This paper attempts to consolidate considerable literature on dose-finding designs by considering both goals simultaneously. One challenge to creating useful ways to examine the trade-off is that the two goals are measured on different scales, requiring one to consider how much precision a toxic event is worth. To mitigate this problem, the efficacy and toxicity criteria adopted in this paper are both expressed in terms of numbers of subjects to assist in simultaneously comparing designs and their settings.
Simulation is the practical approach to study the performance of dose-finding designs because the exact characteristics of estimators are typically unknown under non-linear models when dependencies are generated with the application of adaptive allocation rules. A set of five papers (viz. Storer, 1989;Stylianou & Flournoy, 2002;Ivanova et al., 2003;Oron & Hoff, 2013;Diniz et al., 2019) illustrate simulated comparisons existing in the literature that have different goals. The three first papers study the performance of several designs under different parameter values assuming responses to follow a logistic curve. Storer (1989) presents an unfruitful search for confidence intervals of the target dose; he compares several methods for constructing confidence intervals as applied to several different dose allocation rules. Stylianou and Flournoy (2002) compare the performance of several estimators of the target dose based on one allocation rule, whereas Ivanova et al. (2003) compare different estimators following different designs, finding that the isotonic estimator was superior to others. Oron & Hoff (2013) and Diniz et al. (2019) also consider non-logistic response curves. The former studies the small sample behaviour of Bayesian and random walk dose-finding, concluding that the later are more robust. Finally, Diniz et al. (2019) studies the loss of information that comes from discretising continuous variables under two Bayesian allocation procedures. The authors conclude that more than nine doses should be used to minimise the loss of information when discretising.
Our main goal is to provide performance measures attending to several criteria, with graphical tools, to evaluate the global performance of designs. Good performance requires accurate estimation of the target dose with a small number of total toxicities.

Design Goals
Dose-finding designs can be classified in a variety of ways, and we select several (described in Section 1.3) to represent this variety in our graphical comparisons. They are long memory (using all past observations for current dose allocations) and short memory (using only recent observations to curtail the drag of early misleading outcomes). Some designs aim to allocate subjects around the target dose to provide information about the dose-response function in the neighbourhood of the target. Others seek to allocate all subjects to the target dose and may use the dose final allocation to estimate the target. Regarding the later approach, Azriel et al. (2011) proves that, for any adaptive allocation rule operating under any arbitrary monotonic dose-response function, the sequence of allocations cannot converge with probability 1 to the target dose, and so specifically, any sequence of treatments that are selected by estimating the target dose cannot converge almost surely to the target dose. Along the same line, in Fedorov et al. (2011) practitioners are warned against so-called best-intention designs (those that allocate the present patient to the dose believed to be the best), because allocations may converge to the wrong point with non-zero probability. Shen & O'Quigley (1996) provides sufficient conditions for convergence of allocations with the continual reassessment method (CRM, see Section 1.3.3). Cheung (2011) revisits these conditions in order to relax them.
In interval designs, the dose assignment is replicated when the estimated toxicity rate is within a prescribed interval around the target rate, whereas if the estimate is outside this interval, the dose assigned will move toward it. Oron et al. (2011) proves that interval design allocations will converge to the target dose if it is the only dose within the inverse interval prescribed; it will oscillate between the two doses straddling the target dose if no dose levels are in the interval; and if there are multiple doses in the interval, allocations will converge to one of them, but not necessarily to the one closest to the target. A simulation study in Oron et al. (2011) shows that a small number of realistic scenarios meet the conditions of Shen & O'Quigley (1996) for convergence using the CRM or the interval cumulative cohort design (CCD, see Section 1.3.2) Designs to optimally estimate the parameters of quantal response curves are well known, a remarkable paper is Ford et al. (1992) where they obtain the optimal designs for the canonical version of a wide class of generalised linear models with a sole explanatory variable. With non-linear mean response functions, these optimal designs depend on unknown model parameters which must be estimated to start the experiment and then may be updated using study data for subsequent assignments. When parameter estimates are updated as a study progresses, the design is called adaptive optimal. Other model-based designs also require parameter estimation to specify dose allocation probabilities. Parametric estimates of the target dose invariably involve estimating a slope parameter, and to do this well, optimal design theory prescribes the need for observations relatively far from the target. In Fedorov & Leonov (2014), constrained optimal designs to reduce the likelihood of assignments with high toxicity potential were shown useful in studies with relatively large sample sizes. However, with small sample sizes, estimating slope parameters is problematic (as is made clear later), and although we study final parametric estimates that are functions of slope parameters, we restrict this paper to designs whose treatment-allocation procedures do not depend on slope parameter estimates.
Designs may also be constructed for dose-selection rather than estimation, a distinction made explicit with notation developed in the following section.

The Model
Assume patients arrive sequentially (or in cohorts). The probability of a toxic event is an increasing function of dose, and the toxicity function is defined as F .x/ D P r.T oxicityjDose D x/. The nth patient receives dose X n 2 fd 1 < d 2 < < d L g. The L doses are called permissible. The nth patient's response and indicators for the dose received are, respectively, Y n D 1; toxicity; 0; non toxicity. ı nj D 1; X n D d j I 0; otherwise. j D 1; : : : ; L: Now, the toxicity function can be written as The dose with prescribed target toxicity rate is F 1 ./.
Observe that the toxicity rate at any permissible dose is unlikely to equal . If the experiment's goal is dose selection, then one seeks to identify the dose in fd 1 ; ; d L g that is the closest to F 1 ./ (or maybe closest and less than); whereas for dose-estimation, one wants a good estimator of F 1 ./. We focus on the estimation of F 1 ./, following recommendations given in Oron & Hoff (2013). Some designs require model assumptions to operate, and these are described when the design is described in Section 1.3. The designs' performance is studied by simulated experiments that are described in Section 3; simulations assume underlying logistic, normal and skewed normal models of the dose-response function. For estimation of the target dose, maximum likelihood estimates are calculated assuming a logistic model, while centred isotonic estimators (Oron & Flournoy, 2017) only assume an increasing response function as described in Section 1.5.

Designs and Some of Their Properties
The catalogue of phase I designs is quite large, and an exhaustive comparative study with all of them is unfeasible. In Sverdlov et al. (2014), a wide survey on novel adaptive designs for phase I trials is presented. This survey is based on classifying designs as algorithm based or parametric model based. Comparisons in Section 3 are limited to a few exemplary designs, but the performance measures and comparative procedures presented in this work extend to other designs. They are described briefly in this section.
All designs studied use previous allocations and patient outcomes to allocate the next patient or patient cohort. Let F n D .Y j ; ı j W j Ä n/ be the accrued information up to the nth patient. Then, an adaptive allocation rule induces, in each new patient n, a set of allocation probabilities nj WD P .ı nj D 1jF n 1 / D F .d j / j D 1; : : : ; L: When patients are allocated in cohorts, these definitions hold after substituting the index n with an index for the cohorts.
For a faster read, one may just note the abbreviations given to each method hereafter and skip to the next section. Abbreviations used are summarised in Appendix A.

The k-in-a-row design
The k-in-a-row designs (kRDs) are Markov chain-based designs that were introduced to sensory studies by Wetherill (1963),  and Wetherill & Levitt (1966) where they are widely used. They go by a variety of names in the literature including transformed and geometric rules. If a toxicity is observed at a permissible dose, immediately lower dose is allocated next; otherwise, the same dose is administered until k consecutive non-toxic responses are observed in which case the next higher dose is assigned to the immediately individual. This rule allocates patients, asymptotically, unimodally around the target dose with the most patients assigned to one of the doses straddling the target quantile D 1 .1=2/ .1=k/ of the toxicity function (Oron & Hoff, 2009). So the kRD with k D 2 is adopted to estimate the dose having toxicity rate D 0:293. Oron and Hoff also show that dose-specific allocation probabilities using the kRD converge faster to their stationary values than those of other Markovian up-anddown designs with the same target toxicity rate.

The cumulative cohort design
The first interval dose-finding design is the CCD by Ivanova et al. (2007). Sample size and dose dependent "no-change" intervals are determined to cluster allocations around the target using the theory of Markovian-based group up-and-down designs; all dose-specific data to date are used to make treatment decisions, rather than just those from the current cohort of subjects.
Let R nj denote the observed proportion of toxic responses among N nj subjects that have been assigned to dose d j up through the nth patient. If j is the last dose used, and if R nj Ä Lnj ;the next subject is given dose d j C1 I if R nj C U nj ;the next subject is given dose d j 1 I otherwise,the next subject is again given d j : The no-change limits f Lnj ; U nj g are solutions to the equations c Lnj D . Lnj /N nj and c U nj D . C U nj /N nj , where c Lnj and c U nj satisfy the so-called balance equation which equates the probability of increasing and decreasing the dose: and W nj is a binomial random variable with parameters .N nj ; /. If the rule prescribes a dose outside the range OEd 1 ; d L , the same dose is administered again. When there is no exact solution for (2) given , a solution for a binomial random variable having a toxicity rate close to is used. The study (Liu et al., 2013) comparing six up-and-down designs shows that the CCD has the best overall performance. In Oron et al. (2011), the CCD is shown to meet the criteria for convergence more often than the continual reassessment method (which is discussed next).

Continual reassessment method
The CRM was introduced in O' Quigley et al. (1990) as follows. Consider the dose-response skeleton model where x represents the dose and a is an unknown parameter. The a priori distribution of a is exponential with parameter one. Bayes theorem is applied as data become available to obtain and successively update the a posteriori distribution of a. The mean of the a posteriori distribution is denoted by O a n when n 1 patients have participated in the experiment. The first patient is assigned the permissible lowest dose. When n 1 patients have been allocated, the next patient will receive the dose x 2 d 1 ; : : : ; d L W j .x; O a n / j is minimal. This initial CRM model has been modified to improve its performance; Sverdlov et al. (2014) provides a selected review of the CRM modifications, but a deeper presentation can be found in the book by Cheung (2011) which is focused on CRM variations and their properties. In general, the skeleton is a strictly monotone sequence of prior toxicity probabilities for the L permissible doses that initiates the CRM. Simulations in Section 3 use the step-up skeleton of James et al. (2016) in which toxicity probability increments are slow (i.e. 0.05) until the prior median is reached; beyond the prior median, increments are 0.1. James et al. (2016) showed that this skeleton performed well in a case study.
Even though Shen & O'Quigley (1996) conjectured conditions under which CRM allocations converge to the permissible dose closest to F 1 ./, and Azriel et al. (2011) proved them to be true, Oron et al. (2011) showed the conditions to be extremely restrictive, and  proved that CRM convergence cannot be guaranteed in general under monotonic sequences of toxicity probabilities.

Escalation with overdose control
The escalation with overdose control (EWOC) was introduced in (Babb et al., 1998) with the goal of allocations quickly approaching the F 1 ./ under the constraint that the predicted proportion of patients treated above F 1 ./ be equal or less than a bound˛. EWOC is driven by Bayesian updates like the CRM but is constrained to decrease the exposure to highly toxic doses. Following (Babb et al., 1998), a two-parameter logistic model for the dose response curve drives the design, and˛D 0:25. When the .n 1/th patient is allocated, the posterior cumulative distribution function of the target, say n , is obtained, and the next patient is allocated in the dose x W n .x/ D˛. So the nth patient receives the 25th percentile of the posterior distribution of the target.
Comparative simulation studies in Babb et al. (1998) show that EWOC ant CRM have similar estimation efficiency, but EWOC treats fewer patients on doses greater than F 1 ./.

Benchmark designs
The benchmark designs used in this manuscript are common designs that are well characterised for non-sequential experiments but are unethical in the dose-finding environment. They provide comparative standards in some simulations.
Uniform design (UND): a dose is randomly selected for each subject. So with L permissible doses, each is applied with probability 1=L. The allocation probabilities do not change; there is no learning.
D-optimal design (OD): doses are selected to maximise the determinant of the information matrix. The D-optimal design for logistic dose-responses was obtained by Wetherill (1963) and Minkin (1987). It prescribes equal numbers of subjects be assigned to the 0.176th and the 0.824th quantiles of the logistic function, that is, to .˙1:5434 ˛/=ˇ. Later, Ford et al. (1992) found optimal designs for a variety of dose-response models, and Biedermann et al. (2006) addressed restricted design spaces.

Start-Up Allocation Rules
Used with procedures that require the next dose to be near the current dose, start-up rules reduce the number of patients allocated to inefficacious doses by accelerating dose assignments into a neighbourhood of F 1 ./. Procedures that assign the next subject as close as possible to the predicted target dose are unreliable early in the study, and for these, start-up rules mitigate the likelihood of early allocations to highly toxic doses (Cheung, 2011). We consider the following start-up rules: Escalate until first toxicity (ETk): starting at d 1 , ETk assigns cohorts of size k to escalating doses until one or more toxicities appear. ET1 was studied in Ivanova et al. (2003).
k-in-a-row (kR): starting at d 1 , kR escalates doses only after k consecutive non-toxicities are observed and stops when a toxicity appears. This contrasts with the primary k-in-a-row design (kRD) which moves to the next lowest dose when a toxicity appears and continues on from there. The 1R and ET1 rules coincide.
3+3: patients are treated in cohorts of size 3 starting at the lowest dose d 1 and escalating without skipping any permissible doses. At any dose, if no toxicities are observed among the first three patients, the next three are allocated to the next higher permissible dose; if one out of three toxicities is observed, the next three patients are treated at the same dose; otherwise, the three next patients are treated at the next lower permissible dose. When toxicities are observed in more than one third of the subjects at a dose, 3+3 stops, and the largest permissible dose that has a toxicity frequency no greater than one third is chosen as the initial dose for the primary design.
The 3+3 plays a crucial role in the phase I of clinical trials because some reviews confirm that more than 90% of trials use it or some variation (Hansen et al., 2014). However, many criticisms have been raised about 3+3 designs. Note the procedure is independent of the target. Reiner et al. (1999) completely enumerate possible allocation sequences for the 3+3 and compute exact estimates of the target dose. For a variety of scenarios, they find that the probability of selecting an incorrect dose is excessively high. Ivanova (2006) finds toxicity rates of the selected dose vary from 0.17 to 0.26, far from the target 0.33 . But because the 3+3 rule remains common in practice, we evaluate it as a start-up rule.

Estimators
The performance of estimators of the target F 1 ./ is our primary interest. We denote an arbitrary estimate of the target by Q F 1 ./. We consider both the maximum likelihood estimator and the centred isotonic regression estimator of F 1 ./, which are abbreviated simply as MLE and CIRE, respectively. We introduce these estimators here and briefly discuss some problems that may appear when calculating them. The seriousness of these problems varies considerably by design and by the slope of F as we demonstrate in Section 3. The MLE of the target dose is denoted by b F 1 ./. It depends on an assumed parametric model. Although simulations assume different underlying models drive the designs, maximum likelihood estimates are found always assuming the logistic model: where logit./ D logOE=.1 /, and b and b are the MLEs of˛andˇ, respectively. Silvapulle (1981) provides necessary conditions for b and b to exist. The literature contains kludges to "fix" the problem of non-existent MLEs, but we take the position that failure of Silvapulle's conditions to hold implies failure of the experiment: either more observations are needed or a new study is needed with a different set of permissible doses. That is, algorithmic fixes should not be applied to force the existence of MLEs; important information is contained in this failure. Hence, we consider the failure of Silvapulle's conditions to hold to be an important design performance criterion. In addition, even when Silvapulle's conditions hold, common algorithms used to obtain the MLE may fail (Heinze & Ploner, 2003).
Isotonic estimation only requires that the dose-response function increases monotonically, which implies F .d 1 / < < F.d L /. Let N ni and T ni denote the number of subjects and the number of toxic responses observed on treatment d i up through the nth patient, respectively; and let R ni WD T ni =N ni denote the observed proportion of toxicities. The isotonic regressors, denoted f M F .d i /g, minimise the weighted least squares expression The Pool Adjacent Violators Algorithm (PAVA) produces the isotonic estimates at the permissible doses (Robertson et al., 1988). Traditionally, these estimates are connected over the range OEd 1 ; d L via a step function. Centred isotonic regression modifies PAVA by forcing strict monotonicity and using the allocation frequencies, fN ni g, to locate increases in the dose-response function estimate between permissible doses (for details, see Oron & Flournoy;2017). We use M F .x/ to denote the CIRE of F .x/ going forward. M F 1 ./ will exist by inverse estimation methods unless . M F .d 1 /; M F .d L // fails to span . In this general setting, there is no closed form expression for the mean or variance of CIRE.
The rest of the paper is organised as follows. Section 2 introduces performance measures (in addition to estimators' existence rates) that are used to evaluate the designs. In Section 3, general simulation procedures are presented along with an analysis of the results.

The root-mean-square error
The mean squared error is a standard measure of the inferential performance of an estimator. Its square root is useful because it has the same units of measurement as does the estimator and its standard deviation. However, because many acceptable doses may have toxicity rates close to the target , any estimator having F OE. Q F 1 ./ close to may be acceptable. Therefore,

D-optimal efficiency
The design points fd 1 ; : : : ; d L g and the allocation frequencies fN n1 =n; : : : ; N nL =n W P N ni D ng together comprise a design n which is said to be optimal and denoted by when it maximises (or minimises) a criterion function. A fixed design n under likelihood L n Á L n .Â / has information M. n / Á EOE.@ log L n =@Â /.@ log L n =@Â T /j n . In this paper, Classical optimality criteria are expressed in terms of a concave (convex) function OEM. n / of information matrices. Our benchmark D-optimal design (Section 1.3.5) maximises the determinant of M. n /.
Using information in the D-optimal design for reference, a measure of n 's inferential quality is . n / Á OEM. n /= OEM. /, . n / 2 OE0; 1, and the closer to 1 the better. Because the D-optimality criterion is positive homogeneous [i.e. for any information matrix M ; .ıM / D .M /=ı ], . n / is the fraction of the sample size needed using to get the same inferential precision as using n . In other words, the percentage increase in the number of patients needed using n instead of to obtain the same criterion value is the percentage loss of information, OE1 . n / 100. When a random rule is used to allocate n patients to doses d 1 ; : : : ; d L sequentially or in cohorts, the allocation proportions fN n1 =n; : : : ; N nL =ng, the information M. n / and the efficiency . n / are stochastic processes. We also define relative mean a posteriori efficiency as n WD EOE . n /.

Ethical Measures
2.2.1 The overall proportion of toxicities (p n ) and its standard deviation .p n / From an ethical point of view, minimising the frequency of toxic responses is a natural criterion. Its standard deviation measures the precision of a design with respect to its expected toxicity rate, a measure of ethical reliability so to speak. We evaluate the overall proportion of observed toxicities through the nth patient, p n Á n 1 P n i D1 Y i , and its standard deviation, .p n /, as ethical criteria.

Allocation measures .g n /
An allocation measure .g n / summarises the closeness of the allocations to the target dose. Criteria based on an allocation measure are surrogate ethical criteria because they imply a toxicity that occurs far from the target is worse than a toxicity that occurs at the target. The use of allocation measures as ethical criteria asserts that a design with patients closer to the target is ethically preferable even if it produces exactly the same overall observed toxicity rate as a design with more diverse dosing.
An example of a reasonable measure that increases with the distance of a patient's allocation from the target (on the toxicity rate scale) but penalises overdosing is Although popular in the literature, we do not use allocation measures as performance criteria because they presume that the dose at which the toxicity occurs is more important than the event.

Comparing Designs And Target Dose Estimators
Now, tables and graphs summarising simulation data are provided to contrast the performance of the designs with respect to their inferential and ethical performance.

The Simulation Setup
The graphs and tables presented in Section 3.2 use summary statistics obtained by simulating patients' responses under the primary designs (i.e. 2RD, CCD, CRM and EWOC) as described in Section 1.3. A complete simulated clinical trial is called a run. The dose-response functions used to simulate subjects' responses to treatment are called generating models or generating functions (in contrast to models integral to allocation decisions and models assumed for analyses). Four parametric generating models of F .x/ were considered: the logistic and normal cumulative distribution functions, and the skew-normal with shape parameters 3 and 3.
For each primary design and each generating model, x/ is the density function of the N(0,1); and Á D g.w /= for the skew-normal model, where g.x/ is the density function of the skew-normal with parameters 0, 1 and 0 . To illustrate the range of scenarios considered, Figure 1 and Table 1 display a sample of logistic generating functions and gives their toxicity probabilities for selected values of Â , at d 6 , d 7 and d 8 together with their correspondingˇvalues; and (d) the set of permissible doses is arbitrarily set to be f1; 2; 3; : : : ; 13g.
Following simulations with random slopes, this general setup is modified in two ways: (i) in Section 3.3, separate simulations are carried out for fixed values of Â to show the effect of the slope at the target dose on a design's performance, and (ii) in Section 3.4, the effect of using a start-up rule prior to the primary design is studied.
In this paper, simulations of both CRM and EWOC were performed with the R-package bcrm which is profusely explained in Sweeting et al. (2013). R code to implement both the kRD and the CCD can be obtained from the corresponding author upon request.

A Snapshot Comparison of Centred Isotonic Regression and Maximum Likelihood Estimates of the Target Dose
Detailed design comparisons for ethical and inferential criteria are given in Section 3.3. Preliminary to these comparisons, Figure 2 motivates the study of CIR estimators in addition to the more traditional ML estimators of the target dose (simulated wtih a logistic generating function with random slopes). Figure 2a displays density estimates of CIREs and MLEs for the moderately large sample size n D 75 for each of the primary designs: 2RD, CRM, EWOC and CCD. Both estimators cluster around the target dose but the CIREs are less variable than MLEs for all designs. MLEs are also more skewed. These observations hold for n D 50 and 25 (not shown), but curves spread and become increasingly rough.
As discussed previously, not all runs produce valid estimators, and it needs to be kept in mind that only valid estimators appear in Figure 2a and subsequent tables and graphs. Figure 2b shows the sample size required to obtain estimators. Observe that valid CIREs are much more likely than are valid MLEs. The CIRE can be obtained with 30 patients almost 100% of time for any design and generating model (data not shown). By n D 50, valid CIREs can be produced from all designs with probability close to 100, whereas obtaining valid MLES is still problematic at n D 100. Consequently, statistics formed with MLEs in Figures 3 and 4 are made with fewer observations than are statistics formed from CIREs.
The 2RD is most likely to produce valid MLEs, followed by the CRM. We elaborate on components required to obtain estimators in Section 3.4.

Compound Ethical and Inferential Criteria
As stated in the introduction, a good design would provide accurate estimates with a small number of total toxicities. However, inferential and ethical criteria usually compete, so that improving one of them entails a worsening of the other. Approaches proposed for balancing opposing criteria involve different strategies for allocating patients to doses. We present a graphical evaluation of the trade-offs between ethics and inference in a design.

Toxicity-efficicacy trade-off headlines from inspecting Figures 3 and 4
Figure 3 displays cumulative average values of two competing criteria that are obtained from the simulated experiments with random slopes as the sample size n evolves from 10 to 100 for

. Trade-off between ethics and estimation for primary designs as the sample size increases. Colours distinguish blocks of sample sizes: 10-25 (black), 25-50 (red), 50-75 and 75-100 (green). Symbols represent designs: 2RD (4), CRM ( ), EWOC and CCD (C). [Colour figure can be viewed at wileyonlinelibrary.com]
each of the primary designs introduced in Section 1.3. The x-axes in all subplots are values of a single ethical criterion, namely, the overall toxicity rate. In Figure 3a, the subplots' y-axes are RMSE values for the CIRE (first row) and MLE (second row); plots in each column are produced under different generating dose-response models; the y-axis in Figure 3b is squared bias and variance, and the y-axis in Figure 3c is bias. Designs in Figure 3 are represented by different symbols. Colours change to distinguish batches of sample sizes, which are black, red, blue and green for n 2 f10 25g; f26 50g; f51 75g and f76 100g, respectively. Designs' performance, with respect to the two chosen criteria, are contrasted by examining graphs of the trajectories of these pairs of values as n evolves.
The box plots in Figure 4 use the same data as were used to produce the averages in Figure 3 but restricted to n 2 f25; 50; 75g. These box plots provide marginal summaries at fixed sample sizes.
(a) The relative performance of designs does not depend on the generating dose-response model: looking across the columns in Figure 3a, one sees similar patterns for all generating

. Box plots for the toxicity rate (left), RMSE of CIRE (middle) and RMSE of MLE (right) assuming a logistic generating function for designs 2RD (red), CRM (blue), EWOC (cyan) and CCD (green). The logistic model is also assumed for ML estimation. [Colour figure can be viewed at wileyonlinelibrary.com]
models. Figure 3a demonstrates that the comparative behavior of designs can be satisfactorily evaluated with only one generating model. While the skew-normal ( 3) Figure 3a) have relatively low toxicity rates with relatively high RMSE. But RMSE(MLE)s start higher than RMSE(CIRE)s. Hence, for small sample sizes, RMSE(MLE)s tend to be slightly larger than RMSE(CIRE)s. The evolution of each design generally moves from 'left and up' to 'down and right' in this figure. Except for a little early wiggle with the CCD, sequences do not change direction (left-right-left or down-up-down) as sample sizes increase. 2RD's sequences cross the CRM's and CCD's. Toxicity rates, in general, increase as the RMSEs decrease. This suggests a dependence between the precision of target dose estimates and the frequency of toxic events. With large sample sizes (blue and green) however, toxicity rates are less associated with RMSEs particularly for the CRM. (c) Toxicity rates and inferential precision change little after samples sizes reach 50 for all the designs: in Figure 3a, green and blue points (marking statistics for n 50) become increasingly compressed for all the designs almost overlapping in the blue colour. This suggests that expected toxicity rates converge and little is to be gained regarding inferential precision (as measured by RMSE) as the sample size continues to increase above 50. (d) The bias contributes negligibly to the RMSE: Figure 3b displays separate sequences for the squared bias and the variance, again as functions of the observed toxicity rate. After n D 25, sequences of squared biases lie close to zero whereas sequences of variances asymptote at much higher values. This demonstrates that MSEs are dominated by the variances after n D 25 regardless of the estimator, CIR or ML.
Zooming in, Figure 3c graphs bias versus observed toxicity rates. CIRE bias is negative for all the designs, growing with sample size and with toxicity rate to converge near zero. Observe that CIRE bias moves in a very short range of values for the CRM. MLE bias behaves similarly except that with the 2RD, it becomes sightly positive after n D 25. (e) The 2RD has the best inferential performance, while EWOC settles on the smallest expected toxicity: observe in the subplots of Figure 3a that, for any batch of patients, the 2RD symbol (triangle) is closer to the horizontal axis (i.e. zero RMSE) than any other symbol. For instance, RMSEs for sample sizes 26-50 (red) in the 2RD trajectory are contained within CRM's RMSE values with sample sizes 51-75 (green).
In the top row of Figure 3a, one sees EWOC's complete sequences with RMSE(MLE) falling to the left of other designs', and in the bottom row with RMSE(CIRE), most of EWOC's sequences are to the left of other designs' demonstrating almost uniformly less expected toxicity. However, EWOC's RMSE values are compromised relative to other designs. Note, for example, that with large samples (green and blue), EWOC's RMSE values are exceeded only by the CCD's and are markedly higher than the 2RD's and CRM's. Furthermore, RMSE values in EWOC sequences for moderate sample sizes (red) barely overlap if at all with 2RD's and CRM's. There is considerable overlap among the sequences for small sample sizes (black).
This trade-off is also seen in the box plots of Figure 4 for n D 25; 50 and 75. Note that these box plots are each constructed from 10 000 points; so if 1% of the data were beyond the whiskers, there would be 100 such points. Following (Diniz et al., 2019) , solid bars are placed at ˙0:1 across the panel of box plots of toxicity rates. At n D 25, EWOC's toxicity rate box is almost entirely below this line, and while it rises with sample size, it remains well below . For n D 75, EWOC's box is below 0.25 while 2RD's box is above 0.3. However, 2RD's and CRM's boxes and whiskers remain within the ˙0:1 bounds for n D 25; 50 and 75, while CCD's box comes within bounds at n D 50; but CCD's whiskers extend outside even at n D 75. Asymptotically, 2RD places more subjects on the two doses straddling the target than on any other doses (Oron & Hoff, 2009).
In contrast for RSMEs, CCD's boxes are highest followed by EWOC's. 2RD's and CRM's boxes overlap substantially with 2RD's generally being lower. EWOC's RSME boxes at n D 50 come to be centred at magnitudes comparable to 2RD's and CRM's at n D 25, but with slightly larger spread. Figure 5 Figure 5a plots the percentage of additional patients needed to have the same inferential precision as the benchmark D-optimal design [ (1n )*100, abbreviated PAD] versus toxicity rates for n D 10; : : : ; 100. Figure 5b plots the percentage of additional patients needed to have the same inferential precision as the 2RD (abbreviated PA2R) versus toxicity rates for n D 10; : : : ; 100. Symbols and colours in Figure 5 are the same as in Figure 3. These plots are obtained under logistic generating and analysis models with random slopes. As observed in Figure 5, the patterns reported are similar to those seen using other generating functions (not shown).

Designs' efficiency headlines from inspecting
(a) The PAD is lowest using the 2RD; PADs generally decrease as the toxicity rate increases. Figure 5a shows general reductions in PADs with increasing toxicity rates and sample sizes. Reductions are steady with 2RDs and CCDs. PADs decrease dramatically over the small sample size segment (black) of 2RD experiments. This decrement continues progressively more slowly in subsequent batches of patients. 2RD attains efficiencies with small sample sizes (black) that are higher that the highest efficiencies obtained by the other designs. For all batches, CRM's PADs are smaller than either CDD's or EWOC's. (b) Strangely, in the larger segments (green and/or blue), CRM's and EWOC's PAD value slightly increases as the number of patients increases. The range of variation under CRM, both in PAD and in toxicity rate, is very small. Surprisingly, trajectories under the CRM and EWOC designs have a turn point after which their PAD increases (albeit slightly) with increasing sample size; in other words, their MLEs become less accurate. This effect can be seen in the enlargement within Figure 5a. (c) CCD, EWOC and CRM require substantially more subjects in order to match estimation precision of 2RDs. Because the 2RD was found to outperform other designs with respect to the D-optimality criterion and because D-optimal designs are unattainable benchmarks, other designs are compared to the 2RD in Figure 5b. Substantially, more subjects required in CRM and EWOC experiments to match the precision of 2RD target dose estimates. EWOC's PADs increase with toxicity rate and sample size. CRM's PADs increase with sample size, but it is virtually invariant to toxicity rates. CCD's PADs slightly decrease over small sample sizes but are fairly invariant to sample sizes and toxicity rates for n > 25 (red, green and blue segments). Figure 5b shows that EWOC requires around 45-55% more patients than 2RD to achieve the same precision estimates of the target dose. Additional patients required by the CRM rockets from 10% for very small sample sizes to over 30% for n 50. CCD requires about 40% more patients for n 20. (d) The uniform design clearly is outperformed by the other designs according to both ethical and inferential criteria. For the sake of clarity, UND's trajectories were not included in Figures 3 to 5. UND's mean toxicity rate sticks to 40% from the first patient, as expected. This high toxicity rate, very far from the target, does not bring better inference: the UND's RMSE and loss of information are outperformed by the other designs. The expected loss of information remains in the same level, 90%, from the first patient. The RMSE decreases slowly from 0.76 with 50 patients to 0.73 with 100, which is outperformed by 2RD and CRM. Observe that the main difference between the UND design and the other designs is that any dose can be applied to any patient with no learning from previous responses and allocations; this explains its poor toxicity performance.

Influence of the Dose-Response Slope on Designs' Performance
As can be seen in Table 1, varying the slope of the logistic function covers a wide range of toxicity response patterns, including extreme situations. For example, Â D 0:01 corresponds to a dose-response curve that is almost flat, whereas Â > 30 reflects a large change in toxicity rates between doses 6 and 8. In simulations reported in this section, the slopes of the generating logistic functions are not randomly chosen. Instead, separate simulations are performed at several fixed slope values.
In Figures 7 and 6, the upper x-axisˇD tan.Â /=OE.1 / has tick marks corresponding to Â D .0:01 C 2 i/ o , i D 1; : : : ; 23 on the lower x-axis. In Figure 7, the y-axis represents the first value n for which at least 90% of runs provide a valid MLE (Figure 7a) and CIRE ( Figure  7b). The horizontal dotted line at n D 20 is plotted for reference. In Figure 6, two different values are represented on the y-axis: along the top of the graph are filled circles which are the proportion of runs for which Silvapulle's conditions hold before the end of an experiment with n D 100. The lines are the expected first patient for which Silvapulle's conditions are fulfiled (when they hold before n > 100). Table 2 summarises the mean (standard deviation) sample size required for Silvapulle's conditions to hold for selected slopes. Figures 8 and 9 show plots of the RMSE(CIRE) and the toxicity rate, respectively, as a functions of n. Subfigures are produced from simulations with Â = 1.01, 4.01 and 8.01 selected to represent small, moderate and large slopes at F 1 ./ (Table 1). Figure 9 includes˙standard deviation bars at n = 10, 25 and 45 which are plotted with a slight delay to avoid overlapping bars.

Headlines about the effect of slope on inference and toxicity
(a) To have a high probability of successfully calculating the MLE for small and large slopes the sample size must be large. Figure 6 shows that, as the slope grows, the expected number   of patients required on-study before MLEs exist grows, as does its standard deviation. In addition, the proportion of runs with MLE existing when the experiment ends at n D 100 decreases as Â increases. This was expected because, as the slope grows, the variation in toxicity rates between consecutive doses also grows, and then observed non-toxicities and toxicities tend to separate on the dose scale and Silvapulle's conditions are more likely to fail.
Even if existing, MLEs may not be obtained because the optimisation algorithm prescribes a dose out of range or it fails. Several alternative procedures for estimating the logistic regression parameters have been considered in the literature and implemented in R (R Core Team, 2019), for instance, glm2 based on Marschner (2011), the penalised likelihood algorithm (Firth, 1993), exact logistic regression (Mehta & Patel, 1995) and Markov Chain Monte Carlo methods for Bayesian analysis (Hamra et al., 2013). We have carried out a simulation study to find out if one of them was more effective than R's basic glm command in obtaining the MLE. Even though these procedures mitigate the problem of convergence, they do not solve it completely. In fact, we have confirmed that it is not unusual to obtain estimates out of the prescribed dose range when the slope of the logistic generating function is small, regardless the procedure used for estimation. We adopted R's glm2 command for maximising the likelihood. The failure of estimates to exist provides a critical warning to investigators that follow-up studies should be performed with more doses in range where separation, or near separation, occurred. Figure 7a shows the combined effect of all three types of failure. Observe that for very small and high slopes, a large number of patients is required to have at least 90% of probability of obtaining a valid MLE. For valid MLEs, one requires the smallest number of subjects on-study with 2RD for any slope, followed by the CRM; and both are quite competitive when 2 < Â < 12. (b) Calculating the CIRE is only problematic when slopes are very small. In Figure 7b, observe that using UND, CRM and 2RD, the CIRE is obtained with probability at least 90% with less than 20 patients for slopes Â 2. Besides this, one expects to obtain the CIRE with the CRM and UND with fewer patients than with the other designs. This is because the only condition required to successfully calculate the CIRE is that the isotonic regressors (Section 1.5) cross the target toxicity rate. Only designs that escalate dose assignments without restrictions are likely to fulfil this condition early in the experiment. (c) The RMSE(CIRE) decreases slowly after 30 patients on-study, but at different levels depending on the slope and the design. Two horizontal lines at y D 1 and y D 2 are plotted in Figure 8  Designs' performance rankings change with the slope. For small slopes, EWOC has the poorest inferential performance, while UND has the best followed by the 2RD which outperforms the CRM near the 10th patient. For intermediate slopes, EWOC's performance improves with RMSE(CIRE) values similar to CCD's. Other differences between UND, 2RD and CRM are negligible as the number of patients grows. Finally, for large slopes, the UND's performance deteriorates, while the 2RD and CRM have similar trajectories, outperforming the EWOC and CCD. (d) Variability in the overall observed toxicity rate decreases as the slope grows, for each n.
The expected toxicity rate changes little from 30 patients upwards, but its value depends on the design. 2RD has an expected toxicity rate slightly over target, but it is least variable and hence most predictable. Figure 9 includes a horizontal line at the target toxicity rate for reference. As expected, the UND produces the highest overall toxicity rate. Also, regardless of slope, variability in overall toxicity rates slowly decreases as the number of patients grows and ordering of designs remains constant, except for the 2RD whose toxicity rate is less than that of the CRM for small samples and gets larger at about n D 30. Observed overall toxicity rates with the EWOC and CCD remain far below the target, whereas the 2RD and CRM stabilise slightly over and under the target, respectively. Observe that the 2RD has the smallest variability for all examined slopes, indicating that it is more likely to perform as expected than the others. This coincides with the comments on Figure 3. (e) Penalising for overdosing does not change performance rankings. The g n measure behaves in the same way as the p n measure, so they provide the same ranking of designs (figures not shown).

The Influence of a Start-Up Rule
In this section, we evaluate the use of several start-up rules before the primary design is applied. In addition, the interaction between the start-up rule and the slope is considered. Let DSU denote the dose where the start-up rule finishes and let NSU denote number of patients required to complete the start-up rule. The performance of the start-up rules are evaluated according to the following criteria: (a) The closer DSU is to the target dose the better, because it mitigates the use of inefficacious doses. DSU close to the target provides a good starting dose for kRD and CCD and improves the prior for the CRM and EWOC. (b) The smaller NSU is the better. (c) A start-up rule that is more likely than another to provide valid MLE and CIRE calculations is better. Table 3 Figure 10 plots mean values of DSU (black) and NSU (red) with˙standard deviation bars by slope for each start-up rule described in Section 1.4. The same scale is used for both DSU and NSU on the y-axis. The blue horizontal line marks the target dose as a reference for the DSU values. Table 3 provides a numerical summary of these graphs and includes two start-up rules not plotted. Figure 11 contains five plots, one for each start-up rule described in Section 1.4 plus one where a start-up rule is not applied. For each, interval bounds (mean˙standard deviations) for the dose assigned to the nth patient are plotted by n D 1; : : : ; 30 for each primary design.  (b) The 1R/ET1 and 2R have the best properties. With ET1 and 2RD start-up rules, NSU has the smallest expected value and variability, regardless the slope. Moreover, ET1 and 2RD startup rules have expected DSUs closest to the target. The variability of DSU is quite similar for all the start-up rules. It may be significant that ET1 and 2R allocate doses to patients one at a time, whereas the other designs use cohorts of two or three patients. (c) The frequency of obtaining a valid CIRE improves using ET1 or 2R start-up rules with CCD and EWOC primary designs. When using the ET1 or 2R as start-up rules, the CCD and EWOC require less patients to obtain the CIRE with a 90% probability. But start-up rules do not improve this probability for the CRM and 2RD primary designs (figure not shown). (d) The CCD benefits from the use of a start-up rule by reducing the assignment of patients to very low doses. Figure 11a was produced without a start-up procedure. In this graph, primary designs start, with no variability, at the first dose except with the CRM which always starts at the second dose. Allocation intervals become centred around the target dose (pink horizontal line) fastest with the CRM, but the CRM's interval width has the highest early variability. Allocation intervals with the 2RD and EWOC primary designs centre around the target with a few more patients but also with smaller variability. Observe also that, as outlined before, ET1 and 2R start-up designs produce similar allocation behavior in the primary designs; with the 3+3 start-up, more patients are required for intervals to bound the target. Using the CCD without a start-up, 15 patients are required to centre around the target dose; the use of the ET1 or 2R start-up rules saves 10 patients. When we repeat the study of Sections 3.1 and 3.2 with the ET1 as start-up rule, the performance of CCD improves, but its rankings relative to the other designs does not change. Nevertheless, trials with less than 15 patients seem unusual in practice; and with at least 15 patients, regardless of the start-up chosen, the allocation intervals have similar ranges for all primary designs.

Discussion
This paper demonstrates useful assessment measures and graphical tools for evaluating the global performance of dose-finding designs. Good performance requires accurate estimation with a small number of total toxicities. Specifically, the trade-off between ethics and inference is contrasted among several designs selected from the literature to have a variety of dose allocation features: (i) long and short memory, (ii) parametric model and Markov chain theory driven and (iii) with and without dose changes restricted to nearest neighbours. Assuming a logistic model for inference, comparisons were found to be substantially invariant to the generating doseresponse model be it logistic, normal or skew-normal .˙3/. Summary simulated statistics were created by randomly selecting the slope of the generating model at the target dose, while slopespecific statistics reveal substantial differences in expected variability of target dose estimates among designs. Finally, a start-up rule is found useful to warm up long-memory designs.
Centred isotonic regression estimators of the target dose are shown to be much more likely to be valid and consistently less variable than maximum likelihood estimators. Reliance on the use of MLEs is not advised with a small number of patients or when high dose-response slopes are anticipated because they are difficult to obtain. MLEs may be useful given they can be obtained (particularly for forming confidence intervals), but attention to the performance of statistical software packages is needed.
Several global conclusions regarding the designs are (a) among the primary designs, 2RD and CRM are quite competitive and outperform the other designs. 2RD has slightly higher toxicity rates with larger sample sizes, while EWOC's toxicity rates are consistently lowest, but these could be equalised by adding a small random hold to 2RD dose changes and/or by increasing the EWOC parameter that makes allocations conservative. The 2RD's toxicity rate is least variable, and hence, this design is most predictable, ethically speaking. It also has better inferential properties than the other designs studied, especially using the CIRE; (b) start-up rules kR, k D 1; 2 would warm up the trial nicely, producing fewer patients on inefficacious doses and providing allocations' quick arrival to doses close to the target dose. When the study in Section 3.1 is repeated with 1R and 2R as start-up rules, global comparative conclusions change very slightly, even though CCD's properties improve substantially; (c) consider sample size 15-50. In this case, a start-up rule will not be helpful (except if using the CCD); and the CIRE will be available with a high probability, especially if using the 2RD or CRM. Using more than 50 patients does not provide substantial gains, either from the ethical or inferential points of view; and (d) the performance measures RMSE, p n and e n provide complementary information. Mapping RMSEs to the toxicity scale provided no additional illumination , and overall toxicity rates p n are easier to interpret than other toxicity measures such as g n . It is useful to examine the efficiency measure e n together with toxicity rates because both can be interpreted in terms of numbers of patients.