Dose-finding designs using a novel quasi-continuous endpoint for multiple toxicities

The aim of a phase I oncology trial is to identify a dose with an acceptable safety profile. Most phase I designs use the dose-limiting toxicity, a binary endpoint, to assess the unacceptable level of toxicity. The dose-limiting toxicity might be incomplete for investigating molecularly targeted therapies as much useful toxicity information is discarded. In this work, we propose a quasi-continuous toxicity score, the total toxicity profile (TTP), to measure quantitatively and comprehensively the overall severity of multiple toxicities. We define the TTP as the Euclidean norm of the weights of toxicities experienced by a patient, where the weights reflect the relative clinical importance of each grade and toxicity type. We propose a dose-finding design, the quasi-likelihood continual reassessment method (CRM), incorporating the TTP score into the CRM, with a logistic model for the dose–toxicity relationship in a frequentist framework. Using simulations, we compared our design with three existing designs for quasi-continuous toxicity score (the Bayesian quasi-CRM with an empiric model and two nonparametric designs), all using the TTP score, under eight different scenarios. All designs using the TTP score to identify the recommended dose had good performance characteristics for most scenarios, with good overdosing control. For a sample size of 36, the percentage of correct selection for the quasi-likelihood CRM ranged from 80% to 90%, with similar results for the quasi-CRM design. These designs with TTP score present an appealing alternative to the conventional dose-finding designs, especially in the context of molecularly targeted agents.


Introduction
Phase I oncology trials aim to evaluate the safety of a new agent to identify a dose to be recommended (RD) for further evaluation. Toxicity remains the primary endpoint of such trials. Although toxicity is intrinsically multidimensional with different body systems possibly involved and multiple toxicity events may be observed in a single patient, conventional phase I trial designs are usually based on a binary endpoint of dose-limiting toxicity (DLT). The common practice of reducing the toxicity assessment to a binary indicator is perhaps an oversimplification of the multidimensional clinical reality: (i) the relative severity of different DLTs is ignored; (ii) the moderate toxicity events below the threshold of DLT is ignored; and (iii) the multiplicity of the events is neglected as only the worst grade is used to define the DLT. The landscape of oncology drug development has recently changed with the emergence of targeted agents available for testing. These new drugs appear more likely to induce multiple moderate toxicity rather than DLTs [1,2]. In this context, the use of DLT criteria to lead the dose escalation and determine the RD deserves re-examination.

Definition of the total toxicity profile.
We assessed toxicity for each system and graded it using a standard reference, such as the common terminology criteria for adverse events (common toxicity criteria of the National Cancer Institute) [13]. Multiple toxicity observations are available for each patient, but for the purpose of defining the DLT, we take the dichotomized approach: most dose-finding protocols define as DLT the occurrence of grades 3 and 4 nonhematological and grade 4 hematological toxicities.
The same grade assessed on two different organs can correspond to a very different clinical importance, for example, a grade 3 nephrotoxicity and a grade 3 fatigue. As suggested by Bekele and Thall [4], we define weights reflecting the relative clinical importance of each grade and type of toxicity, rather than direct utilization of raw common terminology criteria for adverse events grades. Let W D fw l;j g be the matrix of weights defined for each grade j , j 2 f0; : : : ; 4g, of each toxicity type l, l 2 f1; : : : ; Lg.

C A
We realize the elicitation of the matrix of numerical weights in close collaboration with physicians before the initiation of the trial. We limit the space of toxicity grades to the range OE0; 4 because the occurrence of a grade 5, corresponding to death, would require another kind of decision rule, with direct interaction between the safety committee and the physicians to interpret the event and make a decision concerning the continuation of the trial. The matrix will be filled by values equal to 0 for the grades that do not exist in the grading system (e.g., headache grade 4).
To capture the relative clinical importance of various toxicity profiles (TPs) in this multidimensional space, we have proposed a flexible toxicity endpoint, thereafter called TTP, as an alternative measure to the TTB proposed by Bekele and Thall [4]. As the same arithmetic sum of toxicity, defining the TTB, can correspond to clinical situations of different toxicity burdens, we computed the TTP as the Euclidean norm of the weights, which measures the length of the toxicity vector in a multidimensional space.
For patient i treated at dose level d k , the Euclidean norm TTP i;k is defined by w 2 l;j 1.G i;k;l D j / with, for patient i treated at dose d k , 1.G i;k;l D j / D 1 if the maximum grade of the observed toxicity type l is equal to j; 0 otherwise As there are a limited number of combinations of weights, the resulting score is, by construction, a quasi-continuous variable.

2.1.2.
Elicitation of the target total toxicity profile, Â. The use of this new toxicity endpoint leads us to define the target TTP, Â, that is, a TTP value judged acceptable by the clinicians. We interviewed expert clinicians to decide, for a set of various hypothetical cohorts, whether to escalate, repeat, or de-escalate the dose for the next cohort. Once a consistent classification of the cohorts is obtained, defined as a string of decisions of escalation corresponding to the lower values of TTP in the cohorts, followed by the decisions of repeat and the decisions of de-escalation corresponding to the higher TTP, we then compute the target TTP Â as the mean of TTPs of the cohorts associated with a decision to repeat the dose.

Normalization.
To work with a toxicity measure ranging from 0 to 1 that can be modeled using a quasi-Bernoulli likelihood [14], we subsequently normalize the TTP (nTTP) as where is a normalization constant, D TTP max C , with TTP max equal to the most severe possible TP that we can compute from the matrix of the selected toxicities, and a small positive value. As the matrix of weights must be a priori defined and consequently includes a selection of expected toxicity items, thus allows us to adaptively integrate a severe toxicity that would not have been a priori selected and weighted during the trial.
where a 2 < and b 2 < C . We obtain the pseudo-doses x k by the backward substitution that ensures that the dose-toxicity model (3) provides an exact fit over the initial guesses of toxicity scores, reflecting clinicians' prior belief. For our method, we fixed the intercept value a, leading to a one-parameter logistic model with only one parameter to be estimated, b, as recommended by Chevret et al. for the CRM with the binary endpoint [15].

Variance specification.
Considering the nTTP as a fractional event leads to the quasi-Bernoulli likelihood function in approximation [6,14], where a Bernoulli variance is embedded by default: Var.nTTP/ D nTTP .1 nTTP/ An alternative variance function is the Wedderburn variance [16], We may also explicitly derive the analytical variance function by considering the structure of nTTP score as a function of several toxicity random variables, using the delta method (equation given in the Appendix A).
As discussed later, a model based on this analytical variance function leads to an alternative approach that is outside the scope of the current research work.

Likelihood.
As performed by Yuan et al. for its toxicity score [6], we modeled the normalized nTTP using the quasi-Bernoulli likelihood that can accommodate fractional events. Assume that the last patient i, treated at dose level d k , corresponding to the pseudo-dose x k , experiences a toxicity measured by nTTP i . Its contribution to the quasi-Bernoulli likelihood will be After the observation of the ith patient, we update the quasi-Bernoulli likelihood L i by We also extended the quasi-likelihood function using the Wedderburn variance function [16].

Decision rules.
In the dose escalation phase of the trial, the dose allocated to the next cohort of patients is the dose associated with an estimated nTTP, 1 nTTP, closest to the normalized target, Â . At the end of the trial, we define the RD as the dose that would be allocated to the next cohort, that is, the closest to Â .
As we work in a frequentist framework, likelihood-based estimates are not available to us before any toxicity has been observed. For this reason, the QLCRM includes two different stages, with an escalation stage procedure [12] as long as no toxicity event is observed (all nTTPs equal 0). The second stage of the design, which includes a model-based estimation of the dose-toxicity relationship, starts as soon as some heterogeneity in the toxicity response is observed. One advantage of using the TTP score, rather than the binary endpoint, DLT, is that the estimation of the model is possible from the first observation of toxicity, even mild, shorting greatly the first stage of this design. Copyright [6]. The QCRM differs from the QLCRM in that it uses a Bayesian framework and models the dose-toxicity relationship using an empiric model as E.nTTP/ D˛b k where˛k are defined by the working model with˛k 2 OE0; 1 and b is the parameter to be estimated.
We can obtain the contribution to the model of the observation of the last patient by formulas (4) and (5).
Suppose that g 0 .b/ is the prior distribution for the parameter b. After the observation of the ith patient, we update the quasi-posterior density for b by The decision rules are similar to that of the QLCRM.

Extended isotonic design.
The design proposed by Chen et al. is based on an isotonic regression for both the dose escalation phase and the identification of the RD [3].
Assume that the last patient i is treated at dose d k . We estimate the mean score, 1 nTTP k , at each dose level. If a nondecreasing dose-toxicity relationship is observed, the score estimated by the isotonic regression, 1 nTTP k , corresponds to the mean score ( 1 nTTP k D nTTP k ). When it is violated, we use a pooled adjacent violator algorithm to estimate the scores. If the dose k C 1 has not been explored yet, its estimated score, 3 nTTP kC1 , is equal to the score estimated at the highest dose explored . We define the dose allocation rule, also used for the identification of the RD, as follows: nTTP kC1 Â , we assign the next patient to dose d kC1 , where k < K. Otherwise, we assign the next patient to dose d k .
nTTP k Â , we assign the next patient to dose d k 1 , where k > 1. Otherwise, we assign the next patient to dose d k .

Unified approach.
The dose escalation algorithm is derived from the up-and-down method [17] and is based on a t -statistic [7].
Assume that the last patient i is treated at dose d k . Let T k denote the t -statistic performed after the observation of patient i.
where nTTP k and s k are the mean and the variance of nTTP computed from all available observations, n k , of the patients treated at dose d k , respectively. We define the dose allocation algorithm as follows: If T k 6 , we assign the next subject to dose d kC1 .
If T k > , we assign the next subject to dose d k 1 .
If < T k < , we assign the next subject to dose d k .
is the design parameter that is fixed before the beginning of the trial; for further details, see [7]. In contrast to the model-based methods previously described, which use information from all patients treated onto the trial until this time point, the dose allocation for the next cohort is based only on the observations of the patients treated at the last dose level.
At the end of the trial, we performed an isotonic regression using a pool adjacent violator algorithm if a nonmonotonic dose-toxicity relationship is observed. We then define the RD as the dose with an estimated nTTP, 1 nTTP, closest to the normalized target, Â . After correction of the violators, it can happen that two or more doses are associated with the same estimated nTTP. Following the authors' recommendations, if this estimated nTTP is the closest to Â and is above Â , we select the lowest of these doses as the RD. If this estimated nTTP is the closest to Â but is below Â , we select the highest of these doses as the RD.

Design parameters
To be consistent through the different methods, we only considered trials of fixed sample size. The trial stops when the prespecified number of patients, n, is exhausted. We did not allow skipping of dose levels during dose escalation.
We used the same definition of the RD in all the methods, that is, the dose associated with an estimated 1 nTTP the closest to the normalized target Â , even if above Â , or a dose not allocated. Let us note that the UA cannot recommend a dose never explored as the isotonic regression is applied only to the doses already explored. We performed a sensitivity analysis, for the QLCRM, QCRM, and EID designs, whereby the RD, although close to the normalized target Â , should be among the doses allocated in the trial, as used in the design of Chen [3].
In the main analysis, the design parameters of QLCRM were as follows: we obtain the vector of initial guesses, also called working model, with the getprior function of the R-package dfcrm developed by Cheung [18], using dose 3 as the prior guess of RD, 0.04 as the indifference interval parameter, and a logistic model with the intercept set at 3. Please refer to the paper of Lee [19] for more details. For the QCRM, we similarly generated the working model from the getprior function (the prior guess of RD equal to dose 3 and the indifference interval parameter equal to 0.04 and the empiric model). The prior distribution of the empiric model parameter b is the exponential distribution with mean 1, as proposed by the authors [6].
We studied some other parameters that will be discussed later. For the UA method, we fix the design parameter to 1 as recommended by the author.

Simulations
We used simulations to compare these four designs.

Definition of the matrix of weights and the target of toxicity
Assume that the main expected toxicities related to the treatment can be represented by three types of toxicities supposed to be independent: renal, neurological, and hematological toxicities. For the purpose of our simulation study, we elicited the matrix of weights from an expert clinician as follows: In this setting, we define DLT as the occurrence of a grade 3 or 4 neurological or renal toxicity or a grade 4 hematological toxicity.
With the use of this matrix, a patient with a DLT as single toxicity has a score equal to 1 for either a grade 3 renal, grade 3 neurological, or grade 4 hematological toxicity, and he has a score equal to 1.5 for a grade 4 renal or neurological toxicity. These patients experiencing a single DLT have almost the same score as a patient experiencing grade 2 renal plus grade 2 neurological toxicities. Lower weights give lower grades of toxicity.
The maximum TTP computed from this matrix is 2.34, corresponding to a grade 4 neurological toxicity, associated with grade 4 renal and grade 4 hematological toxicities. The TTPs were thus normalized by dividing each value by 2.5.
We asked the same expert to order a set of hypothetical cohorts of three patients with various TPs. The cohorts associated with the decision to repeat the dose have a mean nTTP varying from 0.24 to 0.32, corresponding for example to the following cohorts: The nTTP of patient 3 is given by The mean nTTP of this cohort, associated with no DLT, is Second cohort, with different TPs: -Patient 1 presents grade 3 renal (considered as DLT), grade 0 neurological, and grade 0 hematological toxicities. The nTTP of patient 1 is given by nTTP 1 D p 1 2 C 0 C 0=2:5 D 0:40. -Patient 2 presents grade 0 renal, grade 2 neurological, and grade 1 hematological toxicities.
The nTTP of the patient 3 is given by nTTP 3 D p 0: The mean nTTP of the second cohort, associated with one DLT observed in patient 1, is The target Â has consequently been set at 0.28.

Definition of the scenarios and generation of the toxicity data
We define each scenario by a given TP at each dose level. For each toxicity type, we defined the matrix of probabilities of observing grades 0 to 4, for the K dose levels. We assumed a unimodal distribution of the probability of observing a given grade across dose levels.
The probability of observing grade 0 (no toxicity) is maximal for d 1 and decreases with the dose; the probability of observing a grade 4 increases with the dose, with a maximum probability at the last dose to be explored.
As these three toxicity types were assumed to be independent, we could compute for each dose level the probability of each of the 5 3 combinations of the three toxicities (yielding a vector of 5 3 probabilities corresponding to the 5 3 TPs). For a given dose level, we then derive the mean nTTP and the probability of observing a DLT. The mean nTTP is the weighted sum of the 5 3 TTP values. We detail one of the studied scenarios as an example in Appendix A. Table I shows the description from the more toxic to the less toxic of the eight scenarios that we proposed. Scenarios C, F, and G represent a translation of scenario A to the right. We use these scenarios to study the impact of the position of the RD in the dose scale, for the same slope of the dose-toxicity curve around the RD. For all of them, the nTTP at the true RD is equal to the target nTTP. Scenarios B and E represent a mild variation of scenarios C and F, respectively, with an nTTP at the true RD slightly above the target nTTP. Similarly, scenarios D and H represent a mild variation of scenarios C and G, respectively, with an nTTP at the true RD slightly below the target nTTP.

Simulation of the trials and metrics of comparison
For each scenario, we simulated 5000 repetitions of a trial of a fixed sample, n, recruiting patients by cohorts of three and exploring six dose levels. The detailed analysis describes the results for n D 36, in terms of percentage of dose recommendation and allocated dose percentage. We also reported the distribution of the patient toxicity scores nTTP and the number of DLTs observed in each trial, reflecting the safety of the different designs during the trial. The dose corresponding to the dose below the true RD is thereafter called RD 1, and the dose just below RD 1 is RD 2; the dose above the true RD is called RD C 1, and the dose above RD C 1 is RD C 2.
Although actual phase I clinical trials samples are rather small, it is important to check if the methods converge asymptotically with increasing sample size. If a new method were found to fail to converge asymptotically, its performance would be questionable for usual small sample sizes. We studied the convergence of the PCS of the RD for n varying from 15 to 99, using a cohort size of three patients. The study was performed using R v2.11.1 [20].

Main results for n D 36
The distribution of dose recommendation obtained with the QLCRM is very narrow around the RD. The PCS is very high, varying from approximatively 80% to 90% according to the scenario (Table II). In all cases, more than 90% of the recommendations correspond to the RD or the next closest dose. The control of overdosing is excellent with 0% of recommendations at RD C 2 in all scenarios. The chance of underdosing is also very low, and we never recommend the RD 2.
With the QLCRM, more than 45% of the 36 patients are allocated at the true RD. The dose escalation process looks efficient with a percentage of patients allocated to the dose levels below the RD 1 very close to 8.33% (D3/36) in all the scenarios where the true RD is above the d 2 , meaning that the dose is escalated after each cohort in most of the cases until the dose allocated is RD 1.
The performance of the other methods is good in all the studied scenarios. The PCS varies from 82% to 91% in the QCRM, from 69 to 85% in the EID, and from 78% to 93% in the UA. The PCS of QLCRM and QCRM are very similar in the various scenarios, with a difference varying from 6% to C2%, the QCRM being slightly better than the QLCRM in six of the eight studied scenarios. The performance of the QLCRM is greater than that of the EID in all but one scenario, with a difference of PCS equal to Copyright   Results at the target dose are in bold. Results at the closest target dose are in italics. QLCRM, quasi-likelihood continual reassessment method (our proposal); QCRM, quasi-continual reassessment method [6]; EID, extended isotonic design [3]; UA, unified algorithm [7]. or greater than C9% in seven of the eight studied scenarios (up to C16% in scenarios B and E). The difference of the performance of QLCRM compared with that of UA depends upon the scenario, varying from 10% to C7%. These two methods give similar results in four scenarios (A, C, F, and G), with a difference of PCS below 1%. The distribution of allocated doses across the trial is very similar between QLCRM and QCRM, with more than half of the patients allocated to the RD in six of the eight scenarios. In all scenarios but one (scenario D), the parametric methods (QLCRM and QCRM) allocate more patients at the RD than the nonparametric methods (EID and UA). The distribution of allocated doses is wider with the EID; in particular, this method allocates more patients to RD C 2 in almost all the studied scenarios. The UA method appears more conservative in the dose escalation process, with much more patients allocated to the RD 1 even when it is not the next closest dose to the target. The distributions of the observed nTTP scores and number of DLTs reflect the safety of the process for the patients included during the trial. These distributions are very similar for the different methods, across the eight studied scenarios, except for the UA method, which appears a little safer in terms of the number of DLT. Figure 1 illustrates the box plot of the nTTP and DLT distributions for scenarios A, C, and G (other scenarios are available on request).
When the Wedderburn variance function in QLCRM is used, the percentages of correct selection were similar numerically in most scenarios, but the PCS can be worse in some scenarios. In addition, we allocated more patients to the doses higher than the RD in all scenarios (details in Table A.1 of Appendix A).

Convergence study
The convergence study leads to the examination of the properties of the outlined designs for different numbers of patients. As illustrated on Figure 2, QLCRM, QCRM, and UA converge toward the true RD in all studied scenarios. For n D 99, the PCS is above 90% in all the scenarios. Contrasting with the excellent results observed with these three methods, the EID presents poor convergence behavior with a PCS plateauing very quickly as the number of patients increases, in six of the eight studied scenarios. As phase I clinical trials generally include less than 50 patients, operating characteristics of the different methods for a small number of patients are also an important issue. QLCRM and QCRM show very similar properties, different from that of the UA design in three of the eight studied scenarios (the differences being in either directions). For these three designs, the PCS is greater than 65% when the sample size is equal to or greater than 24 in the eight studied scenarios.

Sensitivity analysis
To evaluate the impact of the decision rule of selecting the RD, we performed a sensitivity analysis for which the RD should be among the doses allocated in the trial, as recommended in the paper of Chen et al. [3]. We did not observe any significant impact of the decision rule on the performance of the methods (details given in Appendix A).

Discussion
The emergence of molecularly targeted agents (MTA) in oncology revolutionizes the current phase I paradigm in a variety of ways [1,2,[21][22][23][24][25][26]. As noted by leaders in the field [25,26], the design of phase I trials is an open issue, and we require novel approaches to better fit with the particularities of MTA. One component of the current questioning concerns the choice of the best endpoint to identify the dose to be recommended for further evaluation. Although reasonable expectations come from the integration of alternate endpoints such as plasma drug concentration and dynamic biomarkers able to measure target inhibition in tumor or surrogate tissues [2,[21][22][23][25][26][27][28][29], toxicity response remains a major endpoint in these trials [27,30]. This was recently confirmed in a review of oncology phase I trials published between 1997 and 2008, including 99 trials evaluating MTA: in all study reports, the prespecified primary aim (or co-primary aim) was to determine the RD according to observed toxicity [31]. An analysis of 82 phase I trials evaluating MTA as a single agent showed that DLTs were observed in only approximately half of the trials (43/82) [1].
To overcome these problems, several authors have recently proposed different toxicity scoring systems to measure quantitatively and comprehensively the overall severity of multiple toxicities per patient, considering a quasi-continuous toxicity endpoint [3][4][5]. Bekele and Thall were the first to combine multiple graded toxicities in a single measure of toxicity endpoint (TTB) [4]. In 2010, Lee et al. [5] proposed a toxicity burden score. Although they differed by the elicitation process, both scores are computed as an arithmetic sum of toxicity weights. By construction, multiple mild or moderate toxicities may easily lead to a higher score than a single DLT. In 2010, Chen et al. [3] defined an ETS preserving the relative order of the highest adjusted toxicity grade and its classification as a DLT or not, the additional toxicities counting only for a decimal part of the score. As underlined, these approaches lead to very different ranking of patients experiencing multiple moderate toxicities. Computed as the Euclidean norm, the nTTP is an intermediate alternative to the others. Its own mathematical property (triangular inequality) is appealing: the score of a single patient accumulating two toxicities is lower than the sum of scores of two different patients, each experiencing one toxicity event; in other words, multiple toxicities carry more weight if they come from different patients.
In this paper, we proposed a design based on a quasi-continuous toxicity score, the nTTP, and compared it with three other existing designs [3,6,7], using the same toxicity endpoint. A major aspect of our design is the construction of the matrix of weights and the definition of the toxicity target, which requires close collaboration with clinicians prior to the start of a trial. A panel of expert clinicians would have to ideally define the numerical weights and the target toxicity measure before launching a trial. The process elicitation needs a close collaboration with clinicians for each new phase I trial. Similar to how the definition of the DLT is tailored according to the context in a classical design, the weights allocated to each grade and type of toxicity could differ according to the trial population and the evaluated drug. The weights could also accommodate other features of the toxic event, such as its duration or reversibility. We acknowledge the fact that this design requires more collaborative efforts both at the design stage, for the elicitation of the matrix of clinical weights and to gauge the target score, and during the conduct of the study as the whole toxicity data have to be considered and collected in a timely fashion. Another hurdle is that the clinical meaning of the target score is less straightforward than a target percentage of DLT. The illustration of the target score by hypothetical cohorts of patients may help to facilitate its understanding. From the current results, we think that these issues are well balanced by the expected improvement in the performance of the dose-finding designs incorporating the nTTP endpoint.
Our simulation studies demonstrated that all these designs have good performance characteristics as it relates to the estimation of the (correct) RD. In particular, our design, the QLCRM, had a PCS of > 80% in all scenarios for a fixed sample size of 36 patients. For the model-based methods derived from the CRM, the choice of inference (frequentist versus Bayesian approach) and the type of one-parameter modeling (logistic versus empiric) have a limited impact on the performance, as illustrated by the very similar results of the QLCRM and the QCRM [6]. The nonparametric methods utilized in the UA design [7] and the EID [3] utilized an isotonic regression-based approach for dose escalation. Whereas the PCS for the UA was similar to that of the model-based methods, the performance of the EID Copyright  was less than optimal. The difference in the behavior of the evaluated methods in the convergence study was particularly striking: the EID design apparently did not converge to the true RD. This was likely due to recommending the upper dose in some cases when using the EID whereas model-based or UA designs would have recommended repeating the dose. This arises from the extrapolation of the estimated score at the highest allocated dose to the doses not explored yet in the case of EID. This is consistent with the conclusion of the comparison of several isotonic designs for dose finding using binary toxicity endpoint [32]: the designs in which the dose closest to the RD is selected at every step, on the basis of isotonic regression [33], do not work well compared with the design where the isotonic regression is used only at the end of the trial on the cumulative cohort [34]. On the other hand, the convergence study informed us on the good performance of the model-based methods as well as the UA design when the trial included 24 or more patients. All the designs yielded good overdosing control, with the UA design being more conservative by allocating more patients to doses lower than the RD.
In the main results, the QLCRM and QCRM used the working models obtained by the getprior function. We compared these results with those obtained with the working model published by Yuan et al. [6]. For both QLCRM and QCRM, the choice of the working model between the two proposals has very little impact on the results except in scenario H where the performance is greatly impaired when using the working model of Yuan. We also compared the QLCRM, based on a one-parameter logistic model, for various values of the fixed intercept. The results are much better with the intercept value a D 3 than with other values (a D 2 or 5), which lead to incorrect identifying of the RD, whatever the working model we used: with a equal to 2, the RD is always overestimated, whereas with a equal to 5, the RD is underestimated (additional results available on request).
Considering a toxicity score as a fractional event and thus using a quasi-Bernoulli likelihood in the QLCRM modeling as performed by Yuan et al. [6] raised the issue of choice concerning variance function. We have studied this issue by comparing the results with the alternative Wedderburn variance. The design with the Wedderburn variance leads to similar performance in most scenarios but also results in poorer performance in some scenarios. We also derived the explicit variance function from the different toxicity components using the Delta method, leading to a more precise variance than Bernoulli variance or Wedderburn variance. As this variance function relies on different random variables, a joint modeling of the different toxicity components would be required, as performed by Bekele and Thall in their pioneering work for their toxicity score [4]. However, the model would be more complex for a small potential additional benefit if we consider the rather good performance of the QLCRM. This could be the object of a future work as the original idea of our work was to propose an extension of the CRM based on the summarized measure of the different toxicities and not on multiple individual toxicity items.
In this work, we have proposed to extend the QCRM design proposed by Yuan et al. [6] to incorporate the TTP score in the QLCRM, using a logistic model in a frequentist framework, because we think that it is more intuitive for the clinicians. However, it is worth noting that we have extended the QCRM for three other variants: using a cloglog link function and an empiric modeling for the dose-toxicity relationship in a frequentist framework and using a logistic modeling in a Bayesian framework. The five variants present, on average, very similar results across the eight studied scenarios (results detailed in Table A.2 of Appendix A). As proposed from the original CRM design for the DLT endpoint in 1996 [10], the main purpose of these variants is not to make claims about which provides improvement over the other. Instead, the purpose is to add another perspective to the QCRM construction, to make it more general and to facilitate the properties of the design and its understanding. Because the performance of the proposed method is very similar with the different link functions, the choice of the underlying model may depend on the number of doses and the dose increment setup [11]. A one-parameter logistic model may be more appropriate when a limited number of evenly spaced doses are explored, whereas an empiric function model would a priori fit better in case of a large number of dose levels with no relative or absolute dose increments. The choice of the inference is also a matter of debate despite its very limited impact on the performance of the method, as shown here when using a toxicity score, as well as by several authors for the classical DLT-driven CRM [10,11]. The frequentist design requires two stages because the likelihood equation has no solution if no heterogeneity is observed in the response. An up-and-down escalation schema [12,17] is generally proposed for the first stage, as long as no toxicity event is observed. Even if the use of the classical schema in the beginning of a DLT-driven trial is reassuring for the clinicians, dose allocation may not be optimal until the model is fitted. One advantage of considering all grade information in the proposed design is that the model can be estimated from the first observation of toxicity, even if it is only a mild toxicity. This leads to a shorter first stage of the design and increases the percentage of patients allocated to the true RD (details available from the author upon request).
Our design can be extended to do the following: (i) accommodate other quasi-continuous toxicity endpoints, such as those proposed by Bekele and Thall or by Chen et al. [3,4]; (ii) adaptively integrate and update the toxicity matrix after the trial is launched to incorporate a new toxicity not selected a priori and weighted; (iii) integrate other stopping rules, as those proposed by Chen for the EID design [3] or by Zohar and Chevret for the CRM [35]; and (iv) allow for different cohort sizes and/or skipping of dose levels during dose escalation.

A.1. Analytical variance function using the Delta method
We may explicitly derive the analytical variance function by considering the structure of nTTP score as a function of several toxicity random variables, using the delta method as where P .G l;j / is the probability of grade j for toxicity l and

A.2. Definition of the scenarios
In this section, we will detail how we generated the scenario F.

Toxicity profile
Let us define TP as the combination of a given grade j r of renal toxicity, G r;j r , plus a given grade j n of neurological toxicity, G n;j n , plus a given grade j h of hematological toxicity, G n;j h , with j r , j n , and j h being in f0; 1; 2; 3; 4g. We can define a vector of the 5 3 TPs.
VTP D..G r;0 C G n;0 C G h;0 /; .G r;0 C G n;0 C G h;1 /; : : : ; .G r;0 C G n;1 C G h;0 /; .G r;0 C G n;1 C G h;1 /; : : : ; .G r;1 C G n;0 C G h;0 /; : : : ; .G r;4 C G n;4 C G h;4 // We compute the nTTP score of each toxicity profile, yielding a vector of 5 3 nTTP values, nTTP.G r;j r ; G n;j n ; G h;j h / D q w 2 r;j r C w 2 n;j n C w 2 h;j h where w r;j r , w n;j n , and w h;j h are the weights corresponding to the grade j r of renal toxicity (G r;j r ), grade j n of neurological toxicity (G n;j n ), and grade j h of hematological toxicity (G h;j h ), respectively, defined in weights matrix W , and is defined in Equation (2). Copyright  In the same way, we compute the DLT from the TP, leading to a vector of 5 3 DLT values equal to 0 or 1.

DLT D
1 if max.G r;j r ; G r;j n / > 3 or max.G r;h r / > 4 0 otherwise where j r , j n , and j h are in f0; 1; 2; 3; 4g.

Generation of the scenario
We define a scenario by the probability of each of the 5 3 TPs for the six dose levels. We can derive this from three matrices corresponding to each of the three toxicity types we consider. For each toxicity type, this matrix is a plausible matrix of the probabilities of observing each grade (from 0 to 4, defining the columns of the matrix) at each dose level (from d 1 to d 6 , defining the rows of the matrix). Let P r , P n , and P h be the probability matrices defined for the renal, neurological, and hematological toxicities, respectively. Find the matrices used for scenario F in the following. With the preceding three matrices, we can compute the probability of observing, at dose level d k , a given TP as P k .G r;j r ; G n;j n ; G h;j h / D P k .G r;j r /P k .G n;j n /P k .G h;j h / where j r , j n , and j h are in f0; 1; 2; 3; 4g and P k .G r;j r / is the probability of observing a renal toxicity of grade j r at dose level d k , defined in the matrix P r . We similarly define the probabilities P k .G n;j n / and P k .G h;j h / in the matrices P n and P h , respectively.

Features of the scenario
The mean of the normalized TTP at each dose level d k , nTTP k , is the weighted sum of the 5 3 nTTP values: P k .G r;j r ; G n;j n ; G h;j h / nTTP.G r;j r ; G n;j n ; G h;j h / We can similarly compute the probability of DLT at each dose level d k as P k .G r;j r ; G n;j n ; G h;j h / DLT.G r;j r ; G n;j n ; G h;j h / The probability matrices P r , P n , and P h for all scenarios are available from authors on request.

A.3. Sensitivity analysis
As illustrated in Figure A.1 of Appendix A, the results of the convergence rate were the same when the trial accrues more than 20 patients. In scenarios G and H, for which the true recommended dose is the fifth dose, the PCS is very high with the parametric methods for n D 15, that is, after having treated five cohorts.  Results at the target dose are in bold. QLCRMW, quasi-likelihood continual reassessment method using Wedderburn variance and its corresponding likelihood; QLCRM: quasi-likelihood continual reassessment method (our proposal, with the Bernoulli variance).   Results at the target dose are in bold. Results at the closest target dose are in italics. QLCRM, quasi-likelihood continual reassessment method (our proposal: quasi-CRM with a logistic model in a frequentist framework); QLCRMcl, quasi-continual reassessment method with cloglog model in a frequentist framework; QCRM: quasi-continual reassessment method ( [6]: quasi-CRM with an empiric model in a Bayesian framework); QCRM-EF, quasi-continual reassessment method with an empiric model in a frequentist framework; QCRM-LB, quasi-continual reassessment method with a logistic model, with a fixed intercept equal to 3, in a Bayesian framework assuming a normal law for the prior distribution of the slope, with mean equal to 0 and variance equal to 1.34.