A straightforward meta-analysis approach for oncology phase I dose-finding studies

Phase I early-phase clinical studies aim at investigating the safety and the underlying dose-toxicity relationship of a drug or combination. While little may still be known about the compound’s properties, it is crucial to consider quantitative information available from any studies that may have been conducted previously on the same drug. A meta-analytic approach has the advantages of being able to properly accountfor between-studyheterogeneity,and it may be readily extended to prediction or shrinkage applications. Here we propose a simple and robust two-stage approach for the estimation of maximum tolerated dose(s) (MTDs) utilizing penalized logistic regression and Bayesian random-eﬀects meta-analysis methodology. Implementation is facilitated using standard R packages. The properties of the proposed methods are investigated in Monte-Carlo simulations. The investigations are motivated and illustrated by two examples from oncology.


INTRODUCTION
Phase I dose-finding studies are the first-in-human studies of the clinical development which aim at estimating the maximum tolerated dose (MTD) of a drug or a combination of molecules.The MTD is usually defined as the dose level associated with a probability of occurrence of treatment-related adverse events, so-called dose-limiting toxicities (DLTs), at a pre-specified level (0 < ⋆ < 1; commonly: ⋆ = 33% or ⋆ = 25%).DLTs are usually composite endpoints comprising a range of adverse events according their severity grade.These exploratory early-phase studies are typically composed by a small sample of healthy volunteers, or patients (for example, in oncology bacause of the potential high toxicity of drugs, or in paediatrics because of the vulnerable population). 1,2Due to the limited sample size and ethical concerns, simple randomized designs are usually not applied, but rather response-adaptive sequential designs, including some methods that can potentially find the MTD sooner and limit the number of observed DLTs. 3,4hese sequential designs can be divided in two main groups: (1) algorithm-based and (2) model-based methods.In the first group we can find the classical 3 + 3 design, which uses prespecified fixed rules and only information of the patients at one dose (the one given to the previous cohort of patients) to allocate the next cohort to a dose level. 5,6On the other hand, modelbased designs, such as the continual reassessment method (CRM) 7 or the Bayesian logistic regression model (BLRM), 8 use an underlying statistical working model to describe the dose-toxicity relationship and a prespecified toxicity rate (the target toxicity), 0 Abbreviations: BLRM, Bayesian logistic regression model; CI, credible interval; CRM, coninual reassessment method; DLT, dose-limiting toxicity; FLAC, Firth's logistic regression with added covariate; GLM, generalized linear model; MAC, meta-analytic-combined; MAP, meta-analytic-predictive; MED, minimum effective dose; ML, maximum likelihood; MTD, maximum tolerated dose; NNHM, normal-normal hierarchical model; OR, odds ratio to allocate the next cohort of patients.While the former are easier to implement, the latter have better operating characteristics, that is, a higher percentage of correct MTD selection and greater chances of promoting effective drugs to subsequent development phases. 9,10,11As a compromise between these two approaches, model-assisted designs have been proposed, which allow to combine the simplicity of rule-based designs with the estimation of model-based designs. 12,13Like the model-based approaches, these use a statistical model to develop the design, but fixed dose allocation rules are determined at the pre-planning stage, which makes implementation about as straightforward as for rule-based designs.
In all medical fields, and above all in oncology, identifying the appropriate therapeutical dose range is crucial to avoid exposing patients to an unacceptable toxicity profile or to insufficient efficacy. 14Recently, Harrison 15 reported that phase II and phase III study failures were often due to lack of efficacy (52%) and safety (24%).They failed owing to inadequate patient selection, study design, biomarkers, and schedules, as well as data analysis, but the most evident resaon was wrong estimation of the dose-response relationship during exploratory clinical trials.Thus, an improper dose selection at the first stages of clinical development may lead either to abandonment of promising drugs, or to the focusing on a wrong dose (too toxic or lacking efficacy).Therefore, using all available data, including other exploratory trials, in order to improve the estimation of the MTD is of utmost importance and the use of meta-analysis methods could be helpful.To cope with the specifics of phase I trials, i.e., small sample sizes and the consideration of several dose levels, only few meta-analytic methods have been proposed so far.Zohar et al. 16 proposed a common-effect (fixed-effect) method based on the retroscpective CRM.Ursino et al. 17 developed a more sophisticated one-stage random-effects approach based on stochastic process priors.Other authors, instead, focused directly on efficacy endpoints, such as Kim et al. 18 who developed random-effects meta-analysis methodology to deal with a relaxed exchangeability assumption and rare events.
Our take was to propose the adaptation and the use of simpler approaches developed for regular meta-analysis, and to check their suitability in realistic phase I scenarios.We considered estimates of the MTD along with their standard errors, which are readily derived from the outcomes of a dose-finding trial based on a simple logistic model.In order to account for the small sample sizes typically encountered in early-phase studies, we suggest the use of a penalization approach correcting for bias in the logistic regression stage. 19,20We then utilize a univariate two-stage meta-analysis model to synthesize the different estimates.Implementation is based on the logistf and bayesmeta R packages. 21,22The results of the proposed meta-analysis methods may be utilized not only for an overall summary estimate (e.g., for consideration by regulatory agencies), but also for the design of a prospective trial, 23,24 for the selection of the dose panel, for sample size determination, 25 or as a meta-analytic predictive (MAP) prior for Bayesian inference. 26,27he organisation of this article is as follows; in Section 2, we motivate the problem by considering two example data sets; in both cases, a literature search on clinicaltrials.govand PubMed yielded a number of phase I studies investigating a novel drug at a range of dose levels.In Section 3 we introduce the statistical methods for MTD estimation and meta-analysis used in the following.In Section 4, we investigate the proposed methods' performance using Monte Carlo simulations considering the combined estimate as well as shrinkage and prediction.In Section 5, we analyze the example data, and Section 6 finally closes with some conclusions.

Sorafenib example
The first illustrating example concerns Sorafenib (BAY 43-9006), a kinase inhibitor for the treatment of advanced renal cell carcinoma, hepatocellular carcinoma, and radioactive iodine resistant advanced thyroid carcinoma. 28Thirteen trials with published results, described in 11 manuscripts, were identified in a literature search.The doses used in each trial, the numbers of patients allocated, and the total numbers of DLTs at each dose are summarized in Table 1 .
Seven doses (100, 200, 300, 400, 600, 800 and 1000 mg) were investigated across all studies.DLT definitions and Sorafenib administration schedules (about a 28-day cycle) were comparable across studies.The total sample sizes vary from 16 up to 54 patients, the maximum was achieved in one of the studies where Sorafenib was tested in 9 different patient subgroups defined by organ dysfunctions. 35As expected for dose-escalation trials, observed DLT frequencies generally tend to increase with increasing dose, with zero or a low event rates at the lower doses placed at the beginning of the dose panel.

Irinotecan / S-1 example
The second illustrating example concerns a combination therapy of Irinotecan (a topoisomerase 1 inhibitor) and S-1 (a combination of three pharmacological compounds, namely, tegafur, gimeracil, and oteracil potassium) that was tested in advanced colorectal and gastric cancer in a Japanese population. 44,45Data extracted from 12 studies are shown in Table 2 .A total of 10 doses, ranging from 40 up to 150 mg/m 2 , were evaluated among all trials.Two or more infusions of Irinotecan were planned in all trials, except for Yamada et al. 46 and Yoshioka et al. 53 with only one infusion at the first day cycle.The sample sizes range from 6, in agreement with the well-known 3+3 design, to 51, since, in Inokuchi et al., 48 DLTs were still recorded at the phase II stage of the study.

Dose-finding experiments: assumptions
Within a dose-finding experiment, we have a discrete set of covariable levels , usually denoted as doses or exposures, or transformations thereof, which are indexed in increasing order.The response is an event count among a total number of patients that have been exposed to the th dose.
We assume the DLT count among the total of patients exposed at level to follow a binomial distribution, MTD estimation then aims at determining the dose ⋆ for which the DLT probability reaches a certain threshold ⋆ (or sometimes also the largest dose with DLT probability ≤ ⋆ ).Depending on the approach taken, the dose level ⋆ may be assumed to be among the set of experimentally considered dose levels ( 1 , … , ) or also in between or beyond the investigated doses.

Logistic regression model
We apply logistic regression models with a logit link for each study.The DLT counts are modelled via a binomial distribution as in (1).The logistic regression model provides a joint parametrization of the DLT probabilities as a parametric function of the dose covariable : where the logit link function and its inverse are defined as Note that the logits of the probabilities corresponds to the logarithmic odds, where the odds correponding to a probability ∈ [0, 1] are given by 1− ∈ [0, ∞].The linearity assumption constrains the DLT probabilities and reduces dimensionality of the problem: only 2 unknown parameters, the intercept 0 and slope 1 now determine the dose-response relationship.In addition, it facilitates interpolation or extrapolation beyond the discrete set of covariable ( ) values.Introduction of a parametric model also means that monotonic transformations of the dose levels ( ) may affect the model fit (see also Figure 1 ).Within the logistic model, the MTD directly results from the regression parameters ( 0 and 1 ).Inverting equation (2) to derive the dose level ⋆ for which the pre-specified toxicity of ⋆ is attained (i.e., so that 0 + 1 ⋆ = logit( ⋆ )) leads to as the corresponding MTD.The logistic regression model as a very common procedure is a simple and pragmatic model choice here; a number of other models or parametrisations are also commonly used in dose-response modeling. 58,59The logistic model also includes the socalled Emax model as a special case (see Appendix C for details).

Parameter estimation in the logistic model
Simple maximum-likelihood (ML) estimation within the logistic regression model is sometimes problematic, in particular in case of smaller sample sizes.ML estimators are affected by certain biases, and separation issues may arise. 60Separation refers to "pathological" data constellations where the predictor allows to perfectly split the data into cases and non-cases, which in the estimation stage leads to problems (a likelihood that doesn't attain a definite maximum at finite parameter values, which in practice often leads to numerical problems, resulting in large estimates as well as associated standard errors).
The use of the Firth correction has been suggested in order to reduce bias and to ensure finite parameter estimates. 61,19Heinze and colleagues 62,63 showed that the use of the Firth correction avoids separation issues and may improve properties also in small samples.In the present context, our focus is not only on contrast estimates (odds ratios) but also hinges on the accuracy absolute (DLT) event probability estimates.Puhr et al. 20 introduced Firth's logistic regression with added covariate (FLAC) as a method to further reduce bias in event probability estimates; the FLAC method has been recommended due to improved performance as well as its invariance properties. 20he logistic model falls within the class of generalized linear models (GLMs), and as such, implementations are readily available.Within R, one may utilize the "glm()" function for simple logistic regression, while the "MASS" library also provides the functionality to retrieve estimates and standard errors for the MTD. 64Firth logistic regression and FLAC are implemented in the "logistf" R package. 21Point estimates and their associated variance-covariance may again be utilized to derive corresponding MTD estimates and standard errors.

General remarks
Note that the MTD ⋆ does not necessarily fall within the range of investigated doses ( 1 , … , ); the resulting estimate hence is usually an interpolation, or even an extrapolation.The actual task is to use the inverted equation ( 4) based on estimates of the coefficients 0 and 1 in order to derive an estimate of the MTD ⋆ .A common approach is to assume a joint normal distribution for the ML estimates ̂ 0 and ̂ 1 .The distribution of the estimated MTD then results as a ratio of (correlated) normal variates (see (4)).Such a ratio may often again be reasonably approximated by a normal distribution, however, in general the distribution may also be bimodal and heavy-tailed. 65In particular, its first two moments (mean and variance) in general do not exist.
In the following, we will investigate a simple error propagation approach based on the delta method.Note that the eventual objective of this exercise is to provide estimates and standard errors to be passed on to the meta-analysis procedure at a subsequent stage.Estimates and standard errors are then utilized to specify the (approximately) normal likelihood (see Section 3.5 below).The aim hence is to provide an accurate (or reasonably conservative) normal approximation to the uncertainty in the MTD. 66

The delta method
Consideration of standard errors (the variance-covariance matrix) of the regression coefficients 0 and 1 via the delta method yields a corresponding standard error for the MTD estimate ⋆ . 67This procedure is described in detail, including R code, by Venables and Ripley. 64It should be noted, however, that this only works well in case of reasonably small standard errors; in particular, the denominator in equation ( 4), the regression slope, needs to be somewhat bounded away from zero (i.e., it needs to have a small coefficient of variation). 65

Two-stage approach to meta-analysis 3.5.1 The normal-normal hierarchical model (NNHM)
With MTD estimates that are provided along with their standard errors, the meta-analysis problem falls into the generic class of problems that may be addressed using the normal-normal hierarchical model (NNHM). 68,69,70,71,72,73For each study ( = 1, … , ), there is an underlying (unknown) true MTD value .Analysis of the experimental data yields an MTD estimate that has a standard error associated.Utilizing a simple normal model, we assume that where the study-specific true values (the MTDs) are not necessarily identical across the different studies; these also have a certain amount of variability associated, which is implemented using another variance component via The heterogeneity parameter denotes the between-study variability, and is the overall mean.The model may also be expressed in its marginal form as

Estimation within the NNHM
In the Bayesian framework, the estimates and their standard errors are utilized to define the (approximately) normal data likelihood; inference needs to consider the unknowns and , and primary interest is usually in the overall mean .For the design and analysis of a future dose-finding study, prediction (of a "new" MTD +1 ) or shrinkage estimation (of one of the MTDs ) may also be relevant. 26,27Prior distributions need to be specified for the overall mean ( ) and the heterogeneity ( ), experssing any a-priori information that may be available on these parameters.In the following, we will utilize uniform effect and heterogeneity priors, due to the mostly large sample sizes (numbers of studies ) in the examples.In general, or in particular when faced with only a small number of studies, the use of (weakly) informative priors may also be appropriate. 73,74SIMULATIONS

Aims
The aim is to firstly investigate the performance of the (1st-stage) regression methods, and then also of the (2nd-stage) metaanalysis.For the logistic regression models, we will check bias, CI coverage probabilities and CI widths; for the meta-analyses, we focus on CI coverages and widths (for overall mean, prediction, shrinkage estimates).To this end, we will utilize Monte Carlo expriments using simulated data from a range of relevant scenarios.Even when assuming that the logistic regression analyses yield relatively unbiased estimates of the regression coefficients, the eventual consideration of derived MTDs (a nonlinear transformation, see equation ( 4)) means that we again should expect some bias.Considered as a function of the slope 1 , the transformation in ( 4) is a convex function, so if we assume that the MTD uncertainty was dominated by uncertainty in the slope, then (following Jensen's inequality) we may expect a negative bias here.Considered as a function of both intercept and slope, however, the MTD is neither convex nor concave.
In addition, it also remains to be seen how well a normal approximation works in a setup where moments may actually not be finite. 65

Simulation scenarios
The simulation scenarios considered are intended to be realistic in terms of the data that are commonly encountered in doseescalation studies (as, e.g., in the examples from Section 2; see also Appendix A).We will use a mixture of data generated by employing the common "3+3", 5,6 "CRM", 7,75 and "BLRM" 8,76,4 study designs (with equal probabilities).While the algorithmic 3+3 design used to be the most common in the past, alternative designs have recently gained popularity. 77,78 The different designs will then result in differing kinds of data sets, in terms of sample sizes, event rates, investigated dose ranges, etc. 79,80 Implementation details regarding the three designs are provided in Appendix D.
Toxicity profiles: we will use 5 dose-response curves that are shown in Figure 1 and Table 3 .All of these are based on 6 dose levels ( ∈ {1, 2, … , 6}), and they are labeled as "moderate", "steep", "gentle", "convex" or"concave".The first three are linear, while the latter two deviate from the logit model assumed in the analyses.Defining the targeted DLT probability as ⋆ = 0.33, the MTD is attained within the dose range (between 1 and 6) in all 5 cases; the MTDs are also listed in Table 3 .
Design criteria: the differing dose-escalation designs require the specification of certain free parameters.The 3+3 design is initiated at the lowest dose, a cohort size of 3 is used, there is no skipping of doses, and no dose escalation following a DLT event.CRM and BLRM designs utilize the "skeleton" (a prior guess of the response curve) shown in Figure 1 and Table 3 .For CRM and BLRM, the maximum sample size was chosen randomly as a multiple of 3 between 15 and 30.
Heterogeneity: Heterogeneity in MTDs will be implemented via variability in the intercept term of the dose-response curve.In Figure 1 , one can see that shifting the curves in the vertical direction will also increase or decrease the associated MTD.
For a logistic dose-response (as in equation 2), adding a random offset with variance ( 1 ) 2 will imply a heterogeneity variance of 2 in the associated MTD.For the non-linear scenarios, the relationship is not as simple, but instead of the slope 1 , the slope at the MTD (as in Table 3 ) is simply used instead as an approximation.Heterogeneity levels of ∈ {0.0, 0.5, 1.0} will be considered.

Numbers of studies:
At the meta-analysis stage, totals of ∈ {5, 10, 20} studies will be considered.
It should be noted that the five different dose-response curves may also be related to one another by considering a "common" curve with differing dose placement; see also the illustration in Figure 1 .Panel (C) shows how the five scenarios may result from considering different sets of investigated doses.A denser or wider spacing affects the resulting slope, and constancy of distances between neighbouring doses affects overall linearity.Differences between scenarios do not only relate to the underlying doseresponse mechanism, but also to choices made at the experimental design stage.For example, the third ("gentle") scenario may be viewed as a scenario in which only a very narrow dose range is investigated -which eventually (and maybe as expected) makes it harder to identify the exact MTD, while at the same time problems appear amplified since uncertainty in the MTD in terms of a certain absolute dose range then quickly translates into uncertainty spanning several dose levels.This highlights the importance of the careful choice of the dose (or exposure) scale and of possible transformations (e.g. when considering logtransformed doses).It also emphasizes the difficulties arising in the comparison of results (in particular: standard errors and CI widths quoted in units of doses) across the different simulation scenarios.Seemingly "steep" and "gentle" shapes may simply arise from from considering wider or narrower dose ranges.

Simulation scenario properties
We briefly investigated the kinds of data that are returned when implementing the above model assumptions.Table 4 characterizes the data in terms of the mean numbers of doses used, the numbers of patients recruited and the numbers of DLT events under the three designs and overall (on average across designs).Most notably, CRM and BLRM yield very similar data, while 3+3 differs slightly from the other two.The 3+3 design generally yields a smaller total sample size (numbers of patients and events).In the following, we will not distinguish between data generating models, but will consider a mixture of the three.The data characteristics are also roughly similar to those encountered in the Sorafenib and Irinotecan examples above (see also Tables 1 , 2 , A1 and A2 ).

FIGURE 1
The five simulation scenarios (dose-response curves) on the probability (A) and logit scales (B).The dashed line indicates the "skeleton" (a prior guess of the curve, which needs to be specified for CRM and BLRM methods).The corresponding MTDs, the doses at which a probabilitiy of 0.33 is attained (where logit(0.33)= −0.71),are marked by a dot.See also Table 3 for the actual numbers.Panel (C) illustrates that the five scenarios may also be motivated via a common underlying dose-response curve, but differing sets of dose levels.
ods perform comparably, while FLAC appears to be overall least biased.Overall there appears to be a tendency to underestimate the MTD, i.e., the errors are on the conservative side.In particular the third scenario, characterized by a very gentle slope, leads to relatively large offsets (in terms of dose steps) and a negative bias.This may in fact be expected, as this is the scenario in which the slope estimate should tend to be particularly uncertain.Panel (B) again illustrates the offset in the MTD estimate, this time in terms of the DLT probability at the estimated MTD.On this scale, differences between scenarios do not appear quite as dramatic.The probabilities are centered at their aimed for value .0/ 20.7 / 3.9 4.7 / 20.9 / 3.3 4.9 / 20.7 / 3.9 3.9 / 19.6 / 4.5 of ⋆ = 0.33 (marked by a red line), and depending on the scenario, they exhibit more or less variability.The FLAC estimate again appears to be closest to the target on average.
Panel (C) shows the uncertainties (estimated standard errors) associated with the DLT estimates.The Firth and FLAC estimates behave similarly, while the simple GLM estimate yields more extreme (both very small or very large) errors.Note that the standard error is in units of doses here, i.e., a standard error > 1 means that the associated CI will span across several doses.
Panel (D) eventually shows the 95% CIs' coverage probabilities for the different combinations; in general we see some undercoverage, which is not actually satisfactory, while in most scenarios (in particular in those where the linearity assumption is met), the FLAC method again tends to perform best.
Overall, the regularized (Firth or FLAC) estimates clearly outperform the "plain" logistic regression for MTD estimation based on relatively small sample sizes.Also, while "plain" regression would fail in about 4% of cases, the other two always yield finite parameter estimates.Firth and FLAC estimates seem to behave more similarly, with a slight advantage apparent for the FLAC method.In the following, we will hence focus on the FLAC method for deriving MTD estimates (and associated standard errors) in the first stage of the analysis.

Investigating MA performance
After checking the performance of MTD estimators for the first stage of the analysis, we will investigate the performance of meta-analyses based on estimated MTDs.In the meta-analysis setup, we will vary the five dose-response scenarios, as well as the amount of heterogeneity and the number of studies considered.As in the previous section, we will be considering the aspects of estimation error (in terms of doses and in terms of DLT probability), as well as CI width and coverage probability.
Figure 3 illustrates the errors in MTD estimation, both on the dose scale as well as the DLT probability scale.It becomes evident that negative biases already apparent in the previous simulations propagate through to the meta-analysis results.Especially in the "gentle" scenario, the MTD gets underestimated, and an increasing number of studies ( ) reinforces the biased estimate.In this scenario, the MTD estimate is about one dose off, corresponding to a DLT probability of roughly 20-25% instead of the aimed for ⋆ of 33% (see also Table 3 ).For all the other scenarios, however, and in particular those where the linearity assumption is met, the estimates appear to behave reasonably, irrespective of the amount of heterogeneity ( ) or the number of studies ( ).
Figure 4 illustrates the widths and coverage probabilities of credible intervals for the MTD.The length of CIs behaves as expected; an increasing number ( ) of studies considered leads to shorter intervals, while increasing the amount of heterogeneity ( ) makes intervals slightly wider.Coverage probabilities of 95% intervals however are substantially off in some cases, which makes sense given the previously considered simulation results; when estimation bias is substantial and a large number of studies is included, so that bias is of the order of the CI width, then the coverage deteriorates.In the majority of scenarios considered, however, coverage probabilities are at reasonable levels.
As outlined in Section 3.5.2,instead of the overall mean , in some applications it may also make sense to consider estimation of one of the study-specific means (shrinkage estimation), or of a "new" instance +1 (prediction).For analogous results for prediction and shrinkage intervals, see Appendix E. In brief, one can see that the biases observed previously appear to propagate through, however, the coverage probabilities are substantially closer to their nominal levels.

Sorafenib example 5.1.1 MTD estimation
Now consider MTD estimation in the context of the Sorafenib example (see Table 1 ), again aiming for a DLT probability of ⋆ = 0.33.Considering the empirical DLT frequencies encountered, note that a number of studies never actually reach an empirical DLT frequency of 33%, which also makes MTD estimation a bit of an extrapolation exercise here.Also, separation occurs e.g. in the study by Furuse et al. 33 , where only two doses are investigated and events only were observed in one of the two groups.We will consider doses on their logarithmic scale for the analysis.Table A1 (in Appendix A) also shows the resulting MTD estimates and their standard errors.The joint analysis of the 13 estimates is illustrated in Fig. 5 .Note that the data from some of the studies barely allow to constrain the MTD, which is reflected in correspondingly large standard errors associated with the estimates.The eventual analysis is hence supported by certain studies, while others may not contribute substantially.The contribution of studies to the resulting estimate may be expressed in terms of (percentage) weights, which are also shown in the Figure. 81The eventual estimate of the overall mean MTD is at 608.1 [470.5, 795.6].In terms of dose selection (among the investigated doses; see Table 1 ), this range only includes the dose of 600 mg.This is in agreement with the result previously found by Ursino et al., 17 who worked using decrete dose levels.Therefore, the EMA-recommended dose of 400 mg (twice a day) is still considered a safer choice taking into account these new findings. 28hen the meta-analysis is performed at the design stage of a new study, it may be more relevant to look at a prediction interval; this may be useful to help designing the study, or to formally include the "historical" information in the eventual analysis. 25,26,27The prediction interval is wider, as it includes the estimated between-study heterogeneity, and here it ranges from 363.3 to 1044.8 mg.
For any of the 13 included studies, we may also investigate the study-specific MTDs in the light of the combined data via shrinkage estimation; for example the shrinkage estimate for the most recent study by Chen et al., 39   The heterogeneity ( ) is estimated at 0.13 [0.00, 0.45], which seems to be at a reasonable level for a log-transformed MTD. 82,74iven the weights that are shown in Figure 5 , an obvious question is to what extent the studies with "small" weights are actually relevant to the analysis.The weights are closely related to the standard errors ( ) associated with each estimate, 81 and it seems sensible that those studies with very large uncertainties associated may only help very little in constraining the MTD.If we simply omit those studies with > 1 (on the logarithmic scale this means that lower and upper 95% interval bound are more than a factor of 50 apart), we are left with 6 studies, and the overall analysis results remain almost the same; see Figure B1 in the Appendix.We may also check back with the original data (see Tables 1 and A1 ) and check why these particular studies only seem to provide little evidence; these are all studies either involving only few DLT events, or studies that (empirically) exhibited a gently-sloping response, so that MTD estimates are large and at the same time associated with correspondingly large uncertainties.

Meta-analytic approach to bridging
Suppose one was interested in quantifying the MTD in Japanese patients.Strictly speaking, only two studies are available that investigated this figure 33,34 (see also Table 2 ).A meta-analysis of this pair of studies is readily performed and only requires the additional specification of a proper heterogeneity prior -unlike in the case of many studies (large ), where an improper uniform heterogeneity prior may be considered appropriate, the case of only = 2 studies requires a proper prior here. 83,73,74or example, one may argue that a half-normal prior with scale 0.2 may be appropriate here, implying that while one anticipates some differences in study-specific MTDs, these are expected to vary around their common mean by a factor ranging mostly between 2 3 and 3 2 . 74Tab. 3 However, due to the low precision that the two Japanese studies provide, the meta-analysis also remains rather inconclusive with an estimated MTD of 1200mg and a CI ranging from 56mg up to 26 000mg.With this, the estimate covers all of the potential doses (from 100 to 1000mg) that had been considered in any of the experiments.
While evidence from the remaining European / North American studies may not be directly transferable to the Japanese context, it seems obvious that still these are of some relevance here. 41,42A way to formally consider the external data for estimating the Japanese MTD in a dynamic fashion works via shrinkage estimation. 43Technically, this may be implemented via several meta-analyses.Two summary estimates resulting from analyzing the groups of Japanese and Western studies separately are then combined in a third meta-analysis.At this second stage, the focus is not on an "overall mean" parameter, but rather on an updated shrinkage estimate (here: the Japanese MTD). 43The assumption being implemented here is that there is heterogeneity within the Japanese and non-Japanese studies, as well as between the two groups of studies, resulting in an additional hierarchical model stage, and a total of three variance parameters. 84The second-stage meta-analysis again requires a proper heterogeneity prior; as in the above example, we will use a half-normal prior with scale 0.2.The results are shown in Table 5 .Excluding the two Japanese studies from the analysis yields an MTD estimate for the "Western" studies similar to the overall result shown in Table 5 .The two Japanese studies alone on the other hand yield only a very vague estimate; the standard error (on the logarithmic scale) is roughly 10 times as wide as the "Western" standard error.Performing the meta-analysis at the second stage then yields a shrinkage estimate closer to the Western estimate and with a substantially reduced standard error.
The new shrinkage estimate of 617 mg and the associated interval ranging from 337 up to 1179 mg do not allow to pinpoint a single dose, yet they allow to narrow down the range of likely and plausible doses considerably based on explicit and transparent model assumptions regarding the relationships between Japanese and external data.Again, we may also determine the weights of the Western and Japanese studies as these contribute to the resulting shrinkage estimate; 81 the Japanese contribution here amounts to only 3.6%, once more highlighting the potential gain from considering the external data.

Irinotecan / S-1 example
Altough there is one study less in the second example, as well as fewer patients and fewer events involved, all studies saw an empirical DLT rate of at least 11%, and there are eventually more studies that are able to contribute substantial information on the MTD (see also Table 2 ).The resulting DLT estimates are shown in Table A2 in Appendix A; only two studies yield (log-) DLT estimates with standard errors > 1.One example of an ambiguous data setup is given by Yoshioka et al. 53 ; the data do not even clearly suggest an increasing or decreasing toxicity with increasing dose (see Table 2 ), and consequently the resulting MTD estimate ends up with a huge standard error.Note also the noticeable differences between MTD estimates based on simple logistic regression compared to the regularized ones.
Figure 6 shows the results of a meta-analysis of all 12 MTD estimates.The eventual MTD estimate is at 80.3 [67.4,97.3], which in terms of dose selection, would include the doses of 70, 80 or 90 mg/m 2 .To our knowledge, the combination of irinotecan TABLE 5 Meta analysis aiming for an estimate of the MTD for a Japanese population.Assuming heterogeneity among Western and Japanese studies, as well as between the two groups of studies, first two separate analyses combining the Japanese studies 33,34 as well as the remaining Western studies are performed.Subsequently the two resulting estimates are analyzed jointly, aiming for a shrinkage estimate of the Japanese MTD.FIGURE 6 A forest plot illustrating the meta-analysis for the Irinotecan / S-1 example data.MTD estimates are given on the logarithmic as well as on the dose scale.and S-1 has not yet received market authorization, but these figures are in agreement with the range determined by Ursino et al. 17 for the MTD.The prediction interval, denoting the expected MTD in a new study, ranges from 47.6 up to 138.1 mg/m 2 .
We may again also consider shrinkage estimation; for example, for the most recent study (Goya et al. 57 ), which contributed a pretty precise estimate of 85.9 [78.1, 94.6] on its own already, considering the remaining studies in addition does not add much information and the shrinkage estimate is very similar at 85.6 [77.9, 94.0].Another interesting example is the one by Yoshioka et al., 53 where one DLT was observed in 12 patients allocated to three doses (100, 125 and 150 mg∕m 2 ; see Table 2 ).The authors eventually determined the highest dose as the recommended one, which seems appropriate considering only the data of the trial.We note, however, that the shrinkage interval for this study, [47.6, 138.1], excludes the dose of 150 (see also Figure 6 ).

DISCUSSION
The aim of this manuscript is to propose a simple meta-analysis approach for the estimation of the maximum tolerated dose (MTD) from multiple phase I dose-finding studies.The proposed two-stage approach is easier to implement than a one-stage approach that would require additional assumptions on the model, such as the control group response or the implementation of pooling and stratification schemes. 85ased on our extensive simulation study, this simple approach gave on average good operating characteristis, in terms of MTD standard error (in the first stage) and MTD correct dose selection associated with DLT probability estimated at MTD (at the second stage).Only in the "gentle" scenario the proposed approach led to an underestimation of the selected MTD.Unless the therapeutic window of the tested drug is very narrow (that is, if the minimum effective dose, MED, is close to the MTD), an underestimation of the MTD could be still considered acceptable, since a non-toxic, yet efficacious dose is selected.Neverthess, the "gentle" scenario reflects a case where an inappropriate dose range has been used in the original study.Indeed, if the toxicity probabilities associated with the investigated dose are similar, the first-stage regression (based on small samples) is likely to yield very imprecise MTD estimates, entailing problems with the delta method being applied to a non-linear response curve.The second-stage meta-analysis then may also not improve upon this.The selection of an appropriate set of doses in phase I dose-finding trials is hence important in order to be able to derive useful MTD estimates.
We also investigated the case of bridging studies where information from one population is used to extrapolate knowledge about another population.In the the Sorafenib illustration, we considered dynamic shrinkage estimation when using Western data for the estimation of the Japanese MTD.Ollier et al. 86 have proposed a complementary approach to evaluate the similarity between two population distributions of a parameter of interest (MTD, dose-response model parameter(s), etc.) and thus help in deciding whether or not to use and to tailor the available information.
A limitation of this two-stage approach is related to the bias and coverage issues that seem to originate from the first-stage logistic regressions.A one-stage approach might avoid this, but would come at the cost of additional modeling assumptions and a more complicated analysis, as recalled at the beginning of this section.However, the heterogeneity parameter might 'absorb' some of the erratic small-sample-size behaviour at the first stage, which might explain why prediction and shrinkage estimates are not as badly affected.
Bayesian approaches involve the specification of prior distributions, which is relevant for the meta-analysis at the second stage of our proposed approach.Here we relied on "uninformative" (uniform) priors for the overall mean and heterogeneity parameters as a default.In particular in cases involving few studies only, the use of (weakly) informative priors is recommended. 73,74In practice, some prior information may come from earlier phases or from PK/PD considerations or preclinical data, 87 and it may be worth trying to include these in terms of informative priors. 73,74nother point is associated with the way dose-finding designs are constructed.In the interest of participants, early-phase dosefinding studies are designed with the aim to minimize the number of DLTs occuring, in particular when volunteers are involved.This is why these methods are often sequential, avoiding allocating patients to toxic doses and targeting the MTD.However, such schemes might result in large uncertainty in the eventual MTD estimate; from the estimation perspective, it may be more desirable to also investigate wider ranges of (not necessarily more toxic) doses in order to better probe the actual dose-response curve. 88inally, our proposed simple meta-analysis method may be transferable to related estimation problems, e.g., for median lethal doses (LD 50 ), estimation of the MED coming from phase II dose-ranging studies, etc. User-friendly R code is provided in the supplement to help clinical trials statisticians and stakeholders to implement the proposed method.

DATA AVAILABILITY
The data that supports the findings of this study are available in the supplementary material of this article.

APPENDIX A EXAMPLE DATA
Tables A1 and A2 below characterize the data sets introduced in Section 2 in terms of the numbers of doses, patients and events, and show the slightly differing MTD estimates (on the logarithmic scale) from "plain" logistic regression, regression using Firth correction, and Firth's logistic regression with added covariate (FLAC) approaches.

FIGURE B1
A sensitivity analysis for the sorafenib example data.MTD estimates are based on "FLAC" regressions.Omitting studies with "large" standard errors ( > 1), which do not contribute much information on the overall mean estimate yields very similar results (see also Figure 5 ).

C CORRESPONDENCE OF LOGISTIC AND EMAX MODELS
In case of a logarithmic dose as covariable ( = log( ), and > 0), the logistic model from Sec. 3 is equivalent to the Emax model.The Emax model is defined as where max > 0 denotes the maximal possible response, and 50 > 0 denotes the dose level yielding half the maximum response. 89In case of max = 1, the logistic and Emax models are equivalent; the Emax model's exponent corresponds to the slope parameter 1 , and the 50 parameter is related to the intercept 0 as

D DATA GENERATION DETAILS
The simulation of data from common dose-finding designs (see Section 4) was facilitated by utilizing existing R libraries.The "3+3" data were generated using the UBCRM package's "sim3p3()" function. 90The 3+3 simulation only requires specification of the (true) toxicity profile; the dose range is probed starting from the lower end, patients are recruited in cohorts of three, and decision on increasing, decreasing or proceeding with the current dose is done using fixed rules based on the number of DLTs observed in the previous cohort. 5,6he "CRM" data generation was based on the dfcrm package's "crmsim()" function, 91 using a cohort size of 3, the first dose as the starting dose, the skeleton as shown in Figure 1 and Table 3 , and a targeted DLT probability of ⋆ = 0.33.The default settings include a "no-skipping rule".The sample size was drawn uniformly (among multiples of 3) between 15 and 30.Patients are again recruited in small cohorts, and decision on the dose to be utilized for the next cohort is based on a parametric (logistic regression) model fitted to the past data. 7,75BLRM" data utilized the bcrm package's "bcrm()" function 92 using settings as above, and in addition using an EWOC criterion of 0.25 and vague independent lognormal priors for the logistic regression parameters (intercept, slope) with means logit(0.01)and 0.0 and variances 4 and 1, respectively.The BLRM works similarly to the CRM, but is based on a Bayesian model and includes consideration of the (posterior) probability (here: 25%) of exposing patients to overly toxic doses.8,76,4 Table D3 illustrates how the different designs employed in the simlations probe the dose-response curve in terms of the mean numbers of patients assigned to each of the six doses (see Section 4 for more details on the scenarios). Siulations here are based on "fixed" dose-response curves (no heterogeneity).

E PREDICTION AND SHRINKAGE RESULTS
The four figures below illustrate the meta-analysis performance for prediction and shrinkage estimation.Figures E2 and Figure E3 show the absolute offset in terms of dose levels as well as DLT probability in analogy to Figure 3 .Figures E4  and E5 illustrate the credible interval lengths and their coverage probabilities analogously to Figure 4 .

Figure 2
Figure2illustrates the estimation performance of the three investigated methods ("plain" logistic regression within the GLM framework, Firth correction, and FLAC).Panel (A) shows the error in the estimated MTD; within each scenario, all three meth-

FIGURE 2 k = 5 k 20 τk 20 τk = 5 k 20 τk = 5 k 20 τk = 5 k 20 τk = 5 k 20 τ 4 k 20 τ 4 k = 5 k 20 τ 4 k = 5 k 20 τ 4 k = 5 k 20 τ 4 FIGURE 3
FIGURE 2 Performance of the first-stage logistic regression methods.The top left panel (A) shows the errors in MTD estimates in terms of the offset in doses.The top right panel (B) illustrates the error in terms of the DLT probability at the estimated MTD; the horizontal reference line indicates the target of 33%.Panel (C) shows the standard errors associated with the differing estimates, and panel (D) shows the coverage probabilities of derived 95% confidence intervals.Boxplots indicate the three quartiles and the central 90% range.

FIGURE 5 A
FIGURE5 A forest plot illustrating the meta-analysis results for the Sorafenib example data set.MTD estimates are based on penalized logistic regression (FLAC).The estimates are shown in terms of logarithmic dose as well as on their original scale.The "weight" column indicates each study's contribution to the overall estimate.

TABLE 2
The results of 12 Japanese studies on combination therapy of Irinotecan and S-1.For each dose considered in each trial, the numbers of patients experiencing DLT events, and the total numbers of exposed patients are given.

TABLE 3
The DLT probabilities in the five example scenarios, and the associated true MTDs.See also Figure1.

TABLE 4
Mean numbers of doses / patients / events resulting from the different data scenarios and designs.Note that for CRM and BLRM, the maximum sample size is also random here (between 15 and 30).
which contributed only little information (a rather vague estimate of 3149.5 [0.0, 2146367310.5]), is at 607.0 [364.6,1046.8].As expected, this is very close to the prediction, but also substantially shorter than the original interval.

TABLE A1
Characteristics of the Sorafenib example introduced in Section 2 (see also Table1); the numbers of doses, patients and DLT events are given for each study, as well as the MTD estimates (on the logarithmic scale) from plain logistic regression, regression using Firth correction and the FLAC approach.

TABLE A2
Characteristics of the Irinotecan / S-1 example introduced in Section 2 (see alsoTable 2 ); the numbers of doses, patients and DLT events are given for each study, as well as the MTD estimates (on the logarithmic scale) from plain logistic regression, regression using Firth correction and the FLAC approach.FigureB1illustrates a sensitivity analysis based on the Sorafenib data (see alsoFigure 5 ).A number of studies (those with "large" standard errors associated) contribute little to the analysis and hence have low weights assiciated.Omitting studies with low weights yields very similar, consistent results.

TABLE D3
Mean numbers of patients assigned to each dose in the different simulation scenarios, and based on the different experimental designs.