Confirmatory adaptive group sequential designs for single‐arm phase II studies with multiple time‐to‐event endpoints

Existing methods concerning the assessment of long‐term survival outcomes in one‐armed trials are commonly restricted to one primary endpoint. Corresponding adaptive designs suffer from limitations regarding the use of information from other endpoints in interim design changes. Here we provide adaptive group sequential one‐sample tests for testing hypotheses on the multivariate survival distribution derived from multi‐state models, while making provision for data‐dependent design modifications based on all involved time‐to‐event endpoints. We explicitly elaborate application of the methodology to one‐sample tests for the joint distribution of (i) progression‐free survival (PFS) and overall survival (OS) in the context of an illness‐death model, and (ii) time to toxicity and time to progression while accounting for death as a competing event. Large sample distributions are derived using a counting process approach. Small sample properties are studied by simulation. An already established multi‐state model for non‐small cell lung cancer is used to illustrate the adaptive procedure.

a single time-to-event endpoint constitutes a major restriction of those approaches. Especially in oncology, the multiple aspects of disease and treatment require a more comprehensive description, which often cannot be reflected adequately by a single time-to-event endpoint. In clinical practice, researchers therefore often tend to capture the different aspects of disease and treatment in a single composite endpoint, for example, time to progression, severe toxicity or death, whatever occurs first. The drawback of the latter approach is a quite coarse description of the course of disease. In oncology, it is instead of interest to describe progression, toxicity and death simultaneously as complementary and intertwined characteristics of disease and treatment. Of course, the relative importance of these events may depend on the characteristics of the disease and the stage of development of a new therapy. However, as soon as it is clarified which relationship shall be investigated in the respective situations, multi-state models are an appropriate tool to depict these aspects properly. While there are well-elaborated multi-state models for joint modelling of efficacy and toxicity or joint modelling of progression and death as time-to-event endpoints, there is limited methodology to assess the performance of an experimental treatment as compared to a reference w.r.t. all involved time-to-event endpoints. One-sample testing procedures for single transition probabilities have been suggested in Andersen et al. (1993). Existing methodology to evaluate whole models as in Tattar and Vaman (2008) is designed to find differences in the matrix of transition probabilities rather than the underlying multivariate survival distribution. Additionally, aforementioned approaches assume a single analysis at the end of the trial. To fill this gap, we instead provide methodology for group-sequential one-sample tests for multivariate survival distributions derived from a multi-state model, while making provision for design modifications based on interim data from all involved survival endpoints simultaneously.
Our research is essentially motivated by applications in pediatric oncology. As lined out in Le-Rademacher et al. (2018), multi-state models can be especially useful in clinical oncology trials where patients can experience multiple paths between study enrolment and final outcome. On the one hand, we take up the controversy on the choice between progression-free survival (PFS)/event-free survival (EFS) and overall survival (OS) as primary endpoint in phase II oncology trials. For this purpose, we provide group-sequential one-sample tests for the joint distribution of PFS and OS in the context of an illness-death model. We apply this using an example from Fleischer et al. (2009) which was derived from clinical data. On the other hand, the simultaneous assessment of long-term efficacy (PFS) and long-term safety (time to life-threatening toxicity or death) appears likewise crucial in pediatric oncology, because late effects of treatment may persist lifelong. For this purpose we also provide a group-sequential one-sample test for the joint distribution of time to toxicity and time to progression while accounting for death as a competing event as motivated by our second clinical example, the LOGGIC Europe trial (Eudra-CT number 2018-000636-10). LOGGIC Europe is an international multicentre trial for children and adolescents with low-grade glioma. During the planning phase, the trial committee asked for a study design with the two primary time-to-event endpoints neurotoxicity-free survival and progression-free survival. Two interim analyses with the option of data-dependent sample size recalculation based on interim data from both endpoints were also requested.
In concerns of this request, we additionally address and solve a common problem from the theory of adaptive designs for survival endpoints. The fundamental statistical challenge in adaptive survival trials is handling of data from patients who enter the trial before an interim analysis and remain event-free beyond it. This implies the difficulty to define stage-wise independent p-values, as required by common adaptive design methodology. Two approaches have so far been proposed in the literature to obtain stage-wise p-values: (i) The independent increments approach, and (ii) the strategy of patient-wise separation. Presently available methodology for the first approach (Schäfer and Müller, 2001;Schmidt et al., 2017;Wassmer, 2006) allows design modifications to be based on a single time-to-event endpoint, only, because otherwise type I error rate inflation can occur (Bauer & Posch, 2004). The strategy of patient-wise separation allows design modifications to be based on all available interim data (see Jenkins et al., 2011or Irle & Schäfer, 2012. But it is involved with the drawback that either some available primary outcome data will be discarded or some worst case adjustment has to be performed resulting in a conservative significance test (Magirr et al., 2016). Here, we are aiming for a multivariate extension of the independent increments approach. We provide survival tests where provision is made for data dependent design modifications based on all involved time-to-event endpoints while avoiding the problems arising from the strategy of patient-wise separation. Basically this is achieved by compensating the multivariate counting process w.r.t. the filtration generated by the multivariate process itself, rather than the filtration generated by a single time-to-event endpoint as it is done commonly. We therefore call our strategy the multivariate independent increments approach.
The paper is organized as follows. To motivate our proposed testing procedure, we start with a presentation of its application in the prominent example of PFS and OS in Section 2. In the following sections we give the required theoretical foundations. In Section 3 we settle basic notation which we will use in Section 4 to construct suitable counting process martingales and their asymptotics. These in turn will be used to derive statistical testing procedures. Foundations of F I G U R E 1 Representation of the PFS/OS scenario as a multi-state model confirmatory sequential and adaptive designs are provided in Section 5. This theoretical part is followed by a brief treatment of several issues which arise in practical implementations of the procedure in Section 6. In particular, we also elaborate tests for the other setting mentioned above. Small sample properties are studied by simulations in Section 7 and we conduct a case study in Section 8 in order to give an example for an application of the adaptive procedure. We conclude with a discussion of our findings and prospects for future research.
Proofs of the mathematical statements and additional supporting information are shifted to the Appendices.

MAIN APPLICATION EXAMPLE: PFS AND OS
We illustrate our procedure using the example of a trial with the primary time-to-event endpoints PFS and OS. OS is defined as time from start of a therapy to death and PFS is defined as time from start of a therapy to progression or death, whatever occurs first. While there is no clear conclusion about which of these endpoints to choose as the primary endpoint in general, our methodology yields the possibility to use both of them as primary endpoints under exploitation of their dependence structure. The first step of outcome improvement in oncology is strictly associated with an extension of the progression free survival time and an increase of the plateau of patients without tumor progression. The second step is the hard criterion 'death yes or no' without any interpretation scope left. A second line treatment may or may not be in between both steps; however, the link between PFS and OS is always in focus of the investigators and the simultaneous presentation an important task. As PFS and OS may occur simultaneously if a patient dies before any progress of the disease has been observed, we can use a multi-state model as in Meller et al. (2019) to establish a corresponding probabilistic structure. The multi-state model is visualized in Figure 1. A patient's history of disease from start of the therapy corresponds to a path along the arrows in this figure. At the beginning of the treatment, a patient starts in state 0. He may die directly without progression. This is represented by a jump to state 2. Otherwise, he may experience a progression of the disease, which is represented by a jump to state 1 and die afterwards which is represented to a subsequent jump to state 2.
We denote the random time of transition to node 1 by {1} and the time of transition to node 2 by {2} . Accordingly, PFS can be defined as the minimum of {1} and {2} which will be denoted by {1} ∧ {2} and OS is given by {2} . The probabilistic structure of the multi-state model is given by hazards, also called intensities, for the three transitions. To be consistent with our later notation, we will denote them by: (1,{1,2}) the hazard for a progression (2,{1,2}) the hazard for death without prior progression (e.g., by treatment related toxicity) (2,{2}) the hazard for death with prior progression.
Note that the first part of the upper index contains the state to which the corresponding transition refers while the second part contains the set of states that have not yet been visited. These hazards can be defined by the limit respectively (resp.) From now on, we will assume that we can continuously monitor progression and death of the patients. Hence, we will only need to consider arguments of the form ( , , ) for ( ,{1,2}) and (2,{1,2}) resp. arguments of the form ( , {1} , ) with {1} ≤ , which refers to the time of progression, for (2,{2}) . Depending on the assumptions of the model, the hazards can be simplified further. For the easiest case, a time-homogeneous Markov model, each hazard assumes a constant value and hence does not depend on the current time since start of the treatment or the timing of observed progressions. An example of such a model will be investigated in Section 8.
In the context of a one-armed clinical trial, we assume that we are given a reference model. The hazards of this reference model may have been estimated using historical data of patients which have been treated under the current standard of care. Estimations of reference hazards will be the content of Section 6.2. The hazards of this reference model will be marked with a 0 in the lower index. We want to explore possible differences between the reference model and the survival history of patients under a new treatment which is examined in this trial. The null hypothesis can thus be formulated by We are now given patients entering the trial at calendar times 0 ≤ 1 < 2 < … < < ∞. For the -th patient, let {1} the time of progression resp. {2} the time of death after entry. The patient is censored at the random time after entry, whereby is assumed to be independent of ( {1} , {2} ). An analysis taking place at calendar time 1 can also induce censoring. Hence, we can only observe { } ∧ ∧ ( − ) + at calendar time 1 for any patient ∈ {1, … , } and ∈ {1, 2}.
To compute the test statistic, firstly we need to count three types of events. For any ≥ 0, • PFS ( ) is the number of observed PFS-events up to time (i.e., progression or death, whatever occurred first), • OS ( ) is the total number of deaths observed up to calendar time , • PFS,OS ( ) is the number of deaths up to calendar time without prior progression observed (i.e., PFS and OS occur at the same time).
For each of these counts, we need counterparts in which the expected number of these events under the null hypothesis is computed. These are given by We can summarize these quantities in the matrices Together, we can form the multivariate process 0 ( ) ∶= ( PFS which is a martingale under the null hypothesis 0 . In a group-sequential setting, we are given a sequence of analysis dates. For the sake of simplicity, we restrict ourselves to a design with one interim analysis at calendar time 1 and a final analysis at calendar time 2 here. The first stage test statistic will be based on 0 ( 1 ) and the second stage test statistic will be based on the increment 0 ( 2 ) − 0 ( 1 ). For both, we need to estimate the covariance matrix under the null hypothesis for both stages. These are given bŷ 1 = ( 1 ) + 0 ( 1 ) 2 and̂2 = ( ( 2 ) − ( 1 )) + ( 0 ( 2 ) − 0 ( 1 )) 2 (6) for the first resp. for the second stage. Let̂̂⊤ denote the Cholesky decompositions of̂for ∈ {1, 2}. The test statistic 0,1 for the first resp. 0,2 for the second stage can now be computed by 0,1 =̂− 1 1 ( 1 ) resp. 0,2 =̂− 1 2 ( ( 2 ) − ( 1 )).
To construct statistical tests with a given level , it is crucial to know that under the null hypothesis 0,1 and 0,2 are bivariate standard normally distributed and that they are independent. There are several possibilities to obtain stage-wise -values from these test statistics. Possible choices are the 2 -or the ∞ norm of the stage-wise test statistic. This would yield stage-wise -values for ∈ {1, 2}. Of course, this choice has to be made at the planning stage of the study. We can now apply any kind of combination test and sequential decision boundaries to conduct an adaptive level test. Such a test is given by a rejection region  ∶= { 1 ≤ 1 } ∪ { 1 < 1 ≤ 0 , 2 < 2 ( 1 )} for a prefixed level 1 and futility bound 0 of the first stage and a strictly decreasing conditional error function 2 ∶ ( 1 , 0 ] → [0, 1] with ∫ 0 1 2 ( ) = − 1 . At the interim analysis, the design of the trial (e.g., sample size) may be adapted based on observations concerning PFS-and OS-events observed until the interim analysis. These topics are covered in more detail in Section 5.

NOTATION FOR THE GENERAL FRAMEWORK
In this section we will formalise the framework of our approach and introduce all the components we need to construct the processes and test statistics. While the previous section focused on a particular example concerning two time-to-event endpoints (PFS and OS), the general methodology allows comparisons based on an arbitrary number of composite events in the presence of an absorbing state. Let (Ω,  , ℙ) denote the probability space upon which all random variables are defined. For a trial with ∈ ℕ patients we are given a sequence of entry times 0 ≤ ,1 < ,2 < … < , < ∞. As → ∞, the accrual process which is defined by ( ) ∶= ∑ =1 { , ≤ } for all ≥ 0 and ∈ ℕ asymptotically behaves according to a cumulative accrual rate with (0) = 0 and lim →∞ ( ) = 1, that is The asymptotic distribution of our test statistics will be derived in the limit → ∞. Hence, we will fix from now on and omit the double indexing for the sake of simplicity. The -th patient in this trial, which enters the trial at calendar time is associated with a ( + 1)-tuple ( , We additionally introduce the set-valued processes monitoring the set of yet unobserved events and define the vector ( , ) ∈ for a set ⊂ {1, … , } and any ≥ 0 by Notice that ( ) ∶= ( , ⋅) and ( ∧ , ) ( ) contain all information about the time-to-event endpoints of patient available at time units after his entry. We can imagine the situation as a multi-state model with + 1 different states. For now, we restrict ourselves with exactly one absorbing state which in practice is usually death. Of course, scenarios with several absorbing states or without any absorbing state can be handled similarly. The first state, which we will call state 0, is the initial state. We have − 1 transient states 1 to − 1 which are first entered at times {1} to { −1} and the absorbing state which is entered at time { } . Recurrent events can be represented by different states. As we do not want to restrict ourselves to a particular joint modeling approach for PFS and OS, we need to keep track of the processes introduced above. Of course, this illustration is only meaningful if we suppose that no two events occur simultaneously in finite time ℙalmost surely. This limitation will later be eliminated, as it is not very reasonable for many constellations, especially when we deal with composite endpoints.
The described reasoning should be reflected by the representation of the hazard rates for the jump processes. For a fixed event the hazard for this event to occur might be different depending on which events have happened before. Hence, for a set of events ⊂ {1, … , } which have not yet happened, the hazard function for event will be denoted by ( , ) . Such conditional hazards, that is, hazards for one event under a certain knowledge about other events can be computed under ℙ by Please note that (11) also enables the computation of conditional hazards for cases in which the assessment of some timeto-event variables may not be executable continuously, but only in certain time-steps. In the following we assume that we can assess all events continuously. As each state is only visited once, we have As other existing methods for one-sample log-rank tests (see, e.g., Schmidt et al., 2017), the asymptotics of our test statistics rely on a martingale central limit theorem. In order to construct such a martingale it is necessary to describe a filtration that comprises exactly the information that is available for any fixed time of patient 's time in trial. This includes information about whether events under consideration could have been observed until time after entry or not and if yes, when or if there has already been a censoring of the patient's data. Hence, let the -algebra  , be generated by the random variables for all ∈ {1, … , }. When it comes to measurability issues later on, one should make oneself clear that is progressively measurable w.r.t. ( , ) ≥0 for all ∈ {1, … , } and that ( ∧ , ) ( ) is measurable w.r.t.  , for all ∈ {1, … , } and ≥ 0. Now, as we fixed the framework for a single patient, we also want to take into account, that we have several patients in our trial with different entry times . Hence we have to translate all the information which is so far given in a patient's time in trial, to calendar time. The events and the censoring of patient take place at { } + resp. + in calendar time.
The -algebra  contatining all information available at calendar time is hence given by where + = max(0, ) denotes the positive part of a real number ∈ ℝ. The filtration ( ) ≥0 now describes the information available in calendar time w.r.t. which we want to construct a martingale leading to our test statistics.
In clinical practice (as in Sections 2 and 6.1), we often deal with composite events, for example, PFS, defined as time from start of therapy until death or progression, whatever occurs first. We can include composite events in our framework by taking a set ⊂ {1, … , } and defining the event ∶= min ∈ { } for all ∈ {1, … , }. The original component events 1, … , are still included in this notation by choosing as a singleton. Finally, we want to introduce the processes with which we can monitor the information we collected for certain events. This includes expected and observed numbers of the composite events themselves as well as for their simultaneous occurrence. We will need them later to standardize our test statistics. Firstly, there are the counting processes for any Secondly, we want to keep track of the hazard that has been accumulated for any possible component or composite event until any time ≥ 0. Therefore we define for any 1 , 2 ⊂ {1, … , } where the set-valued process is defined as in (10).These can be interpreted as the number of events expected given the information about all of the component events in the model.

CONSTRUCTION OF THE MULTIVARIATE MARTINGALE AND ITS ASYMPTOTICS
After we set up the framework, we now just want to give a quick overview of the theoretical results before we proceed to their applications in relevant scenarios. Therefore, we need to state two conditions regarding the regularity of the joint distribution: Proofs of the statements in this chapter and a more careful mathematical treatment can be found in the Appendix. For any fixed component event and any patient we can now construct a martingale which is adapted to the filtration ( , ) ≥0 . The only difference to the construction of the martingale for the standard one-sample log-rank test is, that it has to take all the information given by the filtration into account. Lemma 1. Let a vector of survival times ∶= ( 1 , … , ) with a joint distribution on [0, ∞] fulfilling (R1) and (R2) and an independent censoring variable be given. Then for all ∈ {1, … , } the process ( { } ( )) ≥0 defined by for all ≥ 0 is an ( , ) ≥0 -martingale for any ∈ {1, … , } where the set-valued process is defined as in (10).
This formula also holds if we are in a situation with an absorbing state. The quite technical assumption of the Lemma only ensures that there are no simultaneous component events and that all the marginal distributions are absolutely continuous. Of course, a very similar statement, which is an easy consequence of the former, holds for composite events.

Corollary 2. Let a vector of survival times and an independent censoring variable as in Lemma 1 be given. Then for all
⊂ {1, … , } the stopped process ( ( )) ≥0 defined by As in the previous chapter we can collect all the information and translate it to calendar time in order to get a martingale for any event comprising the information from all individuals in the trial.
Before we state the asymptotics we fix ∈ ℕ composite events defined by sets 1 , … , ⊂ {1, … , }. To state the asymptotics for our multivariate martingale ( ( )) ≥0 ∶= ( 1 ( ), … , ( )) ≥0 , we have to establish two assumptions which ensure that the processes behave reasonably in the limit → ∞. These assumptions refer to the expected number of observable events and simultaneously observable events. Therefore we define As stated in Section II.5.1 of Andersen et al. (1993) we require ⋅ , with from (9), to be a continuous × positive semidefinite matrix-valued function with positive semidefinite increments. This is generally fulfilled for any reasonable recruitment and censoring mechanism.
As we can see in the following theorem, increments of the matrix-valued function ⋅ need to be estimated consistently. Possible choices of consistent estimators will be presented afterwards in Equations (19) and (20).
Theorem 4. Let ( , ) as in Lemma 1 for all ∈ ℕ and a set of composite events { 1 , … , } ⊂ ({1, … , }) be given. Let 0 ∶= 0 < 1 < … < < ∞ be a sequence of analysis times. LetΔ be a consistent estimator for ∶= ( ) ( ) − ( −1 ) ( −1 ) with Cholesky decomposition̂=∶̂̂⊤ for any ∈ {1, … , }. Then where ⋅ is the identity matrix of size ⋅ and where Finally, the choice of the consistent estimators for the increments of ⋅ mentioned in Theorem 4 needs further consideration before we proceed to the practical use cases for our methodology. Theorem II.5.1 from Andersen et al. (1993) directly yields two choices. We can either choose the quadratic variation or the predictable quadratic variation process to estimate the variance of . More exactly, we have uniformly on compact subsets of [0, ∞). It has been studied in Tu and Gross (1996), Sun et al. (2011) and Wu (2014) that neither the former nor the latter one are perfect solutions in the case of standard one-sample log-rank tests. In particular for small sample sizes the tails of the distribution of the test statistics differs very much from tails of a normal distribution. Unfortunately, the behavior at the tails is even more important for us because we deal with sequential and multivariate testing. Hence we will stick to the suggestion from Wu (2014). There it is suggested to use the mean of the two choices mentioned above to estimate the variances. Obviously, it also holds that uniformly on compact subsets of [0, ∞). As we will see in Section 7 this yields promising results regarding the empirical distributions of the test statistics and the empirical type I error probability which we obtained from simulations.

TWO-STEP GROUP SEQUENTIAL AND ADAPTIVE DESIGNS
For the sake of simplicity we will restrict ourselves to two-step designs with one interim and one final analysis, but with an arbitrary number ∈ {2, 3, … } of (composite) endpoints here and in the following sections. Let ℙ 0 denote the null distribution, that is, the distribution under which the distribution of = ( 1 , … , ) for the study cohort is equal to the reference distribution. As in Section 2 we will indicate hazards and processes computed under ℙ 0 with an additional 0 in the lower index. If also 0 denotes the joint null distribution function of the survival times 1 , … , and is their true distribution function, the null hypothesis of our testing problem could be formulated as Alternatively, we can also formulate the null hypothesis in terms of the transition hazards as we did it in (3) for our motivating example. For this, let be the collection of transition hazards in the multi-state model and let 0 be their exact specification in the null hypothesis. Then the null hypothesis can also be written as This formulation also seems more appropriate as it may be more natural to formulate planning hypotheses on the level of the particular transitions. Note that if the reference distribution is given in terms of the joint distribution function or the joint density, the hazards can easily be computed from these functions. This may be the case for models as in Yuan and Yin (2009) where copulas are used. If such a transformation is necessary, see Appendix A.3 for the corresponding formulas.
It is obvious that the space of alternative hypotheses for this simple null hypothesis is very big. Hence, the results of the test have to be interpreted with utmost care.
As already mentioned, the processes 0 are defined by replacing the (true) unknown quantities with the known quantities 0 everywhere, which makes it possible to derive 0 from the observed data. A two-step test of 0 is defined by a sequence 0 =∶ 0 < 1 < 2 of calendar times. At any analysis ∈ {1, 2} the test statistic 0, can be calculated as in (18) with a proper choice for the estimator of the covariance matrix. By Theorem 4 we have This asymptotic behavior needs to be exploited to obtain a stage-wise testing procedure.

The rejection region
At each analysis one can measure the distance from the origin to 0, and calculate a corresponding stage-wise -value. Convenient choices for this distance could be the Euclidean norm | 0, | 2 or the maximum norm | 0, | ∞ . The corresponding distributions to calculate the -value would be the distribution of the square root of a 2 2 -distributed random variable resp. the distribution of the square root of the maximum of two independent 2 1 -distributed random variables. In these cases we get Derivations of these -values can be found in Appendix A.2. The obtained -values can be used with any kind of combination test and decision boundaries to get a sequential testing method. For a comprehensive overview on these topics see, for example, Wassmer and Brannath (2016).
We do not consider stopping for futility here for the sake of simplicity. Therefore, it suffices to specify the values 0 < 1 < < 1 where is the overall significance level and 1 is the level of the first stage and a conditional error function 2 ∶ ( 1 , 1] → [0, 1] which is monotonically decreasing such that ∫ 1 1 2 ( ) + 1 = .
This leads to the rejection region If one fixes a norm for use, the rejection region in terms of stage-wise -values, which is a subset of [0, 1] 2 in our twostage case, can be translated to a rejection region for the vector of test statistics using (24) resp. (25). This region which we will call  in the following is in turn a subset of ℝ 4 . An example of a translation to these principles of sequential testing to rejection regions for our method can be seen in Section 8.
The procedure can be extended to more than two stages. Futility stopping may also be included.

The adaptive test
If the times of analyses 1 and 2 are deterministic, we have ℙ 0 [] = . Since 0,1 and 0,2 are asymptotically independent, this still holds true in the adaptive case when the calendar time of the final analysis is chosen depending on the interim test statistic 0,1 . The exact requirements are given by the following theorem the proof of which can be found in Appendix A.4.
Theorem 5. Let 0 = ( 0,1 , 0,2 ) be defined as in (18) using hazards from ℙ 0 and let  be a rejection region with level . Furthermore, let a minimum follow-up time > 0 be given, which is the time between the end of the recruitment period and the final analysis. Then the asymptotical property ℙ 0 [] = remains true if the final analysis time 2 is replaced by any  1 -measurable random variable 2 with 2 ≥ 1 + .
This enables a data-dependent sample size recalculation. On the one hand the number of events will change with modification of the final analysis time 2 . On the other hand the accrual period will be shortened or extended.
It is also possible to include unplanned interim analyses according to the CRP principle of Müller and Schäfer, see Schäfer and Müller (2001). In this case, the computation of conditional rejection probabilities requires a bit more attention. This is due to the fact that the exact knowledge of the conditional distribution of ( 1 ) given the data collected until an unplanned interim analysis * under the null hypothesis requires knowledge about the covariation matrix ( 1 ). This conditional distribution is given by  ( ( * ), ( 1 ) − ( * )). One can now try to determine ( 1 ) − ( * ) via simulation or direct computation under assumptions about random censoring and recruitment mechanisms. If the accrual duration is extended or the final analysis is postponed a more accurate computation can be accomplished at the originally planned analysis date 1 . In this case the final rejection boundaries can only be computed at the time of the final analysis.

PRACTICAL CONSIDERATIONS
After the presentation of our theoretical results and some brief explanations on how to use these results to construct valid adaptive group-sequential tests, we want to address some further problems which may arise upon application of the procedure.

Application to further cases
Before going into another example, we quickly summarize how our motivating example from Section 2 fits into the general framework developed in Section 4. Obviously, there are = 2 component events (progression and death) and = 2 composite endpoints. One of them is PFS, which represents 1 = {1, 2} while the other is OS which represents 2 = {2}. Thus, we have = ( PFS , OS ) = ( 1 , 2 ) and the rest follows as described in Section 4. If another situation with multiple time-to-event endpoints shall be investigated with the presented procedure it is necessary to evaluate if there is a non-zero probability of these events happening simultaneously. If this is the case (as for PFS and OS), the cause of this coincidence has to be considered and a set of component events needs to be established to build up a latent multi-state model on this new set of events. After that, Theorem 4 can be applied to develop a testing procedure to execute a test for the joint distribution of all original endpoints. This approach will now be demonstrated with another example which corresponds to a typical setting of a phase IIa trial, in which we want to assess safety and efficacy simultaneously: Similar to the setting of Section 2, efforts have been undertaken to investigate the interplay between efficacy and toxicity while accounting for death as a competing event. This is very important in the early clinical stages of drug development. Our methodology can be seen as a tool to detect in which way this interplay differs between the standard of care represented by the reference distribution and the treatment under examination.
As in Lee et al. (2020) we assume that we can observe toxicities ( {1} ), progressions ( {2} ) and death ( {3} ). In the spirit of a competing risk analysis (see, e.g., chapter 3 of Beyersmann et al., 2011) we define two events characterized by the sets 1 = {1, 3} for toxicity and 2 = {2, 3} for efficacy. The former corresponds to the time from the start of a therapy to toxicity or death, whatever occurs first, the latter corresponds to PFS. This setting is visualized by Figure 2. As this model is slightly more complex than the model in Section 2 we need more hazards to compute the martingale 0 . But similar to above, the hazards in Section 2. Please note, that we do not need any hazard ( , ) where is a singleton in this case as both composite events will have happened by then.

Derivation of reference distributions
Of course, the creation of an adequate reference model is only possible if corresponding historical data is available. This may, for example, be the case in pediatric oncology thanks to the outstanding role of registries as the German Childhood Cancer Registry in which about 95% of cases in pediatric oncology in Germany are registered. Furthermore, about 90% of these registered patients are treated within a clinical trial (see Kaatsch, 2004or Erdmann et al., 2020. Thus, the sensible use of such a database can prevent from selection bias and neglect of temporal trends. If appropriate historical data is given, the choice of the modeling approach is also crucial. As already mentioned, multistate models appear suitable. The most prominent multi-state modeling approaches for our main example, which will also be employed in our simulation study in Section 7, are summarized in Meller et al. (2019). A general treatise of multi-state modeling approaches can be found in chapter 6 of Hougaard (2000). In some cases it may be justifiable to assume that the resulting process is a (time-homogeneoues) Markov Process as in Fleischer et al. (2009). In general, this might not be the case. In Titman and Sharples (2010) advice on how to choose the appropriate approach is provided.
Beyond that, the application of our procedure is not restricted to multi-state models. Also frailty-or copula-based approaches as presented, for example, in chapter 7 of Hougaard (2000) or Burzykowski et al. (2001) for the PFS/OSexample resp. in Yuan and Yin (2009) for the efficacy/safety-example. In these cases the joint distribution function of both endpoints is given. The quantities needed to execute our testing procedure can be calculated with the formulas provided in Appendix A.3.
Most of the publications cited here provide inference procedures for the transition hazards resp. the joint distribution function. As an additional support for users we provide functions for parameter estimations for the approaches applied in our simulation study in the supplementary R code.
The testing procedure itself relies on a correct specification of the reference distribution and does not account for sampling variability arising from estimates derived from historical data.

One-sided hypotheses
The methodology provided so far yields testing procedures for the null hypothesis as stated in (22) resp. (23). As any kind of deviation from this hypothesis can be detected here, it is commonly called a two-sided hypothesis. Nevertheless, the test underlying statistic 0 itself is multivariate and the rejection region for this statistic is therefore a subset of ℝ . If interest lies in showing superiority over the reference, a corresponding counterpart which qualifies as a one-sided hypothesis would then be given by where the inequality is to be understood componentwise. Obviously, a rejection of such a hypothesis goes along with the assumption that the value of at least one transition hazard is below the corresponding value of the reference model. Rejection regions should be given in such a way that the type I error levels monotonically decrease for any increase beyond 0 . Given this monotonicity, it would be enough to show that the nominal type I error level is adhered to if equality holds in (26).
For this sake, we propose rejection regions as in Section 5.1 with the additional restriction that all entries of the multivariate martingale shall be negative. It is worth mentioning that this procedure is not a trivial extension of the preceding methodology. In particular it is necessary to compute the probabilities for any analysis date which is only possible at the time of analysis itself, when the estimate of the covariance matrix of is identifiable. For the sake of simplicity, we restrict ourselves to the two-sided hypothesis tests in the remainder of this article.

Power and sample size calculations
Clinically relevant differences of a new treatment compared with the reference distribution can be set separately for the different transition hazards. This way it is possible to depict different mechanisms of action. As in our simulation study in Section 7 or in the case study in Section 8, these differences can be quantified by separate hazard ratios for the transition hazards. This can be especially useful if the mechanism of action of a new treatment differs from previous ones which can be the case for new cancer immunotherapies as it is outlined in Hoos (2012).
If the planning alternative has been set, power for different sample sizes can be determined via simulation. The provided R code enables power simulations with several sample sizes, s.t. one can determine the sample size which is necessary to achieve a specified power under the planning alternative.

Interpretation and further analyses
As multiple time-to-event endpoints are evaluated simultaneously with the method presented here, we also obtain a multivariate test statistic (see Section 4). Thus, the interpretation of the results requires a bit more attention and shall be described here for the two examples presented so far.
In general, interpretation is easier on the level of the martingale 0 , as defined in (15). The value of one entry of the vector 0 , which is associated to an event , is the difference between the number of events of type observed and expected given the history concerning component events outside of . As there are no component events outside PFS in the PFS/OS-example, the first component is just the difference between PFS-events observed and expected under the null hypothesis. The second component (OS) does not include the component event of progressions and is hence the difference of observed deaths and deaths expected given the progression history of all patients. Obviously, differences between the study cohort and the reference model concerning the hazard for transition to progression only affect the first entry, while differences in the hazard for transition from progression to death only affect the second component. This is also visualized in our case study in Section 8. Differences concerning the hazard for transition from the initial state to death affect both entries.
The same interpretation holds for the more complex example concerning efficacy and safety. The toxicity-component compares toxicity events of patients in the trial to their expected number under the null hypothesis given their progression history and the efficacy-component compares progressions of patients in the trial to their expected number given their history concerning toxicities. In both components, we additionally account for death as a competing event. Hence, any deviation from the null distribution concerning one of the hazards Hence, interpretation requires a basic understanding of conditional probability. In both cases there is one transition emerging in both components, which is the direct transition to death. This is because we are accounting for death as a competing event. Unfortunately, this causes a correlation of the components of 0 . The standardization procedure yields asymptotically independent components which we need to construct valid statistical tests.
While the examination of the entries of 0 reveals which of the endpoints under investigation is affected by differences between a novel therapy and the control group, one can also go further into detail to examine particular transitions. TA B L E 1 Parameter configurations and expected event rates for scenarios 1-5 of the simulation study

SIMULATION STUDY
The aim of this simulation study is twofold. On the ond hand, we conduct it to demonstrate that the asymptotics from Theorem 4 already take effect for practically relevant sample sizes and assess empirical type I errors for our method in relevant scenarios. As already mentioned, the latter is an important issue as the missing concordance of the distribution of standard one-sample log rank statistics and standard normal distribution at the tails as treated, for example, in Sun et al.

Design
We consider different scenarios for the PFS/OS main application example from Section 2 with = 2 analyses (i.e., one interim analysis and a final analysis). We use the scenarios from Meller et al. (2019) in which several commonly used joint modeling approaches are represented. The scenarios comprise processes for the latent progression-death model from Figure 1. More precisely, we have with parameter values as in Table 1. Hence, we have a time-homogeneous Markov process in scenario 1, timeinhomogeneous Markov processes in scenarios 4 and 5 and a Semi-Markov process in scenarios 2 and 3, where the hazard does not depend on the current time, but only the current duration of stay in the state (see, e.g., chapter 6 of Hougaard, 2000).
As the processes introduced and studied in Section 3 and Section 4 are given in calendar time, corresponding trials will be planned in calendar time, too. Please note, that unlike the classical one-sample log-rank test (see Breslow, 1975or Schmidt et al., 2017, information time is not a univariate process, but a 2 × 2 symmetric matrix-valued (resp. × symmetric matrix-valued in the general case) process. The diagonal entries of these matrices refer to the expected PFSresp. OS-events under the null hypothesis while the off-diagonal entry contains the expected number of joint PFS-and OS events, that is, direct deaths. The calendar time parameters were fixed as follows: accrual period = 3 with uniform accrual rate, fixed follow-up time = 2 and analyses at 1 = 2.5 and 2 = 5. Hence, the interim analysis takes place after half of the total time of the trial. The expected event rates of PFS and OS under the given parameters are also shown in Table 1 (denoted by PFS resp. OS ). Given the sample size and assuming a uniform recruitment mechanism one can compute the diagonal entries of the expected information time matrix at the analysis dates with these values.
In the first part of the study, we investigated the behavior of 0 under the null hypothesis (22) to verify that the asymptotics from Theorem 4 already hold for small sample sizes. For each parameter constellation in combination with each of the sample sizes ∈ {50, 100, 250, 500}, we executed 100,000 simulation runs. In the second part, we The trials were designed for a type I error level of = 5%. We used inverse normal designs with equal weights and with Pocock and O'Brien-Fleming boundaries (abbreviated by P resp. OF in Table 2). We chose these designs as we consider them as well-known and -established in the wide range of group sequential and adaptive designs. The weights of the inverse normal combination could still be optimized for each scenario and alternative hypothesis separately. For the sake of simplicity, we restrict ourselves to the simple choice of equal weights here. The null hypothesis (22) was then tested using stage-wise -values derived from the 2 -norm or the ∞ -norm of the stage-wise test statistics computed as in (18). Hence, the tests can be interpreted as "two-sided". A brief discussion of testing procedures resembling one-sided tests can be found in Section 6.3.
The simulation study was performed with R 3.6.2 (see R Core Team, 2017).

Behavior under the null hypothesis
Please note that each set of four values in Table 2 concerning the same scenario and maximum sample size belong to the same set of simulated trials, and only the computation of p-values and decision boundaries was altered. For interpretation of the obtained values, notice that the 95%-confidence interval for an underlying true value of 0.05 has a breadth of about 0.0027 for 100,000 simulation runs. The results stem from computations employing the standardization from (21). Using only the (predictable) quadratic (co)variation also yields asymptotically correct testing procedures. Anyhow, we can see here why whe used the extension from the idea from Wu (2014) as the empirical type I errors for a sample size of 50 are in a range between 0.054 and 0.059 for the predictable quadratic (co)variation resp. 0.059 and 0.070 for the quadratic (co)variation. Additionally, we want to mention that the empirical distribution for the former standardization is right-skewed and the latter is left-skewed. Therefore, standardizations based on (19) and (20) were not further used here. Of course, the difference between empirical error and nominal error level shrink with increasing sample size. But as we can see in Table 2, the chosen approach based F I G U R E 3 Empirical distribution of 1 0,1 (the first component of 0,1 ) and its joint empirical distribution with other entries of the vectors of test statistics for Scenario 1 with a sample size of 50 on the standardization from (21) works quite well for sample sizes as small as 50. Furthermore it is recognisable that the empirical error levels differ slightly between the scenarios but are in an acceptable range.
Additionally, we take a closer look at scenario 1 with a sample size of 50. Exemplarily, we show the histogram of results for the first entry of the vector 0,1 and scatterplots illustrating its joint distribution with the other entries of the vectors of test statistics in Figure 3. Concordance with the standard normal distribution and asymptotical independence already seem to take effect at this sample size. This is confirmed by the empirical Pearson correlation coefficients between the different entries of the vector of test statistics. They range between −0.0036 and 0.0007 and hence lie in the 95% confidence interval for a true correlation of zero. And also the tails of the marginal distributions agree to a great extent with the tails of the normal distribution, except from a very light left-skew.

Sample size and power
As mentioned in Section 6.4, the power of the procedure under alternatives can be determined via simulation. The supplementary R code provides simulation of trials with an arbitrary number of interim analyses for a set of sample sizes. The smallest sample size achieving the prespecified power can be determined this way to enable a sample size calculation. The results can be found in Table 3. As in general, the maximum sample size is lower for the O'Brien-Fleming boundaries than for the Pocock boundaries. One can also observe that the maximum sample size for procedures employing the 2 -norm of the sequential -statistics is always lower than it would be under application of the ∞ -norm. Furthermore, one can see that this difference nearly vanishes if one of the hazard ratios is distinctly higher than the other one (see, e.g., results for 01 = 1.7 and 12 = 1.15). While the sample size for all approaches are in a large part driven by the size of 01 , this effect is even more distinct for the ∞ -norm, whose sample size is hence more stable under small changes in 12 . In contrast, the application of the 2 -norm has clear advantages if there are effects of a similar magnitude for both components.

CASE STUDY
While the simulation study shows how our procedure enables the evaluation of PFS and OS as co-primary endpoints, we now also want to focus on the resulting advantage concerning interim analyses in such a study. With our new methodology, it is allowed to perform data-dependent sample size recalculation based on all involved time-to-event endpoints. This is not covered by previous methodology.
Here, we try to illustrate an adaptive testing strategy using a specific multi-state progression-death model from Fleischer et al. (2009) for patients with non-small cell lung cancer. The joint distribution of PFS and OS is given by a timehomogeneous Markov model, that is, the form of the transition hazards corresponds to those given for scenario 1 from the previous section (see (27)). The parameters were estimated using data from a two-armed study. The arms were summarized due to their similarity. The estimates are given by 01 = 0.284, 02 = 0.075 and 12 = 0.128. As the model is timehomogeneous, we have 01 = 02 = 12 = 1. Please note that the time scale for this model is given in months.
We assume that this model properly represents the course of disease under the standard of care. In this case, the model resp. the joint distribution of PFS and OS resulting from this model can be employed as a reference distribution in a fictional trial to investigate the effectiveness of a new treatment. The calendar time parameters of such a trial shall be fixed as follows: accrual period = 24 with uniform accrual rate, fixed follow-up time = 12 and analyses at 1 = 18 and 2 = 36. The expected PFS and OS rates under the null hypothesis with this design are given by 63.4% resp. 41.6% at 1 and 99.8% resp. 91.8% at 2 . In the study, the testing procedure presented here shall be applied to detect differences between the distribution of PFS and OS under the new treatment and the standard of care. As in the previous section the trial is designed for a type I error level of = 5% and stage-wise -values will be combined via the inverse normal combination function with equal weights. We will apply "two-sided" tests in terms of what we lined out in Section 6.3. As it turned out to be the most powerful method among those which were investigated in the simulation study in Section 7 we will determine stage-wise -values via the 2 -norm of the stage-wise -statistics (see (24)) and use the sequential O'Brien-Fleming boundaries.
As in the simulation study, the group-sequential trial is planned under proportional hazard assumptions for the three transitions with hazard ratios 01 = 1.5, 02 = 1 and 12 = 1.25. In order to reach a power of 80%, a sample size of 110 patients is needed. Obviously, if at least one of the hazard ratios is misspecified, the actual power may differ from 80%. Thus, we want to apply Theorem 5 and use the complete information about PFS-and OS-events available at calendar time 1 to reschedule the remaining trial. The adaption rule at the interim analysis is inspired by the rule applied in the NB2004-HR trial (NCT number 03042429, see Berthold et al., 2020) and is as follows: At the interim analysis the length of the accrual period is recalculated. This period will always be followed by the follow-up period of 6 months. Recruitment will be continued for a minimum of 3 and a maximum of 30 months which corresponds to a doubling of the originally planed duration. We consider 10 different options in between these boundary scenarios by increasing the period in steps of 3 months. For each of these options, the conditional power based on the interim estimate for the three (constant) transition hazards is computed (as in Section 7.4.2 of Wassmer & Brannath, 2016). The interim estimates for the three transitions are computed according to the maximum likelihood procedure given in Fleischer et al. (2009) which is also implemented in the supplementary R code. As before, the conditional power is evaluated via simulation and takes into account the complete information about progressions of active patients available at the time of the interim analysis. If at least one of the 8 options reaches a conditional power of 80%, we choose the shortest continuation among these options. If this is not the case, but the continuation by 30 months leads to a conditional power exceeding 50%, we choose this option. If this is also not the case, we choose the shortest option, which implies a continuation of the accrual process by 3 months.
Afterwards, we compared the outcomes of group-sequential and adaptive trials. For each combination, 10,000 simulation runs of the group-sequential and the adaptive trial were executed. A simulation of the adaptive design is computationally intensive. For any simulated trial in which the null hypothesis is not rejected at the interim analysis, the conditional power of the 10 different rescheduling options needs to be determined via simulation as well. For these "inner" simulations, 100 runs were performed. In Figure 4 one can see the impact of the hazard ratios 01 and 12 on the stage-wise increments of the bivariate martingale . These increments are displayed for four different settings (combinations of 01 ∈ {1, 1.7} and 12 ∈ {1, 1.35}) with F I G U R E 4 Empirical distribution of the bivariate martingale under different alternative hypotheses in a group-sequential test of the joint distribution determined by Fleischer et al. (2009) with sample size and design parameters as lined out in Section 8 TA B L E 4 Comparison of empirically attained power of group-sequential and adaptive designs for several deviations from the planning hypothesis, originally planned sample size is set for hazard ratios 01 = 1.5 and 12 = 1.25. Upper value in each cell refers to power of group-sequential design, lower value refers to adaptive design. Numbers in brackets show average duration of accrual in the respective scenario (in months) the first-stage values in the left and the second-stage values in the right column. As described in Section 6.5, an increase of 01 leads to a shift (to the left) of the first component and an increase of 12 leads to a (downward) shift of the second component. Furthermore, it can be observed that the effect of a decreased risk of progression becomes obvious rather in the first stage. On the contrary, a decreased risk of death after progression is apparent rather in the second stage. This becomes clear as many patients experience a PFS-event during the first stage while there are many patients which are in the progredient state during the second stage. Additionally, for the first stage results, we show how the 2 -and ∞ -rejection regions for the -statistics translate to the level of the martingale increments if the true covariance matrix (i.e., the matrix from Theorem 4) would have been known. In practice, this is not the case and this matrix has to be estimated as explained in Section 4. For the second stage, such a visualization is not possible as the rejection in the second stage depends from the outcomes in both stages.
Finally, we compared the empirical power of the two procedures. The results can be found in Table 4. The breadth of the 95%-confidence intervals for an underlying true value of 0.8 amounts to about 0.016. While the adaptive procedure assumes the preplanned power if the underlying truth has been met by the planning hypothesis, its power does not increase more than the power of the group-sequential design if one of the hazard ratios is underestimated. Additionally, the mean duration of the accrual period and hence also the sample size are similar. On the other hand, some of the power which is lost for the group-sequential design can be recovered by the adaptive design. Especially if 01 is overestimated, values of 12 which are close to the value used in the planning hypothesis can possibly help to show a difference in the joint distribution of PFS and OS. Furthermore, the adaptive procedure complies with the nominal type I error level (empirical type I error: 5.17%, 95%-confidence interval: [4.74%, 5.60%]).

DISCUSSION
We proposed a procedure to evaluate the performance of an experimental cohort in comparison with a fixed reference distribution regarding several time-to-event endpoints. It allows testing hypotheses on the multivariate survival function of the patients while making provision for simultaneous use of interim data from all involved time-to-event endpoints for data-driven design modifications. These endpoints and their joint distribution have been embedded in a multi-state model. In comparison to other testing procedures for multi-state models, the presented method does not only focus on particular transitions, but evaluates the whole process with a focus on the relevant endpoints. It qualifies as a multivariate extension of the univariate independent increments approach for single-armed adaptive survival trials based on the one-sample logrank test. As shown in Section 6.1, the methodology can be applied to relevant scenarios for confirmatory Phase IIa survival trials. If prognostic factors are available and incorporated in the reference distribution, these can of course be adjusted for by computing the martingales with hazards that are additionally conditioned on these factors. From a practical point of view, the bivariate testing problem reflects the requirements of early phase survival trials in oncology very suitably. In such trials, recommendation of a new treatment for further assessment is often based on simultaneous consideration of treatment efficacy and treatment toxicity. This naturally leads to a bivariate testing problem. If properly applied, our methodology yields several advantages. In the first place, it allows to choose multiple primary time-to-event endpoints. Especially our example from Section 2 addresses the broadly discussed topic on whether to choose PFS or OS as the primary endpoint in oncological survival trials. Subsequently, the result shows the performance of the cohort under consideration in several aspects and may help to detect structural differences compared to the reference model as it has been illustrated in Section 8. This is a valid point even if the procedure is applied in a non-sequential way, that is, with only one analysis. When used as a sequential adaptive procedure, design changes are allowed to be based on all included time-to-event endpoints. Hence, it also partially addresses the concern raised in Bauer and Posch (2004). Additionally, further refinements in view of the rejection region are possible. One could set different ways to determine the -values in the different analyses in advance. In view of our case study (c.f. Section 8) this could mean that we spend the whole type I error level provided for the first stage on PFS as we may mostly observe progressions in this stage and the type I error level provided for the last stage on OS as we may mostly observe deaths at a later stage.
Choice of the reference distribution will typically be done in the light of historic data on standard of care. This entails a number of challenges. First, estimating a multivariate reference distribution from historic data is mathematically complex. However, several multivariate survival models and corresponding procedures for inference are available in the literature, especially for our main example from Section 2 concerning a joint model for PFS and OS. Apart from the estimation process, it is also essential to ensure that the chosen reference distribution is clinically valid and sensible. In this context, the common challenges in connection with historically controlled trials apply to our methodology as well: It is crucial that the composition of the historical control cohort (used for choosing the reference distribution) is consistent with the inclusion and exclusion criteria of the study in order to ensure that historic and study population are comparable (in particular concerning distribution of key prognostic factors). Furthermore, sufficient data has to be available for the entire length of follow-up in the planned study in both the historic control group and the experimental treatment group. These conditions are often fulfilled, for example, in pediatric oncology which is characterized by a great willingness of patients to participate in clinical trials (c.f. Kaatsch, 2004or Erdmann et al., 2020. This implies the existence of population-based data sets on standard of care and thus the best possible prerequisites for choosing a sensible reference distribution. However, even under such ideal conditions, comparisons with a historical reference should be interpreted with the necessary caution, as any superiority over a historical reference may reflect not only the therapeutic effect but also any advances in diagnostics and concomitant therapy. Thus, our methodology appears very attractive for early phase survival trials in oncology. However, the results must be confirmed in a subsequent phase III and cannot replace a randomized comparison. For future research, we therefore aim for an extension of our methodology from a single-arm setting to a (randomized) multi-arm setting.
Apart from extensions toward a (randomized) multi-arm setting, our results could also be the basis to improve the univariate log-rank test by allowing interim decision making to be based on several (correlated) survival endpoints, simultaneously. This is even closer to the scenario envisioned in Bauer and Posch (2004) and requires an estimation of the dependence, which only seems possible under additional assumptions. That makes an accurate treatise of this particular problem necessary. Similar mathematical issues arise when study discontinuations occur while death might still be observable through administrative data. If the discontinuation occurs after progression, nothing needs to be changed. Otherwise, the OS-counting process for that particular patient needs to be compensated with the hazard conditional on the fact that progression has not occurred until the time of study discontinuation. The same conditional hazards need to be calculated in the aforementioned case.
Finally, we also addressed the problems arising for small sample sizes in classical one-sample log-rank tests. Deviations of the distribution of the classical one-sample log-rank test from the standard normal distribution for small sample sizes gains importance if applied to our sequential and multivariate procedure. We extended the suggestion from Wu (2014) to the multivariate case which yielded satisfying results for the scenarios considered in Section 7. However, it can still be deficient for either very short or very long follow-ups and could therefore be further improved.
For the future, we primarily strive to extend our methodology from a single-arm setting to a (randomized) multi-arm setting. We expect that the multivariate counting process martingale derived here can provide a basis for this.

A C K N O W L E D G E M E N T S
The authors thank the two anonymous reviewers whose suggestions helped improve this manuscript, especially in drawing a clear connection to application and applicability. The work of the corresponding author was funded by the German Science Foundation (Deutsche Forschungsgemeinschaft, DFG, grant number 413730122). The NB2004-HR trial was funded by a grant obtained by Frank Berthold from the Deutsche Krebshilfe (grant number 70107712).

C O N F L I C T O F I N T E R E S T
The authors have declared no conflict of interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The data that supports the findings of this study are available in the supplementary material of this article.

O P E N R E S E A R C H B A D G E S
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available in the Supporting Information section. This article has earned an open data badge "Reproducible Research" for making publicly available the code necessary to reproduce the reported results. The results reported in this article could fully be reproduced. Computation of test statistics in Phase IIa setting Analogously to the procedure in Section 2, we present the computation of test statistics for the scenario presented in Section 6.1 which corresponds to a typical Phase IIa trial which aims for a simultaneous assessment of efficacy and toxicity. The following conditional hazards need to be given:

O R C I D
(1,{1,2,3}) the hazard for toxicity without any prior event (2,{1,2,3}) the hazard for a progression without any prior event (3,{1,2,3}) the hazard for death without any prior event ( From here on, the remaining steps which lead to stage-wise -values can be adapted from Section 2. While PFS need to be replaced by 1 resp. OS needs to be replaced by 2 in the computations (4) and (5), the following steps (6), (7) and (8) do not need any further adaptation. Once again, the standardization applied here is the extension of the approach from Wu (2014).
Please note that the effect of direct deaths without any prior event is only incorporated in the first component of the stage-wise test statistics. If it is desired otherwise, one might just interchange 1 and 2 .

A.2
Computation of stage-wise -values As indicated in Section 5, there are many options to derive -values from the stage-wise test statistics. If different events are assessed, we obtain a standard normally distributed random vector of length after each stage. We fix the stage and call this vector . To derive a -value from that vector we suggest to apply a function ∶ ℝ → [0, ∞), calculate the distribution of ( ) and compute the -value for an observed value by In our opinion, should at least be a seminorm to get a meaningful test. Also, for the sake of applicability, the distribution of ( ) should be easy to calculate. From this point of view, choosing as the 2 -norm or the ∞ -norm on ℝ seem to be suitable.
For the former one we have where 2 denotes the distribution function of a 2 -distributed random variable. The last equation holds because ∑ =1 2 ∼ 2 . For the latter one we have where we used that the components of are i.i.d. and standard normally distributed and that 2 ∼ 2 1 .

A.3
Computation of conditional hazards from joint densities and joint distribution functions As indicated, we do not only want our procedure to be applicable for standard multi-state model approaches as timehomogeneous, time-inhomogeneous or Semi-Markov models, but also for other modeling approaches for multivariate survival times using copulas or frailties. In these cases, the conditional hazards which we need are not given directly. All we are given in the first place is the joint distribution function or the joint density of the random vector . So in the following we want to provide the formulas to compute the conditional hazards from joint distribution functions or densities. These formulas rely on the computation of regular versions of conditional distributions. A proof of the correctness of the following formulas can be given on request to the authors.
We where ( , ( , ) ) { } is just a nesting of the notation introduced above, which means that the -th entry of this vector is , the other entries which index is contained in are and the entries which index is not contained in are { } . As Here denotes the complement of in the set of events {1, … , } and for a set ⊂ {1, … , } indicates the joint distribution function of the components of T which index is in . Correspondingly | denotes the projection of the -dimensional vector to its components which index is in . Although these formulas may appear a bit unwieldy, it is not very difficult to apply them to standard copula or frailty models.
Clearly, ( { { } ≤⋅} ( )) ≥0 is a ( , ) ≥0 -submartingale. As (R1) holds, all requirements of the Doob-Meyer decomposition theorem are fulfilled (see, e.g., Karatzas & Shreve, 1998). Hence, there exists an increasing ( , ) ≥0 -predictable process (̃{ } ( )) ≥0 s.t. (̃{ } ( )) ≥0 with̃{ is an ( , ) ≥0 -martingale. As in Karatzas and Shreve (1998), we will identify this process by a discrete approximation. For a fixed ≥ 0 we define the partitions Π = ( Please note, that we only integrate w.r.t. the first argument of the corresponding hazard function while the reference times, that is, the information given concerning the events, remain fixed. We can now write for any ≥ 0, is also measurable w.r.t. the filtration ( , ) ≥0 where for any ≥ 0 the -algebra , is generated by the , … , }. As , is a sub--algebra of  , , one can apply the tower property to see that it is now also a ( , ) ≥0 -martingale. Now,  , is in turn a sub--algebra of , for any ≥ 0. This can be seen from rewriting the indicator functions from the generators of the -algebras  , by resp.
From (11) and (12), it follows that there is ℙ-almost surely no event happening after { } . This can be seen from where the second factor of the integrand is the density of the distribution of { } restricted to (0, ∞) and the first factor can be rewritten as for all ≥ 0 is indistinguishable from the process (̃{ } ( )) ≥0 mentioned above. Together with  , being a sub--algebra of , for any ≥ 0 this implies that ( { } ( )) ≥0 is a ( , ) ≥0 -martingale. □ Proof of Corollary 2. For any ∈ , ( { } ( )) ≥0 is an ( , ) ≥0 -martingale. Additionally min ∈ { } ∧ is a ( , ) ≥0stopping time. Hence, the result follows from the Optional Stopping Theorem. □ Proof of Theorem 3. Let ⊂ {1, … , } be fixed. We now need to incorporate the censoring induced by calendar time , that is, the censoring variable of patient is now ∧ ( − ) + .