RMST-based multiple contrast tests in general factorial designs

Several methods in survival analysis are based on the proportional hazards assumption. However, this assumption is very restrictive and often not justifiable in practice. Therefore, effect estimands that do not rely on the proportional hazards assumption are highly desirable in practical applications. One popular example for this is the restricted mean survival time (RMST). It is defined as the area under the survival curve up to a prespecified time point and, thus, summarizes the survival curve into a meaningful estimand. For two-sample comparisons based on the RMST, previous research found the inflation of the type I error of the asymptotic test for small samples and, therefore, a two-sample permutation test has already been developed. The first goal of the present paper is to further extend the permutation test for general factorial designs and general contrast hypotheses by considering a Wald-type test statistic and its asymptotic behavior. Additionally, a groupwise bootstrap approach is considered. Moreover, when a global test detects a significant difference by comparing the RMSTs of more than two groups, it is of interest which specific RMST differences cause the result. However, global tests do not provide this information. Therefore, multiple tests for the RMST are developed in a second step to infer several null hypotheses simultaneously. Hereby, the asymptotically exact dependence structure between the local test statistics is incorporated to gain more power. Finally, the small sample performance of the proposed global and multiple testing procedures is analyzed in simulations and illustrated in a real data example.

in factorial designs, are desired.Beyond the average hazard ratio, 2,3 concordance and Mann-Whitney effect, [4][5][6] and the median survival time, [7][8][9] the restricted mean survival time (RMST) is becoming increasingly popular and does not rely on the proportional hazards assumption. 10he RMST is defined as the area under the survival curve up to a prespecified time point and it has an intuitive interpretation as the expected minimum of the survival time and the specified time point.Thus, the RMST reduces the whole survival curve to a meaningful estimand.The asymptotic behavior and statistical inference of the RMST have already been considered in the literature.Horiguchi and Uno 11 detected an inflation of the type I error of the asymptotic test for small samples in the two-sample case and, therefore, proposed an unstudentized permutation approach under exchangeability.Ditzhaus et al 12 extended this approach by developing a studentized permutation test, such that different censoring distributions in the two groups can be handled.A similar approach has been further analyzed in the context of cure models, in both non-and semiparametric models. 13uch studentized permutation tests could be of interest for more complex factorial designs or more general linear hypotheses in practice, for example, when more than two different treatments are to be compared in a clinical study.Thus, our first aim is the extension of the studentized permutation test of Ditzhaus et al 12 for general factorial designs and general linear hypotheses by employing a Wald-type test statistic.Furthermore, other resampling methods as the groupwise and, in the supplement, the wild bootstrap are considered for this general setup.
On the other hand, when a global test detects a significant result by comparing the RMSTs of more than two groups, it is of interest which particular RMSTs differ significantly.Unfortunately, global tests do not yield this information.Therefore, multiple linear hypothesis testing (MLHT) procedures are desired.They offer the information which of the local hypotheses are rejected in addition to the global one.Moreover, their power is not necessarily lower than the power of a global testing procedure. 14To the best of our knowledge, the MLHT problem for the RMST in general factorial designs has not been tackled in the literature up until now.The present paper shall fill this gap.For gaining more power, we aim to take the exact asymptotic dependency structure between the different test statistics into account.In order to improve the small sample performance, we propose a groupwise bootstrap procedure for approximating the limiting null distribution and we show its validity.
The remainder of this article is organized as follows.Section 2 includes four subsections.In Section 2.1, the factorial survival setup and a general contrast testing problem is presented.For this testing problem, a suitable test statistic is defined and studied in Section 2.2.The studentized permutation approach of Ditzhaus et al 12 is extended for more general factorial designs in Section 2.3.Furthermore, a groupwise bootstrap procedure is investigated in Section 2.4.In Section 3, multiple contrast tests are constructed and the consistency of the groupwise bootstrap in this setup is shown.Additionally, several subtopics of practical interest are covered in Section 3.This includes the calculation of adjusted P-values, the construction of simultaneous confidence regions, a stepwise extension of the multiple testing procedure and simultaneous non-inferiority and equivalence tests.The small sample performance of the proposed tests is analyzed in extensive simulation studies in Section 4. In Section 5, we illustrate the proposed methodologies by analyzing a real data example.Finally, the results are discussed in Section 6. Detailed simulation results, a wild bootstrap approach and all technical proofs can be found in the supplement.

STATISTICAL INFERENCE
In this section, the factorial survival setup and statistical methodologies for the global testing problem are presented.

Factorial survival setup
We consider the following factorial design as in Ditzhaus et al, 9 that is, as k-sample setup, k ∈ N. We suppose that the survival and censoring times respectively, are mutually independent.Here, S i and G i denote the survival functions of the survival and censoring times, respectively, and n i ∈ N represent the numbers of individuals in group i for all i ∈ {1, … , k}.Of note, we do not assume the continuity of the survival functions.Consequently, ties in the data are explicitly allowed.However, we assume that the S i do not have jumps of size 1, that is, the survival times are not deterministic.Moreover, we define the right-censored observable event times X ij ∶= min{T ij , C ij } and the censoring status Furthermore, we assume that the group sizes do not vanish asymptotically, that is, as n → ∞ for all i ∈ {1, … , k}, where n ∶= ∑ k i=1 n i represents the total sample size.The RMST of group i is defined as for all i ∈ {1, … , k}.Here,  > 0 should be a pre-specified constant such that P(X i1 ≥ ) = P(T i1 ≥ )P(C i1 ≥ ) > 0 and P(T i1 < ) > 0 holds for all i ∈ {1, … , k}.By replacing S i through the Kaplan-Meier estimator Ŝi , a natural estimator for the RMST of group i is be the vector of the RMSTs and μ ∶= (μ 1 , … , μk ) ′ be the vector of their estimators.In addition, let r ∈ N, c ∈ R r be a fixed vector and H ∈ R r×k be a contrast matrix, that is, H1 k = 0 r , where and throughout 1 k ∈ R k and 0 r ∈ R r denote the vectors of ones and zeros, respectively.Moreover, we assume that rank(H) > 0.Then, we consider the null and alternative hypothesis The formulation of this testing framework is very general.In particular, it includes the null hypothesis of equal RMSTs in all groups by choosing, for example, c = 0 k and the Grand-mean-type contrast matrix 15 H ∶= P k ∶= I k − J k ∕k.Here, I k ∈ R k×k represents the unit matrix and J k ∶= 1 k 1 ′ k ∈ R k×k represents the matrix of ones.Moreover, by splitting up indices, different kinds of factorial structures can be covered.For example, in a two-way design with factors A (a levels) and B (b levels), we set k ∶= ab and split up the group index i in two subindices (i 1 , i 2 ) ∈ {1, … , a} × {1, … , b}.Then, hypotheses about no main or interaction effect can be formulated by choosing c as the zero vector and one of the following contrast matrices, respectively: Here, ⊗ represents the Kronecker product.Higher-way designs or hierarchically nested layouts can be incorporated similarly as in Pauly et al. 16

2.2
The Wald-type test statistic and its asymptotic behavior In this section, a suitable test statistic for the testing problem ( 2) is constructed and its asymptotic behavior is studied.First of all, let us introduce some notation.
Then, we define the Wald-type test statistic for the testing problem (2) as where being an estimator regarding the asymptotic variance of √ n(μ i −  i ) for all i ∈ {1, … , k}. 12 Here and throughout, ΔM = M − M − denotes the increment and M − denotes the left-continuous version of a monotone function M and we use the convention 0∕0 ∶= 0.
The following theorem provides the asymptotic distribution of the Wald-type test statistic.
Theorem 1.Under the null hypothesis in (2), we have as n → ∞.

Studentized permutation test
For two-sample comparisons, Horiguchi and Uno 11 pointed out that RMST-based tests derived from asymptotic methods have an increased type I error.Hence, we aim to improve the type I error control by extending the studentized permutation approach of Ditzhaus et al 12 to the present general factorial design setting.In the already treated two-sample case, the approach has the advantage that it also works asymptotically without the assumption of exchangeable data.In this section, we will transfer these good properties to general factorial designs to construct a resampling-based test that serves as an alternative for (4).For this purpose, let (X, ) ∶= (X ij ,  ij ) j∈{1, … ,n i },i∈{1, … ,k} denote the observed data and ,k} be the permuted version.That is, the groups of the original data are randomly shuffled in the sense that the data pairs (X ij ,  ij ) are permuted.In the following, we denote the permutation counterparts of the statistics μ and Σ defined in the previous sections with a superscript : μ and Σ .Then, we define the permutation counterpart of the Wald-type test statistic as Since we do not have convergence in distribution of this statistic for all observations in the conditional space, let d * − −− → denote conditional convergence in distribution in probability given the data (X, ).This means that the conditional distribution converges in probability.Another possibility to explain this convergence is to use another way to state the convergence in distribution via uniform convergence of the conditional distribution function as in the following theorem.
To this end, let P − −− → denote convergence in probability.
Theorem 2. Under both hypotheses  0 and  1 , we have as n → ∞.Mathematically, (5) means as n → ∞, where Z ∼  2 rank(H) .From this result, we can construct a permutation test where q  1− denotes the (1 − )-quantile of the conditional distribution of W  n (H) given (X, ).Lemma 1 of Janssen and Pauls 17 ensures that   n is asymptotically valid.

Groupwise bootstrap test
Another possible solution for approximating the limiting distribution is the groupwise bootstrap.An advantage over the studentized permutation approach is that the groupwise bootstrap can mimic the different variance structures in the groups.This ensures that the groupwise bootstrap is also applicable for the multiple testing problem, see Section 3.
For the groupwise bootstrap, the bootstrap observations are drawn randomly with replacement from the observations of the corresponding group, that is, Then, we denote the groupwise bootstrap counterparts of the statistics μ and Σ defined in Section 2.2 with a superscript * : μ * and Σ * .The groupwise bootstrap test statistic is defined by ) .
The following theorem provides the consistency of the groupwise bootstrap.
Theorem 3.Under both hypotheses  0 and  1 , we have as n → ∞.
Hence, we obtain a groupwise bootstrap test where q * 1− denotes the (1 − )-quantile of the conditional distribution of W * n (H) given (X, ).By Lemma 1 in Janssen and Pauls, 17  * n is an asymptotically valid level- test.Note that we do not need the property that H is a contrast matrix in the proofs of Theorems 1 and 3. Hence, the groupwise bootstrap test is also valid for general matrices H ∈ R r×k with rank(H) > 0.

MULTIPLE TESTS
Let us now interpret the contrast matrix H as a partitioned matrix assume rank(H  ) > 0 for all  ∈ {1, … , L}.In this section, we aim to construct a testing procedure for the multiple testing problem with null and alternative hypotheses Thereby, we aim to incorporate the asymptotically exact dependence structure between the test statistics of the L local tests to gain more power than, for example, by using a Bonferroni-correction.
Example 1.A global null hypothesis which is of interest in many applications is the equality of the RMSTs, that is, However, there are different possible choices of the contrast matrix H which lead to this global null hypothesis. 14A popular choice is the Grand-mean-type contrast matrix as introduced in Section 2.1, where the RMSTs of the different groups are compared with the overall mean of the RMSTs  ∶= 1 k ∑ k i=1  i for the different contrasts, respectively.Many-to-one comparisons can be considered by choosing the Dunnett-type contrast matrix 18 and c = 0 k−1 , where the RMSTs  2 , … ,  k are compared to the RMST  1 of the first group regarding the different contrasts.In order to compare all pairs of RMSTs and c = 0 k(k−1)∕2 can be used.An overview of different contrast tests can be found in Bretz et al. 20 Furthermore, the choice of the considered partition of the matrix ′ and, therefore, the resulting local hypotheses depend on the question of interest.This general formulation of the multiple testing problem covers the post-hoc testing problem and includes, for example, the local null hypotheses , where e  ∈ R k denotes the th unit vector.Analogously, we can perform many-to-one comparisons and all-pair comparisons of the mean functions simultaneously by considering the r rows of the Dunnett-type and Tukey-type contrast matrix, respectively, as blocks H 1 , … , H r .
Furthermore, the formulation of this testing problem allows to perform multiple tests with more than one contrast matrix simultaneously.In a two-way design, we may choose H 1 = H A , H 2 = H B , and H 3 = H AB as introduced in Section 2.1, for example.This allows for simultaneous testing of the factors A and B and their interaction.
For all local hypotheses in (6), we can calculate the Wald-type test statistics W n (H  , c  ),  ∈ {1, … , L}.Since we aim to use the asymptotically exact dependence structure of the test statistics, we have to investigate the joint asymptotic behavior.
Therefore, let Z ∼  k (0 k , ) with  ∶= diag( 2 1 , … ,  2 k ) in the following, where here and throughout In Section S.5 of the supplement of Ditzhaus et al, 12 it is shown that  2 i is the almost sure limit of (3) for all i ∈ {1, … , k}.Theorem 4.Under the null hypotheses (6), we have as n → ∞.
Note that  is generally unknown such that we do not know the exact asymptotic joint limiting distribution of Unfortunately, we cannot use the studentized permutation approach for approximating the joint limiting distribution.That is because as n → ∞ holds similarly as in the proof of Theorem 2, where Z  ∼  k (0 k ,   ).Since the limiting distributions in ( 8) and ( 7) are generally not equal in distribution, the studentized permutation approach is not consistent for the multiple testing problem.However, we can approximate the critical values via the groupwise bootstrap as introduced above.The difference here is that the covariance structures of the groups are not altered since the bootstrap observations are drawn within each group.The asymptotic validity is guaranteed by the following theorem.
A naive approach for compatible local and global test decisions would be to calculate a critical value for the maximum statistic max ∈{1, … ,L} W n (H  , c  ) in view of the global hypothesis (2).A local hypothesis in ( 6) is rejected whenever the corresponding local test statistic exceeds the critical value.In the special case that rank(H  ) are equal for all  ∈ {1, … , L}, the limiting distributions of the Wald-type test statistics are equal, as we have seen in Section 2.2.Then, the maximum statistic can be used for testing the global hypothesis and every contrast is treated in the same way.
However, if the ranks are not equal, the limiting distributions of the Wald-type test statistics are not equal; compare Section 2.2.Hence, the contrasts are not treated in the same way by considering the maximum statistic.Thus, we adopt the idea for the construction of simultaneous confidence bands proposed by Bühlmann. 21o this end, let Note that we only have to consider since the quantiles can only take B different values, respectively.
Additionally, we only have to search for  n () within the interval . The lower bound can be interpreted as Bonferroni bound and results from the following inequalities: The decision rules are constructed as follows: Here, we set 0∕0 ∶= 0. • We reject the global null hypothesis  0 in (2) whenever at least one of the hypotheses  0,1 , … ,  0,L is rejected.Hence, we reject the global null hypothesis  0 in (2) if and only if max Each test statistic W n (H  , c  ),  ∈ {1, … , L}, is treated in the same way and has the same impact since we use the same local level of significance  n () for each contrast.Moreover, the following theorem provides that the level of significance of the global test and the family-wise type I error rate for the multiple testing problem is controlled asymptotically.Theorem 6.Let  ⊂ {1, … , L} denote the subset of true hypotheses, that is, let  0, ,  ∈  , in (6) be true.With B = B(n) → ∞ as n → ∞, we have The inequality becomes an equality if  = {1, … , L}.

Adjusted P-values
The method described above for constructing multiple test decisions is accompanied by an adjusting of P-values.The following proposition ensures that the test decisions based on these P-values are unchanged.

Simultaneous confidence regions and intervals
Furthermore, we can use the constructed multiple testing procedure for defining simultaneous confidence regions for H   with asymptotic global confidence level 1 − .Therefore, we define the th confidence region as } for all  ∈ {1, … , L}.It can be easily checked that P(H ∈ ⊗ L =1 CR n, ) → 1 −  as n → ∞.In the case that H  ∈ R 1×k , that is, r  = 1, we can simplify the confidence regions to confidence intervals CR n, ∶= [L n, (∕2), U n, (∕2)] by solving the equation W n (H  , ) ≤ q * ,1− n () for  ∈ R.This yields

Simultaneous non-inferiority and equivalence tests
Let us consider again the case r  = 1 for all  ∈ {1, … , L}.In this special case, we write c  instead of c  in non-bold type for all  ∈ {1, … , L}.Based on the previous constructed confidence intervals, we can also define simultaneous non-inferiority and equivalence tests by using the two one-sided test procedure: 22 let  1 , … ,  L > 0 be prespecified equivalence bounds; the hypotheses of interest are for the non-inferiority testing problem and for the equivalence testing problem.

Stepwise extension
For gaining more power, our methodologies can be combined with the closed testing procedure as in Blanche et al 23 if only multiple test decisions but not the construction of (simultaneous) confidence regions are of interest: for each  ∈ {1, … , L}, the hypothesis  0, in ( 6) is rejected at level  if and only if for each  ∋  the intersection hypothesis  0, ∶= ⋂ j∈  0,j is rejected at level .For testing an intersection hypothesis  0, , we can use the procedure as described above.To be specific,  0, is rejected at level  whenever  (3, 20).
The survival functions of these censoring times are illustrated in Figure 1.The resulting censoring rates of the different groups are presented in Table S6 in the supplement.The censoring rates ranged from 20% up to 60% in groups 1-3 and from 1% up to 57% in group 4.
Furthermore, N sim = 5000 simulation runs with B = 1999 resampling iterations were generated.The level of significance was set to  = 5% and the upper integration bound to  = 10.
The following methods were compared: • asymptotic_global: The global Wald-type test as in Section 2.2, • permutation: The global studentized permutation test as in Section 2.3, • asymptotic: Multiple Wald-type tests, where the multivariate limit distribution in Theorem 4 is approximated by using the estimator for  as defined in Section 2.2 and, then, applying the multiple testing procedure of Section 3, • wild, Rademacher; wild, Gaussian: Multiple wild bootstrap [25][26][27] tests as in Section A in the supplement with Rademacher and Gaussian multipliers, respectively, by applying the multiple testing procedure of Section 3, • groupwise: The multiple groupwise bootstrap test as in Section 3, • asymptotic_bonf : Global Wald-type tests as in Section 2.2 adjusted with the Bonferroni-correction, • permutation_bonf : Global studentized permutation tests as in Section 2.3 adjusted with the Bonferroni-correction.
Clearly, the first two methods (asymptotic_global, permutation) can only be compared to multiple testing procedures for the global testing problem.However, by using a Bonferroni-correction (asymptotic_bonf , permutation_bonf ), we can also obtain test decisions for the local hypotheses.In all figures, one can see that only the permutation approach and the groupwise bootstrap seem to perform well over all simulation settings.Here, the permutation approach yields slightly better values than the groupwise bootstrap.Tables S7 to S42 in the supplement show the global rejection rates of the different settings.Under the null hypothesis, all values in the binomial confidence interval are printed in bold type.The permutation method is exact under exchangeability and, thus, most of the values of the permutation method with equal survival distributions across the groups under the null (exp early, exp late, exp prop, logn, Weib late, Weib prop) and equal censoring distributions fall within that interval.Furthermore, when exchangeability is violated, the permutation method still seems to perform quite accurately in terms of type I error control for all sample sizes.The groupwise bootstrap approach also results in very accurate family-wise error rates, especially for medium and large sample sizes.Moreover, we note that the three asymptotic approaches (asymptotic_global, asymptotic, asymptotic_bonf ) and the wild bootstrap approaches are too liberal, as they exhibit too high rejection rates in nearly all settings.In Figures S10 to S12 in the supplement, it is observable that these methods exceed the desired level of significance particularly for settings with small sample sizes.By further analyzing the tables in the supplement, we observe that high censoring rates facilitate the liberality of the tests.Note that the highest rejection rates occur for small sample size settings, where at least 49% of the data is censored.

Simulation results under the null hypothesis
It should be noted that the power of our multiple tests can be improved by using a stepwise procedure as described in Section 3. The power of the Bonferroni corrected methods can also be improved by a stepwise procedure, for example, the Holm-correction. 28However, stepwise procedures cannot be used for the construction of confidence regions and, hence, we did not focus on these in the simulation study.
We proved that all approaches are asymptotically valid under the null hypothesis.Figures S10 to S12 in the supplement confirm this empirically: all methods seem to tend to the desired level of significance of 5% for increasing sample sizes.However, the convergence rates of the asymptotic and the wild bootstrap approaches appear to be very slow.This observation prompts an inquiry into analyzing how larger sample sizes might influence the type I error control for the naive methods, that are the three asymptotic approaches.Therefore, further simulations under the null hypothesis were conducted in Section C.1 in the supplement.Specifically, we increased the scaling factor for sample sizes, that is K ∈ {6, 8, 10}, resulting in sample sizes ranging from 60 to 200 per group.

Simulation results under the alternative hypothesis
In the power assessment, we observed small differences between the different methods.The global asymptotic approach leads to the highest power in most settings, followed by the wild bootstrap with Gaussian and with Rademacher multipliers.However, in view of the bad type I error control of these methods, we cannot recommend their use.
Let us now review the multiple testing problem.Because of the bad type I error control of the wild bootstrap approaches and for the sake of clarity, we did not consider this method in the following.Moreover, the global approaches (asymptotic_global and permutation) do not yield local decisions.Thus, we only compared the asymptotic, the groupwise bootstrap and the Bonferroni-corrected approaches for the multiple testing problem.Furthermore, only the settings under the alternative hypothesis are considered.Tables S43 to S90 in the supplement provide the rejection rates of the false local hypotheses across all settings for the different sample sizes; they are further illustrated in Figures S13 to S15 in the supplement.Therein, it is apparent that the asymptotic approaches have a higher power for each false hypothesis than the groupwise bootstrap and the studentized permutation approach with the Bonferroni-correction.However, this difference is rather small, especially for large sample sizes.Additionally, by comparing the empirical power of the groupwise bootstrap test and of the studentized permutation test with Bonferroni-correction, the groupwise bootstrap test tends to be slightly more powerful for medium and large sample sizes.For small sample sizes, this trend reverses for the Dunnett-type and Tukey-type contrast matrix.However, it is important to note that the differences between the two methods regarding the empirical power are quite small and mainly not even visible in Figures S13 to S15.
Nevertheless, it is well-known that the Bonferroni-correction might lead to a loss of power. 14In order to illustrate this, we conducted an additional simulation study under non-exchangeability; see Section C.2 in the supplement for details.Here, we saw that the groupwise bootstrap approach is able to outperform the permutation approach with Bonferroni-corrections in specific scenarios under non-exchangeability.This effect becomes particularly observable for the Tukey-type contrast matrix, where six hypotheses are tested simultaneously.
We conducted further investigations to assess the impact of censoring and sample sizes on the power.As expected, the power increases for larger sample sizes for each method.Additionally, settings with lower censoring rates tend to be more powerful.When comparing the power between the three false hypotheses  0,3 ,  0,5 , and  0,6 of the Tukey-type contrast matrix, it becomes apparent that the fifth hypothesis  0,5 can be rejected more often, see, for example, Figure S14.The reason behind this can be attributed to the unequal sample sizes in the unbalanced design: Groups 1 and 3 contain only K ⋅ 10 observations, respectively, while groups 2 and 4 contain K ⋅ 20 observations each, for K ∈ {1, 2, 4}.Consequently, when comparing the RMSTs of groups 2 and 4, we have a larger dataset compared to other pairwise comparisons leading to more power.This exemplifies how an unbalanced design can boost the power of specific local hypotheses.However, depending on the contrast matrix, this is often done at the cost of a reduced power for testing other local hypotheses.
It should be noted that the empirical power is very low in some scenarios.This is particularly the case for the groupwise bootstrap and the studentized permutation approach with Bonferroni-correction and small sample sizes.Moreover, an increasing number of hypotheses decreases the power for the local hypotheses in general.Consequently, multiple tests based on the Tukey-type contrast matrix have even less power than multiple tests based on the Dunnett-type contrast matrix.Furthermore, small differences to the null hypothesis are difficult to detect.This can be observed for the Grand-mean-type contrast matrix, see Figure S15 in the supplement, where the three null hypotheses  0,1 ∶  1 = ,  0,2 ∶  2 = , and  0,3 ∶  3 =  have very low rejection rates under the alternative hypothesis due to a small difference of In conclusion, we recommend to use the studentized permutation method for the global testing problem.For the multiple testing problem, the groupwise bootstrap test and the studentized permutation method with Bonferroni-correction perform similarly and quite well in terms of the type I error control and the empirical power across all simulation scenarios.However, we recommend to use the groupwise bootstrap test for testing a large number of hypotheses since the Bonferroni-correction is known to have a lower power in this case. 14

APPLICATION TO REAL DATA ABOUT THE OCCURRENCE OF HAY FEVER
In order to illustrate our novel methods on real data, we consider a data set with data about the occurrence of hay fever of boys and girls with and without contact to farming environments. 29,30These data derive from an observational study and may be structured in a factorial 2-by-2 design: factor A represents whether the child was growing up on a farm; factor B represents the sex.The event of interest is the age at which hay fever occurred.Ties are present in the data as each measured age was rounded (down) to full years.
The children were included in the survey via primary schools in 2006.Hence, their age has been mainly between 6 and 10 years at the beginning of the study.The medical diagnoses of hay fever together with the age at initial diagnosis before study entry were recorded retrospectively.The age at which the diagnosis was made is easy to remember so that no significant recall bias or inaccuracies were assumed here.Follow-up surveys took place in 2010 with retrospective recording of initial diagnoses since the last survey and from then on annually until 2016.For simultaneous testing on a main effect of the two factors as well as on an interaction effect, we define H ∶= [H ′ A , H ′ B , H ′ AB ] ′ by using the notation of Section 2.1.Furthermore, we set  = 5% as the level of significance and chose  = 15 years.
The data set consists of 2234 participants.In detail, 654 boys and 649 girls not growing up on a farm and 450 boys and 481 girls growing up on farms were observed.Note that we did not adjust for any confounding variables in order to simplify this application of our method to real data.This comes with the limitation that the results may not fully reflect the causal effects of sex or growing up on a farm on the incidence of hay fever.The censoring rates in the different groups ranged from 74% up to 93%.The Kaplan-Meier and Nelson-Aalen curves of all groups are illustrated in Figure 5. Here, it can be seen that the estimated cumulative hazard functions are crossing each other and, thus, the proportional hazards assumption is not justified.If we would perform a Cox proportional hazards model nevertheless, the resulting (unadjusted) P-values of the existence of an impact on the occurrence of hay fever are p A < 10 −8 for a main effect of factor A, p B = 0.112 for a main effect of factor B and p AB = 0.235 for an interaction effect.By using a Bonferroni-or Holm-correction of the P-values, we could only establish that factor A (growing up on a farm) has a main effect on the occurrence of hay fever at global level 5%.
However, since the proportional hazards assumption seems violated, we aimed to compare the RMSTs in the different groups.The estimated RMSTs respectively are 14.22 and 14.66 for boys and girls growing up on farms and 13.59 and 13.79 for boys and girls not growing up on a farm.This indicates that boys tend to be more prone to hay fever than girls until the age of 15.Furthermore, growing up on a farm seems to reduce the risk of getting hay fever until the age of 15.Performing the global asymptotic Wald-type test and its global studentized permutation version with B = 19 999 resampling iterations leads to P-values of P < .003and, thus, the existence of at least one main or the interaction effect on the occurrence of hay fever is highly significant.However, these tests cannot provide the information whether the sex and/or growing up on a farm and/or an interaction of these factors lead to a significant difference of hay fever occurrence.Therefore, we applied multiple testing procedures.The resulting adjusted P-values of our proposed methods with B = 19 999 resampling iterations are shown in Table 1.The P-values of the global asymptotic and permutation approach were adjusted by a Bonferroni-correction for enabling local test decisions.Here, we found that all methods rejected the local hypotheses of no main effect of the two factors simultaneously at the  = 5% level.However, the interaction effect of the two factors was not significant.The data from this example do not fit perfectly to the simulation design in Section 4 since, here, a 2-by-2 design with different hypothesis matrices and larger sample sizes and censoring rates is considered.Thus, additional simulation results inspired by this data example can be found in Section C.3 in the supplement.

DISCUSSION
In many applications, the proportional hazards assumption is not easy to detect by the naked eye or simply obviously not satisfied.In this case, the RMST can be used for summarizing the survival curve and, therefore, for comparing the survival curves of different groups in factorial designs.We considered a very general linear hypothesis testing problem for RMSTs in general factorial designs.To this end, we constructed a Wald-type test statistic and studied its asymptotic behavior.Furthermore, we proposed resampling procedures for approximating the limiting distribution.This includes the studentized permutation approach and the groupwise bootstrap.In addition, we considered the multiple linear hypothesis testing problem for the RMST in general factorial designs, where several local hypotheses are tested simultaneously.Here, it turned out that the groupwise bootstrap can be used for approximating the joint limiting distribution of the test statistics.However, the studentized permutation approach is not able to approximate the joint limiting distribution directly and, thus, a controlling procedure that works under any dependence structure of the individual test statistics has to be applied retrospectively, for example, the Bonferroni correction.In an extensive simulation study, we analyzed the performance of the proposed methods.The results indicate that the groupwise bootstrap approach and the studentized permutation approach with Bonferroni correction perform best in terms of type I error control for the multiple testing problem and the global studentized permutation approach for the global testing problem.Finally, the proposed methods were applied to a real data set about hay fever.
It should be noted that the studentized permutation approach is finitely exact for the global testing problem under exchangeability.However, for the multiple testing problem, we cannot approximate the joint limiting distribution by the studentized permutation approach; see Section 3 for details.Hence, local test decisions can only be obtained by applying a correction procedure as the Bonferroni-or Holm-correction afterwards.These procedures are known to yield a lower power, particularly for a large number of hypotheses and positively correlated test statistics. 14The groupwise and wild bootstrap approach, on the other hand, can approximate the joint limiting distribution and, thus, the asymptotically exact dependence structure can be taken into account.A further advantage of the bootstrap approaches is that they also work for general hypotheses matrices and do not restrict to the case of contrast matrices.
A more flexible estimand than the usual RMST is the weighted version of the RMST.That is,  i,w i ∶= ∫  0 w i (t)S i (t) dt with estimator μi,w i ∶= ∫  0 w i (t) Ŝi (t) dt for some weight function w i ∈  1 ([0, ]) and i ∈ {1, … , k} similar as in Zhao et al. 27 For the global testing problem, we get a similar statement as in Theorem 1 based on the weighted RMSTs.This can be shown analogously to the proof of Theorem 1 by using that the functional D[0, ] ∋ M  → ∫ [0,] w i (t)M(t) dt is continuous for all i ∈ {1, … , k}.It should be noted that the multivariate limiting distribution of the Wald-type test statistics for the multiple testing problem as in (7) based on the weighted RMSTs would also depend on the weight functions.Additionally, Zhao et al 27 already investigated the case of unknown weight functions for the two-sample case.For future research, their result could be extended to complex factorial designs and to more general linear hypotheses.
Furthermore, it is important to note that our real data example derived from an observational study but that we did not account for potential confounding variables.These can significantly impact the survival times.The appropriate selection of confounding variables requires careful causal considerations; effective control for confounding also requires large enough sample size and recorded data on all confounding variables.Hence, it would be interesting to extend our methods in future research such that an adjustment for confounding variables is possible.

1 {
denote B groupwise bootstrap test statistics.For each b ∈ {1, … , B}, the same bootstrap samples are used for calculating the groupwise bootstrap counterparts (W * ,b n (H 1 ), … , W * ,b n (H L )) for the different contrasts.This reflects the real world situation that the same original samples are used for testing all local hypotheses.Let q * ,1− denote the (1 − )-quantile of W * ,b n (H  ), b ∈ {1, … , B}, for all  ∈ {1, … , L}.Our strategy is to adjust the local level  such that the level  is controlled globally.To this end, we let FWER * n () ∶= 1 B B ∑ b=1 ∃ ∈ {1, … , L} ∶ W * ,b n (H  ) > q * ,1− } denote the estimated family-wise type I error rate by using the (1 − )-quantiles as critical values for all  ∈ [0, 1].Then, we define the local level  n () as the largest value such that the family-wise type I error rate is bounded by the level of significance , that is,

Figures 2 -
Figures2-4under  0 illustrate the global rejection rates, which coincide with the family wise error rates for the multiple tests, over all settings for the different contrast matrices.Here, the dotted line represents the -level of 5% and the dashed lines represent the borders of the binomial confidence interval [4.4%, 5.62%].

2
Rejection rates under  0 over all settings for the Dunnett-type contrast matrix.The dashed lines represent the borders of the binomial confidence interval [4.4%, 5.62%].

F I G U R E 3
Rejection rates under  0 over all settings for the Tukey-type contrast matrix.The dashed lines represent the borders of the binomial confidence interval [4.4%, 5.62%].

4
Rejection rates under  0 over all settings for the Grand-mean-type contrast matrix.The dashed lines represent the borders of the binomial confidence interval [4.4%, 5.62%].
on a farm girls growing up on a farm boys not growing up on a farm girls not growing up on a farm on a farm girls growing up on a farm boys not growing up on a farm girls not growing up on a farm F I G U R E 5 Kaplan-Meier and Nelson-Aalen curves of the different groups.
To see this, let w n (H  , c  ) be the realization of W n (H  , c  ) for all  ∈ {1, … , L}.First, we determine the local P-values by ) ≥ w n (H  , c  ) } for all  ∈ {1, … , L}. Comparing the local P-values to  n () yields multiple test decisions that are consistent to the method described above.Translating this comparison to a comparison with the level of significance  is intuitive due to the definition of  n ().Hence, by plugging the local P-value in FWER * n , the adjusted P-value for the th hypothesis can be defined by p  ∶= FWER * n (  n, ) for all  ∈ {1, … , L} and the global P-value by p ∶= min{p 1 , … , p L }.
Survival functions of the censoring times.
TA B L E 1