A nonparametric relative treatment effect for direct comparisons of censored paired survival outcomes

A frequently addressed issue in clinical trials is the comparison of censored paired survival outcomes, for example, when individuals were matched based on their characteristics prior to the analysis. In this regard, a proper incorporation of the dependence structure of the paired censored outcomes is required and, up to now, appropriate methods are only rarely available in the literature. Moreover, existing methods are not motivated by the strive for insights by means of an easy‐to‐interpret parameter. Hence, we seek to develop a new estimand‐driven method to compare the effectiveness of two treatments in the context of right‐censored survival data with matched pairs. With the help of competing risks techniques, the so‐called relative treatment effect is estimated. This estimand describes the probability that individuals under Treatment 1 have a longer lifetime than comparable individuals under Treatment 2. We derive hypothesis tests and confidence intervals based on a studentized version of the estimator, where resampling‐based inference is established by means of a randomization method. In a simulation study, we demonstrate for numerous sample sizes and different amounts of censoring that the developed test exhibits a good power. Finally, we apply the methodology to a well‐known benchmark data set from a trial with patients suffering from diabetic retinopathy.


INTRODUCTION
Testing the stochastic superiority of one distribution to another is a very classical problem in statistics.For two independent and fully observable samples, the -test, the median test, and the Mann-Whitney test are three well-known solutions to it.Likewise, in the case of two dependent samples, i.e., if data consist of (matched) pairs, the -test, the sign test, and the Wilcoxon signed rank test based on the pair-wise differences could be used.Extensions to the case of two dependent and censored samples are not obvious but some have been developed in the literature for a few decades.The developed test statistics are based on ranked absolute within-pair differences of the possibly censored survival times 41 , a difference of counting processes 37 , differences of efficient scores 31 , ranking all censored observations separately from the uncensored ones 3 , a combination of terms for different censoring and pair-wise ordering patterns 9 , integrals of scores with respect to differences of the sample-specific Nelson-Aalen estimators for cumulative hazards 26 , a combination of frailty-based log-rank tests 30 , using further prioritized outcomes, i.e., additional data, if the primary survival endpoint does not offer decisive pair-wise comparisons 33 .Extensions for covariates were also developed 23,25 .Reviews of methods for paired survival data including additional references to other approaches and discussions are also available 42,28 .
While these existing approaches undoubtedly offer many good approaches for powerful statistical inference in the two-sample problem, an additional, easily interpretable quantification of the discrepancy between both samples is usually not available.One notable exception is the popular win ratio method of Pocock et al. 33 which recently had been exploited through win odds in order to take ties into account in the inference method 8 .The perhaps most straightforward approach for such a quantification is to compare the survival chances for both groups at a fixed time point.This would provide a very limited, yet easy-to-interpret summary.A more global impression of the difference between both samples could be obtained by integration over time, leading to the (restricted) mean survival time.We will treat this topic in another forthcoming paper.
Instead, we will pursue an approach which is motivated by another estimand with a very clear interpretation: the relative treatment effect.In brief, it describes the probability that the lifetime under Treatment 1 is bigger than the lifetime under Treatment 2. If this is (significantly) different from 0.5, a solid statistical conclusion can be drawn about the treatment efficacies.At the same time, it is a simple probability which is easy to communicate.
In the present paper, we will develop a nonparametric methodology with an emphasis on the following quality criteria: 1. begin the research with a clear formulation of an estimand of interest; 2. make only very few and weak assumptions for the method to work; 3. in particular, no continuity of survival functions is needed, i.e., instantaneous hazard rates need not exist; 4. guaranteed large sample properties; 5. a good statistical reliability even for small samples, i.e., good control of the type-I error rate and confidence level, as well as a good power and narrow confidence intervals, respectively.
All of these points are of crucial importance, in particular in the light of the ICH E9(R1) 17 guidelines on estimands in trial analysis.At the beginning of Section A.5.1 therein, it is written that: "An estimand for the effect of treatment relative to a control will be estimated by comparing the outcomes in a group of subjects on the treatment to those in a similar group of subjects on the control.For a given estimand, an aligned method of analysis, or estimator, should be implemented that is able to provide an estimate on which reliable interpretation can be based.The method of analysis will also support calculation of confidence intervals and tests for statistical significance.An important consideration for whether an interpretable estimate will be available is the extent of assumptions that need to be made in the analysis."These statements clarify that even the most powerful inference method might not be the preferable one if other criteria are not met, e.g., if no intelligible estimand is available.Also, we wish to point out that most of the methods rely on strong assumptions such as the equality of censoring times for both members of a pair 41,3 or the continuity of survival distributions 41,3,9,30 .In this sense, the power of the test we will develop in this paper is not criterion of greatest importance although a powerful method is of course welcome.
This paper is structured as follows.First, we will introduce the relative treatment effect and explain its estimation in Section 2. At the end of that section, we will relate the present approach to some others from the literature.Second, in Section 3, we will present a method to make statistical inference based on a data re-randomization technique 13 .Third, we will investigate the large sample properties of the new method and explore its small sample performance by means of a simulation study in Section 4. Therein, also the power of the test will be assessed in a comparison to some competitor methods.Next, in Section 5, we will apply the methodology to a well-known benchmark data set from a trial with patients suffering from with diabetic retinopathy, described by Huster et al. 24 .We will conclude with a discussion in Section 6.The Supplementary Material contains all proofs, additional technical details, and additional simulation results.

Model and notation
We denote by ( 1 , 2 ) a bivariate random vector on a probability space (Ω, , ).Each ( = 1, 2) stands for a survival time of a patient who was randomized to receive treatment .For instance, the pair results from a matching of two individuals with a similar physiology.As a consequence, we generally assume 1 and 2 to be dependent.There exist many approaches for the estimation of the bivariate survival distribution of ( 1 , 2 ) in the literature; see the paper by Pruitt 34 for a comparison of six methods that are able to handle bivariate right-censored data; Dai et al. 10 developed an estimator under the more general assumption of bivariate left-truncated and right-censored data.We do not mean to give an exhaustive list of references in that direction and point to the references included in the two just mentioned papers.Instead of estimating the bivariate survival function, we aim at estimating a meaningful summary of it, i.e., a treatment effect measure.
Let us now prepare the introduction of our estimand of interest.Let > 0 denote the maximum follow-up time of a study in which the superiority of Treatment 1 to Treatment 2 shall be analyzed.Here, superiority means that Treatment 1 prolongs the survival times compared to Treatment 2. Hence, it seems constructive to consider the following probability: The second term is important to give equal credit to both treatments in the case of equal outcomes.Furthermore, we say that Treatment 1 is preferable if ̃ > 0.5.However, due to the maximum follow-up time , ̃ is not always estimable; instead, we will focus on the estimand which we will we call the relative treatment effect from now on.Brunner and Munzel 7 introduced this concept for the case of two independent samples.Many more research papers in this context emerged afterwards.Just to mention two, Munzel and Brunner 29 and Konietschke and Pauly 27 considered to the case of two dependent but fully observable samples.Finally, for technical reasons, we assume that ( 1 > , 2 > ) > 0 and that ( 1 = ) = ( 2 = ) = 0.The latter is always achievable by artificially increasing by a very small number.Throughout the paper, we assume that ∈ (0, 1), i.e., no perfect superiority of one treatment over the other.As mentioned before, censoring is omnipresent in many medical studies.We thus assume that survival times are independently right-censored, i.e., it is only possible to observe an event if it occurred before a so-called censoring time, say 1 and 2 , respectively.These are allowed to be dependent, but ( 1 , 2 ) and ( 1 , 2 ) are assumed to be independent.As a consequence, the actually observable data are of the form ( 1 , 1 , 2 , 2 ), where = min( , , ) and = 1{min( , ) ≤ }, = 1, 2; here, 1{⋅} denotes the indicator function.For estimation of the second probability in the relative treatment effect, we consider the case of = as uncensored.On a side note, Efron 14 also assumes the largest observation as uncensored in order to achieve a so-called "self-consistency" property for the Kaplan-Meier estimator; see Section 7 therein.
In the following, we will assume without loss of generality that the censoring times are continuously distributed prior to .In the case of discrete components in their distribution, ties can be broken by adding very small positive random numbers to them.These random numbers can be chosen small enough so that the order of all event times among the censoring times is not altered.Also, all statistical procedures considered below are not affected by these small modifications of the censoring times.

Transformation of paired survival data into competing risks data, and estimation
Let us now describe our novel estimation approach for the relative treatment effect.We assume that our data set consists of independently and identically distributed data points of the kind described above, i.e., of independent paired right-censored data.For a facilitated estimation, we transform the paired survival data into a competing risks data set.Note that our transformation is similar in spirit to the one used by Scheike, Holst, and Hjelmborg 36 : therein, a transformation of a paired competing risks data set into a univariate one facilitated the estimation of a concordance function.Our transformation works as follows: if for a pair 1. the first entry is observed to fail before the second, an event of type 1 occurred; 2. the second entry is observed to fail before the first, an event of type 2 occurred; 3. both entries are observed to fail simultaneously, an event of type 3 occurred.
All other cases are labelled right-censored.In each case, the (censored) event time is set to be the minimum of all four event and censoring times.We summarize the thus obtained competing risks data set as ( , ) = (min( ̌ , ̌ ), ̌ ⋅ 1{ ̌ ≤ ̌ }), = 1, … , .Here, ̌ is the minimum of both (potentially unobservable) event times of pair and ̌ is the minimum of both (potentially unobservable) censoring times of pair .Note that the type of event ̌ is unobservable in the case of a censoring.More technical details about the transformation are given in Section A in the Supplementary Material.
One important consequence of the subsequent Proposition 1 is that the Aalen-Johansen estimators 1 of , say ̂ , , = 1, 2, 3, are available for estimating : The specific structure of these Aalen-Johansen estimators are given in the Supplementary Material.Many statistical properties of the Aalen-Johansen estimator are well-known; cf. 5 , Section IV.4.
(b) (Sufficiency) The above-described data reduction is sufficient for .
Let us now translate two of the most important statistical results about Aalen-Johansen estimators to the relative treatment effect estimator, that is, consistency and asymptotic normality.To this end, we denote convergences in probability and in distribution by → and →, respectively.A more detailed formula for the asymptotic variance 2 is offered in the Supplementary Material.The results of Theorem 1 could be combined with a consistent variance estimator to construct Wald tests and related confidence intervals.However, such inference procedures typically exhibit a suboptimal control of the type-I error rate and the confidence level, respectively, especially for small sample sizes.That is why we propose a resampling-based approach in Section 3 below.

Discussion of related approaches in the literature
Let us review the present approach in the light of existing approaches and suggestions from the literature.Seigel and Podgor 37 proposed to compare the counting processes for the competing risks of type 1 and 2 at time , say 1 ( )− 2 ( ), i.e., McNemar's statistic.A statistical analysis of this difference would require taking the censoring rate into account.In contrast, the relative treatment effect estimator (2) relates to the difference of cumulative incidence functions through a one-to-one mapping: ).The latter may be written as ∫ 0 ( 1 − 2 )( )∕ ̂ ( −), where the denominator is the left-continuous version of the Kaplan-Meier estimator for being uncensored (in the competing risks data set).In that sense, our approach may be called an inverse-probability-of-censoring-weighting (IPCW) version of the suggestion by Seigel and Podgor 37 .
The previous representation also illustrates the difference from the class of test statistics suggested by Dabrowska 9 : she Here ̃ 1 and ̃ 2 are defined like 1 , 2 , just based on the completely uncensored pairs, ̃ 3 and ̃ 4 correspond to the singly censored data points, and and are some scoring processes.
Let us also compare the present method to the Kaplan-Meier estimator-based approach in Dobler 13 .Therein, a different variant of the relative treatment effect is analyzed, say ̄ = ( 11 > 22 ) + 1 2 ( 11 = 22 ).That is, outcomes from different treatments of different individuals are compared.Other names for this parameter are D-value or Mann-Whitney parameter 18,11 , and also the C-index 20 is related.It is known 13 that ̄ can be represented by an intergral that involves only the marginal survival functions.Consequently, estimation could be based on two marginal and dependent Kaplan-Meier estimators.In contrast, the method developed in the present paper is based on direct comparison within each pair.In this sense, confounding is avoided or at least reduced.The novel competing risks-based approach will thus make better use of the available information on the dependence structure within the paired observations.It has been argued 18,19 that it is challenging to draw causal conclusions for ̄ if not paired but only sample-specific measurements are available.In addition, Example 1 illustrates that ̄ might exhibit some undesirable proporties which do not occur for .Finally, the parameter ̃ from (1) is related to the area under the ROC curve (AUC) which is sometimes written in a similar way. 32ample 1.We wish to illustrate some differences between the estimands and ̄ in addition to examples from the literature 18,19 .In particular, we will point out two cases with different implications for subpopulations.Let us suppose that the following data set is fully observable; a variant of the following examples has been kindly provided by Katharina Kramer (University of Augsburg): Here, we also suppose that additional information (a subgroup) is available, e.g., males and females.For simplicity, let = ∞.
In the example above, the above-mentioned estimators of and ̄ yield the estimates ̂ = 1 and ̂ = 10 16 , respectively.These are the estimates for the whole population, i.e., both subgroups combined.We would like to point out that in this completely observable case the estimators simplify to the empirical fractions: ̂ is equivalent to the sign test statistic and ̂ is equivalent to the Mann-Whitney U test statistic.Within the subsamples, however, the estimates are ̂ , = 1 and ̂ , = 3 4 , = 1, 2. Extending this example reveals that ̂ can get arbitrarily close to 1 2 , i.e., basically no Treatment 1 benefit, whereas ̂ , = 3 4 > 1 2 .This cannot happen with ̂ which is a convex combination of the subgroup-specific ̂ 1, and ̂ 2, (in the fully observable case).In contrast, the following example is a case where ̂ 1, = 3 4 > 1 2 and ̂ 2, = 1 2 for the subsamples but ̂ = 7 16 < 1 2 for the whole sample: That is, for each subgroup Treatment 1 seems beneficial or at least not harmful in view of ̄ , whereas is seems harmful for the whole population.In contrast, ̂ = 3 4 which is in line with ̂ 1, = 1 and ̂ 2, = 1 2 .This illustrates that, in some special cases, drawing conclusions for subpopulations can be difficult based on ̄ .As we saw, these phenomenons do not seem to occur for .It should be pointed out though that such comparisons with subgroups are more challenging in the censored case because there is in general no direct connection between ̂ and ( ̂ 1, , ̂ 2, ) any more.It might also be useful to complement the estimand with an average treatment effect, e.g., differences of the restricted mean survival times (RMST) (min( 1 , ) − min( 2 , )) to get additional insight in the effectiveness of a treatment with respect to another one.

INFERENCE ON THE RELATIVE TREATMENT EFFECT
One resampling option is a variant of the classical bootstrap 15,16 , i.e., draw times independently with replacement from the competing risks data pairs ( , ), = 1, … , , and recompute ̂ based on the drawn bootstrap sample.By repeating this procedure a large number of times, the collection of the normalized bootstrapped relative treatment effect estimators, say * , = √ ( ̂ * , − ̂ ), = 1, … , , can be used to estimate different aspects of the distribution of = √ ( ̂ − ), e.g., the (1 − )-quantile which is necessary for a right-tailed test.
Another, perhaps less well-known resampling option, is given by data re-randomization.The general procedure is similar to the bootstrap, except that data points are not drawn with replacement but instead some other random variation is introduced which is related to an algebraic group structure.One popular example is random permutation of the data points in a two independent samples setting which leads to permutation tests.Randomization tests can be shown to be finitely exact if the re-randomization procedure reflects the data generation process.Recent works revisited the finite exactness of randomization tests. 21,22Similarly, finite exactness of confidence intervals can be shown under randomization-invariance (up to the discreteness of the randomization distribution).In addition, Dobler 13 argued the asymptotic exactness of randomization tests (with finite exactness in special cases) even if the data generation process does not exactly match the re-randomization method.
We will apply one such randomization approach is the present paper.To motivate it, consider for a moment the strong null hypothesis that both treatments are completely exchangeable in every respect.In that case, we would have for the relative treatment effect = 0.5.However, under the weak null hypothesis 0 ∶ = 0.5, it is well possible that the treatments are not exchangeable; for example, if the survival functions related to 1 and 2 are allowed to be different.A treatment rerandomization, i.e., a random re-labeling of the assigned treatment within each data pair, would lead to an artificial situation in which both treatments are exchangeable, which in turn implies 0 .That is, the relative treatment effect estimator based on the re-randomized data set targets the value 0.5.Thus, randomization tests and confidence intervals that are based on critical values obtained through the described randomization procedure will be finitely exact under the sharp null hypothesis of treatment exchangeability.Additionally, it will be asymptotically exact under the weak null hypothesis 0 if the normalized relative treatment effect estimators are suitably studentized so that the limit distribution is pivotal.Note that this re-randomization is different from the approach used in classical permutation tests for two independent samples, as the present data consist of paired data, and the randomization is done within each pair.
Let us make the randomization approach more explicit.In terms of a fixed competing risks data set ( , ), = 1, … , , a re-labeling of both treatments would mean that the times remain unchanged but every occurrence of a type-1 and type-2 event will be randomly re-labeled a type-1 or type-2 event, each with probability 50%.Denote a thus obtained randomized data set by ( ̃ , ̃ ), = 1, … , , and the resulting randomized relative treatment effect by ̃ , where the randomization procedure is considered to be random.For realizations of the re-labeling, while the original data are kept fixed, we will again use the index = 1, … , in the subscript to indicate the randomization iteration.It remains to justify the conditional convergence in distribution of the randomized relative treatment effect estimator.To this end, we introduce the notation  for the distribution of a random variable and let denote a distance which metrizes the space of distributions on ℝ, e.g., the Prokhorov distance 6, pp. 72-73 .Finally, let → denote convergence in probability.Let ̂ 2 , and ̃ 2 , be suitable consistent estimators of the asymptotic variances of ̂ and ̃ , respectively; see Section B in the Supplementary Material for this paper and Subsection 4.1 for details.Furthermore, we wish to point out that -lim →∞ ̂ 2 , = 2 ≠ -lim →∞ ̃ 2 , = ̃ 2 in general if the two treatments are not exchangeable.This underlines the necessity to studentize the (randomized) estimator.
Based on Theorem 2, consistent one-sided hypothesis tests for 0 ∶ ≤ 0.5 against ∶ > 0.5 of asymptotic level ∈ (0, 1) are given by rejecting 0 if and only if ̂ exceeds 0.5 + ̂ , ⋅ * (1 − ).Here, * (1 − ) denotes the conditional (1− )-quantile of ( ̃ −0.5)∕ ̃ , given the data.Such tests control the significance level exactly under the sharp null hypothesis of the exchangeability of both treatments (and censoring distributions).Yet, they might still be slightly conservative for very small sample sizes if the non-randomized version † of the randomization tests are used.Equivalently, [ ̂ + ̂ , ⋅ * (1 − ), 1] are one-sided asymptotic level (1 − ) confidence intervals.Similarly, two-sided tests and confidence intervals can be obtained if additionally the lower randomization-based quantiles are used.As mentioned before, these confidence intervals will also be exact under treatment-exchangeability, up to the discreteness of the randomization distribution.

General remarks
The small sample properties are analyzed with the help of simulation studies.Next to simulations for assessing the size of the proposed right-tailed tests under the null hypothesis 0 ∶ = 0.5, we also conducted simulations regarding the power of these tests.All simulations were conducted under R version 4.1.0 35.We also used the R package etm 4 in which the Aalen-Johansen estimators and all related variance and covariance estimators of Greenwood-type are implemented.
The above-indicated rate parameters and of the marginal distributions were found through numerous generations of large data sets and they were chosen such that ≈ 0.5, i.e., the null hypothesis is considered true.The parameters of the censoring distributions resulted in censoring rates of 38% to 42% (strong), 27% to 34% (medium), and 17% to 27% (light); these rates were found through simulations, where truncations at were also considered censorings.
To elaborate a bit more on the simulation steps, we would like to point out that the bivariate copula data are generated first.Next, the quantile functions of the marginal distributions are used to transform the copula data into bivariate data with the pre-selected dependence structure and marginal distributions.Finally, censoring is introduced by taking the minimum of the event and the simulated censoring times.For the purpose of estimating the relative treatment effect, the synthetic data are next transformed into a competing risks data set as described in Section 2.2.It should be pointed out that the cumulative incidence functions underlying the transformed competing risks data set neither have to be specified, nor do they play an important role, except for at time .
The results displayed in Table 1 can be summarized as follows.Among the tests based on the untransformed relative treatment effect ("lin." in the table), the asymptotical test tends to be a bit liberal, while the bootstrap test is somewhat conservative.This is more pronounced for the smaller sample sizes up to = 100.The randomization test is generally closest to the selected significance level.For larger sample sizes ( ∈ {125, 150}), the sizes of all tests approach the 5% level quite accurately.
Table 1 about here.
One notable peculiarity is the combination of strong censoring, the Gumbel-Hougaard copula, and, in particular, Gompertz versus exponential marginals: here, the bootstrap test stays very conservative, and the randomization test is very anti-conservative.There is a slight improvement when the sample size increases.In this scenario, the asymptotic test is also anti-conservative but it surprisingly performs better than the randomization test.For smaller sample sizes ( ∈ {25, 50}), similar observations about the simulation results can be made for the related medium censoring case.Figure 1 illustrates that the above-discussed scenario is indeed a very challenging one: many data points are converted into censorings on the competing risks scale; the censoring rate amounts to about 52%.At the same time, there is a very strong correlation between the survival times of a pair, and two very different marginal distributions.For smaller sample sizes, most of these characteristics are hardly visible.That is why this is by far the most challenging simulation setting.
The log-log-transformation ("tra." in the table) seems to rectify the liberality of the asymptotic test and the conservativeness of the bootstrap test in most scenarios.However, the transformation has nearly no effect on the randomization test.

Power simulations
In addition to the size simulations under 0 ∶ = 0.5, we have also conducted a simulation study to assess the power of the one-tailed version of the developed test.Since the results for the transformed and untransformed test statistics are very much alike, we have solely focused on the latter.We also considered two competitor tests, also in their one-tailed versions: the paired Prentice-Wilcoxon test 31 which was found to be very powerful in the comparative simulation study by 42 ; the stratified log-rank  Upper panels: complete data set (no censoring); lower panels: censored data set.Left panels: raw data; censorings in both coordinates are denoted by "+"; censorings only in the horizontal coordinate are denoted "X"; censorings only in the vertical coordinate are denoted "Y"; completely uncensored data points are denoted by red circles.Right panels: competing risks data after transformation.The event time is the minimum of both coordinates.The symbols "1", "2", and "C" represent whether the data point corresponds to an observed event of type 1, type 2, or to a censoring.test 30 which was more closely analyzed under correlated frailty models.In order to ensure a fair comparison, the randomization versions of all tests were used, such that all of them control the significance level for finite sample sizes under exchangeability.In this subsection, we chose the significance level = 5%.
We considered the same two copulas as in the previous subsection, the sample sizes ∈ {25, 50, 100}, 1,000 test replications, 1,000 randomization iterations, and the following three marginal distribution scenarios: 1. Mixture of the (2)-exponential and the (0, 2)-uniform distribution against the (2)-exponential distribution; the censoring times were independently (0, 2.5)-distributed and = 1.9.This scenario departs from the sharp null hypothesis of exchangeability into alternatives with crossing hazard rates at late time points close to as the mixing parameter puts more and more weight on the uniform distribution.

Mixture of the
(2)-exponential and the Gompertz distribution with shape parameter 0.1 and rate parameter 2 against the (2)-exponential distribution; the censoring times were independently (0, 2.5)-distributed and = 1.8.This scenario departs from the sharp null hypothesis of exchangeability into alternatives with crossing hazard rates at central time points as the mixing parameter puts more and more weight on the Gompertz distribution.

The
(2∕ )-exponential distribution against the (2)-exponential distribution; the censoring times were independently (0, 2)-distributed and = 1.3.This scenario departs from the sharp null hypothesis of exchangeability into alternatives with parallel hazard rates as the scale parameter increases from 1 to 2.
For now, we only focus on the results for Scenario 1 graphically presented in Figure 2; the results for Scenarios 2 and 3 are presented in Section D.2 of the Supplementary Material and they are similar to those presented here.We can see from Figure 2 by comparing each combination of left and right panel that the copula that connects the lifetimes apparently has only little influence on the performance of the tests.Not surprisingly, the power of all tests increases when the sample size increases (top to bottom in the figure) and when we depart from the null hypothesis (from left to right within each panel).The paired Prentice-Wilcoxon test 31 is always the most powerful one.In most cases, the proposed test has the next higher power but its performance is generally very similar to that of the stratified log-rank test 30 .
Multiple comments are in order.First, in this simulation study we could confirm the earlier findings 42 that the paired Prentice-Wilcoxon test is indeed quite powerful.This is also not surprising because it is based on an efficient score approach.Next, the power of the stratified log-rank test could possibly be greater if the optimal combination with the unstratified log-rank test was used. 30However, an implementation of the combination would be beyond the scope of the present paper.In addition, it should not be forgotten that the considered competitor tests were proposed for the sharp null hypothesis of equal survival distributions; it is only natural that such tests potentially have a greater power than the proposed test which was designed for the weak null hypothesis 0 ∶ = 0.5.Another reason for the relatively high power of the log-rank test and the paired Prentice-Wilcoxon test is the fact that the generated lifetimes were not truncated at , whereas smaller values of let the relative treatment effect get closer to 0.5, i.e., closer to the null hypothesis.In the light of these facts, the power of the proposed test procedure is very competitive.In addition, as initially mentioned, our focus was on the development of an easily interpretable estimand-based inference procedure, rather than developing the most powerful test for comparing two survival functions.

DATA EXAMPLE
We illustrate our methodology by re-analyzing a well known benchmark data set which has been published in the R package survival 38 .The data set diabetic, in detail described and analyzed by Huster et al. 24 , contains 394 observations from a trial including 197 patients with "high-risk" diabetic retinopathy, a complication associated with diabetes mellitus that frequently leads to blindness.In this trial each patient acts as its own control: one eye was randomized to a laser photocoagulation, while the other eye received no treatment.
Apart from suffering from diabetic retinopathy, the inclusion criterion of the trial was a visual acuity of at least 20/100 in both eyes.The aim of the study was to investigate the effect of the laser treatment on delaying the onset of blindness, defined by a visual acuity of less than 5/200 at two consecutive visits after four months.Thus, the survival times are the times until blindness for the eyes.Censoring was caused by death, dropout, or the end of the study.Consequently, we consider the censored paired survival outcomes and their relative treatment effect.
The data set consists of two subgroups, defined by the type of diabetes, i.e., patients with juvenile onset diabetes (diagnosis before an age of 20, 114 patients) and adult onset diabetes (83 patients).In these subgroups, respectively 78.9% and 85.1% of the patients had at least one censoring.Further, there are several other covariates, e.g., the laser type and a risk score.
For a better overview of the data, Figure 3(a) visualizes the time until blindness for the total sample, regardless of the age at diagnosis.From the pattern of the observations displayed in the figure, we conclude that the laser treatment seems in general to delay blindness.For a separate investigation of the juvenile and adult sample, respectively, Figure 3 subsamples, we observe a visible difference between the Kaplan-Meier curves and again we conclude that the laser treatment seems to delay the onset of blindness compared to the control, which becomes even more visible in the adult sample.Table 2 about here.In order to confirm our visual findings, we will estimate the relative treatment effect of the laser photocoagulation and the corresponding confidence intervals, as well as performing the corresponding two-sided hypothesis tests in order to assess the effectiveness of this therapy.We fix the maximum follow-up time as = 60 (indicated by the dashed box in Figure 3(a)) and perform the analysis for both samples, that is juvenile and adult onset diabetes, separately.For the juvenile sample, we obtain a relative treatment effect of ̂ = 0.598, for the adult sample we have ̂ = 0.731.The corresponding 95%− confidence intervals and the results of the (two-sided) hypothesis test for the different approaches described in Section 3 are summarized in Table 2.The transformation used for the analysis is given by ( ) = log(− log( )) and all results based on bootstrap were achieved by using =2,000 bootstrap repetitions.For each subsample, all confidence intervals and also all p-values are very similar.In general, the effect of an additional transformation of the test statistic is rather small and yields very similar confidence intervals and test results.As already indicated by Figure 3(b), there is a notable difference of treatment and control eye, which is even larger in the adult sample.This is confirmed by p-values below 0.001 for the adult sample for all tests under consideration.For the juvenile sample, p-values lie between 0.012 and 0.025.Hence, we conclude a significant treatment effect for both samples at the significance level = 5%.
Our results are similar to the findings from Oakes and Feng 30 , who investigated the same data set regarding the treatment effect.In their paper, they propose three different test approaches and conclude that, for the example at hand, all resulting treatment effects are significant.

DISCUSSION
In this paper, we developed a new estimand in the context of paired, right-censored survival data, the so-called relative treatment effect, to compare the effectiveness of two treatments.Such data occur for instance in matched pairs studies.We derived confidence intervals and hypothesis tests and could demonstrate all desirable propoerties by means of a simulation study.
The relative treatment effect quantifies the stochastic ordering of treatment outcomes but not how much bigger one survival times is than the other.In this sense, it is a global measure for the superiority of the first treatment, although other measures, e.g., about the actual size of the differences, might also be of major importance in some applications.One extension of the present method could be the incorporation of additional patient covariates, e.g., those that are used for the matching of individuals into pairs, which could be used to tackle the classification problem of who specifically should receive which treatment.This could be achieved in terms of semiparametric regression models or by involving relative treatment effects in a machine learning algorithm.
There are multiple possibilities for other extensions of the present approach.One open question is how to incorporate additional patients that could not be matched with others or if multiple patients of one treatment group could be matched with just one patient of the other group.The latter problem could potentially be approached by means of an appropriate re-weighting of the within-pair comparisons.However, there is the risk that the celebrated easy interpretability of the relative treatment effect could be lost.Another way to extend the present approach is the incorporation of additional covariates.Due to the favourable competing risks approach, such an extension could be achieved rather straightforwardly, e.g., by means of cause-specific hazard models or subdistribution hazard models.
Finally, we note that the hypothesis test presented in this paper investigates the significance of the treatment effect.However, there might also occur situations where one is rather interested in testing whether the deviation of the treatment effect of 0.5 is not larger than pre-specified values 1 and 2 , respectively.In other words, this requires an equivalence test for 0 ∶ ∉ (0.5 − 1 , 0.5 + 2 ) against ∶ ∈ (0.5 − 1 , 0.5 + 2 ) (see, for example, Wellek 40 ).Such an approach could provide a very flexible framework for statistical inference, address numerous other research questions, and consequently provide a useful addition to the test proposed in this paper.We leave the development of such a procedure for future research.

APPENDIX A TECHNICAL DETAILS FOR THE COMPETING RISKS-BASED ESTIMATION OF THE RELATIVE TREATMENT EFFECT
In this section, we will explain in detail how the transformation of the paired, right-censored data to competing risks data can be achieved and why these competing risks data are usable for valid estimation of the relative treatment effect.
Let us thus assume that a data set consists of independently and identically distributed data points ( 1 , 1 , 2 , 2 ), = 1, … , , as described in Section 2 in the main manuscript.As explained there, we wish to estimate the relative treatment effect with the help of a competing risks approach.Our key strategy is to transform the data into a competing risks data set: where with survival function ( ) = ( ̌ > ) (0 ≤ ≤ ) denote the censoring times, and ̌ ∈ {1, 2, 3} are the event indicators, respectively.Again, ̌ , ̌ , and ̌ are not fully observable but ( , ) is.The three different (artificial) events are defined as follows: 1.If the first treated subject experienced an observable event before the second, i.e., 1 < 2 , 1 = 1 or if 1 = 2 , ( 1 , 2 ) = (1, 0), we say the first type of event ( = 1) took place at time 1 = 1 .Note though that the second just mentioned case is impossible due to the independent and continuous censoring assumptions: 2. Similarly, the second type of event ( = 2) took place at time 2 = 2 if 2 < 1 and 2 = 1.
3. If both events were observed simultaneously, i.e., 1 = 2 and 1 = 2 = 1, the third type of event ( = 3) took place In all other cases, the observation is censored ( = 0) at In the following, we suppress the index for ease of presentation when there is no need to specify the precise subject.We denote the cause-specific cumulative hazard functions in the competing risks framework by , = 1, 2, 3.The following lemma is the crucial step towards estimating the relative treatment effect ; in fact, it shows that no relevant information is lost by the above-described conversion to a competing risks data set.
Lemma 1.The cause-specific Nelson-Aalen estimators converges in distribution to a three-dimensional zero-mean Gaussian process as → ∞.
The proof of Lemma 1 (provided in Section C below) reveals that the transformation of the paired survival data into competing risks data preserves the underlying intensities.As a consequence, other properties of the Nelson-Aalen estimators are retrieved, such as their interpretation as nonparametric maximum likelihood estimators.Another consequence of Lemma 1 is that the Aalen-Johansen estimators 1 for the cumulative incidence functions for all event times are similarly estimable; cf.Dobler 12 for the general case with both continuous and discrete components in the event time distribution.To be specific, estimation of the relative treatment effect is achievable as follows: where Tedious but straightforward calculations revealed the following expression for the asymptotic variance of the relative treatment effect estimator, √ ( ̂ − ): are the asymptotic variance and covariance functions of the normalized cause-specific Nelson-Aalen estimators, respectively, and is the asymptotic variance function of the normalized all-cause Nelson-Aalen estimator.We propose to use the consistent Greenwood-type variance and estimators of 2 and ; see, e.g., formulas (4.4.17) and (4.4.18) in Andersen et al. 5 .For the randomization version of the relative treatment effect estimator, √ ( ̃ − 1 2 ), we discovered a structure similar to 2 , except that all quantities 2 , , are replaced by their randomization counterparts; see the subsequent subsection for details.

C PROOFS
Proof of Lemma 1.We are going to show that the censored competing risks data set exhibits the correct underlying hazard rates.The consistency and asymptotic normality statements then follow from well-known results in the literature, e.g., Andersen et al. 5 .We refer to Dobler 12 for detailed derivations regarding the general case of event times with both discrete and continuous components.
We first consider continuous components of the hazard rate.Thus, at points of continuity < of the distribution of ̌ , we have A similar derivation holds for time points ≤ of discontinuity: Here, 1 ( −) = lim ↑ 1 ( ) denotes the left-continuous version of 1 .The very same arguments can be used for the second competing risk, i.e., ̌ = 2.
For the third competing risk, we similarly have for any point < of continuity that lim Again, a similar derivation holds for times ≤ of discontinuity: This verifies that the Nelson-Aalen estimators estimate the correct quantities.∎ Proof of Proposition 1. Part (a) is obvious.
(b) Lemma 1 revealed that the underlying intensity processes are preserved under the data transformation.In addition, counting and at-risk process are known to be sufficient statistics for the nonparametric intensity processes 2 .It is thus apparent that the -algebra generated by the competing risks data set is sufficient for .
(c) It is well-known that Aalen-Johansen estimators have an interpretation as NPMLEs; we refer to Sections IV.4.1.5 and IV.1.5 in Andersen et al. 5 for a detailed discussion about the concept of NPMLEs, and the NPMLE properties of the Aalen-Johansen and the more fundamental Nelson-Aalen estimators, respectively.Likewise, it clear that ̂ is the NPMLE of .∎ Proof of Theorem 1.The stated convergence in distribution is a consequence of the continuous mapping theorem in combination with the functional delta-method because the relative treatment effect estimator depends on the cause-specific Nelson-Aalen estimators through a combination of the following functionals: • Wilcoxon functional (Section 3.9.4.1 in van der Vaart and Wellner 39 ).For càdlàg functions and functions of bounded variation (bounded by a fixed constant), the functional ( , )  → ∫ 0 ( ) ( ) is Hadamard-differentiable with derivative (ℎ, )  → ∫ 0 ℎ( ) ( ) + ∫ 0 ( ) ( ), where the latter integral is defined via integration by parts: ∫ 0 ( ) ( ) = ( ) ( ) − (0) (0) − ∫ 0 ( −) ( ) if is not of bounded variation.
• Product integral (Section 3.9.4.5 in van der Vaart and Wellner 39 ).For càdlàg functions of bounded variation (bounded by a fixed constant), the functional which is again defined by integration by parts.Here and below, ∏ with a time-continuous indexing stands for the product integral.
Due to the chain rule (Lemma 3.9.3 in van der Vaart and Wellner 39 ), the composition of both functionals is also Hadamarddifferentiable.Thus, since the Kaplan-Meier estimator satisfies ̂ , we obtain the following asymptotic representation of the relative treatment effect estimator: where the first integral can be rewritten as Here, ̂ , is the all-cause Nelson-Aalen estimator.Simplifying the expression in the previous display once again, we get , where − ⊂ [0, ) denotes the set of all discontinuities of • in [0, ).In combination with the well-known limit behaviour of the Nelson-Aalen estimators on function spaces, our previous reasoning and the continuous mapping theorem establishes the asymptotic normality of the normalized relative treatment effect estimator.
It remains to verify the asymptotic variance 2 .First note that, for deterministic càdlàg functions of bounded variation and and for stochastic processes and that satisfy the required integrability conditions, Fubini's theorem implies that This, in combination with the asymptotic variances 2 and covariances of the normalized cause-specific Nelson-Aalen estimators and straightforward but tedious computations reveal the claimed structure of asymptotic variance 2 .∎ Proof of Theorem 2. At first, we verify the conditions of Theorem 2 in Dobler 13 to establish the asymptotic normality of the randomized relative treatment effect.For this, we notice that the (randomized) Nelson-Aalen estimators ̃ , , = 1, 2, 3, are retrieved as functionals of the empiricial process of ( ̃ , ̃ ), = 1, … , , indexed by the class It is easy to see that  is a Vapnik-Cervonenkis class and it hence satisfies the uniform entropy condition (2.5.1) in 39 .As a consequence,  is both ℙ and P-Donsker and both distributions have bounded supremum norms, where ℙ = ( , ) and P = ( ̃ , ̃ ) .Similarly, the class Donsker and ℙ again has a bounded supremum norm with respect to  .Thus, Theorem 2 in 13 yields that the conditional distribution of √ ( ̃ − 0.5) converges weakly to a normal distribution in probability as → ∞.The limit variance, say ̃ 2 , is in general different from 2 .That is why the studentization based on ̃ 2 , is essential.To justify the consistency of ̃ 2 , for ̃ 2 , note that this variance estimator is a continuous functional of the randomization empirical process, i.e., the empirical process based on ( ̃ , ̃ ), = 1, … , .Since the class  is Glivenko-Cantelli, the continuous mapping theorem yields the consistency of the variance estimator.Finally, an application of a conditional version of Slutkzy's theorem concludes the proof.∎

D ADDITIONAL SIMULATION RESULTS
D.1 Sizes under 0 ∶ = 0.5 and nominal significance levels ∈ {1%, 10%} Tables D1 and D2 summarize the simulation results for the size of the proposed one-sided tests under 0 ∶ = 0.5 with nominal significance levels = 1% and 10%, respectively.The remaining simulation scenarios are the same as those which led to the results displayed in Table 1 for = 5% in the main manuscript.As the overall impression is similar to the findings from that Table 1, no additional comments on the outcomes are given here.

D.2 Additional power simulation results
Figures D1 and D2 present the power simulation results of the proposed randomization-based one-sided test in comparison to the paired Prentice-Wilcoxon test and the stratified log-rank test.Here, the marginal distributions are different competing exponential distributions and exponential-Gompertz mixtures versus an exponential distribution, respectively.Since also here the overall findings are similar to those given in Figure 2 in the main manuscript, no further comments are needed.

Figure 1
Figure1Scatterplots of = 200 simulated data points according to a Gumbel-Hougaard copula and, respectively, Gumbel and exponentially distributed marginals.The red line is the diagonal = .The dashed lines illustrate the upper end of the censoring support (0.7) and = 0.6.Upper panels: complete data set (no censoring); lower panels: censored data set.Left panels: raw data; censorings in both coordinates are denoted by "+"; censorings only in the horizontal coordinate are denoted "X"; censorings only in the vertical coordinate are denoted "Y"; completely uncensored data points are denoted by red circles.Right panels: competing risks data after transformation.The event time is the minimum of both coordinates.The symbols "1", "2", and "C" represent whether the data point corresponds to an observed event of type 1, type 2, or to a censoring.

Figure 2
Figure 2 Simulated power of three selected right-tailed randomization-based tests with significance level = 5%.

Figure 3
Figure 3 about here.
denotes the Kaplan-Meier estimator of ( ) = ( ̌ > ) and ̂ , are the Aalen-Johansen estimators of , = 2, 3.The minus sign in an argument indicates the left-continuous version of a function.B ASYMPTOTIC VARIANCES OF THE RELATIVE TREATMENT EFFECT ESTIMATORS, AND CONSISTENT VARIANCE ESTIMATORS B.1 Asymptotic variances of √ ( ̂ − ) and √ ( ̃ − 1 2 ) Time until blindness for the diabetic retinopathy data.The values on the diagonal (denoted by "+") correspond to patients where no blindness occurred throughout the observational period.Red circles indicate blindness of both eyes, values below the diagonal (denoted by "Y") indicate blindness of the control eye only, whereas values above the diagonal (denoted by "X") indicate blindness of the treated eye only.
sample) (b) Kaplan-Meier-curves for the two eyes (treated and control) for the diabetic retinopathy data, fitted separately for the juvenile and the adult sample.

Figure 3
Figure 3 Graphical summaries of the data set.