Estimating and comparing adverse event probabilities in the presence of varying follow-up times and competing events

Safety analyses of adverse events (AEs) are important in assessing benefit – risk of therapies but are often rather simplistic compared to efficacy analyses. AE probabilities are typically estimated by incidence proportions, sometimes incidence densities or Kaplan – Meier estimation are proposed. These analyses either do not account for censoring, rely on a too restrictive parametric model, or ignore competing events. With the non-parametric Aalen-Johansen estimator as the “ gold standard ” , that is, reference estimator, potential sources of bias are investigated in an example from oncology and in simulations, for both one-sample and two-sample scenarios. The Aalen-Johansen estimator serves as a reference, because it is the proper non-parametric generalization of the Kaplan – Meier estimator to multiple outcomes. Because of potential large variances at the end of follow-up, comparisons also consider further quantiles of the observed times. To date, consequences for safety comparisons have hardly been investigated, the impact of using different estimators for group comparisons being unclear. For example, the ratio of two both underestimating or overestimating estimators may not be comparable to the ratio of the reference, and our investigation also considers the ratio of AE probabilities. We find that ignoring competing events is more of a problem than falsely assuming constant hazards by the use of the incidence density and that the choice of the AE probability estimator is crucial for group comparisons.

of the incidence proportion drawing the interpretation into question. 14 To solve this issue all comparisons are not only conducted at the end of follow-up but also at the shorter of the two observed maximum follow-up times in the two groups, which is in line with, for example, common logrank test comparisons. There is also the concern that a low number of observations under study at the end of follow-up leads to increased variances. 15 Hence, the comparisons of the AE probability estimators are also investigated at two different quantiles of the observed times.
Throughout, our estimands will use the intention-to-treat population when comparing groups, based on estimators of the cumulative AE probability. In practice, treatment effects will be investigated using either treatment policy or while on treatment estimands, depending on the type of CEs and the available data, see Section 2.1.
We also note that Bender and Beckmann 16 have recently investigated whether the ratio of incidence densities may serve as an estimator of the hazard ratio even under misspecification. Their investigation was also motivated by AE analyses, also considered variances (via confidence intervals), and found that results depend on the baseline cumulative AE probability. However, these authors did neither consider competing events (which impacts probabilities) nor bootstrapping variances (which may lead to larger confidence intervals).
The paper is organized as follows, Section 2 introduces the AE probability estimand and estimators with corresponding variance estimators. Section 3 presents the comparisons of the estimators at the different follow-up times based on data from an oncology trial. In Section 4 a simulation study addresses the three questions posed above. The paper concludes with a discussion in Section 5.

| Competing risks model and estimand
In the following, we consider data from a two-arm randomized controlled trial with each group following a competing risks model displayed in Figure 1. Every patient starts in the initial state 0 at study entry, that is, at time 0. The event time at which a patient i moves from state 0 to either state 1 or 2, whatever occurs first, is denoted by T i and the event type is denoted by ϵ i as ϵ i = 1 in case of an AE and ϵ i = 2 in case of a CE in a time-to-first-event and type-of-first-event setting. Observation of patient i's data (T i , ϵ i ) is subject to right-censoring at time C i if C i < T i . Only the minimum of the censoring time C i or the event time T i can be observed and (T i , ϵ i ) remain unobserved if C i < T i . Therefore, the observable data consists of i.i.d. replicates of (min(T i , C i ), 1(T i ≤ C i ) Á ϵ i ). Furthermore, (T i , ϵ i ) and C i are assumed to be independent.
Suppressing index i in the notation, the AE hazard is defined as λ t ð Þ ¼ lim
Within groups, our main estimand will be the cumulative probability of a type 1 event, for times τ as defined below. Estimands of group comparison will contrast this estimand between groups. These estimands follow the intention-to-treat principle in the sense that they are based on the intention-to-treat population. If death is the only CE that may possibly occur before an AE, the treatment-policy estimand is adressed. 17 However, in safety analyses as part of the marketing authorization process, regulatory agencies typically consider the effect of the F I G U R E 1 Competing risks setting intention-to-treat on the while on treatment estimand, thus not completely following the intention-to-treat principle. One reason for this choice is that switch or discontinuation of treatment may result in diluting and perhaps anticonservative effects when assessing safety. This is reflected by the fact that AE follow-up is often shorter than follow-up for primary endpoints in oncology such as death. For instance, in oncology, another CE could be progression, in which case we would consider the intention-to-treat during progression-free survival, corresponding to the while-on-treatment estimand when progression leads to end of treatment. 18 These considerations also have a technical aspect, at least for non-parametric survival analyses. Here, a typical technical requirement is that the asymptotic probability of being atrisk is bounded away from zero. 1 All estimators are evaluated at time τ. Later τ will take the values τ which are the maximum follow-up times in the experimental group and control group, respectively. In general, τ max , but comparing both groups by evaluating estimators at the respective maximum follow-up time is what is commonly done in the analyses of adverse events. But to refrain from comparisons being impacted by observing one group much longer than the other, we also consider τ max ¼ min τ P90 the empirical 90% quantiles of min(T i , C i ) in group A and B, respectively, and τ P60 defined in the same way.

| Estimators of event probabilities and their variances
The following five different estimators of the AE probability on (0, τ] are considered and will be compared. The estimators are only displayed for the experimental group A. For the control group B, they are derived analogously. • Incidence proportion: Let d A (u) denote the number of observed AEs at time u in group A and let n A be the total number of patients in group A, then the incidence proportion is defined as where the sum is over all observed, unique event times u. The incidence proportion estimates P (AE in (0, τ], AE observed). The corresponding model-based variance estimator is • Probability transform incidence density: We first define the incidence density as where the denominator is the population time at risk restricted by τ with T i = t i the (realization of the) time of the first event irrespective of the event type of patient i. The model-based variance estimator of the incidence density is The incidence density is an estimator of the hazard λ A and as a consequence not directly comparable to the incidence proportion. But by the assumption of constant hazards and the connection to the exponential distribution it can be transformed to the probability scale by 1 À exp À c ID A τ ð ÞÁτ . If the assumption of constant hazards holds, λ A (t) = λ A 8t, the incidence density is an unbiased estimator of the hazard. The model-based variance estimator of the , which can easily be derived using the delta-method.
• 1-Kaplan-Meier: Let Δ b Λ A u ð Þ be the increment of the Nelson-Aalen estimator of the cumulative AE hazard, that is, where uÀ denotes the event time just before u, and the Nelson-Aalen estimator is given where Y A (t) denotes the number of individuals at risk in group A just prior to t.
Therefore, the increment of the Nelson-Aalen estimator is closely related to c ID A τ ð Þ and can be interpreted as an estimator of the current hazard value times the time increment. Then the 1-Kaplan-Meier estimator is defined as The Kaplan-Meier estimator estimates the "event-specific survival function" CEs as censoring the follow-up time. Its variance can be estimated using the Greenwood variance estimator. 19 Note that S A does not have a proper probability interpretation as a consequence of competing events, see below.
• Aalen-Johansen estimator: As reference we consider here the Aalen-Johansen estimator which is given by is the increment of the Nelson-Aalen estimator of the CE. The model-based variance of the Aalen-Johansen estimator can be estimated using a Greenwood-type estimator. 20 The Aalen-Johansen estimator estimates the cumulative incidence function, that is, the cumulative probability of a type 1 event P T ≤ τ, Here, the Kaplan-Meier estimator in the curly braces of the previous display estimates P (T ≥ u | group A) and the increment of the Nelson-Aalen estimator estimates λ A (u) du. Comparing 1 minus the event-specific survival function and the cumulative incidence function it can be easily shown than 1 minus the event-specific survival function is greater than the cumulative incidence function as long as CEs are present (see Appendix A for details), and the same inequality holds for the 1-Kaplan-Meier estimator and the Aalen-Johansen estimator. It can be shown that in absence of censoring or if censoring is only observed after the last AE, the Aalen-Johansen estimator is equal to the incidence proportion and that else the incidence proportion is smaller than the Aalen-Johansen estimator (see Appendix B for details).
• Probability transform incidence density accounting for CEs: ð Þ the number of observed CEs at time u in group A, is the incidence density of the CE and, hence, c ID A τ ð ÞÁτ is the parametric analog of b Λ A τ ð Þ. Using the incidence density of the CE, the connection between the incidence density and the incidence proportion is c ð Þ=n A in the absence of censoring. 21 In the presence of censoring and under a constant hazards assumption, the second factor of this probability transform, 1 À exp Àτ c , is the estimated probability of experiencing an event of either type until time τ, whereas the first factor, c , is the estimated probability of this event being an AE. Moreover, a model-based variance estimator of the probability transform of the incidence density accounting for CEs can be derived with the delta-method 22 and is provided in Appendix C. In the following, we will also call the probability transform of the incidence density accounting for CEs, somewhat loosely, parametric counterpart of the Aalen-Johansen estimator as the two estimators estimate the same quantity under the parametric assumption of constant hazards. Another way to estimate the variances of the estimators is a non-parametric bootstrap accounting for model misspecifications that may also influence the model-based variances. 13 To be more precise, the bootstrapped variance estimates are obtained by calculating the variance of the estimators of 1000 bootstrap datasets, which are generated by sampling observations of the original dataset with replacement until a dataset of the same size is obtained. The variances of the parametric estimators given above assume that the hazards are constant. This assumption is not made by the non-parametric bootstrap.
Below, we will compare non-parametric bootstrap variance estimators with the closed formula variance estimators from provided above, denoting the latter as "model-based."

| Between group comparisons
The comparison of the two treatment groups can be done in terms of the relative risk wherep A τ ð Þ andp B τ ð Þ are one of the five estimators of the AE probability in the interval (0, τ] in group A and group B which are introduced above. The variance estimator of the relative risk can be constructed based on a logtransformationv wherev ar A τ ð Þ 2 andv ar B τ ð Þ 2 are one of the two suggested variance estimators ofp A τ ð Þ andp B τ ð Þ evaluated at time τ.

| AN EXAMPLE: THE DECIDER TRIAL IN ACUTE MYELOID LEUKEMIA
As an example, we use data from the DECIDER trial (DECItabine, DEacetylase inhibition, Retinoic acid; ClinicalTrials. gov identifier: NCT00867672). 23,24 This randomized, multicenter trial had the objective to investigate the efficacy and safety of valproate (VPA) and all-trans retinoic acid (ATRA) in combination with decitabine in 200 older and nonfit patients with acute myeloid leukemia. The trial had a 2 Â 2 design, in which patients were randomly assigned to one of four treatment arms: decitabine, decitabine + VPA, decitabine + ATRA, or decitabine + VPA + ATRA. The primary endpoint of this phase II trial was objective response, and the trial was planned to show a difference between the combined ATRA and VPA treatment arms and the combined no ATRA and no VPA arms (control) at one-sided level alpha of 0.1. Under the assumption of treatment effects of 40% versus 25% a sample size of 200 was calculated for a power of 80%. The trial showed that the objective response rate as well as the overall survival rate was increased by ATRA. Here, we consider the comparison of the combined ATRA treatment arms (called group A, n A = 96) with the combined no ATRA treatment arms (called group B, n B = 104) with respect to the adverse event severe thrombocytopenia, Common Terminology Criteria for Adverse Events (CTCAE) grade 3-5. This AE was observed in 35 patients in group A and 32 patients in group B. As AEs were recorded only until 28 days after end of treatment, death and end of treatment plus 28 days with no observed AE during that time were considered as CEs. As ATRA prolonged overall survival time (median of 8.2 months in group A and 5.1 months in group B), 24 a CE was experienced by 56 patients in group A and by 69 patients in group B, leading on average to a longer follow-up time for AEs in group A (mean 137 days) as compared to group B (mean 125 days). So in this data set, there is hardly any censoring as only five patients in group A and three patients in group B were censored. The time was measured in days and analyses were performed at τ A ð Þ max ¼ 802, respectively τ B ð Þ max ¼ 980, the maximum follow-up times in the two groups, at τ max = 802 the minimum of the two maximum follow-up times, at τ P90 = 353 and at τ P60 = 67, both chosen according to the quantiles of the observed times.  24 This difference was not regarded as relevant taking into account the superiority of group A with respect to overall survival. Now considering the other estimators, the probability transform of the incidence density results in a higher estimated AE probability than the Aalen-Johansen estimator at all follow-up times. The difference is more pronounced for longer follow-up times as the CEs are observed later in time. The 1-Kaplan-Meier estimator also always obtains a higher estimated AE probability than the Aalen-Johansen estimator but less pronounced than the probability transform of the incidence density. For the latter, the estimated cumulative hazards in Figure 3 illustrate departures from the constant hazard assumption. Moreover, Figure 3 shows that CEs tend to occur later in time than AEs. This is the reason why the difference between the probability transform of the incidence density or the 1-Kaplan-Meier estimator and the Aalen-Johansen estimator is larger at later follow-up times.

| Estimating the AE probability
The other three estimators differ only slightly. The probability transform of the incidence density accounting for CEs is comparable to the Aalen-Johansen estimator. Even though the assumption of constant hazards is arguably not valid, the difference between the two estimators that account for the CE is nearly negligible. The incidence proportion F I G U R E 2 Adverse event probability estimators applied to the DECIDER trial at the different follow-up times. The symbols have been jittered for better readability takes the same value as the Aalen-Johansen estimator as long as there are no censored observations in the data. 21 In these data, there is hardly any censoring. In group A, all censoring is observed after the last AE yielding equality of the Aalen-Johansen estimator and the incidence proportion. In group B, however, there is one AE after all censoring resulting in a slightly lower incidence proportion compared to the Aalen-Johansen estimator. When estimating the probabilities at earlier follow-up times this difference is less pronounced as in these situations all AEs are observed before one observation is censored.

| Variance estimation
The variances of the probability estimators are calculated in two ways, first by using the formulas displayed in Section 2 and second by a non-parametric bootstrap. Table 1 displays the estimated variances of all estimators for both groups evaluated at the different follow-up times. At the maximal follow-up time in each group, two major points can be observed. Firstly, the variance of the 1-Kaplan-Meier estimator is much larger than all other variances. This verifies the expectation that at later follow-up times the variance of the Kaplan-Meier estimator is increased due to a small number of patients still at risk and this is the motivation for comparing also at the other follow-up times. 15 Also for shorter follow-up, except for τ P60 , this increased variance can be observed. At τ P60 all variances are virtually the same.
Secondly, for the probability transform of the incidence density, the model-based variance is smaller than the bootstrapped variance. This is likely due to the model misspecification as the constant hazard assumption might not hold for the AE hazard as Figure 3 suggests. Again this effect is more pronounced for longer follow-up times than for τ P60 .
The two variance estimators of the Aalen-Johansen estimator and of the probability transform of the incidence density accounting for CEs are almost identical as, like the incidence proportion, they aim to estimate the binomial probability P(AE in (0, t]).

| Group comparisons
We investigate how the ratio of two biased estimators of the probabilities compares to the ratio of two unbiased estimators. Table 2 displays the relative risks calculated from the estimators and time points shown in Figure 2. Additionally, the 95% confidence intervals of the relative risk calculated with both variance estimators are displayed.
F I G U R E 3 Nelson-Aalen estimates of the cumulative hazard with 95% pointwise confidence intervals (95% CI) of the adverse event and the competing event in both treatment groups. The vertical lines display the analysis times The relative risks calculated by the incidence proportion and the Aalen-Johansen estimator only differ at the maximal follow-up time. This is a direct consequence of the AE after the late censorings in group B leading to a slightly lower AE probability estimated by the incidence proportion than by the Aalen-Johansen estimator.
At all follow-up times, the relative risk estimated with the probability transform of the incidence density and the one estimated with the 1-Kaplan-Meier estimator are smaller than the relative risk obtained with the reference Aalen-Johansen estimator. At the maximum follow-up time, the direction of the estimated relative risks is different. But note that 1 is always included in the confidence interval indicating no statistically significant therapy effects.
The relative risk estimated by the probability transform of the incidence density accounting for CEs and the relative risk estimated by the Aalen-Johansen are similar at the later follow-up times. At τ P60 the relative risk estimated by the probability transform of the incidence density accounting for CEs is greater than the one estimated by the Aalen-Johansen estimator.
The differences between the confidence intervals obtained with the model-based and with the bootstrapped variances are due to differences in the estimated variances. The possible impact of this is illustrated by the fact that the relative risk estimated with the Aalen-Johansen estimator is not included in the model-based confidence interval of the probability transform of the incidence density at the maximal follow-up time and at τ P90 but it is included in the confidence interval obtained with the bootstrapped variance. Furthermore, the model-based confidence intervals of the incidence proportion and the Aalen-Johansen estimator are wider than the confidence interval where the variances are obtained with a bootstrap. For the probability transform of the incidence density and the 1-Kaplan-Meier estimator at τ   incidence density at τ P90 and for the probability transform of the incidence density accounting for CEs at τ P90 and τ P60 the confidence interval based on the bootstrapped variance is longer than the model-based confidence interval. This is also due to the differences in the estimated variances.
In summary, in this data example, estimators that lead to a higher estimate of the AE probability than the Aalen-Johansen estimator result in a smaller relative risk estimate than that based on the Aalen-Johansen estimator. The model-based CIs of the probability transform of the incidence density may be too small in the sense that they do not cover the reference estimate of the RR, but the bootstrapped CIs do not suffer from this shortcoming.

| SIMULATION STUDY
In this section, we take a closer look at the three sources of bias, namely censoring, CEs, and non-constant hazards. For this purpose, competing risk data with the event of interest and one CE are simulated for two independent groups A and B. 25,26 The simulation algorithm is described in the Supporting Information. Table 3 describes the simulation T A B L E 2 Relative risk (RR), model-based confidence interval (CI), bootstrap CI and the ratio of the lengths of the two confidence intervals (model-based/bootstrap) calculated from the different AE probability estimators at several follow-up times in the DECIDER trial 24 scenarios. The simulation scenarios marked with ? are displayed in the following. These represent all main findings of the investigation. The results of the other simulation scenarios can be found in the Supporting Information. For each scenario N REP = 10,000 datasets are simulated. Scenarios S1, S2, and S3 consider constant hazards for both events. The hazards are equal to the incidence densities estimated in the data example presented in Section 3. The hazards λ A (t) and λ B (t) correspond to the T A B L E 3 Summary of the scenarios considered in the simulation study (N REP = 10,000). The sample size is equal in both groups     AE hazards and the hazards λ A t ð Þ and λ B t ð Þ to the CE hazards in group A and group B, respectively. In contrast, scenarios S4-S10 investigate the case where at least one hazard is time-dependent. In the scenarios S1, S2, S4, S7, and S8 the data are completely observed, that is, without censoring. In scenarios S3, S5, S6, and S9 between 10 and 28 percent of the observations are censored whereas in scenario S10 similar to the data example only very little censoring is present. The censoring times were generated from a uniform distribution and are independent of the event times.

| Probability estimators
As in the data example, the comparisons are also conducted at several follow-up times. Table 4 displays the mean simulated follow-up times, that is, the mean simulated values of τ A ð Þ max , τ B ð Þ max , τ max , τ P90 and τ P60 as they take different values in each simulation run. Therefore, as the true simulated value depends on the follow-up time, it is not the same value over all simulation runs, but different in each simulation run. For example, the true simulated value of the probability of the event of interest in group A is calculated by P . For this reason, we calculate the mean simulated value over all N REP = 10,000 simulation runs. The mean simulated AE probabilities at the different follow-up times of group A and B are displayed in Table 5. For more stable results in the calculation of the mean simulated value, the mean was calculated on the logit-transformed results and back-transformed to the probability scale. For the same reason in the calculation of the relative mean bias in Table 6, the mean was calculated on the log ratios and then exp-transformed back to the original scale. The Tables 5 and 6 show three important quantities of the comparison: the mean simulated AE probabilities at the different follow-up times of group A and B, the absolute mean bias and the relative mean bias.
The incidence proportion underestimates the AE probability if censoring is present, and therefore, at all follow-up times in scenarios S3, S5, and S10. Under a large percentage of censoring as in scenario S3 at the maximum follow-up time, the AE probability estimated by the incidence proportion is on average about 9% smaller than the mean simulated AE probability.
The probability transform of the incidence density and the 1-Kaplan-Meier estimator are higher than the simulated value. For the 1-Kaplan-Meier estimator this difference is the most pronounced in scenario S5 in group A as this is the scenario with the most CEs. At the maximum follow-up time in group A the AE probability estimated by the 1-Kaplan-Meier estimator is on average about 400% increased compared to the simulated value. For smaller follow-up times the differences are less pronounced since fewer CEs are present. Furthermore, it is notable that for group A in scenario S5 there is a huge difference between the estimate of the probability transform of the incidence density and the 1-Kaplan-Meier estimator. Due to the quadratic term in the hazard of the event of interest the assumption of constant hazards for the AE hazard is severely violated resulting in quite different estimates of the two estimators that censor CEs.
In the scenarios S2, S3, and S10 when considering the mean absolute and relative bias the Aalen-Johansen estimator and the probability transform of the incidence density accounting for CEs are both unbiased. But if both hazards are non-constant as in scenario S5 the probability transform of the incidence density accounting for CEs is more biased.
In scenario 5 at the maximum follow-up time all estimators are biased. But the Aalen-Johansen estimator obtains the smallest bias in group A and the second smallest bias in group B. Note that in scenario S5 group A the mean relative bias is marked as NA for all estimators at τ P90 and τ P60 as in most simulation runs no events of interest were observed until then resulting in a ratio of 0.  The difference between the 1-Kaplan-Meier estimator and the simulated values is larger than the one between the parametric counterpart of the Aalen-Johansen estimator or the Aalen-Johansen estimator and the simulated values. Therefore, it seems that ignoring CEs may be more harmful than falsely assuming constant hazards.
In scenarios in which at least one hazard is constant the Aalen-Johansen estimator and the probability transform of the incidence density accounting for CEs perform similar but in a scenario with both non-constant hazards as in scenario S5 the Aalen-Johansen estimator still has a small bias compared to the other estimators. In real data analyses the true value is unknown and if one estimator needs to be chosen to work in any situation, the Aalen-Johansen estimator is the best choice. This makes the Aalen-Johansen the reference estimator in this context. As long as there is no censoring present the incidence proportion and the Aalen-Johansen estimator coincide and therefore have the same bias (scenario S2). The probability transform of the incidence density and the 1-Kaplan-Meier estimator are more biased than the Aalen-Johansen estimator.  F I G U R E 5 Boxplots of the difference between the relative risk of the estimators and the simulated relative risks (mean absolute bias). The following follow-up times are considered τ

| Variance estimators
Analogously to the data example, the estimated variances are calculated model-based with the formulas from Section 2.2 and with a non-parametric bootstrap. The boxplots in Figure 4 display the estimated variances in all simulation scenarios for all considered probability estimators.
Considering the boxplots of the variances of the estimators one immediately notices the increased variances of the Kaplan-Meier estimator for the scenarios S2, S3, and group A of S5. The reason for these outliers is that, if the last event is an AE, the Kaplan-Meier estimator equals 0 and as a consequence, the 1-Kaplan-Meier estimate remains unchanged at 1 near the end of follow-up in some of the bootstrap replications. The model-based variance using the Greenwood estimator is slightly smaller but still increased compared to the other estimators. This problem was already mentioned in Pocock et al. 15 and is the motivation why we also consider some earlier follow-up times. Late censored observations and late CEs as present in group B of scenario S5 and scenario S10 prevent the Kaplan-Meier estimator from dropping to 0 and therefore result in a smaller variance.
Furthermore, the parametric counterpart of the Kaplan-Meier estimator is less sensitive to the type of the last event. The variance of this estimator has only a few outliers at the end of follow-up. In this comparison, the parametric estimator has a smaller variance than the non-parametric Kaplan-Meier estimator.
In scenario S10 in group B an increased variance for earlier follow-up times can be detected. The reason is that censoring, increasing the variance, occurs early and the last event is rarely censored.
T A B L E 8 Simulation results: Mean relative bias of the relative risk estimators compared to the mean true simulated relative risk in Table 7.  However, it is notable that differences between the estimated variances of the non-parametric Aalen-Johansen estimator and its parametric counterpart are small. Furthermore, the bootstrapped variances of the incidence proportion, of the probability transform of the incidence density accounting for CEs and of the Aalen-Johansen estimator are comparable to the model-based ones.

| Estimating the relative risk
In the data example, estimators that overestimate the AE probability underestimate the relative risk. This is further investigated in the simulations. Table 7 displays the mean simulated relative risk. The different scenarios consider situations of a larger probability in group A (scenario S2, S3, and S10 at τ P60 ) and of a smaller probability in group A (scenario S5 and S10 except τ P60 ).
The boxplots in Figure 5 display the mean absolute bias of the relative risk calculated with the estimators compared to the simulated relative risk. Table 8 displays the corresponding mean relative bias. Using the incidence proportion, the probability transform of the incidence density accounting for CEs or the Aalen-Johansen estimator when comparing two treatment groups in terms of the relative risk induced no bias in the scenarios with constant hazards without censoring and time-dependent hazards. In the scenario with constant hazards and censoring (scenario S3) the incidence proportion results in a negative bias whereas the probability transform of the incidence density accounting for CEs and the Aalen-Johansen estimator are unbiased.
Using either the probability transform of the incidence density or the 1-Kaplan-Meier estimator the relative risk is smaller than the simulated relative risk in scenario S2 and S3 where there is a beneficial effect of B but higher in scenario S5 at τ The estimators that perform well in terms of the mean absolute or relative bias in both groups in Table 6, that is, have a negligible bias, also perform well when estimating the relative risk. The combination of the Tables 6 and 8 show that there is no case in which the AE probability estimates are severely biased but the estimated relative risk is unbiased.

| CONCLUSIONS AND DISCUSSION
In this paper, we compared several estimators quantifying the AE probability. The reference estimator in a time-toevent analysis with CEs is the Aalen-Johansen estimator. Simulations and the real data example illustrated that using the probability transform of the incidence density accounting for CEs or the incidence proportion may also provide unbiased estimates as long as the hazards are constant or there is no censoring at the considered follow-up time. In this context, we note that data monitoring committees or data safety monitoring boards will typically face situations characterized by high amounts of censored observations when monitoring adverse events during an ongoing trial.
In AE analyses with many CEs as progression, treatment discontinuation, or death only few observations may be censored in a time-to-first-event (and type-of-first-event) setting. Therefore, the use of the incidence proportion may often be justified if censoring is low and the CEs have been specified appropriately. This emphasizes the need to consider thoroughly how the CEs are defined in the specific trial as this directly impacts the amount of censoring, see Stegherr et al. 18 for guidance on this aspect. As outlined earlier, the type of CEs will also have an impact on the type of estimands, typically either while on treatment or treatment policy, and the type of CEs may be influenced by the design of AE follow-up. Moreover, in the presence of CEs the probability transform of the incidence density and the 1-Kaplan-Meier estimator always overestimate the AE probability. Using the 1-Kaplan-Meier estimator or the probability transform of the incidence density in group comparisons in terms of the relative risk can lead to the opposite conclusion about a therapy's safety as compared to using the Aalen-Johansen estimator.
The incorrect use of the Kaplan-Meier estimator in the presence of CEs also occurs in other disease areas, among others in the context of cardiovascular diseases. 9,10 The assumption of constant hazards has been challenged for events associated with aging, for example, death and some types of cancer, as the hazards increase with time, or for events that occur early, for example, remission, as there the hazards decrease with time. 5 To summarize, in the literature about the analyses of AEs 16,27 mainly the constant hazards assumption is criticized, but one of our main results is that ignoring CEs and treating them falsely as censoring at the event time in a Kaplan-Meier-type calculation may be worse than misspecifying the model by falsely assuming constant hazards.
In practice a common way to report the AE probability is by the use of frequency categories. 28,29 Thereby, the AE probability is categorized to "very common" if the probability is greater than 10%, "common" if the probability is between 1% and 10%, "uncommon" if the probability is between 0.1% and 1%, "rare" if the probability is between 0.01% and 0.1%, and "very rare" if the probability is smaller than 0.01%. For very common AEs as here in the DECIDER trial and in most scenarios of the simulation study often all estimators derive the same category. But for smaller categories the choice of the estimator plays an important role. An absolute bias of 0.0001 may lead to a different frequency category for a very rare AE.
AEs may also be recurrent, 27 but the issue of CEs will remain as relevant as in time-to-first-event analyses. Recent papers by Charles-Nelson et al. 30 and by Andersen et al. 31 emphasize the importance of considering competing (terminal) events also in analyses of recurrent events. See these references 27,32,33 for suggestions on the analysis of recurrent AEs.
Here we emphasize the need to carefully consider the different follow-up times in safety analyses. One reason is that due to a small number of patients still at risk at the end of follow-up the variances of the estimators may be increased. 15 We investigated this aspect using times τ max , τ P90 and τ P60 , defined via quantiles in treatment groups. We note that for times τ P90 and τ P60 this comes with possibly reduced statistical power, especially if AEs occur late in the study period. Another aspect with different follow-up times between groups is to calculate the relative risk using one common time point in both groups as otherwise, the interpretation may be difficult. 14 We reiterate that also the common logrank test only compares two groups while the risk sets are nonempty in both groups, that is, up to time τ max . Here, we only considered the relative risk, since it is frequently used and often considered easy to interpret. But similar conclusions can be drawn for the risk difference (results not shown).
We further compared the model-based and the bootstrapped variance estimates of the estimators. Thereby, for the incidence proportion, the Kaplan-Meier estimator, the probability transform of the incidence density accounting for CEs, and the Aalen-Johansen estimator no relevant differences between the two variance estimators were found. For the probability transform of the incidence density, we found a smaller estimated variance for the model-based approach than for the bootstrapped one. The model-based variance assumes constant hazards whereas the nonparametric bootstrapped variance estimator does not. 13 Here, we focused on the comparison of probabilities and the ratio of probabilities (relative risk) instead of the comparison of hazards. Interpretation of hazards and hazard ratios is often challenging in particular in the presence of CEs. 34 Arguably, the best way to communicate results is by visualizing the estimated probabilities for the two groups. All estimators introduced in Section 2 can easily be visualized in plots (see Figure 2), although theses plots do not provide direct information about the possible dynamic pattern of the treatment effect. 35 Here, to better understand the differences between the estimated AE probabilities in the data example, we also considered a plot of the cumulative hazards estimated by the Nelson-Aalen estimator in Figure 3. Therefore, as most analyses consider both the probabilities and the hazards one other point of future research is to compare different estimators of the hazard ratio, for example, the hazard ratio obtained by the Cox proportional hazards model, the ratio of the Nelson-Aalen estimators and the ratio of the incidence density of both groups.
To continue the discussion on studying probabilities or hazards in a time-to-AE, hence, in a competing events setting, we note than one major motivation for using the cumulative AE probability as our basic estimand, see Section 2.1, was that both the commonly used incidence proportion and the Kaplan-Meier estimator target the probability scale. This is also the scale that informs AE frequency categories and it motivated conversion of the also commonly used incidence density onto the probability scale, either in the spirit of Kaplan-Meier or accounting for CEs. We note that in this context the presentation of the use of incidence density as opposed to incidence proportions is often somewhat technical, focusing on the different denominators, but hardly on the parameter being estimated, see Beyersmann and Schrade 21 for a discussion in the context of AEs during hospital stay.
However, incidence densities are quite instructive for a discussion on studying probabilities or hazards for time-to-AE. An ideal situation when weighing the benefits and harms of a treatment should be a situation where there is no effect on AEs. Formalizing this as no effect on the AE hazard λ (and assuming constant hazards), recall that the anytime AE probability is with CE hazard λ. If the treatment is beneficial in that it reduces the CE hazard, the anytime AE probability will be increased. Therefore, the use of the probability scale is often criticized as "biased," but we believe that the effect at hand is just common sense: The time spent in the initial state of Figure 1 is prolonged, because the all-events-hazard λ þ λ is reduced. During this prolonged time, an unaltered AE hazards acts, leading to the effect on the AE probability just described. However, this scenario does shed a light, firstly, on the subtle relation between hazards and probabilities when there is more than just one event type, and, secondly, on the important question of how to formalize effects in the presence of CEs. Arguably, "no effect on AE" is perhaps most easily formalized on the hazard scale, but the above simple calculation demonstrates that one must account for CEs. Again, the Kaplan-Meier estimator does not achieve this and would consequently provide a misleading picture on the probability scale.
The analyses of this paper are motivated by the Survival analysis for AdVerse events with VarYing follow-up times project (SAVVY). 18 This is a joint project of academic institutions and pharmaceutical companies with the aim to improve standards for the reporting of incidences of adverse events. Whereas this paper is more focused on a methodological comparison of the estimators under different settings, the SAVVY project includes an empirical study calculating estimators mentioned in this paper in several randomized controlled trials and summarizing the results in a metaanalysis to assess the sources of bias in safety analyses in real clinical trials. In the end, both this paper and the SAVVY project intend to reduce the usage of biased estimators in the analysis of AEs.
To sum up and to return to the three questions raised in the introduction of this paper: The answer to question (i) is that the choice of the estimator is also crucial for group comparisons in terms of the relative risk. The bias in calculating the AE probabilities and variances of the AE probabilities do also directly influence the relative risk and the confidence intervals. Regarding question (ii), since the probability transform of the incidence density accounting for CEs is less biased with respect to the Aalen-Johansen estimator than of 1-Kaplan-Meier estimator, ignoring CEs can be worse than falsely assuming constant hazards. When then considering question (iii), for earlier time points of analysis for most AE probability and relative risk estimators the bias with respect to the Aalen-Johansen estimator is smaller. Especially, for the incidence proportion the bias is almost negligible.
A P P END I X A .: PROOF: THE ESTIMAND OF THE 1-KAPLAN-MEIER ESTIMATOR IS GREATER THAN THE CUMULATIVE INCIDENCE FUNCTION For the cumulative incidence function (left-hand side below) that can be estimated by the Aalen-Johansen estimator and the estimand of the 1-Kaplan-Meier estimator (right-hand side) the following inequality holds: where relation ? holds since exp À R u 0 λ A s ð Þds ≤ 1 with equality if λ A s ð Þ ¼ 08s, that is, no CEs are present in the data.
Note that relation ? does not postulate the existence of latent event-specific times. As a consequence, exp À R t 0 λ A s ð Þds has no proper probability interpretation in settings with CEs.

A P P END I X B.: CONNECTION BETWEEN THE INCIDENCE PROPORTION AND THE AALEN-JOHANSEN ESTIMATOR
If we look more closely at the Aalen-Johansen estimator, we see two cases where it is equal to the incidence proportion. For simplicity, we only consider the end of follow-up τ We can distinguish three cases and in two of them the Aalen-Johansen estimator reduces to the binomial estimator, the incidence proportion.
• No censoring: In case of no censoring, there is no i {1, …, m} such that both d A (t i ) and d A t i ð Þ are equal to 0 at t i .
Then, it is easy to see that Equation (B1) is equal to P m k¼1 d A t k ð Þ=n A , the incidence proportion.
• Censoring only after last AE: Assume the first m 1 observations are either AE or CE and let the last m À m 1 observations be either CEs or censored. This is equal to assuming there is no i {1, …, m 1 } such that both d A (t i ) and d A t i ð Þ are equal to 0 at t i . Then Equation (B1) is equal to P m 1 k¼1 d A t k ð Þ=n A and as d A (t i ) = 0 for i = m 1 + 1, …, m, it is also equal to the incidence proportion P m k¼1 d A t k ð Þ=n A .
• Censoring in between AE and CE: If there are no AEs or CEs but only censoring observed at time t i , that is, ð Þ one (more) factor is equal to 1 instead of something smaller than 1. The incidence proportion assumes all of these factors being smaller than 1 resulting in something smaller being added for each AE after a censoring. The consequence is a smaller estimate compared to the Aalen-Johansen estimator which accounts censoring.