Survival analysis for AdVerse events with VarYing follow-up times (SAVVY): Rationale and statistical concept of a meta-analytic study

The assessment of safety is an important aspect of the evaluation of new therapies in clinical trials, with analyses of adverse events being an essential part of this. Standard methods for the analysis of adverse events such as the incidence proportion, i.e. the number of patients with a specific adverse event out of all patients in the treatment groups, do not account for both varying follow-up times and competing risks. Alternative approaches such as the Aalen-Johansen estimator of the cumulative incidence function have been suggested. Theoretical arguments and numerical evaluations support the application of these more advanced methodology, but as yet there is to our knowledge only insufficient empirical evidence whether these methods would lead to different conclusions in safety evaluations. The Survival analysis for AdVerse events with VarYing follow-up times (SAVVY) project strives to close this gap in evidence by conducting a meta-analytical study to assess the impact of the methodology on the conclusion of the safety assessment empirically. Here we present the rationale and statistical concept of the empirical study conducted as part of the SAVVY project. The statistical methods are presented in unified notation and examples of their implementation in R and SAS are provided.


Introduction
Time-to-event or survival endpoints are commonly encountered in clinical trials, and some literature searches have found that survival analysis is the most common advanced statistical technique in medical research (Horton and Switzer, 2005;Sato et al., 2017). Censoring is arguably the major reason to use survival methodology, since statistical inference that does not account for censoring will, in general, be biased.
to this work beyond presenting concepts of the main study. While basic methodological considerations on AE analyses, censoring, varying follow-up times and competing risks have been discussed elsewhere, see, e.g., Unkel et al. (2019) and Beyersmann and Schmoor (2019), this paper offers additional detailed insights by considering practically relevant questions such as which kind of event should be viewed as competing and which one as a standard censoring event. For instance, Unkel et al. (2019) briefly touch upon the question whether diagnosed progression is a competing event or rather a censoring event, and if the latter, whether such censoring is informative, and, finally, what the impact on AE analyses is. Here, Section 2.2 offers further guidance, also distinguishing between a "hard" competing event definition and a more encompassing one. Furthermore, an important meta-analytical issue is that estimates are typically weighted by the inverse of an estimated variance which is not straightforward in the setting here, since the aim is to compare different methodological approaches performed within one study. As a consequence, the variance of the difference measure between, e.g., Kaplan-Meier and incidence proportion is required. This difficulty has, e.g., also been faced (but not solved) by Lacny et al. (2015Lacny et al. ( , 2018, and Section 4.4 will explain how to bootstrap these variances.
The remainder of the paper is organized as follows. Section 2 discusses in detail the definition of AEs and of competing events and will also briefly consider composite events. The latter will be included to investigate the impact of ignoring censoring without the complication of competing events. Section 3 explains the organization of the data analyses within SAVVY and may serve as a template for future investigations of related questions. Here, an important aspect is that trial level analyses will be run at sponsors' sites, but meta-analyses will be run centrally by the academic project collaborators. The statistical methods on trial level are collected in Section 4. This section, in particular, explains how SAVVY will quantify and account for different lengths of follow-up and makes a connection between the different methods of estimation. For instance, the incidence proportion will equal the Aalen-Johansen estimator evaluated at the largest observed time in the absence of censoring. Details of the meta-analyses to be performed are in Section 5. This section, in particular, addresses the multitude of comparisons to be considered as well as assessment of bias and heterogeneity. A brief discussion is in Section 6, also addressing the question of recurrent AEs, and software code is provided in the online supplement.

Definition of events 2.1 Adverse events
According to the Good Clinical Practice (GCP) guideline an AE is defined as "any untoward medical occurrence in a patient or clinical investigation subject administered a pharmaceutical product and which does not necessarily have a causal relationship with this treatment" (Committee for Human Medicinal Products, 2016). In this meta-analytic study, the choice of AEs within the selected clinical trials is left to the sponsor. These may be defined as AEs of special interest, as belonging to a specific Medical Dictionary for Regulatory Activities (MedDRA) system organ class or preferred term, as being severe according to a toxicity grading, as being related to the investigational product, as serious, or as a combination of these characteristics.
Often sponsors will select as AEs the adverse drug reactions presented in the core data sheet for a first submission for drug approval, and select the studies that supported the frequency derivation.
These choices may result in a range of frequency, from common AEs to rare ones. It is expected that differences between the methodological approaches will be less marked with very rare AEs or very common AEs. So, grouping of different rare AE types into one AE category would be permissible for studying methodological differences. Making use of this frequency range (from rare to more common per selected trial) allows to further investigate the impact of the frequency on the differences between statistical analysis methods. Ideally, AEs of different frequencies per trial should be chosen, e.g. around 30%, around 10%, and around 1%.
The investigation is restricted to the analysis of the occurrence of the first AE of a specific type and will not consider the analysis of recurrent events. However, the relevance of the present investigation for recurrent AEs will also be considered in Section 6.

Competing events definition
Competing events are events that preclude the occurrence of the AE of interest. For instance, if in a clinical trial the focus is on estimating the probability of headache, patients who die without prior headache report will never report headache. It is obvious that death is a competing event with respect to the occurrence of headache. However, defining a general rule as to which events should be treated as a competing event, without specific insight into the specific event of interest and study at hand, is challenging.
As a rule of thumb, any event that both a) would be viewed from a patient perspective as an event of his/her course of disease or treatment, and b) would stop the recording of the AE of interest should be viewed as a competing event. This situation would typically occur when a patient discontinues the treatment due to another AE judged by the investigator as too severe to continue treatment, or when the patient discontinues treatment or study due to progression/lack of efficacy, and, as a consequence, the recording of AEs ends. Hence, if end of follow-up for AEs, withdrawal of consent or discontinuation is disease or treatment-related, this would be handled as a competing event. See Lawless and Cook (2019) for related considerations.
In contrast to a competing event, the time-to-event is censored if the patient reaches the designated end of follow-up without having had the AE of interest or a competing event as defined above. This situation is present with administrative censoring due to the regular end of the trial or the end of follow-up for AEs due to the planned end of treatment (often end of treatment plus an additional fixed time interval, e.g., 30 days) not triggered by the course of disease.
In the analysis, the different competing events will be combined into one composite competing event, as the aim is to compare different methods to quantify the risk of an interesting AE and not the risk of a competing event of a specific type.
There is much debate among statisticians on the question which events should be analyzed as competing events and which events should lead to censoring the time-to-event. In practice, there is no discussion that death without the previous occurrence of the interesting AE acts as a competing event with respect to AE occurrence. The reason is that after death the AE can definitely not occur any more, and in this sense "death without prior AE" is a so-called "hard" competing event. However, the other events mentioned above as loss to follow-up, withdrawal of consent or treatment discontinuation can be regarded as so-called "soft" competing events in the sense that, thereafter, the interesting AE in principle still could occur, but cannot be observed due to end of follow-up. Discussions therefore arise around the question whether to treat these soft competing events as censoring or as competing event. One extra concern here is that, e.g., treatment discontinuation will likely alter the AE hazard. See also Unkel et al. (2019) on how these aspects connect to the current debate on estimands.
Therefore, two different approaches will be compared in this project. In a first approach (called "allevents approach" in the following), all competing events mentioned above will be combined and analyzed as a single competing event and only patients reaching the designated end-of-follow-up with neither former AE of interest nor competing event as defined above will contribute as censored observation. In a second approach (referred to as "death-only approach" in the following), only the hard competing event (i.e. death without prior AE of interest) will be analyzed as competing event, and the soft competing events will be analyzed as censored observations.

Composite events
Additionally to the statistical analyses of AEs considering competing events, further analyses encompassing both the AE of interest and the competing event as composite event will be performed, thereby addressing the composite estimand (Unkel et al., 2019;Rufibach, 2019). The time to the composite event will be defined as time to the interesting AE or to the competing event whatever occurs first, and patients with neither the interesting AE nor the competing event will contribute a censored time-to-event. The rationale for the inclusion of this approach is to gauge the impact of using time-to-event methodology to account for varying follow-up times without the methodological complication of competing events. To this end, timeto-event analyses accounting for censoring will be compared to the traditionally used incidence proportion.

Organization of the data analysis
The SAVVY project group consists of the academic project collaborators who planned the statistical analyses (in the following referred to as the "analysis center") and the participating sponsors, who contribute randomized clinical trial data for analysis. In the SAVVY project, the data analysis involves the following steps: 1. Pre-registration: Confidential pre-registration of the clinical trials selected by the sponsors with the analysis center and allocation of a SAVVY trial identifier (ID) to registered trials by the analysis center 2. Individual trial analysis: Analysis of the registered trials at sponsor's site using code provided by the analysis center and transfer of aggregated trial level results to the analysis center 3. Meta-analysis: Meta-analysis of trial level results at the analysis center In the following, these steps will be considered one by one in more detail.
Pre-registration The sponsors select the randomized clinical trials and the AEs they wish to enter into this project. In order to avoid selection bias, these trials have to be confidentially pre-registered with the analysis center before running the analyses. For the identification of the trials, a unique trial ID according to a publicly accessible trial registry, e.g. clinicaltrials.gov or the German Clinical Trials Register, has to be provided together with some characteristics of the trial and the selected AEs (see Table 1). The identification of the trials included will not be disclosed otherwise. In publications or presentations it will only be reported that studies have been identified to the analysis center. The analysis center will handle any information related to trial level data in a confidential manner. Also, the exchange of trial related information between the sponsor and the analysis center will take place in a secured manner following the sponsors' individual policies regarding the secure exchange of confidential information.
To register the trials and the AEs for the SAVVY project, a spreadsheet is filled in by the sponsor containing one row per AE of interest with the main AE characteristics, e.g. seriousness, severity, MedDRA system organ class (SOC), MedDRA preferred term (PT). If a sponsor does not want to provide a particular information or if the information is not relevant due to the particular grouping applied, "NA" for "not applicable" can be used. Table 1 shows the characteristics of the selected trials and AEs that will be captured.
After receipt of the registration sheet, a SAVVY trial ID will be allocated to the trial by the analysis center. The SAVVY ID will be entered into the study characteristics spreadsheet and returned to the sponsor.
Individual trial analysis The analyses of the individual clinical trials will be done at sponsor's site using SAS or R code provided by the analysis center. Therefore, it is not required to release any individual patient data to the analysis center. Only aggregated data, summarizing the results of the analyses, will be shared with the analysis center. The analysis center will not share the aggregated data of one sponsor with any other sponsor. A manual is provided by the analysis center to the sponsor describing what needs to be done after receipt of the SAVVY trial ID and the program code. As a prerequisite, at the sponsors' sites, the individual clinical trials data sets must be brought into a format which allows the application of the provided SAS or R code. The required data structure is simple, similar to that of a standard survival analysis, and shown in Table 2.
For each trial one dataset is needed in which all AE specific data for the selected AEs are set below each other. The different AEs are distinguished by the AE ID, matching the AE ID given in the study characteristics table filled for trial pre-registration. The treatment group ID will be used by the SAS or R code but is not included in the results where experimental and control groups are coded as "A" and "B", respectively. Observations with missing data, negative event times or type of event not in {0, 1, 2, 3} are automatically excluded from the analyses.
The SAVVY trial ID is inserted in the SAS or R code. The program returns the aggregated data summarizing the results of the statistical analysis methods described in Section 4 and the results of some further descriptive analyses on the AE and the competing event, namely mean, median, minimum and maximum Specify the type(s) of the soft competing event(s) as defined in Section 2.2, e.g., end of treatment, withdrawal of consent, progression event time by type of event and overall, and the total numbers of AEs, competing events, and censored observations, each overall and per treatment group. The dataset containing all results is named automatically identical to the SAVVY trial ID and is sent to the analysis center for further processing.
Meta-analysis Once the results of all registered trials are received, the analysis center performs the metaanalyses described in Section 5 of the estimated parameters described in Section 4. The results of the metaanalyses will be presented and discussed within the SAVVY project group without identifying individual trials and sponsors. We will use two approaches to both quantify length of follow-up and to study its impact on analysing the occurrence of AEs. The first approach is guided by the implicit choice made when calculating incidence proportions. The second approach is guided by the concern that overly small risk sets late in time may lead to unstable probability estimators (Pocock et al., 2002). The first approach is to choose a time τ A as the largest observed time (censored or AE or competing event) which was (if observed AE) or could have been (if censored or competing event) an observed AE time in group A. Time point τ B is defined analogously for group B. Then one step to account for different lengths of follow-up between groups is to restrict statistical inference to the smaller of the two time points. Hence, let τ = min(τ A , τ B ). The motivation behind this approach is two-fold: Firstly, the commonly used incidence proportions are calculated in the complete data set, i.e., for the data available on [0, τ A ] and [0, τ B ], respectively. Secondly, group comparisons for time-to-event data are typically restricted to the smaller of these two time intervals only. For instance, the common log-rank test only compares groups as long as the risk sets are non-empty in both groups.
The second approach follows a suggestion of Pocock et al. (2002). It covers a range of choices for quantifying length of follow-up and will depend on the proportion of patients still at risk. So, additionally, chooseτ A (p) =τ A as the time such that 100 · p% of all patients in group A are still at risk just prior timeτ A and may have an observed event at timeτ A , p ∈ (0, 1). To be precise, letτ A be the 100 · p%-quantile of the usual empirical distribution functionF of the observed times (irrespective of censoring status), Chooseτ B analogously in group B and letτ = min(τ A ,τ B ). We will consider p ∈ {0.3, 0.6, 0.9}. The relationship betweenτ A and τ A is that both time points coincide for the choice of p = 1.
We also note that the different choices of τ account for different time horizons underlying the estimation methods within groups, but they do not account for differential drop-out between groups. It is, e.g., possible that τ A = τ B , but that drop-out rates differ between groups. Such differential drop-out would therefore be potentially treatment group-related, suggesting to handle drop-outs as competing risks, see Section 2.2.

One-sample estimators
Methods are exemplarily discussed for group A and for τ . The estimators of "absolute AE risk" that we will consider fall into three groups (Allignol et al., 2016). Firstly, the incidence proportion accounts for competing risks but not for censoring (Equation (1) below). Secondly, one minus the Kaplan-Meier estimator accounts for censoring but not for competing risks (Equation (4)), and this is also true for a standard conversion of the incidence density to a probability (Equation (3)). Thirdly, the Aalen-Johansen estimator generalizes the Kaplan-Meier estimator to competing risks and will later serve as a benchmark or method of choice for nonparametric estimation of the cumulative AE probability. The connection to the incidence proportion is that both coincide in the absence of censoring. The connection to one minus the Kaplan-Meier estimator is that both coincide in the absence of competing risks. A parametric counterpart (Equation (7)) of the Aalen-Johansen estimator (Equation (8)) based on incidence densities is also considered. The incidence proportion divides the number of patients with an observed AE on [0, τ ] in group A by the number n A of patients in group A. More precisely, introduce individual first-AE-counting processes where N i (t) = 1 denotes that an AE has been observed for patient i in the time interval [0, t] and that no competing event has been observed before the AE. Analogously, let denote i's counting process of observed competing events. Because we consider time-to-first-event and type-of-first-event, we have that N i (t) +N i (t) ≤ 1 and both N i (t) andN i (t) change their value from 0 to 1 at most once, when a time-to-first-event has been observed. The aggregated processes are In the absence of censoring, the sum of the two aggregated processes will eventually be equal to n A , but in general we have N A (∞) +N A (∞) ≤ n A . The incidence proportion now is The incidence density has the same numerator, but divides by person-time-at-risk. Again, to be precise, introduce individual at-risk-processes where, for t > 0, Y i (t) = 1 denotes that the patient is still under observation on [0, t) and that neither an AE nor a competing event have happend on [0, t). Note that the at-risk-processes are left-continuous, such that Y i (t) denotes the at-risk status just prior time t. If Y i (t) = 1, an event may happen and be observed at time t. Otherwise, Y i (t) = 0. The incidence density now is The incidence density is not a probability, but estimates the AE hazard with values in [0, ∞) under a constant hazard assumption. A typical transformation of this estimator onto the probability scale is Assuming a constant AE hazard, estimator (3) estimates the same quantity as one minus the Kaplan-Meier estimator, which only codes observed AEs as an event and censors anything else. To be precise, introduce increments which equal one, if an AE (before any competing event) is observed for patient i at time t, and ∆N i (t) = 0 otherwise. Defining ∆N i (t) analogously, the increments of the aggregated processes are The size of the risk set is and one minus the Kaplan-Meier estimator which only codes observed AEs as an event and censors anything else can be expressed as Here, ∆Λ A (u) is the increment of the nonparametric Nelson-Aalen estimator of the cumulative AE hazard where the product in (4) and the sum in (5) is over all observed, unique event times u. Also note that we are slightly abusing notation in (4), becauseŜ A (τ ) is not estimating a proper survival function because of the presence of competing risks. The Nelson-Aalen estimator, however, is a proper estimator of the cumulative AE hazard. Assuming a constant AE hazard,Λ A (τ ) and ID A (τ ) · τ estimate the same quantity. Accounting for competing risks now requires to acknowledge that there also is a competing hazard. To begin, we introduce a competing incidence density Also using ID A (τ ) as defined above, we obtain an estimator of the cumulative AE probability based on incidence densities and accounting for competing risks, The connection of this estimator to the incidence proportion is that the leading factor on the right hand side of the previous display equals IP A (τ ) in the absence of censoring and if τ = τ A (Beyersmann and Schrade, 2017). In words, both of these quantities estimate the anytime-AE-probability in this situation.
In the presence of censoring, quantity (7) estimates the cumulative AE probability assuming all hazards to be constant. The nonparametric counterpart is the Aalen-Johansen estimator of the so-called cumulative incidence function, where ∆Λ A (v) now is the increment of the competing Nelson-Aalen estimator in analogy to ID A . Note that we have again slightly abused notation, writing CIF A (τ ), although this quantity is an estimator.

Two-sample Comparisons
In principle, many methods of two-sample comparisons are conceivable. We here aim to consider one method that applies to all one-sample estimators and provides a quantification of risk differences and relative risks, where risk here refers to a probability estimator as defined earlier. To this end, assume that we have estimatorŝ whereq A andq B are probability estimators within groups. That is, we have one line of values (9) for each time point for the incidence proportion, one line of values for each time point for incidence densities (transformed onto probability scale) etc., for each estimation method discussed in Section 4.2 and for each evaluation time point defined in Section 4.1. Then, we estimate the risk difference bŷ with and approximate 95% confidence interval where z 0.975 is the 0.975 quantile of a standard normal distribution. Relative risks are estimated bŷ and we base variance estimation and construction of approximate confidence intervals on a log-transformation. So, e.g., using the delta method This leads to the backtransformed approximate 95% confidence interval RR · exp (±z 0.975 ·σ) .
The primary comparison of methods will be based on probability estimators as explained. However, because of the omnipresence of hazard ratios for group comparisons, we will also consider comparisons on the hazard scale as detailed below: • An estimated hazard ratio (output from standard Cox software) only coding AEs as "observed event", together with an estimator of its variance and a confidence interval for the hazard ratio.
• Ditto for the competing event, now only coding competing events as "observed event". One rationale here is to check whether relevant signals on the hazard scale would have been missed by ignoring competing risks. We reiterate that, as for all competing event analyses, this will be done with the hard and the soft definition given in Section 2.2.
• Ratios of incidence densities for AEs, with variance estimation analogous to above. The rationale here is to check whether the simple constant hazard framework, although potentially misspecified, leads to a reasonable approximation of the hazard ratio estimated in semi-parametric fashion.
• Ratios of incidence densities for competing events.
• Ratios of Nelson-Aalen estimators for AEs, with variance estimation analogous to above. The rationale here is to compare the usual hazard ratio estimator not only with a very simple parametric counterpart, but also with a fully non-parametric competitor. Under a proportional hazards assumption, the ratio of Nelson-Aalen estimators also estimates the hazard ratio, but not under non-proportional hazards (Andersen, 1983).
• Ditto for the competing event.

Assessment of differences of estimators
The estimators in (9) and the derived information on estimated risk differences and risk ratios are of standard input form for a meta-analysis. However, the aim is a methodological comparison of different methods for quantifying AE risk when applied to the very same data. To this end, the information on variances so far does not suffice, but what we need is an estimator of the variance between, e.g., the incidence proportion and the Aalen-Johansen estimator when calculated on the same data set. As is obvious from the formulae in Subsection 4.2, such estimators will in general be dependent. Closed form variance estimators might be obtained using the functional delta method (Gill and Johansen, 1990), but one would need to derive and implement such estimators for every single methodological comparison. We have therefore decided to follow the advice of Andersen et al. (2012) and to resort to bootstrap variances, drawing with replacement from the individual patients under an i.i.d. set-up. We also note that in a meta-analysis of published data on the overestimation of the cumulative revision arthroplasty using a Kaplan-Meier-type estimator, Lacny et al. (2015) used common approximations of the estimated variance of the hazard ratio for this purpose. However, this approach estimates a different variance as the correlation structure is not accounted for.

Implementation
The estimators displayed in the Sections 4.2 and 4.3 are readily available in the statistical software SAS (SAS Institute, Cary, NC, US) and R (R Core Team, 2018). The implementation of the estimators of the incidence proportion and the two estimators based on the incidence density is straightforward by the use of the formulae. In SAS software the proc lifetest calculates the one minus Kaplan-Meier estimator. In R it can either be obtained by the survfit function of the survival package (Therneau and Grambsch, 2000) or, as it is a special case of a competing risk setting, the etm function of the identically named package  can also be used to calculate both one minus the Kaplan-Meier estimator and the Aalen-Johansen estimator. Depending on which SAS version is used the Aalen-Johansen estimator can, on the one hand, be computed by the predefined %CIF Macro. On the other hand, in newer versions of SAS software the proc lifetest specifying the event of interest in the option failcode can be used. The first part of the two-sample comparisons, the risk differences and relative risks with corresponding variances, can be directly calculated by implementing the formulae. The Cox model and therefore the estimated hazard ratio may be obtained by the use of the proc phreg in SAS and with the function coxph in R. In oder to estimate (event-specific) hazard ratios, e.g., for AE, a well known coding method is to also code observed competing events as "censored". A brief look at the simple incidence densities illustrates correctness of this method for analyzing hazards. The estimator of the cumulative AE probability in (7) demonstrates that all hazards enter probability calculations and hence, the "code as censored" approach is only available on the hazard level. In both of the software, it can be easily switched which event is of interest, such that the hazard analysis of the competing event can as easily be obtained. The ratios of the incidence densities for the adverse as well as for the competing event are easily calculated once the incidence density for the one-sample estimators have been saved. The proc lifetest with the nelson option gives the Nelson-Aalen estimator. Moreover, the mvna function of the mvna package (Allignol et al., 2008) returns this estimator in R.
SAS macro code and the corresponding function in R is available as supplementary material. The main macro code has been written in SAS 9.4 software and checked in R by one of the authors (RS). It has subsequently been checked in a small scale pilot study (VJ, KR, CS).

Meta-analysis
Once the trial level data have been analysed using the methods described in Section 4, results will be summarized across trials using the approaches listed below. Whereas individual trial data analyses will be run within the sponsor company, meta-analyses will be run on the calculated probability and hazard (ratio) estimates centrally at the analysis center, i.e. the institutions of the academic project group members. Table 3 gives an overview of the planned method comparisons. We will distinguish between the two types of the competing events introduced in Section 2.2, i.e., all comparisons will be performed for both types of competing events (death-only and all-events). Thereby, the main interest is in the 'all-events' competing event. Especially, the comparison of Aalen-Johansen estimators based on the different competing event definitions is of interest. The two Aalen-Johansen estimators will be compared for the AE as well as for the competing event. Moreover, the comparisons in terms of hazard ratios will be conducted for the AE and for the competing event. All comparisons are conducted at the five follow-up times as defined in Section 4.1 and not only at the final time-point.

Assessment of bias
The bias of the estimators will be assessed by comparison with the benchmark estimator. This will be assessed visually using Bland-Altman plots of the AE probability, the risk difference and the (log) relative risks (Altman and Bland, 1983). As we consider a comparison of an estimator to a benchmark, the benchmark estimator is plotted on the x-axis instead of the mean (Krouwer, 2008). With these plots, both

Frequency categories
For the one-sample estimators, the possible change in frequency categories depending on estimation method will also be investigated. According to the European Commission's guideline on summary of product characteristics (SmPC) (EMA, 2009) the frequency categories are respectively classified as 'very rare', 'rare', 'uncommon', 'common' and 'very common' when found to be < 0.01%, < 0.1%, < 1%, < 10%, ≥ 10%. Frequency categories obtained with the estimators will be compared to frequency categories obtained with the benchmark estimator, i.e., the Aalen-Johansen estimator. The comparison of the conclusions about the therapies' safety derived from the two-sample comparisons of the various approaches shall be compared in terms of statistical significance, clinical relevance and benefit assessment criteria (IQWiG, 2017;Kieser and Hauschke, 2005) against the Aalen-Johansen approach as benchmark in frequency tables.

Assessment of precision
As assessment of precision, the standard errors or the width of the confidence intervals of the estimators will be compared to the benchmark ones. This is done in terms of plots of the ratios of the standard errors for the methods with at most small to moderate bias. The consideration of precision is deemed useful only in the absence of any substantial bias.

Random effects meta-analysis and meta-regression
A more formal assessment of difference between estimators and possible factors influencing these is carried out in form of random effects meta-analyses and meta-regressions. These will model the ratios of the estimators considered in Section 5.1 (i.e., respective estimator divided by benchmark). The standard errors of these ratios will be needed for the meta-analysis. As noted in Section 4.4, the derivation of these standard errors is complicated by the dependence of the estimators. Therefore, they will be obtained with a bootstrap, see Section 4.4.
To be more precise, the estimator of the log-ratio (log(estimator/benchmark))θ k is observed with bootstrapped varianceσ 2 k for each adverse event k = 1, ..., K. Then a normal-normal hierarchical model (NNHM) (Hedges and Olkin, 2014) of the formθ k |θ k ∼ N (θ k , σ 2 k ) with θ k |θ, ρ ∼ N (θ, ρ 2 ), k = 1, ..., K, is fitted, with ρ 2 denoting the between AE heterogeneity. Thereby, between adverse events variability is introduced via θ k = θ + ǫ k with ǫ k ∼ N (0, ρ 2 ). As the main interest is in the mean parameter θ the marginal modelθ k |θ ∼ N (θ, σ 2 k + ρ 2 ), k = 1, ..., K, will be used. The between adverse event variability ρ 2 is estimated by the Paule-Mandel estimator as recommended by Veroniki et al. (2016). As we are also interested in exploring any heterogeneity identified, possible sources of heterogeneity are assessed using meta-regression models including, for example, the frequency of the adverse event and competing event recording over time.
The meta-analysis and meta-regression will first be performed on AE level and not on study level, i.e., AEs of the same study will be assumed to be independent. In a next step, as the structure of these data are more complex than in standard meta-analyses, potentially additional hierarchy levels will be considered. In the NNHM described above, it is assumed that it is sufficient to model the between AE heterogeneity. However, it might be necessary to consider in addition for instance any heterogeneity between studies or indications. Therefore, random effects not only for AE but also for study or indication are considered in subsequent analyses to explore whether additional hierarchy levels improve model fit.

Discussion
We have presented the rationale and statistical concept of the empirical, meta-analytical SAVVY study which is presently ongoing. The study aims to investigate the impact of commonly used methods to quantify AE incidence which fall short of accounting for both varying follow-up times and competing risks.
The study described in this paper considers time-to-first-AE only, but not recurrent AEs. The reasons are four-fold. Firstly, we kept in mind the ultimate goal of safety evaluation in drug development i.e. accurately informing the product label adverse drug reaction section by providing the most relevant frequency category for SmPC (Summary of Product Characteristics) or frequency for US PI (US prescribing information). Secondly, the incidence proportion is only meaningful as an estimator of absolute AE risk for first AEs, but not for recurrent ones. The incidence density could be computed for recurrent events in a meaningful way, but the assumption of a constant AE hazard would then be an even more restrictive parametric model (Windeler and Lange, 1995). Thirdly, censoring, varying follow-up times and competing events will be no less important when AEs can be recurrent. For general recurrent events analyses, this has only very recently be re-emphasized by Andersen et al. (2019). Fourthly, in a time-to-first-event analysis, the absolute AE risk or cumulative AE probability, non-parametrically estimated by Aalen-Johansen, is a natural target quantity or estimand. In a recurrent events setting, the options for statistical modelling become more complex, because intermediate AEs will, in general, impact the incidence of subsequent AEs. One consequence is that in the time-to-first-event setting, the absolute AE risk can be expressed via fully conditional intensities, while the question of whether to use fully conditional or rather marginal approaches becomes a more pressing question when AEs are recurrent, see again Andersen et al. (2019). It is our intention that the SAVVY project will, in the future, also further investigate the analysis of recurrent AEs, and, to begin, such investigations shall be informed by the results of the study described in the present paper.
The current investigations within the SAVVY project focus on analyses of individual studies. In practice, these analyses would be integrated across trials. Particular problems arise when only a small number of trials is combined in a random effects meta-analysis (Bender et al., 2018) or the events are rare (Günhan et al., 2019). Furthermore, strategies for signal detection in safety analyses are also not considered here, but are subject to an ongoing research (see e.g. Gould (2018)).