The electronic data generally captured for signal refinement by systems like Mini-Sentinel are primarily administrative and claims based, collected by health plans during the course of routine healthcare practice. Mini-Sentinel uses a distributed data system, in which individual level data, standardized using a common data model, remain at the local site. For this paper, we will assume that distributed programs summarize event and sample size counts at each site, stratified by exposure group and by confounders, and these results are then aggregated across sites for analysis. Although in some cases, analyses may be based on individual level data, more often to protect patient privacy, deidentified information is combined for central analysis, and thus, the focus of this discussion remains on aggregate data.

### Data specifications and notation

We assume that accruing data will be analyzed at specific time points (*t* = 1,…,T). We also assume that each individual *i* is either exposed to the MPI, *D*_{i} = 1, or not exposed, *D*_{i} = 0, and either has the outcome of interest occurring before the end of analysis *t*, *Y*_{i}(*t*) = 1, or does not *Y*_{i}(*t*) = 0. The exposure time, *E*_{i}(*t*), denotes the cumulative exposure time prior to analysis *t*. It could be a single time exposure window (e.g., vaccine: *E*_{i}(*t*) *=* 1 for all individuals) or a chronic exposure (time on either MPI or comparator), for which assumptions of the exposure time and outcome relationship must be made (constant risk or change in risk because of exposure duration). For this manuscript, we censor a participant's exposure time at the date of disenrollment, occurrence of the outcome, or discontinuation of use of the initial prescribed treatment. In the case of discontinuation, we add a certain lag time to allow recognition of outcomes that could biologically be related to the exposure (e.g., 7 days after discontinuation of treatment for the outcome of seizure because outcomes more than 7 days after discontinuation are unlikely to be related to treatment). Furthermore, participants are censored if they switch exposure groups and begin taking the other medical product (i.e., an exposed individual starts taking the comparator medical product). A lag time also may be added after the date of switching exposures. These design features are consistent with the incident user cohort study design currently in common use in postmarket surveillance[8].

Furthermore, we assume that there is a set of baseline confounders, Z_{i}, associated with individual *i*, which can be composed of variables such as age, sex, site, and health conditions. When using aggregate data, these confounders often are categorized to form a set of categorical confounders, Zic. For example, a continuous confounder, such as age, can be categorized into 5- or 10-year age groups. Under this data setup, confounding can be addressed by regression, stratification, or matching.

### Sequential testing framework

In a signal refinement evaluation, the overall hypothesis of interest is whether there is a higher event rate for those on the MPI (*D*_{i} = 1) compared with the unexposed group (*D*_{i} = 0) after accounting for confounding and exposure time. Numerous test statistics (based on the relative risk (RR) or hazard ratio, for example) can be derived to evaluate this hypothesis, thus creating different statistical methods. The chosen hypothesis is tested at each analysis *t*, and if the test statistic at analysis *t* exceeds a pre-defined critical boundary, *c(t)*, it signals a significantly elevated rate of events at analysis *t*; otherwise, the study continues to the next analysis time until the pre-defined end of the evaluation. At each analysis, more new information accumulates, which may include new participants exposed and unexposed to the MPI since the last analysis, as well as more follow-up or exposure time for participants already included in the previous analysis. Different approaches to incorporating updated data induce different assumptions that need to be accounted for in the calculation of the critical boundary. The critical boundary can be chosen in numerous ways, but it must maintain the overall type I error rate across all analyses, taking into account both multiple testing and a skewed testing distribution that conditions on whether earlier test statistics exceeded the specified critical value at previous analysis times. A general review of sequential monitoring boundaries has been presented by Emerson *et al*.[9] and is beyond the scope of this paper, but we will present approaches specific to the observational surveillance setting and one general method used in randomized clinical trials that is applicable to this area.

### Group sequential statistical methods

#### Lan–Demets group sequential approach using error spending

The first method we consider is a general group sequential method used mainly in randomized clinical trials developed by Lan and Demets[10] using an error spending approach. An error spending approach uses the concept of cumulative alpha or type I error, *α*(*t*), defined as the cumulative amount of type I error spent at analysis *t* and all previous analyses, 1,…,*t*-1. We assume that 0 < *α*(1) ≤ ∙∙∙ ≤ *α*(T) = *α*, where *α* is the overall type I error to be spent across the evaluation period. The function *α*(*t*) can be any increasing monotonic function that preserves family-wise error, but there are several common approaches including the Pocock[11] boundary function *α*(t) = log(1 + (exp(1)-1)t/T) *α*, O'Brien-Fleming[12] boundary function *α*(t) = 21−ΦZ1−α/2t/T, and the general power boundary function *α*(t) = (*t*/*T*)^{p}*α* for *p* > 0. The most commonly used boundary function for safety evaluations has been a flat, Pocock-like, boundary on a standardized test statistic scale. This boundary spends *α* approximately evenly across analyses, given the test statistic is asymptotically normally distributed. Therefore, it spends more *α* at earlier analyses relative to later analyses, given the amount of statistical information, or sample size, observed up to time *t* compared with an O'Brien Fleming boundary, which is commonly used in efficacy studies. This flat boundary has been discussed as Pocock like, but a Pocock boundary when testing more frequently (quarterly or more often) is not completely flat. For further discussion of boundary shapes and statistical trade-offs between them in practice for postmarket surveillance, see Nelson *et al*.[13]

Given the error spending boundary function, Lan and Demets developed an asymptotic conditional sequential monitoring boundary for any asymptotically normal test statistic based on independent increments of data.[10] This boundary can be computed and used to compare with almost any standardized test statistic, including one that controls for confounding. For example, when interest is in an adjusted RR, R^(t), or log RR, it can be estimated using Poisson regression, and a standardized test statistic can be calculated,Zval(t)=R^(t)/Var(R^(t)). The value of *Zval(t)* can then be compared with the asymptotic conditional monitoring boundary developed by Lan and Demets,[10] resulting in a decision to stop if *Zval(t)* exceeds the monitoring boundary or to continue collecting additional data. This is an appealing approach because the boundary is very simple to calculate and relies on a well-defined asymptotic distribution. However, in practice with rare events and frequent testing (small amount of new information between analyses), the asymptotic properties of the boundary fail to hold. This is similar to the scenario where an exact test may be preferred to an asymptotically normal test when the sample size is small. The following methods have sought to address the shortcomings of this approach to allow for more precise statistical performance in a wider variety of settings.

#### Group sequential likelihood ratio test

The group sequential likelihood ratio test (LRT) approach is a method that has been used in the Vaccine Safety Data Link project to monitor vaccine safety for a single time vaccine exposure.[3, 6, 7, 14] The approach uses exposure matching with a fixed matching ratio (1:M) to control for confounding and then computes a LRT statistic. The most commonly used method is the Binomial maxSPRT,[14] which assumes continuous monitoring (i.e., after each matched set of exposed and unexposed individuals come into the dataset, the test statistic is compared with the monitoring boundary).

Specifically, for the maxSPRT method, one creates matched exposure strata, *s* (*s* = 1,…,S), such that each exposed individual, with *D*_{s1} = 1, is matched to one or more unexposed individuals (*D*_{s2} = 0,…,*D*_{S(M+1)} = 0) who have the same categorical confounders, Zic. Then, the log LRT statistic at each analysis, *t*, is the following:

LLR1(t)=logYD=1(t)Y(t)YD=1(t)YD=0(t)Y(t)YD=0(t)1M+1YD=1(t)MM+1YD=0(t), where YD=1(t)=∑s=1S(t)∑j=1M+1YsjDsj and YD=0(t)=∑s=1S(t)∑j=1M+1Ysj(1−Dsj) are the number of events observed among those exposed and unexposed to the MPI up to time *t*, respectively, and *Y*(*t*) = *Y*_{D = 1}(*t*) + *Y*_{D = 0}(*t*) is the total number of events up to time *t*. Note that *S(t)* is the number of strata up to time *t*, which also is the number of exposed participants because we are assuming a fixed matching ratio of 1:M. This particular LRT, which conditions on the total number of events, *Y*(*t*), is designed for the rare event case in which only one event is expected to be observed per exposure stratum. One can think of this LRT as comparing the observed proportion of exposed (and unexposed) events out of the total number of events to the expected proportion under the null, which is just 1/(M + 1) for the exposed participants and M/(M + 1) for the unexposed participants.

However, when events are not extremely rare, or when the probability within a stratum of more than one event occurring is not small, the assumptions of this LRT are violated, and a more general two-sample binomial likelihood ratio test statistic should be used:

LLR2(t)=logYD=1(t)ND=1(t)YD=1(t)1−YD=1(t)ND=1(t)ND=1(t)−YD=1(t)YD=0(t)ND=0(t)YD=0(t)1−YD=0(t)ND=0(t)ND=0(t)−YD=0(t)Y(t)N(t)Y(t)1−Y(t)N(t)N(t)−Y(t), where ND=1(t)=∑s=1S(t)∑j=1M+1Dsj=S(t) andND=0(t)=∑s=1S(t)∑j=1M+1(1−Dsj)=M×S(t)are the number of people exposed and unexposed to the medical product up to time *t*, respectively, and *N*(*t*) = *N*_{D = 1}(*t*) + *N*_{D = 0}(*t*) is the total sample size up to time *t*. Note that this general LRT incorporates the total sample size, unlike the binomial maxSPRT LRT that is conditional on the total number of events. For rare events, the performance of each LRT is similar. Further evaluation needs to be conducted to establish the scenarios in which each LRT has better statistical properties.

For the binomial maxSPRT, a Pocock-like boundary has been proposed, c(*t*) = *a*, which is a flat boundary on the log LRT statistic. One common way to solve for the constant, *a*, uses an iterative simulation approach similar to the following:

- Step 1: Simulate data assuming H
_{o} and the observed event rate while controlling for confounding (i.e., using a permutation approach: fix *Y*_{s1},…,*Y*_{sM} (*s =* 1,…,S), and permute *D*_{s1},…,*D*_{sM} to create *D*_{s1*},…,*D*_{sM*} so that you hold the exposure strata relationships and thus control for confounding). - Step 2: Calculate LLR(
*t*) on the simulated dataset. - Step 3: If LLR(
*t*) ≥ *a* then *Signal*_{k} = 1 and stop loop; otherwise, continue to next *t* + 1. - Step 4: If
*t* = T, then *Signal*_{k} = 0.

This process is repeated a large number, *Nsim*, times, and the estimated *α* level for the boundary is calculated as α^=∑k=1NsimSignalk/Nsim. One solves for *a* by repeating the simulation and changing *a* until α^=α.

This approach is a special case of the general unifying boundary approach developed by Kittleson *et al*.[15] To allow for the more general approach, define *c(t) = au(t)* where *u(t)* is a function dependent upon the proportion of statistical information (e.g., sample size) up to time *t* and is of the form *u(t)* = (*N(T)/N(t)*)^{1-2Δ} where Δ > 0 is a fixed parameter depending upon the design (e.g., *u(t)* = 1 is Pocock, and *u(t)* = (*N(T)/N(t)*)^{0.5} is O'Brien and Fleming). The same approach is used to solve iteratively for *a*, but the boundary *c(t)* will now be shaped differently depending upon *u(t)*. We have named this more flexible version of the binomial maxSPRT as the group sequential LRT (GS LRT). This additional flexibility allows the method to be applied more generally, for example, within the Mini-Sentinel pilot, where data are not available as often (potentially quarterly). Furthermore, the shape of boundary can be changed to reflect the desired trade-offs appropriate to the specific safety question of interest. Because the original binomial maxSPRT used a unifying boundary type approach, we have presented it as such here, but as has been shown by others[16], the error spending approach and unifying approach are complementary, and therefore, we could have chosen an error spending approach.

A potential limitation of the GS LRT method is the fixed matching ratio. In practice, if there is a need to implement a strict matching criterion, because of the need for strong confounding control, then it can be difficult to find *M* unexposed matches for each exposed participant especially in the scenario of frequent monitoring. Frequent monitoring typically implies that an exposed participant should be matched to *M* unexposed participants within the current analysis time frame. This can lead to loss of matched strata including strata with events. When strata are lost, the results are then only generalizable to the subpopulation of the exposed population for which a matching control was found. Often, the matching criterion is then loosened, leading to less confounding control but a larger matched cohort.

#### Conditional sequential sampling procedure

The conditional sequential sampling procedure (CSSP)[17] was specifically developed to handle chronically used exposures, such as drugs that are taken over a period. However, the approach also is able to accommodate a single time exposure such as a vaccine. This method handles confounding using stratification and assumes that the data are aggregated.

Specifically, using categorical confounders, Zic, one stratifies the entire population under evaluation (unlike GS LRT, which uses a matched sample). Then, at each analysis, *t*, within each confounder stratum, ZkS (k = 1,…,K), one calculates the exposure time, *E*_{D, k}(*t*), and number of events, *Y*_{D, k}(*t*) among all participants in stratum *k* on medical product *D* (*D* = 0 (unexposed) or *D* = 1 (exposed)) since the previous analysis *t*-1, where ED,k(t)=∑i=1N(Ei(t)−Ei(t−1))I(Zic=ZkSandDi=D) and YD,k(t)=∑i=1N(Yi(t)−Yi(t−1))I(Zic=ZkSandDi=D). Under Ho, no relationship between exposure to the MPI and the outcome conditional on strata, the conditional distribution of *Y*_{D = 1, k}(*t*)|*Y*_{D = 1, k}(*t*) + *Y*_{D = 0, k}(*t*) isBinomialYD=1,k(t)+YD=0,k(t),ED=1,k(t)ED=1,k(t)+ED=0,k(t), which is based on the proportion of exposure time observed for those exposed compared with the total exposure time including exposed and unexposed. Using this stratum-specific conditional distribution, one can simulate the distribution of *Y*_{D = 1, k}(*t*), the number of outcomes among those on the MPI within each stratum under Ho, given *Y*_{D = 1, k}(*t*) + *Y*_{D = 0, k}(*t*).

The test statistic of interest is then the total number of adverse events observed among those exposed up to time *t* across all strata, YD=1(t)=∑k=1KYD=1,k(t). The CSSP approach uses an error spending approach in combination with the conditional stratum-specific distributions to create the sequential monitoring boundary. Specifically, it uses the following iterative simulation approach:

- Step 1: Create a single realization of the following dataset of observed exposed counts under Ho for analysis
*t*, *t* = 1,..,*T* as follows: - For all confounder strata
*k*, simulate Y˜D=1,k(t)~BinomialYk(t),ED=1,k(t)ED=1,k(t)+ED=0,k(t) if Y˜k(t)>0 else set Y˜D=1,k(t)=0. - Calculate Y˜D=1(t)=∑j=1t∑k=1KY˜D=1,k(j) (total number of simulated exposed events at analysis
*t*)

- Step 2: Repeat Step 1 for a large number of realizations,
*Nsim*, to create a distribution of total number of exposed events at each analysis,Y˜D=11(t),…,Y˜D=1Nsim(t). - Step 3: Order Y˜D=11(1),…,Y˜D=1Nsim(1) from smallest to largest and if YD=1(1)>Y˜D=1(Nsim*(1−α(1)))(1) then signal at analysis
*t* else continue. - Step 4: Set the simulated event counts that would have signaled at this analysis, Y˜D=1(Nsim(1−α(t−1))+1)(t−1),…,Y˜D=1(Nsim)(t−1), to an extreme value, such as 1000, so that these realizations will be indicated as having past the boundary. This allows for a cumulative error spending calculation that incorporates stopping. Otherwise, keep Y˜D=1j(t)from Step 1 and repeat from 1 at next analysis,
*t* + 1.

Using this simulation approach explicitly incorporates the sequential monitoring stopping rules. Any form of the cumulative error spending function, *α*(*t*), can be assumed as discussed in the section on the Lan–Demets Group Sequential approach using error spending.

This CSSP approach is especially good when evaluating rare events, but it has limitations when there are too many strata and/or short intervals between analyses. The reason this approach breaks down is because the only informative strata are those that meet the following two criteria: (i) at least one observed event but not all participants observe an event; and (ii) both an exposed and unexposed participant. Furthermore, each analysis is treated as having separate strata because information from one analysis to the next is being treated as independent. Therefore, the true number of independent strata is*K* × *T*(number of confounder strata times total number of analyses) across all analyses. So as both *K* and *T* increase, very few strata will be informative. As a result, the test statistic is less stable, which can both influence power and potentially inflate or deflate the type I error. Having a small number of informative strata also leads to results being generalizable to the informative strata population only and not to the overall population. Caution should be taken in the interpretation of the results in this high dimensional strata situation. Furthermore, this approach assumes a constant relationship between exposure duration and the probability of an event, which may not be valid. Overall, it has nice properties for the rare event case and will be applicable to postmarket surveillance in settings where testing is not performed highly frequently or when too many confounder strata are required.

#### Group sequential estimating equation approach

The final approach we will present is an approach that controls for confounding through regression (unweighted or weighted). It can be applied to either the single exposure time or chronic exposure time settings. It has the flexibility to incorporate different exposure duration relationships, but we will focus on a constant relationship (i.e., given exposure duration, one assumes a constant rate of disease based just on exposure time). The approach uses a generalized estimating equation (GEE) framework and a score test statistic. Specifically, assume that the mean regression model under the null hypothesis, H_{o}, of no relationship between the MPI and the event is*g*(*E*(*Y*_{i}(*t*))) = *β*_{0} + *β*_{Z}*Z*_{i} + *f*_{θ}(*E*_{i}(*t*)), where g(.) is the mean link function; for example, the logit for a logistic model or the logarithm for a Poisson model. The exposure link function,*f*_{θ}(.), would typically be ignored for a single time exposure or specified as the logarithmic function if using a Poisson model. However, to allow for flexibility, this has been kept general.

Given the mean model, the generalized score statistic,[18] *Sc(t)*, can be calculated, with the additional specification for the family from which the data have arisen; for example, a binomial family for logistic regression and a Poisson family for a log regression model. However, a nice property of GEE when using the generalized score statistic is that it only assumes that the mean model is correctly specified.[19]

To calculate the sequential monitoring boundary, it has been proposed to use the following permutation data distribution:

- Step 1: At each analysis
*t*, simulate data by fixing (*Y*_{N(t-1)+1}, *Z*_{N(t-1)+1}),…,(*Y*_{N(t)}, *Z*_{N(t)}) and permuting *D*_{N(t-1)+1},…,*D*_{N(t)} to create *D*^{*}_{N(t-1)+1},…,*D*^{*}_{N(t)} and calculate S˜cj(t). - Step 2: Repeat Step 1 for a large number of realizations,
*Nsim*, to create a distribution of score statistics, under H_{o}, at each analysis *t*, S˜c1(t),…,S˜cNsim(t).

The boundary can be defined following the unifying boundary formulation as outlined for the GS LRT method or an error spending approach as outlined for GS LD method, except with this permuted dataset and score test statistic. Note that we are not directly estimating the effect of *D*_{i} because a score statistic is calculated under H_{o}. This allows for the test statistic to have better statistical properties, such as power, when the interest is in comparing alternative hypotheses that are closer to the null (e.g., better power relative to other methods for detecting RR = 1.5 versus RR = 3.0)[20].

The potential advantages of this approach compared with the other three approaches is that it may provide more flexible confounder control compared with GS LRT or CSSP, and it does not rely as heavily on the asymptotic assumptions as needed for the Lan–Demets error spending approach. However, a limitation to this approach, and any regression approach, is that it requires the first analysis to have enough events and observations to estimate the parameters of the mean regression model. This can be difficult for the extremely rare event case where the GS LRT or CSSP approaches may be preferable. As outlined by Nelson *et al*.,[13] it may be advantageous in safety surveillance to delay the first test of the data until an adequate amount of information has accrued, in which case, this method may be applicable in most commonly encountered situations. Furthermore, it requires more computational time than the well-defined asymptotically normal Lan–Demets error spending approach, so under the non-rare event case, the latter approach may be preferable for simplicity. Overall, all four approaches are applicable to the postmarket surveillance setting, and a brief summary of assumptions, limitations, and advantages is outlined in Table 1.

Table 1. Overview of the four statistical methods sequential monitoring including potential advantages and limitationsGS LD | Single time or chronic exposure | All: Matching, stratification, regression | Any standardized test statistic | Error spending boundary derived using a normal approximation | Easy to apply, flexible confounding control | In very rare event setting, or frequent testing, the normal approximation assumptions may not hold |

GS LRT | Single time exposure | Matching with fixed matching ratio | LRT | Unifying boundary derived using permutation; potential to extend to error spending boundary | Matching provides an appealing interpretation | Information loss because of restricted sample; potential loss of exposed if matching criteria too strict or insufficient confounding control if criteria too loose |

CSSP | Single time or chronic exposure | Stratification | Number of events for those on MPI | Error spending boundary derived by conditioning on number of events within strata | Works well for rare adverse events | May not maintain type I error when strata are small or if testing is frequent |

GS EE | Single time or chronic exposure | Regression | Score statistic | Unifying boundary or error spending boundary derived using permutation | Flexible confounding control with few assumptions | Requires sufficient outcome data at first look to estimate the initial regression parameters |