Statistical approaches to group sequential monitoring of postmarket safety surveillance data: current state of the art for use in the Mini-Sentinel pilot


A. J. Cook, Biostatistics Unit, Group Health Research Institute, 1730 Minor Avenue, Suite 1600, Seattle, WA 98101, USA. E-mail:



This manuscript describes the current statistical methodology available for active postmarket surveillance of pre-specified safety outcomes using a prospective incident user concurrent control cohort design with existing electronic healthcare data.


Motivation of the active postmarket surveillance setting is provided using the Food and Drug Administration's Mini-Sentinel Pilot as an example. Four sequential monitoring statistical methods are presented including the Lan–Demets error spending approach, a matched likelihood ratio test statistic approach with the binomial MaxSPRT as a special case, the conditional sequential sampling procedure with stratification, and a generalized estimating equation regression approach using permutation. Information on the assumptions, limitations, and advantages of each approach is provided, including how each method defines sequential monitoring boundaries, what test statistic is used, and how robust it is to settings of rare events or frequent testing.


A hypothetical example of how the approaches could be applied to data comparing a medical product of interest, drug A, to a concurrent control drug, drug B, is presented including providing the type of information one would have available for monitoring such drugs.


We have described the current state of methodology for postmarket surveillance of pre-specified safety outcomes. We describe the limitations and advantages of the approaches while acknowledging areas for future development. Copyright © 2012 John Wiley & Sons, Ltd.


There is a pressing public health need to monitor the safety of marketed medical products. Therapeutic and prevention products, such as vaccines, drugs, and devices, go through rigorous clinical trials evaluating efficacy and safety before being approved, but these trials are generally not of sufficient size to systematically detect rare adverse events and do not always include representation from all populations that receive them after their marketing. Therefore, the Centers for Disease Control and the Food and Drug Administration (FDA) have begun to utilize large multi-site healthcare databases to conduct postmarket surveillance evaluations for medical product safety. The FDA's Sentinel Initiative is an example of a program designed to improve the evaluation of safety across a large array of FDA-regulated medical products.[1]

This paper describes statistical methods for the evaluation of recently approved products using a prospective cohort observational design with existing electronic healthcare data for pre-specified safety outcomes. The goal of this study design is to quickly detect potential safety concerns by sequentially monitoring effect estimates multiple times throughout an evaluation. The aim is to determine whether, for a prespecified set of safety outcomes, there is an excess rate of observed events in recipients of the medical product of interest (MPI) compared with a single comparison group. The comparison group is important and can be chosen in several ways. In this manuscript, we consider a concurrent control group defined to be comparable to those taking the MPI after controlling for confounders. For example, when evaluating a new diabetes drug for safety, an appropriate comparison group could be those taking an alternative diabetes drug. However, we would need to control for site, and perhaps patient characteristics, because physicians from the various sites contributing data may exhibit differential prescribing habits, and patient characteristics may be associated with choice of diabetes drug.

This type of safety evaluation has been coined “signal refinement” because potential adverse events are predefined based upon the suggestion of a potential risk, which may result from various scenarios including, but not limited to, observation during pre-approval or in adverse event reporting systems, [2] or because of known biologic reasons uncovered during the study of similar medical products. This signal refinement stage can be thought of as a preliminary step before conducting a more extensive phase IV observational study or confirmatory randomized clinical trial because existing healthcare databases, typically constructed for payment or clinical care purposes, tend to have issues such as incomplete data, data errors, and lack of information on potential confounders. There have been several examples of signal refinement studies published, but this is still a relatively new area.[3-7]

Statistical methods used to address hypotheses within postmarket safety evaluation designs must be able to detect both rare and common adverse events, control for confounding, and maintain the overall type I error across multiple tests. This manuscript describes the current state of statistical methods developed to conduct sequential analysis of prospective cohort data for medical product safety. We present four sequential methods that use different approaches to handle confounding, maintain the overall type I error, and have different statistical properties such as time to signal detection and power. Controlling for confounding is a major concern for observational safety surveillance and distinguishes it from the randomized clinical trial setting in which most sequential monitoring methods have been developed. Furthermore, when the outcomes of interest are rare, the inferential properties that hold in randomized trials, such as large sample asymptotics, may not hold in this setting and need to be assessed. We focus on methods already applied to observational safety surveillance evaluations and studies but, for comparison, also introduce one general method used in randomized clinical trials that is applicable to safety surveillance. We discuss potential limitations of these methods and conclude with discussion of the need for future work to develop methods tailored to the setting that we characterize.


The electronic data generally captured for signal refinement by systems like Mini-Sentinel are primarily administrative and claims based, collected by health plans during the course of routine healthcare practice. Mini-Sentinel uses a distributed data system, in which individual level data, standardized using a common data model, remain at the local site. For this paper, we will assume that distributed programs summarize event and sample size counts at each site, stratified by exposure group and by confounders, and these results are then aggregated across sites for analysis. Although in some cases, analyses may be based on individual level data, more often to protect patient privacy, deidentified information is combined for central analysis, and thus, the focus of this discussion remains on aggregate data.

Data specifications and notation

We assume that accruing data will be analyzed at specific time points (t = 1,…,T). We also assume that each individual i is either exposed to the MPI, Di = 1, or not exposed, Di = 0, and either has the outcome of interest occurring before the end of analysis t, Yi(t) = 1, or does not Yi(t) = 0. The exposure time, Ei(t), denotes the cumulative exposure time prior to analysis t. It could be a single time exposure window (e.g., vaccine: Ei(t) = 1 for all individuals) or a chronic exposure (time on either MPI or comparator), for which assumptions of the exposure time and outcome relationship must be made (constant risk or change in risk because of exposure duration). For this manuscript, we censor a participant's exposure time at the date of disenrollment, occurrence of the outcome, or discontinuation of use of the initial prescribed treatment. In the case of discontinuation, we add a certain lag time to allow recognition of outcomes that could biologically be related to the exposure (e.g., 7 days after discontinuation of treatment for the outcome of seizure because outcomes more than 7 days after discontinuation are unlikely to be related to treatment). Furthermore, participants are censored if they switch exposure groups and begin taking the other medical product (i.e., an exposed individual starts taking the comparator medical product). A lag time also may be added after the date of switching exposures. These design features are consistent with the incident user cohort study design currently in common use in postmarket surveillance[8].

Furthermore, we assume that there is a set of baseline confounders, Zi, associated with individual i, which can be composed of variables such as age, sex, site, and health conditions. When using aggregate data, these confounders often are categorized to form a set of categorical confounders, math formula. For example, a continuous confounder, such as age, can be categorized into 5- or 10-year age groups. Under this data setup, confounding can be addressed by regression, stratification, or matching.

Sequential testing framework

In a signal refinement evaluation, the overall hypothesis of interest is whether there is a higher event rate for those on the MPI (Di = 1) compared with the unexposed group (Di  = 0) after accounting for confounding and exposure time. Numerous test statistics (based on the relative risk (RR) or hazard ratio, for example) can be derived to evaluate this hypothesis, thus creating different statistical methods. The chosen hypothesis is tested at each analysis t, and if the test statistic at analysis t exceeds a pre-defined critical boundary, c(t), it signals a significantly elevated rate of events at analysis t; otherwise, the study continues to the next analysis time until the pre-defined end of the evaluation. At each analysis, more new information accumulates, which may include new participants exposed and unexposed to the MPI since the last analysis, as well as more follow-up or exposure time for participants already included in the previous analysis. Different approaches to incorporating updated data induce different assumptions that need to be accounted for in the calculation of the critical boundary. The critical boundary can be chosen in numerous ways, but it must maintain the overall type I error rate across all analyses, taking into account both multiple testing and a skewed testing distribution that conditions on whether earlier test statistics exceeded the specified critical value at previous analysis times. A general review of sequential monitoring boundaries has been presented by Emerson et al.[9] and is beyond the scope of this paper, but we will present approaches specific to the observational surveillance setting and one general method used in randomized clinical trials that is applicable to this area.

Group sequential statistical methods

Lan–Demets group sequential approach using error spending

The first method we consider is a general group sequential method used mainly in randomized clinical trials developed by Lan and Demets[10] using an error spending approach. An error spending approach uses the concept of cumulative alpha or type I error, α(t), defined as the cumulative amount of type I error spent at analysis t and all previous analyses, 1,…,t-1. We assume that 0 < α(1) ≤ ∙∙∙ ≤ α(T) = α, where α is the overall type I error to be spent across the evaluation period. The function α(t) can be any increasing monotonic function that preserves family-wise error, but there are several common approaches including the Pocock[11] boundary function α(t) = log(1 + (exp(1)-1)t/T) α, O'Brien-Fleming[12] boundary function α(t) = math formula, and the general power boundary function α(t) = (t/T)pα for p > 0. The most commonly used boundary function for safety evaluations has been a flat, Pocock-like, boundary on a standardized test statistic scale. This boundary spends α approximately evenly across analyses, given the test statistic is asymptotically normally distributed. Therefore, it spends more α at earlier analyses relative to later analyses, given the amount of statistical information, or sample size, observed up to time t compared with an O'Brien Fleming boundary, which is commonly used in efficacy studies. This flat boundary has been discussed as Pocock like, but a Pocock boundary when testing more frequently (quarterly or more often) is not completely flat. For further discussion of boundary shapes and statistical trade-offs between them in practice for postmarket surveillance, see Nelson et al.[13]

Given the error spending boundary function, Lan and Demets developed an asymptotic conditional sequential monitoring boundary for any asymptotically normal test statistic based on independent increments of data.[10] This boundary can be computed and used to compare with almost any standardized test statistic, including one that controls for confounding. For example, when interest is in an adjusted RR, math formula, or log RR, it can be estimated using Poisson regression, and a standardized test statistic can be calculated,math formula. The value of Zval(t) can then be compared with the asymptotic conditional monitoring boundary developed by Lan and Demets,[10] resulting in a decision to stop if Zval(t) exceeds the monitoring boundary or to continue collecting additional data. This is an appealing approach because the boundary is very simple to calculate and relies on a well-defined asymptotic distribution. However, in practice with rare events and frequent testing (small amount of new information between analyses), the asymptotic properties of the boundary fail to hold. This is similar to the scenario where an exact test may be preferred to an asymptotically normal test when the sample size is small. The following methods have sought to address the shortcomings of this approach to allow for more precise statistical performance in a wider variety of settings.

Group sequential likelihood ratio test

The group sequential likelihood ratio test (LRT) approach is a method that has been used in the Vaccine Safety Data Link project to monitor vaccine safety for a single time vaccine exposure.[3, 6, 7, 14] The approach uses exposure matching with a fixed matching ratio (1:M) to control for confounding and then computes a LRT statistic. The most commonly used method is the Binomial maxSPRT,[14] which assumes continuous monitoring (i.e., after each matched set of exposed and unexposed individuals come into the dataset, the test statistic is compared with the monitoring boundary).

Specifically, for the maxSPRT method, one creates matched exposure strata, s (s = 1,…,S), such that each exposed individual, with Ds1 = 1, is matched to one or more unexposed individuals (Ds2 = 0,…,DS(M+1) = 0) who have the same categorical confounders, math formula. Then, the log LRT statistic at each analysis, t, is the following:

display math

where math formula and math formula are the number of events observed among those exposed and unexposed to the MPI up to time t, respectively, and Y(t) = YD = 1(t) + YD = 0(t) is the total number of events up to time t. Note that S(t) is the number of strata up to time t, which also is the number of exposed participants because we are assuming a fixed matching ratio of 1:M. This particular LRT, which conditions on the total number of events, Y(t), is designed for the rare event case in which only one event is expected to be observed per exposure stratum. One can think of this LRT as comparing the observed proportion of exposed (and unexposed) events out of the total number of events to the expected proportion under the null, which is just 1/(M + 1) for the exposed participants and M/(M + 1) for the unexposed participants.

However, when events are not extremely rare, or when the probability within a stratum of more than one event occurring is not small, the assumptions of this LRT are violated, and a more general two-sample binomial likelihood ratio test statistic should be used:

display math

where math formula andmath formulaare the number of people exposed and unexposed to the medical product up to time t, respectively, and N(t) = ND = 1(t) + ND = 0(t) is the total sample size up to time t. Note that this general LRT incorporates the total sample size, unlike the binomial maxSPRT LRT that is conditional on the total number of events. For rare events, the performance of each LRT is similar. Further evaluation needs to be conducted to establish the scenarios in which each LRT has better statistical properties.

For the binomial maxSPRT, a Pocock-like boundary has been proposed, c(t) = a, which is a flat boundary on the log LRT statistic. One common way to solve for the constant, a, uses an iterative simulation approach similar to the following:

  • Step 1: Simulate data assuming Ho and the observed event rate while controlling for confounding (i.e., using a permutation approach: fix Ys1,…,YsM (s = 1,…,S), and permute Ds1,…,DsM to create Ds1*,…,DsM* so that you hold the exposure strata relationships and thus control for confounding).
  • Step 2: Calculate LLR(t) on the simulated dataset.
  • Step 3: If LLR(t) ≥ a then Signalk = 1 and stop loop; otherwise, continue to next t + 1.
  • Step 4: If t = T, then Signalk = 0.

This process is repeated a large number, Nsim, times, and the estimated α level for the boundary is calculated as math formula. One solves for a by repeating the simulation and changing a until math formula.

This approach is a special case of the general unifying boundary approach developed by Kittleson et al.[15] To allow for the more general approach, define c(t) = au(t) where u(t) is a function dependent upon the proportion of statistical information (e.g., sample size) up to time t and is of the form u(t) = (N(T)/N(t))1-2Δ where Δ > 0 is a fixed parameter depending upon the design (e.g., u(t) = 1 is Pocock, and u(t) = (N(T)/N(t))0.5 is O'Brien and Fleming). The same approach is used to solve iteratively for a, but the boundary c(t) will now be shaped differently depending upon u(t). We have named this more flexible version of the binomial maxSPRT as the group sequential LRT (GS LRT). This additional flexibility allows the method to be applied more generally, for example, within the Mini-Sentinel pilot, where data are not available as often (potentially quarterly). Furthermore, the shape of boundary can be changed to reflect the desired trade-offs appropriate to the specific safety question of interest. Because the original binomial maxSPRT used a unifying boundary type approach, we have presented it as such here, but as has been shown by others[16], the error spending approach and unifying approach are complementary, and therefore, we could have chosen an error spending approach.

A potential limitation of the GS LRT method is the fixed matching ratio. In practice, if there is a need to implement a strict matching criterion, because of the need for strong confounding control, then it can be difficult to find M unexposed matches for each exposed participant especially in the scenario of frequent monitoring. Frequent monitoring typically implies that an exposed participant should be matched to M unexposed participants within the current analysis time frame. This can lead to loss of matched strata including strata with events. When strata are lost, the results are then only generalizable to the subpopulation of the exposed population for which a matching control was found. Often, the matching criterion is then loosened, leading to less confounding control but a larger matched cohort.

Conditional sequential sampling procedure

The conditional sequential sampling procedure (CSSP)[17] was specifically developed to handle chronically used exposures, such as drugs that are taken over a period. However, the approach also is able to accommodate a single time exposure such as a vaccine. This method handles confounding using stratification and assumes that the data are aggregated.

Specifically, using categorical confounders, math formula, one stratifies the entire population under evaluation (unlike GS LRT, which uses a matched sample). Then, at each analysis, t, within each confounder stratum, math formula (k = 1,…,K), one calculates the exposure time, ED, k(t), and number of events, YD, k(t) among all participants in stratum k on medical product D (D = 0 (unexposed) or D = 1 (exposed)) since the previous analysis t-1, where math formula and math formula Under Ho, no relationship between exposure to the MPI and the outcome conditional on strata, the conditional distribution of YD = 1, k(t)|YD = 1, k(t) + YD = 0, k(t) ismath formula, which is based on the proportion of exposure time observed for those exposed compared with the total exposure time including exposed and unexposed. Using this stratum-specific conditional distribution, one can simulate the distribution of YD = 1, k(t), the number of outcomes among those on the MPI within each stratum under Ho, given YD = 1, k(t) + YD = 0, k(t).

The test statistic of interest is then the total number of adverse events observed among those exposed up to time t across all strata, math formula. The CSSP approach uses an error spending approach in combination with the conditional stratum-specific distributions to create the sequential monitoring boundary. Specifically, it uses the following iterative simulation approach:

  • Step 1: Create a single realization of the following dataset of observed exposed counts under Ho for analysis t, t = 1,..,T as follows:
    1. For all confounder strata k, simulate math formula if math formula else set math formula
    2. Calculate math formula (total number of simulated exposed events at analysis t)
  • Step 2: Repeat Step 1 for a large number of realizations, Nsim, to create a distribution of total number of exposed events at each analysis,math formula.
  • Step 3: Order math formula from smallest to largest and if math formula then signal at analysis t else continue.
  • Step 4: Set the simulated event counts that would have signaled at this analysis, math formula, to an extreme value, such as 1000, so that these realizations will be indicated as having past the boundary. This allows for a cumulative error spending calculation that incorporates stopping. Otherwise, keep math formulafrom Step 1 and repeat from 1 at next analysis, t + 1.

Using this simulation approach explicitly incorporates the sequential monitoring stopping rules. Any form of the cumulative error spending function, α(t), can be assumed as discussed in the section on the Lan–Demets Group Sequential approach using error spending.

This CSSP approach is especially good when evaluating rare events, but it has limitations when there are too many strata and/or short intervals between analyses. The reason this approach breaks down is because the only informative strata are those that meet the following two criteria: (i) at least one observed event but not all participants observe an event; and (ii) both an exposed and unexposed participant. Furthermore, each analysis is treated as having separate strata because information from one analysis to the next is being treated as independent. Therefore, the true number of independent strata isK × T(number of confounder strata times total number of analyses) across all analyses. So as both K and T increase, very few strata will be informative. As a result, the test statistic is less stable, which can both influence power and potentially inflate or deflate the type I error. Having a small number of informative strata also leads to results being generalizable to the informative strata population only and not to the overall population. Caution should be taken in the interpretation of the results in this high dimensional strata situation. Furthermore, this approach assumes a constant relationship between exposure duration and the probability of an event, which may not be valid. Overall, it has nice properties for the rare event case and will be applicable to postmarket surveillance in settings where testing is not performed highly frequently or when too many confounder strata are required.

Group sequential estimating equation approach

The final approach we will present is an approach that controls for confounding through regression (unweighted or weighted). It can be applied to either the single exposure time or chronic exposure time settings. It has the flexibility to incorporate different exposure duration relationships, but we will focus on a constant relationship (i.e., given exposure duration, one assumes a constant rate of disease based just on exposure time). The approach uses a generalized estimating equation (GEE) framework and a score test statistic. Specifically, assume that the mean regression model under the null hypothesis, Ho, of no relationship between the MPI and the event isg(E(Yi(t))) = β0 + βZZi + fθ(Ei(t)), where g(.) is the mean link function; for example, the logit for a logistic model or the logarithm for a Poisson model. The exposure link function,fθ(.), would typically be ignored for a single time exposure or specified as the logarithmic function if using a Poisson model. However, to allow for flexibility, this has been kept general.

Given the mean model, the generalized score statistic,[18] Sc(t), can be calculated, with the additional specification for the family from which the data have arisen; for example, a binomial family for logistic regression and a Poisson family for a log regression model. However, a nice property of GEE when using the generalized score statistic is that it only assumes that the mean model is correctly specified.[19]

To calculate the sequential monitoring boundary, it has been proposed to use the following permutation data distribution:

  • Step 1: At each analysis t, simulate data by fixing (YN(t-1)+1, ZN(t-1)+1),…,(YN(t), ZN(t)) and permuting DN(t-1)+1,…,DN(t) to create D*N(t-1)+1,…,D*N(t) and calculate math formula.
  • Step 2: Repeat Step 1 for a large number of realizations, Nsim, to create a distribution of score statistics, under Ho, at each analysis t, math formula.

The boundary can be defined following the unifying boundary formulation as outlined for the GS LRT method or an error spending approach as outlined for GS LD method, except with this permuted dataset and score test statistic. Note that we are not directly estimating the effect of Di because a score statistic is calculated under Ho. This allows for the test statistic to have better statistical properties, such as power, when the interest is in comparing alternative hypotheses that are closer to the null (e.g., better power relative to other methods for detecting RR = 1.5 versus RR = 3.0)[20].

The potential advantages of this approach compared with the other three approaches is that it may provide more flexible confounder control compared with GS LRT or CSSP, and it does not rely as heavily on the asymptotic assumptions as needed for the Lan–Demets error spending approach. However, a limitation to this approach, and any regression approach, is that it requires the first analysis to have enough events and observations to estimate the parameters of the mean regression model. This can be difficult for the extremely rare event case where the GS LRT or CSSP approaches may be preferable. As outlined by Nelson et al.,[13] it may be advantageous in safety surveillance to delay the first test of the data until an adequate amount of information has accrued, in which case, this method may be applicable in most commonly encountered situations. Furthermore, it requires more computational time than the well-defined asymptotically normal Lan–Demets error spending approach, so under the non-rare event case, the latter approach may be preferable for simplicity. Overall, all four approaches are applicable to the postmarket surveillance setting, and a brief summary of assumptions, limitations, and advantages is outlined in Table 1.

Table 1. Overview of the four statistical methods sequential monitoring including potential advantages and limitations
 Exposure settingConfounding controlTest statisticSequential boundary formulationPotential advantagesPotential limitations
GS LDSingle time or chronic exposureAll: Matching, stratification, regressionAny standardized test statisticError spending boundary derived using a normal approximationEasy to apply, flexible confounding controlIn very rare event setting, or frequent testing, the normal approximation assumptions may not hold
GS LRTSingle time exposureMatching with fixed matching ratioLRTUnifying boundary derived using permutation; potential to extend to error spending boundaryMatching provides an appealing interpretationInformation loss because of restricted sample; potential loss of exposed if matching criteria too strict or insufficient confounding control if criteria too loose
CSSPSingle time or chronic exposureStratificationNumber of events for those on MPIError spending boundary derived by conditioning on number of events within strataWorks well for rare adverse eventsMay not maintain type I error when strata are small or if testing is frequent
GS EESingle time or chronic exposureRegressionScore statisticUnifying boundary or error spending boundary derived using permutationFlexible confounding control with few assumptionsRequires sufficient outcome data at first look to estimate the initial regression parameters


In this section, we present a hypothetical sequential monitoring application where the question of interest is as follows: Does the new drug A (the MPI) have a higher rate of myocardial infarction (MI) compared with drug B. The data are from five sites, and the confounders are age, sex, and body mass index (BMI). For deidentification, age is categorized into 5-year categories and BMI into four categories: low (BMI < 18.5 kg/m2), normal (18.5 kg/m2 ≤ BMI < 25 kg/m2), overweight (25 kg/m2 ≤ BMI < 30 kg/m2), and obese (30 kg/m2 ≤ BMI).

The surveillance evaluation is designed to sequentially monitor up to a total sample size of 10 000 participants assuming a flat, Pocock-style, boundary with the first analysis following accrual of 2500 participants and then analyses approximately every 417 participants (19 analyses) (Figure 1). This scenario is akin to a 2-year evaluation with constant accrual of 10 000 participants where the first analysis occurs after 180 days, and each subsequent analysis occurs monthly thereafter. For simplicity, the uptake of each drug is equal, and the expected percent with the event, MI, after 2 years is 5% overall. Table 2 shows an example of such a dataset.

Figure 1.

Sequential Monitoring boundaries for a flat, Pocock-style, boundary with a sample size of 10,000 participants with the first look after the first 2,500 participants and then approximately every 417 participants (19 looks) using a) GS LD and GS EE boundaries based on a standardized test statistic and b) CSSP boundary based on the error spending approach

Table 2. The structure of the aggregated data available for analysis in data systems like the Sentinel System
  1. a

    Ycum,s is the total cumulative events observed at and before look t within each stratum s.

  2. b

    Yt,s is the events observed only at look t within each stratum s.

  3. c

    Expcum,s is the total cumulative exposure time observed at and before look t within each stratum s.

  4. d

    Expt,s is the exposure time observed only at look t within each stratum s.


We now apply three of the four methods discussed previously. We will not apply the GS LRT method because it is not applicable outside a single-time exposure setting. For the GS LD and GS EE methods, one uses the stratum-specific cumulative event data, Ycum,s=math formula, and exposure time data, Expcum,s =math formula, at each analysis (Table 2: Columns 8 and 10) and fits a Poisson regression model adjusting for age, sex, and BMI categories with log(Expcum,s) as an offset term. The GS LD method then calculates the standardized Wald statistic based on the adjusted RR and compares this with the normal approximation boundary developed by Lan–Demets. The GS EE method calculates the generalized score test statistic and compares this with the permutation-derived critical boundary. The CSSP approach uses the total number of events for those on drug A, math formula, as the test statistic, where S(t) is the total number of strata at analysis t, and calculates an analysis-specific p-value (i.e., the probability of observing this test statistic, or one more extreme, based on the simulated distribution under the null) and compares this p-value with a Pocock error spending boundary. Figure 1 shows the different boundary shapes for the three methods.

Given these boundaries, Tables 3a and 3b provide an example of the type of monitoring summary one would create for a sequential monitoring evaluation. For this fictitious data example, the actual RR was 2, and all three methods signaled at the second analysis, but results often vary in other data settings. In this case, all methods performed equally well, and there was an indication of an elevated rate of MI for those on drug A compared with drug B even after controlling for confounding.

Table 3a. Examples of monitoring data for the GS LD and GS EE methods when comparing observed test statistics with a standardized test statistic sequential boundary based on outcomes with prevalence 0.05 over the 2-year evaluation and confounding when the actual adjusted relative risk is 2
LookTime (months)YcumaYcum,D = AbExpcumc (person-days)Expcum,D = Ad (person-days)RRAtoBeTestStatfTest statistic boundarygSignal
  1. a

    Ycum is the total cumulative events observed at and before look t.

  2. b

    Ycum,D = A is the total cumulative events observed at or before look t for those on drug A.

  3. c

    Expcum is the total cumulative exposure time observed at and before look t.

  4. d

    Expcum,D = A is the total cumulative exposure time observed at and before look t for those on drug A.

  5. e

    RRAtoB is the adjusted relative risk (RR) comparing drug A to drug B at each look adjusting for site, age, sex, and BMI category.

  6. f

    TestStat is the observed test statistic calculated at each look and is the Wald-based test for GS LD and score-based test for GS EE.

  7. g

    Test Statistic Boundary is the critical boundary in which the test statistic is compared to indicate if a given look has signaled.

GS LD         
167353193 37396 6341.762.092.10No
279775252 559125 3662.383.482.31Yes
3811688314 716155 7742.12   
49143107379 954187 6292.09   
19245143791 454 836703 7472.06   
GS EE         
167353193 37396 6341.762.162.28No
279775252 559125 3662.383.652.28Yes
3811688314 716155 7742.12   
49143107379 954187 6292.09   
19245143791 454 836703 7472.06   
Table 3b. Example of monitoring data for the conditional sequential sampling procedure method when comparing the estimated probability of observing number of observed outcomes in Drug group A with an error spending sequential monitoring boundary based on outcomes with prevalence of 0.05 over the 2-year evaluation and confounding
LookTime (months)YcumaYcum,D = AbExpcumc (person-days)Expcum,D = Ad (person-days)Look p-valueeError spending boundaryfSignal
  1. a

    Ycum is the total cumulative events observed at and before look t.

  2. b

    Ycum,D = A is the total cumulative events observed at or before look t for those on drug A.

  3. c

    Expcum is the total cumulative exposure time observed at and before look t.

  4. d

    Expcum,D = A is the total cumulative exposure time observed at and before look t for those on drug A.

  5. e

    Look p-value is the cumulative probability of observing Ycum,D = A or something more extreme at or before look t.

  6. f

    Error spending boundary is the amount of cumulative alpha one specifies to spend at a given look. Given the error spending boundary, one computes the current p-value at each look, and if that current p-value is less than the error spending boundary, then the given look has signaled.

167353193 37396 6340.0200.017No
279775252 559125 3660.0120.020Yes
3811688314 716155 774   
49143107379 954187 629   
19245143791 454 836703 747   


We have presented four different group sequential monitoring approaches that are applicable to active postmarket surveillance of administrative and claims observational data. The theoretical underpinnings of each method have been described and illustrated using a hypothetical application. A formal evaluation of these four approaches still needs to be conducted to assess important statistical properties, such as delineation of scenarios in which a given method is appropriate (i.e., maintains the overall type I error and controls for confounding) or outperforms other methods. Performance often is quantified as having higher probability of signaling when a signal exists (power) or how quickly a method detects a signal (time to detection), which are clearly important quantities in safety surveillance.

There are still other methodological issues that need to be addressed. Open questions include developing better approaches to handle distributed data sources with more nuanced confounding control, extensions to the survival context for rare adverse events, and controlling for provider or facility effects. Therefore, the statistical methods presented represent a first step toward a general methodology appropriate for the signal refinement surveillance setting.


The authors declare no conflict of interest.


  • Active postmarket surveillance of pre-defined outcomes require sequential monitoring approaches that control the overall type I error, or false-positive rate, because of multiple testing over time.
  • There are numerous sequential monitoring methods that can be applied, and these approaches differ based on the test statistic of interest, how the approach controls for confounding (stratification, matching, or regression), and how the approach derives the sequential monitoring boundary.
  • There are numerous reasons that postmarket surveillance is different from the randomized control trial setting, in which most sequential monitoring methods have been developed, but key differences include the observational cohort design yielding a need for confounding control, more frequent testing because data are available more rapidly, and the interest often is in rare adverse events.
  • The four approaches presented in this manuscript are the current statistical approaches being applied to the postmarket surveillance setting with appropriateness of a given approach depending upon strength of confounding control needed, frequency of testing desired, and how rare the adverse of interest is.


Mini-Sentinel is funded by the Food and Drug Administration (FDA) through the Department of Health and Human Services (HHS) Contract Number HHSF223200910006I. The views expressed in this article do not necessarily represent those of the Food and Drug Administration.