Challenges in the design and analysis of sequentially monitored postmarket safety surveillance evaluations using electronic observational health care data


J. C. Nelson, Biostatistics Unit, Group Health Research Institute, 1730 Minor Avenue, Suite 1600, Seattle, WA 98101, USA. E-mail:



Many challenges arise when conducting a sequentially monitored medical product safety surveillance evaluation using observational electronic data captured during routine care. We review existing sequential approaches for potential use in this setting, including a continuous sequential testing method that has been utilized within the Vaccine Safety Datalink (VSD) and group sequential methods, which are used widely in randomized clinical trials.


Using both simulated data and preliminary data from an ongoing VSD evaluation, we discuss key sequential design considerations, including sample size and duration of surveillance, shape of the signaling threshold, and frequency of interim testing.

Results and Conclusions

All designs control the overall Type 1 error rate across all tests performed, but each yields different tradeoffs between the probability and timing of true and false positive signals. Designs tailored to monitor efficacy outcomes in clinical trials have been well studied, but less consideration has been given to optimizing design choices for observational safety settings, where the hypotheses, population, prevalence and severity of the outcomes, implications of signaling, and costs of false positive and negative findings are very different. Analytic challenges include confounding, missing and partially accrued data, high misclassification rates for outcomes presumptively defined using diagnostic codes, and unpredictable changes in dynamically accessed data over time (e.g., differential product uptake). Many of these factors influence the variability of the adverse events under evaluation and, in turn, the probability of committing a Type 1 error. Thus, to ensure proper Type 1 error control, planned sequential thresholds should be adjusted over time to account for these issues. Copyright © 2012 John Wiley & Sons, Ltd.


To bolster postmarket safety surveillance for regulated medical products, innovative systems are being developed to monitor electronic health care records that are routinely collected by insurance plans. Such systems, which include the Vaccine Safety Datalink (VSD) project of the Centers for Disease Control and Prevention (CDC)[1-3] and the Sentinel System of the US Food and Drug Administration (FDA),[4] involve capturing and prospectively analyzing data as they accrue across multiple health plan populations. Thus, they offer promise to provide an active safety monitoring framework that is rapid, statistically powerful, and cost-effective. To date, the primary goal of systems such as the VSD and the Sentinel System has been to evaluate previously suggested safety signals (e.g., those identified during premarket trials or from spontaneous reports) and potentially prompt a more formal confirmatory study. Consequently, they can provide a complementary middle ground between traditional passive reporting databases, which are often used to generate hypotheses by searching for associations among a large number of product–event pairs, and Phase IV observational studies and randomized trials, which provide more in-depth investigations of a single hypothesis using more accurate information and thus are designed to yield more definitive results.

Many statistical approaches have been used to evaluate postmarket safety in this new setting, and these methods have recently been reviewed.[5] One particularly promising approach is sequential testing. The VSD, for example, has routinely used a continuous maximized sequential probability ratio test (MaxSPRT)6 to monitor the safety of new vaccines in the USA since about 2005.[7-13] Use of this method successfully detected an increased risk of seizure after receipt of the new combination measles–mumps–rubella–varicella (MMRV) vaccine compared with separate injections of measles–mumps–rubella and varicella vaccines and eventually led to national policy changes for MMRV vaccine use.[12] Near-continuous testing using MaxSPRT is just one of many possible sequential method options, however. Other approaches, including classic group sequential methods from clinical trials,[14, 15] which have generally not been used in observational studies, and a newer tool designed specifically for observational safety surveillance,[16] are reviewed in this issue by Cook et al.[17]

Group sequential methods are especially appealing to consider for use in postmarket observational safety surveillance given their established use in randomized trials where it has become customary to repeatedly monitor accruing data over time in order to identify a new efficacious treatment as soon as possible and minimize patient exposure to suboptimal treatments. To facilitate early study termination that balances the scientific and ethical issues, a carefully chosen sequential stopping rule that statistically accounts for the repeated testing and sampling over time is commonly used.[14, 15] In trials, group sequential stopping rule selection is often a collaborative process conducted by study sponsors and a Data Safety Monitoring Board. It involves several choices, including specification of the desired frequency of interim testing and of the critical value that corresponds to rejection of the null hypothesis, and rules customized to monitor efficacy endpoints in clinical trials have been well studied. Numerous challenges arise when selecting and applying group sequential methods in observational settings designed to detect safety signals, as the scientific aims and data settings are very different.

The main objectives of this article were to (i) define the key characteristics of the newly emerging electronic safety surveillance systems, including standard surveillance aims, data, and sequential methods used, and (ii) describe and discuss ways to address the design and analysis challenges that arise when applying sequential approaches, in particular, when adapting trial-based group sequential methods, in this setting. To concretely exemplify these issues, we provide illustrations using simulated data and preliminary data from an ongoing VSD evaluation.


We first explicitly define the safety surveillance setting, including the surveillance aims, electronic health care data, and sequential methods that have been proposed or previously applied in existing systems. To illustrate, we describe an ongoing sequentially monitored VSD evaluation of a combination diphtheria and tetanus toxoids and acellular pertussis, inactivated poliovirus, and Haemophilus b conjugate (DTaP-IPV-Hib) vaccine (trade name Pentacel; Sanofi Pasteur, Swiftwater, Pennsylvania).[18]

Surveillance aims

A comprehensive medical product safety surveillance evaluation should be equipped both to address hypotheses about prespecified product–event pairs of prior concern (signal refinement) and to detect unanticipated adverse events not suspected in advance (signal generation). Emerging electronic database systems are capable of both but thus far have focused on targeted monitoring of specific pairs, typically 5–10 events for a given product. For example, shortly after the licensure a DTaP-IPV-Hib vaccine in 2008, a VSD evaluation began that monitored a cohort of children 6 weeks to 3 years of age for the occurrence of seven prespecified events, including serious allergic reactions, seizure, and medically attended fever. Targeted surveillance like this is inherently more powerful than large-scale data mining of many (e.g., hundreds or thousands) products and events due to the reduction of the multiple testing problem. Further, it allows focus on the most important (from a risk–benefit stance) and reliably measured products and events.

Electronic data

Several ongoing efforts, including the VSD,[7-13] the HMO Research Network's Center for Education and Research on Therapeutics,[19, 20] and the Mini-Sentinel pilot program,[4] have used or proposed to use electronic health care data to monitor medical product safety. To evaluate a particular safety hypothesis, the following information is typically captured: health plan enrollment, demographics, medication prescriptions or administration of vaccines, and diagnosis codes assigned to outpatient, emergency department, and inpatient medical encounters. Adverse events and potentially confounding chronic conditions are presumptively defined using the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes. Medical product exposures and timing or duration of use are defined using immunization dates and types (for vaccines) or pharmacy dispensing dates, national drug codes, units dispensed, days supplied, and generic names (for drugs). Exposures may also be identified using Healthcare Common Procedure Coding System and Current Procedural Terminology codes. To facilitate analyses across multiple health plan populations, data elements are commonly defined using a standardized data dictionary known as a common data model. To protect patient privacy, a distributed data mechanism is often employed, whereby individual-level patient information is retained at the local health plans and only summarized data are combined centrally for analysis. In the DTaP-IPV-Hib evaluation, seven VSD medical care organizations (MCOs) contributed commonly defined event, vaccine, and confounder data, and the centralized dataset contained aggregated counts of events within predefined categorical vaccine exposure and confounder strata (Table 1).

Table 1. Structure of aggregated weekly data from the VSD DTaP-IPV-Hib safety evaluation includes number of vaccines and adverse events for each stratum defined by age group, gender, and site
WeekMCO siteAge in weeksGenderNumber of DTaP-IPV-Hib vaccinesNumber of DTaP vaccinesNumber of days from vaccine to adverse eventNumber of fever events following DTaP-IPV-HibNumber of seizures following DTaP-IPV-HibNumber of fever events following DTaPNumber of seizures following DTaP
  1. VSD, Vaccine Safety Datalink; DTaP-IPV-Hib, diphtheria and tetanus toxoids and acellular pertussis, inactivated poliovirus, and Haemophilus b conjugate; MCO, medical care organization.


Sequential testing approach

The VSD has pioneered the use of sequential methods to evaluate safety questions in an electronic health care database setting. We briefly review that sequential approach[6] as a foundation for discussing other sequential methods. In the VSD, commonly defined event, vaccine, and confounder data are updated at each MCO site each week. Since about 2005, the VSD has conducted weekly sequential monitoring for many newly licensed vaccines[7-13] using the MaxSPRT method,[6] which involves the following steps: (i) Using the relative risk (RR), specify null (H0: RR = 1) and alternative (HA: RR > 1) hypotheses that compare event risk between new product users and an appropriate comparator group. (ii) Each week, conduct a likelihood ratio test (LRT) by computing a log likelihood ratio (LLR) that weighs the evidence for H0 versus HA and compare it with a predefined signaling threshold (C), computed either analytically[6] or via simulation[21] and designed to maintain a prespecified Type 1 (false positive) error rate across all weekly tests performed. (iii) At each test, a prespecified decision rule is applied: (a) If LLR ≥ C, testing stops and H0 is rejected; (b) if LLR < C and the prespecified end of evaluation has been reached, testing stops and we fail to reject H0; (c) otherwise, if LLR < C, testing continues. Notable features of this sequential approach are the choices of testing frequency (weekly), level of the critical value over time that corresponds to rejection of the null hypothesis (constant or flat signaling threshold shape on the scale of the LLR), and the statistical test (LRT).

The VSD experience with MaxSPRT provides an excellent basis for broadly considering other sequential methods in an observational safety surveillance setting. Exploring different sequential design and analysis options is important because differing approaches yield different tradeoffs between the probability and timing of true and false positive signals.[21] In the DTaP-IPV-Hib evaluation, for instance, the primary aim was to determine if the RR of each targeted event was elevated for DTaP-IPV-Hib recipients versus unexposed historical comparators who received DTaP vaccine in the prior 4 years. Concurrent comparator data were also captured for secondary analyses. To address this hypothesis, both near-continuous testing using MaxSPRT and several trial-based group sequential designs were considered. However, and not surprisingly, many challenges were encountered when adapting trial-based methods to an observational safety setting due to major differences between a randomized controlled experiment and an observational environment where data are not collected for scientific purposes (Table 2). In the next sections, we describe these challenges in more detail.

Table 2. Key differences between a randomized trial and observational database surveillance that raise challenges for the adaptation of trial-based sequential methods in the observational database setting
 Randomized trialObservational surveillance
Outcome of interestEfficacy; accurately measuredAdverse event, which may be rare; can be subject to considerable misclassification
Number of tests (frequency)Typically few (quarterly, twice annually)Has been proposed to be many (continuous, weekly, and monthly)
Accrual of participantsSteady accrual of random sample; possible to controlDepends on product uptake: unpredictable; may vary in rate and population composition over time
ConfoundingLargely mitigated by randomizationOften present; may be substantial; may vary over time
Evidence and decision at the end of the study or evaluationConfirmatory evidence; signal means study stops; regulatory action may be taken (e.g. product may not be approved)Less confirmatory, suggestive evidence; signal means further evaluation, potentially within the same population and dataset


Designing a sequentially monitored medical product safety surveillance evaluation involves many of the same choices as those encountered in a conventional study (or evaluation) with a single analysis at the study's end. In this issue, Gagne et al. described a framework for making standard design choices for safety evaluations, including how to select an appropriate evaluation design (e.g., cohort or self-controlled) given the characteristics of the event, product, and likely confounders.[22] Use of sequential monitoring, which involves testing at multiple time points during the evaluation, raises several additional design questions: (i) How frequently should testing be conducted? (ii) At what level should the signaling threshold be set at each test? (iii) Should the duration of surveillance be defined using calendar time or sample size (e.g., number of accrued new doses)? (iv) Should answers to these questions vary based on characteristics of the product and event? Such choices determine what are often referred to as the operating or performance characteristics of the sequential design: the probability of rejecting the null hypothesis when it is true (i.e., false positive or Type 1 error), the probability of detecting a true signal (i.e., statistical power), the expected time until a signal is detected, and the expected length of the surveillance (whether a signal is detected or not). They are thus important regardless of the setting (e.g., premarket trial or postmarket surveillance) or the study or evaluation design.

Sequential design decisions should be guided by the scientific and clinical goals as well as the ethical concerns relevant to the evaluation or study question and setting.[21] Designs tailored to monitor efficacy outcomes in randomized clinical trials have been widely considered. In observational safety evaluations, however, the primary aims, population, prevalence and severity of the outcomes, implications of signaling, and the costs of false positives and negative findings are different, and less consideration has been given to assessing what sequential designs are optimal in light of these differences. In the next sections, we consider several key sequential design choices and challenges in the context of observational safety surveillance and illustrate ways to navigate these options.

Sample size and duration of surveillance

As in a clinical trial, the planned maximum sample size for a sequentially monitored safety surveillance evaluation ideally should be driven by the statistical power one desires to detect a clinically important difference in adverse event risk (relative to the estimated benefit) between new product users and relevant comparators. Thus, sample size rather than calendar time is the most relevant timescale by which to plan surveillance duration. Unlike in a trial, however, where it is possible to control and carry out enrollment within a planned time frame, the accrual of subjects in an observational surveillance setting depends on the often unpredictable rate of uptake of the new product in the population under evaluation. Therefore, it may be considerably more difficult to know in advance how long surveillance will need to continue to accrue a prespecified sample size designed to achieve statistical power requirements. To deal with this challenge, it may be helpful to wait for a small amount of product use to occur so more is known about the uptake rate, which could in turn facilitate better estimation of the length of surveillance. If, based on preliminary uptake data, the estimated duration of surveillance is too long to be feasible, then other strategies to meet sample size needs could be considered at the evaluation outset, such as adding an additional health care data source to increase the size of the population.

Alternatively, one could specify in advance that a specific period of surveillance based on calendar time (e.g., 1–2 years) is desired. However, because the uptake rate and thus the sample size at the end of a prespecified evaluation period are likely uncertain in advance, defining duration of surveillance solely based on calendar time will not guarantee that adequate power will be achieved at the evaluation's end. To increase power post hoc, one could simply decide at the end of the evaluation to continue to follow the population for a longer period to accrue additional new product users, but this would be at the expense of increased Type 1 error.

Shape of the signaling threshold

Another important sequential design feature is the choice of the signaling threshold shape. The MaxSPRT method[6] uses a flat upper signaling threshold on the scale of the LLR so that the null hypothesis is rejected when the observed LLR reaches a prespecified fixed value. Surveillance using MaxSPRT ends without rejecting the null hypothesis when a prespecified length of surveillance is reached. A flat threshold on the scale of a standardized test statistic like this is sometimes referred to as a Pocock boundary in clinical trials.[23] However, a wide variety of other boundaries, including triangular (linear decreasing) rejection regions[24] and nonlinear decreasing boundaries like that of O'Brien and Fleming,[25] and boundary families[26, 27] have been used in clinical trials and could also be applied in observational sequential safety surveillance.

Figure 1 shows the characteristic shapes of Pocock and O'Brien–Fleming signaling thresholds on the LLR scale, computed using simulated data. In contrast to the historically controlled DTaP-IPV-Hib evaluation, here, simulated data were generated as a prospective cohort of exposed and concurrently matched unexposed subjects (N = 5000 matched pairs). In each of 100 000 simulated datasets, exposed and unexposed individuals were assumed to enter evenly over a 1-year evaluation period. A fixed number of events was generated and randomly assigned to exposed and unexposed subjects according to binomial probabilities pe and pu, respectively. The unexposed event prevalence (pu = 0.01, 0.02, 0.1, 0.2), true RR (pe / pu = 1.2, 1.5, 2.0), threshold shape (Pocock, O'Brien–Fleming), and testing frequency (daily, weekly, monthly, quarterly) were varied. For each scenario, the threshold that maintained a 0.05 overall Type 1 error rate, power, and time to correct detection of an elevated event risk among exposed subjects were computed via simulation.

Figure 1.

Pocock and O'Brien–Fleming signaling thresholds estimated via simulation to control Type 1 error (0.05) under a variety of testing frequency scenarios (daily, weekly, monthly, and quarterly) in a simulated cohort of 5000 exposure-matched pairs over a 1-year evaluation period with event prevalence among unexposed subjects of 0.01

When overall power is of primary importance, thresholds like that of O'Brien and Fleming are appealing as they are high and spend little Type 1 error early (Figure 1), thus reducing the likelihood of a false signal in early analyses and conserving power for later analyses when more data have accrued (Table 3). However, when timeliness is most important, using a lower threshold earlier that spends Type 1 error more evenly throughout the study or evaluation may be preferred. For postmarket surveillance compared with premarket trials, it could be argued that a higher priority should be placed on early signaling because the new product is in wide general use and there is potential to prevent many adverse events if signaling occurs even just a few weeks sooner. Similarly, we might want to signal sooner (with risk of error) for a serious adverse event as opposed to a less serious one. Conversely, one could argue that a primary postmarket aim is to detect events that premarket trials were not powered to find, in which case, overall power is most important. As in clinical trials[21], the specific trade-off between power and timeliness, which in turn dictates the optimal shape of the signaling threshold, should depend on the product–event pair and be examined carefully at the outset of the evaluation.

Table 3. Estimated power for Pocock and O'Brien–Fleming signaling thresholds under a variety of testing frequency scenarios (daily, weekly, monthly, and quarterly) in a simulated cohort of 5000 exposure-matched pairs over a 1-year evaluation period with varying event prevalence among unexposed subjects (0.01, 0.02, 0.1, and 0.2)
  1. Scenarios in bold font are those with power close to the typically desired range.

True RR1.21.521.21.52
pu =0.01      
pu =0.02      
pu =0.1      
pu =0.2      

Frequency of sequential testing

The frequency with which sequential testing is performed also impacts power and signal detection timeliness. For a fixed maximum sample size, power is maximized with one test at the end of the evaluation. Multiple testing reduces power (Table 3) but can also decrease the average time to signal detection. When timeliness is of highest priority, continuous testing methods like MaxSPRT, which are designed to preserve Type 1 error when data are analyzed after each newly exposed participant (or matched set) accrues, are appropriate. Due to ease of use based on published threshold values,[6] MaxSPRT may also be appealing to use even when data are analyzed at near-continuous discrete time points (e.g., weekly). However, use of continuous thresholds when testing is not strictly continuous is conservative (i.e., Type 1 error will be lower than planned). The magnitude of the conservatism increases with the amount of new data received between discrete testing points and with increasing adverse event prevalence.[5]

Alternatively, an investigator may prioritize overall power and thus prefer to test less frequently. For example, Table 3 shows that when using a Pocock threshold, power can be increased by about 10% when testing quarterly compared with daily, and the loss in detection timeliness is relatively small (8–25 days, depending on pe). O'Brien–Fleming thresholds are less influenced by testing frequency (Figure 1). If less frequent testing is desired, then group sequential methods, which control the overall Type 1 error rate for a specific number and frequency of tests, should be used. In addition, group sequential designs offer considerable flexibility in that the testing frequency need not be constant over time but can vary and be customized to meet specific scientific goals.[27, 28]

In addition to scientific factors, testing frequency may also be affected by logistical constraints, including how frequently data can be updated from the data sources. Another potential administrative constraint relates to the testing frequency time scale. As for surveillance duration, the frequency of interim testing can be defined either using calendar time (e.g., quarterly) or accrued sample size (e.g., every 1000 new users). The latter is attractive scientifically as frequency is directly based on a desired amount of newly accrued information, thus ensuring that it will be appropriate to spend Type 1 error at each new test. However, logistically, it may be necessary to schedule tests based on calendar time, in which case the amount of new information, and thus the power, at each test likely will not be known with certainty in advance.

Sequential design for the VSD DTaP-IPV-Hib safety evaluation

In this section, we describe the rationale for and the selection of the DTaP-IPV-Hib safety sequential design. Because DTaP-IPV-Hib was well tolerated in prelicensure studies with few serious adverse events,[29, 30] the first sequential test was not performed until 1 year after DTaP-IPV-Hib uptake began within the VSD (at 33 000 DTaP-IPV-Hib doses). Delaying the first test was also deemed important (i) to allow for some initial uptake to improve sample size and surveillance duration estimates, (ii) to minimize early false positive signals when test statistics are less stable due to small samples, and (iii) to increase power at later tests when samples sizes are larger than were possible in prelicensure trials, thus maximizing the value added by the postmarket DTaP-IPV-Hib evaluation to existing evidence.

After the first test, 11 equally spaced (based on accrued doses) LRT tests and a flat signaling threshold on the scale of the LLR were planned. The between-test sample size for the 11 subsequent tests was computed using simulation to achieve 80% power to detect an RR = 2 or less for each prespecified event, and thus it varied by event prevalence (e.g., every 3500 doses for fever, which is more common; every 10 500 doses for serious allergic reactions, which are rarer). This, in turn, led to variation in the planned total sample size and surveillance duration for different events (e.g., 71 500 doses for fever, 148 500 doses for serious allergic reactions). Although it would have been administratively simpler to have the same testing schedule for all events, it was deemed important to monitor rarer events less frequently and for a longer period of time to accumulate an adequate amount of information between tests and to have enough total power to address these hypotheses. Further, some events (e.g., Guillain-Barré syndrome) were considered too rare to warrant formal testing at all (i.e., events with an expected rate within the VSD population of <0.05/10 000 doses) and were simply monitored descriptively each week for the occurrence of any cases.

In addition to this chosen “delayed start” design, two other options that used a flat LLR threshold were considered: (i) 12 evenly spaced tests with no delayed start and (ii) continuous monitoring (i.e., MaxSPRT). Each design induces a different testing schedule and signaling threshold, which in turn results in different implications for signal detection power and timeliness. For example, for the fever outcome, the LLR threshold that maintained an overall Type 1 error rate of 0.05 was 2.15 for the chosen delayed start design, 2.67 for the evenly spaced testing design with no delayed start, and 4.15 for MaxSPRT.


In addition to sequential design concerns, numerous analytic challenges arise when sequentially monitoring product safety using observational electronic health care data. Many difficulties derive from the lack of a controlled experimental setting, including (i) confounding, (ii) missing and partially accrued (e.g., late-arriving claims) data, (iii) high misclassification rates for presumptive outcomes defined using ICD-9-CM codes, and (iv) unpredictable variation or changes in the data over time (e.g., variable product uptake rate). Other difficulties are due to unique features of postmarket safety surveillance questions and the data environment, such as (i) safety outcomes may be rare and (ii) the distributed data environment prevents pooling of individual-level data, which can limit analytic options. Some challenges and methods to address them have been described elsewhere, including confounding[17] and missing or partially accrued data.[31] Others, like correcting for misclassification, are a recognized major source of bias and false positive signals but have received less attention. A full discussion of each problem is beyond the scope of this article. Thus, we instead highlight two issues that relate most directly to the sequential analytic aspects.

Unpredictable changes in electronic data over time

Unlike in a clinical trial, where participant accrual can be relatively steady and controllable, the accrual of individuals in observational safety surveillance depends on the rate of product uptake in the population under evaluation, which may be variable, unpredictable, and differential by confounding factors. In the DTaP-IPV-Hib VSD evaluation, for instance, uptake was initially slow, increased over time, and varied widely by potential confounders like MCO site, from rapid (MCO 1) to slow and steady (MCO 2) to delayed (MCO 3) (Figure 2). One consequence of this is that the spacing between sequential tests, when based on the number of doses, may not be executable as planned. Figure 3 shows the planned between-test sample sizes (every 3500 doses) for fever and seizure versus those that actually occurred in practice for each sequential test. Of note is that the sixth planned test was skipped when an unexpectedly large bolus of new DTaP-IPV-Hib data was received from a single site after a vaccine coding error was corrected.

Figure 2.

Differential uptake of DTaP-IPV-Hib vaccine (trade name Pentacel) within the VSD population at selected MCO sites. DTaP-IPV-Hib vaccine uptake is shown as a percentage of the total number of vaccines received during the evaluation period (i.e., DTaP-IPV-Hib vaccines among exposed plus DTaP vaccines among concurrent unexposed comparators)

Figure 3.

Planned sample sizes for each of the 12 sequential tests for fever and seizure outcomes versus actual (observed) sample sizes at each test in practice for the DTaP-IPV-Hib VSD evaluation

In addition, when electronic health care databases are dynamically accessed in near-real time, unpredictable changes can also occur in data at the individual level. For instance, subject-level exposure and event data can “arrive late” into health plan data systems when care is received at outside hospitals not affiliated with the health plan. Further, if data are accessed for safety monitoring more often (e.g., monthly) than the health plan updates its enrollment data for administrative purposes (e.g., annually), then when enrollment data are periodically updated, both the identification of newly enrolled subjects and the “disappearance” of existing subjects and events can occur. The impact of these changes on safety analyses depends on the magnitude of the data changes and the method for handling them. For instance, at each new test, one could cumulatively refresh all data since the evaluation start or one could freeze all previously analyzed data and only append newly accrued data since the last test. In the DTaP-IPV-Hib evaluation at one example interim test, these two methods led to non-negligible differences in the total number of doses (66 400 vs. 68 826) and fevers (343 vs. 335).

Collectively, these unpredictable changes in the data over time imply changes in the variability of the events under evaluation, which, in turn, impact the probability of committing a Type 1 error (which investigators want to control). Thus, to correctly maintain the desired overall Type 1 error rate, it is important to account for these changes by recomputing the signaling thresholds at each test based on the actual (versus planned) between-test sample size, the observed distribution of confounders in the population, and the changing individual-level data values. Use of the original thresholds based on planned information could lead to inflation or deflation of the overall Type 1 error because the amount and variability in the observed data may not occur in practice as planned. The recomputed thresholds, however, are based on the actual information at each sequential test and thus will properly maintain Type 1 error at the desired level. In the DTaP-IPV-Hib VSD evaluation, accounting for such changes led to the use of LLR sequential thresholds that ranged from 2.15 to 1.79 for the fever outcome (versus using the planned constant flat threshold of 2.15 at all tests). The recomputed thresholds were lower at later tests primarily because the actual amount of information was greater (i.e., sample sizes were larger) at each sequential test than initially planned because the data arrived in larger weekly batches than expected. More information at each test than planned implied less variability in the data, which implied that a lower-than-planned boundary would maintain the desired Type 1 error. This type of threshold adjustment is also a standard practice in clinical trials[28] but is even more important in an observational context where such changes in the data may be large.

It is also important to note that, in addition to impacting the variability of the data at each sequential test, unpredictable changes in the data over time can also introduce bias. For example, if ICD-9 coding practices change over time and the adverse event rate among new users of a medical product is being compared with an event rate estimated in a historical period prior to the uptake of the new product, temporal bias can occur. To minimize temporal bias in this scenario, one could use a historical period that is as recent (to the introduction of the new product) and as narrow as possible. Alternatively, one could select a concurrent comparator group who receives an alternate product during the same time period as the introduction of the new product. In general, the strategies for reducing bias in observational database surveillance evaluations are similar to those used in conventional pharmacoepidemiologic studies and involve appropriate selection of comparators and proper adjustment for confounding.

Rare outcomes

Another challenge for sequential safety monitoring is that adverse event outcomes can be relatively rare. Statistically, this has implications for the appropriate choice of a test statistic. Exact testing methods may be preferred over standard test statistic choices that rely on normal approximation theory, as large sample properties may not hold. Furthermore, some group sequential methods like the alpha spending approach with standardized test statistic boundaries developed by Lan and Demets[32] rely on the assumption of adequate increments of new information between each new test, which is more difficult to attain when evaluating rare outcomes, particularly if testing frequency is high. Practically, evaluating rare outcomes means that there is considerable instability in the data and test statistics, especially at early tests. Figure 4 shows the dramatic impact in the DTaP-IPV-Hib VSD evaluation of each new seizure event on the trajectory of the LLR. Thus, to avoid early false-positive findings that are driven by just one to two events, investigators should be careful to define thresholds that are robust to this extreme variation.

Figure 4.

Preliminary data (first 33 000 doses) on the trajectory of the LLR for the seizure outcome in the VSD DTaP-IPV-Hib safety evaluation, illustrating high variability in the early weeks of the evaluation when sample sizes are small


Use of sequential methods to conduct surveillance for medical product safety using electronic health care data is challenging but has strong potential to provide a flexible and robust monitoring approach, as in clinical trials. Careful attention should be paid at both the sequential design and the analysis phases due to the many difficulties that arise in this observational setting, difficulties which derive from a lack of a controlled experiment and include confounding, missing and partially accrued data, misclassification, and unpredictable changes in the dynamically accessed data over time. Continued work is needed to better understand many of these issues and to develop more rigorous strategies to overcome them.


The sponsor of this project had the right of commenting, but the authors retained the right to accept or reject comments or suggestions.


  • Sequential testing, an established method in clinical trials, shows promise for use in medical product safety evaluations that capture electronic health care data, although many challenges arise when adapting such methods to this observational setting.
  • Key sequential design considerations include the sample size and surveillance duration, the shape of the signaling threshold, and the frequency of interim testing.
  • Differing design choices yield different trade-offs between the probability and timing of true and false positive signals and thus should be guided by the scientific and ethical factors relevant to the safety surveillance question and observational setting.
  • Analytic challenges largely derive from the lack of a controlled experiment and include confounding, missing data, ICD-9-CM coded outcome misclassification, and unpredictable changes in the data over time (e.g., differential product uptake).
  • To account for these factors, which impact outcome variability over time (and thus the probability of committing a Type 1 error, which investigators want to control), planned sequential thresholds should be adjusted at each test in the analysis phase.


This work was funded in part through a VSD subcontract with America's Health Insurance Plans under Contract 200-2002-00732 from the CDC and a Mini-Sentinel subcontract. Mini-Sentinel is funded by the FDA through the Department of Health and Human Services Contract Number HHSF223200910006I. The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the CDC or FDA.