In addition to sequential design concerns, numerous analytic challenges arise when sequentially monitoring product safety using observational electronic health care data. Many difficulties derive from the lack of a controlled experimental setting, including (i) confounding, (ii) missing and partially accrued (e.g., late-arriving claims) data, (iii) high misclassification rates for presumptive outcomes defined using ICD-9-CM codes, and (iv) unpredictable variation or changes in the data over time (e.g., variable product uptake rate). Other difficulties are due to unique features of postmarket safety surveillance questions and the data environment, such as (i) safety outcomes may be rare and (ii) the distributed data environment prevents pooling of individual-level data, which can limit analytic options. Some challenges and methods to address them have been described elsewhere, including confounding and missing or partially accrued data. Others, like correcting for misclassification, are a recognized major source of bias and false positive signals but have received less attention. A full discussion of each problem is beyond the scope of this article. Thus, we instead highlight two issues that relate most directly to the sequential analytic aspects.
Unpredictable changes in electronic data over time
Unlike in a clinical trial, where participant accrual can be relatively steady and controllable, the accrual of individuals in observational safety surveillance depends on the rate of product uptake in the population under evaluation, which may be variable, unpredictable, and differential by confounding factors. In the DTaP-IPV-Hib VSD evaluation, for instance, uptake was initially slow, increased over time, and varied widely by potential confounders like MCO site, from rapid (MCO 1) to slow and steady (MCO 2) to delayed (MCO 3) (Figure 2). One consequence of this is that the spacing between sequential tests, when based on the number of doses, may not be executable as planned. Figure 3 shows the planned between-test sample sizes (every 3500 doses) for fever and seizure versus those that actually occurred in practice for each sequential test. Of note is that the sixth planned test was skipped when an unexpectedly large bolus of new DTaP-IPV-Hib data was received from a single site after a vaccine coding error was corrected.
Figure 2. Differential uptake of DTaP-IPV-Hib vaccine (trade name Pentacel) within the VSD population at selected MCO sites. DTaP-IPV-Hib vaccine uptake is shown as a percentage of the total number of vaccines received during the evaluation period (i.e., DTaP-IPV-Hib vaccines among exposed plus DTaP vaccines among concurrent unexposed comparators)
Download figure to PowerPoint
Figure 3. Planned sample sizes for each of the 12 sequential tests for fever and seizure outcomes versus actual (observed) sample sizes at each test in practice for the DTaP-IPV-Hib VSD evaluation
Download figure to PowerPoint
In addition, when electronic health care databases are dynamically accessed in near-real time, unpredictable changes can also occur in data at the individual level. For instance, subject-level exposure and event data can “arrive late” into health plan data systems when care is received at outside hospitals not affiliated with the health plan. Further, if data are accessed for safety monitoring more often (e.g., monthly) than the health plan updates its enrollment data for administrative purposes (e.g., annually), then when enrollment data are periodically updated, both the identification of newly enrolled subjects and the “disappearance” of existing subjects and events can occur. The impact of these changes on safety analyses depends on the magnitude of the data changes and the method for handling them. For instance, at each new test, one could cumulatively refresh all data since the evaluation start or one could freeze all previously analyzed data and only append newly accrued data since the last test. In the DTaP-IPV-Hib evaluation at one example interim test, these two methods led to non-negligible differences in the total number of doses (66 400 vs. 68 826) and fevers (343 vs. 335).
Collectively, these unpredictable changes in the data over time imply changes in the variability of the events under evaluation, which, in turn, impact the probability of committing a Type 1 error (which investigators want to control). Thus, to correctly maintain the desired overall Type 1 error rate, it is important to account for these changes by recomputing the signaling thresholds at each test based on the actual (versus planned) between-test sample size, the observed distribution of confounders in the population, and the changing individual-level data values. Use of the original thresholds based on planned information could lead to inflation or deflation of the overall Type 1 error because the amount and variability in the observed data may not occur in practice as planned. The recomputed thresholds, however, are based on the actual information at each sequential test and thus will properly maintain Type 1 error at the desired level. In the DTaP-IPV-Hib VSD evaluation, accounting for such changes led to the use of LLR sequential thresholds that ranged from 2.15 to 1.79 for the fever outcome (versus using the planned constant flat threshold of 2.15 at all tests). The recomputed thresholds were lower at later tests primarily because the actual amount of information was greater (i.e., sample sizes were larger) at each sequential test than initially planned because the data arrived in larger weekly batches than expected. More information at each test than planned implied less variability in the data, which implied that a lower-than-planned boundary would maintain the desired Type 1 error. This type of threshold adjustment is also a standard practice in clinical trials but is even more important in an observational context where such changes in the data may be large.
It is also important to note that, in addition to impacting the variability of the data at each sequential test, unpredictable changes in the data over time can also introduce bias. For example, if ICD-9 coding practices change over time and the adverse event rate among new users of a medical product is being compared with an event rate estimated in a historical period prior to the uptake of the new product, temporal bias can occur. To minimize temporal bias in this scenario, one could use a historical period that is as recent (to the introduction of the new product) and as narrow as possible. Alternatively, one could select a concurrent comparator group who receives an alternate product during the same time period as the introduction of the new product. In general, the strategies for reducing bias in observational database surveillance evaluations are similar to those used in conventional pharmacoepidemiologic studies and involve appropriate selection of comparators and proper adjustment for confounding.
Another challenge for sequential safety monitoring is that adverse event outcomes can be relatively rare. Statistically, this has implications for the appropriate choice of a test statistic. Exact testing methods may be preferred over standard test statistic choices that rely on normal approximation theory, as large sample properties may not hold. Furthermore, some group sequential methods like the alpha spending approach with standardized test statistic boundaries developed by Lan and Demets rely on the assumption of adequate increments of new information between each new test, which is more difficult to attain when evaluating rare outcomes, particularly if testing frequency is high. Practically, evaluating rare outcomes means that there is considerable instability in the data and test statistics, especially at early tests. Figure 4 shows the dramatic impact in the DTaP-IPV-Hib VSD evaluation of each new seizure event on the trajectory of the LLR. Thus, to avoid early false-positive findings that are driven by just one to two events, investigators should be careful to define thresholds that are robust to this extreme variation.
Figure 4. Preliminary data (first 33 000 doses) on the trajectory of the LLR for the seizure outcome in the VSD DTaP-IPV-Hib safety evaluation, illustrating high variability in the early weeks of the evaluation when sample sizes are small
Download figure to PowerPoint