An evaluation of the impact of missing deaths on overall survival analyses of advanced non–small cell lung cancer patients conducted in an electronic health records database

Abstract Purpose The aim of this study was to assess the impact of missing death data on survival analyses conducted in an oncology EHR‐derived database. Methods The study was conducted using the Flatiron Health oncology database and the National Death Index (NDI) as a gold standard. Three analytic frameworks were evaluated in advanced non‐small cell lung cancer (aNSCLC) patients: median overall survival [mOS]), relative risk estimates conducted within the EHR‐derived database, and “external control arm” analyses comparing an experimental group augmented with mortality data from the gold standard to a control group from the EHR‐derived database only. The hazard ratios (HRs) obtained within the EHR‐derived database (91% sensitivity) and the external control arm analyses, were compared with results when both groups were augmented with mortality data from the gold standard. The above analyses were repeated using simulated lower mortality sensitivities to understand the impact of more extreme levels of missing deaths. Results Bias in mOS ranged from modest (0.6–0.9 mos.) in the EHR‐derived cohort with (91% sensitivity) to substantial when lower sensitivities were generated through simulation (3.3–9.7 mos.). Overall, small differences were observed in the HRs for the EHR‐derived cohort across comparative analyses when compared with HRs obtained using the gold standard data source. When only one treatment arm was subject to estimation bias, the bias was slightly more pronounced, but increased substantially when lower sensitivities were simulated. Conclusions The impact on survival analysis is minimal with high mortality sensitivity with only modest impact associated within external control arm applications.


| INTRODUCTION
Real-world evidence (RWE) generated from real-world data (RWD), including data derived from electronic health records (EHRs), is increasingly important for pharmacoepidemiological research. 1 These data provide opportunities for deriving clinical insights and serve to complement findings from clinical trials. In order to synthesize RWD into high-quality RWE, outcomes data are needed, and mortality-based outcomes (eg, Although these analyses could involve multiple data sources, a strength of RWE is that datasets are often large enough that these questions can be addressed within a single, harmonized database, which leverages consistent data-generating mechanisms across groups. Another emerging application for RWE is to serve as an external control arm for single-arm trials, where every patient receives the experimental treatment. 1 Although a control arm is not built into the study, researchers often wish to make comparisons between the experimental treatment and contemporaneous control treatments external to the trial. Therefore, this application faces the additional challenge that outcomes data are compared across multiple data sources, conflating any differences in treatment effect (eg, HR estimates) with differences in underlying data. Lastly, real-world mortality data can be used for trial planning purposes, where estimates from real-world populations could be used for either power calculations or planning the time needed to accrue a number of events.
Due to its critical role in identifying a survival benefit associated with a treatment regimen, the quality of the underlying mortality data used in RWE studies is of salient interest. In EHR-derived databases, mortality data are collected in a structured format as part of routine clinical care. However, researchers often must augment incomplete EHR mortality data with other sources, such as national death indices and commercial sources. 2 Ideally, the quality of the data source will have benchmarks against a gold standard. While rules can be applied to address specificity and date agreement by flagging potential false positives or improbable dates, there are limited ways to address imperfect sensitivity short of obtaining additional data. Thus, missing deaths are of paramount interest when describing the quality of EHR-derived mortality data.
Once mortality quality benchmarks are in place, researchers are often faced with the question of how good is good enough? There are currently no universally accepted minimum standards or consensus with regard to an acceptable level of mortality completeness; rather, the right standard likely depends on the particular analysis. 3,4 Even if deaths are missing completely at random, incomplete mortality data results in absolute estimates of OS, such as median OS (mOS), being biased upward. 5,6 If missing deaths are equally distributed between comparator arms, the impact on relative risk estimates, such as HRs, should be minimal; however, this assumption may not hold in applications such as external control arms where the experimental and control arms are drawn from different data sources. Beyond these hypothetical implications, how do missing deaths impact findings in applications of interest to researchers utilizing RWE?
We sought to answer these questions using an oncology EHRderived data source, where a recent study reported greater than 90% mortality sensitivity in advanced non-small cell lung cancer (aNSCLC) patients. 2 We tested the impact of missing deaths in the EHR-derived data source-both at the high sensitivity levels observed in practice and artificially reduced for illustrative purposes-by comparing the output of descriptive analyses, CER, and external control arms to those obtained using a gold standard data source. Thus, we aimed not only to understand the impact of missing deaths in the EHR-derived data source utilized here but also, more broadly, to understand what levels of mortality sensitivity are high enough quality to minimize impact on analytic results.

| Study overview
The purpose of this retrospective observational study was to evaluate the impact of missing deaths on survival analyses from an EHRderived RWE database in patients diagnosed with aNSCLC, as compared with a gold standard data source. Institutional Review Board and National Center for Health Statistics approval of the study protocol was obtained prior to study conduct. Informed consent was waived as this was a non-interventional study using routinely collected data. Flatiron Health standard methodology for data security and patient privacy was implemented.

| Description of EHR-derived database and gold standard data source
The Flatiron Health database contains real-world clinical data and outcomes collected through EHRs used by cancer care providers primarily in community oncology clinics across the United States. For patients treated in the Flatiron network, information includes data entered into structured fields and contained in unstructured documents. The EHR data are subsequently linked with external mortality data. At the time of this study, the database included information from 250 cancer clinics, which consisted of approximately 775 unique sites of care in the United States, although academic centers were excluded from this analysis. The quality of mortality in the EHR-derived database has been previously evaluated. 2 The 2015 NDI data served as the gold standard data source. The NDI is a centralized database containing death record information from state vital statistics offices. As a result, the NDI will capture more complete death information relative to the EHR data linked with external death information for patients that have transitioned from the study's EHR network or were lost to follow up. 7 The NDI is updated annually with death records from state vital statistics offices, so its recency does not suffice for every use case, but its completeness makes it a good historical resource for benchmarking.

| Cohort selection and classification
Patients from the EHR-derived database with an aNSCLC diagnosis (stage IIIB or metastatic stage IV, or recurrent advanced disease) between January 1, 2011, and December 31, 2015, were selected for inclusion to align with the data cutoff of December 31, 2015, for the gold standard data source. Patients with aNSCLC had an ICD code for lung cancer (ICD-9 162.x or ICD-10 C34x or C39.9), and advanced disease was confirmed via unstructured documents using "technologyenabled abstraction," which combines clinically trained human abstractors with software that displays portions of the chart. 8 Upon linking the data sources, patients within the EHR-derived database were classified as follows: true positives (cell A) where a death date was in both the EHR-derived and gold standard data; false positives (cell B) where a death date was in the EHR-derived data but not in the gold standard; false negatives (cell C) where a death date was not in the EHR-derived data but in the gold standard; or true negatives (cell D) where no death date was in either the EHR-derived data or in the gold standard ( Figure 1, "Classification by Gold Standard").

| Sampling methods
In addition to the classification described above corresponding to the empirically observed sensitivity in the EHR-derived data source (90.6% as compared with the gold standard 2 ), lower sensitivity datasets were generated through simulation ( Figure 1, "Sampling Methods"). Bootstrap sampling (ie, sampling with replacement) was performed with 1000 iterations to simulate sensitivities of 63.4% (simulation 1) and 72.5% (simulation 2) by randomly selecting 30% and 20% of true positives (cell A), respectively. These patients were then reclassified as false negatives (cell C) as they were no longer considered to have a death in the EHR-derived data source. The reclassified patient groups were denoted based on the proportion of patients that were sampled from cell A and moved to cell C (ie, cells A 30 and C 30 for simulation 1; cells A 20 and C 20 for simulation 2).

| Statistical analysis
Descriptive statistics on the demographic and clinical characteristics of the cohort were calculated, stratified by the classification of the EHRderived mortality variable against the gold standard. Among patients with missing EHR-derived death data (cell C), the last confirmed structured activity date (ie, their last visit or administration in the EHR) was compared with the death date in the gold standard data source, and the distribution of differences between dates was visually examined.
Three sets of comparison groups were selected to examine the impact of missing deaths and were chosen based on known prognostic and/or predictive properties to allow for a number of expected effect sizes. [9][10][11][12][13][14] For the first comparison, we selected treatments commonly administered during the study period (2011)(2012)(2013)(2014)(2015) in the first-line setting (ie, before the widespread use of immunotherapy) and expected to show a difference in survival between comparison groups. Patients that received a platinum-based treatment (defined as all regimens containing cisplatin or carboplatin; "experimental group") were compared with patients receiving other chemotherapy (defined as all regimens without a platinum agent but containing any combination of paclitaxel, nab-paclitaxel, docetaxel, gemcitabine, vinorelbine, irinotecan, pemetrexed, or bevacizumab; "control group"). Regimens containing clinical study drugs were excluded (eg, anonymized therapies from clinical trials). The two remaining comparisons evaluated biomarker status using the patient's most recent valid test result (ie, positive or negative), as identified from unstructured documents available in the EHR, where the result was not "Results pending," unsuccessful, or indeterminate test. One biomarker comparison was chosen based on widespread testing (EGFR+ "experimental group" versus EGFR− "control group"), and another was chosen based on less frequent testing and where there was an expected large difference in survival in order to test if the results held for a small sample size (KRAS+ "experimental group" versus KRAS− "control group").

KEY POINTS
• The impact of missing death data on survival analyses and estimates of overall survival is small when mortality capture sensitivity is high (eg, approximately 90% or more).
• The magnitude of bias is increased and, at times, substantial, with lower mortality sensitivities in the 60% to 70% range.
• The direction of the effect estimation may also change with lower mortality sensitivities in the 60% to 70% range.
• Electronic health records mortality data with high sensitivity limit the potential for missing deaths to bias OS estimates allowing valid inferences to be drawn.
Using these comparison groups, three analytic use cases were evaluated: descriptive statistics of absolute risk, CER, and external control arms. For each analytic use case, a benchmark and EHRderived database only analysis was performed and results compared.
The cohorts remained the same for each analysis, but date of death was defined according to

| Impact of missing deaths on absolute survival estimates
An upward bias in mOS of approximately 0.5 months was observed in EHR-derived death data for patients treated with platinum agents in first-line compared with the benchmark (Table 2). For the simulated cohorts with mortality sensitivities of 63.4% and 72.5%, the bias increased substantially to 3.3 and 2.2 months, respectively. A similar trend was observed in patients treated with other chemotherapy, as well as the biomarker-based groups. Overall, the magnitude of bias in mOS comparing the gold standard data source with the EHR-derived death data ranged from approximately 2.5% to 8.1%, while the bias observed in the simulated lower mortality sensitivities ranged from 36.7% to 53.2% for simulation 1. Observed differences in mOS for KRAS+ were as much as 9.7 months in simulation 1 (63.4% mortality sensitivity) and 6.2 months in simulation 2 (72.5% mortality sensitivity).

| Impact on CER analyses
When HRs were examined, a lack of systematic bias between exposure groups was observed, including within the two simulated cohorts, with only small differences seen in the HRs for the EHR-derived death data across all three comparisons relative to HRs obtained for the gold standard death data ( Figure 4A-C). Despite large observed differences in mOS between the simulated cohorts and the gold standard cohort, comparisons of relative risk such as HRs were largely unaffected by missing death data.

| Impact on external control arm analyses
Given the systematic differences in mortality capture of the exposure groups artificially introduced in the external control arm analyses, the impact of missing deaths on these analyses was much more pronounced ( Figure 5A-C) than in analyses performed entirely within the EHR-derived database (ie, measures of absolute risk or CER analyses). The impact ranged from modest differences in the EHR-derived Among those tested and based upon most recent successful biomarker test. b PD-L1 "Unsuccessful/indeterminate test" results also include "PD-L1 equivocal" results.

| DISCUSSION
RWE sources, including EHR-derived datasets, are valuable analytic platforms for conducting clinical research. 1 Mortality serves as the primary outcome in many analyses across disease areas and particularly for oncology. However, it is often incomplete because of imperfect data collection systems, workflows not designed to capture mortality data, and patients lost to follow up. 5,15,16 The purpose of this study was to examine the potential impact of missing death data in an EHRderived oncology data source, which is of critical importance to establishing a research-grade EHR-derived database and should provide guidance with respect to an acceptable level of completeness. [16][17][18] In CER analyses, there was little to no impact on the estimated   One key opportunity for RWE in drug development is to serve as an external control for a single-arm clinical trial. 1 With the EHRs current mortality sensitivity of greater than 90%, creating differential sensitivity in the context of external control arm analyses resulted only in small differences in estimated HRs and would lead to conservative conclusions and biasing against the experimental arm (eg, absolute change in HR of less than 0.05 towards the null) with corresponding small increases in the probability of type II errors. In drug development, bias in the direction of the null is preferable to an enhanced risk of a type I error.
Also, decisions on molecule phase advancement within drug development (e.g., from single-arm phases 1b to 3 randomized trials) generally would not change based on a 0.05 absolute difference in a phase 1b HR. Conversely, when the sensitivity was lowered, the impact was far more pronounced and much more likely to alter decision making.
Other studies have examined the impact of missing deaths with little evidence to suggest meaningful estimation bias when the mortality outcome is reasonably well captured (ie, 85%-90% sensitivity). 4,15 Some studies have observed systematic differences in comparative analyses likely attributable in part to informative censoring in settings where exposures are related differentially to the mortality outcome. 16,19 Given the absence of any meaningful estimation bias when sensitivity is greater than 90%, why was it important to conduct this study? It was clear from the simulated analyses at lower sensitivities that the impact of missing endpoints such as mortality can have a major effect on analyses, in particular on estimates of absolute risk.
Although there are a number of thresholds that have been discussed with respect to levels of missing outcomes in EHRs, there is a dearth of empirical support. 3,4 Understanding the impact of missing deaths in EHRs is essential to instilling confidence in this rapidly evolving source of clinical evidence. In doing so, researchers will ensure a level of scientific rigor that will allow for sensible use of EHR-derived data for clinical research as an adjunct to the gold standard randomized clinical trials.
There are a number of study limitations that should be considered when evaluating the findings. First, this study leveraged data from community-based oncology clinics in the United States, and patterns of missing data may be different in academic centers or in other countries. This analysis assumes that the NDI is a gold standard for mortality, yet any database at this scale is unlikely to capture every death. 20 Second, this study did not consider the mechanism for missing deaths.
Although we observed little impact on the examples studied here with high-sensitivity mortality data, regardless of mechanism, further work is needed to describe the presence and degree of informative censoring in these data and understand its impact. Third, despite the minimal impact on most conclusions observed in aNSCLC, it is unclear how this will expand to other cancer types with longer mOS. Lastly, although FIGURE 5 Impact of missing deaths on analyses that use the EHR-derived data as an external control arm: current mortality sensitivity vs simulated sensitivities compared with gold standard benchmark. For the external control analyses, the experimental arm in all analyses is composed of the gold standard data, and the control arm is composed of the EHR-derived data only. For simulations 1 and 2, the same approach is taken where the experimental arms are composed of the gold standard data, and the control arms are composed of the EHR-derived data only (with their respective simulated lower sensitivities). Each analysis is in turn compared with an analysis conducted using the gold standard data only (solid red vertical line in Figure 5 represents the HR using the gold standard with dashed line representing its corresponding 95% CI) [Colour figure can be viewed at wileyonlinelibrary.com] the comparison groups were chosen to represent a range of common research questions, they are not exhaustive.
Strength of the study include the varying levels of mortality sensitivity and large sample size. Additionally, the study utilized a gold standard data source; in many examinations of missing data, a proxy for the complete data is never available for comparison. Finally, a variety of analytic use cases were examined, including the novel use of EHR-derived data as an external control.
Although modest bias was observed for absolute estimates and external control analyses when sensitivity was greater than 90%, the bias occurred in a consistent direction and would not likely impact study conclusions or decision making. However, mortality data with lower sensitivity allows for the possibility of more substantial bias to enter into analyses conducted using EHR-derived data. For analyses of mortality based on external controls, researchers should understand the level sensitivity of the data and consider the impact on bias. Using EHR-derived mortality data with high sensitivity mitigates the likelihood that analyses performed using the data will be subject to bias of any meaningful magnitude. In fact, based on the findings from the current study, achieving perfect mortality capture (100% sensitivity) in an EHR-derived database would not result in meaningful gains in terms of a researcher's ability to draw conclusions from the data as compared with the greater than 90% sensitivity observed in this dataset.

ETHICS STATEMENT
Institutional Review Board approval of was obtained prior to study conduct, and included a waiver of informed consent.