A systematic review of how missing data are handled and reported in multi‐database pharmacoepidemiologic studies

Abstract Purpose Pharmacoepidemiologic multi‐database studies (MDBS) provide opportunities to better evaluate the safety and effectiveness of medicines. However, the issue of missing data is often exacerbated in MDBS, potentially resulting in bias and precision loss. We sought to measure how missing data are being recorded and addressed in pharmacoepidemiologic MDBS. Methods We conducted a systematic literature search in PubMed for pharmacoepidemiologic MDBS published between 1st January 2018 and 31st December 2019. Included studies were those that used ≥2 distinct databases to assess the same safety/effectiveness outcome associated with a drug exposure. Outcome variables extracted from the studies included strategies to execute a MDBS, reporting of missing data (type, bias evaluation) and the methods used to account for missing data. Results Two thousand seven hundred and twenty‐six articles were identified, and 62 studies were included: using data from either North America (56%), Europe (31%), multiple regions (11%) or East‐Asia (2%). Thirty‐five (56%) articles reported missing data: 11 of these studies reported that this could have introduced bias and 19 studies reported a method to address missing data. Thirteen (68%) carried out a complete case analysis, 2 (11%) applied multiple imputation, 2 (11%) used both methods, 1 (5%) used mean imputation and 1 (5%) substituted information from a similar variable. Conclusions Just over half of the recent pharmacoepidemiologic MDBS reported missing data and two‐thirds of these studies reported how they accounted for it. We should increase our vigilance for database completeness in MDBS by reporting and addressing the missing data that could introduce bias.


| INTRODUCTION
The use of multiple health databases in pharmacoepidemiologic studies can facilitate more robust assessments of drug safety and effectiveness. 1,2 Multi-database studies (MDBS) involve the analysis of routinely collected data from two or more data sources, which may take the form of health insurance claims databases, electronic healthcare records (EHR) or healthcare record linkage systems. 3 MDBS can allow us to investigate specific subgroups of patients, rare outcomes, or conduct an early post-approval assessment of safety and effectiveness. 4,5 Multi-national MDBS can lead to more generalisable results, due to the inclusion of heterogeneous patient populations 1 and allow us to compare the safety and effectiveness of compounds between countries and regions, taking differences in ethnology and health care systems into account. 5 MDBS are methodologically more complex than single database studies as a result of inter-database heterogeneity, caused by differences in what information is recorded and how. This brings practical challenges such as how to standardise analyses across a distributed network and how to combine data or results. 2,6,7 In addition, there may also be differences in the completeness of information across databases, potentially resulting in missing data, herein defined as any data that are relevant to the analysis or interpretation of a study that are not available to the study investigators at the time of analysis or reporting. Common data models (CDMs) and common protocols (CPs) have been used across database networks to mitigate bias due to heterogeneity in MDBS analyses, 2 but cannot solve differences in database completeness. Failure to account for missing data in epidemiologic studies can introduce bias, even altering the direction of treatment effect estimates, and reduce the precision of effect estimates. 8,9 For example, one study showed that risk estimates of venous thromboembolism associated with anti-osteoporotic medications were substantially affected by the use of different strategies for the handling of missing data, leading to differences in the direction of treatment effect estimates. 8 Missing data can arise at several stages within a multi-database pharmacoepidemiologic study. Like in a single database study, data may not be recorded at the stage of data entry into the database. For example, data may be partially (or sporadically) missing because a health professional recorded information in an unstructured manner (e.g., using free text) or did not collect information for certain patients, such as body mass index (BMI) and smoking status, because they were considered healthy. [10][11][12] MDBS can have the additional complexity of completely (or systematically) missing variables, where data on a certain variable are missing for all individuals in a database. 12,13 This may occur when certain variables are not recorded in a database because they are not required by the health authorities for reimbursement (in administrative/claims databases) or because the recording is not part of routine clinical practice (in electronic health record (EHR) databases). In the case of a systematically missing confounder, this can lead to residual confounding in a study; if a variable used to define exposure is systematically missing, this can lead to exposure misclassification in one or more of the databases. There may also be some information loss during the extract, transform and load process (e.g., data which did not meet the quality criteria are omitted) or creation of final analytical variables. This can happen, for example, when components of a composite variable are missing or if time restrictions are applied to a variable, such as a measurement being available within 12 months of the index date. 14 Methods to account for sporadically missing data, such as multiple imputation (MI) and inverse probability weighting, are widely known. 8,15 To handle systematically missing data, a practical approach is to exclude the missing variable from the analyses or exclude an entire database. 8 A recently proposed alternative is multi-level MI (MLMI), which can account for both sporadically and systematically missing data. This approach utilises information on the covariance of variables in one dataset or database to impute missing information in another. [16][17][18][19] The reporting of missing data and the methods used to address it are specified in the RECORD-PE and STROBE statements. 20,21 Without thorough reporting, we cannot have full confidence in the validity of the study estimates. Our aim for this review was to first measure the extent to which recent MDBS reported missing data and considered it as a potential source of bias. Second, we sought to identify which strategies are being reported in MDBS to deal with missing data.

| Study design and search criteria
We conducted a systematic literature search in PubMed to identify and report the methods used in recent peer-reviewed, multi-database pharmacoepidemiologic studies. The full list of inclusion and exclusion criteria can be seen in Table 1

Key Points
• Missing data can lead to biased estimates of drug safety and effectiveness, a problem that can be exacerbated in multi-database studies.
• Forty-four percent of recent multi-database pharmacoepidemiologic studies did not report missing data and 69% did not report accounting for missing data.
• In studies which report missing data, lifestyle variables were most frequently reported missing (14%-29%).
• Most studies which reported a method to address missing data performed a complete case analysis (68%). Multiple imputation was the predominant statistical method used to handle missing data (22%).
• Variables with missing data, the potential bias and accounting for missing data should be thoroughly

| Screening and selection
Title and abstract screening for eligibility were performed by one author (NBH). Another author (RP) independently screened the title and abstract of 100 articles. Any differences in this pilot screening were discussed and resolved. One author (NBH) then screened the full-text of the remaining articles for eligibility. Two authors (NBH and RP) independently reviewed the full list of included studies to confirm eligibility and disagreements were discussed with the other authors.

| Extraction of the general study characteristics
The general characteristics of each study were recorded: The study design (e.g., cohort or case-control), the journal, the exposure and outcome, the size (number of databases, countries, subjects), the database type (administrative/claims, EHR, other) and whether the study provided a pooled estimate. To categorise the strategies used to execute the MDBS, we used criteria described by ENCePP (a full description can be found in Gini et al. 28 ). We categorised each study as carrying out either a local analysis, where the data extraction and analysis are conducted by individual centres (according to a CP); sharing of raw data, where the local site extracts the raw data and transfers it to a central partner for the analysis; the use of a study-specific CDM; or use of a general CDM. 28

| Extraction of the outcomes
For the primary outcome, we measured how many of the included studies reported the existence of missing data; in what context this was reported (methods, results or as a limitation); which variables were missing data; type of missingness (sporadic or systematic, as determined by the authors); the type of variable with missing data (exposure, outcomes or confounders); and the amount of % missing.
We recorded whether the authors of each study discussed the extent to which the missing data had contributed to bias and whether missing data were missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). 29 For studies that reported missing data, we recorded which method was used to account for missing data. For studies which used MI, we additionally extracted information on the number of imputations per variable, the number of imputed variables and the statistical software used for the imputation.

| Data analysis
Data extraction was piloted in five articles by two authors (NBH and RP). The remaining data extraction was conducted by a single author (NBH) and any uncertainties were discussed with the co-authors.
Extracted data were recorded and where applicable, data were transformed into a pre-specified answer list. Means, medians and ranges were calculated.

| RESULTS
The search identified 2726 publications for title/abstract review ( Figure 1). Sixty-two articles from forty-four scientific journals were eligible for inclusion (see supplementary included studies carried out their analysis locally per study site, of which five did not provide a pooled estimate of multiple databases. Ten (16%) reported sharing raw data for a central analysis and 24 (38%) used a CDM. Twenty-three (96%) of these used a general CDM with 11 using the VSD and 7 using Sentinel's CDM; one study used a study-specific CDM.

Cardiovascular system 6 10
Blood and blood-forming organs 6 10 Nervous system 5 8 Genitourinary system and sex hormones 2 3 Multiple groups 5 8

| DISCUSSION
In this systematic review of recently published multi-database pharmacoepidemiologic studies, we found that out of 62 included articles, only 56% reported missing data, 18% reported whether missing data could have biased the study and 31% reported how they dealt with missing data. The reporting of missing data was slightly higher in studies which used a CDM and those which used European and Asian data compared to North-American data.
In contrast to Rioux et al., 30   Location:

In analysis 22
In limitations 4 In both 9 Variable type: In MDBS which use claims databases, the issue of systematically missing data may be larger than that of sporadically missing data. In a pharmacoepidemiologic MDBS, there is a high likelihood of missing data because the multiple databases involved may record different variables.
Furthermore, a recent study of established European health databases showed that vaccinations are captured in 38% of databases, while inpatient administered (5.8%) and over-the-counter drugs are rarely captured. 3 It is therefore possible that the studies in this review have underreported missing data in their analyses, although this depends on the relevance of these kinds of exposures to the included studies.
The reporting of whether missing data could have contributed to potential bias is important, however, we find that only 21 (34%) studies did this. None of the included studies reports whether the data were missing at random (MAR, MCAR) or not (MNAR). This is a valuable factor when determining whether missing data have contributed to bias in the study and when considering what method to apply to correct for missing data. MNAR assumes that the missingness pattern is dependent on unobserved variables, so it is a potentially greater source of bias. 9 Only 31% of the studies reported a missing data method, the other studies (69%) may have applied a missing data method without reporting it in the publication. In the absence of any additional missing data methods, a CCA is likely to have been performed. In 2012, it was reported that 81% of epidemiologic studies carry out a CCA. 31 More recently, it was reported that 70% carry out CCA and 18%, MI. 30 It has already been recommended that missing data assumptions and the rationale for using a CCA should be reported, since CCA may bias the study estimates as data is often not MCAR. 8,9 Three of the included studies reported a head-to-head comparison: two found that the use of MI compared to CCA did not change the conclusions of the study and one reported difficulty in comparing the methods due to a large quantity of missing data. We recommend that future studies conduct and report sensitivity analyses to clarify the potential impact of methodological choices when addressing missing data in their analyses. 8 One of the other possible solutions to deal with missing data is MI, which allows the inclusion of the patients who have missing data.
It is a method that assumes data are MAR but it can handle data which is MCAR or (under stronger assumptions) MNAR. 15  increase overall confidence in the study. 8,21,35 An overview of the recommended steps in addressing and reporting missing data in MDB pharmacoepidemiologic studies can be found in Figure 2.
In this review, we successfully captured MDB pharmacoepidemiologic studies from multiple regions around the world, expanding on previous work to identify these studies in a systematic search. 1 To our knowledge, a review of the reporting of missing data and the methods used to address it in MDBS, has not previously taken place. However, there are some potential limitations to our study. First, we were not able to determine with absolute certainty how much data were missing in each study or database, as we were limited to reviewing only what was reported in published studies. Future research could assess the origin and quantity of missing data by directly examining pharmacoepidemiologic databases, particularly before and after data processing, and compare the findings against what is routinely reported for these databases. Second, we might have missed relevant studies due to difficulties in detecting MDBS from systematic searches, as also indicated in similar studies. 1 For example, studies which use an established database network might not refer to the use of multiple databases in their abstract but instead to the network name. To account for this, we included names of well-known database networks in our search strategy, which were identified in collaboration with an expert.
In addition, we only included publications in English, which could limit the generalisability of our findings beyond Europe and North America.
Multi-database pharmacoepidemiologic studies are deemed to be essential for regulatory and clinical assessments of drug safety and effectiveness, thus we must increase confidence in the potential that these studies can bring. 36 Missing data are a persistent problem in EHR, and it is underreported in multi-database pharmacoepidemiologic studies. The quantity and type of missing data as well as the resulting potential bias, and justification of the method used to address it should be reported.

ACKNOWLEDGEMENT
We would like to thank Xiaofeng Zhou (Pfizer Inc., Chair-elect ISPE Database SIG) for advice on specific database networks to search for this review.