Reporting to Improve Reproducibility and Facilitate Validity Assessment for Healthcare Database Studies V1.0

Abstract Purpose Defining a study population and creating an analytic dataset from longitudinal healthcare databases involves many decisions. Our objective was to catalogue scientific decisions underpinning study execution that should be reported to facilitate replication and enable assessment of validity of studies conducted in large healthcare databases. Methods We reviewed key investigator decisions required to operate a sample of macros and software tools designed to create and analyze analytic cohorts from longitudinal streams of healthcare data. A panel of academic, regulatory, and industry experts in healthcare database analytics discussed and added to this list. Conclusion Evidence generated from large healthcare encounter and reimbursement databases is increasingly being sought by decision‐makers. Varied terminology is used around the world for the same concepts. Agreeing on terminology and which parameters from a large catalogue are the most essential to report for replicable research would improve transparency and facilitate assessment of validity. At a minimum, reporting for a database study should provide clarity regarding operational definitions for key temporal anchors and their relation to each other when creating the analytic dataset, accompanied by an attrition table and a design diagram. A substantial improvement in reproducibility, rigor and confidence in real world evidence generated from healthcare databases could be achieved with greater transparency about operational study parameters used to create analytic datasets from longitudinal healthcare databases.


| INTRODUCTION
Modern healthcare encounter and reimbursement systems produce an abundance of electronically recorded, patient-level longitudinal data.
These data streams contain information on physician visits, hospitalizations, diagnoses made and recorded, procedures performed and billed, medications prescribed and filled, lab tests performed or results recorded, as well as many other date-stamped items. Such temporally ordered data are used to study the effectiveness and safety of medical products, healthcare policies, and medical interventions and have become a key tool for improving the quality and affordability of healthcare. 1,2 The importance and influence of such "real world" evidence is demonstrated by commitment of governments around the world to develop infrastructure and technology to increase the capacity for use of these data in comparative effectiveness and safety research as well as health technology assessments. [3][4][5][6][7][8][9][10][11][12] Research conducted using healthcare databases currently suffers from a lack of transparency in reporting of study details. [13][14][15][16] This has led to high profile controversies over apparent discrepancies in results and reduced confidence in evidence generated from healthcare databases. However, subtle differences in scientific decisions regarding specific study parameters can have significant impacts on results and interpretationas was discovered in the controversies over 3 rd generation oral contraceptives and risk of venous thromboembolism or statins and the risk of hip fracture. 17,18 Clarity regarding key operational decisions would have facilitated replication, assessment of validity and earlier understanding of the reasons that studies reported different findings.
The intertwined issues of transparency, reproducibility and validity cut across scientific disciplines. There has been an increasing movement towards "open science", an umbrella term that covers study registration, data sharing, public protocols and more detailed, transparent reporting. [19][20][21][22][23][24][25][26][27][28] To address these issues in the field of healthcare database research, a Joint Task Force between the International Society for Pharmacoepidemiology (ISPE) and the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) was convened to address transparency in process for database studies (e.g. "what did you plan to do?") and transparency in study execution (e.g. "what did you actually do?). This paper led by ISPE focuses on the latter topic, reporting of the specific steps taken during study implementation to improve reproducibility and assessment of validity.
Transparency and reproducibility in large healthcare databases is dependent on clarity regarding 1) cleaning and other pre-processing of raw source data tables, 2) operational decisions to create an analytic dataset and 3) analytic choices ( Figure 1). This paper focuses on reporting of design and implementation decisions to define and create a temporally anchored study population from raw longitudinal source data ( Figure 1 Step 2). A temporally anchored study population is identified by a sentinel eventan initial temporal anchor. Characteristics of patients, exposures and/or outcomes are evaluated during time periods defined in relation to the sentinel event.
However understanding how source data tables are cut, cleaned and pre-processed prior to implementation of a research study ( Figure   1 Step 1), how information is extracted from unstructured data (e.g. natural language processing of free text from clinical notes), and how the created dataset is analyzed (Figure 1 Step 3) are also important parts of reproducible research. These topics have been covered elsewhere, 14,29-36 however we summarize key points for those data provenance steps in the online appendix.

| Transparency
Transparency in what researchers initially intended to do protects against data dredging and cherry picking of results. It can be achieved with pre-registration and public posting of protocols before initiation of analysis. This is addressed in detail in a companion paper led by ISPOR. 37 Because the initially planned research and the design and methodology underlying reported results may differ, it is also important to have transparency regarding what researchers actually did to obtain the reported results from a healthcare database study. This can be achieved with clear reporting on the detailed operational decisions made by investigators during implementation. These decisions include how to define a study population (whom to study), and how to design and conduct an analysis (what to measure, when and how to measure it).

| Reproducibility and replicability
Reproducibility is a characteristic of a study or a finding. A reproducible study is one for which independent investigators implementing the same methods in the same data are able to obtain the same results (direct replication 38 ). In contrast, a reproducible finding is a higher order target than a reproducible study, which can be tested by conducting multiple studies that evaluate the same question and estimand (target of inference) but use different data and/or apply different methodology or operational decisions (conceptual replication 38 ) (Table 1).
Direct replicability is a necessary, but not sufficient, component of high quality research. In other words, a fully transparent and directly replicable research study is not necessarily rigorous nor does it necessarily produce valid findings. However, the transparency that makes direct replication possible means that validity of design and operational decisions can be evaluated, questioned and improved. Higher order issues such as conceptual replication of the finding can and should be evaluated as well, however, without transparency in study implementation, it can be difficult to ascertain whether superficially similar studies address the same conceptual question.
For healthcare database research, direct replication of a study means that if independent investigators applied the same design operational choices to the same longitudinal source data, they should be able to obtain the same results (or at least a near exact reproduction).
In contrast, conceptual replication and robustness of a finding can be assessed by applying the same methods to different source data (or different years from the same source). Here, lack of replicability would not necessarily mean that one result is more "correct" than another, or refutes the results of the original. Instead, it would highlight a need for deeper inquiry to find the drivers of the differences, including differences in data definitions and quality, temporal changes or true differences in treatment effect for different populations. Conceptual replications can be further evaluated through application of different plausible methodologic and operational decisions to the same or different source data to evaluate how much the finding is influenced by the specific parameter combinations originally selected. This would encompass evaluation of how much reported findings vary with plausible alternative parameter choices, implementation in comparable data sources or after flawed design or operational decision is corrected.
However, the scientific community cannot evaluate the validity and rigor of research methods if implementation decisions necessary for replication are not transparently reporte.
The importance of achieving consistently reproducible research is recognized in many reporting guidelines (e.g. STROBE, 34 RECORD,39 PCORI Methodology Report, 40 EnCePP 33 ) and is one impetus for developing infrastructure and tools to scale up capacity for generating evidence from large healthcare database research. 3,[41][42][43][44][45] Other guidelines, such as the ISPE Guidelines for Good Pharmacoepidemiology Practice (GPP) broadly cover many aspects of pharmacoepidemiology from protocol development, to responsibilities of research personnel and facilities, to human subject protection and adverse event reporting. 46 While these guidelines certainly increase transparency, even strict adherence to existing guidance would not provide all the information necessary for full reproducibility. In recognition of this issue, ISPE formed a joint task force with ISPOR specifically focused on improving transparency, reproducibility and validity assessment for database research, and supported a complementary effort to develop a version of the RECORD reporting guidelines with a specific focus on healthcare database pharmacepidemiology.
Any replication of database research requires an exact description of the transformations performed upon the source data and how missing data are handled. Indeed, it has been demonstrated that when researchers go beyond general guidance and provide a clear report of the temporal anchors, coding algorithms, and other decisions made to create and analyze their study population(s), independent investigators following the same technical/statistical protocol and using the same data source are able to closely replicate the study population and results. 47 1.3 | The current status of transparency and reproducibility of healthcare database studies Many research fields that rely on primary data collection have emphasized creation of repositories for sharing study data and analytic code. 48,49 In contrast to fields that rely on primary data collection, numerous healthcare database researchers routinely make secondary use of the same large healthcare data sources. However the legal framework that enables healthcare database researchers to license or otherwise access raw data for research often prevents public sharing both of raw source data itself as well as created analytic datasets due to patient privacy and data security concerns. Access to data and code guarantees the ability to directly replicate a study. However, the current system for multi-user access to the same large healthcare data sources often prevents public sharing of that data. Furthermore, database studies require thousands of lines of code to create and analyze a temporally anchored study population from a large healthcare database. This is several orders of magnitude larger than the code required for analysis of a randomized trial or other dataset based on primary collection. Transparency requires clear reporting of the decisions and parameters used in study execution. While we encourage sharing data and code, we recognize that for many reasons, including data use agreements and intellectual property, this is often not possible. We emphasize that simply sharing code without extensive annotation to identify where key operational and design parameters are defined would obfuscate important scientific decisions. Clear natural language description of key operational and design details should be the basis for sharing the scientific thought process with the majority of informed consumers of evidence.

| Recent efforts to improve transparency and reproducibility of healthcare database studies
To generate transparent and reproducible evidence that can inform decision-making at a larger scale, many organizations have developed infrastructure to more efficiently utilize large healthcare data sources. 9,[50][51][52][53][54][55][56] Recently developed comprehensive software tools from such organizations use different coding languages and platforms to facilitate identification of study populations, creation of temporally anchored analytic datasets, and analysis from raw longitudinal healthcare data streams. They have in common the flexibility for investigators to turn "gears and levers" at key operational touchpoints to create analytically usable, customized study populations from raw longitudinal source data tables. However, the specific parameters that must be user specified, the flexibility of the options and the underlying programming code differ. Many but not all, reusable software tools go through extensive quality checking and validation processes to provide assurance of the fidelity of the code to intended action. Transparency in quality assurance and validation processes for software tools is TABLE 1 Reproducibility and replicability critically important to prevent exactly replicable findings that lack fidelity to intended design and operational parameters.
Even with tools available to facilitate creation and analysis of a temporally anchored study population from longitudinal healthcare databases, investigators must still take responsibility for publically reporting the details of their design and operational decisions. Due to the level of detail, these can be made available as online appendices or web links for publications and reports.

| Objective
The objective of this paper was to catalogue scientific decisions made when executing a database study that are relevant for facilitating replication and assessment of validity.
We emphasize that a fully transparent study does not imply that reported parameter choices were scientifically valid; rather, the validity of a research study cannot be evaluated without transparency regarding those choices. We also note that the purpose of this paper was not to recommend specific software or suggest that studies conducted with software platforms are better than studies based on de novo code.

| METHODS
In order to identify an initial list of key parameters that must be defined to implement a study, we reviewed 5 macro based programs and software systems designed to support healthcare database research (listed in appendix). We used this as a starting point because such programs are designed with flexible parameters to allow creation of customized study populations based on user specified scientific decisions. 54,[57][58][59][60] These flexible parameters informed our catalogue of operational decisions that would have to be transparent for an independent investigator to fully understand how a study was implemented and be able to directly replicate a study.
Our review included a convenience sample of macro based programs and software systems that were publically available, developed by or otherwise accessible to members of the Task Force. Although the software systems used a variety of coding languages, from a methodologic perspective, differences in code or coding languages are irrelevant so long as study parameters are implemented as intended by the investigator.
In our review, we identified places where an investigator had to make a scientific decision between options or create study specific inputs to create an analytic dataset from raw longitudinal source data, including details of data source, inclusion/exclusion criteria, exposure definition, outcome definition, follow up (days at risk), baseline covariates, as well as reporting on analysis methods. As we reviewed each tool, we added new parameters that had not been previously encountered and synonyms for different concepts.
After the list of parameters was compiled, the co-authors, an international group of database experts, corresponded about these items and suggested additional parameters to include. In-person discussions took place following the ISPE mid-year in London (2017).
This paper was opened to comment by ISPE membership prior to publication and was endorsed by ISPE's Executive Board on July 20, 2017. The paper was also reviewed by ISPOR membership and endorsed by ISPOR leadership.

| RESULTS
Our review identified many scientific decisions necessary to operate software solutions that would facilitate direct replication of an analytic cohort from raw source data captured in a longitudinal healthcare data source ( Table 2). After reviewing the first two comprehensive software solutions, no parameters were added with review of additional software tools (e.g. "saturation point"). The general catalogue includes items that may not be relevant for all studies or study designs.
The group of experts agreed that the detailed catalogue of scientific decision points that would enhance transparency and reproducibility but noted that even if every parameter were reported, there was room for different interpretation of language used to describe choices.
Therefore future development of clear, shared terminology and design visualization techniques would be valuable. While sharing source data and code should be encouraged (when permissible by data use agreements and intellectual property), this would not be a sufficient substitute for transparent, natural language reporting of study parameters.

| Data source
Researchers should specify the name of the data source, who provided the data (A1), the data extraction date (DED) (A2), data version, or data sampling strategy (A3) (when appropriate), as well as the years of source data used for the study (A4). As summarized in the appendix, source data may have subtle or profound differences depending on when the raw source data was cut for research use. Therefore, if an investigator were to run the same code to create and analyze a study population from the same data source twice, the results may not line up exactly if the investigator uses a different data version or raw longitudinal source data cut by the data holding organization at different time points.
When a researcher is granted access to only a subset of raw longitudinal source data from a data vendor, the sampling strategy and any inclusions or exclusions applied to obtain that subset should be reported. For example, one could obtain access to a 5% sample of It is also important for researchers to describe the types of data available in the data source (A5) and characteristics of the data such as the median duration of person-time within the data source. This is important for transparency and ability of decision-makers unfamiliar with the data source to assess the validity or appropriateness of selected design choices. The data type has implications for comprehensiveness of patient data capture. For example, is the data based on administrative or electronic health records? If the latter, does the data cover only primary care, inpatient settings or an integrated health system? Does it include lab tests, results or registry data? Does it contain data on prescribed medications or dispensed medications? Is there linkage between outpatient and inpatient data? Is there linkage to other data sources? (A6) If so, then who did the linkage, when and how?

A.2 Data extraction date (DED)
The date (or version number) when data were extracted from the dynamic raw transactional data stream (e.g. date that the data were cut for research use by the vendor).  We evaluated risk of outcome Z following incident exposure to drug X or drug Y. Incident exposure was defined as beginning on the day of the first dispensation for one of these drugs after at least 180 days without dispensations for either (SED). Patients with incident exposure to both drug X and drug Y on the same SED were excluded. The exposure risk window for patients with Drug X and Drug Y began 10 days after incident exposure and continued until 14 days past the last days supply, including refills. If a patient refilled early, the date of the early refill and subsequent refills were adjusted so that the full days supply from the initial dispensation was counted before the days supply from the next dispensation was tallied. Gaps of less than or equal to 14 days in between one dispensation plus days supply and the next dispensation for the same drug were bridged (i.e. the time was counted as continuously exposed). If patients exposed to Drug X were dispensed Drug Y or vice versa, exposure was censored. NDC codes used to define incident exposure to drug X and drug Y can be found in the appendix. Drug X was defined by NDC codes listed in the appendix. Brand and generic

D.2 Exposure risk window (ERW)
The ERW is specific to an exposure and the outcome under investigation. For drug exposures, it is equivalent to the time between the minimum and maximum hypothesized induction time following ingestion of the molecule.
Drug era, risk window D.2a Induction period 1 Days on or following study entry date during which an outcome would not be counted as "exposed time" or "comparator time".  The time over which patient covariates are assessed.
We assessed covariates during the 180 days prior to but not including the SED.

Baseline period G.2 Comorbidity/risk score
The components and weights used in calculation of a risk score.
See appendix for example. Note that codes, temporality, diagnosis position and care setting should be specified for each component when applicable.

G.3 Healthcare utilization metrics
The counts of encounters or orders over a specified time period, sometimes stratified by care setting, or type of encounter/order.
We counted the number of generics dispensed for each patient in the CAP. We counted the number of dispensations for each patient in the CAP. We counted the number of outpatient encounters recorded in the CAP. We counted the number of days with outpatient encounters recorded in the CAP. We counted the number of inpatient hospitalizations in the CAP, if admission and discharge dates for different encounters overlapped, these were "rolled up" and counted as 1 hospitalization. (Continues) If the raw source data is pre-processed, with cleaning up of messy fields or missing data, before an analytic cohort is created, the decisions in this process should be described (A7). For example, if the raw data is converted to a common data model (CDM) prior to creation of an analytic cohort, the CDM version should be referenced unilaterally dropped from all relational data tables, this should be documented in meta-data about the data source. If the data is periodically refreshed with more recent data, the date of the refresh should be reported as well as any changes in assumptions applied during the data transformation. 31,32 If cleaning decisions are made on a project specific basis rather than at a global data level, these should also be reported.

| Design
In addition to stating the study design, researchers should provide a design diagram that provides a visual depiction of first/second order temporal anchors (B1,

Temporal Anchors Description
Base anchors (calendar time): Data Extraction Date -DED The date when the data were extracted from the dynamic raw transactional data stream Source Data Range -SDR The calendar time range of data used for the study. Note that the implemented study may use only a subset of the available data.

First order anchors (event time):
Study Entry Date -SED The dates when subjects enter the study.

Second order anchors (event time):
Enrollment Window -EW The time window prior to SED in which an individual was required to be contributing to the data source Covariate Assessment Window -CW The time during which all patient covariates are assessed. Baseline covariate assessment should precede cohort entry in order to avoid adjusting for causal intermediates.
Follow-Up Window -FW The time following cohort entry during which patients are at risk to develop the outcome due to the exposure.
Exposure Assessment Window -EAW The time window during which the exposure status is assessed. Exposure is defined at the end of the period. If the occurrence of exposure defines cohort entry, e.g. new initiator, then the exposure assessment may be a point in time rather than a window. If exposure assessment is after cohort entry, follow up must begin after exposure assessment.
Event Date -ED The date of an event occurrence following cohort entry Washout for Exposure -WE The time prior to cohort entry during which there should be no exposure (or comparator).
Washout for Outcome -WO The time prior to cohort entry during which the outcome of interest should not occur 1 Anchor dates are key dates; baseline anchors identify the available source data; first order anchor dates define entry to the analytic dataset, and second order anchors are relative to the first order anchor It is important to report on who can be included in a study.
Reporting should include specification of what type of exposure measurement is under investigation, for example prevalent versus incident exposure (D1). 64 If the latter, the criteria used to define incidence, including the washout window, should be clearly specified (C11). For example, incidence with respect to the exposure of interest only, the entire drug class, exposure and comparator, etc. When relevant, place of service used to define exposure should also be specified (e.g. inpatient versus outpatient).
Type of exposure (D1), when exposure is assessed and duration of exposure influence who is selected into the study and how long they In addition, the statistical software program or platform used to create the study population and run the analysis should be detailed, including specific software version, settings, procedures or packages (I1).
The catalogue of items in Table 2 are important to report in detail in order to achieve transparent scientific decisions defining study populations and replicable creation of analytic datasets from longitudinal healthcare databases. We have highlighted in Table 3 key temporal anchors that are essential to report in the methods section of a paper, ideally accompanied with a design diagram ( Figure 2). Other items from Regardless of whether a study is conducted with software tools or de novo code, as part of a network or independently, a substantial improvement in transparency of design and implementation of healthcare database research could be achieved if specific design and operation decisions were routinely reported. We encourage researchers to prepare appendices that report in detail 1) data source provenance including data extraction date or version and years covered, 2) key temporal anchors (ideally with a design diagram), 3) detailed algorithms to define patient characteristics, inclusion or exclusion criteria, and 4) attrition table with baseline characteristics of the study population before applying methods to deal with confounding.
The ultimate measure of transparency is whether a study could be directly replicated by a qualified independent investigator based on publically reported information. While sharing data and code should be encouraged whenever data use agreements and intellectual property permit, in many cases this is not possible. Even if data and code are shared, clear, natural language description would be necessary for transparency and the ability to evaluate the validity of scientific decisions.
In many cases, attempts from an independent investigator to directly replicate a study will be hampered by data use agreements that prohibit public sharing of source data tables and differences in source data tables accessed from the same data holder at different times.
Nevertheless, understanding how closely findings can be replicated by an independent investigator when using the same data source over the same time period would be valuable and informative. Similarly, evaluation of variation in findings from attempts to conceptually replicate an original study using different source data or plausible alternative parameter choices can provide substantial insights. Our ability to understand observed differences in findings after either direct or conceptual replication relies on clarity and transparency of the scientific decisions originally implemented.
This paper provides a catalogue of specific items to report to improve reproducibility and facilitate assessment of validity of healthcare database analyses. We expect that it will grow and change over time with input from additional stakeholders. This catalogue could be used to support parallel efforts to improve transparency and repro- Participated in small group discussion and/or provided substantial feedback prior to ISPE/ISPOR membership review