Design considerations, architecture, and use of the Mini-Sentinel distributed data system

Authors


J. S. Brown, Department of Population Medicine, 133 Brookline Ave., 6th floor, Boston, MA 02215, USA. E-mail: jeff_brown@harvardpilgrim.org

ABSTRACT

Purpose

We describe the design, implementation, and use of a large, multiorganizational distributed database developed to support the Mini-Sentinel Pilot Program of the US Food and Drug Administration (FDA). As envisioned by the US FDA, this implementation will inform and facilitate the development of an active surveillance system for monitoring the safety of medical products (drugs, biologics, and devices) in the USA.

Methods

A common data model was designed to address the priorities of the Mini-Sentinel Pilot and to leverage the experience and data of participating organizations and data partners. A review of existing common data models informed the process. Each participating organization designed a process to extract, transform, and load its source data, applying the common data model to create the Mini-Sentinel Distributed Database. Transformed data were characterized and evaluated using a series of programs developed centrally and executed locally by participating organizations. A secure communications portal was designed to facilitate queries of the Mini-Sentinel Distributed Database and transfer of confidential data, analytic tools were developed to facilitate rapid response to common questions, and distributed querying software was implemented to facilitate rapid querying of summary data.

Results

As of July 2011, information on 99 260 976 health plan members was included in the Mini-Sentinel Distributed Database. The database includes 316 009 067 person-years of observation time, with members contributing, on average, 27.0 months of observation time. All data partners have successfully executed distributed code and returned findings to the Mini-Sentinel Operations Center.

Conclusion

This work demonstrates the feasibility of building a large, multiorganizational distributed data system in which organizations retain possession of their data that are used in an active surveillance system. Copyright © 2012 John Wiley & Sons, Ltd.

INTRODUCTION

Mini-Sentinel is a major component of the Sentinel Initiative, the response of the US Food and Drug Administration (FDA) to a congressional mandate to create an active surveillance system accessing electronic health data for 25 million people by 2010 and 100 million people by 2012.[1, 2] The objective of Mini-Sentinel is to inform and facilitate the development of an active surveillance system for monitoring the safety and safe use of medical products, including drugs, biologics, and devices.[2]

Mini-Sentinel uses a distributed data approach in which participating data partners maintain physical and operational control over electronic data at their sites.[3-7] By allowing data partners to maintain control of their data and its uses, the distributed model reduces many security, proprietary, legal, and privacy concerns, such as those related to the Health Insurance Portability and Accountability Act.

In a distributed data environment, analyses may be conducted using a decentralized or centralized approach. In the decentralized approach, each data partner translates a study protocol into analytic programs that are executed against local data. In contrast, the centralized approach relies on analytic code that is developed centrally and distributed to each data partner to execute against data that are stored in a common format.[5] Mini-Sentinel chose the centralized approach because it reduces the potential for the analytic inconsistency that arises when different investigators and analysts implement a study protocol differently.[8] This approach was successful in several other multi-institutional projects, including the Vaccine Safety Datalink[9, 10] project, the HMO Research Network,[11] the Meningococcal Vaccine Safety[6] study, the Observational Medical Outcomes Partnership,[12] and the Post-Licensure Rapid Immunization Safety Monitoring[13] project.

Five data partners were involved in the initial implementation of the Mini-Sentinel common data model (CDM) and the development of the Mini-Sentinel distributed data system: HealthCore, Inc.; the HMO Research Network;11 Humana; the Kaiser Permanente Center for Effectiveness and Safety Research; and Vanderbilt University. Together, we set out to build a system that would support distributed querying for active safety surveillance. In this report, we describe the initial design of the Mini-Sentinel CDM, the development of the Mini-Sentinel Distributed Database, and the analytic tools developed to facilitate its use. Governance[14] and privacy issues[15] are addressed elsewhere.

METHODS

Design of the Mini-Sentinel CDM

Principles and priorities

The design process began in March 2010 with the review of the FDA Sentinel Initiative reports,[16] the articulation of a set of guiding principles (Table 1), and the establishment of initial priorities. The objectives for the first year were to create a working distributed data system and to develop the capability to respond to FDA queries, both of which required a highly focused yet readily extensible model. We defined a “working” system as one in which data partners could extract, transform, and load source data according to a CDM and implement analytic tools for distributed querying. Furthermore, we required that the system be both transparent and flexible so that problems could be identified and addressed as they arose.

Table 1. Guiding principles for the development of the Mini-Sentinel common data model
  1. MSCDM, Mini-Sentinel common data method; FDA, US Food and Drug Administration.

• Data partners have the best understanding of their data, its origin, and its uses; valid use and interpretation of findings require input from the data partners.
• After thorough testing, distributed programs should be executed with site-specific modifications for file locations only.
• The MSCDM accommodates all requirements of Mini-Sentinel data activities and may change to meet FDA objectives.
• The MSCDM is able to incorporate new data types and data elements as needs indicate.
• Development of the initial MSCDM and all enhancements require input from and acceptance by the data partners.
• Documentation of data-partner-specific issues and qualifiers that may impact use and interpretation of the data is necessary for the effective operation of Mini-Sentinel activities.
• The MSCDM design is transparent, intuitive, well-documented, and easily understood by analysts, investigators, and stakeholders. It is easy for experienced analysts and investigators to use; special skills or knowledge beyond those commonly found among pharmacoepidemiologists and professional analytic staff is not necessary.
• The MSCDM leverages evolving health care coding standards.
• The MSCDM captures values found in the source data. When necessary, mapping to standard vocabularies is transparent. Validated mappings should be used whenever available.
• Only the minimum necessary information should be used and shared with authorized staff of the Mini-Sentinel Coordinating Center.
• Data partners may include site-specific variables in their implementation of the MSCDM.

In setting priorities, we agreed that the initial model should rely on electronic claims-type and administrative data with expansion to other data sources in subsequent years. Administrative and claims data from public (e.g., Medicare) and private (e.g., health insurers) payers are widely used for medical product safety studies and evaluations,[17-19] and their strengths and weaknesses for medical product safety surveillance evaluations are well understood.[20] In addition, there was consensus that the Mini-Sentinel CDM should reflect the guiding principles, leverage the cumulative experience of the data partners, and be consistent with the standards established by the Office of the National Coordinator, such as clinical coding standards typically found in claims systems (e.g., International Classification of Diseases, Ninth Revision, Clinical Modification [ICD-9-CM], Current Procedural Terminology [CPT], and National Drug Codes [NDC]).

Data inventory

We inventoried the administrative and claims data holdings of the data partners. The basic data elements available in administrative and claims databases are relatively consistent across payers due to the common business requirements of billing submission systems.[20] For example, all data partners use ICD-9-CM terminology to code diagnoses and ICD-9-CM and CPT codes to code procedures. However, some data partners have access to data elements not currently available to other partners (e.g., vital signs and laboratory results), and those unique data elements may prove valuable for medical product safety surveillance activities.

We also surveyed data partners about other data streams that could be added to the CDM in subsequent years. An inventory of electronic health record systems gathered information on the number of patients, the breadth and longitudinal scope of the data, and the potential for linkage of data across settings and sites. Information on previous validation studies and the use of standardized clinical terminologies was also requested. A separate survey explored the availability and content of device and disease registries maintained by Mini-Sentinel collaborating institutions.

Evaluation of existing CDMs

The evaluation of existing CDMs included both a thorough document review and discussions with key personnel involved in their creation and implementation. Several members of the Mini-Sentinel Coordinating Center and all data partners had been involved in antecedent projects and have extensive experience in designing and implementing distributed models. Included in the review process were CDMs used by the HMO Research Network,[11] Vaccine Safety Datalink,[9, 21] the Meningococcal Vaccine Study,[6] Post-Licensure Rapid Immunization Safety Monitoring,[13] and the Observational Medical Outcomes Partnership.[22] The Mini-Sentinel team also explored other distributed data models, including i2b2,[23] ePCRN,[24] and the Teradata[25] health care data model.

Several themes emerged from the review. First, it is feasible for multiple organizations to adhere to a common data structure and to assemble patient-level files from their source data. Second, organizations can retain complete control of their data while simultaneously using centrally developed programs that generate reports and aggregated files for analyses. Third, although organizations adhere to clinical coding standards in most cases, locally developed codes are occasionally used, and the CDM must account for that coding variability. Fourth, organizations vary in the latency of their data and the frequency with which they can refresh the data. Fifth, extensive and ongoing involvement of data partners is crucial for successful implementation and use of a distributed system. Finally, antecedent distributed models[6, 9, 11, 13, 21] demonstrate that analytic imperative could be met when patient-level files remained behind organizational firewalls and only aggregated data files were shared.

Development of CDM specifications

The initial version of the data model reflected the guiding principles, the themes that emerged from the review of other distributed models, and the data partners' experiences. Over a 3-month period, each data partner and key FDA staff provided specific, detailed feedback on each data element in every table of the proposed CDM, including information on the specificity of the variable definition and whether the data partner could populate the variable as specified. Data partners also provided recommendations on the contents, structure, definitions, and relationships of tables and variables within the CDM.

Data partners identified many issues, including challenges in identifying unique “encounters,” ascertainment of death outside institutional settings, the presence of various coding sets for diagnoses and procedures, questions about transforming data to the CDM specifications, whether to include utilization data for care provided to individuals who were not health plan members, and whether to include denied, rejected, and adjustment claims. Answers to these questions resulted in improved precision of the definitions of tables and of variables included in tables.

The data partners expressed a preference for a CDM that reflected the concepts and granularity of the data stored in their source files and for excluding concepts that are not intrinsically part of the original data systems. For example, data partners placed a high priority on a data model that minimized the need to routinely calculate derived variables or to maintain secondary data tables, issues that are especially important in Mini-Sentinel because frequent data updates are required. Data partners also expressed concern about the effort required to maintain up-to-date, complex clinical mappings because the underlying data systems are dynamic and would likely require new mappings with each data update. At the conclusion of the review process, data partners verified that the necessary data elements were available in source databases and the requirements were consistent with their expectations.

Mini-Sentinel CDM, Version 1.1

The Mini-Sentinel CDM Version 1.1 (MSCDM V1.1)[26] reflects the guiding principles, contains many of the data elements needed for medical product safety evaluations, and balances the capabilities of the data partners with the needs of the Mini-Sentinel pilot. Eight tables represent specific data domains that are available in administrative and claims data and necessary to accomplish medical product safety surveillance evaluations (Table 2). The overall table structure is designed to facilitate data access while preserving the granularity and nature of the source data. Specifically, the data tables keep similar coding and operational concepts together and, whenever possible, keep distinct data streams separate so that tables can be updated individually and at different intervals, if necessary. For example, outpatient pharmacy dispensings are held separately from other claims sources so that the pharmacy table can be updated without necessarily affecting other tables in the data model.

Table 2. Mini-Sentinel common data model V1.0 tables
TableDescriptionKey data elements
  1. NDC, National Drug Codes; ICD-9-CM, International Classification of Diseases, Ninth Revision, Clinical Modification; CPT, Current Procedural Terminology.

EnrollmentContains records for all individuals who were health plan members of the data partner during the period included in the data extractUnique person identifier
Start and end dates of coverage
Flags to indicate medical and pharmacy coverage
DemographicsIncludes everyone in the data partner database and is not limited to members included in the enrollment tableUnique person identifier
Date of birth
Sex, race, and ethnicity
Outpatient pharmacy dispensingIncludes each outpatient pharmacy dispensing picked up by an individualUnique person identifier
Dispensed date
NDC
Days supplied and amount dispensed
EncounterContains one record for each time an individual sees a provider in the ambulatory setting or is hospitalized; multiple encounters per day are possible if they occur in a different care settingsUnique person identifier
Encounter identifier
Encounter type (e.g., inpatient, outpatient, emergency department)
Start and end date for encounter
Discharge status and disposition
DiagnosisLinked to the encounter table in a many-to-one relationship so that all of the associated diagnoses are recorded in the diagnosis tableUnique person identifier
Encounter identifier
Diagnosis code
Type of code (e.g., ICD-9-CM)
ProcedureLinked to the encounter table in a many-to-one relationship so that all of the associated procedures are recorded in the procedure tableUnique person identifier
Encounter identifier
Procedure code
Type of code (e.g., ICD-9-CM, CPT4)
DeathContains one record per deathUnique person identifier
Date of death
Data source (e.g., National Death Index, State)
Cause of DeathContains one record per cause of deathUnique person identifier
Cause of death diagnosis code
Data source (e.g., National Death Index, State)

An internal, unique person identifier is included in all tables to allow linkage across the tables and a comprehensive view of patient care during an enrollment period. The unique person identifier is not a true identifier (e.g., Social Security number or medical record number). Rather, the unique person identifier is assigned by the data partner, is unique to that data partner, and is not shared with either the Mini-Sentinel Operations Center or the FDA. However, a crosswalk between the unique person identifier and the true identifier is retained by the data partner in the event that retrieval of medical records is required for a specific surveillance activity.

The MSCDM V1.1 can be expanded to accommodate new data domains, typically through the addition of new tables. For example, adding laboratory test results, vital signs, or vaccination records from immunization registries will entail creating the structure and adding new tables to the existing model; there is no need to change existing CDM tables. The structure maintains the information found in the source data and does not require a priori mapping to external standards. This structure can accommodate different algorithmic approaches for mapping coding standards and defining exposures, outcomes, and evaluation eligibility criteria.

Implementation of the Mini-Sentinel CDM

The Mini-Sentinel Operations Center provided guidance on the extract-transform-and-load (ETL) process via weekly calls and individual consultations and in response to questions that arose during data checking. Each data partner designed a process to extract, transform, and load their source data according to the MSCDM V1.1 specifications. Source data varied by data partner. Claims data were the only source used by some data partners (e.g., HealthCore, Humana, and Vanderbilt), whereas other data partners (e.g., Kaiser Permanente, some HMO Research Network sites) had access to claims data and electronic health records. Data partners stored the transformed data as tables within enterprise-level relational database management systems or in SAS data sets (SAS Institute Inc, Cary, North Carolina). Upon completion of the ETL process, each data partner prepared a detailed report that described the local approach to creating each CDM table. Reports included details about local data sources accessed to create the tables, issues encountered during the transformation, and high-level data summaries (e.g., age and sex distributions). Collectively, these local implementations of the MSCDM V1.1 comprise the Mini-Sentinel Distributed Database.

Next, transformed data were evaluated using a series of programs developed by the Mini-Sentinel Operations Center and executed by each data partner. First, the completeness and content of each variable in each table were reviewed to ensure that the data conform to MSCDM V1.1 specifications. Next, the logical relationship and integrity of data values within and across variables and within and across tables were examined. For example, it is permissible for the unique person identifier to occur more than once in the enrollment table, as there can be more than one period of enrollment for an individual. However, in the demographic table, the unique person identifier should occur only once. Finally, the consistency of data distributions was examined over time and across data partners. Examples include assessing trends in outpatient pharmacy dispensings per member per month, total dispensings per month, hospital admissions per member per month, and total encounters by encounter type per month. Data partners used the reports generated by the programs to identify issues in the ETL process and make necessary corrections. Final reports were sent to the Mini-Sentinel Operations Center to enable cross-site comparisons. From these cross-site comparisons, the Mini-Sentinel Operations Center generated partner-specific reports that were provided to and discussed with each data partner. After review, each data partner responded in writing to questions raised, and an agreement was reached between the data partner and the Mini-Sentinel Operations Center regarding necessary modifications to subsequent ETL processes. Most data partners refresh their implementation of the CDM on a quarterly basis. All centrally developed programs were written in SAS Version 9.2.

In addition, the Mini-Sentinel Operations Center formally assessed the technical environment at each data partner, including operational issues, institutional technical and operational constraints, and the capacity of the staff to respond to requests. The technical report helped to identify technical variations such as servers, server settings, SAS versions, and storage space. Partners reported that executing Mini-Sentinel standard programs in the current data partner environments had minimal impact.

During the technical assessment, concerns were voiced about communication between the Mini-Sentinel Operations Center and the data partners regarding expectations and timelines. In addition, some data partners raised concerns about the length of run time for the data checking programs and the granularity of information requested. Several solutions were developed, including modification of the data checking code, local adjustments to server settings to allocate more memory, and increased use of the secure portal for data checking results.

Querying the Mini-Sentinel Distributed Database

Figure 1 shows the process for querying the Mini-Sentinel Distributed Database. First, an executable program (“query”) is submitted to the secure Mini-Sentinel Secure Portal. Each data partner is notified of the query, retrieves it from the secure portal, and executes the program on the local implementation of the MSCDM. The data partner reviews the results and transfers them to the secure portal for retrieval by the Mini-Sentinel Operations Center. Throughout the process, individual-level data remain behind institutional firewalls and data partners maintain control of their data. Data partners can review all queries before local execution and can review query results before results are transferred to the secure portal. The system is designed with multiple manual steps to allow partners to review and approve all requests according to local policies, processes, and procedures; different levels of query automation can be set at the discretion of the data partner. The portal is hosted in a private cloud environment in a Tier 3 data center that complies with the Federal Information Security Management Act of 2002.

Figure 1.

Querying the Mini-Sentinel Distributed Database

  1. Query (an executable program) is created and submitted by an authorized user on the secure network portal.
  2. Data partners are notified of the query, and they retrieve it from the secure network portal.
  3. Data partners review and run query against their local data.
  4. Data partners review results.
  5. Data partners securely return results to the secure network portal for review by requestor.
  6. Authorized user retrieves results from the secure network portal.

Questions posed by the FDA may benefit from an initial assessment of how often a drug is dispensed or how many enrollees have a diagnosis or procedure of interest. To facilitate rapid response to such questions, Mini-Sentinel has implemented the Mini-Sentinel Distributed Query Tool—based on the PopMedNetTM software application—that allows rapid querying of stratified summary counts of each of the enrollment and utilization tables in the local implementation of MSCDM V1.1. The summary tables reside on the data partners' systems, and the software is used to securely transmit queries and post results.[27] Each query includes a description. Data partners can choose to evaluate queries before execution, or these summary table queries can be run automatically. The expectation is that results from summary table queries will be returned to the Mini-Sentinel Operations Center within 2 business days.

To facilitate the rapid execution of common but slightly more complex inquiries, a series of modular programs were developed, tested, and refined by the Mini-Sentinel Operations Center and the data partners (Table 3).[28] Each program allows the requester to specify input parameters, including date ranges, age groups, and exposures and outcomes of interest, and to describe the specific question being addressed. The data partner initializes a set of variables with names and locations of MSCDM tables but does not otherwise alter the program. The output contains summary-level counts (e.g., number of members exposed to a drug, number of newly exposed members, and number of members with a specific diagnosis or condition) stratified by various parameters (e.g., age group, sex, and year). The expectation is that results from modular program requests will be returned to the Mini-Sentinel Operations Center within 5 business days. Additional modular programs will be developed and existing programs enhanced on an ongoing basis in collaboration with FDA and its centers.

Table 3. Modular programs to facilitate rapid querying for common questions
Modular programDescriptionParametersMSCDM tablesExample
  1. MSCDM, Mini-Sentinel common data method; NDC, National Drug Codes; ICD-9-CM, International Classification of Diseases, Ninth Revision, Clinical Modification.

1. Medication useCharacterizes the use of specified products (defined by NDC)Date range, drug list, duration of washout period, age groupsEnrollmentUse of statins by age group and sex, over time; includes prevalent and incident users
Demographic
Outpatient pharmacy dispensing
2. Medication use by conditionCharacterizes the use of specified products (defined by NDC) among individuals with a specified condition (defined by ICD-9-CM diagnosis codes)Date range, drug list, duration of washout period, follow-up period, diagnosis codes, care setting, age groupsEnrollmentUse of asthma medications among individuals with an asthma diagnosis by age group and sex, over time; includes prevalent and incident users
Demographic
Outpatient pharmacy dispensing Diagnosis
3. Incident use and outcomesEvaluates the rate of specified outcomes (defined by ICD-9-CM diagnosis codes) among individuals using specified products (defined by NDC)Date range, drug list, duration of washout period for drugs and events, treatment episode, pre-existing conditions, follow-up period, diagnosis codes, care setting, age groupsEnrollmentRate of inpatient diagnosis of acute myocardial infarction after incident use of an antidiabetes agent among individuals with a diagnosis of diabetes
Demographic
Outpatient pharmacy dispensing Diagnosis
4. Concomitant medication useCharacterizes the concomitant use of products (defined by NDC) among individuals using specified products, with or without a pre-existing condition (ICD-9-CM diagnosis codes)Date range, drug lists (study and concomitant drugs), duration of washout periods, treatment episode, pre-existing conditions, diagnosis codes, care setting, age groupsEnrollmentUse of atypical antipsychotic drugs among individuals with a diagnosis of depression and incident use of selective serotonin reuptake inhibitors
Demographic
Outpatient pharmacy dispensing Diagnosis

RESULTS

More than 99 million health plan members are included in the Mini-Sentinel Distributed Database as of July 2011, including 39 077 330 members actively enrolled with medical and drug coverage as of 1 January 2009. Approximately 63 million (63.8%) individuals included in the database are between 20 and 64 years old, and 8.1 million (8.2%) are 9 years old or younger. There are roughly equal proportions of men and women. In total, the Mini-Sentinel Distributed Database includes 316 009 067 person-years of observation time, with members contributing, on average, 27.0 months of observation time.

We tested the ability of the data partners' local implementation of the MSCDM to respond to specific requests from the FDA. In September 2010, the Mini-Sentinel Operations Center distributed a program that accessed demographic tables to collect counts of enrollees aged ≤19 years as of 1 January 2009. After the initial draft of the program was reviewed and refined by the data partners, the final program was distributed and executed successfully by each data partner within 7 days. Second, to inform the development of a protocol to assess myocardial infarction among users of oral hypoglycemic agents, the Mini-Sentinel Operations Center distributed two programs. One described the use of selected antidiabetic agents, and the other estimated the incidence of myocardial infarction leading to hospitalization in both diabetic and nondiabetic populations. These distributed queries accessed enrollment, demographic, dispensing, and diagnosis tables, and all data partners responded within 11 days. Finally, all data partners successfully executed all of the modular programs.

Querying of the Mini-Sentinel Distributed Database continues. In June and July 2011, over 50 summary table queries were distributed and returned by the data partners, encompassing information on well over 100 drug products, diagnoses, and procedures. Nearly all results were returned by the data partners within 2 days of the distribution, and many were returned within hours. Including the time required to prepare the request for distribution and to summarize the results, the summary table queries generated results within 1–2 weeks. During the same period, four modular program requests were distributed. These requests encompassed 19 separate “runs” of the programs that generated information for over 40 unique comparisons. Average response time was 4 days, with over one half of queries returned within 3 days. Including the time required to prepare the request for distribution and to summarize the results, the modular programs generated results within 3–4 weeks.

DISCUSSION

We set out to develop and implement a distributed data system to support active surveillance across multiple data partners. During an 8-month period, we articulated priorities and principles, inventoried data holdings, and designed an extensible and scalable CDM. During the implementation phase, each data partner transformed local source data using the CDM, and the transformed data were evaluated using centrally developed programs distributed for local execution. Two test queries were successfully fielded and analytic tools developed and tested to facilitate rapid execution of common queries. Querying of the distributed system continues with an average response time of 2 days for summary table queries and 4 days for modular program requests. Summary table results are returned to the requestor within 2 weeks of the request, and modular table results are returned within 4 weeks. Taken together, these achievements demonstrate the feasibility of developing a multiorganizational distributed data system in which organizations retain possession of data used in an active surveillance system.

Several important lessons were learned along the way. First, the data partners are invaluable sources of experience and expertise. Their insights have informed and substantially improved the design of the MSCDM and the development of the rapid querying capabilities. Second, definitions and coding of source data are not uniform across sites. The initial focus on claims-type and administrative data minimized, in many respects, the inconsistencies and challenges, yet source data still differed in important ways. For example, among data partners that function as care providers and insurers, data regarding outpatient clinic visits, hospitalizations, and pharmacy dispensings can come from encounter-based systems and pharmacy claims. Other examples include differences in the meaning of the first-listed diagnosis on inpatient and outpatient claims and local standards for claims processing, such as bridging short enrollment gaps and claims adjudication.

Third, ongoing verbal and written communication is essential. A weekly teleconference has been the primary conduit of information between the Mini-Sentinel Operations Center and the data partners, and additional information is shared via e-mail and one-on-one teleconferences. Although manageable for the earliest phase of start-up, the approach is not sustainable. The establishment of a secure portal and a secure project management Web site to serve as conduits for the dissemination of key documents and requests has facilitated communication and helped to track tasks and action items.

Finally, each data partner has working definitions of concepts like a medical encounter and a membership period, and each has business rules for handling administrative and claims data. The local rules and definitions differ across data partners, requiring substantial effort in defining and redefining not only terminology itself but also the terminology concepts so that all partners are using terms and concepts consistently.

The experience of Mini-Sentinel extends to other organizations that are considering the development of a distributed system for safety surveillance. The decision to build the foundation on readily available administrative and claims data enabled us to proceed quickly and achieve ambitious goals in a short timeframe. The extensible and scalable design allows us to respond to needs as they evolve rather than anticipate all possible needs at the outset. Finally, the engagement of data holders as partners and collaborators ensures that uses and interpretations of the transformed data are consistent with the original data sources.

Several additions and enhancements are planned. First, the MSCDM will be expanded to include specifications for selected clinical data. Specifically, tables will be added to the current model, but existing tables will not change. Second, with the FDA, the Mini-Sentinel Operations Center will identify relevant national coding and terminology standards (e.g., LOINC, RxNORM). An assessment of the appropriateness of the identified coding and terminology standards will be undertaken, and a strategy for incorporating standards in the MSCDM will be developed. In addition, the Mini-Sentinel Operations Center will develop and test new capabilities for extracting information from electronic health records and will assess the viability of and barriers to anonymous linkage across data partners. Finally, the Mini-Sentinel Operations Center will develop additional standardized tools for commonly performed, routine operations, such as incident cohort identification, rates of outcomes among a specified cohort, data checking, and patient natural histories.

As these planned additions and enhancements proceed, we will work to enhance system performance and sustainability. At most data partners, the Mini-Sentinel Distributed Database does not reside in a standalone environment; parallel research and operational projects can reduce the speed at which data can be queried and updated. Thus, computational efficiency of the CDM remains a high priority. Establishing a dedicated technical environment at data partners may facilitate rapid processing. In addition, we will assess the stability of the data in the distributed system. Claims and administrative records may be revised as clerical and billing errors are detected. Typically, these revisions occur soon after the health care encounter, but changes can occur later as well. An assessment of the length of time required for claims and administrative data to stabilize is underway and will inform future ETL processes. Finally, we will actively monitor challenges to sustainability, including changes in source data streams, the burden of frequent updates, and the need to support and maintain expertise at each data partner.

Identifying the needs and expansion priorities for the Mini-Sentinel CDM will continue to be a high priority for the Mini-Sentinel Operations Center. In general, data needs and priorities will be driven by surveillance protocols, questions, and other specific lines of inquiry as defined by FDA. Thus, an assessment of data needs and gaps will be included in the surveillance protocol development. In addition, in discussions with FDA, data partners, and others, we will identify new data resources. These resources might include new data partners, data sources available through linkage, and expanded clinical data capture as data partners work to achieve meaningful use requirements for electronic health records.

CONFLICT OF INTEREST

The authors declare no conflict of interest.

KEY POINTS

  • A multi-organizational distributed database, in which data partners retain control of their data, was created for the Mini-Sentinel Pilot Program
  • Use of a common data model allows centrally developed analytic programs to be executed locally by data partners
  • Software tools and standardized programs enable rapid response to FDA queries
  • The Mini-Sentinel Distributed Database includes information for over 99 million health plan members and 316 million person-years
  • All data partners routinely execute distributed code and return results within expected time-frames

ACKNOWLEDGEMENTS

The authors are indebted to the statistical programmers at each of the sites for their work in designing the Mini-Sentinel distributed data system, extracting data, testing programming algorithms, and providing feedback on programming and system design; Claire Canning and Ashley Wong for their work on the distributed programs and analysis of results; Eric Gravel and Erick Moyneur (Statlog Consulting, Inc.) for statistical programming and guidance; and the Mini-Sentinel Operations Center for its help in overseeing the project. Mini-Sentinel is funded by the FDA through the Department of Health and Human Services Contract Number HHSF223200910006I.

Ancillary