Data for cancer comparative effectiveness research

Past, present, and future potential


  • Anne-Marie Meyer PhD,

    1. Universisty of North Carolina-Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina
    2. Cecil G. Sheps Center for Health Services Research, University of North Carolina, Chapel Hill, North Carolina
    Search for more papers by this author
  • William R. Carpenter PhD,

    Corresponding author
    1. Universisty of North Carolina-Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina
    2. Cecil G. Sheps Center for Health Services Research, University of North Carolina, Chapel Hill, North Carolina
    3. Department of Health Policy and Management, Gillings School of Global Public Health, Chapel Hill, North Carolina
    • Department of Health Policy and Management, University of North Carolina, Gillings School of Global Public Health, 1102A McGavran Greenberg Hall; CB 7411, Chapel Hill, NC 27599

    Search for more papers by this author
    • Fax: (919) 966-6961

  • Amy P. Abernethy MD,

    1. Department of Medicine, Division of Medical Oncology, Duke University, Durham, North Carolina
    2. Duke Cancer Institute, Duke University, Durham, North Carolina
    Search for more papers by this author
  • Til Stürmer MD, PhD,

    1. Cecil G. Sheps Center for Health Services Research, University of North Carolina, Chapel Hill, North Carolina
    2. Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, North Carolina
    Search for more papers by this author
  • Michael R. Kosorok PhD

    1. Universisty of North Carolina-Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina
    2. Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, North Carolina
    Search for more papers by this author


Comparative effectiveness research (CER) can efficiently and rapidly generate new scientific evidence and address knowledge gaps, reduce clinical uncertainty, and guide health care choices. Much of the potential in CER is driven by the application of novel methods to analyze existing data. Despite its potential, several challenges must be identified and overcome so that CER may be improved, accelerated, and expeditiously implemented into the broad spectrum of cancer care and clinical practice. To identify and characterize the challenges to cancer CER, the authors reviewed the literature and conducted semistructured interviews with 41 cancer CER researchers at the Agency for Healthcare Research and Quality's Developing Evidence to Inform Decisions about Effectiveness (DEcIDE) Cancer CER Consortium. Several data sets for cancer CER were identified and differentiated into an ontology of 8 categories and were characterized in terms of strengths, weaknesses, and utility. Several themes emerged during the development of this ontology and discussions with CER researchers. Dominant among them was accelerating cancer CER and promoting the acceptance of findings, which will necessitate transcending disciplinary silos to incorporate diverse perspectives and expertise. Multidisciplinary collaboration is required, including those with expertise in nonexperimental data, statistics, outcomes research, clinical trials, epidemiology, generalist and specialty medicine, survivorship, informatics, data, and methods, among others. Recommendations highlight the systematic, collaborative identification of critical measures; application of more rigorous study design and sampling methods; policy-level resolution of issues in data ownership, governance, access, and cost; and development and application of consistent standards for data security, privacy, and confidentiality. Cancer 2012. © 2012 American Cancer Society.


Rapid advances in cancer care continues through an accelerated pace of scientific discovery and technology development. Timely integration of developments into clinical practice is increasingly challenging, and it is imperative that more immediate, generalizable, and evidence-based information be available. Randomized controlled trials (RCTs) remain the gold standard for developing such information; however, this research design is not always feasible, practical, or sufficiently timely. In addition, RCT designs limit generalizability of findings to heterogeneous patient populations or to specific subgroups seen in clinical practice.1-7

Cancer comparative effectiveness research (CER) holds great promise for meeting many shortcomings of RCTs. Although CER takes many forms, in the current article, we focus on the Institute of Medicine (IOM) definition of CER as the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat, and monitor a clinical condition or to improve the delivery of care. The purpose of CER is to assist consumers, clinicians, purchasers, and policy makers to make informed decisions that will improve health care at both the individual and population levels.6, 8

The foundation of CER is understanding effectiveness in the context of large, heterogeneous populations. Propitiously, large population-based data are becoming increasingly available through advances in information technology and research methods in the form of secondary data collected for nonresearch purposes. By increasing our understanding of these data, CER stands to benefit immeasurably by these ever-growing repositories.

For cancer CER, these data originate from many different sources, including electronic health records, registries, administrative data, observational studies, clinical trials, and others. Not all existing or secondary data are adequate, and each data source comes with its own unique challenges. Because secondary data originate from many different sources, they may have missing critical variables or significant and systematic differences in how variables are measured. These differences impede researchers' ability to confidently characterize important care processes and outcomes across data. An additional challenge is the lack of randomization, which makes controlling for relevant confounders critical. Consequently, cancer care stakeholders frequently are uncomfortable acting on CER findings generated from these non-RCT data sources.

A better understanding of data is necessary to improve data collection and methods development and to overcome the challenges facing cancer CER. To further this understanding and help guide federal data and research partners, we reviewed the literature and met with over 40 cancer outcomes researchers and clinicians. Our goals were to 1) develop a conceptual model for examining observational data in cancer CER, 2) characterize the strengths and limitations of current cancer CER data resources, 3) identify barriers in the conduct of cancer CER, and 4) formulate recommendations and guiding principles. Although our focus was on secondary, observational data (ie, nonrandomized, retrospective), the findings we present also are applicable to any prospective data collection.


Data Collection

Literature was reviewed regarding current cancer care, cancer research data, and cancer CER. This information helped inform the development of a conceptual model of data needs for cancer CER9 and frame discussions with a convenience sample of cancer outcomes researchers associated with the Cancer Developing Evidence to Inform Decisions about Effectiveness (Can-DEcIDE) Consortium, which is associated with the University of North Carolina Can-DEcIDE from the University of North Carolina at Chapel Hill), Duke University, the Centers for Disease Control and Prevention, the Brigham and Women's Hospital, the University of Virginia, the Epidemiologic Research and Information Center at the Durham Veterans Affairs Medical Center, the North Carolina Central Cancer Registry, Blue Cross and Blue Shield of North Carolina, the Agency for Healthcare Research and Quality (AHRQ), and the National Cancer Institute. Participants were from multiple disciplines and included clinicians, clinical trials experts, epidemiologists, pharmacoepidemiologists, health services researchers, biostatisticians, clinical data managers, state public health workers, and informaticians. The majority of participants relied on federal or academic funding; individuals who relied on funding from industry or nongovernment, third-party payers were not targeted in the initial sampling frame. Applying snowball sampling, participants were asked to identify other researchers who may provide additional insight, and, together, these participants comprised the study sample of 41 discussants.

Discussions were conducted individually and were tailored according to each researcher's area of expertise. Guided by findings in the literature, the discussions centered on the following areas: 1) identification of specific data sets for cancer CER; 2) utility of measures; 3) data access or logistical challenges; 4) population/target and sampling; 5) data-linking capabilities; 6) longitudinal follow-up in data sets; 7) temporality of data/measures; 8) data completeness; 9) data standardization, formatting, and documentation; and 10) data processing and required expertise.

Study Team Review and Development of Recommendations

The primary study team comprised an epidemiologist, a pharmacoepidemiologist, a biostatistician, a health services researcher, and 3 cancer-focused physician researchers, all of whom conduct federally funded, patient-centered cancer outcomes research. The study team met multiple times to summarize key informant interviews, integrate the summary with information from the literature, and organize the findings into categories and themes. These meetings were audiotaped to ensure capture of the entire discussion. Recommendations were developed collaboratively by the study team and reflected broad themes observed in the literature, results from the interviews, and specific issues or examples specified by multiple participants. Draft findings and recommendations subsequently were reviewed by select participants and other outcomes researchers to assure their accuracy and face validity. Finally, the entire article was reviewed by external experts participating in the AHRQ DEcIDE network.


Data Sources for Cancer Comparative Effectiveness Research

We identified 46 relevant data sets from our study sample of cancer outcomes researchers. Participants themselves expressed different opinions with regard to which data were important, adequate, or weak for cancer CER. This variability highlighted the lack of standardized nomenclature associated with these data. Therefore, our first priority was to identify patterns with which we could organize existing data sets and broad themes.10-29 Inspection of the data sets and their purposes revealed that a consistent nomenclature was needed before they could be easily organized to support CER. In response, an ontology was developed that divided the data into 8 categories, including 6 “existing or fixed data” categories and 2 “hybrid data” categories (Table 1).

Table 1. Data Ontology and Definitions
Data TypeDefinition
  1. Abbreviations: ARIC, Atherosclerosis Risk in Communities Study; CER, comparative effectiveness research; HMO, health maintenance organization; HPFS, Health Professionals Follow-Up Study; MCBS, Medicare Current Beneficiary Survey; NHS, Nurses Health Study; PHS, Physicians' Health Study; POC, Patterns of Care Study; SEER, Surveillance, Epidemiology, and End Results Program of the National Cancer Institute; WHS, Women's Health Study.

Existing and fixed data 
 Experimental studies dataPrimary data collected for the purpose of studying the safety and efficacy of health and medical interventions; significant variation exists between types of studies with regard to utility for CER; for example, phase 3-4 randomized trials are more limited for CER versus pragmatic trials or large epidemiologic trials (eg, WHS, WHI, PHS)
 Nonexperimental studies dataData collected by public health researchers to identify patterns and determinants of diseases and outcomes within a population (eg, NHS, HPFS, ARIC)
 Registry dataData collected by public health or other clinical health institutions to evaluate disease incidence, morbidity, and outcomes for a population defined by a specific condition or exposure, including interventions (eg, SEER)
 Administrative/claims dataData collected for the business or programmatic purposes of documenting, delivering, or paying for health care, including insurance companies, health systems, or government entities; may be for the purpose of organizing, tracking, and defining patient health and interactions with the health care system (eg, Medicare, Medicaid, MarketScan)
 Electronic health recordsData collected at point of care to support clinical care delivery, management, and decision making; these data are stored/managed through specific computer-based software and information systems (eg, HMO-network)
 Other dataVarious, yet untested for CER; examples include syndromic surveillance and/or pharmacy purchase/market data
Hybrid data 
 Linked clinical and claims dataData sets created by the linkage between 2 unique data sources (often from the above categories) collected by different entities and for different purposes (eg, SEER-Medicare)
 Validation study dataData collected or obtained for the purposes of overcoming limitations from existing/secondary data sources (eg, MCBS, POC)

Definitive empirical definitions for each category were difficult, because they are not mutually exclusive. This was complicated by the reality that participants from different clinical and methodological specialties prioritized different characteristics of the data sets. Despite this, the study team reached consensus and unanimously agreed on the final ontology, which had face validity to internal and external reviewers and provides useful classification and characterization of the data sets. Table 2 presents an illustrative sampling of what were perceived as key data sets and categories and a summary of their strengths, limitations, and applicability for cancer CER, as informed by the discussants and study team.

Table 2. Existing and Secondary Data Sources: Illustrative Examples, Strengths, Limitations, and Applicability/Utility to Comparative Effectiveness Research
ExamplesStrengthsLimitationsApplicability/Utility to Cancer CER
  1. Abbreviations: ACCENT, Adjuvant Colon Cancer Endpoints; CER, comparative effectiveness research; DUAs, data use agreements; EHR, electronic health record; HMO, health maintenance organization; NCI, National Cancer Institute; PI, principal investigator; PS2, Performance Status 2; SEER, Surveillance, Epidemiology, and End Results Program of the National Cancer Institute; VA, Veterans Administration; WHI, Women's Health Initiative.

Existing and fixed data sources   
 1) Experimental studies (trials) data (eg, clinical trials, pragmatic trials)   
   •ACCENT/PS2•Detailed and unbiased information on treatment, and important clinical covariates•Limited generalizability•Utility for CER depends on type of experimental study; phase 3 and 4 clinical trial data potentially may be a valuable resource but currently have significant barriers for CER (eg, data governance issues, secondary data set creation for out-of-scope analyses, and resolving population hyperspecificity measurement limitations, etc); broadly defined (or population-based) trials can be useful for CER but require extensive inclusion of covariates outside of the main trial aims
   •Women's Health Initiative•Enormous breadth and diversity of data (across 12 NCI cooperative groups)•Expensive to conduct, requires lengthy follow-up for many outcomes•Secondary use of experimental studies for CER could be improved through investments in 1) pragmatic clinical trials and 2) methods/design development
   •Physicians' Health Study •Limited sample sizes 
   •Women's Health Study •Highly specific (ie, usually single treatment/intervention) 
  •Limited in essential/important covariates 
 2) Nonexperimental (observational) studies data   
   •North Carolina-Louisiana Prostate Cancer Project•Extensive data on diagnosis, procedures, and outcomes•Expensive to develop and maintain•Can be leveraged for comparative effectiveness depending on data quality and extent of biases
   •Cancer Care Outcomes Research and Surveillance Consortium•Rich in covariates (risk factors, important confounders)•Logistics of study development limit data availability and addition of new hypotheses•Utility for CER also depends on study design, quality/completeness of measures, and broad inclusion covariates
   •Health Professionals Follow-Up Study•Often include patient medical records•Several biases may exist: selection; information, recall, and response•Can be strengthened through potential data linkages to claims or EHR data, which can augment or off-set biases/limitations (can provide temporality of events, verification of treatment/outcomes, etc)
   •Nurses Health Study•Can be population-based•Unclear event temporality between data collection waves 
   •American Cancer Society Cohort •Limited in scope, statistical power beyond initial study aims 
  •Proprietary data requiring extensive protocols, procedures 
 3) Registry data   
   •SEER•Rich disease information•Potential sampling biases (selection, inclusion, etc)•Do not provide enough complete data for rigorous CER
   •National Program of Cancer Registries•Clinical information at point of care or diagnosis•Questionable generalizability•Linkages to additional data are necessary to provide missing information
   •National Cancer Data Base•Simultaneously collected with diagnosis and treatment•Primarily limited to first occurrence of event or disease and limited inclusion of covariates•Dearth of literature on solutions/methods for inherent biases, interoperable study design, and evaluation/application of comparator populations
   •National Oncologic Positron Emission Tomography•Opportunity for recruitment into cohorts or trials•Unknown response, toxicity, patient reported outcomes 
 •Can link with administrative data•Challenging for longitudinal data capture 
  •Sparse patient identifiers 
  •Challenging for selecting controls/comparator populations 
 4) Administrative and claims data   
  •Most health insurance programs, Medicare, Medicaid, Blue Cross/Blue Shield, etc•Represents large proportion of US population•Design/structure often impacts data sensitivity/specificity•Missing key CER components, including vital tumor and disease information
  •Medstat/MarketScan•Rich patient-level data: demographics, procedures, treatments•Missing important clinical etiologic information•Linkages can supplement missing information, but costs and/or DUAs often inhibit additional linkages
   •United Health•Includes temporality of events•Includes date or type of testing procedures, but no results (eg, pathology, tumor response, genetics, vital stats, etc)•Utility for CER would be greatly improved through institutional and governmental policies that overcome limitations (ie, funding, training, collaboration)
 •Some include organizational/provider characteristics•High patient turn-over 
 •Most have unique identifiers enabling linkage to other data•Complicated data structure requires significant learning curve and programming resources 
  •Burdensome and prohibitive DUAs 
  •Expensive to obtain 
  •Untimely data releases—significant time lags 
 5) Electronic health records   
   •Health care systems: VA, HMO-network, Kaiser, Mayo, Geisinger, US Oncology, UK General Practitioners Research Database•Includes multiple data components (practice management, electronic patient record, patient portal)•Populations are not generalizable•Currently, there is limited utility for EHR data from private vendors
   •Large vendors: GE Health, Allscripts/Misys, Epic, McKesson, NextGen•Fully integrated EHRs provide clinical information, claims, tumor specifics, longitudinal follow-up, objectively measured events•Lack of standardization of patient information and clinical measures between systems (technology, data structure, and coding)•However, examples from VA and universal/national systems (UK, Canada) exemplify potential of EHR sources
 •Allows for studies of toxicity, quality of life, natural history•Missing or insufficient data elements necessary for CER•Future utility depends on: standardization of measures and data systems/interoperability; standard linkage variables; public and private institutional data governance and stewardship
  •Imperfect record keeping/follow up; patients not consistently maintained within a single system/HER 
  •Enormous expense to obtain data from private sector/vendors 
 6) Other data   
   •Genetic and genomic data•Data at both patient and ecological level•Unclear how to identify, define and use these data•Utility to CER depends on integration into other data, specifically clinical care data
   •Geospatial data•Information on behavioral and environmental risks  
   •Environmental monitoring data•Can provide information on disease determinants  
   •Over-the-counter drug purchasing•Self-reported experiences, exposures, outcomes  
   •Health seeking on internet   
   •Patient-networking sites  (PatientsLikeMe 201030)   
   •Syndromic surveillance   
Hybrid data sources   
 7) Linked clinical and claims data   
   •SEER-Medicare•Includes clinical and health services data•Missing information (eg, HMO or supplemental insurance); often highly specific populations (aged >65 y, disabled, etc)•Powerful for CER studies because of large, generalizable populations
   •State cancer registry–Medicare/Medicaid•Provides temporality of events•Noncovered services are excluded (eg, prescription drugs, long-term care, free screenings)•Large number of covariates and clinical information
   •WHI-Medicare•Large population samples; ability to study rare events/treatments•Missing vital clinical information (tumor response)•Lengthy follow-up available, including information on temporality of treatment and events
 •Provides access to controls or comparison populations•Treatment rationale and test results are unknown•Could be strengthened by linkages to laboratory and clinical results
 •Allows for adjudication/validation of events (ie, self-reported)•Complicated algorithms needed to characterize treatment 
 •Can detect recurrence•Large, complex data require advanced training/experience 
  •Delay in research access 
 8) Validation study data   
   •Internal validation studies  (Sturmer 2007,51 Sturmer 200533)•Rich disease information•Lack of validated studies exist for CER•To be useful for CER, an investment in methodologic work is required—similar to P01 CA 142538 “Statistical Methods for Cancer Clinical Trials” (PI, M. R. Kosorok)
   •External validation studies  (Goldberg 200933)•Used to minimize limitations of other data•Methodologic limitations and lack of model transportability to CER•Validation studies could lead to immediate return of investment with regard to leveraging existing data for CER
 •Can give estimates of associations not discernible within data  

Barriers to the Conduct of Cancer Comparative Effectiveness Research

Several consistent themes emerged through the discussions and analysis: 1) There is a need for systematically identified, standardized measures to fill gaps and enhance data linkage and transferability. 2) Improvements in study design and population sampling are critical for CER studies to be meaningful. 3) Substantial issues exist regarding data ownership, access, governance, and cost. 4) Data security, privacy, and confidentiality remain paramount. 5) Broad multidisciplinary representation is needed to effectively address these CER data needs. These themes were consistent throughout the analysis and resonated with key informants and the study team.


We have developed a novel framework for organizing and characterizing cancer CER data together with relevant research needs. On the basis of the literature and key informant interviews, we propose a practical ontology regarding data resources and availability. The structure of this ontology was defined through a retrospective lens by asking participants to nominate secondary sources of data that could be immediately leveraged or developed.

The retrospective lens provides a starting point from which a rational ontology can be developed. It allows us to define and characterize available data resources ready for cancer CER. And, finally, it provides a characterized delineation point for transition from retrospective data models to prospective CER data models. Moving forward, we anticipate a transition to more frequent prospective and real-time data-collection activities (electronic health records, continuously aggregating registries, rapid learning data systems). The ontology proposed from our study provides a foundational nomenclature from which to build future data resources. Increasingly clear throughout the fields of science and engineering is the need to organize and systematically structure data so that information can be extracted maximally to assist in prescribing the right treatment at the right time for a specific patient. Moreover, we need agreement and collaboration from respective stakeholders of the multiple, diverse systems for collecting data. This report highlights these realities and helps to point a practical way forward.

Our approach includes several limitations. Development of this ontology was challenged by a lack of mutual exclusivity among data sets and the diverse perspectives of participants. Our federally funded study team was focused on describing data sets and CER opportunities with a government perspective. Our sampling was purposeful but not exhaustive; additional cancer CER data sets probably have been missed, and the relative impact on the ontology is not clear.

Despite these challenges, this work provides a practical ontology that is adaptive and can be upgraded over time. It provides a template for understanding the strengths and limitations of current CER data resources and formulating recommendations and guiding principles to advance cancer CER.

Below, we present recommendations corresponding to the major themes identified in this study with a goal of informing the evolution of the CER data framework, resolving data gaps, and ultimately establishing a national data infrastructure for cancer CER. Our focus was on existing secondary, observational data, although the findings we present also are applicable to prospective data collection and future data resources.

Recommendation 1

There is a need for systematically identified, standardized measures to fill data gaps and enhance linkages and transferability.

Inconsistent, incomplete measures and a lack of data standardization pose a substantial threat to improving public health through CER. Stakeholders (eg, researchers, providers, payers) collect clinical, population, and health services data in numerous ways. Even within the cancer research community, there are substantial differences of opinion on essential variables. This lack of consensus inhibits comparability across and within health data sets.

Recommendation 1a

Systematically identify necessary measures, including uniform definitions and standardization of collection and coding.

Intervention selection, exposure assignment, and outcome measures must be identified and characterized systematically. Therefore, as a starting point, we have recommended a framework for identifying measures across the cancer care continuum.9 Standards for how measures are defined, collected, and coded must be developed and broadly applied, even for very basic measures like race and ethnicity. This issue extends to algorithms for defining meaningful measures and cohorts, or deriving complex treatments or outcomes. Lack of global standardization inhibits data pooling, comparability among multiple sources, and generalizability of findings in the context of population heterogeneity.34

A multidisciplinary panel of CER researchers, stakeholders, and their partners is required to address this diversity of measures and lack of data standardization. A goal of such an effort should be the identification of a minimum, basic set of essential measures in all new data-collection initiatives, including standardized data definitions.

Recommendation 1b

Develop and incorporate new measures and data set crosswalks to address gaps among current data resources.

Additional measures must be identified which incorporate advances in medicine and health sciences. A key example is the enhancement of our national cancer registries' collection of data on genetic markers. These tests, like the KRAS test, are increasingly able to provide predictive insight into intervention effectiveness for individual patients.35-37 Because of the potentially rapid and inconsistent adoption of these markers, multiconcept coding systems are necessary to capture 1) whether the test was used, 2) test results, and 3) test characteristics. In addition to genetic markers, federal and other payers could consider standardization of clinical markers, such as stage, grade, and performance status. The current use of the International Classification of Diseases, 9th Revision and the Healthcare Common Procedural Coding System codes is insufficient in this cancer-specific context. Furthermore, investment in measurement and methods research could facilitate the development of “crosswalks” between existing measures and instruments.38, 39 This will enable the comparison of constructs between data sets and offer potential mechanisms for combining existing data or supplementing missing information.40, 41

Increasingly relevant for cancer CER are intermediate outcomes, including patient-reported outcomes.42-46 Historically, clinical research has focused on mortality; but, through advances in cancer detection and treatment, patients are living longer and may not die from cancer. To enable better comparisons between treatments, new measures are needed that go beyond life expectancy and better quantify side effects, costs, and other trade-offs, such as the probability of continuing to work or attending to family needs.47, 48 Patient treatment decisions are increasingly likely to be informed by such factors. Systems to capture these measures must be integrated better into clinical care data and embedded in future data sets.34

Recommendation 1c

Establish data architecture and systems standards for collecting and communicating these measures among health care delivery and financing organizations and researchers.

To date, health care reform has focused on standards for patient care, transferability (health information exchange), and quality-of-care evaluations; however, CER also needs to be included as a priority component for improving health care. “Meaningful use” regulations offer significant incentives to standardize clinical data for transferability and interoperability, although these efforts still are nascent. Accordingly, CER stakeholder involvement is critical in the discussions between the Centers for Medicare and Medicaid Services and the Office of the National Coordinator for Health Information Technology and must extend beyond meaningful use requirements for health information technology development and requirements. The National Cancer Institute's Cancer Biomedical Informatics Grid and Cancer Data Standards Registry and Repository already have developed an interoperable information technology infrastructure that offers standard rules, unified architecture, and common language to develop and use cancer research data.34 It is vital that open-source, open-access tools like these remain at the forefront of integrating with health care data coding, such as Systematized Nomenclature of Medicine—Clinical Terms, Logical Observation Identifiers Names and Codes, or with data interoperability, such as Health Level 7.

Recommendation 2

Improvements in study design and population sampling are critical for CER studies to be meaningful.

Many of the problems with existing data sources cannot be solved through data standardization or sophisticated statistical methods. For example, a greater quantity of data will not necessarily make CER studies more generalizable or reproducible; rather, CER study design issues need to be better understood and overcome, resulting in better quality data.

Although the focus of this work is not statistical methods or study design, data and study methodology are inexorably connected. Recognizing this, future studies need to prospectively apply more advanced data collection, better study designs, and sampling frameworks. At the same time, investments need to be made in ways to reduce bias through advanced statistical methods.37 By funding research on study design issues in existing CER studies, we can develop better methods to apply toward future studies and data collection. It is also important to recognize that the advancement of complex methods requires consistency of measures and data interoperability described in the first recommendation.

Recommendation 2a

Develop methods to leverage existing data, overcome data limitations, and reduce bias.

The majority of data currently used for CER is collected for nonresearch purposes and is nonexperimental with regard to most CER questions. Consequently, several significant sources of bias exist, some of which are correctable through advanced methods. Other sources of error are quantifiable but cannot be adequately addressed. The development and application of better analytic methods can help overcome the design limitations of existing data. Propensity score matching and instrumental variable analysis are 2 important examples of statistical approaches that can capitalize on important data elements and advanced methods.

Many biases or data uncertainties also can be examined using specifically collected data or hybrid data sources. For example, linking administrative data to epidemiologic or clinical data (eg, SEER-Medicare linked data49) creates powerful research resources that serve as models for other such efforts.50 Other approaches include ancillary or validation studies collecting new data on a subgroup of the main population or an external population to supplement missing information or to extrapolate the distribution of an important variable into the study population.31, 33, 51, 52

Recommendation 2b

Facilitate the conduct and completion of pragmatic trials for CER.

Pragmatic trials can overcome many limitations of randomized clinical trials (namely, limited sample sizes and restrictive inclusion criteria). Pragmatic trials use randomization but aim to make eligibility criteria and treatment decisions representative of “real-world” settings.4 They also collect information on a broader number of risks, determinants, health outcomes, and events, either directly or through the novel and efficient use of other data sources (eg, claims/administrative data). Thus, they can yield more generalizable findings. In addition to these benefits, increased funding for pragmatic trials also may help spur methods development on sampling and design issues commonly observed in traditional CER studies.53

Recommendation 3

Issues of data ownership, access, governance, and cost are substantial.

There are many large data resources and innumerous small data sets relevant to cancer CER. However, there are significant barriers limiting their use, including political obstacles, costs, and administrative burden associated with data access.25 Important and timely data often are closely controlled by those who collect the data. Even data from federally funded studies may languish as the investigative team exhausts its “right of first publication.” The potential benefits from additional data linkages are prevented by lack of access, cost, or tightly constrained data use agreements. For example, developing resources analogous to SEER-Medicare for the population aged <65 years is imminently feasible by linking registry data to private and public payer data. However, efforts to do so commonly have met with reluctance on the part of the payers and even registries. For these groups, research is not a primary priority, and it is perceived that the risks or “unknowns” outweigh the prospective benefits.

Recommendation 3a

Develop systems to facilitate timely data sharing for research that supports the public good.

There are practical solutions to identify CER-relevant data sets and facilitate their acquisition.54 This includes the development of codified relationships among federal agencies, their contractors, and many data holders.25 For example, the individual SEER or National Program of Cancer Registries could approve a single data acquisition process to be followed for all federally contracted CER studies, which may relieve administrative burden. The National Cancer Institute's central Institutional Review Board (IRB) may serve as a useful analog, because it was designed to relieve the work of the multitude of institutional IRBs.55 However, it provides a cautionary tale, because the centralized IRB has been criticized for replacing rather than relieving the work needed to open a study.56 Other examples include the broad data use agreements between Medicare and important epidemiologic cohorts, such as the Women's Health Initiative study. Similar agreements could be developed for important cancer studies, making them more accessible to the research community.

Standardized relationships between state and federal agencies help reassure data holders that their data will be used appropriately while distilling data acquisition logistics to a formulaic process. These relationships also would help facilitate the timeliness of data for research and enable quick turn-around on important questions. Regarding access to costly or proprietary data sets, government stakeholders (eg, AHRQ, National Cancer Institute) may consider directly lending their weight to developing special agreements for select restricted or tightly held data sets.

There may be utility in centrally brokered and managed data subscriptions based on standing data use agreements. For example, states like Maine and Oregon have implemented requirements that payers deposit “shadow claims” to public health agencies for purposes of quality improvement and informing policy decisions.57 Formal mechanisms could be established to facilitate the updating and regular access to such data for CER.

Recommendation 4

Data security, privacy, and confidentiality remain paramount.

Although access to data must be improved, data security, privacy, and confidentiality are critical and remain top concerns.28 Moreover, there are multiple laws and regulations governing the maintenance, release, and use of many data sets, such as Medicare or Medicaid claims.

Recommendation 4a

Develop systems to assure data security, privacy, and confidentiality with any enhanced access to data for CER.

Two short-term, practical opportunities warrant further exploration. First, at the state level, the health information exchange is focusing on standardization of electronic health records and rules governing data transfer and use. It is prudent that the federal CER agenda be represented as new processes and regulations continue to be defined.58-61 Second, developing a CER data security and use “accreditation” system may help assure compliance with regulations. This would ensure a baseline level of information technology sophistication that facilitates data use while assuring data vendors that accredited research sites are top-tier, “safe” data custodians. Examining the Centers for Medicare and Medicaid requirements of their quality-improvement organizations62 may be a first step toward developing such accreditation systems.

Recommendation 5

Broad multidisciplinary representation is necessary to effectively address these CER data needs.

The recommendations (from methods to policy) presented by the discussants in this study are a microcosm of the larger CER discussion and highlight many differences in the cultures, values, terminology, measures, approaches, and priorities relevant for cancer CER.

Recommendation 5a

A collaborative, multidisciplinary approach must be emphasized to successfully address data needs for CER.

Multidisciplinary representation is necessary to adequately capture important differences in the cultures, values, and terminology surrounding perceptions of cancer CER and data issues.26 Accordingly, a critical step will be the identification of individuals who can represent their disciplines (and industries) to optimally advance the cancer CER discussion. To be successful, these individuals not only must be technical experts, but they also must be mavens and translators who can bridge technical and disciplinary gaps to identify and achieve solutions.63 Supporting the identification and ongoing communication of such a group will be important to drive CER data needs forward.

Recommendation 5b

CER stakeholders must be engaged and coordinated in the development of rules and standards to inform health reform.

Beyond informing cancer CER and its data needs, it is important that a multidisciplinary advisory group be well represented in the context of health care policy reform. It will be vital to engage these diverse groups and address these issues in a timely and consistent manner—the recently established Patient Centered Outcomes Research Institute (PCORI) is the obvious choice to lead such an effort.64 The PCORI can identify members of the research community and partner with federal agencies. Both groups represent research needs and interests and help define the future of CER in the context of health reform. In addition, the PCORI is well positioned to address other needs, such as maintaining a cancer CER data inventory and perhaps similar registries for protocols, including those with null results. The Registry of Patient Registries project is a promising project that can begin to address this need.50, 65, 66

In conclusion, by leveraging secondary data, we can fill gaps and provide timely, valid, scientific knowledge to systematically conduct CER and improve cancer care and outcomes. However, substantial engagement is required from many organizations to address the issues outlined here. Multidisciplinary individuals within these organizations need to be identified who can help facilitate solutions in order for CER to reach its full potential.

The data ontology and recommendations we present provide guidance for critical discussions between multidisciplinary teams of cancer researchers, methods experts, and other stakeholders. They align with previous calls for infrastructure development to support cancer research and CER.13, 15 Together, they provide a template for systematically addressing cancer CER data needs. By understanding and overcoming weaknesses in current data, we can accelerate the pace of cancer CER and ultimately enhance the adoption of CER findings to improve patient-centered care and outcomes.


We thank Timothy S. Carey, MD, MPH, and Janet K. Freburger, PhD, for their review and feedback, which informed and strengthened this article. We thank the anonymous reviewers from the Agency for Healthcare Research and Quality's Effective Healthcare Program article review system for their constructive suggestions and comments.


This work was supported by funding from AHRQ through the Cancer DEcIDE Comparative Effectiveness Research Consortium, contract HHSA290-205-0040-I-TO4-WA5 to the Data Committee for the DEcIDE Cancer Consortium. This work was also supported in part by NCI grant P01 CA142538 (authors WRC, AA and MK).


Dr. Abernethy has received research funding from the US. National Institutes of Health, US. AHRQ, Robert Wood Johnson Foundation, Pfizer, Eli Lilly, Bristol Meyers Squibb, Helsinn Therapeutics, Amgen, Kanglaite, Alexion, Biovex, DARA Therapeutics, Novartis, and Mi-Co; all of these funds are distributed to Duke University Medical Center to support research. In the last 2 years, she has had nominal consulting agreements (<$10,000) with Helsinn Therapeutics, Amgen, and Novartis.

Dr. Til Stürmer receives investigator-initiated research funding and support as Principal Investigator (RO1 AG023178) from the National Institute on Aging at the National Institutes of Health. He also receives research funding as Principal Investigator of the UNC-DEcIDE center from the Agency for Healthcare Research and Quality. Dr. Stürmer does not accept personal compensation of any kind from any pharmaceutical company, though he receives salary support from the UNC Center of Excellence in Pharmacoepidemiology and Public Health and from unrestricted research grants from pharmaceutical companies to UNC.