Some methodologic lessons learned from cancer screening research


  • Sally W. Vernon Ph.D.,

    Corresponding author
    1. Center for Health Promotion and Prevention Research, The University of Texas–Houston School of Public Health, Houston, Texas
    • Division of Health Promotion and Behavioral Sciences, Center for Health Promotion and Prevention Research, The University of Texas–Houston School of Public Health, 7000 Fannin, UCT 2560, Houston, TX 77030
    Search for more papers by this author
    • Fax: (713) 500-9750

  • Peter A. Briss M.D., M.P.H.,

    1. Systematic Reviews Section, Community Guide Branch, Centers for Disease Control and Prevention, Atlanta, Georgia
    Search for more papers by this author
  • Jasmin A. Tiro M.P.H.,

    1. Center for Health Promotion and Prevention Research, The University of Texas–Houston School of Public Health, Houston, Texas
    Search for more papers by this author
  • Richard B. Warnecke Ph.D.

    1. Center for Health Services Research, University of Illinois–Chicago, Chicago, Illinois
    Search for more papers by this author


Credible and useful methodologic evaluations are essential for increasing the uptake of effective cancer screening tests. In the current article, the authors discuss selected issues that are related to conducting behavior change interventions in cancer screening research and that may assist researchers in better designing future evaluations to increase the credibility and usefulness of such interventions. Selection and measurement of the primary outcome variable (i.e., cancer screening behavior) are discussed in detail. The report also addresses other aspects of study design and execution, including alternatives to the randomized controlled trial, indicators of study quality, and external validity. The authors conclude that the uptake of screening should be the main outcome when evaluating cancer screening strategies; that researchers should agree on definitions and measures of cancer screening behaviors and assess the reliability and validity of these definitions and measures in different populations and settings; and that the development of methods for increasing the external validity of randomized designs and reducing bias in nonrandomized studies is needed. Cancer 2004. © 2004 American Cancer Society.

Credible and useful evaluations of strategies (e.g., interventions, programs, and policies) are essential for increasing the uptake of effective cancer screening tests. A recent Institute of Medicine report calls for the evaluation of comprehensive strategies to address progress in cancer prevention and early detection.1 In their introductory article in the current supplement, Meissner et al.2 note the growing body of studies of strategies for achieving important outcomes, such as the uptake of effective cancer screening tests. This literature base enables us to assess the strengths and weaknesses of previous evaluation efforts, to identify lessons learned from the past, and to address important outstanding issues that will help inform future work. The current report provides advice regarding ways in which the design of future evaluations can be improved to increase their credibility and usefulness.

For the purposes of the current article, we define an intervention as an activity (or group of related activities) intended to promote cancer screening, a program as an institutionalized system of intervention activities, and a policy as a set of organizational rules to promote screening. Strategy is treated as a more general term that encompasses all of the above.

Evaluation of strategies for increasing the use of cancer screening may serve a variety of purposes. These include: 1) documenting needs for improvement; 2) identifying local barriers to screening; 3) demonstrating the extent to which screening and follow-up are reaching target populations and becoming part of routine health care (e.g., whether individuals are screened at recommended intervals, whether screening is performed in accordance with accepted protocols, whether appropriate diagnosis and treatment occur when an abnormality is detected); 4) assessing the likely improvement in screening uptake that would result from the implementation of various strategies; 5) identifying reasons why certain strategies do not meet their objectives; 6) encouraging modifications aimed at increasing the success of certain strategies; and 7) providing information that promotes the diffusion of effective strategies to new communities, populations, or health care and public health systems. For all of these purposes, evidence from high-quality evaluation studies is more credible than evidence from less rigorous studies.

In the current article, we highlight selected methodologic issues that arise in evaluating whether strategies meet their objectives. We address selection and measurement of the primary outcome variable (i.e., cancer screening behaviors) in detail, but we also touch on other aspects of study design and execution that are relevant to all cancer screening behaviors. These aspects include alternatives to the randomized controlled trial (RCT) and indicators of study quality and external validity. We do not discuss methodologic issues related to the independent variables used to predict the uptake of screening, be they individual-level factors (such as perceived susceptibility, perceived barriers, and intention) or social or system-related variables. Selection, definition, and measurement of those variables are important, but there is far less consensus regarding the factors that are theoretically important and the way in which they should be defined and measured.

Table 1 lists seven lessons learned. These lessons were drawn from experiences in previous cancer screening studies and other behavioral research efforts. The discussion of these lessons is intended to contribute to improvements in future research.

Table 1. Methodologic Lessons Learned from Cancer Screening Research
Lesson 1: The uptake of screening should be considered the main outcome when evaluating cancer screening strategies.
Lesson 2: Agreement regarding conceptual and operational definitions of behavioral outcomes of cancer screening is needed.
Lesson 3: Obtaining reliable and valid information on self-reported cancer screening behaviors is a complex cognitive task, and the growing cultural diversity of the population adds to this complexity.
Lesson 4: Studies using self-reported cancer screening behaviors should assess reliability and validity and should quantify measurement error and bias in a broad range of respondents.
Lesson 5: Randomized controlled trials are not always the gold standard in evaluation design; a variety of study designs are appropriate for answering important evaluation questions.
Lesson 6: The quality of cancer screening intervention studies can be improved substantially.
Lesson 7: External validity deserves more attention in cancer screening intervention research.


Lesson 1: The Uptake of Screening Should Be Considered the Main Outcome When Evaluating Cancer Screening Strategies

In general, the most appropriate outcome measure for strategies aimed at promoting screening is screening uptake. Other factors that may lead to increased uptake (i.e., more ‘upstream’, or indirect, factors, such as measures of client knowledge, attitudes, or intentions; measures of provider knowledge, self-efficacy, or behavior; measures of system performance; and community characteristics) may be useful in selecting a strategy, theory, or process for intervention. On their own, however, these are not complete measures of success, because many upstream measures are not strongly correlated with screening use. For example, efforts to increase screening may increase client knowledge without increasing uptake of screening.

Nonetheless, not all evaluations need measure the same outcomes. Although the main evaluation outcome should be uptake of screening, the appropriateness of other outcomes will depend on the goal of the intervention, the characteristics of the target behavior, the target population, and the strategy. These may include, for example, making a decision, making a commitment, preparing to execute a behavior, executing a behavior, or maintaining a behavior. Studies of health care systems or of policies, thus, may have different primary outcome measures.

Theory and modeling are important in both intervention design and strategy evaluation, as they can help match strategies to local context, describe how strategies work, assess how well strategies are being implemented, and explain observed patterns of results. Among the primary uses of theory and modeling are the identification of appropriate outcomes and the selection of the timing of their measurement.

Influencing outcomes that are ‘downstream’ of screening (i.e., outcomes that occur after screening, such as mortality and cancer-specific mortality) is the ultimate goal of screening efforts. Typically, the relation between screening and a given downstream outcome has already been demonstrated before initial screening recommendations are made and does not need to be demonstrated repeatedly.2 Furthermore, downstream outcomes frequently cannot be measured in intervention studies, because downstream outcomes occur too far in the future and are uncommon, thus requiring very long-term follow-up or very large sample sizes.

Lesson 2: Agreement Regarding Conceptual and Operational Definitions of Behavioral Outcomes of Cancer Screening Is Needed

Because screening uptake typically should be the primary outcome measured in studies evaluating behavioral strategies, the next three lessons focus on issues related to the definition and measurement of cancer screening behaviors. Sallis et al.3 proposed a framework for classifying research in the behavioral sciences; this framework includes five research categories or phases that, if addressed, will provide a scientific evidence base for public health interventions. Of these five phases of research, the development of behavioral measures is the least studied, despite the fact that high-quality measures are essential for research at all stages, including establishment of the validity and reliability of extant measures, development of new measures, and field-testing of new tools.3 Hiatt and Rimer4, 5 also emphasized the importance of methodologic research—in particular, the need to develop standardized, reliable, and accurate measures to assess the progress of cancer control efforts. Rigorous and comprehensive evaluations of prevalence estimates, patterns of association with predictor variables, and intervention effects are impossible without clearly defined and consistent outcome measures that can be compared across studies.6–8

Experience from 2 decades of research intended to promote mammography use and Papanicolaou (Pap) testing makes it evident that one of the many challenges in synthesizing the results of applied research is that investigators use different conceptual and operational definitions for a given behavioral outcome (e.g., mammography screening completion).7, 8Conceptual definition refers to the general phenomenon of interest—for instance, compliance with guidelines. Operational definition refers to the way in which the concept was measured—that is, the questions asked or the procedures used (e.g., the coding rules for abstracting data from a medical record). In addition, different data sources (e.g., questionnaires, medical records) have been used to measure outcomes, thereby potentially introducing different types of measurement error and bias.

Early intervention studies often reported initial, first-time, or ‘ever’ screening, but subsequent studies have focused on recency of screening (e.g., the number of months or years since the last test). However, as the prevalence of mammography screening and Pap testing has increased, the focus has shifted to compliance with guidelines or to the number of tests received within a given period. Now, with the availability of additional ways of defining consecutive on-schedule screening use, comparison among studies has become even more challenging.9

In a recent review of the literature on the prevalence of repeat mammography screening, Clark et al.9 found inconsistencies in terms used to describe consecutive on-schedule screening (e.g., repeat, regular, adherence, compliance, annual, rescreen, and maintenance). Conceptual definitions often were vague, and the operational definitions of regular on-schedule mammography use were inconsistent across studies. For example, operational definitions differed in terms of the interval defining sequential (12 months, 15 months, or 24 months), in terms of how many sequential mammograms constituted on-schedule use, or in terms of categories describing mammography use patterns within a given period (more than 1 in a lifetime, 2 in the past 6 years, or some other age-appropriate use pattern). These differences may influence prevalence estimates and intervention effects. They also may affect patterns of association between independent variables and behavioral outcomes. Recently, a limited number of studies have examined the effects of different measures of consecutive on-schedule mammographic screening within the same data set to evaluate the effects on prevalence estimates and patterns of association with correlates.10, 11 Table 2 shows that the proportion of women classified as on-schedule varies greatly depending on the interval used. Different measures had little effect, however, on patterns of association with the independent variables studied. In contrast, Phillips et al.12 compared correlates of 3 different measures of mammographic screening use—ever received a mammogram, received a mammogram in the past 2 years, and received an age-appropriate total number of mammograms— using the same dataset and found that correlates varied according to the measure that was used.

Table 2. Prevalence of Repeat Mammographic Screening by Number of Months since Previous Mammogram and Type of Diagnostic Mammogram Exclusiona
Mos since previous mammogramCumulative proportion
No exclusionsMammograms < 6 mos after previous mammogram excluded
  • a

    Source: Partin et al.11

< 120.1570.127
< 130.4280.407
< 140.5910.576

Although recent systematic reviews and metaanalyses have called attention to the problem of inconsistent definitions and measures, they have not systematically evaluated variations in these definitions as a potential source of heterogeneity in the findings of evaluation studies.6–8, 13–15 Thus, we do not know with any confidence the influence, if any, that differences in measures have on the directions and magnitudes of effects observed in intervention studies, nor do we know whether correlates are similar for different usage measures (e.g., initial, recent, ever, on-schedule).12

The need for consistency and clarity in definition and measurement applies to new and emerging cancer screening behaviors as well. For example, colorectal cancer (CRC) screening is even more complex, as there are more acceptable screening tests (e.g., stool testing for occult blood, sigmoidoscopy, colonoscopy), the time interval for test completion is different for each type of test (and sometimes for the same test), and the guidelines have changed over time.16 Adding to this complexity is the number of available usage measures (e.g., initial, ever, recent). At a minimum, clear definitions and measures are necessary to interpret a body of evidence related to a specific cancer screening behavior.

Another issue arising in intervention research is that different measures of use may have different public health implications. For example, because the reduction of screening intervals from 3 years to 1 year appears to result in only small increases in the number of cervical cancers identified,17 increasing the proportion of women ever screened for cervical cancer is likely to produce larger population health benefits than would increasing the frequency of screening among women who have been screened repeatedly but possibly slightly off schedule. Identifying the measures that are the best predictors of success in improving health or reducing disparities may increase the effectiveness and cost-effectiveness of strategies at the population level and also may provide a focus for dissemination efforts.18

Lesson 3: Obtaining Reliable and Valid Information on Self-Reported Cancer Screening Behaviors Is a Complex Cognitive Task, and the Growing Cultural Diversity of the Population Adds to This Complexity

Assuming that researchers could agree on definitions and measures for cancer screening behaviors, such agreement would not ensure the validity of the information. Because the majority of studies of cancer screening behaviors use self-reported data, we focus here on problems associated with the process through which this information is obtained. In surveys, respondents are required to recall autobiographic information, including the circumstances of the screening event, the frequency of screening, the interval between each screening test, and (sometimes) the date of the most recent test.

Bringing together from memory the information required to respond to questions regarding screening events involves at least four cognitive processes or tasks: comprehension or interpretation of the question, retrieval of information from memory, formation of a judgment regarding how to respond, and revision of a response to adjust the amount of information revealed.19–24 We know relatively little about the cognitive processes respondents use to answer survey questions on cancer screening or about how those processes affect the reliability and validity of responses. It is increasingly well recognized that culture is an important element affecting autobiographic reports.25–30 Thus, the issues that crosscut cognitive tasks are those that result from the growing diversity in culture and ethnicity in the United States. This diversity also raises questions regarding data comparability.29–31

Of the four cognitive tasks, comprehension or interpretation of what is being asked is the most fundamental issue, because it relates to whether a respondent is answering the question asked.32 Variation in interpretation can occur if the conceptual basis for a question does not exist in the language of a respondent,33 if interpretation is mediated by life experiences related to culture or socioeconomic status,26, 27 or if the reading level of a question is too high given the respondent's reading ability.28, 34 Question interpretation also may be affected by translation and language.31

In crosscultural survey research, achieving equivalent responses requires establishing the meaningfulness of a construct within each respondent's cultural context and the capacity to create comparable measures.35 Difficulties in establishing conceptual equivalence can occur if constructs that depend on particular cultural or social contexts are incorrectly believed to be universally meaningful. Cultural syndromes36 or value patterns, such as variations in the concept of time, also influence the interpretation of questions, and these variations may affect conceptual equivalence. In cancer screening, for example, culture may affect how respondents understand issues such as screening intervals or visiting a physician when an individual is not experiencing symptoms. Pasick et al.31 found that questions dealing with visiting a physician in the absence of illness, with notions of risk or fatalism, or with certain screening measures (e.g., the Pap test) may not be relevant concepts or may not even be understood by members of certain cultural groups. Thus, measuring study outcomes using questions based on faulty assumptions about conceptual equivalence may produce misleading results.29, 30, 37

After interpretation, the next cognitive task involved in generating a response is the retrieval of information from memory. Recall can be episodic (i.e., individual events or behaviors are counted) or semantic (i.e., frequency is estimated based on the recency or regularity of event occurrence or on the number of event occurrences).38–41

Episodic events usually are stored in memory as discrete, countable events. They are rare, highly significant, easily counted, and include experiences such as a hospitalization or surgery. Episodic events or behaviors may be retrieved easily from memory, but the details of such events (e.g., location, date, or participants involved) may require cues for recall.41, 42 Cueing is helpful only if those circumstances have been stored along with the memory associated with the event in question.43 Recalling dates is particularly difficult39, 44, 45 and may require extensive cueing.46–49

In contrast, semantic recall is used more often to recall a regular or frequently occurring event or behavior and is often associated with a pattern of events, called a schema, in which a particular behavior occurs.39, 44, 50, 51 With regard to cancer screening, Warnecke et al.40 and Sudman et al.52 found that recall of Pap tests, clinical breast examinations, and mammograms was most accurate when organized around the schema of a regular health maintenance examination; however, most women in the study population were receiving annual examinations. Thus, such conclusions may not be generalizable to other contexts, such as community clinics.

Once a schema is established, deviations from the pattern are difficult to recall.40 The result may be either false remembrance or the forgetting of interim events. In the former instance, a woman may incorrectly assume, for example, that if she received an annual examination, she also received a cancer screening test. In the latter instance, if a woman associates a screening test (e.g., a mammogram) with her annual check-up, then she may forget to report exceptions, such as a follow-up for an abnormal mammogram. In the extreme, a test or procedure may be so embedded in the context of a medical examination that a patient may be unaware that a test was performed. For instance, a Pap test or a prostate-specific antigen (PSA) test may be performed as part of a health maintenance examination without a patient's knowledge unless the physician provides a cue that the test is taking place. Recent data obtained from men who were surveyed immediately after a health maintenance examination revealed that approximately 25% did not know that they had received a PSA test.53 We know very little regarding how these circumstances affect the accuracy of self-reported data.

We also have no data on whether similar recall strategies (episodic vs. semantic) are used to retrieve information regarding different kinds of cancer screening tests. Respondents may use episodic recall to answer questions on sigmoidoscopy and colonoscopy, as these screening behaviors are infrequent, the characteristics of the tests in question are memorable (e.g., their invasiveness, the fact that they require time off from work and may involve a referral), and the tests themselves are not necessarily associated with periodic events such as annual health maintenance examinations.

Recall, like interpretation, may be affected by cultural factors54 via conditioning that influences the value placed on an event and influences whether it is stored in memory. Cultural factors also may cause specific events to be stored together and may cue recall of additional information, such as the date of an event.39, 44 Thus, a schema used for recall may reflect a cultural orientation that may not be consistent with the dominant culture from which the question arises. For example, the concept of timing may possess different meanings to different cultural groups; therefore, accuracy of recall of screening intervals may vary based on the way in which time is interpreted. More research is needed to investigate whether and how cultural variation affects cognitive strategies used to recall cancer screening tests.

In the area of cancer screening, to our knowledge, judgment formation and response editing and the issue of whether these processes affect the reliability and validity of self-reported behavioral outcome measures have not been investigated. Judgment formation involves interpreting and processing the relevance of recalled information to form a response to a question.22–24 The resulting response may be straightforward or may be a synthesis of information retrieved from memory. Synthesis includes sorting, reconciling, and assessing the relative importance of information retrieved from memory and is conditioned by conflicting values, beliefs, and information content. Information that is accessed frequently (i.e., familiar information) is most likely to be used in making judgments.32, 55

Response editing occurs when respondents become concerned about how their answers will appear to those who are asking the questions.22–24 It also may be affected by cultural factors. Several conditions lead respondents to edit what they report. These conditions include acquiescence, social desirability, self-presentation, characteristics of the interviewer, and the interview language.29 Editing is most likely to occur in situations in which there are social or cultural differences between respondent and interviewer in terms of gender, ethnicity, age, education, or other social factors.56, 57 Editing also occurs when respondents are unsure of the meaning of a question. Under these circumstances, they may ‘play it safe’ and agree with the question rather than risk appearing foolish or demeaned by providing a ‘wrong’ answer.29, 34 Acquiescence (agreement with a statement regardless of content) is influenced by the social distance (as perceived by the respondent) between the respondent and the interviewer.58–60 Similarly, respondents may tend to overreport socially desirable behaviors, such as screening.31 Social desirability has been conceptualized as having two components: the need for social approval (considered to be a personality trait) and trait desirability (a respondent's assessment of whether or not a trait applies to her or him).61, 62 These types of response bias could contribute to the pattern of systematic overreporting of Pap tests and mammograms, as noted in Lesson 4.

Methodologic studies of the effects of asking a given question in different ways or assessing the presence of response biases, such as acquiescence and social desirability, are notably rare in the behavioral literature on cancer screening. Investigators may conduct qualitative studies, such as focus groups or structured interviews, as part of intervention development efforts, but the results of those studies are rarely published in the peer-reviewed literature. Although such studies require significant labor and time, they represent the only way in which certain insights can be gained.63

Lesson 4: Studies Using Self-Reported Cancer Screening Behaviors Should Assess Reliability and Validity and Should Quantify Measurement Error and Bias in a Broad Range of Respondents

Over the past 15 years, numerous studies have assessed the reliability and validity of self-reported measures of mammography use and Pap testing. However, there are very few published reports of systematic attempts to evaluate sources of response error through experimental manipulation of different versions of a given question or cognitive interviewing techniques. This lesson examines the results of reliability and validity studies in light of the existing cognitive research literature (discussed in Lesson 3) to understand the factors that may affect a measure's reliability (i.e., its consistency and stability over time) and validity (i.e., whether the respondent's answer provides the information sought by the question asked).


The reliability of self-reported cancer screening behaviors has received scant attention. To date, we are aware of only six published studies that have examined reliability.31, 64–68 All six examined self-reported mammographic screening, but only two examined Pap test self-reports.67, 68 Comparisons are difficult to make, because of variations in operational definitions of agreement (e.g., ever had a mammogram, number of lifetime mammograms, or had a mammogram within 12 months) and because of the amount of time elapsed between interviews (range, 5 days–2.6 years).

Studies promoting consecutive on-schedule screening use must attend to the reliability of self-reports, because most longitudinal studies involve the collection of overlapping data on screening dates and frequency. Vacek et al.64 suggested that women tended to overestimate the time since their previous mammogram if it was in the more recent past (the authors did not specify the interval) and to underestimate this length of time if a mammogram was further in the past (again, the authors did not specify the interval). Rauscher et al.66 found that women with shorter mammography histories (i.e., fewer total mammograms) provided more reliable responses than did women with longer histories (i.e., more total mammograms). Stein et al.67 suggested that nonwhite women tended to report the prevalence of mammographic and Pap testing less consistently compared with white women.

The poor reliability of self-reports regarding screening use may affect the evaluation of strategies aimed at promoting screening. It has been suggested that unreliable outcome measurement introduces random error, thereby biasing associations toward the null and causing observed effects to be underestimated.69 This effect has not been studied empirically, however, and more research is needed both to understand the factors (including culture) that affect consistency of reporting and to develop more reliable measures of cancer screening behavior.


In 1997, Warnecke et al.40 reviewed validation studies that compared self-reports of Pap tests and mammograms with medical records (regarded as the ‘gold standard’). They examined four measures of accuracy: concordance (raw agreement rate), sensitivity, specificity, and the report-to-records ratio (a measure of net bias in test reporting). The report-to-records ratio is equal to the number of patients with positive self-reports (true-positive or false-positive) divided by the number of patients who actually received the test according to the record source (i.e., true-positive or false-negative self-reported data). Warnecke et al.40 found a consistent pattern of overreporting. Across studies, the average report-to-records ratio for Pap testing was 2.10, compared with 1.39 for mammographic screening. Although the 12 Pap test validation studies represented a wide range of settings, from health maintenance organizations (HMOs) to random community samples, 6 of the 8 mammography validation studies were conducted within HMOs.52, 70–82

Since the review by Warnecke et al. in 1997, a number of validation studies have been published.40, 52, 64–66, 83–102 Table 3 provides updated summaries of the four accuracy measures for Pap tests and mammograms and adds summary measures from validation studies of fecal occult blood test (FOBT), sigmoidoscopy, and PSA. Across all cancer screening behaviors, concordance ranged from 0.39 to 0.89; Pap testing was associated with the lowest level of weighted-average concordance (0.71), and sigmoidoscopy was associated with the highest level of agreement (0.89). Consistent with the earlier report by Warnecke et al.,40 overreporting of all cancer screening behaviors was found. The lowest weighted-average overreporting rate was 1.04 for PSA testing, and the highest was 4.27 for sigmoidoscopy; however, the latter summary measure was based on only 3 studies. Overreporting also was high for Pap testing (1.48) and FOBT (1.44) relative to the other screening tests (Table 3). The highest rates of overreporting (of mammography,78, 79, 82 Pap testing,74, 75, 78, 79, 95, 97, 103–105 and sigmoidoscopy79, 106) were found in county health department populations, public clinic populations, tumor registries, and ethnic populations.

Table 3. Accuracy Measures for Five Cancer Screening Behaviors
Cancer screening behaviorConcordanceaSensitivityaSpecificityaReport-to-records ratioa
Weighted averageaRangeWeighted averageaRangeWeighted averageaRangeWeighted averageaRange
  • Pap: Papanicolaou; FOBT: fecal occult blood testing; PSA: prostate-specific antigen.

  • a

    The definitions outlined by Warnecke et al.40 were used. Concordance is the percentage of all individuals who reported receiving a test or who reported no test in agreement with the record source. Sensitivity is the number of individuals who correctly recalled having the test divided by the number of individuals who had a test according to the record source. Specificity is the number of individuals who correctly reported no test divided by the number of individuals with no test documented in the record source. Report-to-records ratio is the number of individuals who reported receiving the test (true-positive plus false-positive reports) divided by the number of tests documented in the record (true-positive plus false-negative reports). Weighted averages were computed by multiplying the accuracy measure for each individual study by the proportion of individuals in that study relative to the total number of individuals in all studies for that cancer screening behavior.

  • b

    Concordance, sensitivity, and specificity calculations were based on data from 16 studies.52, 72, 78–82, 87, 89, 90, 92, 96, 102, 106, 108, 129 Report-to-records ratios were calculated using data from an additional study.77 The studies conducted by Vacek et al.,64 Barratt et al.,65 and Rauscher et al.66 were not included because those studies examined the accuracy of self-reports over time. The studies conducted by Thompson et al.93 and Montaño et al.130 were not included because data for computing estimates were lacking. The study conducted by Johnson et al.104 was not included because of empty cells. The studies conducted by Champion et al.,88 McPhee et al.,91 Zapka et al.,107 Etzi et al.,131 and Kottke et al.132 were not included because of concerns regarding study design. (Participants with negative mammography histories were not contacted, and thus, specificities and report-to-records ratios could not be calculated. Assessments of both negative and positive histories are critical in accurately gauging the degree of overreporting or underreporting.)

  • c

    Concordance, sensitivity, and specificity calculations were based on data from 18 studies52, 70, 72–76, 78, 79, 94–97, 101, 103–105 Report-to-records ratios were calculated using data from an additional study.77 The studies conducted by Stein et al.67 and Brownson et al.68 were not included because they examined the accuracy of self-reports over time. The studies conducted by Bowman et al.,94 Montaño et al.,130 and Chan et al.133 were not included because data for computing estimates were lacking. The studies conducted by McPhee et al.91 and Kottke et al.132 were not included because of concerns regarding study design. (Participants with negative Pap test histories were not contacted.) The study conducted by Whitman et al.134 was not included because the study cohort was different from the chart abstraction population and thus was not comparable. The studies conducted by Hancock et al.110 and Mamoon et al.135 were not included because self-report survey participants were not matched to the registry database. The study conducted by Kahn et al.136 was not included because that study elevated the accuracy of Pap test results (abnormal vs. normal) and not the accuracy of recall of screening behavior.

  • d

    Calculated using data from five studies.72, 79, 81, 84, 106 The studies conducted by Baier et al.,83 Manne et al.,85 and Lipkus et al.137 were not included because data for computing estimates were lacking. The study conducted by Schoen et al.86 was not included because of concerns regarding study design (Participants with negative FOBT histories were not contacted).

  • e

    Calculated using data from three studies.72, 79, 106 The studies conducted by Baier et al.,83 Manne et al.,85 and Lipkus et al.137 were not included because data for computing estimates were lacking.

  • f

    Calculated using data from two studies.99, 100 The study conducted by Godley et al.98 was not included because of concerns regarding study design. (That study dealt with the accuracy of data abstractors in distinguishing screening tests from diagnostic tests.)

Pap testingc0.710.39–0.950.900.71–0.990.400.10–0.731.480.99–3.32
PSA testingf0.720.71–0.740.750.74–0.760.680.65–0.741.040.95–1.20

Currently, we can only speculate on the possible explanations for why overreporting is higher for Pap testing and sigmoidoscopy, but an examination of possible explanations may serve to direct future research. Possible explanations include cognitive issues related to the tests (i.e., comprehension, recall, judgment formation, and response editing), test characteristics, the setting in which the test was conducted (e.g., HMO vs. community clinic), cultural differences, record source accuracy, and survey administration mode. Moreover, results may be influenced by interactions among these factors.

The scant data available40, 52 indicate that respondents may use schemas, such as the process of receiving a regular medical check-up, to recall mammographic and Pap testing, as is discussed above (Lesson 3). Both tests generally occur in the context of a check-up and thus, arguably, should be equally subject to overreporting or underreporting. Because of the characteristics of these tests, however, women who received a pelvic examination may have assumed that they had undergone Pap testing, whereas this assumed link would be unlikely for women undergoing mammography screening.31 When tests are similar to each other (e.g., sigmoidoscopy and colonoscopy), misreporting may occur, as patients may not understand the differences between the similar tests or be aware which test was administered.

The setting (e.g., HMO, community clinic) in which a test was performed also may be associated with differences in overreporting or underreporting according to test type.107 Most mammography validation studies were conducted in HMO settings, as noted above, but most Pap test validation studies were performed in county health departments or public health clinics.75, 91, 103, 104, 108 To explore the ways in which setting may affect the degree of overreporting, we conducted a stratified analysis of report-to-records ratio as a function of study setting for Pap test validation studies. For studies involving HMO populations, the ratio was 1.38 (range, 1.25–1.78); in contrast, for studies involving public clinic, ethnic minority, or randomly drawn samples, the ratio was 1.60 (range, 1.00–3.32).

Setting may also affect the report-to-records ratio because studies conducted in clinics attempted to validate only one screening test (e.g., the most recent one), whereas studies conducted within HMOs usually attempted to validate more than one sequential screening test. Consequently, in the latter type of study, cohorts tended to include only patients who had longer periods of HMO membership (e.g., 5 years).40, 72, 79, 83, 87

Study setting may be confounded with other factors, such as ethnicity and culture. Community clinic populations consisted primarily of individuals belonging to ethnic minorities, as described above. Cultural differences may affect recall strategies or response editing behaviors and thereby contribute to overreporting. Several authors have suggested, for example, that cultural differences in how time is viewed (e.g., dates, schedules), and even in how respondents understand which cancer screening test is being discussed, may reduce the accuracy of self-reported data.31, 107

Participants in managed care programs typically receive most of their medical care from these programs, whereas individuals who obtain medical care from county health departments or public health clinics tend not to use these services consistently for routine care; consequently, medical records kept by managed care programs may be more complete than records kept by public health departments or clinics.87, 91, 108 Thus, biases in report-to-records ratios may be attributable to underreporting in medical records rather than to overreporting by patients.

Another factor that potentially contributes to overreporting in validation studies is the existence of differences in survey formats (e.g., mail-based, telephone-based, interactive computer-based, in-person). In general, it has been found that telephone and mail surveys yield comparable data for nonthreatening questions.109 Few studies are specific to cancer screening behaviors; however, Zapka et al.107 found no difference in the validity of self-reported dates of most recent mammographic screening as reported by telephone or by mail.

Before drawing any conclusions regarding characteristics associated with overreporting or underreporting, additional research on self-reports is required to elucidate the relations that exist among ethnicity, setting, data source, and (perhaps) data collection method. In the meantime, future validation studies should consider how differences in these factors may affect the accuracy of self-reports.

Studies also should evaluate which of the criterion sources or ‘gold standards’ (e.g., medical records, laboratory reports, or administrative databases) is the most accurate. Each criterion source has limitations,110 and these limitations may have different effects on the accuracy of self-reports, depending on which screening test is being evaluated. The limitations of medical records include incompleteness of files due to the receipt of health care from multiple sources, incompleteness of examination records kept by physicians, and incompleteness of coverage of the period between recommended screenings. For all screening tests, it is necessary to capture testing performed outside the context of a specific medical care setting (e.g., testing conducted at a health fair or at another medical facility). It also is important to consider variations in the accepted screening interval; hence, insurance coverage may encourage physicians to modify the way in which they record procedures (e.g., as screening procedures vs. diagnostic procedures) in the medical records or for billing purposes.

Studies (e.g., meta-analyses) summarizing the literature on cancer screening interventions should attend to issues of reliability and validity. In meta-analyses of strategies aimed at promoting mammographic and Pap testing, authors have acknowledged a consistent pattern of overreporting of these behaviors.6, 14 They argue, however, that because women in both intervention and control groups are equally likely to overreport screening, it is unlikely that the relative estimate of screening compliance will be affected; i.e., there is no differential effect. If there is a differential effect between the intervention group and the control group, however, then estimates could be biased. For example, if members of the intervention group receive educational materials encouraging screening, then they may perceive that this is a desired behavior; therefore, self-reports by intervention group members on follow-up surveys may be affected to a greater extent by a social desirability response bias. In contrast, estimates could be biased in the opposite direction if participants assigned to receive the intervention increased their understanding of the types of screening tests and could report more accurately than control participants could on whether they received a test and on which test they received. These sources of bias are a concern, because they may affect the results and conclusions drawn from intervention studies. If possible, researchers should measure (perhaps in a subpopulation) whether differential reporting is occurring and adjust accordingly for overreporting or underreporting in the analysis of intervention effects.

Poor reporting accuracy may also bias associations between predictors and screening behavior. For example, if an outcome measure has poor reliability and validity, then biased associations could result when trying to conduct studies that focus on identifying factors that influence the outcome of interest (e.g., correlates or predictors of screening). Consequently, investigators may use a flawed set of predictors to design their strategy, and that circumstance could affect their ability to detect ‘true’ effects. This problem may be even more critical when translating strategies to ethnically-diverse populations if rates of overreporting already are subject to tendencies toward acquiescence and social desirability responses. To date, the magnitudes and directions of biases attributable to overreporting or underreporting and the factors associated with these biases have not been investigated.

Because of the recent implementation of federal legislation limiting access to medical records (specifically, the Health Insurance Portability and Accountability Act of 1996), the use of self-reported data in studies is likely to increase. Lazovich et al.111 investigated the feasibility of using medical records instead of self-reports to assess mammography use in a population-based cohort. Due to the effort now required to obtain patient consent and medical records across a variety of health care settings, the authors did not recommend the use of medical records as a feasible, cost-effective alternative to self-reporting.111 Therefore, it is important to ensure that we have reliable and valid self-report measures to evaluate the effectiveness of behavioral interventions as well as to monitor progress and trends in adherence to cancer screening.

Lesson 5: RCTs Are Not Always the Gold Standard in Evaluation Design; A Variety of Study Designs Are Appropriate for Answering Important Evaluation Questions.

RCTs often are considered to be the gold standard in evaluation design, because they possess notable strengths. RCTs generally are considered the best available tool for reducing measured and unmeasured confounding, they may exhibit unique strength in supporting causal inference, and they are easily understood. Moreover, substantial attention has been directed toward instances in which well designed trials appear to have overturned previous prevention- or treatment-related findings derived from observational studies. Recently, for example, randomized trials have suggested that combined estrogen-and-progestin therapy for postmenopausal women increases cardiovascular risk, a finding that is in discordance with previous observational findings highlighting the benefits associated with this type of treatment.112–114

Nonetheless, RCTs are not without their potential flaws. They can have significant shortcomings with respect to both internal and external validity. Potential problems pertaining to internal validity include poor concealment of allocation (which can bias results away from the null),115 secular changes in comparison groups,116 contamination (which can bias results toward the null), and inadequate sample size (which can make it impossible to draw any firm conclusions). In addition, many RCTs are plagued by problems with external validity.117 For example, studies in which highly motivated volunteers are allocated to an intervention and then monitored for the duration of the study by an even more motivated study staff may not translate well to routine practice. Despite some well publicized examples of inconsistent findings between RCTs and observational studies,112–114 several recent systematic comparisons show close correlations between data from these two types of studies.118–122 This is especially true in the case of high-quality (e.g., prospective) observational studies.

RCTs may be incompatible with many types of group-oriented approaches aimed at increasing screening use (e.g., laws, policies, and system changes). The best feasible design for many policy evaluations probably is a variant on a time series with a concurrent comparison group. The conceptual advantages of RCTs should not prevent the best feasible evaluation from being conducted, nor should it prevent action from being taken on the basis of the results of such evaluations, when RCTs are infeasible or likely to be flawed.

Both individual evaluations and scientific syntheses may be able to take greater advantage of the complementary strengths of both types of designs. For example, for areas in which most available data are obtained from observational studies, confounding is an important concern, and thus, when randomized studies are feasible and appropriate, they should be conducted. In instances in which contamination is a significant threat to validity (e.g., in a study involving a rapidly diffusing technology, in which case it would be difficult to recruit an unexposed comparison group), RCTs will provide a lower-bound estimate of effectiveness and may be supplemented usefully with quasi-experimental studies that could provide more reasonable estimates of effectiveness. Similarly, RCTs in which the generalizability of results is in question may be supplemented usefully by studies conducted in less highly selected populations. In summary, attention must be directed toward ways in which to reduce the potential for bias in nonrandomized studies and to increase the external validity of randomized studies.

Lesson 6: The Quality of Cancer Screening Intervention Studies Can Be Improved Substantially

Lately, considerable attention has been directed toward improving the conduct and reporting of intervention studies. Numerous approaches114, 123 involving systematic criteria have been developed to assess the quality of RCTs.113 These approaches focus on many important contributors to study quality, as well as on assessment of the dependent variable. The Consolidated Standards for Reporting of Trials (CONSORT) statement was developed to improve the reporting of randomized trials.124, 125 Some recent studies126, 127 suggest an association between promotion of the CONSORT statement by journals and improvements in reporting.

Regarding other types of study design, fewer studies exist, and there is less agreement as to the ways in which quality can be assessed. The Agency for Health Care Research and Quality (AHRQ) recently conducted a systematic review of approaches to assessing the quality of randomized and nonrandomized studies.113 They identified 19 approaches that measured a set of quality characteristics that the reviewers considered to be crucial. Table 4 summarizes key quality characteristics derived from the AHRQ review as well as our experience with one of the approaches128 developed to help improve internal validity and the quality of reporting.

Table 4. Issues To Consider To Improve the Quality of Execution and/or Reporting of Intervention Studiesa
  • a

    Source: Adapted from Zaza et al.,128 Agency for Healthcare Research and Quality,113 and Moher et al.125, 126

Elements of study execution that may increase confidence that the intervention actually caused any observed changes in outcome (internal validity):
 Describe the selection of the study population and the comparability of the intervention and comparison groups.
  In observational studies, select the study population in ways that minimize selection bias.
  In randomized studies, adequately conduct and describe procedure of randomization, including random number sequence generation and concealment of allocation until interventions are assigned.
 Measure exposure to the intervention in ways that are reliable and valid. Document reliability and validity.
 Measure relevant outcomes in ways that are reliable and valid. Document reliability and validity.
 Conduct statistical tests whenever appropriate. Choose appropriate statistical tests and perform them correctly.
 Ensure high follow-up rates.
 Use study design and/or analytic techniques to minimize confounding.
 Minimize and explain other potential biases.
  Whenever feasible, blinding of participants, interveners, and/or outcome assessors can be performed to achieve this goal.
Additional measures that may improve quality of reporting:
 Describe any theoretic or conceptual background regarding the development of the intervention.
 Identify sources of funding for the study.
 Communicate results clearly, especially with respect to the linking of conclusions to supporting data.

Lesson 7: External Validity Deserves More Attention in Cancer Screening Intervention Research

Much of the available literature on study quality focuses primarily on internal validity. To the extent that evaluations are performed to inform practice, external validity (i.e., the extent to which the results are likely to apply to additional populations or in other contexts) may deserve more attention, even if that attention comes at some cost to internal validity. In the current supplement, Glasgow et al.18 note that interventions should have significant potential for dissemination. Table 5 lists a number of suggestions that may lead to increases in external validity. These suggestions can be summarized as follows: 1) choose strategies for selecting populations and study samples that reduce unnecessary exclusions and minimize attrition; 2) describe the intervention, population, and context in some detail so that applicability to the target population and to other populations can be assessed; and 3) perform new studies in populations and contexts that represent the populations of interest as closely as possible and that complement (rather than duplicate) existing research. In addition, as discussed above, study designs that are likely to exhibit a high level of external validity (because they can be conducted in populations of interest and are not limited to highly selected volunteer cohorts) may be useful, even if external validity comes at the cost of some internal validity.

Table 5. Issues To Consider To Increase External Validitya
  • a

    Source: Adapted from Zaza et al.,128 Agency for Healthcare Research and Quality,113 and Moher et al.125, 126

Elements of study design and execution that may increase confidence that results may apply to other contexts (external validity):
 Describe the study population in terms of person, place, time, and other relevant characteristics.
 Describe the intervention in enough detail so that it can be replicated; if journal space is a limiting factor, then consider supplementary reports or supporting Internet publications.
 Describe other relevant characteristics of the context in which the intervention was implemented; e.g., a clear description of the setting may be valuable.
 Choose a study population selection or sampling strategy that minimizes threats to external validity; e.g., reduce unnecessary exclusions.
 Perform new studies in populations or contexts selected to complement (rather than duplicate) existing research.


From the evaluations that have been conducted to date, much has been learned regarding ways in which to increase screening and the sophistication of evaluations; however, many opportunities for improvement remain. We conclude that the uptake of screening should be the main outcome when evaluating cancer screening strategies. Researchers should agree on definitions and measures of cancer screening behaviors and should assess the reliability and validity of these definitions and measures in various populations and settings. Accurate ascertainment of cancer screening behaviors—the primary outcome of intervention research—should be a cornerstone of evaluation efforts. Without accurate outcome measurement, we cannot produce a body of knowledge to inform best practices. In addition, we must find ways to increase the external validity of randomized designs, to reduce bias in nonrandomized studies, and to increase study quality. Better evaluations may improve the certainty with which we can make use of experimental findings and apply available science to practice.