Over the past 15 years, numerous studies have assessed the reliability and validity of self-reported measures of mammography use and Pap testing. However, there are very few published reports of systematic attempts to evaluate sources of response error through experimental manipulation of different versions of a given question or cognitive interviewing techniques. This lesson examines the results of reliability and validity studies in light of the existing cognitive research literature (discussed in Lesson 3) to understand the factors that may affect a measure's reliability (i.e., its consistency and stability over time) and validity (i.e., whether the respondent's answer provides the information sought by the question asked).
The reliability of self-reported cancer screening behaviors has received scant attention. To date, we are aware of only six published studies that have examined reliability.31, 64–68 All six examined self-reported mammographic screening, but only two examined Pap test self-reports.67, 68 Comparisons are difficult to make, because of variations in operational definitions of agreement (e.g., ever had a mammogram, number of lifetime mammograms, or had a mammogram within 12 months) and because of the amount of time elapsed between interviews (range, 5 days–2.6 years).
Studies promoting consecutive on-schedule screening use must attend to the reliability of self-reports, because most longitudinal studies involve the collection of overlapping data on screening dates and frequency. Vacek et al.64 suggested that women tended to overestimate the time since their previous mammogram if it was in the more recent past (the authors did not specify the interval) and to underestimate this length of time if a mammogram was further in the past (again, the authors did not specify the interval). Rauscher et al.66 found that women with shorter mammography histories (i.e., fewer total mammograms) provided more reliable responses than did women with longer histories (i.e., more total mammograms). Stein et al.67 suggested that nonwhite women tended to report the prevalence of mammographic and Pap testing less consistently compared with white women.
The poor reliability of self-reports regarding screening use may affect the evaluation of strategies aimed at promoting screening. It has been suggested that unreliable outcome measurement introduces random error, thereby biasing associations toward the null and causing observed effects to be underestimated.69 This effect has not been studied empirically, however, and more research is needed both to understand the factors (including culture) that affect consistency of reporting and to develop more reliable measures of cancer screening behavior.
In 1997, Warnecke et al.40 reviewed validation studies that compared self-reports of Pap tests and mammograms with medical records (regarded as the ‘gold standard’). They examined four measures of accuracy: concordance (raw agreement rate), sensitivity, specificity, and the report-to-records ratio (a measure of net bias in test reporting). The report-to-records ratio is equal to the number of patients with positive self-reports (true-positive or false-positive) divided by the number of patients who actually received the test according to the record source (i.e., true-positive or false-negative self-reported data). Warnecke et al.40 found a consistent pattern of overreporting. Across studies, the average report-to-records ratio for Pap testing was 2.10, compared with 1.39 for mammographic screening. Although the 12 Pap test validation studies represented a wide range of settings, from health maintenance organizations (HMOs) to random community samples, 6 of the 8 mammography validation studies were conducted within HMOs.52, 70–82
Since the review by Warnecke et al. in 1997, a number of validation studies have been published.40, 52, 64–66, 83–102 Table 3 provides updated summaries of the four accuracy measures for Pap tests and mammograms and adds summary measures from validation studies of fecal occult blood test (FOBT), sigmoidoscopy, and PSA. Across all cancer screening behaviors, concordance ranged from 0.39 to 0.89; Pap testing was associated with the lowest level of weighted-average concordance (0.71), and sigmoidoscopy was associated with the highest level of agreement (0.89). Consistent with the earlier report by Warnecke et al.,40 overreporting of all cancer screening behaviors was found. The lowest weighted-average overreporting rate was 1.04 for PSA testing, and the highest was 4.27 for sigmoidoscopy; however, the latter summary measure was based on only 3 studies. Overreporting also was high for Pap testing (1.48) and FOBT (1.44) relative to the other screening tests (Table 3). The highest rates of overreporting (of mammography,78, 79, 82 Pap testing,74, 75, 78, 79, 95, 97, 103–105 and sigmoidoscopy79, 106) were found in county health department populations, public clinic populations, tumor registries, and ethnic populations.
Table 3. Accuracy Measures for Five Cancer Screening Behaviors
Currently, we can only speculate on the possible explanations for why overreporting is higher for Pap testing and sigmoidoscopy, but an examination of possible explanations may serve to direct future research. Possible explanations include cognitive issues related to the tests (i.e., comprehension, recall, judgment formation, and response editing), test characteristics, the setting in which the test was conducted (e.g., HMO vs. community clinic), cultural differences, record source accuracy, and survey administration mode. Moreover, results may be influenced by interactions among these factors.
The scant data available40, 52 indicate that respondents may use schemas, such as the process of receiving a regular medical check-up, to recall mammographic and Pap testing, as is discussed above (Lesson 3). Both tests generally occur in the context of a check-up and thus, arguably, should be equally subject to overreporting or underreporting. Because of the characteristics of these tests, however, women who received a pelvic examination may have assumed that they had undergone Pap testing, whereas this assumed link would be unlikely for women undergoing mammography screening.31 When tests are similar to each other (e.g., sigmoidoscopy and colonoscopy), misreporting may occur, as patients may not understand the differences between the similar tests or be aware which test was administered.
The setting (e.g., HMO, community clinic) in which a test was performed also may be associated with differences in overreporting or underreporting according to test type.107 Most mammography validation studies were conducted in HMO settings, as noted above, but most Pap test validation studies were performed in county health departments or public health clinics.75, 91, 103, 104, 108 To explore the ways in which setting may affect the degree of overreporting, we conducted a stratified analysis of report-to-records ratio as a function of study setting for Pap test validation studies. For studies involving HMO populations, the ratio was 1.38 (range, 1.25–1.78); in contrast, for studies involving public clinic, ethnic minority, or randomly drawn samples, the ratio was 1.60 (range, 1.00–3.32).
Setting may also affect the report-to-records ratio because studies conducted in clinics attempted to validate only one screening test (e.g., the most recent one), whereas studies conducted within HMOs usually attempted to validate more than one sequential screening test. Consequently, in the latter type of study, cohorts tended to include only patients who had longer periods of HMO membership (e.g., 5 years).40, 72, 79, 83, 87
Study setting may be confounded with other factors, such as ethnicity and culture. Community clinic populations consisted primarily of individuals belonging to ethnic minorities, as described above. Cultural differences may affect recall strategies or response editing behaviors and thereby contribute to overreporting. Several authors have suggested, for example, that cultural differences in how time is viewed (e.g., dates, schedules), and even in how respondents understand which cancer screening test is being discussed, may reduce the accuracy of self-reported data.31, 107
Participants in managed care programs typically receive most of their medical care from these programs, whereas individuals who obtain medical care from county health departments or public health clinics tend not to use these services consistently for routine care; consequently, medical records kept by managed care programs may be more complete than records kept by public health departments or clinics.87, 91, 108 Thus, biases in report-to-records ratios may be attributable to underreporting in medical records rather than to overreporting by patients.
Another factor that potentially contributes to overreporting in validation studies is the existence of differences in survey formats (e.g., mail-based, telephone-based, interactive computer-based, in-person). In general, it has been found that telephone and mail surveys yield comparable data for nonthreatening questions.109 Few studies are specific to cancer screening behaviors; however, Zapka et al.107 found no difference in the validity of self-reported dates of most recent mammographic screening as reported by telephone or by mail.
Before drawing any conclusions regarding characteristics associated with overreporting or underreporting, additional research on self-reports is required to elucidate the relations that exist among ethnicity, setting, data source, and (perhaps) data collection method. In the meantime, future validation studies should consider how differences in these factors may affect the accuracy of self-reports.
Studies also should evaluate which of the criterion sources or ‘gold standards’ (e.g., medical records, laboratory reports, or administrative databases) is the most accurate. Each criterion source has limitations,110 and these limitations may have different effects on the accuracy of self-reports, depending on which screening test is being evaluated. The limitations of medical records include incompleteness of files due to the receipt of health care from multiple sources, incompleteness of examination records kept by physicians, and incompleteness of coverage of the period between recommended screenings. For all screening tests, it is necessary to capture testing performed outside the context of a specific medical care setting (e.g., testing conducted at a health fair or at another medical facility). It also is important to consider variations in the accepted screening interval; hence, insurance coverage may encourage physicians to modify the way in which they record procedures (e.g., as screening procedures vs. diagnostic procedures) in the medical records or for billing purposes.
Studies (e.g., meta-analyses) summarizing the literature on cancer screening interventions should attend to issues of reliability and validity. In meta-analyses of strategies aimed at promoting mammographic and Pap testing, authors have acknowledged a consistent pattern of overreporting of these behaviors.6, 14 They argue, however, that because women in both intervention and control groups are equally likely to overreport screening, it is unlikely that the relative estimate of screening compliance will be affected; i.e., there is no differential effect. If there is a differential effect between the intervention group and the control group, however, then estimates could be biased. For example, if members of the intervention group receive educational materials encouraging screening, then they may perceive that this is a desired behavior; therefore, self-reports by intervention group members on follow-up surveys may be affected to a greater extent by a social desirability response bias. In contrast, estimates could be biased in the opposite direction if participants assigned to receive the intervention increased their understanding of the types of screening tests and could report more accurately than control participants could on whether they received a test and on which test they received. These sources of bias are a concern, because they may affect the results and conclusions drawn from intervention studies. If possible, researchers should measure (perhaps in a subpopulation) whether differential reporting is occurring and adjust accordingly for overreporting or underreporting in the analysis of intervention effects.
Poor reporting accuracy may also bias associations between predictors and screening behavior. For example, if an outcome measure has poor reliability and validity, then biased associations could result when trying to conduct studies that focus on identifying factors that influence the outcome of interest (e.g., correlates or predictors of screening). Consequently, investigators may use a flawed set of predictors to design their strategy, and that circumstance could affect their ability to detect ‘true’ effects. This problem may be even more critical when translating strategies to ethnically-diverse populations if rates of overreporting already are subject to tendencies toward acquiescence and social desirability responses. To date, the magnitudes and directions of biases attributable to overreporting or underreporting and the factors associated with these biases have not been investigated.
Because of the recent implementation of federal legislation limiting access to medical records (specifically, the Health Insurance Portability and Accountability Act of 1996), the use of self-reported data in studies is likely to increase. Lazovich et al.111 investigated the feasibility of using medical records instead of self-reports to assess mammography use in a population-based cohort. Due to the effort now required to obtain patient consent and medical records across a variety of health care settings, the authors did not recommend the use of medical records as a feasible, cost-effective alternative to self-reporting.111 Therefore, it is important to ensure that we have reliable and valid self-report measures to evaluate the effectiveness of behavioral interventions as well as to monitor progress and trends in adherence to cancer screening.