- Top of page
Objectives: Patient-reported outcomes (PROs; self-report assessments) are increasingly important in evaluating medical care and treatment efficacy. Electronic administration of PROs via computer is becoming widespread. This article reviews the literature addressing whether computer-administered tests are equivalent to their paper-and-pencil forms.
Methods: Meta-analysis was used to synthesize 65 studies that directly assessed the equivalence of computer versus paper versions of PROs used in clinical trials. A total of 46 unique studies, evaluating 278 scales, provided sufficient detail to allow quantitative analysis.
Results: Among 233 direct comparisons, the average mean difference between modes averaged 0.2% of the scale range (e.g., 0.02 points on a 10-point scale), and 93% were within ±5% of the scale range. Among 207 correlation coefficients between paper and computer instruments (typically intraclass correlation coefficients), the average weighted correlation was 0.90; 94% of correlations were at least 0.75. Because the cross-mode correlation (paper vs. computer) is also a test–retest correlation, with potential variation because of retest, we compared it to the within-mode (paper vs. paper) test–retest correlation. In four comparisons that evaluated both, the average cross-mode paper-to-computer correlation was almost identical to the within-mode correlation for readministration of a paper measure (0.88 vs. 0.91).
Conclusions: Extensive evidence indicates that paper- and computer-administered PROs are equivalent.
Patient-reported outcome (PRO) measures—i.e., self-reported measures of health status—are increasingly being used in medical and drug development studies [1–4]. PRO data are valuable for several reasons: 1) for many outcomes (e.g., pain, depression), patient reports are the best available method for obtaining information on unobservable events; 2) even when an event is observable (e.g., voiding, dietary intake), the patient is often in the best position to assess and report these outcomes; 3) PRO measures may be more reliable and valid than measures completed by a clinician via an interview with the patient; and 4) in the case of health-related quality of life measures, PRO data can uniquely provide information on a patient's perception of both a disorder and the treatment for the disorder. In sum, PRO measures supply valuable information on health status and treatment effects that could not be collected in any other way.
The use of computers to collect PRO data is becoming commonplace. Computerized assessments potentially offer a number of advantages over paper and pencil assessments : 1) missing data within an assessment can be reduced by requiring completion of an item before the patient can move on to the subsequent question; 2) computerized assessments can handle complex skip patterns, which often confound patients and result in incomplete or invalid data ; 3) computerized assessments eliminate out of range and ambiguous data by allowing the patient to only select one of the on-screen response options; 4) computerized assessments reduce the effort and error involved in entering paper PRO data; 5) in diary studies, electronic diaries can implement sophisticated designs to ensure valid representation of the patient's experience ; and 6) electronic data capture can time-tag records to document timely compliance and can increase compliance. Compliance with computerized diaries is often 90% or better, whereas studies have documented only 11% to 20% compliance with paper diaries [7,8]. Thus, there are several reasons why clinicians and researchers may prefer computerized administration to paper and pencil PRO measures.
Despite its promise, the shift to electronic patient-reported outcomes (ePROs) requires establishing the equivalence of PRO measures administered on acomputer and the original paper and pencil versions [9,10]. In other words, evidence may be needed to demonstrate that scores derived from a computerized measure do not differ from scores derived from the paper and pencil version. Given that the computer and paper versions of PROs present the same text content and response options, one might expect them to be equivalent. Nevertheless, there are two primary reasons why computerized measures might not be equivalent: 1) differences in how the items and responses are presented to the respondent; and 2) potential difficulties that some individuals may have in interacting with computers. The first category encompasses a number of changes that are required to present a PRO measure on a computer. These changes can range from very minor changes, such as asking the patient to tap a response on a computer screen instead of circling a response on a page, to substantial changes, such as splitting items and responses onto multiple screens, because of space constraints . A common change is that items are presented on a computer one at a time, although multiple items are generally presented on the same page in a paper and pencil assessment. This could alter responding if the participant refers to previous questions when answering the current item (e.g., referring back to one's responses about symptom intensity when considering overall health state). Although computerized assessments usually allow participants to move back through an assessment to view or change previous items, it does make the process more difficult, which could influence responding. Assessments can be implemented on different platforms with varying screen sizes, ranging from small-screen personal digital assistants (PDAs) to large screen desktop computers. Because of the smaller screen size, more changes to the presentation of the assessment items may be required with PDAs, which could alter responding. Although migrating an assessment to a computer could adversely affect the instrument, there is also some evidence that computerized assessments can result in more valid data, especially when “sensitive” topics, such as drug use or risky sexual behaviors, are targeted [12,13].
The second concern touches on characteristics of the patient that may impede responding to an assessment completed on a computer. For example, individuals with high levels of “computer anxiety” might report more negative mood when completing a mood assessment on a computer [10,14]. More broadly, patients with little computer experience might have more difficulty completing computerized measures, resulting in a nonequivalent measure.
In this article, we assess the equivalence of ePRO assessments to their paper ancestors. The American Psychological Association (APA)  defines equivalence as a demonstration that: 1) the rank orders of scores of individuals tested in alternative modes closely approximate each other; and 2) the means, dispersions, and shapes of the score distributions are approximately the same. Many empirical studies have addressed the equivalence of computerized PRO measures and paper and pencil PRO measures. Consistent with APA guidelines, most studies of PRO equivalence assess the correlation and/or mean differences between computerized and paper and pencil measures. In most cases, this is accomplished using a crossover design, where a patient completes one version of the PRO measure and later completes the other version. Ideally, the order of computer and paper administration is randomized to control for possible order effects. Studies using intraclass correlations, which account for both covariation (as do other correlations) and equivalence of means and variance , provide a particularly strong assessment of equivalence.
In this article, we use meta-analysis to summarize results from studies quantifying the relationship between computerized and paper measures. Possible moderators were also considered. Migrating an assessment to a computer may differentially impact the responding of older patients. Therefore, studies including an older patient sample might exhibit lower correlations between paper and computerized measures and greater mean differences. We also examined the possibility that studies enrolling patients with less computer experience would exhibit larger mean differences and lower correlations. Further, we addressed whether the platform (PDA vs. PC) on which the computerized assessment was administered influences equivalence.
Examining the equivalence of paper and computerized measures is essentially an examination of test–retest or alternate-forms reliability. This sets a high bar for demonstrating equivalence. Correlations between the two modes of administration should not only be high and significant, but should also meet requirements for demonstrating reliability. A test–retest correlation of 0.75 or higher is considered “excellent”[16,17] and was used as the standard of comparison here. It is also important to place the correlation in the context of test–retest correlations for two administrations of the paper measure. Variations in scores between paper and ePRO can occur either because of random variation or because of changes in the construct between assessments, which would also affect repeat administrations of a paper measure. Accordingly, we also compared the paper-to-ePRO correlations with test–retest correlations for paper. If the correlations are similar, this would be very strong evidence for the equivalence of the paper and computer measures.
- Top of page
The results summarized here show that computer and paper measures produce equivalent scores. Mean differences were very small and neither statistically nor clinically significant. Correlations were very high, and were similar to correlations between repeated administrations of the same paper-and-pencil measure.
Administering PRO measures on computer has the potential to improve patient compliance and reduce the data management burden on investigators. Nevertheless, it has been suggested that investigators need to evaluate equivalence when a PRO measure is moved from paper to electronic administration. For example, the FDA Draft Guidance on PRO endpoints suggests that migrating a measure from paper to computer requires validation testing to ensure that the computerized measure is equivalent to the paper measure . We reviewed the substantial literature on the subject to assess the equivalence of paper and electronic administration. The data from almost 300 comparisons yield an unambiguous conclusion: paper and pencil and computerized measures produce equivalent scores.
According to APA guidelines , one method for demonstrating equivalence is to examine differences between the average scores derived from the different administration modes. Mean differences between the two modes of administration were small—the average difference was only 0.2% of the scale range. There were very few instances where the difference exceeded 5% of the scale range. In a particular application, the investigator must evaluate differences associated with method of assessment relative to clinically meaningful “minimally important differences”. Although we could not evaluate the observed differences in relation to the minimally important difference, which differs across measures and populations, the observed mean differences—less than one half point on a 100-point scale—appear to be so small as to be of no practical significance in any context. Moreover, the meta-analysis showed that the mean difference was not significantly different from zero, indicating that even this small-observed difference was likely due to random variation. Thus, the mean differences were very small and suggest equivalence.
In addition to mean differences, it is also important to examine the correlation between scores from each administration mode, to determine whether individuals retain their relative “rank” in the score distribution when completing the computerized measure . In these data, the average weighted correlation was 0.90, suggesting that relative position in the distribution is retained when the assessment is completed on a computer. Further, studies using ICC or weighted kappa, which take into account both covariance and score means and variability, yielded equivalence estimates that were almost identical to studies using more traditional correlation coefficients. This provides compelling evidence that there is little change in patient responses when migrating to an electronic platform.
There was substantial heterogeneity in the individual correlations, which was resolved by dropping several outliers, without affecting the overall ES. Nevertheless, we were unable to identify any methodological factors within the outlier studies that explained their extreme ES. The analysis of moderators did not resolve this heterogeneity. Therefore, it is unclear why these studies produced extreme correlations. Some factor that we could not identify may cause the variation, or it could be due to random variability in a distribution. In fact, four of the studies were above the mean and four below the mean, as would be expected in a normal distribution. Even with this heterogeneity, only two of the studies produced a correlation that was less than 0.75. There is no reason to believe that the unexplained heterogeneity should temper the overall conclusions drawn from the meta-analysis.
When an assessment is completed on different occasions, there are many reasons why scores may change—among other things, participants may change their minds about how to respond to an item, their actual condition may change even when the interval between assessments is short, or simple random error may alter responses. This is why even repeated administrations of the same test, in the same modality, vary; yielding test–retest correlations lower than 1.0. Because the equivalence tests of paper and electronic assessments also involved two administrations, some of the variability in scores between paper and electronic tests is due to this test–retest variation, rather than to mode of administration. To assess how much of the observed variation was due to the changes in mode of administration, we compared the observed paper-to-electronic correlations to the test–retest correlations from two administrations of the paper assessment. The two measures of retest variation were very similar: in other words, administering a test on a computer is just like readministering the paper test a second time. This suggests that there is little or no variation in scores attributable to mode of administration, and provides compelling evidence that the computerized measures are equivalent to the paper and pencil versions.
The meta-analysis showed a high overall level of agreement between paper and computerized measures. We also examined whether equivalence varied by age, computer experience, and computer platform. Even though items may need to be altered to fit on a PDA screen (e.g., reducing the size of a VAS ), there is little evidence that using a PDA decreases concordance with the paper version of the assessment. Mean differences were slightly larger in studies using PDAs, but this effect was small and largely due to one outlier study. Additionally, paper-computer correlations were not moderated by platform type—correlations were above 0.90 for both types of platform. Most importantly, the mean differences were not significantly different from zero for either type of platform, suggesting that both PDAs and larger screen devices produce scores that are equivalent to scores from paper forms. There was no variation by subjects' computer experience. Although increasing age was associated with lower paper-electronic correlations, this association was small and unlikely to be clinically relevant. Even when the average age of the sample is approximately 65 years old, the predicted paper-electronic correlation is 0.86. It is also possible that test–retest reliability is adversely affected by increasing age, regardless of mode of administration , which would explain the observed slight decline in paper-computer concordance.
The uniform results seen in this review have implications for the use of computerized measures in clinical trials: as long as substantial changes are not made to the item text or response scales, equivalence studies should not be necessary to demonstrate anew the equivalence or validity of a computerized measure. The studies we reviewed appeared to use “faithful migrations,” where the exact text of the paper instrument was ported to a computer screen, without making substantive changes in content. However, a limitation of this literature is that little information is provided about alterations that were made to the items to present them via a computer. For example, studies using PDAs would almost certainly have made some minor revision of items (e.g., placing general instructions on an introductory screen, followed by individual items), but these details are generally not reported. Our finding of equivalence cannot be directly generalized to cases where substantial changes are made to item content or where layout changes substantially affect users' ability to respond to the item, such as when questions are separated from response options or when scrolling is required to view an entire item . When substantive changes like these have been made to a computerized measure, equivalence studies such as those reviewed here may be necessary.
Although equivalence testing should not be needed in most cases when migrating an assessment from paper to computer, it may be fruitful to evaluate the changes in formatting, layout, etc., through cognitive interviewing techniques to ensure that the patients are interpreting the items as intended. Cognitive interviewing is a qualitative method for assessing respondents' interpretation of the assessment, using a small sample of patients studied in the laboratory . Through these small-scale studies, it is possible for investigators to determine whether the alterations made in migrating to computer influence the way in which the assessment is understood by patients.
Our conclusions cannot be generalized to all forms of electronic administration of PROs. We specifically addressed the case of written assessments moved from paper to computer administration. None of the studies reviewed here used an IVRS as the mode of electronic assessment. Studies examining IVRS have assessed the equivalence of IVRS measures with clinician-administered assessments (see review in ), not assessments completed directly by the patient. Those studies suggest the equivalence of IVRS and interview measures. However, the equivalence of IVRS and clinician-administered assessments does not indicate that IVRS measures are also equivalent to paper and pencil measures completed by the patient, nor can the conclusions of our review of written measures be generalized to IVRS. IVRS measures are fundamentally different from written measures, in that: 1) they are presented aurally, not visually; 2) the information is presented serially and the patient is required to retain the question text and response categories in working memory as the item is presented; 3) subjects cannot review the item or response array at a glance; and 4) whereas responses on a computer screen are typically presented in a meaningful order that helps subjects place themselves in the response set (e.g., from low to high severity), responding on a telephone keypad may disrupt this ordered physical representation of responses. Because clinician-administered assessments share these characteristics, it is not surprising that they are equivalent to IVRS measures. Nevertheless, because of the substantial differences between IVRS and patient-completed PRO measures, further testing is required before concluding that they are equivalent.
We have demonstrated that written assessments administered on paper and by computer are equivalent. This suggests that scores obtained via the two modalities are directly comparable. This finding should be doubly reassuring to investigators using electronic PROs in the context of randomized trials, where the focus is on comparison across groups that both use electronic assessment (thus making equivalence to paper instruments less of an issue).
The use of computerized measures to collect PRO data is likely to grow, because electronic assessment offers many advantages over paper and pencil measures. This growth need not be impeded by concerns about the equivalence of electronic PRO measures to their paper-and-pencil ancestors.