Patient-Reported Outcome Instrument Selection: Designing a Measurement Strategy


Jeff A. Sloan, Department of Health Sciences Research, Mayo Clinic and Mayo Foundation, 200 First Street SW, Rochester, MN 55905, USA. E-mail:


Objective:  To discuss issues in the design of a measurement strategy related to the use of patient-reported outcomes (PROs) in support of a labelling claim.

Methods:  In association with the release by the US Food and Drug Administration of its draft guidance on the use of PROs to support labeling claims, the Mayo/FDA Patient-Reported Outcomes Consensus Writing Group was formed. This paper, part of a series of manuscripts produced by the Writing Group, focuses on designing a PRO measurement strategy.

Results:  Developing a PRO measurement strategy begins with a clear statement about the proposed label claim that will derive from the PRO data. Investigators should identify the relevant domains to measure, develop a conceptual framework, identify alternative approaches for measuring the domains, and synthesize the information to design the measurement strategy.

Often, there is not an already existing single instrument that has been developed and validated for the purposes of a given study. In such cases, investigators may consider supplementing an already existing questionnaire with additional scales or questions, modifying already existing instruments for a new application or patient population, or developing a new instrument altogether. The level of revalidation required for modifications and adaptations depends on the extent of the changes made. Revalidation requirements may range from cognitive testing/debriefing to confirm that subjects respond to the new instrument as expected to full-scale reliability and validity evaluations.

Conclusion:  A position of “reasonable pragmatism” is recommended such that the best available measurement strategy be considered as evidence for labeling.


Many therapeutics carry the potential to benefit people in ways that are best measured by self-report, commonly referred to as patient-reported outcomes (PROs). The US Food and Drug Administration (FDA) released a draft guidance document regarding the incorporation of PROs in the regulatory context of drug development and approval [1]. In this article, we discuss topics related to instrument selection or, more generally, to designing a PRO measurement strategy including issues directly and indirectly related to the guidance document.

In a regulatory context, an outcome measurement strategy begins with a clear statement about the proposed label claim that will derive from the PRO data. The proposed label claim sets the objectives for PRO measurement and guides the design of the measurement strategy. In the meeting pertaining to the guidance document, the FDA stressed that “claims contain concepts,” with the concept being the thing or event to be measured. Thus, “success depends on alignment of product development, PRO development, and clinical trial objectives”[2].

Thus, the measurement strategy should reflect a conceptual framework that represents the hypothesized relationships across the PRO domains of interest [3]. The measurement strategy is the operational realization of the conceptual framework through items or instruments designed to assess the domains of interest. The proposed claim will likely use language that is more or less synchronous with language and concepts in any number of candidate PRO instruments that have been validated or are being validated.

A single “perfect” instrument to measure the PROs targeted by the proposed label claim may not exist. Rather than simply selecting an instrument “off the shelf,” researchers often need to design a defensible measurement strategy from available approximations to get the perfect instrument. This measurement strategy may involve combining previously developed and validated instruments in ways they have been used before, modifying or adapting already existing instruments for a new purpose, or even developing new questions or instruments.

Setting Goals and Framing the Context for PRO Measurement

The conceptual framework guides the PRO measurement strategy, but the strategy need not be fully exhaustive of the framework. Goals may vary from providing a comprehensive profile of the impact on health-related quality of life, to focusing on the impact on specific domains of health-related quality of life, or to investigating other PROs such as treatment satisfaction or patient bother. The scope of the goal relates to the proposed label claim and the concepts in that claim within the conceptual framework.

A PRO claim may encompass an important disease outcome (e.g., a common or disabling symptom) or an important treatment outcome (e.g., reduction in usual side effects observed in standard therapies), and it should include a justification for their importance. Temple has defined a clinically meaningful outcome in the context of drug development as “. . . a direct measure of how a patient feels, functions, or survives and is expected to predict the effect of therapy”[4]. The nature of the PRO measure will concern a disease, condition or health state, a population, and a treatment. The effect or outcome will relate to changes in function and in clinical course that correspond to variations in the test results [5].

The importance of the differences in clinical outcomes between treated and control patients will be gauged by the magnitude of the result [6]. A labeling claim could be targeted at single or multiple specific domains (e.g., decreased fatigue in anemia patients or improved vitality in other patients), single or multiple general functioning domains (e.g., physical function and social function in anemia patients), or summary of overall scores (e.g., the Physical Component Summary of the SF-36). The major point is that the claim should focus on an a priori hypothesis of improvement that the data can support.

Investigators should develop and evaluate their PRO measurement strategy and research plan similar to how they devise their clinical end point measurement strategy and research plan. Considerations include biological plausibility; prior clinical research data; evidence of methodological development; clinical perceptions of use, effect, and meaning; knowledge of feasibility in a research setting; understanding of public health consequences; and patients' views concerning importance and meaning [7]. With both PRO end points and clinical end points, “all claims of clinical benefit require substantial evidence” supported by adequate and well-controlled evaluations (i.e., having adequate development and appropriate research design and statistics including a prespecified analysis plan) [8]. These requirements apply regardless of whether the goals of the PRO measurement strategy are broad or narrow, unidimensional or multidimensional, single-item or multi-item. The key point is that the PRO measurement strategy should be designed based on the goals of PRO measurement as outlined in the target claim. The well-controlled research design affords an opportunity to test a PRO measurement strategy, and similar to clinical research, not all PRO concepts need to be tested in the same research program.

The PRO measurement strategy should be designed and evaluated similarly to the broader clinical measurement strategy by being appropriate both for comparable rigor and for comparable reason and pragmatism. If, using the above example of fatigue and anemia, sponsors and regulators agree that fatigue is an important PRO to document in a drug registration program, then the “best available measurement” of fatigue should be accepted in submission of data. More than one “best available” instrument or instruments to use for PRO measurement may exist.

The measurement approach, if well-documented, would logically be acceptable in the labeling claim, even if every component of the background conceptual framework or measurement strategy is not fully satisfied. Such an approach would represent a “level playing field” for PRO and clinical outcome data, as the latter are often equally or even more imprecise (lacking reliability) than PRO data [9].

A position of “reasonable pragmatism” would allow for the best available measurement to be considered as evidence for labeling. If done in such a way that promotes further development of the conceptual and measurement frameworks, this practice would truly advance the field. Because the review and acceptance of frameworks and their proposed measurement is highly subjective, to do otherwise risks deeming PRO data as inadequate for consideration based on individual preferences for frameworks and their measurement.

Developing the Measurement Strategy

Investigators can take the following steps to design a PRO measurement strategy: 1) identify the relevant domains to measure; 2) develop a conceptual framework; 3) identify alternative approaches for measuring the domains; and 4) synthesize the information to design the measurement strategy. These steps represent the process that a company could go through to develop its PRO measurement strategy. The FDA spoke strongly in favor of sponsors providing complete and detailed documentation describing the development of any PRO instrument used to support a claim [2]. Nevertheless, an argument can be made for the FDA's receiving only complete information about the claim and the instrument(s) used to support that claim related to the domains to be studied. The necessity of providing a comprehensive listing of every domain affected by the disease and its treatment, or a comprehensive description of alternatives for measuring every domain, adds significantly to cost and documentation required in the claims submission.

Step 1: Identifying the relevant domains to measure.  In determining the relevant domains to measure, investigators should develop a comprehensive list of domains that are affected by the disease itself and also its treatments based on their therapeutic effects and side effects. Next, using all available information, they should create a list of the domains that are expected to be affected by the experimental therapy. In each case, consideration must be given to both positive and negative effects. Then they should narrow the list based on relevance criteria, in particular whether the domain is relevant to the proposed labeling claim.

Regarding the negative impact of therapy, both experimental and control, side effects are often assessed through the provider-completed adverse event reporting system (e.g., using a Common Toxicity Criteria scoring system). This is not PRO measurement as considered in this article. In some cases, investigators may wish to assess these side effects as part of the PRO measurement strategy. They may want to obtain information on the functional impact of the side effect, such as conducting an assessment of the bother of (rather than the frequency or severity of) certain side effects.

Because adverse events and PROs are not synonymous, both have a place in assessing the impact of therapy. Because standard adverse event reporting may fail to detect some differences, PROs may provide a more sensitive measure in some instances. When a sponsor wishes to demonstrate a superior safety profile (i.e., fewer or less severe side effects), getting the information directly from the patient is well-advised, although not mandatory. Thus, although a sponsor may choose at times to assess the negative effects through its PRO measurement strategy, PRO assessment of negative aspects of treatment should not be confused with standard assessment of adverse drug events [10].

Step 2: Development of a conceptual framework.  The next step is to develop a conceptual framework based on the identified domains. The conceptual framework should outline the relationship between the domains and the hypothesized impacts, both positive and negative, of the experimental and control therapies. This conceptual framework may be useful in refining the goals for PRO measurement. Specifically, investigators can use the conceptual framework to identify their final domains of interest and set priorities for the truly important ones––i.e., those for which the company aims to obtain a labeling claim and thus those that will be most important in designing the PRO measurement strategy. For companies that also seek to conduct a more comprehensive PRO measurement as part of the study, the conceptual framework can help identify all of the domains of interest, not just those that will be used to support the labeling claim. For more information on developing a conceptual framework, see the article by Rothman et al. [3].

Step 3: Identifying candidate approaches for measuring the domains.  After investigators identify the relevant domains and develop a conceptual framework, their next step is to identify the most suitable approach for measuring the domains of interest from alternative approaches. Not uncommon is a situation in which no single instrument covers all domains targeted for the labeling claim. In such instances, investigators may need to use multiple instruments, to modify or adapt an existing instrument, or to develop a completely new instrument. Thus, determining how to measure a PRO in a regulatory context involves not simply selecting an instrument, but rather designing a measurement strategy that will address the targeted domains.

Researchers should consider the relative strengths and weaknesses of the alternative instruments in terms of their comprehensiveness and their psychometric performance [11]. The Medical Outcomes Trust (MOT) developed a list of review criteria that can be used to evaluate the performance of candidate measures [12]. The eight review criteria the MOT proposes include the conceptual and measurement model, reliability, validity, responsiveness, interpretability, burden of administration, alternative forms/modes of administration, and cultural and language adaptations. For more information on evaluating the psychometric performance of instruments, see the article by Frost et al. [13].

When considering alternative measurement approaches, one should first determine whether an existing single instrument is an option. The literature should be searched to identify which instruments have been used previously in similar studies to determine how well they performed. The literature may contain potentially useful instruments that have not been used previously in similar studies. An example of the use of a single instrument occurred in the evaluation of Advair Diskus (GlaxoSmithKline, Research Triangle Park, NC, USA) for patients with asthma. The Asthma Quality of Life Questionnaire (AQLQ) was used to assess the patient's perception of asthma and its treatment. Based on the AQLQ results, the label for Advair notes that patients in the Advair Diskus group experienced improvements in their overall asthma-specific quality of life that were clinically meaningful in comparison with the group on placebo [14].

An existing single instrument for the targeted domains may not be available or sufficient, in which case alternative approaches to PRO measurement should be considered. Although the FDA draft guidance states that if an adequate PRO instrument does not exist a new PRO instrument can be developed [1], investigators have several alternative options when an existing instrument is not adequate. They may be able to modify or adapt existing instruments. Also, if an instrument covers most of the domains of interest, it can be used and supplemented with scales or items from other existing instruments or even with scales that are developed for that particular study. Such adaptations and modifications require varying degrees of revalidation work as discussed in the second half of this article.

The evaluation of etanercept for rheumatoid arthritis (RA) illustrates a measurement strategy using multiple instruments to cover the relevant domains. Several PRO measures were used, including the Health Assessment Questionnaire (HAQ), the SF-36, items assessing energy and mental health from the Medical Outcomes Study, and a single-item rating scale assessing current health [15]. The resulting package insert notes that all subdomains of the HAQ improved in patients receiving etanercept in two studies. It also notes that, in the study that included the SF-36, the patients receiving 25 mg etanercept showed significantly more improvement in the SF-36 physical component summary than the patients receiving 10 mg etanercept [16].

Another option when no single existing instrument addresses the relevant domains is to modify or adapt an instrument previously used in other studies and tailor it to the objectives of the proposed study. For example, eflornithine cream was developed to treat unwanted facial hair (hirsutism), but no existing PRO instrument assessed the impact of hirsutism. In this case, the researchers developed the ESTEEM scale (Exchanges of affection, Social interactions, Time spent removing facial hair, Encountering new people, Engaging in work or school, Minimizing overall bother with facial hair) by adapting the Bother Assessment in Skin Conditions scale (BASC), that had been developed and validated in the assessment of hyperpigmentation. The BASC was modified to create the ESTEEM scale by adapting characteristics of bother and discomfort to the setting of hirsutism [17]. The resulting label claim for eflornithine notes that it significantly reduced how bothered patients felt by their facial hair and by the time spent removing, treating, or concealing facial hair [18].

A final alternative is to develop a new instrument specifically tailored to the study. This approach, too, requires documentation that the new instrument is valid and reliable in this setting. For example, the International Index of Erectile Function (IIEF) was developed to detect changes resulting from treatment for erectile dysfunction [19]. The IIEF was used as the primary measure for the clinical efficacy of sildenafil, specifically focusing on two questions related to ability to achieve erections sufficient for sexual intercourse and maintenance of erections after penetration. Patients also used daily diaries on their sexual function and responded to a global question. Results from the IIEF were used to support the labeling claim for sildenafil, which notes that maintenance of erections after penetration was better in the sildenafil-treated patients than in placebo patients [20]. It also notes that sildenafil improved the frequency, firmness, and maintenance of erections; frequency of orgasm; frequency and level of desire; frequency, satisfaction, and enjoyment of intercourse; and overall relationship satisfaction. For more information on instrument development, see the article by Turner et al. [21].

One final consideration when evaluating alternative measurement approaches is whether it is feasible and advisable to include a general health status measure. Adding a general health status measure can identify unanticipated consequences, both positive and negative, of the experimental therapy or comparators. Using such instruments can also promote comparisons across diseases and populations. Including an additional instrument does increase administrative and respondent burden and costs. Although investigators should always consider this alternative, it will not always be appropriate.

Step 4: Synthesizing the information to design the measurement strategy.  After identifying the relevant domains and the alternative approaches for measuring them, the research team needs to consider the trade-offs of the various strengths and weaknesses and determine the best measurement strategy based on the study's priorities. First, they need to include domains targeted for a labeling claim. Second, the instruments used to measure the targeted domains should be psychometrically sound, and the best available, for the given application.

Patient-reported outcome researchers face trade-offs when designing a measurement strategy. On the one hand, using a previously developed and well-validated instrument lends credibility to the measurement strategy and allows for greater comparability across studies. If the instrument does not adequately target the relevant domains, it may not be as sensitive and responsive in its measurement properties as desired. On the other hand, using newly developed questions that are specifically tailored to measure the relevant outcomes for a given study may be more sensitive to differences and responsive to changes, but this approach requires significantly more work to demonstrate that the measure is valid and reliable.

Using such study-specific instruments does not promote cross-study comparisons. When investigators must deal with such trade-offs, they may find it helpful to identify the domains of greatest importance and ensure that those are measured with instruments of the greatest validity and reliability and allow secondary domains to be measured with instruments that may not have been as well tested. For example, if a company is targeting depression and sexual function for a labeling claim but also wants to measure social function and emotional well-being, depression and sexual function should take priority in designing the measurement strategy, with social function and emotional well-being of secondary concern.

Because trade-offs and compromises will likely be required, the next step is to consider how to strengthen the weaker areas of the selected approach. If an instrument is being supplemented with newly developed items, pilot testing and validating these items before their use may be helpful. Similarly, if a previously developed and validated instrument is being used in new ways or new populations, testing the instrument in the target application or population before the main study is the ideal practice. Validation concurrent with the pivotal Phase III trial is a reasonable strategy, especially considering that the risk in this endeavor lies with the sponsor.

In some circumstances, using pivotal Phase III data to confirm the psychometric properties of an instrument is acceptable. For example, when a sponsor includes an instrument in Phase II studies, analyzes its psychometric properties, and on the basis of those analyses, revises the instrument, it is reasonable and standard practice to administer the revised version of the instrument in phase III and use those data to confirm the reliability and validity of the final version. The FDA is seemingly still evaluating whether validation should be done before Phase III trials. This approach may be incongruous with the agency's stance that, in certain circumstances, molecular biomarkers can be validated during the course of pivotal Phase III trials. Ideally, in a concurrent validation strategy the data should be collected parallel in time with a separate study.

Studies by Damiano et al. illustrate the four-step process for designing a measurement strategy in preparation for clinical trials in Parkinson's disease [11]. First, the researchers identified the areas affected by Parkinson's disease and its treatment. They conducted a literature review and consulted with clinicians and patients to identify the relevant domains. They also identified the two Parkinson's disease-specific questionnaires available at the time. They reviewed how well the two Parkinson's disease measures covered the relevant domains and the evidence available regarding their psychometric performance. Based on this review, they developed and tested a measurement strategy in the target population [22]. The measurement strategy included one of the instruments evaluated in the review (the Parkinson's Disease Questionnaire-39), but because this instrument did not address sexual function, they also included the MOS Sexual Function Scale. Finally, the SF-36 was used to identify the impact on general health status and to identify any unanticipated consequences. They also used this validation study to evaluate two modes of administration (at the study site and over the telephone). Thus, the validation study could be used to support the application of this measurement strategy in a clinical trial for a regulatory submission.

A New Possibility

The best approach is often to use the best available instrument and to make incremental changes to improve its validity or relevance for the target study. In the future, the selective use of large sets of questions measuring key concepts (“item banks”) to create tailored short forms has great appeal for regulatory use. The tailored short forms might be static (i.e., the same small set of items selected by the researcher in discussion with the FDA) or dynamic (i.e., computer-adaptive tests, or CAT assessments). The advantage of this approach is that the concept under discussion has been well studied and the items in the instrument comprising it have been calibrated to the underlying concept being measured [23].

Through the National Institutes of Health PROMIS network, groups of investigators are currently developing item banks to address pain, fatigue, and other aspects of health-related quality of life ( As item banks are developed, PRO researchers may have the opportunity to draw from item banks using previously validated and calibrated items that can be targeted to the relevant outcomes of interest.

Next, we discuss alternatives for modifying and adapting PRO instruments in cases when an existing questionnaire alone is inappropriate or insufficient. Special attention is given to the validation required for such alternatives.

Requirements to Revalidate a Modified Instrument

Often no available instrument assesses the relevant domains in the target population adequately, or existing assessment tools that are developed from the general population or some disease population are not easily transferable to the trial target population. Investigators face pressure to generate instruments that are relevant to the domains of interest in the trial population. Clinical trials are conducted under tight timelines, and developing a new instrument may not always be feasible or even necessary. A viable option is to modify and adapt existing instruments to fit the research needs. The types of modifications that may be made include:

  • • Changes in wording or content;
  • • Changes in mode of administration;
  • • Translation and cultural adaptation; and
  • • Application to a different patient population.

Modifications may occur as part of the natural evolution of instruments, and as long as an investigator or sponsor characterizes a viable PRO measurement strategy in a methodologically sound manner, iterative adaptations and modifications are acceptable. Modifying and adapting existing instruments to fit the research needs is an acceptable approach as long as four main conditions hold:

  • 1The existing instrument has been adequately validated; measurement properties have been established, albeit for a different application (the advantage of adapting an instrument over developing one de novo).
  • 2Questions in the adapted instrument are relevant and appropriate for the target application.
  • 3The instrument is implementable; that is, it is logistically able to be utilized in the particular setting.
  • 4A new interpretation guideline is developed if necessary.

Alternative positions on the extent to which an instrument's psychometric properties need to be re-evaluated after the instrument has been modified range considerably. For simplicity, this re-evaluation will be referred to as “revalidation.” The range of positions includes the following:

  • 1All changes excepting superficial (cosmetic) changes require comprehensive revalidation, including use of confirmatory factor analysis for a multidomain questionnaire. The psychometric properties of the final instrument must be established before phase III.
  • 2Confirming the basic psychometric properties of the revised instrument in phase III trials is sufficient. The sponsor bears the risk if questionnaire performance is inadequate. No other revalidation is required.
  • 3Focus groups or individual patient interviews are required to confirm content validity. If content validity is confirmed, no additional revalidation is required (with or without confirmatory testing in phase III).
  • 4Cognitive testing/debriefing is required (with or without confirmatory testing in phase III).

The FDA's position is somewhat unclear. The draft guidance states both that “The extent of additional validation recommended depends on the type of modification made” and that “The FDA intends to consider a modified instrument as a different instrument from the original and will consider measurement properties to be version-specific”[1]. We believe the extent of psychometric revalidation required depends on the degree of modifications made and the research framework used to support the validity of the modification. A novel PRO measurement strategy may require a comprehensive psychometric program, including item generation and reduction with patient input; development of scoring methods; and documentation of validity, reliability, and responsiveness to change. Conversely, adaptations and modifications that represent subtle changes in clinical context or setting but that rely on a well-developed research framework should not require a full psychometric validation. For example, if a domain such as Vitality from the SF-36 is part of a PRO measurement strategy, its psychometric lineage should enable its use with minimal additional validation. Similarly, if an instrument has not been used in the target population but has been used in a similar population (e.g., one cancer versus another), a rationale describing how the two populations are substantially similar should suffice, and additional validation work in the target population should not be required.

Subsequent sections of this article address the degree of validation we believe is appropriate for different types of modifications. We pay special attention to highlighting circumstances in which we believe comprehensive revalidation is unnecessary. Before describing these recommendations, we explain below the key considerations and guiding principles on which they are based.

Need for Reasonable Pragmatism

Practical implications of revalidation requirements.  In setting forth recommendations on revalidation requirements, we want to emphasize the practical implications of the recommendations and the effects of these implications on product labeling and promotional claims. Because labeling is intended to provide complete, accurate, and balanced information to health-care practitioners, and because important clinical decisions are made on the basis of the information, the evidence requirements are deliberately rigorous.

On the one hand, setting the evidence threshold so low that it compromises the accuracy or reliability of the conclusions obtained would be detrimental, not only to patients, but also to the credibility and value of PRO research. On the other hand, setting the threshold too high could also undermine both clinical decision-making and PRO research. If the criteria are so demanding that they are rarely met, sponsors may be less likely to collect these data and, when they do, the information will be less likely to reach health-care practitioners. In either case, important PRO data would be excluded from product labeling. For PRO research to remain viable and attractive, practical implications of guidelines and research recommendations should be considered, and a balance sought between the ideal and the practical.

Impact on conclusions regarding treatment effects.  With randomized clinical trials, most or all of the changes in an instrument would be reflected equally in treatment arms, thus such modification cannot affect the relative comparisons across arms. For example, even though pain ratings have been shown to vary with the orientation of the visual analog scale [24], such changes would affect both treatment groups equally and would not be expected to alter the study's conclusions.

Most decisions to assume the validity of a modification or adaptation based on the validity of the original text version increase the risk of Type II error, or failure to detect an effect that is actually there. Although this effect is not ideal, in a regulatory approval context it amounts to accumulation of sponsor risk rather than any unfair or unique advantage.

With these guiding principles in mind, we discuss the types of modification and their revalidation requirements below.

Types of Modifications

Change in wording or content.  With continued use of an instrument over time, opportunities arise to improve its performance through relatively small changes in the wording or content of items, response options, or instructions. For example, with an instrument that performs well in one language, the process of developing translations may suggest further refinements to item wording in the original (source) instrument. These changes typically enhance the instrument's properties and do not trigger a need for complete revalidation. Thoughtful cognitive testing and debriefing with a small group of patients can provide added confidence that the modifications are interpreted as intended.

When investigators select an established questionnaire, sponsors may wish to make changes that benefit other instruments. These steps can take the form of reducing the number of response options to improve reliability or expanding the number of responses to improve responsiveness. For example, Version 2.0 of the SF-36 improved on the previous version by substituting five-level response options both for the dichotomous response options in some items and for the six-level response options in other items [25]. The probability is extremely high that other questionnaires using the same set of six-level response options as the earlier version of the SF-36 would benefit from the same revisions.

A PRO instrument may need to be updated to reflect the effects of new treatments. For example, a questionnaire or checklist designed to evaluate specific side effects of treatments for a particular disease might need to be expanded as new treatments emerge. In this case, the modification is necessary to ensure that content validity is preserved.

Change in mode of administration.  A recent issue is how much revalidation is required when an instrument originally developed for paper-and-pencil administration is modified for electronic administration. Electronic administration comprises both interactive voice response systems (IVRS) and computer administration; the form of the latter can vary considerably with the specific device employed. The issues arising from a change in the mode of administration, and the potential for this to influence the data collected, will therefore depend on the specific technology.

A key consideration with IVRS is the additional memory load imposed on respondents when subjects must remember the instructions, questions, and response options. In contrast, if a standard-size computer monitor is used, a computerized version of the instrument could be constructed to mimic the paper questionnaire in terms of its layout and other key features. Portable electronic devices that are smaller in size, may require changes to the layout of the questionnaire and may limit the amount of information that is simultaneously available.

To confirm that the psychometric properties of an instrument are retained when a paper instrument is modified for electronic administration, current common practice is to undertake cognitive testing to establish that subjects can navigate the electronic version and to conduct a small study to confirm that the basic psychometric properties of the instrument are unchanged. With additional experience in computer administration, however, conducting even a small validation study may no longer be necessary. After comparing alternative forms of questionnaires for measuring health-related quality of life and utilities, Guyatt concluded that measurement properties are “seldom affected” by changes in method of administration [26]. As experience with electronic administration accumulates, researchers may be able to identify the factors that determine when differences between modes of administration will occur and to apply these principles to facilitate construction of an electronic version that will replicate the psychometric properties of an original article version. Similarly, moving from either computer or oral administration to paper administration may or may not be relatively straightforward depending on the structure of the questionnaire. In particular, a potential advantage of electronic administration, the use of complex skip patterns, may be very difficult or impossible to reconstruct on paper.

An ePRO Consensus Development Working Group (details available at has been convened to review the state of the science and develop recommendations for ensuring that PRO data quality is maintained when researchers adapt paper instruments to an electronic medium. Until those recommendations are issued, at a minimum, researchers should perform cognitive testing. A small validation study may facilitate acceptance by the FDA, but at the present time, we cannot make firm recommendations about how investigators should proceed as they move from paper to electronic media.

Translation and cultural adaptation.  When investigators need to modify an instrument developed in one language for other languages and/or cultures, the FDA guidance suggests that sponsors use “generally accepted” methods for translation and cultural adaptation and also that they provide evidence that the measurement properties of the translated and culturally adapted versions are comparable [1]. Although the general consensus seems to be that a rigorous translation and cultural adaptation process is important, more than one method can be applied to accomplish this step. Moreover, the incremental value of some steps in the process continues to be debated. Ideally, the psychometric properties of each translation and cultural adaptation would be established before sponsors use them in phase III trials. Nevertheless, because phase III trials are typically much larger than phase II trials, they often include countries that had not been included in phase II. In this case, if the translations and cultural adaptations are not already available, a separate study would need to be conducted to confirm the psychometric properties of those versions––a significant undertaking. Moreover, if enrollment to a phase III study is not proceeding as planned, additional countries may be required to meet recruitment targets. If the sponsor's goal is to obtain evidence for a treatment effect, then as explained above, translations and cultural adaptations that fail to replicate the psychometric properties of the original instrument may make detecting an effect more difficult; they are very unlikely to confer an unfair advantage.

Application to a different patient population.  A related question is what amount of revalidation, if any, is needed when an existing instrument is used without revision in a patient population different from that for which it had been originally developed and validated. It is important to step back and reflect on the assumption that a well-studied and validated instrument must be revalidated in the target trial population. The position supporting the need to revalidate an instrument in new samples is grounded in unproven classical test theory. Cases exist in which special considerations should be made for new applications of validated tools; however, a good generic instrument measuring a common concept is valid when applied to people across a wide variety of diseases. When a generic measure has previously been demonstrated as valid and reliable across diverse patient populations, extensive revalidation in a new patient population should not be required. Although equivalent measurement across groups is not guaranteed, it can be tested a posteriori. These possibilities should be considered carefully when reviewing a proposed measurement strategy, and a principle of reasonable pragmatism should be included in the deliberation.

Applying a disease-specific instrument to a population other than the one in which it was originally developed and validated may not require any modification or may require only substituting the new disease state for the original one in the question stem. If no significant content changes are needed, and the target population is substantially similar to the population in which the instrument was developed and validated, additional validation work may be unnecessary. In this case, investigators describe in detail the similarities between the two populations. In other cases, the content may need to be modified to make it more appropriate to the target population. When changes are made to tailor an existing questionnaire to a new patient population, the level of revalidation required will depend on the significance of the modifications made. For example, in the evaluation of abatacept for RA, the research team developed an Activity Limitation Questionnaire (ALQ) to assess the number of days that patients were unable to perform usual activities because of this condition. Because many RA patients are female, of middle age (mean age, 55 years), and unemployed, or because in international trial settings many countries may have high unemployment rates, work-specific questionnaires may not be appropriate. Thus, for this particular study, the work-related questions in the ALQ were modified to use a more general definition of usual activities that included work, whether or not for pay, and any other activities the patient does during the day. Because this modification was relatively minor, the revalidation required was minimal.

Finally, disease-specific questionnaires developed and validated with a sample that is representative of the patient population will generally require no further revalidation for use in subgroups delineated either by demographic variables (e.g., race, sex) or by symptom type or severity. Conversely, a questionnaire developed specifically for a particular disease or demographic subgroup cannot be assumed to retain its psychometric properties when applied to a broader patient population.


Using PRO data to support a labeling claim calls for more than just instrument selection. A well-defined measurement strategy should be devised with careful attention paid to identification of appropriate domains to be measured, instrument(s) to utilize, and, if necessary, revalidation requirements. The FDA requirements for use of PRO data in labeling claims should be no more or less stringent than those used for other clinical end points. As long as the PRO measure represents a valid concept that can be operationalized and tested, conforms to the prespecified claim structure, is supported by evidence from a prespecified statistical plan, and is reported with fair balance, PROs should be eligible to support claims of patient benefit.

Looking to the future, the efficiency with which PRO research is conducted could be improved by further efforts to address the need for revalidation. Steps to accomplish this include:

  • • Identifying empirically the boundaries within which changes can be made without substantially altering an instrument's measurement properties. There is a wealth of experts from whom to draw consensus on this issue.
  • • Describing a set of principles to use in determining when modifications can be expected to alter measurement properties in such a way that the legitimacy of treatment comparisons is questionable.
  • • Conducting simulations to illustrate how changes in the psychometric properties of instruments would affect conclusions drawn from a trial.

At the present time, the basic principle of using the best available measurement of an important concept should prevail in any given decision.

Source of financial support: Funding for the meeting was provided by the Mayo Foundation in the form of unrestricted educational grants; North Central Cancer Treatment Group (NCCTG) (CA25224-27) and Mayo Comprehensive Cancer Center grants (CA15083-32).