Introduction to special issue: Patient outcomes in rheumatology, 2011


  • Patricia P. Katz

    Corresponding author
    1. The Arthritis Research Group and the Rosalind Russell Arthritis Center, Department of Medicine, University of California, San Francisco
    • University of California, San Francisco; Arthritis Research Group, 3333 California Street, Suite 270, San Francisco, CA 94143-0920
    Search for more papers by this author

Rheumatology researchers have historically been in the forefront of patient outcomes assessment. In 2003, the Association of Rheumatology Health Professionals (ARHP), in collaboration with the American College of Rheumatology (ACR), identified a need for a collected resource of patient outcomes measures relevant to rheumatology and rheumatology health professionals. Many instruments had been published in different journals, but a single source of up-to-date information about instruments that were particularly useful for rheumatologists and rheumatology health professionals was lacking. To address that need, a special issue of Arthritis Care & Research focusing on reviews of patient outcomes measures in rheumatology was published in 2003.

The selection of potential outcomes measures relevant to rheumatology has continued to expand since the publication of the original special issue. A great deal of work has been done to revise and refine outcomes measures, as well as develop new measures; other measures have fallen out of favor. Most importantly, patient-reported outcomes are now playing a much larger role in research, clinical care, and clinical trials. To respond to these trends, this special issue of Arthritis Care & Research is devoted to updating and expanding the 2003 reviews, again with the goal of providing a single-source reference for rheumatology.

This 2011 special issue contains 35 reviews, covering over 250 measures – twice as many as were covered in the 2003 issue. The 35 reviews cover measures in 4 primary domains, Pathology and Symptoms, Function, Health Status and Quality of Life, and Psychological. All manuscripts were peer-reviewed.

Use of Reviews

In order to make this issue as “user-friendly” as possible, all summaries are presented in a standardized format. The summaries include information regarding how to obtain, administer, and score the instrument, and relevant psychometric properties. A glossary of terms used in the summaries is shown in Table 1.

Table 1. Definition of terms in summary format (1, 2)
ReliabilityIn general, assessments of reliability indicate the consistency of responses within a scale or the extent to which a response of score is free from random error (precision). Common measures of reliability are measures of stability (the degree to which a measure yields the same result on different occasions), such as test–retest reliability and intrarater/interrater reliability. Reliability defines the upper bounds of validity. A validity correlation cannot be greater than the square root of the reliability of the measure. The standard error of measurement provides an estimate of the error in a score.
 Internal consistencyTo what extent do the items of a scale measure the same underlying construct or theme? Internal consistency is usually reported as Cronbach's coefficient alpha (α), which can range from 0–1.0. Theoretically, the internal consistency of a measure influences the stability of measurements over time. Various standards for internal consistency have been suggested. Nunnally and Bernstein suggest that α ≥ 0.80 is sufficient for tests used to compare groups, and that α = 0.70 is a reasonable standard in the initial stages of instrument development (3). For measures used to make decisions about individuals, however, α = 0.90 is suggested as a minimum, with 0.95 desirable.
 Test–retest/stabilityHow stable are scores from the measure from one administration to another? If a measure is intended to measure a trait, condition, or construct that changes, the measure may have low test–retest reliability due to changes in the criterion variable.
 Intrarater/interraterHow likely is someone administering a scale to make the same ratings or judgments on repeated administrations? How likely are 2 raters to make the same ratings or judgments about the same individual? Kappa coefficients are typically used to express agreement between 2 in dichotomous assessments, correcting for the amount of agreement that would be expected by chance. Kappa coefficients range 0–1.0, but the ceiling is influenced by the sensitivity and specificity of the measure. Intraclass correlation coefficients (ICCs) may be used to express agreement between raters on continuous measures, in a manner similar to Pearson correlations. However, the ICC is preferable to a Pearson correlation because it measures the average similarity of subjects' scores on 2 ratings rather than just the similarity of the subjects' relative standing.
ValidityTraditionally, validity has been considered to be an expression of the extent to which a question or measure assesses what it is intended to measure. McDowell suggests that validity “describes the range of interpretations that can be appropriately placed on a measurement score” (1). There are a number of different measures of validity. Some may be more appropriate than others for specific instruments. There are 2 situations in which validity can be assessed 1) other scales or measures of the same or similar attributes are available, and 2) no other measure exists. The method of validation will vary in these 2 situations.
 Content validityDoes the instrument cover the concepts it is intended to measure? Although content validity can be assessed formally, this is more likely in educational settings (e.g., does a test match a curriculum). Often referred to as face validity.
 Criterion validity(Appropriate if other measures available.) Does the measure correlate with the gold standard measure of the concept? Assessment of criterion validity requires a criterion (gold standard) against which a new measure can be compared; e.g., scores on a measure of depression might be compared to clinical diagnoses of depression. The sensitivity, or true positive probability, of a measure refers to the proportion of persons who are correctly classified as having the condition (e.g., depression). The specificity, or true negative probability, refers to the proportion of persons who do not have the condition who are correctly classified by the measure. Receiver operating characteristic (ROC) curves, in which true positive rate (sensitivity) is plotted against false positive rate (1– specificity), may be used in setting scoring cutpoints to illustrate the tradeoff between sensitivity and specificity.
 Concurrent validity(A form of criterion validity). Does the measure correlate with the criterion measure given at the same time?
 Predictive validity(A form of criterion validity). Does the measure correlation with a criterion measure that occurs in the future?
 Construct validity(Appropriate if no other measures available.) Does the measure correlate with measures of other variables in hypothesized ways?
 Known groups validity(A form of construct validity). Are scores of groups known to differ in the concept being measured actually different?
 Convergent validity(A form of construct validity). Do measures of the same concept/construct correlate with each other?
 Discriminant or divergent validity(A form of construct validity). Does a measure fail to correlate with measures that are intended to be different?
Responsiveness/sensitivity to changeDo scores on the instrument change to reflect changes in the criterion measure? There are a number of measures of sensitivity or responsiveness. One of the most commonly used is the standardized response mean (SRM), calculated as the mean change in score divided by the SD of change scores. Standardized effect sizes (ES) may also be used to describe how sensitive instruments are to detecting change. Minimum detectable change is an estimate of the minimum amount of change not due to variation in measurement or error. Information regarding minimum clinically important difference (MCID) or change may also be presented to assist users in determining what amount of difference/change in a score constitutes a meaningful difference/change. MCID may be calculated in a variety of ways (4).

Readers are encouraged to use this volume as a starting point when selecting outcomes measures. Examination of the original or supplemental literature cited in these reviews may provide additional information useful in selection of measures or the interpretation of results, and the reader is strongly encouraged to refer to these resources. In addition, there are other volumes that contain reviews of some of the measures included here. For example, Measuring Health (1) is an extremely useful resource.


Dr. Katz drafted the article, revised it critically for important intellectual content, and approved the final version to be published.


This work has only been made possible because of the support and collaboration of many people, all of whom I would like to acknowledge and thank. First, the publication of this work was enthusiastically approved by the ACR Committee on Journal Publications. The publisher of Arthritis Care & Research, Wiley-Blackwell, recognized the potential importance of this topic and encouraged the development of the 2011 revision.

A steering committee was formed to guide development of the special issue, and included international representation of both rheumatologists and rheumatology health professionals. Steering committee responsibilities included developing the review format, identifying topics for review, and selecting authors of the reviews; in addition, steering committee members participated in the review process for each article. Steering committee members were Patricia Katz, PhD, Guest Editor and Chair, (San Francisco), Hermine Brunner, MD, (Cincinnati), Karen Costenbader, MD, MPH, (Boston), Aileen Davis, PhD, (Toronto), James Irrgang, PhD, (Pittsburgh), Sarah Hewlett, PhD, RN, (Bristol), Richard Osborne, PhD, (Melbourne), and Fred Wolfe, MD, (Wichita).

Authors for these manuscripts were chosen for their expertise in the various fields, and were active participants in selecting measures to cover as well as in preparing the evaluations of each measure. Each manuscript was reviewed by at least 2 reviewers, in addition to a member of the steering committee.

Finally, special thanks to Nancy Parker, Managing Editor of Arthritis Care & Research who guided this issue to publication with her editorial expertise, creativity, and patience.