OA = osteoarthritis; P = pain subscale; S = stiffness subscale; PF = physical function subscale; Sy = symptoms subscale; A = activity limitations–daily living; SP = activity limitations–sport and recreation; Q = quality of life (hip or knee related); ? = no information found; NA = not available; see Table 1 for additional abbreviations.
Psychometric evaluation of osteoarthritis questionnaires: A systematic review of the literature
Version of Record online: 31 MAY 2006
Copyright © 2006 by the American College of Rheumatology
Arthritis Care & Research
Volume 55, Issue 3, pages 480–492, 15 June 2006
How to Cite
Veenhof, C., Bijlsma, J. W. J., van den Ende, C. H. M., Dijk, G. M. v., Pisters, M. F. and Dekker, J. (2006), Psychometric evaluation of osteoarthritis questionnaires: A systematic review of the literature. Arthritis & Rheumatism, 55: 480–492. doi: 10.1002/art.22001
- Issue online: 31 MAY 2006
- Version of Record online: 31 MAY 2006
- Manuscript Accepted: 31 AUG 2005
- Manuscript Received: 15 DEC 2004
- Health Care Insurance Board
Both in clinical practice and in research on patients with osteoarthritis (OA), outcome is evaluated using many different instruments. The Outcome Measures in Rheumatology Clinical Trials (OMERACT) group defined a core set of outcome dimensions for clinical studies in hip and knee OA, which are pain, physical function (the performance of daily activities), and patient global assessment (1, 2). These are in line with the recommendations of several guidelines for outcome measurement in OA trials (European League Against Rheumatism [EULAR] , Food and Drug Administration [FDA]; , and Slow-acting Drugs in Osteoarthritis [SADOA] ). However, these guidelines differ in their recommendations of specific instruments, or do not include recommendations of instruments at all (1, 5).
Nowadays, a large number of instruments are available to assess the outcome dimensions of the OMERACT. The issue arises of which instruments are most appropriate to use. The selection of an instrument should depend on the instrument's psychometric qualities (namely, reproducibility, validity, and responsiveness) and on practical considerations (for example, time to complete, ease of scoring, and mode of administration).
Because the majority of the instruments developed for patients with OA are questionnaires, our focus in this article is on questionnaires. Several reviews of OA questionnaires have been published, and recently, a special issue on outcome measurements was published by Arthritis Care & Research (1, 6–8). Sun et al concluded that both the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) and Lequesne Index are recommended as primary measures in treatment studies (7). However, none of these reviews give a complete systematic overview of all available instruments for patients with OA of the hip and/or knee. Before a specific core set of questionnaires can be recommended, a systematic comparison of the descriptive and psychometric qualities of the instruments is required. Recently, systematic reviews of measurement instruments for specific populations have been conducted (9, 10).
The objective of this article was to give an overview of published self-assessment (self-administered and interview-based) instruments on pain, physical function, and patient global assessment for patients with hip and/or knee OA. To evaluate the selected questionnaires, data on the descriptive and psychometric qualities of the instruments were systematically collected and rated using a checklist (9, 11). This overview will facilitate the choice of the most appropriate questionnaires to measure the OMERACT outcome dimensions in patients with OA of the hip and/or knee.
Materials and Methods
An extensive search was conducted in the Medline (1966 through May 2004), CINAHL (1982 through May 2004), and Embase (1988 through May 2004) databases. The broad computerized search strategy was built on search strategy for OA of the hip/knee; search strategy for outcome assessment; search strategy for the outcome dimensions pain, physical function, and patient global assessment; and search strategy for psychometric qualities. Furthermore, references of the retrieved articles were screened for relevant articles.
Inclusion of articles was based on the title and abstract, and was decided by 2 independent reviewers (CV and CHME). In case of uncertainty, the full article was read by 2 independent reviewers (CV and MFP). If necessary, disagreements were resolved by a third reviewer (CHME). Inclusion criteria at the level of patients were as follows: patients with OA of the hip and/or knee; in case of surgical interventions, data were only included when collected before surgical interventions (e.g., total knee or total hip replacement) or before other invasive interventions, because patients after surgery were considered a different patient population. At the level of instruments, inclusion criteria were as follows: self-assessment (self-reported or interview-based) questionnaires; questionnaires that contained ≥1 separate dimensions of either pain, physical function, or global perceived effect; both condition-specific and generic questionnaires were included; in case of different language versions of the same questionnaire, only the English version or the native version was included. Finally, at the level of the performed studies, inclusion criteria were as follows: the main focus of the article was the development, construction, or psychometric evaluation of the instrument (only psychometric evaluations using classical test theory were included, evaluations based on item response theory [IRT] were excluded; no checklist is currently available to rate psychometric evaluations based on IRT); data of patients with hip and/or knee OA were published separately in case of mixed populations (e.g., patients with rheumatoid arthritis and patients with OA); results had been published in English as a full report.
Data extraction and quality assessment.
A checklist of specific criteria for quality assessment of instruments was used, consisting of a section with descriptive aspects of the instrument and a section with specific psychometric criteria (Appendix A). The checklist was developed by Bot et al (11) based on the work by Lohr et al (12) and the checklist developed by Bombardier and Tugwell (13). This list of criteria has already been used in several systematic reviews (9, 10). All qualities were rated as either positive, doubtful, or negative. In case no or insufficient information was available on an aspect, no rating was given. The psychometric qualities of each study were independently assessed by 2 reviewers (CV and GMD). Disagreements between the reviewers were resolved by discussion. Because the present review focused on pain, physical function, and patient global assessment, only information on these outcome dimensions was rated in case other dimensions were also included in the instrument (e.g., quality of life, mental functioning). When ≥2 studies were performed on the same psychometric qualities of the same instrument involving the same population (e.g., hip/knee OA or outpatients/inpatients), the highest rating was taken.
Characteristics of the instruments.
The descriptive data provide information about the target population, scales, and format of the instruments. Extracted data included target population, domains to which the scales could be classified (pain, physical functioning, emotional functioning, social functioning, general health, and quality of life), number of scales, number of items, response options, range of score, mode of administration (self administered or interview based), ease of scoring method, and time needed to complete the questionnaire. Three types of self-reported instruments were distinguished: generic scales, which are designed for various populations of patients; condition-specific questionnaires, designed for a specific group of patients; and patient-specific questionnaires, designed for use with individual patients (14).
Data on the characteristics of the study group (diagnosis and clinical features) were reported to reflect for which population the psychometric qualities (validity, reproducibility, responsiveness, and interpretability) were assessed.
Validity is the degree to which an instrument measures the construct it is intended to measure (12). The content validity, internal consistency, and construct validity of the instruments were evaluated. The content validity examines the extent to which the items adequately represent all significant aspects of the construct being measured (15). The rating of content validity was positive if patient consultation was combined with either expert consultation or examination of literature during item selection in the construction phase of the instrument, and was doubtful if only patients were consulted during item selection. The internal consistency, determined by calculating Cronbach's alpha, indicates the homogeneity of the items in a (sub)scale. To determine which selected items cluster together around one aspect (and thus form a separate (sub) scale), factor analysis has to be performed. Questionnaires were rated positive if factor analysis was conducted and Cronbach's alpha for each separate dimension was >0.70 (16). Construct validity refers to the extent to which scores on a particular instrument relate to other assessment tools in a manner that is consistent with theoretically derived hypotheses (17). A positive rating was achieved when hypotheses about the magnitude and direction of relationships of the questionnaire (sub)scales with reference instruments were specified, and when >75% of these hypotheses could be confirmed. If available, descriptive data on the distribution of scores, including information about the presence of floor or ceiling effect, were extracted. Floor and ceiling effects were considered present if >15% of the respondents achieved the highest or lowest possible score (18).
Reproducibility is the extent to which an instrument yields stable scores over time among respondents who are assumed not to have changed on the domains being assessed. Reproducibility was assessed by rating reliability and agreement. Reliability is the degree to which an instrument is free of measurement error. The intraclass correlation coefficient (ICC) for each (sub)scale is preferred to calculate reliability. The test–retest reliability and interobserver reliability were rated positive if the ICC was >0.70 and >0.60, respectively (12).
To quantify measurement error and detect systematic differences between 2 measurements, a measure of agreement is calculated. The 95% limits of agreement according to Bland and Altman (19) and the standard error of measurement (SEM) or smallest detectable (real) difference (SDD) (20) were considered to be adequate measures for agreement. Because it was not possible to define adequate cutoff points for the agreement, agreement was rated as positive if one of the adequate measures was presented.
Responsiveness is the ability of an instrument to detect real or important change over time in the concept being measured (21). Predefined hypotheses about the relation of change in the instrument to corresponding changes in reference instruments have to be postulated. Responsiveness was rated positive if these hypotheses were presented and if >75% of these hypotheses could be confirmed.
Interpretability is defined as the degree to which (change) scores can be interpreted and a qualitative meaning can be assigned to quantitative scores (12). A minimum clinically important difference (MCID) should be defined to interpret change scores in the target population. Other information that improves interpretation of the scores includes, for instance, presentation of means and standard deviations of patients' scores before and after treatment, data on the distribution of scores in relevant subgroups, and relating changes in the instrument score to patients' global perceived change. A positive rating was achieved when at least 2 types of information were presented.
To obtain an overall score of the instruments, we counted the number of positive ratings for each instrument.
Selection of the studies.
The search identified a total of 1,930 publications. After screening titles and abstracts, 1,777 studies were excluded. Of the remaining 153 publications, 37 publications were included after reading the full article. Reasons for exclusion were data collection after operation or other invasive interventions (n = 54); no separate data presented for patients with hip or knee OA (n = 25); no psychometric evaluation of the instrument (n = 16); no self-assessment instrument (n = 9); no outcome dimensions of pain, physical function, or patient global assessment (n = 6); no English version of the instrument (n = 3); and no use of classical test theory (n = 3). A total of 32 questionnaires were included in study, which were divided into 24 condition-specific instruments (22–55), 7 generic questionnaires (25, 28, 34, 42–44, 52, 55–58), and 1 patient-specific instrument (57, 58). The full names of the investigated questionnaires are presented in Table 1. Actually, the 24 condition-specific questionnaires included different versions of the same instrument. Five different versions of both the WOMAC and Lequesne Index were investigated, and 3 different versions of the Arthritis Impact Measurement Scales (AIMS) were included. Three versions of the WOMAC differed in response options (visual analog scale [VAS], Likert scale, and numeric scale), and 1 version identified the most important items specific for the individual patient (signal version). The last WOMAC version differed in number of items (modified WOMAC), which also applied to the Lequesne Index hip and knee (modified Lequense Index) and the AIMS (AIMS2, AIMS2 Short Form [AIMS2-SF]). Finally, the Lequense Index versions varied in the mode of administration (interview based or self reported). Because not all descriptive information on a specific instrument was published in the article describing the psychometric study of patients with OA, original articles about the development of the instruments involving other patient populations were consulted (59–67).
|WOMAC VA3.0||Western Ontario and McMaster Universities Osteoarthritis Index, Visual Analog Scale|
|WOMAC LK||Western Ontario and McMaster Universities Osteoarthritis Index, Likert Scale|
|Lequesne Index||Algofunctional indices for the hip and knee or index of severity for hip/knee disease|
|KOOS||Knee Injury and Osteoarthritis Outcome Score|
|HOOS||Hip Osteoarthritis Outcome Score|
|SMFA||Short Musculoskeletal Function Assessment Questionnaire|
|J-MAP||Joint-Specific Multidimensional Assessment of Pain|
|SF-36||MOS Short Form 36|
|HAQ||Health Assessment Questionnaire|
|AIMS||Arthritis Impact Measurement Scales|
|AIMS2-SF||Arthritis Impact Measurement Scales-Short Form|
|IRGL||Influence of Rheumatic Disease on General Health and Lifestyle|
|QR&S||Questionnaire Rising and Sitting down|
|ADL difficulty scale||Activities of Daily Living difficulty scale|
|ADL pain scale||Activities of Daily Living pain scale|
|VAS||Visual Analog Scale|
|NHP||Nottingham Health Profile|
|SIP||Sickness Impact Profile|
Description of questionnaires.
All included questionnaires and the descriptive items that were rated are presented in Table 2. Most questionnaires were developed to assess pain (n = 22) and/or physical function (n = 26) in separate (sub)scales. Only the Lequesne Index and the Patient Based Measure combine these 2 aspects in 1 single index.
|Questionnaire (references)||Target population†||Domains‡||No. of scales§||No. of items||No. of response options||Range of scores||Time to administer, minutes||Mode of administration|
|WOMAC VA3.0 (22,24,32,37,42,47)||Hip/knee OA||Pain, other symptoms, physical function||3||24||0–100||P: 0–500 S: 0–200 PF: 0–1,700||<10 (paper) 10–15 (computer)||Self administered on paper and computer|
|WOMAC LK (28,30,32,45,52)||Hip/knee OA||Pain, other symptoms, physical function||3||24||5||P: 0–20 S: 0–8 PF: 0–68||<10||Self administered on paper and telephone|
|WOMAC numeric scale (25, 27, 38, 44, 46)||Hip/knee OA||Pain, other symptoms, physical function||3||24||11||P: 0–50; S: 0–20 PF: 0–170||<10 (paper) 10–15 (computer)||Self administered on paper and computer|
|WOMAC signal (35)||Hip/knee OA||Pain, other symptoms, physical function||3||3||<10||Self administered|
|WOMAC VA modified (23)||Hip/knee OA||Pain, physical function||2||14||0–100||P: 0–500 PF: 0–900||6||Self administered|
|Lequesne Index knee (31,33,47)||Knee OA||Pain, physical function||1||11||2-6||0–24||3–4||Interview based|
|Lequesne Index hip (31,33)||Hip OA||Pain, physical function||1||11||2-6||0–24||3–4||Interview based|
|Lequesne-knee self-reported (38)||Knee OA||Pain, physical function||1||11||2-6||0–24||3–4||Self administered|
|Lequesne hip self-reported (38)||Knee OA||Pain, physical function||1||11||2-6||0–24||3–4||Self administered|
|Lequesne modified (51)||Hip/knee OA||Pain, physical function||1||10||2-6||0–23||3.25||Interview based|
|HOOS (50, 53)||Hip OA||Pain, physical function, quality of life||5||40||5||P: 0–40; Sy: 0–20; A: 0–68; SP: 0–16; Q: 0–16||7–10||Self administered|
|KOOS (39, 41, 54)||Knee OA||Pain, physical function, quality of life||5||42||5||P: 0–36; Sy: 0–28; A: 0–68; SP: 0–20; Q: 0–16||10||Self administered|
|A Patient-Based Measure (40)||Knee OA||Pain, physical function||1||12||2–6||0–100||?||Self administered|
|Knee Pain Scale (36)||Knee OA||Pain||4||6||5/6||1–5 or 1–6||?||Self administered|
|SMFA (48)||Musculoskeletal extremity disorders||Physical function||2||46||5||PF: 34–170; bother:12–60||?||Self administered|
|J-MAP (49)||Patients with pain in any joint||Pain||2||9||Varies||0–100||?||Self administered|
|SF-36 disease specific for physical function and role limitations (55)||Every specific disease||Physical function||2||14||2 or 3||0–100||<10||Self administered|
|HAQ (34, 52)||Arthritic conditions||Physical function||8||20||2 or 4||0–3||5–8||Self administered|
|AIMS (34)||Arthritic conditions||Pain, physical function, mental function, social function||9||52||Varies||0–10||15–20||Self administered|
|AIMS2 (29)||Arthritic conditions||Pain, physical function, mental function, social function||12||78||Varies||0–10||23||Self administered|
|AIMS2–SF (26)||Arthritic conditions||Pain, physical function, mental function, social function||5||23||5||0–10||?||Self administered|
|IRGL (43)||Rheumatoid arthritis||Physical function, mental function, social function||11||68||4||Varies||20||Self administered|
|ADL difficulty scale (57)||Arthritic conditions||Physical function||1||8||4||1–4||5||Self administered|
|ADL pain scale (57)||Arthritic conditions||Pain||1||8||4||1–4||5||Self administered|
|Patient global assessment (57, 58)||General population||Physical function||NA||NA||Varies||Varies||?||Self administered|
|Single question pain (VAS) (57, 58)||General population||Pain||NA||NA||0–100||0–100||?||Self administered|
|Single question pain (Likert) (58)||General population||Pain||NA||NA||5||0–4||?||Self administered|
|NHP (43, 56)||General population||Physical function, mental function, social function||6||38||2||0–100||<10||Self administered|
|SF–36 (25,28,42,44,52,55)||General population||Pain, physical function, mental function, social function, general health||8||36||Varies||0–100||10||Self administered|
|SIP (34)||General population||Physical function, mental function, social function||12||136||2||0–100||20–30||Self administered|
|QR&S (43)||General population||Physical function||2||32||2||0–10||?||Self administered|
Some questionnaires have separate versions for hip OA and knee OA, such as the Hip Osteoarthritis Outcome Score (HOOS) and Knee Injury and Osteoarthritis Outcome Score (KOOS), and Lequesne Index Hip and Lequesne Index Knee. With the exception of the Patient Based Measure and the Knee Pain Scale, which were developed specifically for patients with knee OA, all other questionnaires can be used for both hip OA and knee OA. Of the condition-specific questionnaires, the AIMS2 had the largest number of items (n = 78), followed by the Influence of Rheumatic Disease on General Health and Lifestyle questionnaire (n = 68), AIMS (n = 52), and Short Musculoskeletal Function Assessment Questionnaire (SMFA; n = 46), whereas the Knee Pain Scale (n = 6) and Joint-Specific Multidimensional Assessment of Pain (J-MAP; n = 9) had the smallest number of items. Within the generic questionnaires, the number of items varied even more because the single pain questions (VAS and Likert) consisted of only 1 item and the Sickness Impact Profile (SIP) consisted of 136 items. The majority of questionnaires can be completed within 10 minutes.
Most studies involved patients with knee OA (n = 18) or both knee OA and hip OA (n = 16); only 3 studies included patients with only hip OA. Concerning the setting of the studies, 25 studies included patients from an outpatient setting (e.g., through general practitioner, hospital mailing), 8 studies described in patients (mostly from a hospital or rehabilitation center), and 4 studies included both outpatients and inpatients.
The rating of the psychometric qualities of the hip/knee OA questionnaires is presented in Table 3, summarizing each aspect as good, doubtful, or poor quality. An empty spot indicates no or insufficient information about an aspect. Because most results of psychometric qualities are dependent on the population studied, the type of population is presented in the table (a distinction is made between outpatients and inpatients, and between hip OA and knee OA). None of the questionnaires in this review have been adequately tested on all psychometric qualities of the checklist in patients with hip and/or knee OA.
|Questionnaire (references)||Time to administer||Ease of scoring||Readability and comprehension||Content validity||Internal consistency||Construct validity||Floor/ceiling effect||Reliability||Agreement||Responsiveness||Interpretability||MCID||Positively rated qualities, no.|
|WOMAC VA3.0 (22,24,32,37,42,47)||+†/−‡||ø||+||+||ø, out||+, out knee/ in knee ø, out hip/ knee hip||+, out knee§||ø, out/ in knee||+, out||ø, out/in||+, out||+, out||8|
|WOMAC LK (28, 30, 32, 45, 52)||+||+||+||−, in/ø, out||ø, out/in||+, out knee/ in knee||ø, out||+, out/in||ø, out/in||ø, out knee/ in knee||5|
|WOMAC numeric scale (25,27,38,44,46)||+†/−‡||+||+¶||ø, out||ø, out||+, in§||ø, out||+, in||ø, out/in||+, in||+, in||7|
|WOMAC signal (35)||ø||ø, out knee||ø, out knee||0|
|WOMAC VA3.0 modified (23)||+||ø||+||+¶||+, out knee/ in knee||+, out knee/ in knee||5|
|Lequesne Index knee (31, 33, 47)||+||+||+||ø, out knee||ø, out knee||3|
|Lequesne Index hip (31, 33)||+||+||+||ø, out hip||ø, out hip||3|
|Lequesne–knee self-reported (38)||+||+||ø, out knee||ø, out knee||ø, out knee||2|
|Lequesne hip self-reported (38)||+||+||ø, out||ø, out hip||ø, out hip||2|
|Lequesne modified (51)||+||+||+||+, out knee/ in knee||+, out knee/ in knee||5|
|HOOS (50, 53)||+||ø||+||ø, out hip||+, in hip||+, in hip||+, out hip||5|
|KOOS (39, 41, 54)||+||ø||+||ø, out knee||+, out knee||+, out knee/−, out knee#||ø, out knee||+, out knee||5|
|A Patient-Based Measure (40)||−||−||+, out knee||ø, out knee||+, out knee||2|
|Knee pain scale (36)||ø||+, out knee||+, out knee||ø, out knee||ø, out knee||2|
|SMFA (48)||ø||+||+||ø, in knee||+, out||3|
|J-MAP (49)||−||+||+, in knee||ø, in knee||2|
|SF-36 disease specific for physical function and role limitations (55)||+||ø, out knee||+, out knee||2|
|HAQ (34, 52)||+||ø||+||+||ø,out knee/ in knee||ø, out knee/ in knee||+, out knee/ in knee**||ø, out knee/ in knee||ø, out knee/ in knee||4|
|AIMS (34)||−||−||+||+||ø, out||ø, out||ø, out||2|
|AIMS2 (29)||−||−||+||ø||ø, out||ø, out||2|
|AIMS2-SF (26)||−||ø||+, out||+||+, out||3|
|IRGL (43)||−||+||ø, out††||ø, out††||1|
|ADL difficultly scale (57)||ø||ø, out knee||ø, out knee||ø, out knee||0|
|ADL pain scale (57)||ø||ø, out knee||ø, out knee||ø, out knee||0|
|Patient global assessment (57, 58)||+||+||−, out knee||ø, out knee||+, out knee||3|
|Single question pain (VAS) (57, 58)||+||ø||ø, out knee||ø, out knee||+, out knee||2|
|Single question pain (Likert) (58)||+||+||ø, out knee||ø, out knee||+, out knee||3|
|NHP (43, 56)||+||−||+||ø, out hip||ø, out††||ø, out††||2|
|SF-36 (25, 28, 42, 44, 52, 55)||+||−||ø, out knee||+, out knee/ ø, out hip||+, out knee‡‡/+, in§§||+, in||ø, out/in||+, out/in||+, in||6|
|SIP (34)||−||ø||+||ø, out||ø, out||ø, out||1|
|QR&S (43)||ø||+||ø, out||ø, out||1|
Almost all instruments were scored positively on content validity, meaning that patients and investigators or experts were involved during the development of the questionnaire. Only one instrument, the Patient Based Measure, was scored negatively on content validity because consultation of patients in the development of the questionnaire was not reported.
For a positive rating of the internal consistency, information was needed on the construct of the questionnaire (investigated by factor analysis) and on Cronbach's α of each (sub)scale. Information on both aspects was available for 6 of the 32 questionnaires, of which only 4 had a positive rating (Patient Based Measure, Knee Pain Scale, J-MAP, and AIMS2-SF). The other 2 questionnaires, WOMAC, Likert scale (WOMAC LK) and HOOS, were rated as doubtful. For the WOMAC LK the a priori dimensions could not be confirmed by factor analysis. The dimensions of the HOOS could be supported, with the exception of the subscale “activity limitations-daily living,” which loaded as 2 factors. Besides this, the activity subscale had a Cronbach's α >0.95.
The dimensionality of 3 other questionnaires was studied by factor analysis, but no information was available on Cronbach's alpha of the subscales. The construct of the WOMAC VAS (3 subscales: pain, stiffness, and physical function; WOMAC-VA3.0) and modified WOMAC (pain and physical function subscales) could not be confirmed. A 2-factor solution for the Lequesne Index was found, while the Lequesne Index claims to measure a single construct (68).
In 9 instruments, information on internal consistency was restricted to information on Cronbach's alpha only, which ranged from 0.70 to 0.96. The exceptions were the HAQ and the subscale “role limitations” of the Short Form 36 (SF-36), which had a Cronbach's alpha <0.70.
Only 7 of 26 studies (for knee OA: WOMAC VA3.0, modified WOMAC, modified Lequesne Index, KOOS, disease-specific SF-36, and SF-36; for hip OA: HOOS) that investigated construct validity presented hypotheses relating to the magnitude and direction of expected correlations with other instruments, which is a condition for a positive rating according to the criteria of the checklist. The correlations between most (subscales of) questionnaires measuring pain were moderate (r = 0.40–0.70). Concerning physical function, the Lequesne Index, the WOMAC physical function subscale, the SF-36 physical function subscale, and the SMFA had high correlations (r > 0.7) with each other. These results also apply to the HOOS, KOOS, and WOMAC because the physical function subscales of the HOOS and KOOS are equal to the WOMAC physical function subscale. Of 10 questionnaires, floor and/or ceiling effects were investigated, mainly for outpatients with knee OA. No floor or ceiling effects were found, with the exception of some subscales (e.g., sports and recreation subscale of the KOOS, which showed a floor effect for outpatients with knee OA).
Information on test–retest reliability was found for 13 questionnaires. Because of low ICCs (<0.70), low sample size (<50), or the use of other correlation measures than ICC, only 3 of the 13 questionnaires had a positive rating for reliability. The modified WOMAC and modified Lequesne Index appeared to be reliable questionnaires for patients with knee OA, whereas the HOOS was reliable for patients with hip OA (ICC between 0.78 and 0.95). Information on agreement was available for 4 instruments (WOMAC VA3.0, WOMAC LK, WOMAC numeric scale, and SF-36). Either the SEM or SDD were presented.
The responsiveness was investigated for 16 questionnaires. None of these studies presented hypotheses relating to the magnitude of change and/or relationships with change scores of other instruments. Therefore, all questionnaires were rated as doubtful on responsiveness. Responsiveness was quantified as either effect sizes, relative efficiency, or standardized mean scores. Change scores were also calculated, and correlations with change scores of other instruments were presented. Some studies compared the responsiveness of ≥2 questionnaires. In general, the WOMAC appeared to be more responsive compared with the SF-36 in both patients with hip OA and those with knee OA (25, 42, 44, 58). In patients with knee OA, the responsiveness of the SF-36 appeared to be comparable with the HAQ (52), just as the AIMS was as responsive as the SIP in patients with hip OA and knee OA (34).
Interpretability and MCID.
Eight questionnaires were rated positive on interpretability by presenting at least 2 of the 4 types of information. Only 1 study (that of the AIMS2-SF) intentionally paid attention to the interpretability of scores by comparing the scores on the AIMS2-SF in groups of patients that differed in duration of disease, number of comorbidities, and general health perception (26). The MCID was calculated for 2 questionnaires, the WOMAC numeric scale and the SF-36. Of 13 questionnaires, means and standard deviations of baseline and followup scores or scores of relevant subgroups were presented.
After counting the total number of positive ratings for each instrument, the WOMAC VA3.0, WOMAC numeric scale, modified Lequesne Index, HOOS, and KOOS had the highest overall scores among the condition-specific instruments, with 8, 7, 5, 5, and 5 positive ratings, respectively. Concerning the generic questionnaires, the SF-36 obtained the highest overall score, with 6 positive ratings.
An extensive search strategy led to the identification of 32 self-assessment questionnaires for the evaluation of pain and physical functioning and patient global assessment in patients with OA of the hip and/or knee, for which descriptive and psychometric qualities had been investigated. Most questionnaires were condition specific (n = 24); the remainder were generic (n = 7) and patient specific (n = 1). Twenty-two instruments were developed to rate pain; physical function was rated by 26 instruments. Concerning patient global assessment, only 1 instrument was found. Most studies included patients with knee OA (n = 18) or both knee and hip OA (n = 16); only 3 studies included patients with only hip OA. Many psychometric qualities were not properly tested for a large number of questionnaires, and none of the questionnaires were rated positive on all aspects of the checklist.
Overall, the condition-specific instruments (WOMAC, VAS version; Lequesne Index hip/knee; and HOOS/KOOS) had the best ratings for their descriptive and psychometric qualities for both pain and physical function. The WOMAC has been the most extensively studied instrument and received the best ratings for its descriptive and psychometric qualities. One should keep in mind that some instruments (such as the HOOS and KOOS) have not been studied extensively or have only been studied in other populations; ratings of these instruments might improve when more studies have been conducted on their psychometric qualities. Concerning generic instruments, the SF-36 has been studied most often and demonstrated, overall, the highest ratings. The psychometric qualities of the patient-specific instrument on patient global assessment has been studied to a limited degree in patients with hip and/or knee OA. Therefore, only a small number of quality criteria could be rated. The same accounted for the single questions pain on VAS and Likert scale, which were investigated in a small number of studies.
To compare the results of trials and optimize the transparency of care, a core set of questionnaires in patients with hip and/or knee OA seems to be indicated. For example, a core set of qualified questionnaires will facilitate the comparison and interpretation of the outcome of various treatment modalities in OA. At this time, guidelines for outcome dimensions in OA trials, such as OMERACT, EULAR, FDA, and SADOA guidelines, differ in their recommendations of instruments or do not include recommendations at all (1). Our results suggest that, at this time, the most appropriate questionnaires to use in patients with hip and/or knee OA seem to be the condition-specific questionnaire WOMAC and the generic questionnaire SF-36. Therefore, it is recommended that these questionnaires, completed with a patient-specific instrument on patient global assessment, are included in guidelines as a core set of instruments in patients with OA of the hip and/or knee. However, more research is needed on the psychometric qualities of patient global assessment measures before making a choice of the most appropriate instrument for patients with hip and/or knee OA.
Nonetheless, which scale is most appropriate to use always depends on the particular purpose of the assessment. For example, for discriminative purposes, the instrument should have satisfactory ratings for reproducibility and agreement. The modified WOMAC, modified Lequesne Index, and HOOS were the only questionnaires with a positive rating for test–retest reliability. Alternatively, when the purpose is to evaluate changes over time, an instrument should have positive ratings for responsiveness and no floor or ceiling effects. Currently, all questionnaires were rated equally, namely, were rated doubtful, on responsiveness. In general, the condition-specific instrument WOMAC appeared to be more responsive compared with the generic SF-36. Both in daily practice and in clinical trials, changes over time are frequently evaluated; therefore, responsiveness of the instrument is an important condition in selecting a questionnaire. The low ratings on responsiveness are remarkable, and more solid research on responsiveness is needed.
The dimensionality of only 9 questionnaires was tested using factor analysis. When the dimensionality of a questionnaire has not been analyzed, the internal consistency as reflected by Cronbach's alpha might not be interpretable (69). The theoretical dimensional structure of only 4 questionnaires could be confirmed (namely, Patient Based Measure, Knee Pain Scale, J-MAP, and AIMS2-SF). The factor analysis of the other 5 instruments (WOMAC VA3.0, WOMAC LK, modified WOMAC, Lequesne Index, and HOOS) yielded either more dimensions or less dimensions than a priori stated; therefore, internal consistency was rated as doubtful. It needs to be considered that only studies based on the classical test theory were included in the present review. IRT also provides a model to evaluate health status questionnaires. In total, 3 studies on the WOMAC were excluded from this review because Rasch analyses were performed.
Some limitations of this study have to be mentioned. First, some consideration is recommended when generalizing the results. After all, we excluded studies of patients after operations (e.g., total hip replacement and total knee replacement) or other invasive interventions. Furthermore, because it is uncertain whether psychometric qualities of translated versions can be generalized to the original version, we only included the English version (or, in the absence of an English version, the native version) of the questionnaires. In total, 20 studies (concerning the WOMAC [n = 12], KOOS [n = 2], AIMS [n = 2], Nottingham Health Profile [n = 2], Lequesne Index [n = 1], and VAS [n = 1]) were excluded because non-English versions were evaluated. The results of this review are only applicable to the included populations and questionnaires. Second, the criteria we used to evaluate the quality of the instruments were helpful to provide information on the practical and psychometric properties to facilitate the choice between questionnaires. However, in our opinion, there is room for improvement of the checklist. First, no instructions are given on how to determine the overall best instrument. We counted the number of positive ratings to make an overall judgement of the instruments, which implies that all different qualities are equally important. Second, the criteria for construct validity and responsiveness to postulate specific hypotheses can be questioned. The absence of hypotheses in the publication might be due to a shortcoming of the author instead of a lower psychometric quality of the instrument. In contrast, the need for clearly defined objective cutoff points to rate construct validity and responsiveness is high. Strikingly, the authors of all studies that investigated construct validity and responsiveness of questionnaires concluded that the questionnaires were valid and responsive instruments for patients with hip or knee OA. The present criteria on postulating hypotheses are a first step towards clearly defining these objective cutoff points. Furthermore, as suggested by Bot et al (9), authors can contribute to a good rating of questionnaires by clearly presenting the results of the studies they performed. The checklist, as used in the present review, might be a good tool for authors to check whether their results are systematically and unambiguously presented.
In conclusion, although the final choice of a questionnaire depends on the purpose of the assessment, the WOMAC VA3.0 and SF-36 currently demonstrated the highest ratings overall for both descriptive and psychometric qualities. Therefore, these questionnaires are recommended for evaluating pain and physical function in patients with hip and/or knee OA. Completed with a measure on patient global assessment, these instruments could be recommended in guidelines concerning outcome measurement in OA trials.
- 3World Health Organization, Regional Office for Europe. Guidelines for the clinical investigation of drugs used in rheumatic diseases: European drug guidelines. Series 5. Copenhagen: European League Against Rheumatism; 1985.
- 6Measures of adult lower extremity function: the American Academy of Orthopedic Surgeons Lower Limb Questionnaire, the Activities of Daily Living Scale of the Knee Outcome Survey (ADLS), Foot Function Index (FFI), Functional Assessment System (FAS), Harris Hip Score (HHS), Index of Severity for Hip Osteoarthritis (ISH), Index of Severity for Knee Osteoarthritis (ISK), Knee Injury and Osteoarthritis Outcome Score (KOOS), and Western Ontario and McMaster Universities Osteoarthritis Index (WOMACTM). Arthritis Care Res 2003; 49 Suppl 5: S67–84., .
- 11Psychometric evaluation of self-report questionnaires: the development of a checklist. In: Ader HJ, Mellenbergh GJ, editors. Proceedings of the second workshop on research methodology 25–27 June 2003. Amsterdam: VU University Amsterdam; 2003. p. 161–8., , , , , .
- 14Measurement of functional status, progress, and outcome in orthopaedic clinical practice. Orthopaedic Practice 1999; 11: 14–21..
- 16Psychometric theory. 2nd ed. New York: McGraw-Hill; 1978..
- 25Smallest detectable and minimal clinically important differences of rehabilitation intervention with their implications for required sample sizes using WOMAC and SF-36 quality of life measurement instruments in patients with osteoarthritis of the lower extremities. Arthritis Rheum 2001; 45: 384–91., , .
- 28Test-retest reliability of lower extremity functional and self-reported measures in elderly with osteoarthritis. Adv Physiother 2003; 5: 155–60., , .
- 30Exploring the factorial validity and clinical interpretability of the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC). Physiother Can 2003; 55: 160–8., , , , , .
- 38Comparison of the WOMAC (Western Ontario and McMaster Universities) osteoarthritis index and a self-report format of the self-administered Lequesne-Algofunctional index in patients with knee and hip osteoarthritis. Osteoarthritis Cartilage 1998; 6: 79–86., , , , , , et al.
- 42Comparison of the responsiveness and relative effect size of the Western Ontario and McMaster Universities Osteoarthritis Index and the Short-Form Medical Outcomes Study Survey in a randomized, clinical trial of osteoarthritis patients. Arthritis Care Res 1999; 12: 172–9., , .
- 66De ontwikkeling van de IRGL: een instrument om gezondheid te meten bij patiënten met reuma. Gedrag Gezond 1990; 18: 78–89., , .
- 67Development of a self-report questionnaire to assess the impact of rheumatic diseases on health and lifestyle. J Rehab Sci 1990; 3: 65–70., , .
|Psychometric quality||Definition||Criteria used to rate the psychometric quality|
|Time to administer||Time needed to complete the questionnaire||Rating: [+]less than 10 minutes [−]more than 10 minutes [ ]no information found on time to administer|
|Ease of scoring||Ease of method used to calculate the questionnaire's score||Rating: [+] easy: summing up of the items [ø] moderate: visual analog scale (VAS) or simple formula [−] difficult: VAS in combination with formula or complex formula [ ] no information found on calculation of score|
|Readability and comprehension||The questionnaire is understandable for all patients||Rating: [+] readability tested; result was good [−] inadequate readability [ ] no information found on readability and comprehension|
|Content validity||The extent to which the domain of interest is comprehensively sampled by the items in the questionnaire||1) Patients were involved during item selection and/or item reduction. 2) Patients were consulted for reading and comprehension.|
|Rating: [+] patients and investigator/expert involved [ø] patients only [−] no patient involvement [ ] no information found on content validity|
|Internal consistency||The extent to which items in a (sub)scale are intercorrelated; a measure of the homogeneity of a (sub)scale||1) Factor analysis was applied in order to provide empirical support for the dimensionality of the questionnaire.|
|2) Cronbach's alpha >0.70 for each dimension/subscale.|
|Rating: [+] adequate design and method; factor analysis supporting the dimension; α > 0.70 [ø] doubtful method used or no factor analysis [−] inadequate internal consistency (α < 0.70) or dimensions not supported by factor analysis [ ] no information found on internal consistency|
|Construct validity||The extent to which scores on the questionnaire relate to other measures in a manner that is consistent with theoretically derived hypothesis concerning the domains that are measured||1) Hypotheses were formulated. 2) Results were acceptable in accordance with ≥75% of hypotheses. 3) An adequate measure was used. Rating: [+] adequate design, method, and result [ø] doubtful method used [−] adequate design and method and inadequate construct validity [ ] no information found on construct validity|
|Floor and ceiling effects||The questionnaire fails to demonstrate a worse score in patients who clinically deteriorated and an improved score in patients who clinically improved||1) Descriptive statistics of the distribution of scores were presented. 2) ≤15% of respondents achieved the highest or lowest possible score.|
|Rating: [+] no floor/ceiling effects [−] >15% in extremities [ ] no information found on floor and ceiling effects|
|Test–retest reliability||The extent to which the same results are obtained on repeated administrations of the same questionnaire when no change in physical functioning has occurred||1) Calculation of an intraclass correlation coefficient (ICC); ICC > 0.70. 2) Time interval and confidence intervals (or n > 50) were presented.|
|Rating: [+] adequate design, method, and ICC > 0.70 [ø] doubtful method [−] inadequate reliability, with adequate design and method [ ] no information found on test–retest reliability|
|Agreement||The ability to produce exactly the same scores with repeated measurements||1) For evaluative questionnaires, reliability agreement should be assessed.|
|2) Limits of agreement, Kappa, or standard error of measurement was presented.|
|Rating: [+] adequate design, method, and result [ø] doubtful method used [−] inadequate agreement, with adequate design and method [ ] no information found on agreement|
|Responsiveness||The ability to detect change over time in the concept being measured||1) For evaluative questionnaires, responsiveness should be assessed.|
|2) Hypotheses were formulated and results were in agreement with ≥75% of hypotheses.|
|3) An adequate measure was used (effect size, standardized response mean, comparison with external standard).|
|Rating: [+] adequate design, method, and result [ø] doubtful method used [−] inadequate responsiveness with adequate design, method [ ] no information found on responsiveness|
|Interpretability||The degree to which one can assign qualitative meaning to quantitative scores||Authors provided information on the interpretation of scores:|
|1. presentation of means and SD of scores before and after treatment|
|2. comparative data on the distribution of scores in relevant subgroups|
|3. information on the relationship of scores to well-known functional measures or clinical diagnosis|
|4. information on the association between changes in score and patients' global ratings of the magnitude of change they experienced|
|Rating: [+] ≥2 of above types of information was presented [ø] doubtful method used or doubtful description; 1 type of information was presented [ ] no information found on interpretability|
|Minimum clinically important difference (MCID)||The smallest difference in score in the domain of interest that patients perceive as beneficial and would mandate a change in a patient's treatment||Information is provided about what (difference in) score would be clinically meaningful. Rating: [+] MCID is presented [ ] no information found on MCID|