Findings of extensive variation in the types of outcome measures used in hip and knee replacement clinical trials: A systematic review

Authors


Abstract

Objective

To describe the extent of variation in outcome measure usage in hip and knee replacement randomized trial literature, and to summarize this variation in the context of the International Classification of Functioning, Disability, and Health (ICF) conceptual model created by the World Health Organization (WHO).

Methods

We used a defined search strategy in Medline and EMBase databases to identify articles published from January 2000 to February 2007. Studies were reviewed if they were randomized trials with a ≥6-week followup and if they used noninvasive outcome measures of impaired joint function or whole-person limitations in daily activities or functional status. The WHO ICF model was used to categorize outcome measures.

Results

Of 972 studies, 160 were included for review. Of these, 82 were conducted on patients with hip replacements, 75 on patients with knee replacements, and 3 on patients with both. The most common outcome measure in knee trials was the American Knee Society score (used in 48% of reviewed studies), and in hip trials was the Harris hip score (52.4%). At least 20 different outcome measures were used in the hip trials, and at least 14 different measures were used in knee trials. The primary outcome was identified in only 24% of trials.

Conclusion

We found extensive variation in outcome measures across trials and saw inconsistency across the components of the WHO ICF model. To improve interpretability, future work should determine whether consensus can be developed for a standardized set of outcome measures for hip and knee replacement trials.

INTRODUCTION

Over the past few decades there has been a proliferation of measures designed to assess outcome following hip or knee replacement surgery. Those involved with the care of these patients have recognized the inaccuracies of single-item rating scales designed to rate outcome, for example, as excellent, good, fair, or poor. These types of scales have traditionally not been well defined and have combined multiple patient attributes, such as pain intensity, function, and range of motion (ROM), into a single categorical judgment (1, 2).

Researchers have developed a multitude of self-report or examiner-completed measures, in part to overcome problems associated with the single-item scales. We now have many measures that can be used specifically for patients following knee replacement (3), hip replacement (4), and revision surgery (5). In addition, we have measures that are generic in nature and can be used not only for patients with hip or knee arthritis but for essentially any type of disorder (6). Clinicians and researchers working with patients with arthritic knees or hips now have dozens of measures from which to choose. Included among the options are joint-specific measures (e.g., quadriceps strength) and performance measures (e.g., 6-minute walk test) (7).

The new outcome measures have moved the science of outcome assessment forward, but the proliferation comes with a cost. Randomized trials designed to answer similar research questions may use different outcome measures to assess the same construct. As a result, studies of similar treatments cannot be combined in a meaningful way in systematic reviews, or by clinicians reading the literature. Clinicians comparing multiple trials of a surgical approach for knee arthroplasty, for example, have to judge whether scores on the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) (8), the American Knee Society scale (AKS) (9), and the Oxford knee score (10) are equivalent across the trials. Confusion is compounded because some measures have superior measurement properties as compared with others.

A lack of consistency in the use of outcome measures is not unique to the hip and knee replacement literature. McNaughton-Collins and colleagues reported that the prostate cancer screening and treatment literature is almost impossible to summarize because of the heterogeneity in outcome measures (11). Similar problems with heterogeneity of outcome measures have been reported, for example, with stroke (12), traumatic brain injury (13), and low back pain (14).

The purpose of our systematic review was 2-fold: to describe the extent of variation in use of outcome measures in the hip and knee replacement literature, and to determine the extent to which the 3 main components in the International Classification of Functioning, Disability, and Health (ICF) model of disablement by the World Health Organization (WHO) are addressed in hip and knee randomized trials.

Because of the proliferation of self-report outcome measures, we hypothesized that trials in hip and knee replacement would demonstrate a large amount of variation in the numbers and types of outcome measures. We also hypothesized that there would be wide variation among trials in the main components of the ICF model that were examined.

MATERIALS AND METHODS

Our study is a systematic review designed to characterize the variation in outcome measures used in randomized trials of hip and knee replacement surgery. We also characterized variation in the context of the WHO ICF model of disablement. Because there are an extensive number of potential outcome measures in use for hip and knee replacement, we believed it was important to use a conceptual framework to categorize these measures into distinct subgroups.

Conceptual framework for considering outcome measures.

In 2001, the WHO formally adopted the ICF as standard language for describing health and health-related states and conditions. The ICF is intended to be a multipurpose classification system for making clinical and policy decisions and to guide the description and interpretation of health on a worldwide scale (15).

The framework for the ICF model is illustrated in Figure 1. The term health condition refers to any disorder or disease for which a patient will seek medical care. In the context of joint replacement, health conditions refer, for example, to osteoarthritis (OA), the etiology of OA, and any complications that may arise from treatment of the disorder such as infection or loosening of the prosthesis. As can be seen by the arrows in Figure 1, the disorder can interact either directly or indirectly with all other components in the model.

Figure 1.

Illustration of the International Classification of Functioning, Disability, and Health model created by the World Health Organization. Arrows are used to indicate that there is a dynamic interaction among components in the model.

The body functions and structure component refer to the functioning and structural integrity of specific body systems. When deviations in functioning or structural integrity occur, impairments result. For patients with hip or knee replacement, these impairments can include impairments in system functions (e.g., reduced muscle power, pain, and reduced joint ROM) or impairment of structure (e.g., leg length discrepancy). Activity is the completion of a task or action by an individual. Tasks include walking, lifting, and throwing. If an individual had a reduced ability to walk, he or she would be described as having an activity limitation, a common finding in patients following hip or knee arthroplasty.

The participation component describes the person's involvement in everyday life activities. When considering the societal context in which a person functions, the term participation is appropriate for describing this type of activity. When a person's ability to participate in an everyday life activity is disrupted, the person is described as having a participation restriction. Restrictions are judged by comparison with a generally agreed upon population norm. For example, the generally agreed upon population norm is that a 65-year-old should be expected to be able to shop for food. If a person's ability to shop for food was compromised, then the person would be described as having a participation restriction. Activity and participation components are closely linked (e.g., walking might be an expected part of one's community life).

Two components, together termed contextual factors, impact directly on body structures and functions, activity, and participation components. Contextual factors are those factors that comprise the complete background of an individual's daily life. There are 2 types of contextual factors: environmental and personal. Environmental factors are all the variables external to the person that influence that person's functioning in daily life. These factors include all of the features of the physical world as well as policies, rules, attitudes, values, and systems. Personal factors are the make-up of the individual and his or her background. These include factors such as sex, race, lifestyle, habits, etc. (16). Contextual factors interact with the other components in the model to either increase or decrease the likelihood of impairments in body structures or functions or activity limitations or participation restrictions. For patients with hip and knee replacement, environmental factors could be the surgeon who conducts the surgery, the hospital in which the surgery is completed, or the costs associated with the procedure. A personal factor that may impact the outcome of patients with hip or knee replacement is the patient's level of satisfaction following surgery. The focus of the current study is on body structure and function, activity, and participation components of the ICF model.

Using the ICF model to conceptualize outcomes for hip and knee replacements has potential advantages. Surgical interventions, for example, might be expected to have the greatest impact on pain (body structure and function components) and functional status (activity and participation components), and physical therapy interventions might have preferential influences on strength, ROM (body structure and function components), and activity limitations. Other trials might be specifically designed to reduce the rate of deep vein thrombosis (DVT) occurrence (health condition component).

Many studies of the ICF model have been published (17, 18), and the ICF is rapidly achieving worldwide application in many areas of medicine (19–21). The ICF, therefore, seems ideally suited for categorizing outcomes of patients following hip or knee replacement surgery.

Literature search.

We conducted Medline and EMBase searches of literature published between January 1, 2000 and February 1, 2007. We chose this approach because the National Library of Medicine database is the most comprehensive in the world, EMBase covers >1,800 journals that are not included in Medline (22), and the timeframe captures only contemporary trials in knee and hip arthroplasty. Our search strategy was the following: (replacement OR arthroplasty) AND (hip OR knee) and “randomized controlled trial.” All accepted studies were published in English and conducted on humans. We examined only randomized trial studies because it is these studies that serve as the gold standard for judging treatment efficacy. All references obtained at this point were considered as potentially relevant studies for the analysis.

Inclusionary and exclusionary criteria.

Studies were included in the final analysis if they were randomized trials with a ≥6-week followup. Studies with a shorter followup were excluded first because these studies typically examined only inpatient outcomes following surgery. The next criteria for exclusion were studies that only measured outcome at the level of pathology (e.g., DVT rate) or only measured biomechanical changes such as prosthetic loosening. Studies that were conducted on patients other than those with hip or knee replacement surgery (e.g., hip fracture) were also excluded. To be included in the final analysis, studies had to have chosen noninvasive outcome measures of body structure or function (e.g., knee ROM), activity (e.g., walking ability), or participation (e.g., daily functioning). Figure 2 illustrates the process of identifying relevant reports and the numbers of reports excluded based on each of the criteria.

Figure 2.

Illustration of the flow of reports through the study. A total of 812 reports were excluded for the reasons indicated.

Data extraction and analysis.

A standardized data extraction form was used independently by 2 examiners (DLR and DHB) to identify studies that did and did not meet the requisite criteria. After the studies to be included in the final analysis were identified, a second standardized form was used to collect data. The following data were collected: whether the patients had hip arthroplasty, knee arthroplasty, or both; the overall purpose of the study; the specific outcome measures that were used; and the ICF categories that correspond to the outcome measures. In addition, we identified the primary outcome measure(s) reported in the studies. For an outcome measure to be considered as primary, the report either had to indicate that a power analysis was done using the outcome measure to estimate trial sample size, or to specifically state which of the outcome measures was primary.

The overall purpose of each study was categorized into 1 of 3 broad categories based on the type of interventions used: surgical studies, nonsurgical studies that examined a physical intervention such as exercise, or nonsurgical studies that examined a medical intervention such as medication.

The 3 ICF categories of body structure and function, activity, and participation were matched to each outcome measure reported for each study. For example, if knee strength was used as an outcome measure, the study was coded as using a body structure and function outcome. For studies that used measures that cover more than 1 ICF dimension, we used a hybrid category. For example, the Short Form 36 health survey (SF-36) self-report measure includes questions that pertain to all 3 dimensions of the ICF model. The 3 examiners (DLR, PWS, and DHB) each independently reviewed one-third of the reports that met the final eligibility criteria.

Statistical analysis.

We used the kappa statistic and percentage agreement to describe the reliability of judgments of the relevance of the assessments of appropriateness of 2 independent examiners for each publication that met the search criteria. For a subset of the trials that met the inclusionary criteria, we examined the reliability of judgments of ICF dimension(s) and the specific outcome measures that were used. Each of 3 examiners (DLR, PWS, and DHB) independently examined 15 randomly selected reports not previously evaluated by the examiner. The purpose was to determine the extent to which examiners agree on judgments of study purpose, specific outcome measures used, and applicable ICF categories.

To describe the variation in ICF dimensions addressed and specific outcome measures used, the proportion of reports that used each outcome are reported. Data are stratified based on study purpose and whether the patients had hip and knee replacement. Data also are reported in aggregate.

RESULTS

We identified 972 studies based on our search strategy. The percentage agreement for judgments of study inclusion was 91.4% (888 of 972) with κ = 0.66 (SE 0.034). For the 84 reports in which the 2 examiners disagreed, a second review of the reports was completed to reach consensus. After this process, a total of 160 reports were judged to have met the criteria for inclusion in the study. A total of 82 of the studies were conducted on patients with hip replacements, and 75 on patients with knee replacements. Three studies admitted patients with hip or knee replacements.

Reliability was also examined for the judgments of specific outcome measures used and for the ICF dimensions that these measures addressed. The percentage agreement for the specific outcome measures (n = 109) for the subset of 45 studies was 83.5% (91 of 109). Kappa could not be calculated for the specific outcome measures because there were an excessive number of blank cells for many of them. For the ICF dimensions, κ = 0.85 (SE 0.063) with an 88.6% agreement.

Table 1 shows the specific outcome measures used in the hip studies and Table 2 summarizes outcome data for the knee studies. For example, the WOMAC, perhaps the most frequently studied (23, 24) and highly cited (25) functional status outcome measure in the hip and knee replacement literature, was used in only 16% of knee trials and 13% of hip trials. The most commonly used outcome measure in knee trials was the AKS (used in 48% of the studies), and the most frequently used outcome measure in hip trials was the Harris hip score (HHS) (52.4%). Table 3 summarizes the ICF components assessed in the hip, knee, and combined trials. For hip studies, 52.4% used body structure function measures, 34.1% used activity measures, and 82.9% used hybrid measures. Similar estimates were found for knee studies, with 46% using body structure or function measures, 14.7% using activity measures, and 80% using hybrid measures. Table 4 illustrates the more commonly used outcome measures and the ICF components addressed by each measure.

Table 1. Outcome measures used in hip replacement trials (n = 82) based on study purpose*
MeasureSurgical (n = 57)Nonsurgical physical intervention (n = 21)Nonsurgical medical intervention (n = 4)Totals (n = 82)
  • *

    Values are the number (percentage). WOMAC = Western Ontario and McMaster Universities Osteoarthritis Index; SF-36 = Short Form 36 health survey; EQ-5D index = EuroQol 5-domain index; MACTAR = McMaster Toronto Arthritis patient preference disability questionnaire; SF-12 = Short Form 12 health survey; AIMS = Arthritis Impact Measurement Scales; ROM = range of motion.

WOMAC7 (12.3)3 (14.3)1 (25.0)11 (13.4)
SF-364 (7.0)01 (25.0)5 (6.1)
Harris hip score35 (61.4)6 (28.6)2 (50.0)43 (52.4)
Charnley hip score1 (1.8)1 (4.8)02 (2.4)
EQ-5D index6 (10.5)006 (7.3)
MACTAR1 (1.8)001 (1.2)
Oxford hip score2 (3.5)1 (4.8)03 (3.7)
SF-123 (5.3)003 (3.7)
AIMS01 (4.8)01 (1.2)
Pain scale measure18 (31.6)5 (23.8)1 (25.0)24 (29.3)
Knee strength03 (14.3)03 (3.7)
Balance02 (9.5)02 (2.4)
Hip strength05 (23.8)05 (6.1)
Hip ROM1 (1.8)2 (9.5)2 (50.0)5 (6.1)
Proprioception1 (1.8)001 (1.2)
Timed Get Up & Go test01 (4.8)1 (25.0)2 (2.4)
6-minute walk test2 (3.5)6 (28.6)08 (9.8)
Other timed walk test5 (8.8)7 (33.3)012 (14.6)
Merle d'Aubigné and Postel score8 (14.0)2 (9.5)010 (12.2)
Patient satisfaction1 (1.8)2 (9.5)1 (25.0)4 (4.9)
Other hip outcome measures2014135
Table 2. Outcome measures used in knee replacement trials based on study purpose*
MeasureSurgical (n = 53)Nonsurgical physical intervention (n = 16)Nonsurgical medical intervention (n = 6)Totals (n = 75)
  • *

    Values are the number (percentage). WOMAC = Western Ontario and McMaster Universities Osteoarthritis Index; SF-36 = Short Form 36 health survey; SF-12 = Short Form 12 health survey; HSS = Hospital for Special Surgery; ROM = range of motion.

WOMAC6 (11.3)5 (31.3)1 (16.7)12 (16.0)
SF-363 (5.7)5 (31.3)1 (16.7)9 (12.0)
SF-122 (3.8)1 (6.3)03 (4.0)
American Knee Society score31 (58.5)3 (18.8)2 (33.3)36 (48.0)
HSS score20 (37.7)2 (12.5)1 (16.7)23 (30.7)
Oxford knee score2 (3.8)01 (16.7)3 (4.0)
Pain scale measure14 (26.4)3 (18.8)4 (66.7)21 (28.0)
Knee ROM15 (28.3)11 (68.8)3 (50.0)29 (38.7)
Knee strength3 (5.7)2 (12.5)05 (6.7)
Balance1 (1.9)001 (1.3)
Proprioception2 (3.8)002 (2.7)
Six-minute walk test02 (12.5)02 (2.7)
Other timed walk test4 (7.5)3 (18.8)07 (9.3)
Patient satisfaction7 (13.2)1 (6.3)2 (33.3)10 (13.3)
Other knee outcome measures112114
Table 3. ICF components addressed in randomized trials of hip or knee replacement surgery*
 Hip studies (n = 82)Knee studies (n = 75)Combined studies (n = 3)All studies (n = 160)
  • *

    Values are the number (percentage). ICF = International Classification of Functioning, Disability, and Health; S = body structure and function measure; A = activity measure; H = hybrid measure; P = participation measure.

S5 (6.1)10 (13.3)015 (9.4)
A2 (2.4)002 (1.3)
H29 (35.4)26 (34.7)2 (66.7)56 (35)
S & A6 (7.3)2 (2.7)06 (3.8)
S & H20 (24.4)25 (33.3)045 (28.1)
A & H7 (8.5)01 (33.3)10 (6.3)
A & P1 (1.2)001 (0.6)
S & A & H11 (13.4)9 (12.0)019 (11.9)
S & A & P & H1 (1.2)001 (0.6)
Table 4. Commonly used outcome measures in hip and knee trials and the ICF components addressed for each measure*
 DomainScoring method
Body structure or functionActivityParticipationSeparate scores for each domainTotal score of all domains
  • *

    Only instruments that were used in at least 2 studies are included. ICF = International Classification of Functioning, Disability, and Health; AKS = American Knee Society; x = measure present; EQ-5D index = EuroQol 5-domain index; HSS = Hospital for Special Surgery; ROM = range of motion; SF-36 = Short Form 36 health survey; SF-12 = Short Form 12 health survey; WOMAC = Western Ontario and McMaster Universities Osteoarthritis Index.

  • WOMAC uses 2 separate scores for these domains (1 for body structure or function and 1 for activity/participation) instead of 3.

AKS scorexx x 
Balancex    
Charnley hip scorexx x 
EQ-5D indexxxx x
HSS scorexx  x
Harris hip scorexxxxx
Hip strengthx    
Hip ROMx    
Knee ROMx    
Knee strengthx    
Merle d'Aubigné and Postel scorexx xx
Other timed walk test x   
Oxford hip scorexxx x
Oxford knee scorexxx x
Proprioceptionx    
SF-36xxxxx
SF-12xxx x
Timed Get Up & Go test x   
Six-minute walk test x   
WOMACxxxxx

Of the 160 studies, a total of 38 (24%) indicated the primary outcome measure or measures. The most common were knee ROM (n = 4), WOMAC (n = 4), a walking test (n = 3), and prosthetic loosening (n = 4).

DISCUSSION

Our hypothesis was supported; we found an extensive amount of variation in the types of outcome measures used in hip and knee replacement randomized trials. We also found that trials designed to examine similar issues (e.g., surgical trials) frequently used measures that represented different ICF components. Hip replacement surgical trials, for example, used hybrid measures in 35.4% of studies; hybrid and body structure and function measures in 24.4% of studies; and activity and hybrid measures in 8.5% of studies. For studies designed to answer similar questions about the efficacy of various surgical approaches, we consider this heterogeneity to be substantial. When comparing across these studies, researchers and clinicians would likely have difficulty interpreting potential treatment effects. A greater consistency in use of outcome measures across studies would likely improve interpretability for both clinicians and researchers.

A surprising finding was that only 24% of trials clearly indicated which of the outcome measures were primary and which were secondary. In the majority of studies the primary outcome was not identified and it was unclear how the sample size was determined. The Consolidated Standards of Reporting Trials group statement, published first in 1996 and revised in 2001, instructs authors to describe and report effects for primary and secondary outcome measures (26). Study efficiency and internal validity would appear to be adversely impacted when primary and secondary outcomes are not specified.

We only examined outcome measures used in randomized trials, not cohort studies. A recently published systematic review of hip and knee replacement cohort studies examined the impact of knee and hip replacements on health-related quality of life (HRQOL) (25). Of the 74 studies examined, 18 used a generic or region-specific HRQOL measure that was not reported in the randomized controlled trial literature that we examined. The Sickness Impact Profile was the most commonly reported generic instrument and was used in 7 studies (27). Six of these reports were published prior to the year 2000. The only region-specific questionnaire not represented in our study was the Impact of Rheumatic Diseases on General Health and Lifestyle instrument (28), which was used in 1 study. The WOMAC and/or the SF-36 were the dominant measures and were used in the majority of studies reviewed by Ethgen et al (25). The work of Ethgen and colleagues suggests that our review captured the great majority of HRQOL instruments in the knee and hip replacement literature. With the exception of the Sickness Impact Profile, the instruments not included in the trials we examined are rarely used.

We chose to use the WHO ICF model of disablement as a conceptual framework because of its worldwide appeal and applications across medical disciplines (29, 30). In addition to finding extensive variation in ICF components examined across studies, we found that the hybrid measures that were used demonstrated an inconsistent pattern in the types of components that were addressed. Some hybrid measures have undergone extensive validation to assure that the various questions in the instrument fit a preconceived conceptual framework. The SF-36, for example, was developed with a theoretical framework in mind that allowed for the measurement of several constructs such as physical functioning, social and role disability, and vitality. A series of studies were conducted to validate that the instrument actually measures these various constructs (31, 32).

A less scientific and more experiential approach was used to develop many of the older measures. The AKS is one such measure. In our study, the AKS was the most commonly used measure in knee replacement trials. Insall and colleagues developed the AKS based on a perceived need for a newer instrument to assess effects of knee replacement surgery. The authors suggested that the then-commonly used Hospital for Special Surgery (HSS) scale was outdated, and because the HSS scale included items related to walking and transfers along with measures such as pain and muscle strength, the scores would tend to worsen as the patient ages in spite of a lack of change in knee function (33). The AKS therefore provides a single score for the combined measures of pain, ROM, and alignment, and a single score for walking and stair-climbing. The AKS, however, did not appear to undergo rigorous psychometric testing similar to the SF-36. The AKS has subsequently been shown to have weak content validity and responsiveness relative to other, more psychometrically sound measures (34).

When Insall and colleagues developed the AKS, theoretical frameworks like the WHO ICF disablement model were not readily available. Instead, the state-of-the-art was to develop instruments based primarily on expert opinion. Current practice in outcome measurement development proceeds systematically through a development and testing course based on a theoretical framework (35). Pollard and colleagues reviewed theoretical frameworks of outcome measures for OA and found that clinician-completed measures such as the AKS had weaker methodologic development than patient self-report measures (36). The AKS was developed at a time when systematic psychometric development was not routinely done. As a result, several measures in the hip and replacement trial literature continue to be used in spite of less than ideal psychometric properties. The HHS, another clinician completed instrument developed using an experiential approach, was the most commonly used hip measure in trials that we reviewed. The quality of trials in hip and knee replacement will likely improve when researchers use more contemporary outcome measures with strong psychometric properties and sound theoretical bases.

We believe that a necessary step toward improving quality and consistency of outcome measures is to develop consensus from an international group of experts who are involved in the care of patients with hip or knee replacement. Consensus conferences have been utilized by rheumatologists to identify optimal measures for trials examining various rheumatoid diseases (37, 38), and a similar approach may be effective for achieving consensus among orthopedists.

Our study has several strengths and limitations. We used a systematic search strategy and examined 2 large databases for relevant studies. We did not include reports written in foreign languages, so we possibly missed many trials that were published in foreign-language journals. We only searched literature using a 7-year window; our interests were only in the more contemporary trials, but we may have missed some relevant trials given our search timeframe. We used a systematic approach to include and exclude reports, and we examined the reliability of our approach. In addition, we classified outcome measures using the WHO ICF disablement model, a method for classifying disease consequences that has worldwide appeal.

In conclusion, we found extensive variation in the types of outcome measures used in randomized trials of hip and knee replacement. Future work should determine whether consensus can be developed for a standardized set of outcome measures for hip and knee replacement trials. Until greater consistency is achieved, it is unlikely that clinicians will be able to compare across studies to infer the effects of various surgical and nonsurgical treatments for patients following hip or knee replacement. Researchers also will likely have difficulty combining studies for systematic reviews.

AUTHOR CONTRIBUTIONS

Dr. Riddle had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study design. Riddle, Stratford.

Acquisition of data. Riddle, Bowman.

Analysis and interpretation of data. Riddle, Stratford.

Manuscript preparation. Riddle, Stratford, Bowman.

Statistical analysis. Riddle.

Ancillary