SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. SUBJECTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES
  9. APPENDIX A

Objective

To assess the reliability of the physical examination of the hip in osteoarthritis (OA) among rheumatologists and orthopedic surgeons, and to evaluate the benefits of standardization.

Methods

Thirty-five physical signs and techniques were evaluated using a 6 × 6 Latin square design. Subjects with mild to severe hip OA, based on physical and radiographic signs, were examined in random order prior to and following standardization of physical examination techniques. For dichotomous signs, agreement was calculated as the prevalence-adjusted bias-adjusted kappa (PABAK), whereas for continuous and ordinal signs a reliability coefficient was calculated using analysis of variance. A PABAK >0.60 and a reliability coefficient >0.80 were considered to indicate adequate reliability.

Results

Adequate post-standardization reliability was achieved for 25 (71%) of 35 signs. The most highly reliable signs included true and apparent leg length discrepancy ≥1.5 cm; hip flexion, abduction, adduction, and extension strength; log roll test for hip pain; internal rotation and flexion range of motion; and Thomas test for flexion contracture. The standardization process was associated with substantial improvements in reliability for a number of physical signs, although minimal or no change was noted for some. Only 1 sign, Trendelenburg's sign, was highly unreliable post-standardization.

Conclusion

With the exception of gait, a comprehensive hip examination can be performed with adequate reliability. Post-standardization reliability is improved compared with pre-standardization reliability for some physical signs. The application of these findings to future OA studies will contribute to improved outcome assessments in OA.


INTRODUCTION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. SUBJECTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES
  9. APPENDIX A

The hip examination is a key component in the assessment of osteoarthritis (OA) in clinical practice and in research. Recommendations for the physical examination of the osteoarthritic hip include different signs and techniques (1–3). However, the reliability of these physical signs has not been evaluated comprehensively. A search of the literature revealed few studies on the intrarater reliability (4–8) and interrater reliability (9–13) of the hip examination in OA. Among studies evaluating interrater reliability, 2 assessed range of motion (ROM) (10, 12) and 2 reported on a single hip examination sign (9, 11). In these studies, adequate interrater reliability was seen for select examinations. However, many hip examinations have not been assessed for their reliability and only 1 study reported results before and after standardization (13). As a result, it is not clear which physical signs can be examined reliably in hip OA and whether standardization (i.e., the detailed specification of a procedure) is necessary to achieve reliability.

We have previously published a report on the interrater reliability of the knee examination and have demonstrated that standardization can improve the reliability of many knee examinations (14). In the present study, we have similarly assessed a wide range of physical hip examination signs and techniques in subjects with mild to severe radiographic hip OA. The purpose of this study was 1) to determine which signs can be assessed reliably by rheumatologists and orthopedic surgeons and 2) to determine whether standardization can reduce the interobserver variability.

SUBJECTS AND METHODS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. SUBJECTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES
  9. APPENDIX A

Subjects.

The study was approved by the Clinical Research Ethics Board at the University of British Columbia. Subjects provided written informed consent. Six subjects were recruited through advertisements. Subjects were eligible if they met predefined criteria. Inclusion criteria were 1) age 45–79 years; 2) pain, aching, or discomfort in the groin or upper thigh on most days of the month at any time in the past; 3) any pain, aching, or discomfort in the groin or upper thigh during the previous 12 months; and 4) osteophytes on plain radiography. Exclusion criteria were 1) prior total hip or knee arthroplasty, 2) hip surgery within the previous 4 months, 3) fibromyalgia or inflammatory arthritis, 4) severe low back or knee pain, and 5) acute hip injury within the previous 6 months. Subjects were selected by a rheumatologist not involved in the standardization process (KS). The primary criterion for subject selection was the presence of a range of clinical hip and periarticular findings, such as greater trochanteric pain, as well as the severity of radiographic hip OA, assessed by the Kellgren/Lawrence (K/L) scale (15). Subjects completed the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) (16) to assess symptom severity.

Design.

The study involved 6 examiners (4 rheumatologists and 2 orthopedic surgeons) with a mean of 17.5 years of clinical experience (range 8–28 years). A preliminary list of hip examination signs was developed during discussion with coinvestigators and was based on the frequent use of these signs in clinical practice or research. If several techniques were available for the evaluation of a sign, these were included and specified. The final selection of physical signs was determined by group consensus. The framework used to evaluate all physical signs was based on 5 domains: gait, leg length, muscle strength, ROM, and pain/tenderness.

The standardization study was completed over 2 days. On the first day, a prestudy briefing was conducted by one of the authors (JC). The examiners were familiarized with the instructions, equipment, and scoring sheets. The study manual contained a list of physical examinations with a brief description of subject position, examination method, and scoring. Basic misconceptions were clarified, but no detailed discussion of technique was entertained and no attempt to standardize was undertaken. Examiners were encouraged to examine according to their usual practice. Equipment for examinations included a 30.5-cm goniometer and a 150-cm tape measure, each identical in brand and specification.

Physical examinations.

Pre-standardization examinations.

Following the briefing session, the examiners examined each subject according to a 6 × 6 Latin square design (17) (see Appendix A for a description of physical examinations). Examinations were conducted independently, in separate rooms, at a single site. The subject order and the order of physical examinations were randomized for each examiner. For subject comfort, physical examinations were randomized by testing position (standing, sitting, supine, lateral decubitus, prone) and then type of test. Following each examination, the results were entered immediately for analysis by the biostatistician. The schedule included a 5-minute break between each examination and a 20-minute break after the first 3 examinations. Subjects and examiners were instructed not to discuss any examinations during the breaks. On completion of the pre-standardization examinations, subjects were asked to complete a questionnaire to document any perceived differences in examination technique between examiners.

Standardization.

Following the pre-standardization examinations, examiners convened for the standardization meeting. The standardization process was based on 4 elements: subject questionnaires on perceived examination differences, graphic display of data and identification of outliers, discussion of examination technique, and demonstration of physical signs on a healthy volunteer. Although subject questionnaires provided some insight into differences in the examiners' examinations, these were ultimately not helpful in explaining variability. For each physical examination item, a graphic chart of the variability of findings was displayed, followed by a discussion of the examination technique to elucidate the reasons for the variability. Areas of disagreement were resolved by consensus. The main discussion points were the physical examination techniques, measurement landmarks, and the scoring scale. The standardized technique was recorded into the study manual by each examiner.

Post-standardization examinations.

The following day, subjects and examiners returned for the post-standardization examinations. These were performed using a 6 × 6 Latin square design (17) with a different randomization schedule from pre-standardization. The procedures for breaks and subject feedback at the end of the examinations were the same.

Statistical analysis.

The statistical analysis was performed using the S-Plus statistical program (18). Agreement for dichotomous variables was calculated as the prevalence-adjusted bias-adjusted kappa (PABAK), which is calculated as follows: PABAK = 2po – 1, where po is the observed proportion of agreement (19). The observed proportion of agreement (po) was obtained by calculating the mean of observed proportions of agreement for all possible pairs of examiners. PABAK measures agreement beyond chance, while taking into account prevalence (proportion of abnormal ratings) and bias (difference between observers' proportions of abnormal ratings) (19). PABAK is thought to be a better estimate for agreement than the standard kappa (19) and has the advantage that the results can be directly compared between different variables and even between studies, when study populations are similar. Although PABAK is adjusted for prevalence and bias, it still needs to be examined in conjunction with prevalence and bias. A high level of bias would need to be investigated to determine its cause. In a situation of low prevalence, insufficient information is available for a precise assessment of agreement, and PABAK values may be relatively uninformative.

For the interpretation of PABAK, we adopted the standard kappa descriptive scale by Landis and Koch (20), which, although arbitrary, is widely used (<0.00 indicates poor agreement, 0.00–0.20 slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, and 0.81–1.00 almost perfect agreement). By consensus, a PABAK >0.60 was chosen a priori to indicate adequate reliability (14).

Continuous and ordinal variables were analyzed using analysis of variance (ANOVA) with patient, doctor, and order of examination as explanatory variables. A number of different forms of the intraclass correlation coefficient may be used, depending on the study question and structure (21). Since the purpose of this study was to examine only the variation due to doctors and attempt to reduce this variation, it was thought that the most appropriate reliability coefficient (Rc) would be calculated as Rc = 1 − variancedoctor, where variancedoctor is the proportion of total variance attributed to doctors (13, 14). By consensus, a reliability coefficient >0.80 was chosen a priori to indicate adequate reliability (14).

For all physical signs, an A to D grading system was applied as follows: grade A = reliability adequate on pre- and post-standardization examinations (i.e., PABAK >0.60 or Rc >0.80); grade B = reliability adequate only on post-standardization examination; grade C = reliability adequate on pre- but not post-standardization examination; grade D = reliability inadequate on both pre- and post-standardization examinations (14). Because the post-standardization reliability was of primary interest, a higher grade (A or B) indicated greater reliability and usefulness of the physical sign, with grade A being most desirable because adequate reliability is present without standardization.

RESULTS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. SUBJECTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES
  9. APPENDIX A

Subject characteristics.

One man and 5 women participated in the study, with a median age of 63 years (range 49–65 years), median duration of hip pain of 8 years (range 2–50 years), median WOMAC pain on walking of 52 mm (range 21–81 mm), and median body mass index of 23 kg/m2 (range 22–27 kg/m2). Two subjects had K/L grade 2 radiographic OA severity, and 4 subjects had K/L grade 3.

Pre- and post-standardization agreement.

The standardization meeting resulted in a change in the scoring of hip flexion strength and hip extension strength from a 4-point to a 3-point scale, similar to other strength assessments. In addition, the position for hip extension strength assessment was changed from prone to lateral decubitus. Two items were added during the standardization meeting (Table 1). Hip extension ROM was added to assess ROM comprehensively in all planes. Hip flexion strength in the supine position was added to assess whether subject position affects reliability of this examination sign.

Table 1. Pre-standardization (pre) and post-standardization (post) reliability coefficients for continuous and ordinal physical examination signs*
Physical signReliability coefficientGrade
PrePost
  • *

    For a description of physical examination items, see Appendix A. For a description of the grading system of reliability (grades A–D), see Subjects and Methods.

  • Variable evaluated post-standardization only.

Muscle strength   
 Hip abduction0.900.86A
 Hip adduction0.870.86A
 Hip flexion: sitting0.830.95A
 Hip extension: prone0.850.86A
 Hip flexion: supine 0.90A/B
Leg length discrepancy   
 Leg length: true (right)0.940.95A
 Leg length: true (left)0.940.94A
 Leg length difference: true (right-left)0.900.95A
 Leg length: apparent (right)0.970.97A
 Leg length: apparent (left)0.980.97A
 Leg length difference: apparent (right-left)0.870.84A
 Leg length discrepancy: palpation0.820.71C
Range of motion   
 Hip external rotation: sitting0.550.80D
 Hip internal rotation: sitting0.950.94A
 Hip flexion: supine0.910.91A
 Hip external rotation: supine0.870.80C
 Hip internal rotation: supine0.870.94A
 Hip abduction: supine0.910.88A
 Hip adduction: supine0.720.56D
 Hip extension 0.66C/D

The results for physical examinations are summarized in Table 2 for dichotomous signs and Table 1 for continuous or ordinal signs, as well as in Figure 1, which shows the post-standardization reliability, and Figure 2, which shows the pre- and post-standardization measures. Overall, 18 (55%) of 33 physical signs achieved adequate reliability prior to standardization, whereas 25 (71%) of 35 physical signs were reliable following standardization (grade A or B). Of the 10 physical examinations with inadequate reliability, most were close to the cutoff for reliability, whereas 1 item, Trendelenburg's test, showed poor reliability (PABAK = 0.06) (Figure 1). The process of standardization was associated with substantial improvements in reliability for many physical examination signs, although for some, standardization had minimal or no effect (Figure 2) and for some items, such as Trendelenburg's test, true leg length discrepancy ≥1.5 cm, leg length discrepancy by palpation, and hip adduction ROM, the post-standardization reliability was considerably lower than pre-standardization.

Table 2. Pre- and post-standardization prevalence-adjusted bias-adjusted kappa (PABAK) for dichotomous physical examination signs*
Physical signPre-standardizationPost-standardizationGrade
PABAKPrevalenceBiasPABAKPrevalenceBias
  • *

    Prevalence = proportion of abnormal ratings; bias = mean bias for 15 examiner pairs (for a definition of bias, see Subjects and Methods); LLD = leg length discrepancy. For a description of the grading system of reliability (grades A–D), see Subjects and Methods. For a description of physical examination items, see Appendix A.

Gait       
 External rotation gait0.460.330.240.520.330.18D
 Gait0.060.440.310.520.670.11D
Muscle strength       
 Trendelenburg's test0.360.250.230.060.390.42D
LLD       
 True LLD (difference ≥1.0 cm)0.360.310.230.480.190.14D
 True LLD (difference ≥1.5 cm)0.880.030.060.720.080.14A
 Apparent LLD (difference ≥1.0 cm)0.140.420.300.720.140.06B
 Apparent LLD (difference ≥1.5 cm)0.560.110.160.880.030.06B
Pain/tenderness       
 Hip pain: flexion0.460.670.200.820.780.09B
 Hip pain: external rotation0.240.610.290.720.580.14B
 Hip pain: internal rotation0.600.560.200.520.610.24D
 Hip pain: log roll test0.420.500.180.880.640.06B
 Patrick's test for hip pain0.780.830.110.800.750.10A
 Tenderness: greater trochanter0.400.360.190.680.610.16B
Range of motion       
 Thomas test (hip flexion contracture)0.600.390.200.880.360.06B
 Ober test (iliotibial band tightness)0.380.220.240.800.080.10B
thumbnail image

Figure 1. Post-standardization reliability for 35 physical signs (♦). Vertical lines indicate the cutoff values for acceptable prevalence-adjusted bias-adjusted kappa (PABAK; >0.60) and reliability coefficient (Rc; >0.80). For a description of the grading system (A–D), see Subjects and Methods section. ER = external rotation; IR = internal rotation; LL = leg length; LLD = leg length discrepancy; diff = difference; ROM = range of motion; Gr = greater. * Evaluated post-standardization only.

Download figure to PowerPoint

thumbnail image

Figure 2. Effect of standardization for 33 physical signs. Improvement/worsening in reliability as a result of standardization are indicated by symbols above/below the diagonal line. The degree of improvement or worsening is reflected by the vertical distance from the diagonal. Prevalence-adjusted bias-adjusted kappas (PABAKs) are indicated by lower case letters; PABAK is acceptable if >0.60. Reliability coefficients (Rc) are indicated by upper case letters; Rc is acceptable if >0.80. ER = external rotation; IR = internal rotation; LL = leg length; LLD = leg length discrepancy; diff = difference; ROM = range of motion; Gr = greater. * Evaluated post-standardization only.

Download figure to PowerPoint

Results by domain.

Gait.

Both gait and external rotation gait assessments were unreliable, with post-standardization PABAKs of 0.52 (grade D). For gait assessment, it was specified during the standardization meeting that individual components of gait should be assessed explicitly, including decreased stance phase, decreased stride length, and truncal lurch, with the presence of any of these abnormalities indicating an abnormal gait (Appendix A). Although agreement improved considerably following standardization, the reliability remained below our a priori determined cutoff for adequacy (Table 2). There was a small improvement in external rotation gait following standardization (Table 2), but this was insufficient to meet our criteria for reliability.

Muscle strength.

Hip abduction, adduction, flexion, and extension strength were all found to be reliable on both pre- and post-standardization assessments, with reliability coefficients ranging from 0.86 to 0.95 following standardization (Table 1). Hip flexion strength was reliably measured in the sitting or supine position. In contrast, Trendelenburg's test, a measure of hip abduction strength, was of inadequate reliability and remained so after standardization with a PABAK of 0.06 (grade D). The bias for Trendelenburg's test was high on both pre- and post-standardization examinations, suggesting that the examiners' bias for an abnormal finding was not altered by the standardization process (Table 2).

Leg length discrepancy.

The measurement of leg length, as well as the difference between the 2 leg length measurements, was highly reliable for true and apparent leg length, with reliability coefficients ranging from 0.84 to 0.97 post-standardization (all grade A). However, the dichotomous outcome of leg length discrepancy ≥1 cm between right and left leg measurements, which was the a priori determined cutoff for leg length discrepancy, was reliable only for apparent leg length discrepancy (grade B) but not for true leg length discrepancy (grade D) (Table 2). It should be noted that the prevalence of true leg length discrepancy was low at 0.19, therefore this finding needs to be interpreted with caution. The use of a less rigorous cutoff of ≥1.5 cm for leg length discrepancy was found to be of adequate reliability, with post-standardization PABAKs of 0.72 and 0.88 for true and apparent leg length discrepancy, respectively, although this was associated with an even lower prevalence for a positive finding (Table 2). The assessment of leg length discrepancy by bilateral palpation of the pelvic rim was found to be of inadequate post-standardization reliability (Rc = 0.71, grade C).

Pain/tenderness.

Six evaluations were performed to assess hip pain (Table 2). Patrick's test for hip pain was reliable with pre- and post-standardization PABAKs of 0.78 and 0.80, respectively (grade A). Pain on flexion, external rotation, log roll test, and greater trochanter tenderness all had grade B reliability, with the log roll test having the highest post-standardization reliability (PABAK = 0.88). The assessment of internal rotation pain was of inadequate reliability, with pre- and post-standardization PABAKs of 0.60 and 0.52, respectively (grade D). The bias for finding the presence or absence of internal rotation pain was high on pre- and post-standardization examinations, therefore this bias was unaltered by the standardization process.

Range of motion.

ROM assessments included goniometric measurements in all planes as well as an assessment for hip flexion contracture (Thomas test) and iliotibial band tightness (Ober test) (Tables 1 and 2). The goniometric measurement of internal rotation was highly reliable in the sitting and supine positions (Rc = 0.94 for both positions; grade A). In contrast, external rotation measurement was of borderline reliability (Rc = 0.80 for both sitting and supine; grades D and C, respectively). Flexion and abduction were reliable (grade A), whereas adduction was unreliable (grade D). Extension ROM measurement was added during the standardization meeting, but was of inadequate reliability (Rc = 0.66). Both Thomas test and Ober test required standardization to achieve reliability, with post-standardization PABAKs of 0.88 and 0.80, respectively. However, the prevalence of Ober test abnormality was low, suggesting that the PABAK value is not a precise estimate and needs to be interpreted with caution.

DISCUSSION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. SUBJECTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES
  9. APPENDIX A

The principal elements of the hip examination include evaluations for gait, leg length discrepancy, muscle strength, pain/tenderness, and ROM. The availability of reliable physical examination signs from within each of these domains is crucial for the ability to assess the hip joint comprehensively in future OA outcome studies. With the exception of gait, all domains could be assessed reliably in this study by using the most reliable tests, which are summarized in Table 3. To our knowledge, this is the first study to provide comprehensive interrater reliability data for the hip examination.

Table 3. Summary of post-standardization values for the most reliable physical examination techniques in each domain*
DomainPhysical examination signReliability
  • *

    PABAK = prevalence-adjusted bias-adjusted kappa; Rc = reliability coefficient.

Gait Unreliable
Leg length discrepancyTrue leg length discrepancy ≥1.5 cm0.72 (PABAK)
 Apparent leg length discrepancy ≥1.5 cm0.88 (PABAK)
Muscle strengthHip flexion strength: sitting0.95 (Rc)
 Hip abduction strength: sitting0.86 (Rc)
 Hip adduction strength: sitting0.86 (Rc)
 Hip extension strength: lateral decubitus0.86 (Rc)
Pain/tendernessHip pain: log roll test0.88 (Rc)
Range of motionHip internal rotation range of motion: sitting or supine0.94 (Rc)
 Hip flexion range of motion: supine0.91 (Rc)
 Hip flexion contracture (Thomas test)0.88 (PABAK)

Our results agree with some findings from previous studies (9, 10, 12). Theiler et al (12) reported moderate reliability >0.60 only for intermalleolar distance, Patrick's sign, and supine internal rotation ROM (measured by short-arm goniometer), while Croft et al (10) reported good reliability for flexion ROM (intraclass correlation coefficient 0.87) using a plurimeter. However, internal and external ROM measurements in the sitting position were not reliable with the use of a plurimeter (10, 12). We found good reliability for internal rotation ROM (regardless of position), as well as for flexion and abduction ROM. In addition, external rotation ROM, measured in the supine position, was reliable pre-standardization (Rc = 0.87) and was of borderline reliability post-standardization (Rc = 0.80). The difference in findings from earlier studies may be due to the use of a long-arm goniometer versus a short-arm goniometer or plurimeter. Similar to Theiler et al (12), we demonstrated low reliability for adduction and extension ROM. Whether this relates to patient position or whether further standardization is necessary to achieve reliability will require additional research. We also demonstrated reliability for hip extension strength testing, which was previously reported by Perry et al (9) who evaluated this examination in the supine position. In our study, a lateral decubitus position was used for ease of application. In addition to hip extension strength, we tested hip abduction, adduction, and flexion strength (supine and sitting), all of which were reliable, and Trendelenburg's test, which was unreliable. Despite a detailed discussion and specification of features to be evaluated for Trendelenburg testing (Appendix A), we were unable to improve reliability following standardization. It is possible that the use of more detailed methodology, including a timed component as described by Hardcastle and Nade (22), would have improved reliability. We were also unable to demonstrate reliability for external rotation gait, which was previously reported to be reliable (11). The difference in findings may be related to a difference in scoring of this examination and the reliability statistic used. It is interesting to note that gait assessment was also unreliable, despite specification of features to be evaluated (Appendix A), although it is possible that reliability might have been improved if the criterion for abnormal gait had been based on the presence of all rather than any specified gait abnormalities. Individual leg length measurements were highly reliable for both true and apparent assessments, as were the calculated leg length differences. However, dichotomization of leg length difference into normal versus abnormal using a cutoff of ≥1 cm was associated with considerable variability. Our results suggest that the use of a 1.5-cm cutoff for leg length discrepancy may be more appropriate. The finding that internal rotation pain, which is a key variable in hip OA assessments, was unreliable post-standardization is of interest. In contrast, the log roll test, which measures both internal and external rotation pain, was highly reliable after standardization. Although all end-of-range pain assessments are meant to be performed on passive movement, subjects are often not able to completely relax their muscles on internal and external rotation in the supine position with the hip and knee at 90° angles. Incomplete relaxation can result in pain originating from muscles and likely increases the variability of a positive finding. In contrast, the log roll test is more conducive to muscle relaxation and therefore may result in greater reliability.

Overall, this study demonstrated that the majority of physical signs can be assessed reliably. It is noteworthy that adequate reliability was more difficult to achieve with examinations that require a subjective assessment of abnormality, such as gait, Trendelenburg's test, and leg length discrepancy, whereas objective measurements tended to be more reliable. Standardization improved reliability for most signs. Small changes <0.05 in either direction in reliability coefficient or PABAK are of little clinical importance and are likely due to random error resulting from the dynamic interactions that occur within and between subjects as well as within and between assessors. For a few signs, a greater decrease in reliability coefficient or PABAK was observed. This may relate to several issues. Not all physical measures are equally responsive to simple standardization procedures. There may also be a conflict between what the assessor normally does compared with imposed study requirements. In addition, more extensive assessor training may be required for some examinations to achieve reliability.

Similar to our previous study on the reliability of knee examinations (14), the choice of cutoff values for PABAK and reliability coefficient was arbitrary. In addition, the ANOVA-generated coefficients are characterized by an interplay between the error due to doctors and the error due to subjects and residuals, such that these latter 2 sources of variation can have a profound influence on the magnitude of the error due to doctors and therefore the reliability coefficient. Our finding of adequate reliability may therefore need to be interpreted more or less strictly, depending on the application of these results. The small sample size of both subjects and assessors may also be seen as a limitation. However, the prevalence of a positive finding was adequate for the majority of physical signs, thereby allowing for an appropriate assessment of reliability. For those signs where the prevalence was low, increasing the number of subjects may improve the assessment of reliability. Inclusion of subjects with and without OA may also affect reliability, which requires further investigation. Assessors were selected based on their expertise in OA and therefore the study results may not be generalizable to other rheumatologists or orthopedic surgeons or to other settings such as clinical patient encounters. Because no patients with K/L grade 4 were included in this study, the findings may not apply to advanced hip OA. However, the application of the standardized techniques developed in this study will likely prove useful to further evaluate the reliability of the hip examination in future OA studies pertaining to both rheumatologic and orthopedic investigations. Finally, intraobserver reliability was not assessed in this study, because doing so would have required more repetitions. Because our primary goal was to evaluate interobserver reliability, we designed the study to minimize repeat hip examinations in order to avoid subject/examiner fatigue and to avoid reinforcement of memory, which could potentially bias the findings for interobserver reliability. In addition, the long-term stability of interobserver agreement was not assessed. This is an important consideration for clinical trials and longitudinal OA studies.

In summary, this is the first study to demonstrate that a comprehensive hip examination, with the exception of gait, can be conducted reliably (Table 3). The majority of physical examinations can be performed reliably without standardization. Even in highly reliable signs, standardization can further improve the reliability. If hip examination techniques are to be included in future studies of OA, the use of these key signs will be important and will allow for reliable and therefore improved outcome assessments.

AUTHOR CONTRIBUTIONS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. SUBJECTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES
  9. APPENDIX A

Dr. Cibere had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study design. Cibere, Thorne, Bellamy, Greidanus, Chalmers, Mahomed, Kopec, Esdaile.

Acquisition of data. Cibere, Bellamy, Greidanus, Chalmers, Mahomed, Shojania, Esdaile.

Analysis and interpretation of data. Cibere, Thorne, Bellamy, Chalmers, Mahomed, Kopec.

Manuscript preparation. Cibere, Thorne, Bellamy, Greidanus, Chalmers, Mahomed, Shojania, Kopec, Esdaile.

Statistical analysis. Thorne.

REFERENCES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. SUBJECTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES
  9. APPENDIX A
  • 1
    Graham GP, Fairclough JA. The hip. In: KlippelJH, DieppePA, editors. Rheumatology. 2nd ed. London: Mosby International; 1998. p. 4.11.114.
  • 2
    Evans RC. The hip joint. In: EvansRC. Illustrated orthopedic physical assessment. 2nd ed. St. Louis: Mosby; 2001. p. 677746.
  • 3
    Magee DJ. Hip. In: MageeDJ. Orthopedic physical assessment. 4th ed. Philadelphia: Saunders; 2002. p. 60759.
  • 4
    Holm I, Bolstad B, Lutken T, Ervik A, Rokkum M, Steen H. Reliability of goniometric measurements and visual estimates of hip ROM in patients with osteoarthrosis. Physiother Res Int 2000; 5: 2418.
  • 5
    Gajdosik RL, Sandler MM, Marr HL. Influence of knee positions and gender on the Ober test for length of the iliotibial band. Clin Biomech (Bristol, Avon) 2003; 18: 779.
  • 6
    Reese NB, Bandy WD. Use of an inclinometer to measure flexibility of the iliotibial band using the Ober test and the modified Ober test: differences in magnitude and reliability of measurements. J Orthop Sports Phys Ther 2003; 33: 32630.
  • 7
    Rao KN, Joseph B. Value of measurement of hip movements in childhood hip disorders. J Pediatr Orthop 2001; 21: 495501.
  • 8
    Ross MD, Nordeen MH, Barido M. Test-retest reliability of Patrick's hip range of motion test in healthy college-aged men. J Strength Cond Res 2003; 17: 15661.
  • 9
    Perry J, Weiss WB, Burnfield JM, Gronley JK. The supine hip extensor manual muscle test: a reliability and validity study. Arch Phys Med Rehabil 2004; 85: 134550.
  • 10
    Croft PR, Nahit ES, Macfarlane GJ, Silman AJ. Interobserver reliability in measuring flexion, internal rotation, and external rotation of the hip using a plurimeter. Ann Rheum Dis 1996; 55: 3203.
  • 11
    Goligher EC, Ratzlaff CR, Gillies JH. External rotation of hip during stance phase of gait: grading system validation and reliability study [abstract]. J Rheumatol 2004; 31: 1427.
  • 12
    Theiler R, Stucki G, Schutz R, Hofer H, Seifert B, Tyndall A, et al. Parametric and non-parametric measures in the assessment of knee and hip osteoarthritis: interobserver reliability and correlation with radiology. Osteoarthritis Cartilage 1996; 4: 3542.
  • 13
    Bellamy N, Carette S, Ford PM, Kean WF, Le Riche NG, Lussier A, et al. Osteoarthritis antirheumatic drug trials. I. Effects of standardization procedures on observer dependent outcome measures. J Rheumatol 1992; 19: 43643.
  • 14
    Cibere J, Bellamy N, Thorne A, Esdaile JM, McGorm KJ, Chalmers A, et al. Reliability of the knee examination in osteoarthritis: effect of standardization. Arthritis Rheum 2004; 50: 45868.
  • 15
    Kellgren JH, Lawrence JS. Radiological assessment of osteo-arthrosis. Ann Rheum Dis 1957; 16: 494502.
  • 16
    Bellamy N, Buchanan WW, Goldsmith CH, Campbell J, Stitt LW. Validation study of WOMAC: a health status instrument for measuring clinically important participant relevant outcomes to antirheumatic drug therapy in participants with osteoarthritis of the hip or knee. J Rheumatol 1988; 15: 183340.
  • 17
    Box GE, Hunter WG, Hunter JS. Designs with more than one blocking variable. In: Statistics for experimenters: an introduction to design, data analysis, and model building. New York: John Wiley & Sons; 1978. p. 24580.
  • 18
    S-Plus 2000 Professional 1988-2000. Seattle (WA): mathSoft, Insightful Corporation; 2000.
  • 19
    Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol 1993; 46: 4239.
  • 20
    Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 15974.
  • 21
    Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979; 86: 4207.
  • 22
    Hardcastle P, Nade S. The significance of the Trendelenburg test. J Bone Joint Surg Br 1985; 67: 7416.

APPENDIX A

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. SUBJECTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES
  9. APPENDIX A
Table  . PHYSICAL EXAMINATION DESCRIPTION AND SCORING (ABBREVIATED)
1. External rotation gait. Subject walks barefoot away from examiner. Examiner evaluates degree of external rotation by evaluating the number of toes visible lateral to the lower limb. Scoring: normal = <4 toes visible, abnormal = ≥4 toes visible.
2. Gait. Subject walks away and towards examiner. Examiner inspects for gait abnormalities including decreased stance phase, decreased stride length, and truncal lurch over the affected side while in stance phase. Scoring: normal = no gait abnormality, abnormal = any gait abnormality.
3. Leg length discrepancy: inspection. Subject stands facing away, feet 30 cm apart, bearing weight equally. Examiner places hands on iliac crests and inspects the level of the crests. Scoring: normal = crests level to within 1 cm, right iliac crest lower by >1 cm, left iliac crest lower by >1 cm.
4. Trendelenburg's test. Subject stands facing away, raises opposite leg 10 cm off the ground. Examiner inspects for change in level of pelvis. Scoring: normal (i.e., negative) = pelvis stays level or becomes elevated on the unsupported side; abnormal (i.e., positive) = pelvis drops on the unsupported side or trunk shifts to the stance side.
5. Hip abduction strength. Subject sits at edge of examining table, legs dangling, knees ∼30 cm apart, pushing thighs out, while examiner applies a counterforce. Full force to be achieved without movement of the thighs. Scoring: 0 = severe weakness, 1 = mild weakness, 2 = full strength.
6. Hip adduction strength. Subject sits at edge of examining table, legs dangling, knees ∼30 cm apart, pushing thighs in, while examiner applies a counterforce. Full force to be achieved without movement of the thighs. Scoring: 0 = severe weakness, 1 = mild weakness, 2 = full strength.
7. Hip flexion strength: sitting. Subject sits at edge of examining table, legs dangling, raising thigh off examining table, while examiner applies a counterforce. Scoring: 0 = severe weakness, 1 = mild weakness, 2 = full strength.
8. Hip flexion strength: supine. Subject is supine. With hip and knee bent to 90°, subject pushes upward while examiner produces counterforce. Scoring: 0 = severe weakness, 1 = mild weakness, 2 = full strength.
9. Hip extension strength. Subject is lateral decubitus, tested extremity uppermost, lowermost extremity flexed at hip to 45° and at knee to 90°, uppermost extremity extended at the knee. Subject pushes posteriorly, while examiner produces counterforce. Scoring: 0 = severe weakness, 1 = mild weakness, 2 = full strength.
10. Hip external rotation range of motion (ROM): sitting. Subject sits at edge of examining table, legs dangling. Examiner externally rotates hip, keeping thigh in neutral position, i.e., avoiding abduction. Goniometer is centered at lower border of patella. Arm of goniometer is aligned along patellar tendon. Other arm is aligned vertically. Scoring: degree external rotation.
11. Hip internal rotation ROM: sitting. Subject sits at edge of examining table, legs dangling. Examiner internally rotates hip, keeping thigh in neutral position, i.e., avoiding adduction. Goniometer is centered at lower border of patella. Arm of goniometer is aligned along patellar tendon. Other arm is aligned vertically. Scoring: degree internal rotation.
12. Hip external rotation ROM: supine. Subject is supine. Examiner flexes hip and knee to 90° (zero-degree position). Examiner externally rotates hip, keeping thigh in neutral position, i.e., avoiding abduction. Goniometer is centered at lower border of the patella. Arm of goniometer is aligned along patellar tendon. Other arm is aligned along zero-degree position. Scoring: degree external rotation.
13. Hip internal rotation ROM: supine. Subject is supine. Examiner flexes hip and knee to 90° (zero-degree position). Examiner internally rotates hip, keeping thigh in neutral position i.e., avoiding adduction. Goniometer is centered at lower border of patella. Arm of goniometer is aligned along patellar tendon. Other arm is aligned along zero-degree position. Scoring: degree internal rotation.
14. Hip flexion ROM. Subject is supine. Examiner flexes hip as far as possible with opposite leg maintained in extended position. Examiner centers goniometer at greater trochanter, aligns one arm of goniometer along center of thigh, aligns other arm horizontally. Scoring: degree flexion.
15. Hip abduction ROM. Subject is supine, legs extended (zero-degree position). Examiner centers goniometer midway between anterior superior iliac spine and pubic symphysis, aligning one arm of goniometer centrally over thigh. Examiner abducts extended leg as far as possible without pelvic tilt. Examiner aligns other arm of goniometer in zero-degree position. Scoring: degree abduction.
16. Hip adduction ROM. Subject is supine, legs extended (zero-degree position). Examiner centers goniometer midway between anterior superior iliac spine and pubic symphysis, aligning one arm of goniometer centrally over thigh. Examiner adducts extended leg as far as possible without pelvic tilt. Examiner aligns other arm of goniometer in zero-degree position. Scoring: degree adduction.
17. Hip extension ROM. Subject is lateral decubitus, tested extremity uppermost, lowermost extremity flexed at hip to 45° and at knee to 90°. Examiner passively extends hip as far as possible. Examiner centers goniometer at greater trochanter, aligns one arm of goniometer over center of thigh and other arm along zero-degree position. Scoring: degree extension.
18. True leg length discrepancy. Subject is supine, pelvis and legs straight. Examiner measures distance from anterior superior iliac spine to center of medial malleolus. Scoring: centimeters right and left true leg length to nearest first decimal.
19. Apparent leg length discrepancy. Subject is supine, pelvis and legs straight. Examiner measures distance from umbilicus to center of medial malleolus. Scoring: centimeters right and left apparent leg length to nearest first decimal.
20. Hip pain on flexion. Subject is supine. Examiner flexes hip passively to end of range with knee at 90°, asking for discomfort in or around hip. Scoring: 0 = pain absent, 1 = pain present.
21. Hip pain on external rotation. Subject is supine. Examiner flexes hip and knee to 90° and externally rotates hip passively to end of range, keeping thigh in neutral position, i.e., avoiding abduction, asking for discomfort in or around hip. Scoring: 0 = pain absent, 1 = pain present.
22. Hip pain on internal rotation. Subject is supine. Examiner flexes hip and knee to 90° and internally rotates hip passively to end of range, keeping thigh in neutral position, i.e., avoiding adduction, asking for discomfort in or around hip. Scoring: 0 = pain absent, 1 = pain present.
23. Hip pain on log roll test. Subject is supine, knees extended. Examiner rolls leg passively from external to internal rotation end of range, asking for discomfort in or around hip. Scoring: 0 = pain absent, 1 = pain present.
24. Hip pain on Patrick's test. Subject is supine. Subject flexes, externally rotates, and abducts hip with foot placed on opposite thigh above knee. Examiner stabilizes contralateral pelvis and gently pushes down on ipsilateral knee. Scoring: 0 = inguinal pain absent, 1 = inguinal pain present.
25. Thomas test (hip flexion contracture). Subject is supine, flexing both hips, then maintaining contralateral hip flexed, while straightening out ipsilateral hip. Examiner evaluates full extension of ipsilateral hip for contact of thigh with examination table. Scoring: 0 = negative (i.e., normal), 1 = positive (i.e., abnormal).
26. Greater trochanter tenderness. Subject is supine, knees extended. Examiner palpates 2.5-cm area around greater trochanter, asking for tenderness. Scoring: 0 = tenderness absent, 1 = tenderness present.
27. Ober test (iliotibial band contracture). Subject is lateral decubitus, lowermost extremity flexed at hip to 45° and at knee to 90°, uppermost knee flexed to 90°. Examiner maintains pelvis in vertical position while passively abducting and extending uppermost hip, then passively adducting uppermost hip and observing for achievement of adducted position (beyond horizontal). Scoring: 0 = negative (i.e., normal), 1 = positive (i.e., abnormal).