To assess the interrater reliability of hip examination tests used to assess femoroacetabular impingement (FAI) among clinicians from different disciplines.
To assess the interrater reliability of hip examination tests used to assess femoroacetabular impingement (FAI) among clinicians from different disciplines.
Twelve subjects were examined by 9 clinicians using 12 hip tests drawn from a review of the literature and consultation with experts in hip pain and FAI. Examiners assessed both hips of each subject and were blinded to subject history. The order in which subjects were seen, the order of tests, and order of examination of the 2 hips within each subject were all randomized. Interrater reliability (IRR) for the 10 categorical tests was summarized using overall raw agreement (ORA), positive agreement (agreement on abnormal findings), and negative agreement (agreement on normal findings). An ORA of >0.75 was considered to indicate adequate reliability. For the 2 range of motion (ROM) outcomes, IRR was summarized using the median of the absolute difference (MAD) in measurements obtained by any 2 examiners on any patient. MAD reflects the “typical” difference (in degrees) between 2 raters.
Adequate reliability (ORA >0.75) was achieved for 6 of the 10 hip examination tests with categorical outcomes. Positive agreement ranged from 0.35 to 0.84, while negative agreement ranged from 0.62 to 0.99. For the ROM outcomes, examiners were, on average, within 5° of each other for flexion and 7° for internal rotation.
The results provide evidence that the most common hip examination tests would likely be sufficiently reliable to allow agreement between examiners when discriminating between painful FAI and normal hips in a clinical setting.
Hip osteoarthritis (OA) is among the most prevalent health problems and is a major cause of disability, health care utilization, and diminished quality of life (). Modifiable risk factors have not been well defined and effective preventive and interventional strategies have yet to be described. Recently, it has been suggested that femoroacetabular impingement (FAI) may be one of the most important modifiable precursors to OA ([2, 3]). Despite an almost exponential rise in publications relating to FAI () since the term was first used in 1999 (), the majority of studies to date have been surgical case series and reviews (). There remains limited epidemiologic evidence of the prevalence, incidence, and natural history of FAI ([6-8]). Additional information is needed to understand interventions that may alter the course from FAI to global hip OA ([3, 9]).
Large population-based studies are required to investigate these issues and an important clinical and research question is whether common physical examination tests for FAI are reliable. Our group has previously reported on the reliability of the hip examination in OA () and other groups have reported on the reliability of component tests of the hip examination ([11-20]). FAI tests described in the rheumatologic, sports medicine, family practice, and physical therapy literature ([3, 21-26]) have correlated strongly with magnetic resonance arthrogram and surgical observations ([2, 21, 27]). However, few reports exist on the reliability of clinical examination tests for FAI ([12, 16]). None of these assess a sample containing both normal and (FAI-confirmed) painful hips and using multiple raters.
A hip screening examination that could discriminate between normal hips and painful hips with FAI would be of value to clinicians and researchers from various disciplines. An important first step is to determine which tests are reproducible among examiners. The purpose of this study was to assess the interrater reliability of hip physical examination tests related to painful FAI using multiple clinicians from different disciplines.
Twelve subjects were recruited for the study. Since it was important to evaluate the reliability of both positive and negative tests, we employed a stratified sampling method, seeking to include subjects with symptomatic FAI-confirmed hips and pain-free healthy hips. To recruit subjects with symptomatic FAI, e-mails were sent from an orthopedic surgery service to patients with a history of hip pain and radiographic () and/or magnetic resonance imaging–confirmed FAI (). To recruit healthy volunteers, friends and colleagues of Arthritis Research Centre of Canada employees were contacted via e-mail and personal communication. Inclusion criteria included being age 20–49 years and being of white or Chinese descent. Subjects were excluded if they reported: 1) previous hip surgery, 2) fibromyalgia or inflammatory arthritis, 3) severe low back or knee pain, and 4) acute hip injury within the previous 6 months or if their hip status or medical history was known to any of the examiners. Interested participants were asked the following questions related to hip pain: 1) have you experienced pain in the upper thigh, inner thigh, or groin area? 2) did the pain last longer than 6 weeks in a row in the past 12 months? and 3) did this pain occur on more than 3 occasions in the past 12 months?
Subjects underwent a preexamination briefing on the day of the examinations to provide them with logistical information about the procedures. They were instructed to have no discussion of the examination procedures or findings from previous examinations with the examiner or with other subjects. If any concerns or issues arose during a given examination session (e.g., too much pain), the patient was instructed to let the examiner know. Subjects provided written informed consent and the study was approved by the Clinical Research Ethics Board at the University of British Columbia.
The examiners included 9 clinicians (2 rheumatologists and 7 physiotherapists) with varying degrees of experience in musculoskeletal (MSK) practice and examination of the hip joint for FAI. To emulate clinical practice and the role of multiple examination screeners in our larger epidemiologic study, examiners did not undergo a formal prestudy standardization session. However, examiners were provided with a written physical examination protocol, which provided descriptions of each physical examination test and the scoring scheme for each test. An information session for all examiners was held an hour prior to the physical examination sessions. This included demonstrations of each of the tests with opportunity for questions and answers. Standard procedures were agreed upon and the examiners followed a verbal script in conversing with examination subjects. For example, on questions that required the subject to respond to the presence or absence of pain or discomfort, the examiners were provided with a written script with the following wording: “Does this give you pain or discomfort in the upper thigh, inner thigh, or groin area?” On questions with binary outcomes, equivocal findings were recorded as negative/ absent.
Subjects and examiners completed questionnaires prior to the beginning of the examinations. The subject questionnaire included information regarding demographics such as age, race, self-reported weight and height, and socioeconomic variables. General health information including doctor-diagnosed MSK/rheumatic conditions and pain (aching or discomfort in the low back or knee) was also collected. A hip-specific section included items on hip pain, stiffness, clicking, and surgical history, in addition to the Copenhagen Hip and Groin Outcome Score (HAGOS). The HAGOS is a validated outcome tool used for assessing hip pain in young to middle-aged, physically active patients with longstanding hip and/or groin pain (). The examiner questionnaire included sociodemographic information, educational history, previous clinical training and expertise, and experience in hip examination, including familiarity with the tests performed in this study.
Each subject was examined by all 9 examiners. Examiners were blinded to subject history. Examiners were instructed not to converse with subjects apart from the agreed upon script. The schedule included a 5-minute break between each examination, a 20-minute break after the third and ninth examinations, and a 1-hour lunch break after the sixth examination. Subjects and examiners were instructed and regularly reminded not to discuss any examination procedures or findings during the breaks. Each examiner assessed both hips of each subject using 12 hip tests related to the detection of FAI (see Supplementary Appendix A, available in the online version of this article at http://onlinelibrary.wiley.com/doi/10.1002/acr.22036/abstract). Results were recorded on a standardized score sheet (see Supplementary Appendix B, available in the online version of this article at http://onlinelibrary.wiley.com/doi/10.1002/acr.22036/abstract).
Examiners submitted each score sheet to a study administrator upon completion and data were immediately entered into an electronic spreadsheet by a data analyst. The tests included were determined by a panel of experts in rheumatology, orthopedics, physical therapy, and MSK epidemiology following a review of the literature and in consultation with clinical experts in FAI, and were chosen based on their use in common practice across disciplines. They included range of motion (ROM) tests for flexion and internal rotation (IR) using standard goniometer assessment and provocative tests for the detection of FAI. All tests were performed in the supine position and were conducted passively. The response of pain present or pain absent was recorded by the examiner for all tests. A description of all tests and the full physical examination protocol is available in Supplementary Appendix A (available in the online version of this article at http://onlinelibrary.wiley.com/doi/10.1002/acr.22036/abstract). The order of hip examinations was randomized by selecting 9 columns (examiners) from a 12 × 12 Latin square design. The order of examination of the 2 hips within each subject was also randomized.
Interrater reliability for the 10 categorical test outcomes (absence of pain/presence of pain or full ROM/restricted ROM) was summarized using the proportion of overall agreement and proportion of specific agreement. Overall raw agreement (ORA) measures the proportion of times that a randomly chosen rater will choose a specific category, given that at least 1 of the raters has chosen it. For a standard 2 × 2 table (Table 1), ORA is (a+d)/(a+b+c+d). An arbitrary ORA of >0.75 was considered to indicate adequate reliability. While ORA is informative, it has the limitation that it does not distinguish between agreement on positive and negative tests. Therefore, we also computed the proportions of specific agreement by rating category ([25, 26]), i.e., positive agreement (PA; agreement on abnormal findings) and negative agreement (NA; agreement on normal findings). In Table 1, specific agreement for a positive test is 2a/(2a+b+c), while the specific agreement for a negative test is 2d/(2d+b+c).
|Test positive||Test negative|
For the 2 continuous (flexion and IR ROM) outcomes, interrater reliability was summarized using the median of the absolute difference (MAD) in measurements obtained by any 2 examiners on any patient. MAD is a reflection of the “typical” difference (in degrees) between 2 raters. Confidence intervals were obtained through bootstrapping ().
Seven women and 5 men participated in the study (Table 2). They were predominantly white, ages 24–48 years (mean 36 years), had an average body mass index of 25.2 kg/m2 (with 1 subject underweight, 7 normal weight, 3 overweight, and 1 obese), and all had some postsecondary education. Seven of the 12 subjects reported hip (upper thigh, inner thigh, or groin) pain over the past 12 months in at least 1 hip. Four of the subjects reported bilateral hip pain, giving a total of 11 symptomatic hips out of the 24 hips examined. Five of the subjects had never had hip pain or symptoms.
|Characteristics||No. (%) or mean ± SD (range)|
|Age, years||36 ± 8.2 (24–48)|
|Body mass index, kg/m2||25.2 ± 6.1 (19–42)|
|South Asian||1 (8)|
|Trade certificate, vocation||3 (25)|
|Some post-secondary||1 (8)|
|Bachelor's degree||3 (25)|
|University above bachelor's degree||5 (42)|
|Physician-diagnosed inflammatory disease (RA, gout, fibromyalgia, SLE, AS)||0 (0)|
|Reported in past 12 months, subjects||7 (58)|
|Reported in past 12 months, hips||11 (49)|
|Most days of month in past, subjects||2 (17)|
|Symptoms||74.4 ± 21.6b|
|Pain||82.5 ± 19.4|
|Function, ADL||92.7 ± 9.3|
|Function, sport||78.3 ± 18.9|
|Function, participation||94.8 ± 4.8|
|Quality of life||85.0 ± 13.9|
There were 9 examiners, i.e., 2 rheumatologists and 7 physical therapists that had been in practice on average for 23.7 years (range 4–38 years) (Table 3). The physical therapists were from various subdisciplines, with 4 from MSK-related practice and 3 from non-MSK areas. All examiners were familiar with ROM testing, although only 2 used goniometers routinely. Four of the examiners routinely performed impingement tests for FAI, with 2, 3, 4, and 12 years, respectively, of doing so.
|Characteristics||No. (%) or mean ± SD (range)|
|Medical doctor||2 (22)|
|Physical therapist||7 (78)|
|Years in practice||23.7 ± 10.5 (4–38)|
|Routinely perform ROM tests||9 (100)|
|If yes, how many years||22.8 ± 10.4 (6–38)|
|Routinely use goniometer to measure ROM||2 (22)|
|Routinely perform FAI impingement tests||4 (44)|
|If yes, how many years||5.3 ± 4.6 (2–12)|
|Familiar with test (below) prior to study|
|Hip IR pain||8 (89)|
|Hip IR ROM (using goniometer)||8 (89)|
|Hip flexion ROM||9 (100)|
|Log roll test||7 (78)|
|FABER test||9 (100)|
|Anterior impingement test (flexion 90°/adduction/IR)||7 (78)|
|Flexion 90°/adduction/IR ROM||6 (67)|
|Flexion 120°/adduction/IR pain||7 (78)|
|Flexion 120°/adduction/IR ROM||6 (67)|
|Flexion 90°/adduction compression pain||6 (67)|
|Flexion 120°/adduction compression pain||5 (56)|
|Posterior impingement test||3 (33)|
Adequate reliability by our arbitrary rating of ORA >0.75 was achieved for 6 of the 10 hip examination tests with categorical outcomes: the anterior impingement test; posterior impingement test; the flexion, abduction, and external rotation test (FABER) and log roll test; hip IR (presence or absence of pain); and flexion 120°/adduction/IR test (presence or absence of pain). Most of these tests had a blend of normal and abnormal findings (Table 4), adding precision to the ORA reliability estimate as reflected in reasonably high specific agreement for positive and negative tests. One exception was the log roll test, which was almost universally normal, as reflected by the extreme NA (1.00) and PA (0.00) values. Results for the continuous test outcomes are presented in Table 5. On average, examiners were within 5 degrees of each other for flexion and 7 degrees for IR.
|Examination||Prevalence (proportion abnormal)||Overall raw agreement (95% CI)||Negative agreement, normal result (95% CI)||Positive agreement, abnormal result (95% CI)|
|Log roll test, pain||0.01||0.99 (0.96–1.00)||1.00 (0.98–1.00)||0.00 (0.0–0.25)|
|FABER test, pain||0.26||0.84 (0.75–0.95)||0.89 (0.79–0.97)||0.70 (0.46–0.89)|
|Hip IR, pain||0.45||0.84 (0.74–0.96)||0.90 (0.81–0.98)||0.70 (0.32–0.9)|
|Posterior impingement test (extension 0°/ER)||0.14||0.81 (0.69–0.94)||0.89 (0.79–0.97)||0.35 (0.06–0.71)|
|Flexion 120°/adduction/IR pain||0.75||0.78 (0.68–0.92)||0.68 (0.27–0.91)||0.84 (0.71–0.94)|
|Anterior impingement test (flexion 90°/ adduction/IR)||0.45||0.76 (0.66–0.91)||0.79 (0.6–0.93)||0.73 (0.58–0.89)|
|Flexion 90°/adduction/compression pain||0.26||0.70 (0.59–0.87)||0.81 (0.66–0.93)||0.41 (0.19–0.65)|
|Flexion 120°/adduction/compression pain||0.55||0.69 (0.61–0.85)||0.72 (0.54–0.89)||0.67 (0.45–0.85)|
|Flexion 90°/adduction/IR ROM||0.34||0.67 (0.60–0.83)||0.75 (0.61–0.89)||0.51 (0.3–0.73)|
|Flexion 120°/adduction/IR ROM||0.51||0.58 (0.52–0.75)||0.62 (0.44–0.82)||0.52 (0.33–0.76)|
|Examination||Mean ± SD ROM, °||SEM (95% CI), °||Median absolute difference (95% CI), °|
|Hip flexion ROM||127 ± 13||7 (5–9)||5 (3–6)|
|Hip internal rotation ROM||31 ± 11||6 (4–8)||7 (4–9)|
This study reported on the interrater reliability of FAI-related physical examination tests among 9 examiners from varying clinical disciplines and backgrounds in a group of symptomatic and asymptomatic adults between the ages of 24 and 48 years. The best reliability, above our arbitrary standard of >0.75 ORA, included signs most commonly validated with surgically confirmed FAI: the log roll test (0.99), the FABER test (0.84), hip IR pain (0.84), and anterior impingement test (0.76). The log roll test was almost universally normal, however limiting the usefulness of the estimate. The 4 tests with an ORA <0.75 were not as familiar to most clinicians in this study and are not as well-described in the literature or correlated with surgically confirmed FAI. Generally, examiners had slightly better agreement on normal findings (NA) than on abnormal findings (PA). Excluding the log roll test, NA ranged from 0.62 to 0.90, while PA ranged from 0.35 to 0.84. The anterior impingement test is the most commonly reported test in the literature and positive in 88–99% of patients who have subsequently surgically confirmed FAI ([21, 27]). In our sample, this test had an ORA of 0.76, an NA of 0.79, a PA of 0.73, and an abnormal test prevalence of 45%, increasing the precision of our estimate and reducing the likelihood of a chance-related inflation or bias. Similarly, the FABER test, positive in 69–97% of patients who have subsequently surgically confirmed FAI ([21, 27]), was also among the most reliable tests.
This evidence for interrater reliability is important for future studies where screening large numbers of patients for FAI will be carried out by different examiners. Our results provide evidence that even without a formal standardization training session, examiners from varying backgrounds, some who were unfamiliar with and/or did not routinely use FAI testing, can reliably apply the most common FAI tests. Reliability for FAI tests would likely be enhanced by using examiners who were familiar with and routinely used FAI tests.
For goniometrically measured ROM tests, the MAD was 5° (95% confidence interval [95% CI] 3°–6°) for flexion and 7° (95% CI 4°–9°) for IR, providing evidence that these ROM tests have reliability that is likely to be acceptable in the setting of case discrimination in FAI. Normal population ROM values are approximately 135° and 35° for flexion and IR, respectively, in the age group of our sample, whereas patients with surgically confirmed FAI have reported presurgical average flexion ROM of 97°–111° (±9°–18°) ([21, 27]) and IR (at 90° flexion) of 9° (±8°) (). These differences between FAI and normal hips are much greater than the MAD reported here, and would likely have sufficient reliability to obtain agreement between examiners in discriminating FAI from normal hips.
Strengths of this study include a sample containing both symptomatic FAI and normal hips, and with the exception of the log roll test, a mix of normal and abnormal test findings, reducing the likelihood of a chance-inflated agreement and allowing for an appropriate assessment of reliability. We used 9 examiners across different disciplines with varying degrees of MSK examination and FAI testing experience. While examiners averaged 23 years in practice, some did not routinely use the FAI tests in their clinical practice and most rarely used goniometers. This demonstrates that adequate reliability is possible without extensive training or experience with these tests.
Other strengths included the blinding of examiners to subject hip health status and the randomization of order in which subjects were seen, the order of tests, and the order of examination of the 2 hips within each subject, all to reduce potential systematic bias. The limiting and structuring of communication between examiner and subject to the agreed upon script, as well as a prohibition on communication during the examination day (during examinations or breaks) between any of the study participants or examiners was also a strength, reducing the likelihood of factors other than the test influencing a given finding.
There are several caveats. The choice of cutoff values for ORA (>0.75) was arbitrary, as with other rating scales such as kappa or intraclass correlation coefficient (ICC). The relatively small sample size of subjects may be seen as a limitation. However, the prevalence of a positive finding was adequate for all examination tests except for the log roll test, allowing for a more precise assessment of reliability. A bigger potential concern was that patients were subjected to 9 consecutive hip examinations within a span of several hours; it is possible that physiologic changes secondary to repetitive manipulation, particularly from provocative FAI tests designed to induce pathologic bony collision and reproduce painful symptoms, could result in variation in pain, muscle tone, and ROM during the course of the examinations. This may have decreased reproducibility between examiners due to true changes in the parameter being measured, obfuscating agreement and underestimating true reliability. We did not assess intraobserver reliability, since this would have resulted in a doubling of the number of tests undergone at each hip. However, since our principal purpose was to investigate interobserver reliability, we designed the study to minimize repeat hip examinations in order to avoid subject/examiner fatigue and to avoid reinforcement of memory, which could potentially bias the findings for interobserver reliability. Several studies (using fewer examiners) have examined intraobserver reliability for hip ROM ([11, 13, 20]), but no studies were found that examined the intraobserver reliability of FAI tests. In addition, the long-term stability of interobserver agreement was not assessed. This is an important consideration for clinical trials and longitudinal OA studies. Since our sample was primarily white, our results may not be generalizable to other ethnic groups.
We also did not evaluate validity of the tests. As noted in a recent systematic review, there are currently no physical examination tests available that can reliably confirm or discard the diagnoses of FAI in clinical practice (). Since most of the reported tests are relatively new in a condition only recently identified, and since there is a lack of epidemiologic evidence for diagnostic tests, a first step is to assess the degree that clinical tests are reproducible. A test that is not reliable cannot be valid; however, the opposite is not true (it is possible to have a reliable but invalid test). Most of the tests we assessed are widely used by orthopedic surgeons and other MSK clinicians. A reasonable first step is to determine reliability (which is often easier to do) and then test validity only for tests that are reliable. Many of the published papers on hip examination tests report only on reliability ([10, 12, 15, 16, 18-20, 30]).
We chose to use indices of raw agreement to describe reliability as they have unique common sense value (). Most previous studies have used versions of the ICC or kappa statistic. However, these more complex, less intuitive statistics have the potential to mislead readers as to true agreement for a given test (). For example, the kappa statistic, which attempts to account for chance agreement, can misinform readers since it mixes different sources of disagreement (bias, prevalence, and chance) in a single omnibus index and the assumption that raters “guess” most of the time is unrealistic (). We therefore selected measures of reliability that were less complex and more intuitive. By reporting both PA and NA along with the prevalence of abnormal findings, it is possible to see where an extremely high reliability score (e.g., log roll test) is due to the extremely low prevalence of an abnormal finding. If both PA and NA are satisfactorily large, there is arguably less need or purpose in comparing actual to chance-predicted agreement using a kappa statistic (). Regardless, PA and NA provide more information relevant to understanding and improving ratings than a single omnibus index ([25, 31]). Likewise, ICCs depend on the variability in values across patients in the sample, such that ICC can be high if there is a heterogeneous sample even if the multiple measurements on a given patient are not that similar. By reporting the MAD, we offer a value that can be used to determine if agreement or discord between raters is likely to have clinically useful meaning in a given situation.
Comparison to other studies should be considered cautiously due to differences in study design, subject recruitment, hip health status of the sample, tests examined, and reliability measures used. We only found 2 other articles examining the interrater reliability of FAI tests. Prather et al investigated the interreliability (16 raters) of 4 provocative tests for FAI in 56 normal subjects but were not able to report reliability estimates due to low prevalence of abnormal findings (). Martin and Sekiya studied the interrater reliability (2 raters) of the FABER test, the anterior (flexion 90°/adduction/IR) impingement test, and the log roll test, reporting acceptable levels of interrater reliability for the FABER test and log roll test (kappa scores of 0.63 and 0.61, respectively), but not for the anterior impingement test (). The kappa for the impingement test was 0.58 and not significantly different than 0.40, likely due in part to a high prevalence of abnormal tests. As part of reliability tests for a whole hip examination, Theiler et al () and Cibere et al () reported good or excellent reliability for the FABER (Patrick's) test in OA patients (reliability coefficients of 0.7 and 0.8, respectively).
Some loss in the strength of our reliability indices may have been due to aspects of the study design. First, since 7 of 12 subjects (affecting 11 of 24 hips) reported hip pain or discomfort and all participants were subjected to 9 consecutive hip examinations involving pain provocation tests within the space of several hours (and in some cases within several minutes), physiologic changes secondary to repetitive manipulation such as muscle spasm and guarding may have resulted in variation in pain and range of motion. Second, we did not employ a prestudy standardization training session as has been done in previous studies ([10, 12]), which has been shown to significantly increase reliability (). Further, while our examiner group had an average of 23 years of clinical experience, less than half regularly performed FAI impingement tests as part of their clinical practice. It is probable that limiting examiners to those who regularly practice in MSK settings and routinely perform physical examinations for the hip including FAI impingement tests would augment reliability of these tests. Future studies could examine this, and studies relying on reliable clinical case finding for FAI should consider this.
Using many raters across disciplines to assess reliability is not common. However, as we sought to establish the reliability of an FAI screening examination for a larger study that will require examination of subjects at a variety of locations and times, we were interested in determining the FAI tests that were the most reliable across disciplines while requiring a limited amount of training. Previous studies have also successfully studied reliability among multiple raters ([10, 12]), including among multiple medical disciplines at the trunk and the hip, where the purpose was to establish a quick and reliable screening examination ([12, 34]).
In summary, this is the first study to document interreliability for FAI tests among multiple examiners from different clinical backgrounds. While not all tests met our standard for reliability, the anterior impingement test, IR ROM test, and flexion ROM test would likely be sufficiently reliable to allow agreement between examiners when discriminating between FAI and normal hips in a clinical setting. The reliability described was achieved without employment of a prestudy examination standardization session and using examiners that did not all routinely perform FAI tests or use goniometers. Reliability may be enhanced in settings where these are addressed.
All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be submitted for publication. Dr. Ratzlaff had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study conception and design. Ratzlaff, Simatovic, Wong, Li, Esdaile, Cibere.
Acquisition of data. Ratzlaff, Simatovic, Ezzat, Langford, Esdaile, Kennedy, Embley, Caves, Hopkins, Cibere.
Analysis and interpretation of data. Ratzlaff, Simatovic, Wong, Cibere.
The authors thank Helen Prlic for her invaluable work in coordinating this study.