In clinical practice and in research, the knee examination is a key component in the assessment of patients with osteoarthritis (OA). Recommendations for the physical examination of the OA knee include many different signs and techniques (1–9). In the American College of Rheumatology (ACR) clinical diagnostic criteria for knee OA, assessments for crepitus, bony swelling, bony tenderness, and warmth are required (10). However, the reliability of these and other signs by physical examination has not been evaluated comprehensively. A search of the literature revealed very few studies on the intra- and interrater reliability of the knee examination in OA (11–15) (Table 1). Although the data in Table 1 suggest apparently large between-study differences in reliability, a comparison of these values is not appropriate because they are not all based on the same measure of reliability and because the kappa statistic is very sensitive to differences in bias and prevalence. As a result, it is not clear from these studies which, if any, physical signs can be examined reliably in knee OA. Additional limitations of prior studies are that some OA knee examination techniques have not been evaluated, and the effect of standardization has only been assessed in 1 study (15).
Table 1. Summary of published findings on the interobserver reliability of osteoarthritis knee examination signs
|Knee examination sign||Cushnaghan et al (11)*||Hart et al (12)*||Jones et al (13)*||Hauzeur et al (14)†||Bellamy et al (15)‡|
|Crepitus|| || || || || || |
| General crepitus||–||0.14||0.23||–||–||–|
| Tibiofemoral crepitus||0.64||–||0.09||–||–||–|
| Patellofemoral crepitus||0.24||–||0.10||–||–||–|
|Inflammation|| || || || || || |
| Nonbony swelling||0.28||0.25||0.13||–||–||–|
| Synovial fluid||–||–||0.22||0.45||–||–|
| Popliteal cyst||–||–||0.21||–||–||–|
|Instability|| || || || || || |
| Mediolateral instability||0.23||–||–||–||–||–|
| Anteroposterior instability||0.00||–||–||–||–||–|
|Tenderness/pain|| || || || || || |
| General tibiofemoral||–||0.74||–||–||–||–|
| Medial tibiofemoral||0.40||–||0.48||–||–||–|
| Lateral tibiofemoral||0.43||–||0.44||–||–||–|
| Pain on movement||–||0.85||–||–||–||–|
|Range of motion||–||–||–||–||0.58–0.94||0.89–0.95|
Because kappa is known to be a measure that is sensitive to prevalence and bias, a prevalence-adjusted bias-adjusted kappa (PABAK), described by Byrt et al (16), was used in this study. PABAK measures agreement beyond chance, while taking into account both the prevalence of a positive finding and the bias of each observer for reporting a positive finding. PABAK is thought to be a better estimate for agreement than the standard kappa (16). In addition, PABAK has the advantage that the results can be directly compared between different variables and even between studies, when study populations are similar.
In this study, we assessed a wide range of knee physical examination signs and techniques in patients with mild to severe radiographic knee OA. The purpose of this study was 2-fold, in that we sought to determine 1) which signs can be assessed reliably by rheumatologists, and 2) whether standardization can reduce the interobserver variability.
- Top of page
- PATIENTS AND METHODS
The principal elements of the knee examination include evaluations for alignment, bony swelling, crepitus, gait, inflammation, instability, muscle strength, tenderness/pain, and range of motion. The availability of reliable physical examination signs from within each of these domains is crucial for the ability to assess the knee joint comprehensively in future outcome studies of OA.
With regard to examinations for crepitus, our study findings were variable. Compartment-specific crepitus was not assessed with consistent reliability by using active or passive movement. Passive movement with stress was reliable only for the medial tibiofemoral and the patellofemoral compartments and may be more difficult to implement in clinical research, since it is not usual practice, is not easily performed, and is more time consuming. Therefore, given that passive crepitus is generally assessed in clinical practice, and since the reliability coefficients were either acceptable or close to the cutoff of 0.80, this technique would seem most feasible for use in future studies, if compartment-specific crepitus is of importance. For general (non–compartment-specific) crepitus, the assessment was most reliable using passive movement.
With regard to alignment, inflammation, and muscle strength, several interchangeable signs were evaluated in this study, with at least 1, and frequently more than 1, physical sign achieving good reliability. This allows for selection of appropriate physical examination signs in future studies on the basis of not only reliability, but also suitability and preference. On the other hand, for some domains, such as instability, tenderness/pain, and range of motion, the individual physical signs evaluated represent different dimensions of these domains and are therefore not interchangeable. With regard to instability, we were only able to reliably assess posterior instability. The reliability of medial and lateral instability, which depends on adequate assessments at both 0° and 30° of flexion, was not established. As a result, the assessment of instability as a whole was found to be unreliable and may need to be investigated in further studies. In contrast, the assessments for tenderness/pain and range of motion were found to be reliable, since all physical signs within those groups were deemed to be reliable.
Overall, this study showed that the majority of physical signs can be assessed reliably. Furthermore, the effect of standardization was to improve reliability for most of the signs/techniques. However, for some physical examinations, there was a decrease in reliability following standardization. For most of these, a decrease of less than 0.05 in the reliability coefficient or PABAK was seen. Such small changes, in either direction, may not be clinically important and are likely due to random error resulting from the dynamic interactions which occur within and between subjects as well as within and between assessors. For a few signs, a greater decrease in the reliability coefficient or PABAK was observed. This is likely due to the fact that not all physical measures are equally responsive to a simple standardization procedure, but require more intense or repeated training. In particular, a conflict between what assessors normally do compared with what they are obligated to do on the basis of imposed study requirements can likely influence the reliability. In this study, the standardization meeting included demonstration of physical technique for the purpose of reaching an agreement on standardization, but no extensive assessor training was undertaken, and therefore this may have adversely affected the reliability of some physical examination findings. This possibility needs to be considered in future studies.
Only 3 signs were clearly unreliable for the physical examination of knee OA and were not remedied by standardization. These were warmth, lateral instability at 30° flexion, and medial instability at 30° flexion. It is not surprising that the latter 2 were unreliable, since some mediolateral movement is invariably present at 30° of flexion and the decision of what degree of movement constitutes instability is subjective and difficult to standardize. It is interesting to note that standardization did achieve substantial improvement in agreement for lateral instability at 30° flexion, but not for medial instability at 30° flexion. This discrepancy is difficult to explain. It is possible that mediolateral instability represents a single inseparable sign, which may achieve adequate reliability, if examined as such. This will need to be explored in future studies.
Similar to instability, we found poor reliability for the assessment of warmth and were unable to standardize for it. Because a finding of warmth could be affected by repeated joint examinations, the order of examinations and whether later examinations were associated with findings of warmth was evaluated. However, no effect due to order was found, and therefore it is likely that the poor reliability of warmth was due to the highly subjective nature of its assessment.
The interpretation of our results also requires an understanding of the inherent limitation of dichotomizing a continuum. The cutoff values of PABAK and reliability coefficients, although sensible, were arbitrary. The adequacy or inadequacy of reliability is not clearly discriminated by values that fall immediately above or below the cutoff points. Thus, our findings of adequate reliability may need to be interpreted more or less strictly, depending on the application of these results. Particularly with regard to the PABAK, which has been utilized in few studies, the appropriateness of a cutoff value of 0.60 is uncertain. However, given that a conventional kappa greater than 0.60 is interpreted as at least substantial agreement (21), and given that the PABAK provides more reliable values for an index of agreement than does the conventional kappa, we think that such a cutoff is indeed appropriate for the purpose of our study, which attempted to identify physical signs that are highly reliable for use in future OA studies.
In addition, the magnitude of these statistical values depends very much on the statistical methods being applied. The ANOVA-generated coefficients are characterized by an interplay between the error due to doctors and the error due to patients and residuals, such that these latter 2 sources of variation can have a profound influence on the magnitude of the error due to doctors and thus the reliability coefficient.
The small sample of both patients and assessors may also be seen as a limitation. However, because of potential patient and assessor fatigue with repeated examinations, this type of work, by necessity, involves small samples. Furthermore, the selection of patients was carried out in such a way that they were representative of patients with mild to severe radiographic OA with a range of physical examination findings, and thus they were the kind of patients typically seen in clinical practice and in research. More importantly, the prevalence of a positive finding was adequate for the majority of physical signs, thereby allowing for an appropriate assessment of reliability. For those signs in which the prevalence was low, increasing the number of subjects may improve the assessment of reliability.
Assessors were selected based on their expertise in OA and in clinical assessments, and therefore the study results may not be generalizable to other rheumatologists. However, the application of the standardized techniques developed in this study will likely prove useful to further evaluate the reliability of the knee examination in other OA studies.
Finally, the intraobserver reliability was not assessed in this study, because doing so would have required more repetitions. Since our primary aim was to evaluate interobserver reliability, we designed the study to minimize the repetitions of knee examinations in order to avoid patient and examiner fatigue and, in particular, to avoid reinforcement of memory, which could potentially bias the findings for interobserver reliability. In addition, the long-term stability of interobserver agreement was not assessed. As a result, it is uncertain whether the improvement in reliability achieved during the standardization study is maintained over time. This is an important consideration for clinical trials and other OA studies, since long-term followup is often required. Further studies are necessary to evaluate long-term reliability of the physical examination and the frequency at which assessor training needs to be carried out in order to reliably perform the knee examination in OA.
Despite these potential limitations, the following key findings can be summarized. The majority of physical examinations can be performed reliably even without standardization (grade A signs). Even with highly reliable signs/techniques, standardization can further improve the reliability. Some physical examinations require assessor training and should not be used otherwise. The examination techniques with the highest reliability coefficients or PABAKs will likely be of most value in clinical research and possibly in the evaluation of early OA, in which more subtle findings are expected.
The key most reliable physical examinations of knee OA are listed in Table 4 and include alignment by goniometer, bony swelling, general passive crepitus, gait, effusion bulge sign, quadriceps atrophy, medial and lateral tibiofemoral tenderness, patellofemoral tenderness by grind test, and flexion contracture. If knee examination techniques are to be included in future studies of OA, the inclusion of these key signs will be important and will allow for reliable and therefore improved outcome assessments.
Table 4. Summary of poststandardization values for the most reliable physical examination techniques in each domain
|Domain||Physical examination sign||Reliability|
|Alignment||Alignment by goniometer||0.99*|
|Crepitus||General passive crepitus||0.96*|
|Inflammation||Effusion bulge sign||0.97*|
|Muscle strength||Quadriceps atrophy||0.97*|
|Tenderness/pain||Medial tibiofemoral tenderness||0.94*|
|Tenderness/pain||Lateral tibiofemoral tenderness||0.85*|
|Tenderness/pain||Patellofemoral tenderness by grind test||0.94*|
|Range of motion||Flexion contracture||0.95*|