Reliability of the knee examination in osteoarthritis: Effect of standardization




To assess the reliability of physical examination of the osteoarthritic (OA) knee by rheumatologists, and to evaluate the benefits of standardization.


Forty-two physical signs and techniques were evaluated using a 6 × 6 Latin square design. Patients with mild to severe knee OA, based on physical and radiographic signs, were examined in random order prior to and following standardization of techniques. For those signs with dichotomous scales, agreement among the rheumatologists was calculated as the prevalence-adjusted bias-adjusted kappa (PABAK), while for the signs with continuous and ordinal scales, a reliability coefficient (Rc) was calculated using analysis of variance. A PABAK of >0.60 and an Rc of >0.80 were considered to indicate adequate reliability.


Adequate poststandardization reliability was achieved for 30 of 42 physical signs/techniques (71%). The most highly reliable signs identified by physical examination of the OA knee included alignment by goniometer (Rc = 0.99), bony swelling (Rc = 0.97), general passive crepitus (Rc = 0.96), gait by inspection (PABAK = 0.78), effusion bulge sign (Rc = 0.97), quadriceps atrophy (Rc = 0.97), medial tibiofemoral tenderness (Rc = 0.94), lateral tibiofemoral tenderness (Rc = 0.85), patellofemoral tenderness by grind test (Rc = 0.94), and flexion contracture (Rc = 0.95). The standardization process resulted in substantial improvements in reliability for evaluation of a number of physical signs, although for some signs, minimal or no effect of standardization was noted. After standardization, warmth (PABAK = 0.14), medial instability at 30° flexion (PABAK = 0.02), and lateral instability at 30° flexion (PABAK = 0.34) were the only 3 signs that were highly unreliable.


With the exception of physical examinations for instability, a comprehensive knee examination can be performed with adequate reliability. Standardization further improves the reliability for some physical signs and techniques. The application of these findings to future OA studies will contribute to improved outcome assessments in OA.

In clinical practice and in research, the knee examination is a key component in the assessment of patients with osteoarthritis (OA). Recommendations for the physical examination of the OA knee include many different signs and techniques (1–9). In the American College of Rheumatology (ACR) clinical diagnostic criteria for knee OA, assessments for crepitus, bony swelling, bony tenderness, and warmth are required (10). However, the reliability of these and other signs by physical examination has not been evaluated comprehensively. A search of the literature revealed very few studies on the intra- and interrater reliability of the knee examination in OA (11–15) (Table 1). Although the data in Table 1 suggest apparently large between-study differences in reliability, a comparison of these values is not appropriate because they are not all based on the same measure of reliability and because the kappa statistic is very sensitive to differences in bias and prevalence. As a result, it is not clear from these studies which, if any, physical signs can be examined reliably in knee OA. Additional limitations of prior studies are that some OA knee examination techniques have not been evaluated, and the effect of standardization has only been assessed in 1 study (15).

Table 1. Summary of published findings on the interobserver reliability of osteoarthritis knee examination signs
Knee examination signCushnaghan et al (11)*Hart et al (12)*Jones et al (13)*Hauzeur et al (14)Bellamy et al (15)
  • *

    Reliability assessed by kappa.

  • Reliability assessed by weighted kappa.

  • Reliability assessed by reliability coefficient.

Bony swelling0.550.100.650.69
 General crepitus0.140.23
 Tibiofemoral crepitus0.640.09
 Patellofemoral crepitus0.240.10
 Nonbony swelling0.280.250.13
 Synovial fluid0.220.45
 Popliteal cyst0.21
 Mediolateral instability0.23
 Anteroposterior instability0.00
Muscle strength
 General tibiofemoral0.74
 Medial tibiofemoral0.400.48
 Lateral tibiofemoral0.430.44
 Pain on movement0.85
Range of motion0.58–0.940.89–0.95

Because kappa is known to be a measure that is sensitive to prevalence and bias, a prevalence-adjusted bias-adjusted kappa (PABAK), described by Byrt et al (16), was used in this study. PABAK measures agreement beyond chance, while taking into account both the prevalence of a positive finding and the bias of each observer for reporting a positive finding. PABAK is thought to be a better estimate for agreement than the standard kappa (16). In addition, PABAK has the advantage that the results can be directly compared between different variables and even between studies, when study populations are similar.

In this study, we assessed a wide range of knee physical examination signs and techniques in patients with mild to severe radiographic knee OA. The purpose of this study was 2-fold, in that we sought to determine 1) which signs can be assessed reliably by rheumatologists, and 2) whether standardization can reduce the interobserver variability.



The study was approved by the local institutional review board and all participating subjects provided their written informed consent. Six subjects were selected from a database of patients with knee OA. Subjects were eligible if they met predefined criteria. Inclusion criteria were 1) age 40–79 years, 2) knee pain on most days of the month at any time in the past, 3) any knee pain during the previous 12 months, and 4) osteophytes on plain radiography. Exclusion criteria were 1) prior total knee arthroplasty, 2) knee surgery within the previous 4 months, 3) fibromyalgia or inflammatory arthritis, 4) knee pain derived from the hips or back, and 5) history of acute injury to the knee within the previous 6 months.

Patients were selected by an independent rheumatologist who was not involved in the standardization process (KS). The primary criterion for patient selection was the presence of a full range of physical examination signs, including a range of severity of different physical signs. A secondary criterion was the selection of patients on the basis of radiographic severity of OA as assessed by the Kellgren/Lawrence scale (17). Patients completed the Western Ontario and McMaster Universities OA (WOMAC) Index, version VA3.1 (18), but they were not selected on the basis of symptom severity as reported on the WOMAC. Rheumatologists who participated in the standardization process were blinded to these patient selection criteria.


The study involved 6 rheumatologists who were experienced in the conduct of knee OA research studies, and a biostatistician. The selection of physical examination signs was based on the frequency with which they are recorded in rheumatology clinical practice or research and their potential to be useful in the evaluation of knee pain (1–6, 9, 10). If several techniques were available for the evaluation of a sign, these were included and specified. The framework used to evaluate all physical signs was based on 9 domains, comprising alignment, bony swelling, crepitus, gait, inflammation, instability, muscle strength, tenderness/pain, and range of motion. The final selection of physical signs/techniques was determined by group consensus (Figure 1). A pilot knee examination was conducted by 1 of the investigators (JC) to estimate the time required for the evaluation and to remedy any difficulties with the data collection forms or their scoring.

Figure 1.

Assessment of poststandardization reliability (♦) for 42 physical examination signs evaluated in patients with knee osteoarthritis. Thick vertical bars indicate the cutoff values for an acceptable prevalence-adjusted bias-adjusted kappa (PABAK) (cutoff >0.60) and reliability coefficient (Rc) (cutoff >0.80). For a description of the grading system of reliability (grades A–D), see Patients and Methods. = the effusion–balloon test was evaluated poststandardization only.

The standardization study was completed over 2 days. On the first day, a prestudy briefing was conducted. The rheumatologists were familiarized with the instructions, equipment, and scoring sheets for the 2-day program. The study manual contained a list of physical examinations with a brief description of patient position, examination method, and scoring. Basic misconceptions were clarified, but no detailed discussion of technique was entertained and no attempt to standardize was undertaken. Rheumatologists were encouraged to perform the examinations in accordance with their usual practice. Equipment for examinations included a 30.5-cm goniometer and a 150-cm tape measure, each identical in brand and specification.

Physical examinations.

Prestandardization examinations.

Following the briefing session, the rheumatologists examined each patient according to a 6 × 6 Latin square design (19). Examinations were conducted independently, in separate rooms, at a single site. The patient order and the order of physical examinations were randomized for each rheumatologist. For physician convenience and patient comfort, physical examinations were randomized by position of test (standing, sitting, lying) and then type of test. Patients wore shorts, such that there was adequate bilateral exposure of knees and thighs. Following each examination, all data forms were checked for completeness by the study coordinator, and the results were entered immediately for analysis by the biostatistician. The schedule included a 5-minute break for patients and rheumatologists between each examination and a 20-minute break after the first 3 examinations. All participants (patients and rheumatologists) were instructed not to discuss any examinations or techniques during the breaks. On completion of the prestandardization examinations, patients were asked to complete a 1-page questionnaire to document any perceived differences in examination technique between rheumatologists. Although analgesic medications were allowed in the study, none of the patients had taken any on the day of the examinations.


Following the prestandardization examinations, rheumatologists convened for the standardization meeting, chaired by one of us (NB). The standardization process was based on 4 elements: 1) patient responses to the questionnaires on perceived differences in examinations; 2) graphic display of data and identification of outliers by the biostatistician; 3) discussion of examination techniques to elucidate reasons for variability; and 4) demonstration of physical signs on a healthy volunteer. Although the patient questionnaires provided some insight into differences in rheumatologists' examinations, these were ultimately not helpful to identify outliers or explain variability. For each physical examination item, a graphic chart of the variability of findings was displayed by the biostatistician, followed by a discussion of the examination technique to elucidate the reasons for the variability. Areas of disagreement were resolved by consensus. The main discussion points were the physical examination techniques, measurement landmarks, and the scoring scale. The standardized technique was recorded in the study manual by each rheumatologist. Immediately following the standardization meeting, the poststandardization data collection forms were updated to incorporate all changes.

Poststandardization examinations.

The following day, patients and rheumatologists returned for the poststandardization examinations. These were performed using a 6 × 6 Latin square design (19) with a different randomization schedule from that at prestandardization. The procedures for breaks, data checking, and patient feedback at the end of the examinations were the same.

Statistical analysis.

The statistical analysis was conducted by the biostatistician using the S-Plus statistical program (20). Interobserver agreement with regard to signs with dichotomous scales was calculated as the prevalence-adjusted bias-adjusted kappa (PABAK), which is calculated as 2po − 1, where po is the observed proportion of agreement (16); this observed proportion of agreement (po) was obtained by calculating the mean of observed proportions of agreement for all possible pairs of rheumatologists. Although the PABAK is adjusted for prevalence and bias, it still needs to be examined in conjunction with prevalence and bias. A high level of bias would need to be investigated to determine its cause. In a situation of low prevalence, insufficient information is available for a precise assessment of agreement, and PABAK values may be relatively uninformative. For the interpretation of PABAKs, we adopted the standard kappa descriptive scale by Landis and Koch (21), which, although somewhat arbitrary, is widely used: <0.00 = poor agreement, 0.00–0.20 = slight agreement, 0.21–0.40 = fair agreement, 0.41–0.60 = moderate agreement, 0.61–0.80 = substantial agreement, and 0.81–1.00 = almost perfect agreement. A PABAK of >0.60 was chosen a priori to indicate adequate reliability. In view of the fact that the PABAK represents an improvement on Cohen's kappa, the consensus among the study rheumatologists and the biostatistician was that it was reasonable to use the same descriptive scale. Similarly, the choice of the cutoff value was based on a decision by consensus of all study rheumatologists and the biostatistician.

Interobserver agreement with regard to signs with continuous and ordinal scales was assessed using analysis of variance (ANOVA). A number of different forms of the intraclass correlation coefficient may be used, depending on the study question and structure (22). Since the purpose of this study was to examine only the variation due to the differences between doctors, and because this study attempted to reduce this variation, it was thought that the most appropriate reliability coefficient (Rc) would be calculated as 1 − variancedoctor, where variancedoctor is the proportion of total variance attributed to doctors (15). By consensus, a reliability coefficient of >0.80 was chosen a priori to indicate adequate reliability.

For the 7 signs in which it was necessary to change from an ordinal to a dichotomous scale after standardization, reliability coefficients were calculated for both the pre- and poststandardization data. Although the change in scale means that differences between pre- and poststandardization values must be interpreted with caution, it was believed that it would be more appropriate to compare 2 reliability coefficients than a reliability coefficient and a PABAK.

For all physical signs/techniques, an A–D grading system was applied as follows: grade A = reliability adequate on both pre- and poststandardization examinations (i.e., PABAK >0.60 or Rc >0.80); grade B = reliability adequate only on poststandardization examination; grade C = reliability adequate on pre- but not poststandardization examination; and grade D = reliability inadequate on both pre- and poststandardization examinations. Since the poststandardization reliability is of primary interest, a higher grade (A or B) indicates greater reliability and usefulness of the physical sign, with grade A being the most desirable since adequate reliability is present without standardization.


Patient characteristics.

Three male and 3 female patients participated in the study. Their median age was 62 years (range 44–74 years), median duration of knee pain was 8 years (range 3–20 years), median WOMAC score for pain on walking was 4 mm (range 0–25 mm), and median body mass index was 23.9 kg/m2 (range 22.4–26.7). Three of the patients underwent examination of the right knee, and 3 underwent examination of their left knee. The radiographic severity of OA was a Kellgren/Lawrence grade 2, grade 3, and grade 4 in 2 patients each.

Pre- and poststandardization agreement.

The standardization meeting resulted in changes to the scoring in some of the examination domain items. All 6 items evaluating pain or tenderness were changed from a 4-point scale (none, mild, winces, withdraws) to a 2-point scale (absent, present), mainly because the 2 extreme descriptions of “winces” and “withdraws” were rarely elicited. Similarly, for the effusion bulge sign, the 4-point scale (none, mild, moderate, severe) was replaced by a 2-point scale (absent, present) in the poststandardization examinations. Scoring of bony swelling was changed from a 3-point scale (none, mild, severe) to a 4-point scale (none, mild, moderate, severe). The 3-point scale for the 11 crepitus signs was not changed, although the scale description was modified from “none, mild, severe” to “none, fine, coarse,” which was considered to more accurately reflect the nature of the abnormalities being detected. The scoring and description of all other signs/techniques remained the same. In addition, it was decided to score any marginal findings as normal or absent.

The results for all physical examinations/signs within each domain are summarized in Figure 1, which shows the poststandardization reliability, and Figure 2, which shows the effect of standardization. Overall, adequate reliability was achieved for 30 (71%) of 42 physical signs/techniques, with a grade A or B following standardization. Although most of the other signs/techniques were close to the cutoff value for reliability, a few, including warmth, medial instability at 30° flexion, and lateral instability at 30° flexion, were well below the cutoff and thus were highly unreliable even after standardization (Figure 1). The process of standardization resulted in substantial improvements in reliability for many of the physical examination signs, although for some, standardization had minimal or no effect (Figure 2).

Figure 2.

Effect of standardization of examination for 41 physical signs/techniques. Improvement and worsening of reliability as a result of standardization are indicated by symbols positioned either above or below the diagonal line, respectively. The degree of improvement or worsening is reflected by the vertical distance from the diagonal. Those signs/techniques assessed for reliability by prevalence-adjusted bias-adjusted kappas (PABAK) are denoted by lower case letters; the PABAK was considered acceptable if >0.60. Those signs/techniques assessed with reliability coefficients are denoted by upper case letters and numbers; the reliability coefficient was considered acceptable if >0.80. = the effusion–balloon test was evaluated poststandardization only. A = active; P = passive; TF = tibiofemoral; PS = passive with stress; PF = patellofemoral.

Detailed results on the 12 signs with dichotomous scales are shown in Table 2, which lists the PABAK values, prevalence, and bias for both pre- and poststandardization as well as the letter grade (A–D). One dichotomous item (effusion by balloon test) was added during the standardization meeting, and therefore only poststandardization data are reported for this item. Table 3 shows detailed results on the 29 signs with continuous or ordinal scales, listing the percentage variance due to patient, doctor, order, and error, and the reliability coefficients for both pre- and poststandardization as well as the letter grade (A–D).

Table 2. Pre- and poststandardization prevalence-adjusted bias-adjusted kappa (PABAK) for dichotomous physical examination signs
Physical sign/techniqueScalePrestandardizationPoststandardizationGrade*
  • *

    For a description of the grading system of reliability (grades A–D), see Patients and Methods.

  • Variable only evaluated poststandardization.

 Effusion–balloon testPresent/absent0.880.190.06A/B
 Effusion–patellar tapPresent/absent0.880.190.060.780.170.11A
 Popliteal cystPresent/absent0.780.060.090.660.080.14A
  0° flexionNormal/abnormal0.560.110.200.880.030.06B
  30° flexionNormal/abnormal0.080.420.210.340.280.27D
  0° flexionNormal/abnormal0.480.140.140.660.080.14B
  30° flexionNormal/abnormal0.020.390.310.020.500.18D
  Drawer testNormal/abnormal0.600.280.160.540.190.17D
  Drawer testNormal/abnormal0.820.110.090.820.110.09A
Range of motion        
Table 3. Pre- and poststandardization components of variance and reliability coefficients for continuous and ordinal physical examination signs
Physical sign/techniqueScale (poststandardization)% of variance, prestandardization% of variance, poststandardizationReliability coefficientGrade*
  • *

    For a description of the grading system of reliability (grades A–D), see Patients and Methods.

  • Prestandardization scale = none, mild, severe.

  • Prestandardization scale = none, mild, moderate, severe.

 InspectionNormal, varus, valgus0.790.
 Intercondylar distanceCentimeters0.860.
 Intermalleolar distanceCentimeters0.450.220.030.300.590.
Bony swelling            
 PalpationNone, mild, moderate, severe0.540.
  ActiveNone, fine, coarse0.540.
  PassiveNone, fine, coarse0.560.
 Lateral tibiofemoral            
  ActiveNone, fine, coarse0.
  PassiveNone, fine, coarse0.
  Passive with stressNone, fine, coarse0.
 Medial tibiofemoral            
  ActiveNone, fine, coarse0.410.170.070.340.490.
  PassiveNone, fine, coarse0.350.250.070.330.340.220.060.390.750.78D
  Passive with stressNone, fine, coarse0.
  ActiveNone, fine, coarse0.570.
  PassiveNone, fine, coarse0.690.
  Passive with stressNone, fine, coarse0.
 Effusion–bulge signPresent, absent0.320.180.030.470.
Muscle strength            
 Hamstring strengthPoor, moderate, full0.
 Quadriceps strengthPoor, moderate, full0.
 Extension lagNone, mild, moderate, severe0.530.
 Quadriceps atrophyNone, mild, severe0.620.
 Lateral tibiofemoral tendernessPresent, absent0.500.150.040.310.500.150.020.330.850.85A
 Medial tibiofemoral tendernessPresent, absent0.330.170.060.440.440.060.140.360.830.94A
 Patellofemoral tenderness by grind testPresent, absent0.700.
 Anserine bursa tendernessPresent, absent0.310.260.100.330.450.100.030.420.740.90B
 Patellar tendon tendernessPresent, absent0.340.160.160.340.670.
 End-of-range stress painPresent, absent0.340.110.060.480.400.130.070.400.890.87A
Range of motion            
 Flexion range of motionDegrees0.550.
 Flexion contractureDegrees0.580.

Results by domain.


With regard to assessment of alignment, 3 of the 4 physical signs/techniques (inspection, goniometer, and intercondylar distance) were of adequate reliability (all grade A), whereas intermalleolar distance was not reliable even following standardization (grade D) (Table 3 and Figures 1 and 2). Alignment measured by goniometer was most reliable (Rc = 0.99). However, simple inspection for varus, valgus, or normal alignment also achieved a reliability coefficient of 0.94 poststandardization; because of its simplicity, this technique could also be useful in future research.

Bony swelling.

The assessment for bony swelling, which is a key component of the ACR clinical diagnostic criteria for OA (10), was found to be highly reliable (grade A) in contrast to that observed in previous studies (11, 12, 15). Even with a change from a 3-point to a 4-point scale, reliability further improved after standardization, achieving a reliability coefficient of 0.97 (Table 3 and Figures 1 and 2).


Crepitus was assessed as general and compartment-specific crepitus, and was assessed using active, passive, and passive with stress movement. The latter evaluation has been suggested to correlate with early arthroscopic findings of cartilage damage (9). The findings for reliability were inconsistent. Although general crepitus was reliably assessed using passive movement (Rc = 0.96), assessment of general crepitus with active movement did not achieve adequate reliability (Rc = 0.67). For compartment-specific crepitus, adequate reliability was present for the lateral compartment only on passive movement (Rc = 0.91), and for the medial and patellofemoral compartments, adequate reliability was achieved only on passive with stress movement (Rc = 0.94 and Rc = 0.87, respectively). All other crepitus evaluations achieved a grade C or D, although it should be noted that most of the poststandardization reliability coefficients were close to the cutoff value of 0.80 (Table 3 and Figures 1 and 2).


Gait was assessed by simple inspection and was found to be reliable, with a poststandardization PABAK of 0.78. However, standardization was required to achieve adequate reliability (grade B) (Table 2 and Figures 1 and 2).


Signs of inflammation included joint effusion, popliteal cyst, and warmth. Effusion was assessed by bulge sign, balloon test, and patellar tap. Of these, the bulge sign was most reliable (Rc = 0.97) (Table 3). However, assessment of effusion by balloon test also achieved a poststandardization PABAK of 0.88, although it is uncertain whether standardization was required for such an achievement, since this sign was only assessed poststandardization (Table 2). The examination for popliteal cyst was also reliable (grade A). However, the PABAK for assessment of popliteal cyst unexpectedly decreased from 0.78 on prestandardization to 0.66 on poststandardization (Table 2). Agreement on assessment of warmth was low and remained low despite standardization, with a poststandardization PABAK of 0.14 (grade D) (Table 2 and Figures 1 and 2).


Lateral and medial instability were assessed reliably at 0° of flexion after standardization (both grade B), although the prevalence of a positive finding for these 2 items was very low, and therefore these results must be interpreted with caution. In contrast, at 30° of flexion, agreement on assessment of both lateral and medial instability was poor (both grade D), particularly for medial instability, which achieved a poststandardization PABAK of only 0.02. It is also of interest that the bias for both of these signs was high on both pre- and poststandardization examinations, suggesting that, overall, the rheumatologists' bias for finding instability was not altered by the standardization process (Table 2). Assessment of posterior instability by posterior drawer test and posterior sag was reliable (grade A), but also had a low prevalence, whereas anterior instability assessment by anterior drawer test was found to be unreliable (grade D) (Table 2).

Muscle strength.

Good agreement was achieved for all assessments of muscle strength, with reliability coefficients of 0.86, 0.86, 0.88, and 0.97 for quadriceps strength, hamstring strength, extension lag, and quadriceps atrophy, respectively. All signs achieved grade A, except for extension lag, which achieved grade B (Table 3 and Figure 1).


Good reliability was achieved for assessment of all signs of articular and periarticular tenderness/pain (grade A or B). Although the change in scoring from a 4-point to a 2-point scale may have improved the agreement among rheumatologists, good reliability was already present prior to standardization for most tenderness/pain signs (Table 3 and Figures 1 and 2).

Range of motion.

Assessment of range of motion of the knee was subdivided into flexion, flexion contracture, and hyperextension. All 3 assessments were of adequate reliability (grade A, A, and B, respectively) (Tables 2 and 3). The low PABAK and high bias for the examination of hyperextension before standardization was related to a difficulty in the interpretation of the scale, which was coded as absent (0) or present (1). Some rheumatologists interpreted the absence of hyperextension as abnormal, and therefore coded their finding as 1 instead of 0. Poststandardization, this coding error was eliminated by changing the scale to normal/abnormal. This resulted in an improved PABAK of 0.88 and a much lower bias of 0.06. However, it should also be noted that the poststandardization prevalence was very low, such that this PABAK needs to be interpreted with caution (Table 2 and Figures 1 and 2).


The principal elements of the knee examination include evaluations for alignment, bony swelling, crepitus, gait, inflammation, instability, muscle strength, tenderness/pain, and range of motion. The availability of reliable physical examination signs from within each of these domains is crucial for the ability to assess the knee joint comprehensively in future outcome studies of OA.

With regard to examinations for crepitus, our study findings were variable. Compartment-specific crepitus was not assessed with consistent reliability by using active or passive movement. Passive movement with stress was reliable only for the medial tibiofemoral and the patellofemoral compartments and may be more difficult to implement in clinical research, since it is not usual practice, is not easily performed, and is more time consuming. Therefore, given that passive crepitus is generally assessed in clinical practice, and since the reliability coefficients were either acceptable or close to the cutoff of 0.80, this technique would seem most feasible for use in future studies, if compartment-specific crepitus is of importance. For general (non–compartment-specific) crepitus, the assessment was most reliable using passive movement.

With regard to alignment, inflammation, and muscle strength, several interchangeable signs were evaluated in this study, with at least 1, and frequently more than 1, physical sign achieving good reliability. This allows for selection of appropriate physical examination signs in future studies on the basis of not only reliability, but also suitability and preference. On the other hand, for some domains, such as instability, tenderness/pain, and range of motion, the individual physical signs evaluated represent different dimensions of these domains and are therefore not interchangeable. With regard to instability, we were only able to reliably assess posterior instability. The reliability of medial and lateral instability, which depends on adequate assessments at both 0° and 30° of flexion, was not established. As a result, the assessment of instability as a whole was found to be unreliable and may need to be investigated in further studies. In contrast, the assessments for tenderness/pain and range of motion were found to be reliable, since all physical signs within those groups were deemed to be reliable.

Overall, this study showed that the majority of physical signs can be assessed reliably. Furthermore, the effect of standardization was to improve reliability for most of the signs/techniques. However, for some physical examinations, there was a decrease in reliability following standardization. For most of these, a decrease of less than 0.05 in the reliability coefficient or PABAK was seen. Such small changes, in either direction, may not be clinically important and are likely due to random error resulting from the dynamic interactions which occur within and between subjects as well as within and between assessors. For a few signs, a greater decrease in the reliability coefficient or PABAK was observed. This is likely due to the fact that not all physical measures are equally responsive to a simple standardization procedure, but require more intense or repeated training. In particular, a conflict between what assessors normally do compared with what they are obligated to do on the basis of imposed study requirements can likely influence the reliability. In this study, the standardization meeting included demonstration of physical technique for the purpose of reaching an agreement on standardization, but no extensive assessor training was undertaken, and therefore this may have adversely affected the reliability of some physical examination findings. This possibility needs to be considered in future studies.

Only 3 signs were clearly unreliable for the physical examination of knee OA and were not remedied by standardization. These were warmth, lateral instability at 30° flexion, and medial instability at 30° flexion. It is not surprising that the latter 2 were unreliable, since some mediolateral movement is invariably present at 30° of flexion and the decision of what degree of movement constitutes instability is subjective and difficult to standardize. It is interesting to note that standardization did achieve substantial improvement in agreement for lateral instability at 30° flexion, but not for medial instability at 30° flexion. This discrepancy is difficult to explain. It is possible that mediolateral instability represents a single inseparable sign, which may achieve adequate reliability, if examined as such. This will need to be explored in future studies.

Similar to instability, we found poor reliability for the assessment of warmth and were unable to standardize for it. Because a finding of warmth could be affected by repeated joint examinations, the order of examinations and whether later examinations were associated with findings of warmth was evaluated. However, no effect due to order was found, and therefore it is likely that the poor reliability of warmth was due to the highly subjective nature of its assessment.

The interpretation of our results also requires an understanding of the inherent limitation of dichotomizing a continuum. The cutoff values of PABAK and reliability coefficients, although sensible, were arbitrary. The adequacy or inadequacy of reliability is not clearly discriminated by values that fall immediately above or below the cutoff points. Thus, our findings of adequate reliability may need to be interpreted more or less strictly, depending on the application of these results. Particularly with regard to the PABAK, which has been utilized in few studies, the appropriateness of a cutoff value of 0.60 is uncertain. However, given that a conventional kappa greater than 0.60 is interpreted as at least substantial agreement (21), and given that the PABAK provides more reliable values for an index of agreement than does the conventional kappa, we think that such a cutoff is indeed appropriate for the purpose of our study, which attempted to identify physical signs that are highly reliable for use in future OA studies.

In addition, the magnitude of these statistical values depends very much on the statistical methods being applied. The ANOVA-generated coefficients are characterized by an interplay between the error due to doctors and the error due to patients and residuals, such that these latter 2 sources of variation can have a profound influence on the magnitude of the error due to doctors and thus the reliability coefficient.

The small sample of both patients and assessors may also be seen as a limitation. However, because of potential patient and assessor fatigue with repeated examinations, this type of work, by necessity, involves small samples. Furthermore, the selection of patients was carried out in such a way that they were representative of patients with mild to severe radiographic OA with a range of physical examination findings, and thus they were the kind of patients typically seen in clinical practice and in research. More importantly, the prevalence of a positive finding was adequate for the majority of physical signs, thereby allowing for an appropriate assessment of reliability. For those signs in which the prevalence was low, increasing the number of subjects may improve the assessment of reliability.

Assessors were selected based on their expertise in OA and in clinical assessments, and therefore the study results may not be generalizable to other rheumatologists. However, the application of the standardized techniques developed in this study will likely prove useful to further evaluate the reliability of the knee examination in other OA studies.

Finally, the intraobserver reliability was not assessed in this study, because doing so would have required more repetitions. Since our primary aim was to evaluate interobserver reliability, we designed the study to minimize the repetitions of knee examinations in order to avoid patient and examiner fatigue and, in particular, to avoid reinforcement of memory, which could potentially bias the findings for interobserver reliability. In addition, the long-term stability of interobserver agreement was not assessed. As a result, it is uncertain whether the improvement in reliability achieved during the standardization study is maintained over time. This is an important consideration for clinical trials and other OA studies, since long-term followup is often required. Further studies are necessary to evaluate long-term reliability of the physical examination and the frequency at which assessor training needs to be carried out in order to reliably perform the knee examination in OA.

Despite these potential limitations, the following key findings can be summarized. The majority of physical examinations can be performed reliably even without standardization (grade A signs). Even with highly reliable signs/techniques, standardization can further improve the reliability. Some physical examinations require assessor training and should not be used otherwise. The examination techniques with the highest reliability coefficients or PABAKs will likely be of most value in clinical research and possibly in the evaluation of early OA, in which more subtle findings are expected.

The key most reliable physical examinations of knee OA are listed in Table 4 and include alignment by goniometer, bony swelling, general passive crepitus, gait, effusion bulge sign, quadriceps atrophy, medial and lateral tibiofemoral tenderness, patellofemoral tenderness by grind test, and flexion contracture. If knee examination techniques are to be included in future studies of OA, the inclusion of these key signs will be important and will allow for reliable and therefore improved outcome assessments.

Table 4. Summary of poststandardization values for the most reliable physical examination techniques in each domain
DomainPhysical examination signReliability
  • *

    By reliability coefficient.

  • By prevalence-adjusted bias-adjusted kappa.

AlignmentAlignment by goniometer0.99*
Bony swellingPalpation0.97*
CrepitusGeneral passive crepitus0.96*
InflammationEffusion bulge sign0.97*
Muscle strengthQuadriceps atrophy0.97*
Tenderness/painMedial tibiofemoral tenderness0.94*
Tenderness/painLateral tibiofemoral tenderness0.85*
Tenderness/painPatellofemoral tenderness by grind test0.94*
Range of motionFlexion contracture0.95*


We would like to thank our patient volunteers for their participation in this research study.