The reliability of ultrasonographic examination in rheumatology is a matter of ongoing debate. In an article published recently in Arthritis Care & Research, Cheung et al systematically reviewed the reliability of B-mode and power Doppler (PD) ultrasonography (US) to detect synovitis in rheumatoid arthritis (RA) in 35 studies, comprising 1,415 patients and reporting high interobserver and intraobserver reading reliability, especially of PD (1). However, the “reliability of the measures of reliability” is not beyond question.
Among other measures of reliability, the intraclass correlation coefficient (ICC) had been used to assess reliability. Two challenges exist in the use and interpretation of the ICC. First, the ICC is highly dependent on the heterogeneity of the study sample, and as a consequence is generalizable only to samples with a similar variation (2). The ICC is basically a signal-to-noise ratio. This may be difficult to comprehend conceptually, but it can be clarified by looking at the following equation:
This equation states that the heterogeneity of patients under investigation determines for a large part the value of the ICC. When the variance between patients, shown as Variance (patients), is low, the ICC is likely to be low as well, and vice versa. This also applies to rheumatologic US. When only a few of the total number of joints under investigation show signs of synovitis, which means a low variance between patients, the ICC will probably also be low and largely independent of the level of variance between observers. So the outcome of the ICC does not necessarily express the reliability reliably, nor the variance between observers and variance due to error.
The second issue regarding the ICC is that its value depends on which ICC equation is used. For different study designs there are several different ICC equations (3, 4). The most important distinction to be made is between the ICC for agreement and the ICC for consistency, which consists of different equations (2, 4). It is possible that observers fully agree on ranking patients into a low, intermediate, or high level of pathology according to the assessed scores, resulting in a high ICC for consistency among observers. However, this does not necessarily mean that the observers also reach agreement in the raw values of the scores given, which is the basis of the ICC for agreement. This is illustrated in Table 1. Two observers are asked to score the disease activity of 3 patients on a scale from 0–100. This situation would give an “almost perfect reliability” when calculating the ICC for consistency, since the 3 patients are ranked according to disease activity scores in the same order by both observers. However, when calculating the ICC for agreement, the result will be “poor reliability,” since the scores between the observers clearly differ by ∼10-fold.
|Patient||Observer A||Observer B|
In studies using the ICC, the extent of heterogeneity within the study population should be analyzed and described, as heterogeneity clearly influences the ICC. An example of how this can be done is described in an Outcome Measures in Rheumatology Clinical Trials article on magnetic resonance imaging (5). Furthermore, authors using an ICC should describe which method is used to calculate it and the rationale for their choice. This way, readers can better appreciate the reported reliability. In the articles reviewed by Cheung et al, only some make reference to which formula of ICC has been used (6), and a measure of heterogeneity, as a variable in the outcome of the ICC, is not given in any of these articles. These issues are the axe at the root of the robustness of the review of Cheung and colleagues, and should, in our opinion, have been acknowledged and discussed by Cheung et al in their article.