To perform a systematic review on the reliability of ultrasonographic (US) synovitis detection in rheumatoid arthritis (RA) by B-mode and power Doppler (PD) in image acquisition and still-image interpretation. US is a sensitive method for synovitis detection. However, reliability is still a key concern.
Articles reporting any US reliability results for synovitis in RA in PubMed, EMBase, the Cochrane Library, and meeting abstracts were selected. Data were extracted from the collection of data on US synovitis detection (either qualitatively [binary] or semiquantitatively [0–3 scale], for intraobserver and interobserver reliability in B-mode and PD, and for image acquisition and still-image interpretation). The type of joints tested, the experience of the ultrasonographer, and the quality of the studies were assessed. Data analysis involved descriptive and graphic interpretation of reliability and its potential determinants.
Thirty-five studies (12 for B-mode, 11 for PD, or 12 for both) with a total of 1,415 patients were analyzed. Intraobserver and interobserver reliability for still images in B-mode and PD was high (κ = 0.5–1.0 [14,991 joints] for intraobserver reliability for B-mode, κ = 0.59–1.0 [14,934 joints] for PD, κ = 0.49–1.0 [3,138 joints] for interobserver for B-mode, and κ = 0.66–1.0 [3,325 joints] for PD). Interobserver reliability for image acquisition in both US modes was lower than still-image interpretation (κ = 0.22–0.95). Few studies reported intraobserver image acquisition reliability.
Intraobserver and interobserver reliability of still-image interpretation was high, especially for PD, in published studies involving highly trained observers. However, reliability of acquisition of US should be further assessed.
Rheumatoid arthritis (RA) is characterized by synovitis, a process of joint inflammation that leads to joint destruction and, ultimately, physical disability. The extent of this inflammatory process has been shown to predict the development of bone erosions (1). Therefore, early detection and treatment of synovitis is desirable since it reduces radiographic progression (2).
The gold standard for synovitis detection has traditionally been the clinical assessment of swollen joints by physicians. However, swollen joint counts are poorly reproducible (3), with interobserver reliability estimated as poor to good (intraclass correlation coefficients [ICCs] 0.14–0.74) (4). New imaging modalities such as ultrasonography (US) are a focus of attention since they appear to be more sensitive than clinical examination at synovitis detection (5, 6). For example, in a study of patients with oligoarthritis, almost two-thirds of the patients had evidence of subclinical disease and one-third could be reclassified as having polyarticular disease by using US (6).
US can evaluate synovitis at the anatomic and vascular level. The B-mode setting enables visualization of synovial hypertrophy and effusion, while the power Doppler (PD) setting allows visualization of the movement of blood vessels, therefore detecting increased microvascular blood flow in synovitis. However, to use US as a validated modality for synovitis assessment, the Outcome Measures in Rheumatology Clinical Trials (OMERACT) filter states that an outcome measure must be truthful, feasible, and discriminatory (7). The latter includes reliability, a feature that has been a key concern for US. Large gaps in reliability testing of US in synovitis were noted by the OMERACT US interest group (8), especially in image acquisition (when the observer is responsible for acquiring and interpreting the image). This is important in US because it is highly operator dependent. Image reading reliability, including interpretation of still images (reading and interpretation of images already acquired), is also problematic due to the lack of standardized synovitis grading. Studies have attempted to address the question of reliability; however, data appear as conflicting (9–11).
The purpose of this study was to obtain an overall view of the reliability of US detection of synovitis in RA by B-mode and PD in both image acquisition and still-image interpretation, within and between observers, through a systematic literature review.
MATERIALS AND METHODS
Inclusion criteria were articles reporting any reliability result for detection of synovitis by US in RA patients. A systematic search of the literature, limited to humans, adults, and the English language, was performed in the PubMed, EMBase, and Cochrane Library databases on March 31, 2009 and then updated on June 30, 2009. The following exploded medical subject heading (MeSH) terms were used in PubMed: arthritis rheumatoid and ultrasonography and either synovial membrane or synovitis or joints or reproducibility of results or observer variation. The following terms were used in EMBase: rheumatoid arthritis/exp and ultrasonography (Figure 1). Meeting abstracts were included from the meetings of the European League Against Rheumatism from 2007 to 2009 and from the American College of Rheumatology from 2007 to 2008. A hand search of references was also performed.
Reviews, editorials, and comments were excluded. Letters to the editor were included only if there were sufficient data (i.e., any numerical result for reliability). Selection process was initially based on title and abstract of the article, then on full texts.
One investigator (PPC) selected the articles using the MeSH terms described. These terms were formulated with a university librarian. Data were obtained using a predetermined form on year of publication, study design, US mode (i.e., B-mode, 2-dimenstional [2-D] PD, 3-dimensional [3-D] PD, contrast-enhanced PD), as well as machine settings. The number of ultrasonographers, their level of experience, and the type and number of joints involved in reliability testing were recorded. Demographic data such as sex, age, disease duration, and rheumatoid factor status were collected.
Intraobserver (within the operator) and interobserver (between operators) reliability was grouped either into image acquisition or still-image interpretation. The methodology of reliability testing was assessed and the time interval for retesting for intraobserver reliability was noted. The various ways of expressing reliability included the kappa statistic, ICC, coefficient of variation (CV), overall agreement (percentage of observed exact agreements), or Kendall's W coefficient. If values were given only on reliability of each joint or joint group, the metacarpophalangeal (MCP) joints, proximal interphalangeal (PIP) joints, and wrists were included in the final results since they were more relevant for RA patients.
The kappa statistic was considered if multiple different statistics were provided. Kappa measures agreement between pairs of observers, eliminating random concordance. Kappa values of <0.40 reflect poor agreement, values of 0.40–0.75 reflect fair to good agreement, and values of >0.75 reflect excellent agreement (12). An ICC >0.8 is also considered to be good agreement.
Quality of studies.
Due to the lack of a standardized tool for assessment of quality in nonrandomized prospective studies on reliability, the authors (PPC and LG) developed a set of 6 predefined criteria. These criteria were based on concepts from reviews of quality assessment tools used in systematic reviews of observational studies (13) with consideration into what would be important in a well-conducted reliability study on US. The predefined criteria were the following, each assessed in a binary mode (yes/no): 1) Was the recruitment of patients well-defined in the methods section of the publication? 2) Was calculation of the sample size mentioned (i.e., were there attempts at estimating the number of joints or patients needed to achieve adequate power)? 3) Was there a description of US scanning technique (settings used, type of machine, and protocol for scanning)? 4) Was there a description of attempted blinding of observers? 5) Was there a description of synovitis scoring (i.e., was the reliability testing on a binary score [yes/no] for synovitis or a semiquantitative score using levels of grading for degree of synovitis, and which source was this scoring based on? 6) Were statistics adequately explained and results completely given? Quality was reported on a scale of 0–6, with higher results indicating higher quality.
The specific method of synovitis scoring for B-mode and PD was recorded as 1) binary (yes/no) for presence of synovial hypertrophy or PD signal flow, 2) semiquantitative (usually using a 4-point grade) for degree of synovial hypertrophy or PD signal flow, and/or 3) quantitative (i.e., in millimeters or pixel count area).
Data were analyzed descriptively. Factors potentially explaining the variability in reliability results were tested graphically by plotting the kappa statistic on one axis versus the potential continuous variable on the other axis. For indicative purposes, linear regression was performed. These analyses were only performed for studies reporting the kappa statistic. The following potential factors were analyzed: number of joints, quality scoring for the articles, and number of observers for interobserver reliability.
Article selection and study type.
Of the 348 articles identified, 33 articles and 2 abstracts were included in the analysis, which corresponded to a total of 1,415 patients (Figure 1). Characteristics of the studies are shown in Table 1. A similar number of studies reported reliability of synovitis detection using B-mode (n = 24) and PD (n = 23), of which 12 studies analyzed both (Table 1). Studies on intraobserver reliability (n = 20) included the following: image acquisition (n = 3) and still-image (n = 11) for B-mode, and image acquisition (n = 2) and still-image (n = 11) for PD. Interobserver reliability studies (n = 30) included the following: image acquisition (n = 15) and still-image (n = 9) for B-mode, and image acquisition (n = 8) and still-image (n = 10) for PD. Several studies reported both image acquisition and still-image reliability.
Table 1. Description of the 35 studies reporting reliability of synovitis detection by ultrasonography*
Twenty-seven studies (77%) tested reliability of US in the hand, of which 21 (60%) included MCP joints (usually the 2nd to 5th joints), and less frequently the wrist (n = 20) and PIP joints (n = 14). For the lower extremity, the knee (n = 14) and the metatarsophalangeal joint (n = 10) were the most common joints tested. Overall, the patients' experiences were typical of active RA with high disease activity, with a range of available Disease Activity Scores in 28 joints (4.9–5.9). For these scores, see Supplemental Table 1 (available in the online version of this article at http://www3.interscience.wiley.com/journal/77005015/home).
Nine (26%) of 35 studies assessed reliability as their primary objective, while the remaining studies assessed other aspects of synovitis, with reliability data in a subset of patients.
Of the 11 studies using binary scoring, 8 (73%) provided US definition of synovitis (mainly described as hypoechoic areas). Four studies stated they followed OMERACT definitions (46). Twenty-four studies that described reliability used different semiquantitative scores (4-point grade). For these scores, see Supplemental Tables 2 and 3 (available in the online version of this article at http://www3.interscience.wiley.com/journal/77005015/home). Quantitative scoring used in 2 studies included pixel-count surface area (44) and resistive index (36).
Machine settings and frequency of the probe were mainly high frequency (5–15 MHz). Investigators used the same brand of machine for reliability testing in all but 3 studies (9, 27, 32). Observers were either rheumatologists or radiologists and experienced in musculoskeletal US in all studies, except one that specifically looked at training, using rheumatology fellows who had limited US experience (19).
Quality of publications.
All ultrasonographers were blinded to the results of clinical and patient data. However, no studies reported calculation of sample size for reliability testing. A description of the scanning procedure was provided in all full-text articles in variable detail. The quality scores ranged from 3–5 (out of a range of 0–6) (Table 1).
Intraobserver reliability results.
Image acquisition reliability was reported in 3 studies (35 joints) for B-mode, and kappa results varied from as low as 0.2 (9) to excellent reliability (CV 1.9–2.6%), although the latter was only on 9 joints (17). Of the 8 studies that reported kappa values (3,471 joints) for still images, results were higher than for image acquisition, particularly in the knee (range 0.74–1.0). Large studies (11,240 joints), such as Naredo et al (10, 32), reported excellent ICCs (range 0.956–0.985).
Only 2 studies (80 joints) reported image acquisition data, but results were excellent (ICC 0.95; CV 4.5%) (41, 36). There were 8 studies (3,060 joints) on still images with available kappa results. Results ranged from good to excellent with a range of 0.59–1.0. PD still-image reading reliability was excellent in the MCP and PIP joints in all the scoring methods. The studies by Naredo et al reported excellent ICCs (range 0.958–0.99) in a total of 11,800 joints, including joints other than the hands (10, 32, 39).
Time interval for retesting.
Time interval for retesting, a feature of intraobserver reliability, was reported in only 3 studies (60%) on acquisition reliability and ranged from 30 minutes to 1 week. The time interval for still-image interpretation reliability was longer with a mean of 17 weeks (range 0.5 day to 52 weeks).
The number of observers involved ranged from 2 to 23 observers with 19 studies having 2 observers only.
In the 8 studies (1,104 joints) with kappa values, results of acquisition reliability were wide ranging (κ = 0.22–0.868). Excluding observers with limited experience, the minimum kappa value increased to 0.31. Studies using semiquantitative scoring had kappa values that were fair to good (κ = 0.43–0.87), with better results in the knee (κ = 0.65–0.87). In the 6 studies with kappa results (3,138 joints) on still images, less variation was observed (κ = 0.49–1.0). Similar results for binary and semiquantitative scoring were noted.
Five studies (641 joints) reported PD image acquisition reliability results by the kappa statistic. Results were fair to excellent (κ = 0.42–0.95). The knee and anterior axillary recess of the shoulder had the best results, although sample sizes were small in those studies. Reliability for still images was better than image acquisition. Of the 8 studies (3,325 joints) with kappa values, the range was good to excellent (κ = 0.66–1.0). In addition, interobserver reliability for still images was better in PD than B-mode. Studies using contrast enhancement or 3-D PD seemed to be as reliable as PD without contrast enhancement.
Factors explaining variability of reliability data.
Figure 2 is a graphic representation of reliability results whereby available kappa results were plotted against the total number of joints tested. A linear regression was performed on the kappa results with number of joints tested, quality score, and number of observers for interobserver reliability data. The indicative analyses (data not shown) did not show evidence of a relation between these variables.
The present review has demonstrated that US reliability was good in still-image interpretation (both intraobserver and interobserver), particularly in the 2-D PD mode, when performed by experienced ultrasonographers. However, image acquisition was less reliable with few studies assessing intraobserver image acquisition reliability. Reliability in semiquantitative and binary scoring appeared similar, and the knee was the most reliably assessed joint, including image acquisition. The small joints of the hands, which are the most studied in US reliability studies, had good reliability results in still-image interpretation, but image acquisition was variable. Results in the feet were poor and understudied.
US is becoming a realistic tool for synovitis assessment in RA. It has already fulfilled several criteria from the OMERACT filter, including “truth” (7), which captures the issues of face, content, construct, and criterion validity (15, 47–49), and “feasibility,” which addresses the pragmatic reality of the measure through the cost and ease of operating the machine. However, one criticism that remains has been reliability. From this review, it appeared still-image interpretation was more reliable than image acquisition, which was expected. However, results of acquisition reliability were variable and sometimes poor with kappa values reaching 0.2, and very few studies assessed intraobserver acquisition reliability in either B-mode or PD. Acquisition reliability is unique since it is complicated by its multiplanar capability. The scanning technique for each joint, including patient position, setting of the US, gel quantity, probe positioning, and amount of pressure applied contributes to potential variation in results (50). Differences in the scanning technique and the lack of familiarity of the US machine may also explain the poor results in reliability studies (e.g., involving a group of experts) (9). Standardization of scanning technique, such as development of imaging guidelines, may improve reliability (51).
In this review, it was apparent that few studies looked at the image acquisition reliability. Intraobserver reliability in this domain has been the least studied, and the time interval for retesting was as short as 30 minutes. This is subject to recall bias. Therefore, future reliability studies would need to focus on intraobserver image acquisition reliability with an attempt to prolong the time interval for retesting, with obvious consideration to testing only on patients with stable disease.
PD interobserver reliability was higher than B-mode in still-image interpretation. This may be due to the fact that grading and detection of signal flow on still images would be less liable to variation than identification of hypoechoic structures. Presence or absence of color signal is also easier to differentiate. However, there was a lack of strong acquisition reliability data, a factor which is of paramount importance to PD since results are dependent on acquisition of proper signal flow.
In this review, different synovitis scoring methods were used, particularly for semiquantitative scoring. In order to reduce this variation, one of the challenges would be to develop consensus on a standardized way of semiquantitatively measuring synovitis. Since definitions on US pathology were formalized by OMERACT in 2005, a standardized way of defining US synovitis was evident in recent studies (46). Studies are ongoing within the OMERACT group to obtain consensus on scoring methods.
Reliability, particularly for acquisition, can have the potential to improve with standardized teaching programs, development of consensus guidelines, and improvement in machine quality (52, 53). Another important factor determining reliability, both for acquisition and still-image interpretation, is the experience of the ultrasonographer. In studies where observers had limited knowledge of US, there was noted improvement in acquisition reliability after standardization and training (19). Experience may explain the excellent results noted in the review, e.g., in studies by Naredo et al (10, 32). In light of the findings from this review, studies should focus on the acquisition reliability of the small joints of the hands and feet in addition to the potential improvement after training.
This review has limitations. Reliability data were often incompletely reported, and RA patients were often included as a study subgroup and thereby may have been missed in the literature search. This deficiency had been addressed by searching in 2 widely used databases with increased hand searching and reading of full-text articles. Therefore, we believe results presented here are a comprehensive representation of the published data. Statistical heterogeneity made pooling of data impossible; therefore, results were mostly descriptive. However, studies that provided kappa values were graphically analyzed to check for factors affecting the results for reliability. Study design in reliability testing varied in quality, mainly because reliability was not the primary objective in the majority of studies. Gaps in the description of reliability testing, small sample sizes, and incomplete reliability statistics were inevitable. The quality score, although not validated, was an attempt to address this issue. It was interesting that no relationship was found between kappa results and study quality, or any of the other factors tested. Even so, potential bias related to publication bias or other bias cannot be excluded.
Synovitis assessment by clinicians through detection of swollen joints has previously been assumed as the gold standard by many rheumatologists. However, it is poorly reproducible and insensitive (3). There is a need for a better modality of assessment that can be applied in daily clinical practice. US can have the potential to fulfill this role and aid in clinical decision making, provided that issues of reliability are fully resolved, particularly for acquisition reliability, since obtaining images is crucial for subsequent interpretation of radiologic findings.
This review showed the reliability of synovitis detection on still-image in B-mode and in PD, which was particularly high. However, more studies are required for reliability testing, especially in acquisition image reliability, and there is a continuing need for more standardization of scanning technique and training. 2
All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be submitted for publication. Dr. Cheung had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study conception and design. Cheung, Dougados, Gossec.
Acquisition of data. Cheung.
Analysis and interpretation of data. Cheung, Dougados, Gossec. 3