SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. PATIENTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES

Objective

To evaluate and improve the reliability of power Doppler ultrasonography (PDUS) for detecting and scoring enthesitis in patients with spondylarthitis, using a 3-step procedure.

Methods

In the first step, we evaluated the reliability of 5 sonographers by bilaterally scanning 5 entheses twice in 5 patients. In the second step, starting from disagreements observed during the first step, we established consensus guidelines. The sonographers' implementation was further evaluated in 2 reliability exercises: one on 60 PDUS enthesitis images and the other by scanning 5 new patients. In the third step, we performed a final reliability evaluation of 5 additional patients after 1 year. Kappa coefficients (κ) as well as variance component analysis (VCA) and generalizability theory (GT) were used to assess reliability.

Results

The initial intra- and interobserver reliability were poor, especially for detecting and scoring Doppler signal. VCA and GT showed that most variability was accounted for by interaction between sonographer and enthesis. Implementation of consensus guidelines was associated with a significant improvement in Doppler reliability between the first and second steps (mean interobserver κ increased from 0.13 to 0.51 for binary Doppler scoring in patients; P < 0.005), which persisted in the third step (mean interobserver κ = 0.57). The high GT coefficients reached in the last steps supported such improvement.

Conclusion

The 3-step procedure used in this study to standardize PDUS technique was associated with a significant improvement in interobserver reliability for detecting enthesitis in spondylarthritis patients. Such an approach can be useful to standardize PDUS assessment of musculoskeletal disorders.


INTRODUCTION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. PATIENTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES

Spondylarthritis (SpA) is a group of disorders that are characterized by inflammation of the enthesis and the adjacent bone (1, 2). Peripheral enthesitis is observed in all SpA subtypes, and may sometimes be present for a long period of time as an isolated clinical manifestation (3). Thus the development of imaging methods adapted to the detection of enthesitis is strongly needed. This would be especially helpful for the early diagnosis of SpA, considering the low specificity of clinical findings in this disease and the long period of time often required to observe significant radiographic changes (4–6).

Over the last few years, ultrasonography (US) has proved to be a highly sensitive tool, especially in the assessment of tendon and joint pathology (7, 8). Several studies have described, using B mode, the US aspect of lower limb enthesitis in SpA, revealing the high frequency of clinically asymptomatic, abnormal US findings (9, 10). A recent cross-sectional study using US in B mode combined with power Doppler US (PDUS) underlined the high frequency of peripheral enthesitis among SpA patients, as compared with controls with rheumatoid arthritis or degenerative spinal disease (11). The landmark of US enthesitis in SpA patients was the presence of abnormal vascularization at the insertion of the entheses into the cortical bone. These original results have now been confirmed by other studies outlining the capability of PDUS to reveal inflammation of the enthesis in SpA patients, leading to the proposal of several different scoring systems (12–14).

Despite promising results, the use of PDUS for the management of SpA has remained less often evaluated than other innovative imaging techniques such as magnetic resonance imaging, which has been widely promoted for the detection of axial inflammation (15–19). This discrepancy is probably due to the perception that US remains an unreliable imaging technique, and to the greater difficulty of assessing vascular blood flow with Doppler in the entheses than in other tissues, such as the synovium (20–24). The latter difference can be explained by a greater abundance of vessels in the inflamed synovium than in enthesis (23–25), and because there are more Doppler artifacts at the enthesitic site, due to the close proximity of a highly reflecting surface, the cortical bone (26–28). Considering this context, 2 competences are critically needed for sonographers to optimize enthesitis assessment by PDUS: a specific knowledge of the anatomy of each enthesis (in particular, the localization of normal nutrition vessels) and a capacity to distinguish very slow vascular flow (which is the hallmark of the inflammatory process in the enthesis) from artifacts on power Doppler. Another important factor to consider with regard to optimization of this technique is the characteristics of the Doppler, which depend on the type of device used.

To date, only limited studies have addressed scanning method, definitions of PDUS enthesitis, and quantification of these abnormalities (11–13, 29). Thus, technical and anatomic issues, combined with a lack of standardization, may have hampered the development and validation of the US technique in clinical practice, or in multicenter studies, in SpA.

The purpose of our study was 1) to standardize the use of PDUS for detecting, scoring, and scanning enthesitis in SpA patients, with particular attention to the detection of inflammatory aspects (defined by the presence of vascularization at the cortical enthesis insertion) and their scoring, and 2) to prospectively evaluate the intra- and interobserver reliability of a group of sonographers with different levels of experience in the PDUS examination of enthesitis.

PATIENTS AND METHODS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. PATIENTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES

Design.

The study was planned in 3 steps.

Baseline sonographic reliability.

The first step aimed at evaluating the baseline intra- and interobserver reliability of sonographers with different levels of experience in enthesis scanning for the detection and scoring of enthesitis. This step was conducted in November 2004 using a standardized protocol. Five SpA patients were scanned by each sonographer during 1 session. Each patient was assigned to 1 US machine. Sonographers moved from one patient to another and examined 5 enthesitic sites, bilaterally, according to a latin square design. A random set of 25 of the 50 sites examined were scanned twice in order to assess the intraobserver reliability.

Impact of consensus guidelines on reliability.

Elaboration and implementation of consensus guidelines was performed. Disagreements related to the detection, scoring, and scanning of enthesitis encountered during the first step were collected by e-mail from sonographers and were analyzed. Consensus rules about enthesis scanning, the definition of abnormal findings, and the scoring system were then achieved by using a nominal group procedure (30). Consensus guidelines were further implemented by each examiner in his or her daily practice.

After a 1-month period, all participants were invited to take part in a reliability session on static images, organized on a dedicated Web site using the consensus guidelines, in order to evaluate their performance in recognizing elementary lesions. Sixty images were used for this Web site session: 20 images were randomly taken from the set of images recorded during the first reliability session performed with patients, and the other 40 were randomly selected from a set in which all PDUS findings were equally represented. Thirty of these 60 images were submitted twice in order to evaluate the intraobserver reliability. Images were presented independently to each sonographer, 1 per minute, and only once (antireturn system).

In January 2005, a second intra- and interobserver reliability session with patients was organized to evaluate whether progress was demonstrable after implementation of the guidelines. The design of this session was identical to the first session. Twenty-five of 50 sites were examined twice in order to assess the intraobserver reliability.

Long-term assessment of reliability improvement.

The third step was conducted after another year, in May 2006. The objective of this last part was to verify the acquired reliability and to substantiate whether the training practice modified reliability. For this reason, a new intra- and interobserver reliability session with 5 patients was organized for 1 day. The design of this session was identical to the previous sessions.

Sonography settings.

All US examinations were performed using standardized equipment. The machine type was an Esaote Technos MPX (ESAOTE, Genoa, Italy), with a 13-MHz linear array transducer. Power Doppler settings were standardized with a frequency of 10 MHz, a pulse repetition frequency of 500 Hz, a gain of 113 dB, and medium wall filter. Patients were asked not to talk to the sonographers. The following entheseal sites were examined bilaterally in both planes: insertion of quadriceps femoris into patella, superior insertion of patellar ligament into the patella apex, insertion of extensor common tendons into lateral epicondyle of the elbow, Achilles tendon, and plantar fascia insertion into the calcaneus.

PDUS was initially performed in B mode to detect morphologic and structural abnormalities, and subsequently with power Doppler to detect abnormal vascularization at the cortical insertion. The examination of each enthesis lasted ∼5 minutes. The processing settings (i.e., transducer orientation, positioning of the probe and enthesis, adherence to standard planes, and machine settings) remained constant throughout all sessions. The scan pictures were registered by one of the sonographers (MADA).

PDUS enthesitis scoring.

Sonographers looked for the following elementary lesions in B mode, and by using Doppler technique (any of the lesions was considered a feature of enthesitis): morphologic abnormalities (i.e., increased thickness and/or hypoechogenicity of enthesis insertion), structural abnormalities (i.e., calcifications and/or enthesophytes, erosions), and vascularization at the enthesis insertion into the cortical bone as an inflammatory sign (Figure 1A).

thumbnail image

Figure 1. Power Doppler ultrasonography patterns of enthesis in spondylarthritis patients by using a longitudinal scan. A, Achilles tendon enthesitis: erosions of cortical bone (a); abnormal vascularization at the cortical junction associated with hypoechogenicity of enthesis insertion (b); small enthesophyte (c). B, Enthesis of extensor common tendons on lateral epicondyle of the elbow: power Doppler finding detected at cortical enthesis insertion (v). The interpretation of that finding located at the limit of the normal nutrition vessel position is doubtful. Sonographers conservatively considered it as normal. C, Achilles tendon enthesitis: hyperechoic line that could be interpreted as an enthesophyte or as a calcification (this was the reason for scoring those findings together as a unique feature), but also as normal (a); increased thickness of tendon insertion with a hypoechoic aspect that could alternatively be interpreted as an anisotropic aspect (b).

Download figure to PowerPoint

These elementary lesions were defined and scored as follows. First, tendon thickness and/or echogenicity were scored as 0 if normal and 1 if increased thickness and/or hypoechogenicity was present at the bone insertion by using a comparative criterion (i.e., contralateral site and/or personal experience). Second, calcification was defined as a hyperechoic spot with or without acoustic shadow in the area of the enthesis insertion; enthesophyte was defined as an ossification with irregularity of enthesis cortical bone aspect. These abnormal findings were scored together as 0 if absent and 1 if any were present. Third, bone erosion, defined as a cortical break with a step down defect of bone contour in the longitudinal and transversal axis, was scored 0 if absent and 1 if present. Finally, vascularization of enthesis insertion into the cortical bone was assessed with power Doppler. The Doppler signal was scored 0 if absent, 1 if minimal (only 1 color spot detected), 2 if moderate (2 spots), or 3 if severe (≥3 spots). It was also scored as a binary item: 0 if absent and 1 if any signal was present.

A consensus was reached for scoring increasead thickness and hypoechogenicity of enthesis insertion together as a unique feature, because both are signs of acute inflammation. A comparable agreement was reached for scoring calcification and enthesophyte together as a unique lesion, owing to the difficulty of differentiating small enthesophytes from calcifications.

Concerning vascularization, participants agreed upon predefined Doppler settings and the way to discriminate inflammatory vessels from nutrition vessels with distinctive localization.

Sonographers.

Sonographers were 4 rheumatologists and 1 radiologist experienced in musculoskeletal US but with different levels of training for studying enthesis in SpA. One was an expert who contributed to the development of this technique and performed several pilot studies in this field, and one was trained by the former and had 3 years of experience in that application. The others were sonographers with more limited experience in that specific context.

Patients.

Characteristics of the 15 patients included in this study (5 different patients per session) are shown in Table 1. All patients fulfilled the SpA criteria by Amor et al and the European SpA Study Group criteria, and/or the New York modified criteria (31–33).

Table 1. Demographic and clinical characteristics of spondylarthritis (SpA) patients (n = 15)*
CharacteristicValue
  • *

    Values are the number (percentage) unless otherwise indicated. AS = ankylosing spondylitis; PsA = psoriatic arthritis; AIBD = arthritis associated with inflammatory bowel disease (the 2 patients had Crohn's disease); BASDAI = Bath Ankylosing Spondylitis Disease Activity Index; NSAID = nonsteroidal antiinflammatory drug; anti-TNFα = anti–tumor necrosis factor α.

  • Anti-TNF α biologic (6 infliximab and 2 adalimumab).

Sex ratio, men:women1.1
Age, mean ± SD years47 ± 13
Disease duration, mean ± SD years10 ± 9
HLA–B27 positive13 (86.7)
SpA subtype 
 AS8 (53)
 Undifferentiated SpA3 (20)
 PsA2 (13)
 AIBD2 (13)
BASDAI score, mean ± SD6.1 ± 1.9
Current treatment 
 NSAID15 (100)
 Anti-TNFα8 (53)

Patients were selected according to their high level of disease activity, as defined by a Bath Ankylosing Spondylitis Disease Activity Index score ≥4 of 10 at the time of examination (34). All patients were asked to stop their nonsteroidal antiinflammatory drug treatment 3 days before PDUS examination. All patients gave a written informed consent to participate in this study and the protocol was approved by the institutional ethics committee.

Statistical analysis.

Intra- and interobserver reliability were assessed according to standard kappa (κ) coefficient and weighted κ coefficient with absolute weighting (κ[w]) exclusively for semiquantitative Doppler signal. While intraobserver coefficients were evaluated on pairs of measures performed by the same sonographer at each site, calculation of interobserver coefficient was exclusively based on the first measure of those pairs. Percentage of observed agreement (i.e., percentage of observations that obtained the same score) was also calculated. Interobserver reliability was studied by calculating the mean kappa for all pairs (i.e., Light's kappa) (35). Kappa coefficients were interpreted according to Landis and Koch (36).

We assessed the sources of variability in interobserver reliability exercises by a variance component analysis (VCA) using an analysis of variance model with random effects (enthesis and sonographer) to quantify the variability attributable to the enthesis, to the sonographer, and to the interaction between enthesis and sonographer for all PDUS lesions observed. Then we analyzed the interobserver reliability results using the generalizability theory (GT), which permits a multifaceted perspective on measurement error and its components. In GT, elementary sources of variance are called facets. In the interobserver exercises, there was 1 facet, the sonographer, while enthesis was the object of measurement and the interaction term sonographer × enthesis was confounded with the error term. We calculated the overall reliability coefficient over all facets, which is called the φ-coefficient (or index of dependability). This shows the reliability of the method (sonographer) with all sources of variation included, in the context of an absolute decision (i.e., each PDUS lesion considered as exactly absent, minimal, moderate, or severe). This coefficient ranges from 0 (not reliable at all) to 1 (maximum reliability). Statistical analysis was performed using the R software (http://www.r-project.org/).

RESULTS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. PATIENTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES

The prevalence of detected PDUS abnormalities, the observed agreement, and the kappa coefficient concerning intraobserver reliability exercises for the 3 steps are shown in Table 2, while Table 3 displays the interobserver reliability results.

Table 2. Results of intraobserver reliability exercises throughout study
 Step 1Step 2Step 3
Reliability on patientsReliability on imagesReliability on patientsReliability on patients
  • *

    Mean prevalence of 2 paired measures per sonographer; the denominator used is the number of examined sites.

  • Range of values across participating sonographers.

  • Kappa was set to 1, although not estimable when observed agreement was perfect and all observations were in only 1 cell of the contingency table.

Doppler (0–3)    
 Mean prevalence (range of %)*2068–780–410–15
 Observed agreement (range)0.6–0.80.78–0.950.9–10.85–0.95
 Kappa (range)−0.15–0.610.83–0.960.9–10.32–0.88
Doppler (0–1)    
 Mean prevalence (range of %)*2068–780–410–15
 Observed agreement (range)0.6–0.810.9–10.9–1
 Kappa (range)−0.17–0.4110.9–10.46–1
Morphologic (0–1)    
 Mean prevalence (range of %)*10–5061–760–3715–28
 Observed agreement (range)0.8–10.74–0.950.75–10.85–0.9
 Kappa (range)0.61–10.3–0.890.47–10.48–0.63
Calcifications/enthesophytes (0–1)    
 Mean prevalence (range of %)*4030–7617–8353–80
 Observed agreement (range)0.80.84–0.940.5–0.830.7–0.85
 Kappa (range)0.58–0.6−0.05–0.48−0.14–0.40.21–0.70
Erosions (0–1)    
 Mean prevalence (range of %)*35–4026–328–3310–50
 Observed agreement (range)0.9–10.890.67–10.65–0.9
 Kappa (range)0.78–10.73–0.760.25–10.3–0.8
Table 3. Results of interobserver reliability exercise throughout study
 Step 1Step 2Step 3
Reliability on patientsReliability on imagesReliability on patientsReliability on patients
  • *

    Range of values across participating sonographers.

  • Mean of the n (n − 1)/2 pairwise agreement coefficients between each pair of the n sonographers; the denominator used was the number of examined sites.

Doppler (0–3)    
 Prevalence (range of %)*12–5272–753–85–10
 Observed agreement (mean)0.570.50.950.9
 Kappa (mean)0.080.590.510.57
Doppler (0–1)    
 Prevalence (range of %)*12–5272–753–85–10
 Observed agreement (mean)0.620.990.970.93
 Kappa (mean)0.130.970.650.58
Morphologic (0–1)    
 Prevalence (range of %)*10058–700–2813–21
 Observed agreement (mean)10.710.770.83
 Kappa (mean)10.380.150.41
Calcifications/enthesophytes (0–1)    
 Prevalence (range of %)*53–650–1330–7355–73
 Observed agreement (mean)0.450.890.600.68
 Kappa (mean)−0.0080.090.250.30
Erosions (0–1)    
 Prevalence (range of %)*35–5822–2410–308–48
 Observed agreement (mean)0.720.950.800.65
 Kappa (mean)0.450.810.330.24

The prevalence of detected lesions in sessions performed with patients was quite variable among sonographers along the 3 steps of the study. Abnormal vascularization was consistently less frequently detected than B mode abnormalities (Tables 2 and 3). There was also some variability in the prevalence of elementary lesions according to the enthesitic site (Table 4).

Table 4. Prevalence of elementary lesions by enthesis location observed across the reliability exercises on patients throughout the study*
 Power Doppler signalMorphologic abnormalitiesCalcifications/ enthesophytesErosions
Step 1Step 2Step 3Step 1Step 2Step 3Step 1Step 2Step 3Step 1Step 2Step 3
  • *

    Results are expressed as a range of mean prevalence of 2 paired measures per sonographer, across participating sonographers; the denominator used is the number of examined sites.

  • The prevalence of Doppler signal refers to binary scoring (0–1).

Achilles tendon0–250000–330–30010–1160–10000–5010–50
Plantar fascia500–17050–7583–10040–5062–10017–6740–9075–1000–170–60
Quadriceps tendon01010–250–250–1012–500–25252500–1025
Patellar ligament2500–100–2754–670–90750–6760–90750–1710–60
Lateral epicondyle0030–60050–8350–700–5033–8360–90033–5020–80

In the first step, intraobserver variability was strongly dependent on the type of lesion detected (Table 2). The lowest reliability concerned Doppler signal scoring (both in the binary and in the semiquantitative methods), with kappa coefficients between −0.15 and 0.61. In contrast, good or excellent intraobserver agreement was generally observed for B mode evaluations, with kappa coefficients ranging between 0.58 and 1.

Initial interobserver reliability was also strongly variable among different types of lesions (Table 3). Agreement for detecting vascularization was slight (κ = 0.08 and 0.13 for binary and semiquantitative scoring, respectively). But, again, reliability for detecting morphologic abnormalities and erosions in B mode was much better than for detecting Doppler signal, with kappa coefficients of 1 and 0.45, respectively. Finally, agreement for detecting calcifications/enthesophytes was poor (κ = −0.008). It is noteworthy that the intra- and interobserver reliability levels were independent of the enthesitic site (data not shown).

The second step started with the elaboration and implementation of consensus guidelines followed by 2 reliability exercises: one with static images that virtually assessed the level of guideline efficiency, and the second with patients in order to evaluate their practical impact on PDUS scanning. Overall, the best intra- and interobserver reliability was observed in the Web site exercise, with kappa values ranging from 0.59 to 1 for Doppler signal recognition (Tables 2 and 3). In the exercise performed with patients, there was a strong improvement in reliability for detecting and grading Doppler signal, as compared with the first step (Tables 2 and 3). Excellent intraobserver kappa values (range 0.9–1) were obtained for scoring Doppler signal, both in binary and in semiquantitative methods (Table 2), whereas more variability was observed for scoring B mode lesions, specially for enthesophytes/calcifications (mean κ values −0.14 to 0.4) (Table 2). Interobserver kappa values ranged from 0.15 for morphologic abnormalities to 0.65 for Doppler signal in the binary method (Table 3), which resulted in a statistically significant improvement in detecting Doppler signal as compared with step 1 (P < 0.005).

The third step took place after 1 year of personal practice. The results of this last session performed with patients evidenced a statistically significant improvement of intra- and interobserver reliability for Doppler signal detection, whatever the scoring modality (i.e., binary or semiquantitative), as compared with step 1 (P < 0.005). No statistically significant change was observed between the second and third steps. However, the intraobserver reliability was more variable between sonographers for Doppler signal scoring as compared with step 2 (step 3: κ = 0.46–1 and κ = 0.32–0.88 for binary and semiquantitative scoring, respectively). In the last step, as in the second step, the interobserver reliability for Doppler signal was better than for B mode lesions (Table 3). For all the 3 steps, we observed an improvement in intra- and interobserver agreement for all PDUS lesions, particularly for Doppler signal, which reached values superior to 0.9 in the second and third steps (Tables 2 and 3). Figures 1B and 1C show some of the elementary lesions that raised difficulties in their evaluation during the first step.

From VCA analysis, it appeared that a large proportion of the variance was dependent on the way in which sonographers scored lesions at each enthesitic site (i.e., sonographer × enthesis), especially in the first step (Table 5). The GT absolute coefficient (φ coefficient) confirmed the low level of interobserver reliability for all PDUS findings in the first step, because it comprised between 0.06 for Doppler signal with semiquantitative scoring and 0.44 for erosions (Table 5). The variance tended to decrease significantly for Doppler signal detection in the second step, but not for B mode lesions. Nevertheless, reliability tended to improve, as shown by the GT coefficients (Table 5), which especially increased for Doppler signal detection (0.66 for both scoring modalities in step 2). VCA confirmed a marked improvement in the reliability of Doppler signal in the third step, as compared with the first step, but not in step 2. Furthermore, evolution of GT coefficients from step 1 to step 3 showed a trend toward reliability improvement for the detection of all lesions, except for erosions, which conversely worsened (Table 5). Results of the reliability assessment did not vary among enthesitic sites (data not shown).

Table 5. Variance component analysis and φ-coefficient through all 3 interobserver reliability sessions with patients*
 Step 1Step 2Step 3
EnthesisSonographerE × SφEnthesisSonographerE × SφEnthesisSonographerE × Sφ
  • *

    Values are the percentage. E × S = enthesis × sonographer.

  • Percentage of variance attributable to enthesis, sonographer, and interaction between enthesis and sonographer by the variance component analysis.

  • φ-coefficient, which shows the reliability of the sonographer with all sources of variation included, in the context of an absolute decision (range 0 [not reliable at all] to 1 [maximum reliability]).

Doppler (0–3)6.88.2850.0666.2033.80.6670.4029.60.7
Doppler (0–1)12.41968.60.1266.20.8330.6663.8036.20.64
Morphologic (0–1)1439.146.80.1420.615.064.40.2136.25.8580.36
Calcifications/ enthesophytes (0–1)02.297.8021.326.652.10.11278.664.40.42
Erosions (0–1)44.24.251.60.4437.48.853.80.3717.521.760.80.18

DISCUSSION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. PATIENTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES

The main objective of this study was to evaluate, and eventually improve, the reliability for detecting and scoring enthesitis in SpA by using PDUS. Standardization of enthesitis assessment by PDUS would facilitate the dissemination of this technique in daily practice and allow adequately trained sonographers to participate in multicenter research studies. However, to our knowledge, this is the first study that tried to establish whether inflammation and structural changes that comprise the anatomic aspects of enthesitis in SpA could reliably be assessed by PDUS. To achieve this goal, it was necessary to identify and standardize each component involved in the reliability of this technique.

Few studies have previously evaluated the overall reliability of PDUS in rheumatology. All of these previous studies concerned joint examination (37–39). Their results were strongly dependent on the type of lesion and joint studied. All have underscored the need to achieve an agreement on the definition of lesions and their scoring in order to obtain reliable results (37–42). Here, using a pragmatic methodology based on evaluation and consensus, we proposed a standardized definition of PDUS enthesitis for detecting and scoring each elementary lesion.

Based on previous studies, we considered power Doppler as especially important because it enables the distinction between inflammatory enthesitis, a hallmark of SpA, and enthesitic lesions of purely mechanical origin (11). Thus, we used standardized PDUS equipment that had previously been recognized as highly sensitive for the detection of small vessels, in concordance with histology (43).

The first reliability session, which took place prior to any special training, revealed a great variability in PDUS technique, especially concerning the Doppler findings. A high level of interobserver variability was expected with such a technique that includes several possible sources of discrepancy between sonographers: definition of elementary lesions, detection and scoring of lesions, and data acquisition. Intraobserver reliability also showed a high variability of results, which depended on the type of lesions studied.

The kappa values for Doppler detection were initially lower than those for B mode. This could reflect the relative lack of experience of some of the sonographers in the detection of the former finding, which is highly dependent on the correct positioning of the probe, the proper setting of Doppler parameters, knowledge of normal vessels anatomy, and the time taken for examination. Interestingly, however, a consensus was reached most easily for Doppler signal, maybe because disagreement before training was quite low. This was also followed by a striking improvement in Doppler reliability in the second step, which persisted in the long-term evaluation (third step).

In contrast, the detection of morphologic and structural findings showed less variability in the initial evaluation, except for the interobserver identification of calcifications/enthesophytes, which was poorly reliable. It is noteworthy that these are the findings that all sonographers routinely search for in their daily practice. There was no obvious change in the level of reliability for the detection of these lesions during the 3-step procedure, except again for calcifications/enthesophytes, which improved. This could reflect a relative difficulty in improving the skills of sonographers for the detection of findings that they are already very familiar with.

The prevalence of lesions was variable across the patient series and was sometimes quite low, which might have rendered reliability more difficult to assess because Cohen's kappa was also artificially low in that case. This is known as one of the kappa paradoxes (44). Notwithstanding this limitation, the observed agreement was generally very good, especially in the second and third steps (>60% for all lesions). This situation underlines the practical difficulty of recruiting patients with a large spectrum of lesions, which is not the case when practicing on registered images. In the Web site exercise, we managed to have a balanced prevalence of each lesion, and the results were globally better than for patients. The use of prevalence-adjusted, bias-adjusted kappa (PABAK) as an additional indicator for measuring observer agreement has been advocated (45). The use of PABAK in our study would have improved the reliability results for lesions with the lowest prevalence. For instance, in the second session with patients, the kappa values for interobserver reliability of Doppler and erosion detection (for which the prevalences were in the range of 3–8% and 10–33%, respectively) would have increased from 0.51 to 0.9 and from 0.33 to 0.6, respectively.

Another problem related to the kappa statistic is that it also depends on the degree of discordance of the marginal distributions between observers (i.e., the kappa is artificially low when one observer yields systematically higher or lower evaluations than the other in a pairwise comparison). This was the case in our study for all PDUS lesions in the first step and for the B mode lesions in the second and third steps.

Such bias was also identified by the VCA in the magnitude of the sonographer variance component for most of the B mode lesions in all steps. For example, in the second interobserver reliability session with patients, we observed a large discrepancy in the prevalence of erosions estimated by sonographers (range 30–73%), which was associated with a strong variance in the sonographer facet (26.6%). The same kind of figure was observed for erosions in the third step. By using GT, we observed that the best interobserver reliability was achieved when most of the variability was related to the enthesis facet and not to the sonographer facet, nor to the songrapher × enthesis interaction. This was only the case in the second and third steps for Doppler detection.

In conclusion, implementation of strict consensus guidelines led us to achieve reliable assessment of PDUS examination of the enthesis, independently of sonographers' initial experience. The evaluation of reliability at baseline was essential to identify disagreements and to elaborate successful guidelines. The use of static images on a Web site allowed us to check the efficiency of those guidelines that could be further tested by a practical reliability exercise. Such an approach could be applied to other PDUS applications. Even so, some variability remained after long-term practice, a fact that sonographers performing PDUS should be aware of. However, the variability observed in the last step of our study was acceptable, especially for Doppler detection and scoring, which we consider as the most critical finding for the evaluation of enthesitis in SpA.

AUTHOR CONTRIBUTIONS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. PATIENTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES

Dr. D'Agostino had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study design. D'Agostino, Aegerter, Breban, Landais.

Acquisition of data. D'Agostino, Jousse-Joulin, Chary-Valckenaere, Lecoq, Gaudin, Brault, Schmitz, Dehaut, Le Parc.

Analysis and interpretation of data. D'Agostino, Aegerter, Le Parc, Breban, Landais.

Manuscript preparation. D'Agostino, Aegerter, Breban, Landais.

Statistical analysis. D'Agostino, Aegerter.

REFERENCES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. PATIENTS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. AUTHOR CONTRIBUTIONS
  8. REFERENCES