Intra- and interobserver agreement when describing adnexal masses using the International Ovarian Tumor Analysis terms and definitions: a study on three-dimensional ultrasound volumes

Authors


Correspondence to: Dr P. Sladkevicius, Department of Obstetrics and Gynecology, Skåne University Hospital Malmö, S-20502 Malmö, Sweden (e-mail: Povilas.Sladkevicius@med.lu.se)

ABSTRACT

Objectives

To estimate intraobserver repeatability and interobserver agreement in: (1) describing adnexal masses using the International Ovarian Tumor Analysis (IOTA) terms and definitions; (2) the risk of malignancy calculated using IOTA logistic regression model 1 (LR1) and model 2 (LR2); and (3) the diagnosis made on the basis of subjective assessment of ultrasound images.

Methods

One-hundred and three adnexal masses were examined by transvaginal gray-scale and power Doppler ultrasound. Three-dimensional ultrasound volumes of the mass were saved. After 12–18 months the volumes were analyzed twice, 1–6 months apart, by each of two independent experienced sonologists who used the IOTA terms and definitions to describe the masses. The risk of malignancy was calculated using LR1 and LR2. The sonologists also classified the masses as benign or malignant using subjective assessment.

Results

Eighty-four masses were benign, eight were borderline and 11 were invasively malignant. There was substantial variability within and between observers in the results of measurements included in LR1 and LR2 and some variability also when assessing categorical variables included in the models (agreement = 51–100% and kappa = 0.42–1.00). This resulted in substantial variability in the calculated risk of malignancy, the limits of agreement indicating that the calculated risk of malignancy could vary by a factor of 5–20 within and between observers. The reliability of the calculated risk of malignancy was moderate (LR1) or poor (LR2) when the calculated risk of malignancy was > 10% (intraclass correlation coefficients varied from 0.21 to 0.73). Interobserver agreement when classifying tumors as benign or malignant using the predetermined risk of malignancy cut-off of 10% was fair to good (agreement = 85% and kappa = 0.61 for LR1; agreement = 81% and kappa = 0.52 for LR2). Intra- and interobserver agreements for subjective assessment were 96%, 96% and 96% with kappa values of 0.89, 0.87 and 0.88, respectively.

Conclusions

Intra- and interobserver agreement in classifying tumors as benign or malignant using the risk of malignancy cut-off of 10% for LR1 and LR2 was fair or good, whilst the reproducibility of subjective assessment was excellent. The reliability of calculated risks > 10% was poor, and calculated risk > 10% cannot be used to discriminate between individuals at different risk. These results cannot be extrapolated to real-time ultrasound examinations.

INTRODUCTION

Subjective assessment of ultrasound findings (also called pattern recognition) is an excellent method for discriminating between benign and malignant adnexal masses[1-3] and also for making a specific diagnosis in an adnexal mass[4-6]. As an alternative to subjective assessment the International Ovarian Tumor Analysis (IOTA) group has created a logistic regression model containing 12 variables, logistic regression model 1 (LR1), and its simpler version, LR2, containing six variables. These models calculate the risk of malignancy for each individual adnexal mass. A risk of malignancy of more than 10% has been suggested to indicate malignancy[7].

On external validation LR1 has been shown to have diagnostic performance with regard to discrimination between benign and malignant adnexal masses similar to subjective assessment[2]. Some of the variables in LR1 and LR2 are based on subjective evaluation of ultrasound findings, for example, the color score, the presence of blood flow within a solid papillary projection, acoustic shadows and cyst-wall irregularity. The models also include measurements, for example the maximum diameter of the lesion and the maximum diameter of the largest solid component. Because ultrasound examiners may evaluate an adnexal mass differently with regard to qualitative variables (e.g. the color content of the tumor scan and the presence of irregularity) and because there may be variability in measurement results, the risk of malignancy calculated using LR1 or LR2 may vary both within and between ultrasound examiners. It would be of interest to estimate both intraobserver repeatability and interobserver agreement in the assessment of defined ultrasound characteristics and in measurements of adnexal masses. A reproducibility study can easily be performed by using saved tumor volumes obtained by three-dimensional (3D) ultrasound.

The aims of this study were to estimate intraobserver repeatability and interobserver agreement in: (1) describing adnexal masses using IOTA terms and definitions; (2) the risk of malignancy calculated using the IOTA logistic regression models LR1 and LR2; and (3) the diagnosis made on the basis of subjective assessment of ultrasound images when analyzing 3D ultrasound volumes of adnexal masses.

SUBJECTS AND METHODS

The patients included in this study are a subgroup of patients examined within the frame of the IOTA Study Phase 2[2]. The Ethics Committee of Lund University approved the study protocol. Informed consent was obtained from all participants, after the nature of the procedures had been fully explained. Patients with a pelvic mass that clinically and at ultrasound examination was judged to be of adnexal origin were examined with two-dimensional (2D) and 3D transvaginal gray-scale and power Doppler ultrasound by the last author. Before the ultrasound examination a standardized history was taken from each patient and noted in the IOTA research protocol. All patients were operated on within 120 days after the ultrasound examination. The excised tissues underwent histological examination and tumors were classified according to the criteria recommended by the International Federation of Gynecology and Obstetrics[8]. Borderline tumors were classified as malignant.

The patients were examined in the lithotomy position and with an empty bladder using the standardized IOTA examination technique[9]. The equipment used was a GE Voluson 730 Expert ultrasound system (GE Medical Systems, Zipf, Austria) with a 5–9-MHz transvaginal transducer. For Doppler ultrasound examinations, identical power Doppler ultrasound settings were used: frequency, 6–9 (‘normal’) MHz; pulse repetition frequency, 0.6 kHz; gain, 0.8; and wall motion filter, ‘low 1’ (40 Hz). After completion of the 2D ultrasound examination, 3D gray-scale and power Doppler ultrasound volumes of the tumor were acquired. Attempts were made to include the whole adnexal mass (or if this was not possible, as much of it as possible) in the 3D ultrasound volume. If the whole adnexal mass could not be included in the volume, care was taken to include the most representative parts of the mass. The patient was asked to remain still during acquisition of the ultrasound volumes. After acquisition, the resultant three-orthogonal-plane display of the images was examined to ensure that a complete volume of the adnexal mass, or as large a part of it as possible, had been captured. The volumes were transferred from the ultrasound system to a computer-server for storage in the database ViewPoint™ (GE Medical Systems) and then stored on a USB 2.0 memory stick for easier transport between computers when analyzing the volumes.

The ultrasound volumes were not analyzed until 12–18 months had elapsed since the volumes were acquired. The volumes were analyzed twice, 1–6 months apart, by each of the two authors, both of whom are experienced sonologists. The sonologists analyzed the volumes independently of each other. The sonologists did not read the patients' case notes, previous ultrasound reports or research protocols and so were not aware of any clinical information other than the age of the patient (patient dates of birth were available in the stored volumes) when analyzing the volumes. They did not discuss the IOTA terminology or how to assess the ultrasound volumes of the adnexal masses either before or during the reproducibility study.

When evaluating the volumes the sonologists used any-plane slicing. Analyses were performed on personal computers using the software 4D-view™, version 8 (GE Medical Systems). The size of the lesion and that of its largest solid component were measured (the largest diameter and the mean of three orthogonal diameters) using calipers on the longitudinal, transverse or coronal planes of the three-orthogonal-plane display. The largest longitudinal plane through the tumor, as assessed subjectively, was used as the reference image (Plane A). A color score was assigned on the basis of subjective assessment of the power Doppler volume of the tumor. A color score of 1 indicated absence of color Doppler signals, a color score of 2 indicated a minimal amount of color Doppler signals, a color score of 3 indicated a moderate amount of color Doppler signals and a color score of 4 indicated a large amount of color Doppler signals in the tumor[9]. Both sonologists used the IOTA terms and definitions[9] to describe the mass and noted their results in a dedicated research form. The variables assessed with regard to reproducibility are shown in Table 1. The sonologists were also obliged to classify the adnexal mass as benign or malignant (borderline tumors were classified as malignant) using subjective assessment and to state the confidence with which the diagnosis was made (benign, uncertain or malignant; or certainly benign, probably benign, uncertain, probably malignant or certainly malignant). Moreover, they suggested a specific diagnosis (Table 1) using subjective assessment. If more than one specific diagnosis was suggested (e.g. serous cystadenoma, serous cystadenofibroma), the first one noted in the research form was used for statistical calculations.

Table 1. Variables for which intra- and interobserver reproducibility were estimated
VariableMeasurement parameter
  1. a

    A calculated risk of 8.3–25.6% using LR1 corresponds to an uncertain diagnosis[15].

  2. b

    A second stage test is required if the calculated risk of malignancy is 8.3–25.6%15 or if the sonologist is uncertain whether the tumor is benign or malignant on the basis of subjective assessment. IOTA, International Ovarian Tumor Analysis; LR1, logistic regression model 1; LR2, logistic regression model 2.

Variables included in logistic regression models LR17 and LR27 
    or of importance for these models 
    Maximum diameter of adnexal massmm
    Maximum diameter of largest solid componentmm
    Maximum diameter of largest solid component≤ 50, > 50 mm
    Presence of an entirely solid adnexal massYes, no
    Papillary projections presentYes, no
    Irregular internal cyst wallsYes, no
    Acoustic shadowsYes, no
    Color Doppler signals present in papillary projectionYes, no
    Color score1, 2, 3, 4
Other IOTA variables 
  Continuous variables 
    Mean diameter of adnexal massmm
    Mean diameter of largest solid componentmm
    Height of largest papillary projectionmm
    Thickness of thickest septum where it appeared to be at its thickestmm
  Categorical variables 
    Type of tumorUnilocular, unilocular solid, multilocular, multilocular solid, solid
    Mean diameter of adnexal mass≤ 40, 41–60, 61–80, 81–100 and > 100 mm
    Number of cyst locules0, 1, 2, 3, 4, 5, 6–10, 11–20 and > 20
 ≤ 10, > 10
    Septum presentYes, no
    Incomplete septum presentYes, no
    Solid componentYes, no
    Papillary projections0, 1, 2, 3, ≥ 4
 1, 2, 3, ≥ 4
 < 4, ≥ 4
    Surface of papillary projectionSmooth, irregular, uncertain
    Echogenicity of cyst fluidAnechoic, low level, ground glass, mixed, no cyst fluid
    Color Doppler signals detectableYes, no
Diagnosis based on LR1 and LR2 
  Calculated risk of malignancy using LR1%
 ≤ 10%, > 10%
 < 8.3%, 8.3–25.6%, > 25.6%a
    Second-stage test requiredYes, nob
  Calculated risk of malignancy using LR2%
 ≤ 10%, > 10%
Diagnosis based on subjective assessment 
  Diagnosis of tumor typeBenign, malignant
  Confidence in diagnosis (three confidence levels)Benign, uncertain, malignant
  Confidence in diagnosis (five confidence levels)Benign, probably benign, uncertain, probably malignant, malignant
  Specific diagnosisBenign cyst, endometrioma, dermoid cyst, serous cystadenoma, mucinous cystadenoma, myoma/fibroma, cystadenofibroma, borderline tumor, invasive malignancy

Calculation of risk of malignancy using LR1 and LR2

LR1 contains 12 variables: (a) personal history of ovarian cancer (yes = 1, no = 0); (b) current hormonal therapy (yes = 1, no = 0); (c) age of the patient (in years); (d) maximum diameter of the lesion (in mm); (e) presence of pain during the examination (yes = 1, no = 0); (f) presence of ascites (yes = 1, no = 0); (g) presence of blood flow within a solid papillary projection (yes = 1, no = 0); (h) presence of a purely solid tumor (yes = 1, no = 0); (i) maximal diameter of the solid component (expressed in mm, but with no increase if > 50 mm); (j) irregular internal cyst walls (yes = 1, no = 0); (k) presence of acoustic shadows (yes = 1, no = 0); and (l) color score (1, 2, 3 or 4). The risk of malignancy is derived as y = 1/(1 + e–z), where z = −6.7468 + 1.5985(a)−   0.9983(b)    +    0.0326(c)    +    0.00841(d)  −   0.8577(e)   +   1.5513(f)   +  1.1737(g)   +   0.9281(h)   +   0.0496(i)   +1.1421(j)  − 2.3550(k) + 0.4916(l) and e is the mathematical constant and base value of natural logarithms. The risk of malignancy using the LR2 model is derived as y = 1/(1 + e–z), where z = −5.3718 + 0.0354(c)   +   1.6159(f)    +   1.1768(g)   +   0.0697(i)   +   0.9586(j)   −2.9486(k). A calculated risk of malignancy of more than 10% classified the mass as malignant[7].

The IOTA3 study screen (Astraia GmBH, Munich, Germany) was used to calculate the risk of malignancy according to LR1. The risk of malignancy according to LR2 was calculated using the Statistical Package for the Social Sciences (SPSS program, PASW version 18.0; IBM Corp., New York, NY, USA). When the risk of malignancy was calculated using LR1 and LR2, the following information collected at the real-time ultrasound examination was used because this information could not be obtained from the saved ultrasound volumes: patient's age; personal history of ovarian cancer; current hormonal therapy; pain during the ultrasound examination; and presence of ascites.

Statistical analysis

Statistical calculations were undertaken using the Statistical Package for the Social Sciences (SPSS program, PASW version 18.0; IBM Corp.).

For estimation of interobserver agreement the results of the first analysis of each observer were used.

Intra- and interobserver agreement in the assessment of categorical variables was estimated by calculating the percentage agreement. Cohen's kappa (κ) was used to estimate by how much the observed agreement exceeded that expected by chance[10]. Weighted k values are presented where appropriate[11, 12]. Intra- and interobserver reproducibility of measurement results, including the risks of malignancy calculated using LR1 and LR2, were described as the difference between two measurement results. The differences between the measured values were plotted against the mean of the two measurements (Bland–Altman plots) to assess the relationship between the differences and the magnitude of the measurements[13]. If intra- or interobserver differences increased with the magnitude of the measurements, the values were subjected to logarithmic transformation, whereupon the correlation disappeared[13]. In these cases, the statistical analyses were made using the logarithmically (ln) transformed data and the results presented are those obtained after antilogarithmic transformation. When the results were based on logarithmic transformation, intraobserver differences are presented as the ratio between the first and second measurement results of the same sonologist, and interobserver differences are presented as the ratio between the results of Sonologist 1 and Sonologist 2 using their first measurement.

Systematic bias between two measurements was estimated by calculating the 95% CI of the mean difference (mean difference ± 2 SE). If 0 (for real values) or 1 (for logarithmically transformed values) lay within this interval, no bias was assumed to exist between the two measurements. Intraobserver repeatability and interobserver agreement were expressed as the mean difference and limits of agreement[13]. Ninety-five per cent of differences between any future measurements are estimated to fall between the lower and upper limits of agreement.

Intra- and interobserver reliability of measurements results were estimated by calculating the intraclass correlation coefficient (ICC) using analysis of variance (two-way random model – absolute agreement; this allows generalization of the results to a population of observers). The ICC indicates the proportion of the total variance in measurement results that can be explained by differences between the individuals examined. The more variable the population investigated, the greater the ICC; in contrast, the less variable the population, the smaller the ICC[14].

RESULTS

Volumes were saved from 98 patients, of whom five had bilateral masses. Thus, ultrasound volumes from 103 adnexal masses were included in the study. Patient age ranged between 21 and 88 years (median = 47 years), and 47 (48%) patients were postmenopausal. According to the histopathological examination of the surgical specimens, there were 84 benign, eight borderline and 11 invasively malignant adnexal masses. Specific histological diagnoses are shown in Table 2.

Table 2. Histological diagnoses of the tumors included
Histological diagnosisn
Benign lesion84
    Benign simple cyst6
    Endometrioma23
    Dermoid cyst15
    Serous cystadenoma6
    Mucinous cystadenoma7
    Myoma/fibroma8
    Cystadenofibroma13
    Paraovarian cyst1
    Sactosalpinx, chronic salpingitis4
    Paraneural cyst1
Borderline tumor8
    Serous4
    Mucinous4
Invasive malignancy11

Intra- and interobserver agreement when assessing ultrasound variables included in the IOTA models LR1 and LR2 are shown in Tables 3 and 4, and those when assessing ultrasound variables not included in the models are shown in Tables 5 and 6. Limits of agreement were wide for all measurements, but the measurements of Sonologist 2 were more reproducible than were those of Sonologist 1 (Tables 3 and 5). There were also systematic differences between the two sonologists, with Sonologist 1 obtaining higher measurement values for the size of the mass and the thickness of the thickest septum. The least reliable measurements were those of the height of the largest papillary projection and the thickness of the thickest septum (Table 5). For most categorical ultrasound variables intra- and interobserver agreement was good or very good (Tables 4 and 6). Interobserver agreement was poorest for color score (interobserver agreement = 51%; κ = 0.53), irregular wall (interobserver agreement = 80%; κ = 0.51) and number of papillary projections when papillary projections were judged to be present on both analysis occasions (interobserver agreement = 58%; κ = 0.51).

Table 3. Reproducibility of continuous variables used in IOTA risk-calculation models LR17 and LR27
VariableMeasurement results (mm)Difference between two measurements (mm)aICC point estimate (95% CI)
MedianRangeMean95% CILOA
  1. a

    For intraobserver reproducibility the measurements of the second analysis were subtracted from the measurements of the first analysis, and for interobserver reproducibility the first measurements of Sonologist 2 were subtracted from the first measurements of Sonologist 1. ICC, intraclass correlation coefficient; IOTA, International Ovarian Tumor Analysis; LOA, limits of agreement; LR1, logistic regression model 1; LR2, logistic regression model 2.

Intraobserver reproducibility, Sonologist 1
  Maximum diameter of:      
    Adnexal mass61.85 (n = 206)22.90–189.20− 0.27 (n = 103)−1.42 to 0.89−12.00 to 11.470.985 (0.978–0.990)
    Largest solid component32.15 (n = 76)4.70–185.90−2.71 (n = 38)−7.79 to 2.37−34.03 to 28.610.924 (0.860–0.960)
Intraobserver reproducibility, Sonologist 2
  Maximum diameter of:      
    Adnexal mass58.00 (n = 206)21.00–180.000.08 (n = 103)−1.09 to 1.25−11.78 to 11.930.984 (0.976–0.989)
    Largest solid component38.50 (n = 64)7.00–175.00−0.59 (n = 32)−3.78 to 2.59−18.59 to 17.410.974 (0.947–0.987)
Interobserver reproducibility
  Maximum diameter of:      
    Adnexal mass60.00 (n = 206)21.00–189.203.26 (n = 103)1.77 to 4.74−11.82 to 18.330.985 (0.971–0.991)
    Largest solid component33.60 (n = 66)5.00–185.400.84 (n = 33)−3.93 to 5.61−26.57 to 28.250.970 (0.939–0.985)
Table 4. Intra- and interobserver agreement for categorical variables included in IOTA risk-calculation models LR17 and LR27 or of importance for these models
VariableIntraobserver agreementInterobserver agreement
Sonologist 1Sonologist 2
Agreement (%)KappaAgreement (%)KappaAgreement (%)Kappa
  1. The first analysis of each sonologist was used for calculating interobserver agreement.

  2. a

    Includes only cases in which solid components were seen at both analyses.

  3. b

    Includes only cases in which papillary projections were seen at both analyses.

  4. c

    Weighted kappa index. IOTA, International Ovarian Tumor Analysis; LR1, logistic regression model 1; LR2, logistic regression model 2.

Presence of entirely solid tumor (yes, no)1001.001001.001001.00
Maximum diameter of largest solid component960.81980.90970.85
(≤ 50 mm, > 50 mm)      
Maximum diameter of largest solid component900.76940.86910.80
(≤ 50 mm, > 50 mm)a      
Papillary projections (yes, no)930.76950.74920.71
Solid component (yes, no)940.88940.87890.77
Irregular cyst wall (yes, no)850.69930.79800.51
Acoustic shadows (yes, no)850.64900.69870.66
Color Doppler signals in papillary projections950.42990.80960.48
(yes, no)      
Color Doppler signals in papillary projections790.431001.00920.75
(yes, no)b      
Color score (1, 2, 3, 4)640.64c800.81c510.53c
Table 5. Reproducibility of continuous variables not included in IOTA risk-calculation models LR17 and LR27 but used to describe the ultrasound appearance of adnexal masses
VariableMeasurement results (mm)Difference between two measurementsaICC point estimate (95% CI)
MedianRangeMean95% CILOA
  1. a

    For height of largest papillary projection and thickness of thickest septum, differences are expressed as a ratio between the first and second measurements of each sonologist (intraobserver reproducibility) or between Sonologist 1 and Sonologist 2 (interobserver reproducibility); for the other variables the measurements of the second analysis were subtracted from those of the first analysis when estimating intraobserver reproducibility, and the first measurements of Sonologist 2 were subtracted from the first measurements of Sonologist 1 when estimating interobserver reproducibility. †Where septum appeared to be at its thickest. ICC, intraclass correlation coefficient; IOTA, International Ovarian Tumor Analysis; LOA, limits of agreement.

Intraobserver reproducibility, Sonologist 1
  Mean diameter of:      
    Adnexal mass50.95 (n = 206)18.40–177.110.06 (n = 103)−1.09 to 1.21−11.09 to 12.200.980 (0.971–0.987)
    Largest solid component24.25 (n = 76)4.40–177.10−0.98 (n = 38)−4.03 to 2.07−19.77 to 17.810.960 (0.925–0.979)
  Height of largest papillary projection9.75 (n = 28)3.80–25.000.80 (n = 14)0.67 to 0.950.41 to 1.560.774 (0.350–0.926)
  Thickness of thickest septum†3.35 (n = 88)1.10–33.801.01 (n = 44)0.87 to 1.180.37 to 2.900.756 (0.593–0.859)
Intraobserver reproducibility, Sonologist 2
  Mean diameter of:      
    Adnexal mass46.35 (n = 206)19.00–152.00−0.23 (n = 103)−1.15 to 0.69−9.55 to 9.090.985 (0.978–0.990)
    Largest solid component28.15 (n = 64)6.00–152.00−0.77 (n = 32)−2.27 to 0.74−9.29 to 7.760.991 (0.981–0.995)
  Height of largest papillary projection12.00 (n = 16)4.00–19.000.89 (n = 8)0.77 to 1.030.58 to 1.380.877 (0.532–0.974)
  Thickness of thickest septum†1.90 (n = 86)0.70–8.500.93 (n = 43)0.83 to 1.040.44 to 1.940.811 (0.679–0.893)
Interobserver reproducibility
  Mean diameter of:      
    Adnexal mass58.50 (n = 206)18.40–177.103.24 (n = 103)1.85 to 4.63−10.89 to 17.370.981 (0.963–0.989)
    Largest solid component23.60 (n = 66)5.00–177.101.15 (n = 33)−2.17 to 4.47−17.92 to 20.230.978 (0.956–0.989)
  Height of largest papillary projection9.00 (n = 24)4.00–25.000.99 (n = 12)0.77 to 1.260.41 to 2.340.627 (0.088–0.878)
  Thickness of thickest septum†2.45 (n = 84)0.70–33.801.92 (n = 42)1.54 to 2.380.47 to 7.890.306 (0.048–0.584)
Table 6. Intra- and interobserver agreement for categorical variables not included in IOTA risk-calculation models LR17 or LR27 but used to describe the ultrasound appearance of adnexal masses
VariableIntraobserver agreementInterobserver agreement
Sonologist 1Sonologist 2
AgreementKappaAgreementKappaAgreementKappa
(%)(%)(%)
  1. The first analysis of each sonologist was used to calculate interobserver agreement.

  2. a

    Includes 0.

  3. b

    Includes cases in which papillary projections were seen at both analyses.

  4. c

    Weighted kappa index.

  5. d

    Absent kappa values are explained by asymmetric field tables, making it impossible to calculate them. IOTA, International Ovarian Tumor Analysis; LR1, logistic regression model 1; LR2, logistic regression model 2.

Tumor type (unilocular, unilocular solid, multilocular, 910.88940.92820.75
multilocular solid, solid)      
Mean diameter of tumor (≤ 40, 41–60, 61–80, 81–100, > 100 mm)850.85c920.93c910.93c
Number of locules      
    0, 1, 2, 3, 4, 5, 6–10, 11–20, > 20820.71c820.92c790.65c
    ≤ 10a, > 10990.94990.95980.90
Septum (yes, no)970.94980.96910.83
Incomplete septum (yes, no)940.471001.0097§
Number of papillary projections      
    0, 1, 2, 3, ≥ 4890.67c920.74c870.64c
    1, 2, 3, ≥ 4b710.62c630.56c580.51c
    <  4a, ≥ 4950.64990.8595§
Surface of papillary projectionsb   (regular, irregular, uncertain)860.681001.0058§
      
Echogenicity of cyst fluid (anechoic, low level, ground glass, 900.88860.82830.79
mixed, no fluid)      
Color Doppler signals detectable (yes, no)880.61920.75830.46

Intra- and interobserver differences in the calculated risk of malignancy when using LR1 and LR2 are shown in Table 7. The limits of agreement were wide, reflecting large variation in the calculated risk of malignancy between two analyses performed by the same sonologist and between two sonologists. Reliability, reflected by the ICC values, was good or very good for the whole dataset and for adnexal tumors where the calculated risk of malignancy was ≤ 10%, with ICC values ranging from 0.75 to 0.92. Reliability was moderate or poor for tumors where the calculated risk of malignancy was > 10%, with ICC values varying between 0.21 and 0.73 (Table 7). When classifying tumors as having a risk of malignancy of ≤ 10% (benign) or >  10% (malignant) using LR1 or LR2, the intraobserver agreement was good and the interobserver agreement was moderate (LR2) or good (LR1), the percentage agreement varying between 81% and 94% and the κ values varying between 0.52 and 0.85 (Table 8). The classification of masses as benign or malignant using LR1 or LR2 was less consistent than that using subjective assessment, the percentage agreement for subjective assessment being 96% with κ values of 0.87–0.89. Intra- and interobserver agreement with regard to whether a second-stage test was needed (i.e. the calculated risk of malignancy was 8.3–25.6%[15] or the sonologist was uncertain whether the tumor was benign or malignant on the basis of subjective assessment) varied between 80% and 87%, with κ values ranging from 0.29 to 0.55.

Table 7. Reproducibility of risk of malignancy calculated using IOTA risk-calculation models LR17 and LR27
VariableCalculated risk of malignancy (%)Difference between two measurementsbICC point estimate (95% CI)
MedianRangeMean95% CILOA
  1. a

    The cut-off level of 10% is the mean of two calculated risks (corresponding to the x-axis in a Bland–Altman plot[13]).

  2. b

    Difference expressed as a ratio between the first and second measurements of each sonologist or between the first measurements of Sonologists 1 and 2. ICC, intraclass correlation coefficient; IOTA, International Ovarian Tumor Analysis; LOA, limits of agreement; LR1, logistic regression model 1; LR2, logistic regression model 2.

Intraobserver reproducibility, Sonologist 1
  Risk calculated using LR1      
    For all data3.30 (n = 206)0.10–95.500.82 (n = 103)0.69–0.970.14–4.720.892 (0.842–0.926)
    For risks ≤ 10%a  0.86 (n = 70)0.70–1.050.16–4.680.813 (0.715–0.879)
    For risks > 10%a  0.74 (n = 33)0.53–1.020.11–4.820.484 (0.184–0.704)
  Risk calculated using LR2      
    For all data3.67 (n = 206)0.05–91.850.95 (n = 103)0.77–1.160.11–7.190.864 (0.806–0.906)
    For risks ≤ 10%a  0.96 (n = 70)0.74–1.240.11–8.320.750 (0.627–0.837)
    For risks > 10%a  0.93 (n = 33)0.66–1.290.14–6.280.439 (0.114–0.678)
Intraobserver reproducibility, Sonologist 2
  Risk calculated using LR1      
    For all data3.60 (n = 206)0.10–98.801.00 (n = 103)0.86–1.150.23–4.250.921 (0.886–0.946)
    For risks ≤ 10%a  1.02 (n = 76)0.85–1.210.22–4.740.815 (0.722–0.878)
    For risks > 10%a  0.94 (n = 27)0.75–1.190.29–3.620.731 (0.491–0.868)
Intraobserver reproducibility, Sonologist 2
  Risk calculated using LR2      
    For all data3.38 (n = 206)0.06–97.490.96 (n = 103)0.82–1.140.18–5.080.904 (0.861–0.934)
    For risks ≤ 10%a  0.95 (n = 74)0.78–1.160.18–5.160.813 (0.718–0.878)
    For risks > 10%a  1.00 (n = 29)0.74–1.350.20–5.020.556 (0.238–0.765)
Interobserver reproducibility
  Risk calculated using LR1      
    For all data3.05 (n = 206)0.10–95.300.85 (n = 103)0.72–1.010.15–4.820.889 (0.840–0.924)
    For risks ≤ 10%a  0.85 (n = 74)0.71–1.020.18–4.100.823 (0.732–0.885)
    For risks > 10%a  0.85 (n = 29)0.57–1.260.10–7.120.455 (0.114–0.701)
  Risk calculated using LR2      
    For all data3.62 (n = 206)0.06–87.810.98 (n = 103)0.78–1.240.09–10.560.812 (0.734–0.869)
    For risks ≤ 10%a  0.99 (n = 70)0.78–1.250.13–7.280.768 (0.650–0.849)
    For risks > 10%a  0.97 (n = 33)0.57–1.650.05–20.760.209 (−0.150 to 0.515)
Table 8. Intra- and interobserver agreement in classifying adnexal masses as benign or malignant using IOTA risk-calculation models LR17 and LR27 and subjective assessment
VariableIntraobserver agreementInterobserver agreement
Sonologist 1Sonologist 2
Agreement (%)KappaAgreement (%)KappaAgreement (%)Kappa
  1. The first analysis of each sonologist was used to calculate interobserver agreement.

  2. a

    A calculated risk of 8.3–25.6% using LR1 corresponds to an uncertain diagnosis[15].

  3. b

    A second stage test is needed if the calculated risk of malignancy is 8.3–25.6%15, or if the sonologist is uncertain whether the tumor is benign or malignant on the basis of subjective assessment.

  4. c

    Weighted kappa index. IOTA, International Ovarian Tumor Analysis; LR1, logistic regression model 1; LR2, logistic regression model 2.

Risk of malignancy using LR1      
    ≤ 10%, > 10%880.72940.85850.61
    < 8.3%, 8.3–25.6%, > 25.6%*800.67‡920.89‡810.68‡
    Second-stage test needed† (yes, no)830.44870.55800.29
Risk of malignancy using LR2      
    ≤ 10%, > 10%890.74920.80810.52
Diagnosis using subjective assessment      
  Benign, malignant960.89960.87960.88
    Confidence in diagnosis (three levels)880.80830.64730.51
    Confidence in diagnosis (five levels)810.72780.57650.42
  Specific diagnosis (see Table 1)790.92800.76810.78

The sensitivity and specificity with regard to malignancy when using LR1, LR2 and subjective assessment are shown in Table 9. The sensitivity of LR1 and LR2 was slightly higher than that of subjective assessment, whilst the specificity was lower. The diagnostic accuracy with regard to the specific histological diagnosis on the basis of subjective assessment varied between 61% and 65% (Table 9).

Table 9. Sensitivity and specificity, with regard to malignancy, of IOTA logistic regression models LR17 and LR27 and of subjective assessment, and accuracy of subjective assessment with regard to specific histological diagnosis
Diagnostic methodSensitivity (% (n))Specificity (% (n))Accuracy with regard to histological diagnosisa (%)
  1. a

    For histological diagnoses, see Table 1. IOTA, International Ovarian Tumor Analysis.

LR1   
  Sonologist 1   
    First analysis78.9 (15/19)85.7 (72/84)Not applicable
    Second analysis84.2 (16/19)77.4 (65/84)Not applicable
  Sonologist 2   
    First analysis84.2 (16/19)86.9 (73/84)Not applicable
    Second analysis84.2 (16/19)85.7 (72/84)Not applicable
LR2   
  Sonologist 1   
    First analysis73.7 (14/19)79.8 (67/84)Not applicable
    Second analysis84.2 (16/19)72.6 (61/84)Not applicable
  Sonologist 2   
    First analysis89.5 (17/19)88.1 (74/84)Not applicable
    Second analysis78.9 (15/19)82.1 (69/84)Not applicable
Subjective assessment   
  Sonologist 1   
    First analysis73.7 (14/19)86.9 (73/84)62
    Second analysis73.7 (14/19)85.7 (72/84)62
  Sonologist 2   
    First analysis63.2 (12/19)89.3 (75/84)61
    Second analysis63.2 (12/19)91.7 (77/84)65

DISCUSSION

We have shown substantial variability within and between observers in the results of measurements included in LR1 and LR2 and also some variability when assessing the categorical variables included in the models. This resulted in substantial variability in the risks of malignancy calculated using LR1 and LR2. There was fair to good interobserver agreement when classifying tumors as benign or malignant using the predetermined risk-of-malignancy cut-off of 10%. The interobserver agreement when categorizing the tumors into three risk groups, as suggested by Daemen et al.[15], was also good. The most robust method for classifying tumors as benign or malignant, however, was subjective assessment of ultrasound findings.

To the best of our knowledge, no published studies have evaluated the reproducibility of assessment of the IOTA ultrasound variables used to describe adnexal masses and none has estimated the reproducibility of the calculated risk of malignancy using LR1 or LR2. The fact that we used 3D ultrasound volumes to study reproducibility can be considered as both a strength and a weakness in the current study. It may be considered a strength because both sonologists were exposed to identical ultrasound information and their assessments were not biased by any clinical information other than the patient's age. On the other hand, it may be considered a weakness because the reproducibility of live examinations might be superior to that of evaluation of 3D ultrasound volumes. This is because image quality is likely to be better during a live scan because the visualization of different parts of the tumor can be optimized by scanning from different angles and also because the interactive nature of a live scan facilitates proper evaluation of ultrasound findings. Alternatively, it might be poorer because examination conditions may change, even if live examinations are performed only a few minutes apart.

It is a limitation of our study that we could not estimate either the reproducibility of the assessment of ascites or pain during the examination or that of retrieving information on personal history of ovarian cancer and current hormonal therapy. Some may argue that the results could be biased by the volumes having been acquired by the second author 12–18 months before analysis and that the second author may therefore have remembered some of the clinical information about the patients or the ultrasound appearance of the masses. However, the second author cannot recall a single patient for whom this was the case.

There are few reproducibility studies on ultrasound assessment of adnexal masses. Alcazar et al.[16, 17] found good intra- and interobserver agreement with regard to classifying adnexal tumors as benign or malignant using subjective evaluation of 3D ultrasound volumes. Yazbek et al.[18, 19] found interobserver agreement to be good in classifying adnexal masses as non-invasive or invasive malignant[18] when evaluating saved 2D ultrasound images. They also found that interobserver agreement depended on the confidence with which the diagnosis was suggested[19]. Other studies on adnexal masses have assessed the reproducibility of various types of Doppler examinations and most reported acceptable reproducibility[20-25].

In our study, interobserver agreement for the categorical variables was poorest for irregular internal cyst walls and color score, both relying heavily on subjective evaluation. Intra- and interobserver agreements in measurement results were also rather poor, i.e. the measurements were not at all precise. Agreement was poorest for measurements taken on selected parts of the tumor (the largest solid component, the height of the largest papillary projection and the thickness of the thickest septum). This might be explained by the sonologists having measured different structures.

We think that the poorer-than-expected diagnostic performance of subjective assessment, and of LR1 and LR2, in the current study is explained by our evaluation of ultrasound findings being performed on 3D volumes rather than on live examinations. First, some of the image quality was lost when our volumes were transferred via the hospital intranet to our server and then copied to our personal computers. Second, when evaluating 3D volumes one cannot make use of the interactive nature of a real-time examination facilitating, for example, discrimination between a blood clot and solid tissue. Third, knowledge of clinical information is likely to affect the interpretation of ultrasound findings, so that the diagnostic performance of subjective assessment, and perhaps also that of LR1 and LR2, is improved. A study comparing diagnostic accuracy between real-time scanning and evaluation of static 2D images showed that ‘the preoperative diagnosis of an adnexal mass made on the basis of a real-time ultrasound examination is more precise than that made on the basis of saved static ultrasound images.’[26]

Calculated risks of malignancy of > 10% were unreliable. This means that if the risk of malignancy is calculated using information from 3D ultrasound volumes, any individual with an estimated risk of malignancy of > 10% must be regarded as being at high risk of ovarian cancer, but the magnitude of this risk cannot be determined with any precision. How can the reproducibility of the calculated risk using LR1 or LR2 be improved? In our study, both sonologists used the IOTA terms and definitions[9] to describe their findings. However, they had not discussed together how to take measurements or how to assess the different ultrasound features in adnexal masses before the start of the study. This was a deliberate decision because we wanted our analysis to reflect the situation in real life, where most examiners using the IOTA terminology have no other information than that in the published paper describing the IOTA terms[9]. After completion of the study it became clear that even apparently clear definitions can be understood differently. Possibly, reproducibility of risk calculations could be improved by trying to come to a consensus on how to evaluate the ultrasound variables included in the models. However, there is no gold standard representing the truth for any of the ultrasound variables.

The reproducibility of subjective assessment was superior to that of the calculated risk, reflecting the fact that looking at a tumor in its entirety instead of at single features provides the most reproducible results. Our findings are pertinent to analysis of ultrasound volumes and cannot be extrapolated to real-time examinations.

ACKNOWLEDGMENTS

This work was supported by the Swedish Medical Research Council (grant nos K2001-72X-11605-06A, K2002-72X-11605-07B, K2004-73X-11605-09A and K2006-73X-11605-11-3); funds administered by Skåne University Hospital; Allmänna Sjukhusets i Malmö Stiftelse för bekämpande av cancer (the Malmö General Hospital Foundation for fighting against cancer); and Landstingsfinansierad regional forskning and ALF-medel (two Swedish governmental grants from the region of Scania).

Ancillary