The quest to find an accurate and simple test for discrimination between benign and malignant adnexal tumors has become like the Holy Grail of gynecological ultrasound research. A plethora of diagnostic models have been developed over the last few decades, but none has gained universal acceptance in routine clinical practice. Few tests have endured a level of scrutiny similar to that of the logistic regression models for the diagnosis of ovarian cancer developed by the International Ovarian Tumor Analysis Group (IOTA). Ten years after the standardized IOTA approach to morphological analysis of ovarian tumors was first described, we welcome the very first study investigating the intra- and interobserver agreement of the proposed diagnostic parameters. In addition, Sladkevicius and Valentin assessed the level of agreement in classifying adnexal tumors as being malignant or benign, by calculating the risk of malignancy using IOTA logistic regression models LR1 and LR2 and on the basis of subjective assessment of ultrasound images. Assessment of repeatability and observer agreement is an essential part in the development of any diagnostic test, providing information about the likelihood of the results of the test being identical or closely similar each time it is conducted.
The authors created an arbitrary dataset consisting of a large number of stored three-dimensional ultrasound volumes, which were all recorded by one of the investigators. As a result, the observers were faced with identical datasets. The observers were leading international experts in gynecological ultrasound, with very strong academic records, who had been working closely in the same institution for more than a decade. In view of all this, one would expect to see a very high level of intra- and interobserver agreement. The result was quite the opposite: there was substantial variability in their assessment of various measurements and categorical variables included in the IOTA logistic regression models. As a result, there was also substantial variability in the calculated risk of malignancy.
These findings are unexpected and they appear to be at odds with the results of the IOTA validation studies, which showed that the diagnostic models performed well when used by sonographers of varying ability in different institutions around the world. How do we reconcile these seemingly contradictory results?
Although the current study is methodologically and statistically robust, the use of stored three-dimensional volumes does not reflect conditions during standard pelvic ultrasound examination. This could have affected the confidence of the operators, causing them to depart from their usual examination techniques and leading to inaccuracies in the assessment of different morphological and Doppler variables. It is interesting, however, that the agreement between the operators was much better when they tried to determine the nature of adnexal tumors based on subjective assessment of ultrasound findings. This supports the previously expressed opinion that very good performance of the IOTA models in prospective studies could be partly explained by the ability of the operators to determine the nature of adnexal lesion subjectively whilst collecting the data using the IOTA protocol. These issues could be resolved by carrying out a similar study during live ultrasound examinations. Such a study, however, would be more difficult to organize and itself may not be entirely free of methodological imperfections.