Improving strategies for diagnosing ovarian cancer: a summary of the International Ovarian Tumor Analysis (IOTA) studies

Authors


Correspondence to: Prof. D. Timmerman, Department of Obstetrics and Gynecology, University Hospitals KU Leuven, Herestraat 49, 3000 Leuven, Belgium (e-mail: dirk.timmerman@uzleuven.be)

ABSTRACT

In order to ensure that ovarian cancer patients access appropriate treatment to improve the outcome of this disease, accurate characterization before any surgery on ovarian pathology is essential. The International Ovarian Tumor Analysis (IOTA) collaboration has standardized the approach to the ultrasound description of adnexal pathology. A prospectively collected large database enabled previously developed prediction models like the risk of malignancy index (RMI) to be tested and novel prediction models to be developed and externally validated in order to determine the optimal approach to characterize adnexal pathology preoperatively. The main IOTA prediction models (logistic regression model 1 (LR1) and logistic regression model 2 (LR2)) have both shown excellent diagnostic performance (area under the curve (AUC) values of 0.96 and 0.95, respectively) and outperform previous diagnostic algorithms. Their test performance almost matches subjective assessment by experienced examiners, which is accepted to be the best way to classify adnexal masses before surgery. A two-step strategy using the IOTA simple rules supplemented with subjective assessment of ultrasound findings when the rules do not apply, also reached excellent diagnostic performance (sensitivity 90%, specificity 93%) and misclassified fewer malignancies than did the RMI. An evidence-based approach to the preoperative characterization of ovarian and other adnexal masses should include the use of LR1, LR2 or IOTA simple rules and subjective assessment by an experienced examiner. Copyright © 2012 ISUOG. Published by John Wiley & Sons, Ltd.

BACKGROUND

Correctly discriminating between benign or malignant adnexal masses is the essential starting point for optimal management. Most women with an adnexal mass do not have cancer[1]. Identifying women with benign pathology is important in order to avoid unnecessary morbidity as well as unnecessary costs[2]. Conversely, recognizing cancer means that treatment is not delayed and appropriate staging can be carried out in specialized surgical centers[3-6].

To characterize ovarian pathology as benign or malignant, biomarkers or various prediction models have been used to try to optimize diagnostic accuracy. These include simple scores based on the morphological appearance of a mass using ultrasonography[7-11]; an index including information on serum CA 125 levels, menopausal status and ultrasound findings (the risk of malignancy index (RMI))[12]; and more advanced mathematical models using logistic regression[13], neural networks[14] and other complex computational approaches[15].

The RMI[12] remains the most widely used prediction model for characterizing ovarian pathology in many countries. Although the RMI is based on several ultrasound markers, the serum CA 125 level heavily influences the predictions. A systematic review in 2009 concluded that the RMI was the best available test to triage patients with ovarian tumors for referral to tertiary oncology units[16].

In the USA, the American College of Obstetricians and Gynecologists (ACOG) guidelines for selecting patients for referral to a gynecologic oncology center rely on the use of biomarkers. However, although useful for predicting advanced-stage ovarian cancer, the ACOG protocol performs poorly for the detection of early-stage disease and in the subgroup of premenopausal women[17]. When the multivariate biomarker assay, OVA-1, was incorporated instead of serum CA 125 into these referral guidelines[18], the false-positive rate reached unacceptably high levels[19].

Several prediction models, other than those discussed above, have been developed with the aim of improving preoperative diagnostic tests for ovarian cancer. Most did not retain their original accuracy when subjected to external validation[16, 20-24]. This can be explained by the relatively small sample size of most studies in which models were developed, the use of single-center populations for model development, the heterogeneity of the tumors studied, variations in the definitions of ultrasound terms used and a lack of consistency regarding the reporting of histological findings. To minimize these shortcomings and to develop robust rules and prediction models that can be used by different examiners in various clinical settings, the International Ovarian Tumor Analysis (IOTA) study was established.

AIM OF THE IOTA STUDIES

The principal IOTA investigators set out to study a large cohort of patients with a persistent adnexal mass in different clinical centers, using a standardized ultrasound protocol[25]. The primary aim of the study was to develop rules and models to characterize ovarian pathology and subsequently to demonstrate their utility by both temporal and external validation in the hands of examiners with different levels of ultrasound expertise. Other aims related to the different IOTA projects included establishing the role of measurements of CA 125 and other serum tumor markers for diagnosis, and developing a better understanding of the characteristics of ovarian tumors difficult to classify as benign or malignant using ultrasound. Both a strength and a limitation of the early IOTA phases is that final histology is required for inclusion. This is common to the majority of studies seeking to characterize ovarian pathology. An important ongoing phase for IOTA is studying the long-term behavior of adnexal masses characterized as benign in order to validate IOTA models or rules also in patients in whom surgery is not performed (Figure 1).

Figure 1.

Objectives of the different phases of the International Ovarian Tumor Analysis (IOTA) study. 3D, three dimensional; LR1, logistic regression model 1; LR2, logistic regression model 2; RMI, risk of malignancy index.

DEVELOPMENT AND PERFORMANCE OF PREDICTIVE MODELS

The first step was to agree on standardized terms and definitions that could be used to describe adnexal pathology. These were published in 2000[25]. Subsequently, in Phase 1 of the study (1999–2002), ultrasound data from 1066 non-pregnant women with at least one persistent adnexal mass were collected from nine clinical centers in five countries. A training set of 754 patients (70.7%) was used for model development, and a test set containing the remaining 312 patients was used for internal validation of the models[26].

Between 2002 and 2005 (IOTA Phase 1b), we recruited 507 new consecutive patients, at three centers participating in Phase 1, for prospective temporal validation of the models that seemed to perform best on internal validation in Phase 1[27]. The aim of IOTA Phase 2 (2005–2007) was to externally validate the models. This involved the recruitment of a further 997 patients in 12 new centers that did not take part in Phase 1, and of 941 patients in seven centers from Phase 1 for further temporal validation (Figure 1)[28, 29].

Initially, 11 prediction models were derived from the IOTA 1 dataset. Scoring systems, simple ultrasound rules, logistic regression analysis, artificial neural networks (ANN) and kernel methods, such as support vector machine models, were developed[26, 30-33]. We found that more complex statistical modeling did not improve test performance appreciably in comparison with more simple statistical approaches, such as logistic regression[27]. Accordingly we designated two relatively simple logistic regression models (logistic regression model 1 (LR1) and logistic regression model 2 (LR2)) as our principal models. LR1 included 12, and LR2 included six, demographic and ultrasound variables. The 12 variables used in LR1 were: (1) personal history of ovarian cancer; (2) current hormonal therapy; (3) age of the patient; (4) maximum diameter of the lesion; (5) pain during examination; (6) ascites; (7) blood flow within a solid papillary projection; (8) a purely solid tumor; (9) the maximum diameter of the solid component; (10) irregular internal cyst walls; (11) acoustic shadows; and (12) color score. The six variables in LR2 were: (1) age; (2) ascites; (3) blood flow within a solid papillary projection; (4) maximal diameter of the solid component; (5) irregular internal cyst walls; and (6) acoustic shadows. Any qualified ultrasound examiner scanning women with adnexal masses should be able to retrieve information on the variables required for both models.

Both models had excellent diagnostic performance on both the training and test data[26] and retained their accuracy at prospective temporal validation in three clinical centers using the IOTA 1b dataset (Table 1)[27]. In the IOTA study we have emphasized that good sensitivity is more important than specificity. However, interpreting indices of diagnostic performance is dependent upon the prevalence of pathology in the studied population. In the IOTA study, the overall prevalence of cancer was 28%, which implies that at a fixed specificity level of 75% with a sensitivity of 90%, for every five patients who undergo surgery for a presumed malignant mass only two will have a benign histology.

Table 1. Diagnostic performance of the main predictive models and rules for discrimination between benign and malignant adnexal masses derived by the International Ovarian Tumor Analysis (IOTA) study and of the risk of malignancy index (RMI)
Model or rulesIOTA phaseType of validationnSensitivity (%)Specificity (%)LR+LR−DORAUC
  1. a

    Results are shown for simple rules supplemented with subjective assessment of ultrasound findings when the rules did not apply. †Missing values for CA 125 were handled using multiple imputation, and missing values for metastases were handled as explained in our external validation study[28]. AUC, area under the receiver–operating characteristics curve; DOR, diagnostic odds ratio; LR1, IOTA logistic regression model 1; LR2, IOTA logistic regression model 2; LR+, positive likelihood ratio; LR−, negative likelihood ratio; N/A, not applicable.

LR1 (cut-off 10%)1Development data[26]75493774.010.1042.10.95
1Internal (test set)[26]31293763.810.0945.60.94
1bTemporal[27]50795743.680.0755.80.95
2Temporal[28, 29]94193814.770.0952.80.94
2External[28, 29]99792876.840.0975.70.96
LR2 (cut-off 10%)1Development data[26]75492753.710.1035.50.93
1Internal (test set)[26]31289733.360.1523.10.92
1bTemporal[27]50795743.640.0755.00.95
2Temporal[28, 29]94189804.420.1432.70.92
2External[28, 29]99792866.360.1066.10.95
Simple rules with subjective expert assessmenta1Development data[32]106691908.840.1084.4N/A
1bTemporal[32]50792909.080.09106N/A
2Temporal[37]941929312.280.09142N/A
2External[37]997909312.630.11120N/A
Subjective expert assessment1N/A1066889518.520.13147N/A
1bN/A507909312.630.11120N/A
2N/A941939314.150.07190N/A
2N/A997879211.000.1480.7N/A
RMI†

(cut-off 200)

2External[28]997679512.70.3436.80.91

A problem with diagnostic models is that they are prone to produce good results on the populations on which they were developed. Therefore, a crucial step before incorporating any diagnostic test or prediction model into clinical practice is establishing whether they work in different patient populations and in different clinical settings. This involves external validation in centers unrelated to those in which the tests were developed[34, 35].

Phase 2 of the IOTA project (2005–2007) was designed to externally validate LR1 and LR2 and to compare their test performance with the RMI and other previously published non-IOTA models[28]. A useful tool to evaluate test performance is to construct receiver–operating characteristics (ROC) curves and to compare the area under the curve (AUC) for different diagnostic tests. The advantage of this approach is that the AUC is independent of the cut-off value applied and is therefore a more useful description of how a test performs. Using a risk threshold of 10%, LR1 outperformed other tests, such as RMI, for estimating the risk of malignancy in an ovarian mass, with an AUC of 0.96 (95% CI, 0.94–0.97) and sensitivity and specificity of 92% and 87%, respectively. LR2 achieved an AUC of 0.95, sensitivity of 92% and specificity of 86% using the same 10% risk threshold. In contrast, the RMI achieved an AUC of 0.91, sensitivity of 67% and specificity of 95% (Table 1). The adopted risk-threshold of 10% means that a tumor predicted by the model as having a risk of 10% or more should be classified as malignant. Altering the risk threshold according to personal preference affects test performance. If a 5% risk of malignancy is considered more appropriate, the sensitivity for cancer would increase but at the expense of increasing the false-positive rate.

The ROC curves for LR1, LR2, RMI and serum CA 125 for both premenopausal and postmenopausal women are shown in Figure 2. Figure 2a demonstrates an important diagnostic advantage for both LR1 and LR2 for characterizing adnexal tumors in premenopausal patients compared with the current reference test RMI or using CA 125 alone. The difference in AUC between LR1 and LR2 was small, irrespective of menopausal status, and this small difference is unlikely to be of clinical importance. This implies that a model with only six variables (LR2) has diagnostic performance very similar to one using 12 variables (LR1). The lower number of variables needed for LR2 and its excellent test performance may lead to clinicians favoring the use of LR2 over LR1 in clinical practice.

Figure 2.

Receiver–operating characteristics (ROC) curves of the International Ovarian Tumor Analysis (IOTA) logistic regression model 1 (LR1; image), IOTA logistic regression model 2 (LR2; image), risk of malignancy index (RMI; image) and CA 125 (image) for premenopausal (a) and postmenopausal (b) patients using pooled data (n = 2757) from IOTA Phases 1, 1b and 2, excluding patients derived from our training set (n = 754). Missing values for CA 125 were handled using multiple imputation, and missing values for metastases were handled as explained in our external validation study[28]. (a) Area under the ROC curve (AUC) of LR1, 0.945 (95% CI, 0.928–0.959); LR2, 0.922 (95% CI, 0.901–0.940); RMI, 0.865 (95% CI, 0.827–0.896); and CA 125, 0.741 (95% CI, 0.701–0.777). (b) AUC of LR1, 0.928 (95% CI, 0.910–0.942); LR2, 0.915 (95% CI, 0.895–0.931); RMI, 0.909 (95% CI, 0.889–0.926); and CA 125, 0.886 (95% CI, 0.862–0.906).

In postmenopausal patients (Figure 2b), there is little difference in diagnostic performance between the main IOTA models and RMI. However, the ultrasound-based models have the advantage of offering an instant diagnosis.

The performance of any test for the detection of primary Stage I disease is of particular interest, because treatment for early-stage disease is associated with high survival rates. For Stage I tumors, we found that the logistic regression models had a higher detection rate than the RMI[28]. The LR2 missed fewer malignancies of any kind than did the RMI or CA 125 alone when applying their cut-off points most often used clinically. Table 2 shows the number and types of malignancies that were missed by LR2, RMI and serum CA 125. Of course, the number of false negatives depends on the cut-off adopted.

Table 2. False-negative test results with regard to malignancy for logistic regression model 2 (LR2), the simple rules combined with subjective assessment, risk of malignancy index (RMI) and CA 125 when applying them to International Ovarian Tumor Analysis (IOTA) Phase 1b and Phase 2 data (n = 2445)
 False negative (n (%))
Classification approachInvasive Stage Ia (n = 122)Invasive Stage II–IVb(n = 354)Borderline (n = 131)Metastatic (n = 78)All malignancies (n = 685)
  1. a

    Includes rare primary invasive Stage I tumors.

  2. b

    Includes rare primary invasive Stage II–IV tumors.

  3. c

    Results are shown for simple rules supplemented with subjective assessment of ultrasound findings when the rules did not apply. Missing values for CA 125 were handled using multiple imputation, and missing values for metastases were handled as explained in our external validation study[28].

LR2, 10% risk threshold9 (7.4)13 (3.7)33 (25.2)4 (5.1)59 (8.6)
Simple rules combined with subjective assessmentc12 (9.8)17 (4.8)26 (19.8)4 (5.1)59 (8.6)
RMI, threshold 20058 (47.5)38 (10.7)87 (66.4)29 (37.2)212 (30.9)
CA 125, thresholds 65 IU/L and 35 IU/L for pre- and postmenopausal patients55 (45.1)31 (8.8)77 (58.8)26 (33.3)189 (27.6)

We have retrospectively compared the RMI-based triage system advocated by the Royal College of Obstetricians and Gynaecologists (RCOG) with an IOTA-based alternative protocol using LR2. The IOTA protocol classified women as being at high risk if the estimated probability of malignancy according to LR2 was at least 25%, at low risk if the estimated probability was below 5% and at intermediate risk if the estimated probability was at least 5% but less than 25%[36]. This analysis suggests that if implemented, the IOTA alternative is likely to be better at avoiding unnecessary surgery or unnecessarily extensive surgery in benign disease whilst selecting more patients with cancer for appropriate referral to an oncological surgeon than the RCOG system[36]. This result held true, irrespective of the menopausal status of the patients and the type of unit in which the patient was examined[36].

DEVELOPMENT, VALIDATION AND ROLE OF THE SIMPLE ULTRASOUND-BASED RULES

Subjective assessment of ultrasound images by experienced clinicians is the best way of characterizing ovarian pathology[24]. Many adnexal masses have a typical ultrasound appearance and will therefore be instantly correctly classified even by relatively inexperienced ultrasound examiners. To take this idea forward we established simple rules based on a number of clearly defined ultrasound features that can guide examiners without the need for a computer[32]. Using these simple rules no risk estimates are produced, but tumors are classified as benign, malignant or unclassifiable.

The simple rules consist of five ultrasound features of malignancy (M-features) and five ultrasound features suggestive of a benign mass (B-features). These features with corresponding ultrasound images are presented in Figure 3. A mass is classified as malignant if at least one M-feature and none of the B-features are present, and vice versa[32]. If no B- or M-features are present, or if both B- and M-features are present, then the rules are considered inconclusive (unclassifiable mass) and a different diagnostic method should be used. The simple rules were temporally and externally validated using the IOTA Phase 2 dataset of 1938 patients[37]. The rules could be applied to 77% of ovarian tumors. The simple rules classify tumors as benign, inconclusive or malignant. These three possible outcomes enable the construction of an ROC curve, which facilitates comparison with the prediction models LR1 and LR2. When we do this using the IOTA Phase 1b and 2 datasets (Figure 4), the performance of LR1, LR2 and simple rules are similar. We suggest subjective assessment by an experienced examiner as a second-stage test for cases where the simple rules yielded an inconclusive result. On external validation, this two-step strategy reached a sensitivity of 90% and a specificity of 93% to detect ovarian malignancy (Table 1)[37]. Moreover, the simple rules combined with subjective assessment when the rules did not apply misclassified fewer Stage I ovarian malignancies than did RMI and measurement of serum CA 125 (Table 2). In 2011 the RCOG included the simple rules in their guideline for evaluating ovarian pathology in premenopausal women[38]. However, in postmenopausal patients, serum CA 125 may play a role as a second-stage test, especially in centers with less-experienced ultrasound examiners. This hypothesis is currently being tested as part of the ongoing IOTA 4 project.

Figure 3.

Ultrasound features used in the International Ovarian Tumor Analysis (IOTA) simple rules, illustrated by ultrasound images. B1–B5, benign features; M1–M5, malignant features.

Figure 4.

Receiver–operating characteristics (ROC) curves for logistic regression model 1 (LR1; image) and logistic regression model 2 (LR2; image), with ROC points for the simple rules superimposed. The red (simple rules: benign/inconclusive vs malignant) and green (simple rules: benign vs inconclusive/malignant) ROC points represent situations in which the ‘inconclusive tumors’ are classified as either benign or malignant, respectively. The results were obtained using pooled data (n = 2445) from International Ovarian Tumor Analysis (IOTA) Phases 1b and 2 (n = 2445).

AN INTUITIVE APPROACH TO ULTRASOUND CHARACTERIZATION: THE USE OF ‘INSTANT’ DESCRIPTORS

An important learning point from the IOTA study is that almost half of ovarian masses have features that enable them to be characterized relatively easily (43% of the masses from Phases 1, 1b and 2). For example, ‘typical’ dermoid cysts, ‘typical’ endometriomas and late-stage ovarian cancer have very characteristic ultrasound features that should be recognized almost instantly by any ultrasound examiner. We retrospectively defined six ‘easy descriptors’[39] that should enable an examiner to make an ‘instant’ diagnosis of an ovarian mass without needing to use models, second-stage tests or seek a second opinion: four described features of common benign tumors, whilst two described features of malignancies (Figure 5).

Figure 5.

International Ovarian Tumor Analysis (IOTA) ‘easy descriptors’ illustrated by ultrasound images. BD1–BD4, benign descriptors; MD1–MD2, malignant descriptors.

When applied retrospectively to IOTA data, each one of these six descriptors had excellent diagnostic accuracy to predict whether a mass was benign or malignant. For the masses to which the descriptors could be applied, they had a sensitivity of 98% and a specificity of 97%[39]. If none of the six descriptors could be used, or if both a descriptor of a benign and a malignant mass were applicable, we considered the diagnosis as ‘non-instant’. In clinical practice, a second-stage test or an expert opinion is needed only for tumors where an instant diagnosis cannot be made. As a second-stage test for such masses, we retrospectively applied our previously developed simple rules with real-time subjective assessment by an expert examiner as a third step, when the simple rules could not be applied. This protocol gave a sensitivity of 92% and a specificity of 92% when retrospectively applied on all 1938 patients from IOTA Phase 2. It detected more ovarian cancers than if expert ultrasonography alone had been used in the whole study cohort, without increasing the false-positive rate (sensitivity and specificity of expert ultrasonography = 90% and 93%)[39]. Clearly, prospective external validation is needed before we can suggest incorporation of this approach into clinical protocols.

RELEVANCE OF BIOMARKERS (CA 125 AND HUMAN EPIDIDYMIS SECRETORY PROTEIN-4) AND RISK OF OVARIAN MALIGNANCY ALGORITHM

The role of biomarkers, and in particular CA 125, in the diagnosis of ovarian cancer is controversial. Although widely used as part of the assessment of ovarian pathology, the results of the IOTA study suggest that measurements of serum CA 125 have a limited role in characterizing ovarian pathology, especially in premenopausal women[40]. Incorporating serum CA 125 measurements into logistic regression models has no significant impact on performance of the model for women of any age[40]. Moreover, when subjective assessment by an experienced ultrasound examiner was compared with serum CA 125 for discrimination between benign and malignant adnexal masses in the IOTA 1 dataset, subjective assessment performed significantly better[41]. This was independent of menopausal status, the specific histological diagnosis and the serum CA 125 cut-off level used. We have also shown that adding information on the serum CA 125 level to subjective assessment of ultrasound findings does not improve the diagnostic performance of experienced ultrasound examiners, irrespective of the diagnostic confidence of the examiner[42]. Upon further scrutiny, we found that single fixed cut-off values for serum CA 125 levels can reliably discriminate only between Stage II–IV invasive tumors and benign tumors that are not an abscess or endometrioma[43]. For all other types of tumor, serum CA 125 values overlap considerably.

On the other hand, if malignancy is suspected, preoperative measurement of the serum CA 125 level may be useful for postoperative follow up using serum CA 125 as a biomarker to detect progression during chemotherapy[44].

A great deal of research has been carried out to identify new biomarkers that can be used together with, or instead of, CA 125. Human epididymis secretory protein-4 (HE4) was found to be complementary to CA 125 for the detection of malignant disease. An initial report suggested that combining these two biomarkers increased overall sensitivity and specificity compared with the use of a single biomarker[45]. Based on these preliminary findings, HE4 was combined with CA 125 and menopausal status to form the Risk of Ovarian Malignancy Algorithm (ROMA)[46]. This algorithm showed significantly higher sensitivity for epithelial ovarian cancer than did the original RMI at a fixed specificity level of 75%[47]. Numerous studies have validated this new diagnostic method with contradicting results. Some reports have confirmed the accuracy of ROMA[48-52], whilst others have found that ROMA did not outperform established diagnostic tests that incorporate CA 125 or HE4 for detection of cancer in an ovarian mass[53-55]. In a study by Van Gorp et al.[56], which used data from the IOTA database, it was demonstrated that expert subjective assessment of ultrasound findings was superior to ROMA for distinguishing benign from malignant adnexal masses.

FUTURE DIRECTIONS FOR THE IOTA PROJECT

Predicting subtypes of ovarian malignancy: use of polytomous models

Most ultrasound-based prediction models used in clinical practice have a dichotomous outcome (i.e. cancer or no cancer). However, different subtypes of malignancy (metastatic, primary invasive or borderline malignancy) are managed differently with implications in relation to type of surgery, length of hospitalization and financial cost. To achieve a more fine-tuned categorization, we developed polytomous (or multiclass) prediction models using logistic regression on the IOTA Phase 1 data to characterize ovarian pathology as benign, borderline malignant, primary invasive or metastatic[57]. The polytomous model had a test performance (AUC = 0.95) similar to that of LR1 and LR2 for discriminating between benign and malignant adnexal masses. In addition, the model was able to distinguish benign tumors from borderline, primary invasive and metastatic tumors (AUCs of 0.91, 0.95 and 0.93, respectively). These data are promising, but temporal and external validation revealed that the polytomous model could not discriminate between primary and metastatic invasive tumors[57]. To address this limitation we are now focusing on developing more robust multiclass models using a larger dataset (IOTA 1, 1b and 2). We aim to also differentiate between Stage I and Stage II–IV primary invasive malignancies.

Presentation of results of risk prediction models to facilitate interpretation

Two main approaches for the prediction of malignancy have emerged from the IOTA study. The first uses mathematical models that provide risk estimates. However, it is not straightforward to understand exactly how mathematical models work to obtain risk estimates. This is particularly the case when advanced state-of-the-art algorithms (e.g. support vector machines) are used, but even in the case of a logistic regression prediction model a detailed understanding is difficult. The second approach is based on simple rules. Such rules are appealing to clinicians as their working mechanism is clearer. However, such rule-based approaches are more susceptible to oversimplification, even though validation demonstrated very good test performance for adnexal mass characterization[37].

Using the IOTA dataset as a case study, we combined the advantages of advanced risk modeling techniques with the interpretability and attractiveness of simple scoring systems into the novel Interval Coded Scoring (ICS) system methodology for the development of scoring systems[58]. The ICS system combines elements from state-of-the-art techniques such as support vector machines, splines and L1 regularization[59] but presents the results as color bars that are easy to interpret. The ICS can be implemented in software packages, smartphone applications or on paper, which could be useful for bedside medicines[58]. The color-based representation is suitable for increasing interpretability by both the patient and the doctor, and might improve informed and shared decision making[60, 61]. Although highly promising, the ICS method should be successfully applied to several different diagnostic problems in order to demonstrate its ability to combine good performance with interpretability and user-friendliness.

Assessment of second-stage tests

In IOTA Phase 3 (2009–2012) we focus on improving diagnosis in ‘difficult adnexal tumors’[62, 63] by adding second-stage tests to conventional gray-scale and Doppler ultrasound. These include evaluation of the vascular tree of tumors using three-dimensional power Doppler[64] and novel biomarkers, such as HE-4.

Performance of IOTA rules and models in the hands of examiners with different training and levels of experience

In a validation study, Nunes et al.[65] demonstrated that LR2 retains its diagnostic performance (AUC = 0.93) in the hands of a less-experienced operator. In IOTA Phase 4 (2012–2013), we will evaluate the performance of the IOTA model LR2, the simple rules, the easy descriptors and the RMI as first-stage tests in the hands of ultrasound examiners with less experience than those participating in the published IOTA studies and with different types of ultrasound training (sonographers and doctors).

Long-term behavior of ovarian masses managed expectantly

In IOTA Phase 5 (2012–2017), the main goal is to study the natural history of at least 3000 ovarian masses with benign ultrasound morphology managed conservatively. This should establish the risk of complications such as torsion and cyst rupture, the need for surgical intervention and give an indication of the risk of malignant transformation. These results will enable us to suggest algorithms for suitable management – expectant management or surgery – of all types of adnexal pathology.

SUMMARY

Since the start of the IOTA project in 1999 we have examined a large number of patients with ovarian masses in IOTA Phases 1, 1b and 2 using ultrasonography with a rigorously defined protocol to prospectively collect detailed information about these tumors. The study has been carried out in over 20 different centers in different countries, in both district and oncology referral hospitals. The results have been consistent and so are likely to be robust and generalizable. To date, the IOTA study is the largest study in the literature on ultrasound diagnosis of ovarian pathology.

We have found that pattern recognition of the ultrasound features of an ovarian mass by an experienced clinician is the best way of characterizing ovarian pathology. Papillary projections are characteristic of borderline tumors and Stage I primary invasive epithelial ovarian cancer. A small proportion of solid tissue at ultrasound examination makes a malignant mass more likely to be a borderline tumor or a Stage I epithelial ovarian cancer than an advanced ovarian cancer, a metastasis or a rare type of tumor[66]. Our data suggest that information on serum CA 125 does not improve the diagnostic performance of subjective assessment by experienced ultrasound operators, and that it is not a necessary variable in multivariable prediction models developed to help classify ovarian masses as benign or malignant.

Two main approaches to the classification of ovarian masses have been developed using the IOTA database (Figure 6). The first uses risk-prediction models. The models LR1 and LR2, discussed above, have shown very good test performance on external validation. The second approach involves the use of either simple ultrasound-based rules or ‘easy descriptors’. These are based on ultrasound features that are virtually pathognomonic of either a benign or a malignant mass. The simple rules have been shown to apply to over 75% of masses, and have been successfully externally validated and taken up in a national protocol[38]. The use of the easy descriptors has yet to undergo external validation. For masses to which the simple rules do not apply it seems best to refer the patient for examination by an experienced ultrasound examiner. Such an approach is also applicable to risk-prediction models, for example using the LR2-based triage protocol described earlier[36]. Patients classified as being at intermediate risk of malignancy (i.e. LR2 risk of a malignant tumor is between 5 and 25%)[36] may be referred to an experienced examiner.

Figure 6.

Flow chart showing different approaches using ultrasonography in the assessment of women with adnexal masses to estimate risk of malignancy, incorporating the evidence base of the International Ovarian Tumor Analysis (IOTA) study. LR1, IOTA logistic regression model 1; LR2, IOTA logistic regression model 2.

We have also created multiclass models[57] that can assist clinicians to predict multiple outcomes in patients with an ovarian mass. Multiclass models can distinguish between benign, borderline, primary invasive and metastatic malignant disease and might play a central future role in the decision-making process. Their use is a logical extension of the current evolution toward an ‘individualized’ or ‘personalized’ approach to cancer treatment and healthcare in general.

RECOMMENDATIONS FOR CLINICAL PRACTICE BASED ON THE FINDINGS OF THE IOTA STUDY

These recommendations are based on validation of non-IOTA tests for classifying ovarian pathology and of tests developed from data in the IOTA study. Both non-IOTA tests and IOTA tests have undergone external validation in 12 different IOTA centers[28, 29, 37]. LR2 and the simple rules have also been validated outside the frame of the IOTA study[65, 67].

  1. The IOTA simple rules can be used to classify 75% of all adnexal masses as benign or malignant. The main advantage of these rules is their ease of use. This should make it easy to implement them in clinical practice.
  2. A two-step strategy with referral to a specialist in gynecological ultrasonography for subjective assessment of masses unclassifiable using the simple rules has excellent diagnostic test performance on external validation.
  3. A viable alternative to the simple rules is the logistic regression model, LR2. This model provides a benefit over the simple rules in that it can be applied to an entire tumor population and produces a risk estimate for ovarian malignancy. The latter is a key element in personalized healthcare and shared decision-making. As with the simple rules, some patients can be referred to a specialist if the risk estimate is considered as intermediate or inconclusive.
  4. LR2 or the simple rules should be adopted as the principal test to characterize masses as benign or malignant in premenopausal women because both perform much better than the RMI in premenopausal women.
  5. Measurements of serum CA 125 are not necessary for the characterization of ovarian pathology in premenopausal women and are unlikely to improve the performance of experienced ultrasound examiners, even in the postmenopausal group. Ongoing studies are investigating its value in less-experienced hands.

ACKNOWLEDGMENTS

The authors thank all participating centers, the principal investigators and the study participants for their contributions.

Recruitment Centers

University Hospitals Leuven (Belgium); Ospedale S. Gerardo, Università di Milano Bicocca, Monza (Italy); Ziekenhuis Oost-Limburg (ZOL), Genk (Belgium); Medical University in Lublin (Poland); University of Cagliari, Ospedale San Giovanni di Dio, Cagliari (Italy); Malmö University Hospital, Lund University (Sweden), University of Bologna (Italy); Università Cattolica del Sacro Cuore Rome (Italy); DCS Sacco University of Milan (Milan A, Italy); General Faculty Hospital of Charles University, Prague (Czech Republic); Chinese PLA General Hospital, Beijing (China); King's College Hospital London (UK); Universita degli Studi di Napoli, Napoli (Naples A, Italy); IEO, Milano (Milan B, Italy); Lund University Hospital, Lund (Sweden); Macedonio Melloni Hospital, University of Milan (Milan C, Italy); Università degli Studi di Udine (Italy); McMaster University, St. Joseph's Hospital, Hamilton, Ontario (Canada); and Instituto Nationale dei Tumori, Fondazione Pascale, Napoli (Naples B, Italy).

IOTA Steering Committee

D. Timmerman, Leuven, Belgium; L. Valentin, Malmö, Sweden; T. Bourne, London, UK; A. C. Testa, Rome, Italy; S. Van Huffel, Leuven, Belgium; Ignace Vergote, Leuven, Belgium; and B. Van Calster, Leuven, Belgium.

IOTA Principal Investigators (in alphabetical order)

A. Czekierdowski, Lublin, Poland; Elisabeth Epstein, Lund, Sweden; Daniela Fischerová, Prague, Czech Republic; Dorella Franchi, Milano, Italy; Robert Fruscio, Monza, Italy; Stefano Greggi, Napoli, Italy; S. Guerriero, Cagliari, Italy; Jingzhang, Beijing, China; Davor Jurkovic, London, UK; Francesco P. G. Leone, Milano, Italy; A. A. Lissoni, Monza, Italy; Henry Muggah, Hamilton, Ontario, Canada; Dario Paladini, Napoli, Italy; Alberto Rossi, Udine, Italy; L. Savelli, Bologna, Italy; A. C. Testa, Rome, Italy; D. Timmerman, Leuven, Belgium; Diego Trio, Milan, Italy; L. Valentin, Malmö, Sweden; and C. Van Holsbeke, Genk, Belgium.

Details of ethics approval

The IOTA study protocol was approved by the Central Ethics Committee for Clinical Studies at the University Hospitals KU Leuven, Belgium, and by the Local Ethics Committee at each recruitment center.

Funding

The IOTA study was supported by the Research Council KUL: GOA MaNet, CoE EF/05/006 Optimization in Engineering (OPTEC); Research Foundation – Flanders (FWO): projects G.0302.07 (SVM), G.0341.07 (Data fusion); IWT: TBM070706-IOTA3; Belgian Federal Science Policy Office: IUAP P6/04 (DYSCO, ‘Dynamical systems, control and optimization’, 2007–2011); IBBT (Flemish Government); Swedish Medical Research Council: grant nos K2001-72X 11605-06A, K2002-72X-11605-07B, K2004-73X-11605-09A and K2006-73X-11605-11-3; funds administered by Malmö University Hospital; and two Swedish governmental grants (ALF-medel and Landstingsfinansierad Regional Forskning). Ben Van Calster is a postdoctoral fellow of the Research Foundation – Flanders (FWO). For the IOTA 5 project we received a project grant from the FWO (grant G049312N). Tom Bourne is supported by the Imperial Healthcare NHS Trust NIHR Biomedical Research Center.

Ancillary