To develop flexible classifiers that predict malignancy in adnexal masses using a large database from nine centers.
To develop flexible classifiers that predict malignancy in adnexal masses using a large database from nine centers.
The database consisted of 1066 patients with at least one persistent adnexal mass for which a large amount of clinical and ultrasound data were recorded. The outcome of interest was the histological classification of the adnexal mass as benign or malignant. The outcome was predicted using Bayesian least squares support vector machines in comparison with relevance vector machines. The models were developed on a training set (n = 754) and tested on a test set (n = 312).
Twenty-five percent of the patients (n = 266) had a malignant tumor. Variable selection resulted in a set of 12 variables for the models: age, maximal diameter of the ovary, maximal diameter of the solid component, personal history of ovarian cancer, hormonal therapy, very strong intratumoral blood flow (i.e. color score 4), ascites, presumed ovarian origin of tumor, multilocular-solid tumor, blood flow within papillary projections, irregular internal cyst wall and acoustic shadows. Test set area under the receiver–operating characteristics curve (AUC) for all models exceeded 0.940, with a sensitivity above 90% and a specificity above 80% for all models. The least squares support vector machine model with linear kernel performed very well, with an AUC of 0.946, 91% sensitivity and 84% specificity. The models performed well in the test sets of all the centers.
Bayesian kernel-based methods can accurately separate malignant from benign masses. The robustness of the models will be investigated in future studies. Copyright © 2007 ISUOG. Published by John Wiley & Sons, Ltd.
Ovarian cancer is a common and highly lethal cancer of the female reproductive system. In 2003, 14 657 women died of ovarian cancer in the US1. Early detection of the cancer is important for survival2–4. An important aspect of the management of adnexal masses is planning the appropriate surgery. For example, it is important to recognize Stage I ovarian cancer and to avoid any spilling of cyst contents in order not to worsen the prognosis2.
Previous scoring systems, logistic regression models, artificial neural networks, and support vector machines5–12 to predict the outcome of ovarian masses were often designed on the basis of data collected in small samples from a single center using non-standardized data collection13. Such models are usually not robust and may not perform very well when they are tested in new populations14–18.
The present paper aims at constructing mathematical models using least squares support vector machines19 and relevance vector machines20 to predict malignancy in adnexal masses. These methods can construct non-linear models and are thus an interesting alternative or addition to logistic regression models.
The International Ovarian Tumor Analysis (IOTA) study group collected the data using a standardized protocol. Women with an ovarian mass were recruited at nine different centers in Sweden, Belgium, Italy, France and the UK. Patients presenting with at least one overt persistent adnexal mass and who underwent ultrasound examination by a principal investigator at one of the participating centers were eligible for inclusion. A standardized examination technique and standardized terms and definitions were used21. When two masses were present, data from the most complex mass were used. Pregnant women were excluded. A web-based data entry module secured by a secure socket layer certificate was used for data collection22. Investigators were encouraged to measure serum CA-125 levels in peripheral blood for all patients, but information on CA-125 was not necessary for inclusion. The main outcome was the classification of the mass as either benign (0) or malignant (1). Other diagnostic information was also collected, i.e. the specific histological diagnosis and, in case of malignancy, the surgical stage. All tumors were classified following the criteria recommended by the International Federation of Gynecology and Obstetrics (FIGO)23. The IOTA study has been described in detail elsewhere24.
The IOTA dataset consisted of information regarding 1066 tumors, 266 of which were malignant (25%). Among the malignant masses, 169 were primary invasive (63%), 55 were borderline malignant (21%) and 42 were metastatic (16%). Descriptive statistics of the dataset are given in Tables 1–3.
|n||Median (range)||%*||n||Median (range)||%*|
|Age (years)||800||42 (17–90)||266||56 (17–94)|
|Number of years postmenopause||229||10 (1–40)||29||145||12 (0–44)||55|
|Parity||800||1 (0–10)||266||2 (0–7)|
|Maximal diameter of ovary (mm)||800||61 (11–320)||266||100 (13–410)|
|Maximal diameter of lesion (mm)||800||63 (11–320)||266||100.5 (8–410)|
|Volume of lesion (mL)||800||73 (0.2–7781)||266||303 (0.1–11829)|
|Fluid in pouch of Douglas (mm)||140||12 (2–61)||18||144||24 (3–100)||54|
|Septum thickness (mm)||343||2.1 (1–20)||43||143||4.0 (1–20)||54|
|Height of papillation (mm)||156||7 (2–62)||20||121||14 (3–62)||45|
|Maximal diameter of papillation (mm)||156||10 (3–90)||20||121||21 (4–110)||45|
|Volume of papillation (mL)||156||0.2 (0.006–70)||20||121||2 (0.008–226)||45|
|Ratio of papillation volume to lesion volume||156||0.003 (0–0.456)||20||121||0.006 (0–0.42)||45|
|Number of papillations†||156||1 (1–> 3)||20||121||> 3(1–> 3)||45|
|Number of locules‡||800||1 (0–> 10)||266||3 (0–> 10)|
|Maximal diameter of largest solid component (mm)||309||21 (3–230)||39||244||50 (4–214)||92|
|Volume of the largest solid component (mL)||309||1.6 (0.006–1978)||39||244||34 (0.008–2291)||92|
|Ratio solid to lesion§||309||0.03 (0–1)||39||244||0.24 (0–1)||92|
|Pulsatility index (PI)||506||0.95 (0.13–5.8)||63||246||0.74 (0.25–2.26)||92|
|Resistance index (RI)||506||0.59 (0.12–1.0)||63||246||0.50 (0.17–1.0)||92|
|Peak systolic velocity (PSV, cm/s)||506||11.4 (2.0–85.5)||63||246||24.30 (3.9–202)||92|
|Time-averaged maximum velocity (TAMXV, cm/s)||498||6.9 (1.0–60.0)||62||241||17.0 (3.0–137)||91|
|CA-125 (U/mL)||567||17 (1–1409)||71||242||167 (4–31610)||91|
|n||% yes||%*||n||% yes||%*|
|Family history of ovarian cancer||800||2.5||266||4.9|
|Family history of breast cancer||800||10.8||266||12.4|
|Personal history of ovarian cancer||800||0.8||266||3.0|
|Personal history of breast cancer||800||2.9||266||5.6|
|Postmenopausal bleeding within the year before ultrasound examination||229||14.9||29||145||17.9||55|
|Pelvic pain during the examination||800||28.8||266||19.6|
|Incomplete septum present||800||9.3||266||4.1|
|Suspected ovarian origin||800||81.9||266||83.5|
|Flow within at least one papillation||156||33.3||20||121||84.3||45|
|Presence of an irregular papillation||156||49.4||20||121||82.6||45|
|Irregular internal walls in the lesion||800||32.8||266||81.6|
|Venous blood flow†||800||9.0||266||3.4|
|n (%)||n (%)|
|Unilocular||311 (38.9)||2 (0.8)|
|Unilocular, solid||88 (11.0)||44 (16.5)|
|Multilocular||176 (22.0)||20 (7.5)|
|Multilocular, solid||168 (21.0)||116 (43.6)|
|Solid||52 (6.5)||84 (31.6)|
|Not classifiable||5 (0.6)||0 (0.0)|
|Anechogenic||303 (37.9)||107 (40.2)|
|Low level||149 (18.6)||60 (22.6)|
|‘Ground glass’||192 (24.0)||33 (12.4)|
|Hemorrhagic||8 (1.0)||2 (0.8)|
|Mixed||114 (14.3)||18 (6.8)|
|No cyst fluid||34 (4.3)||46 (17.3)|
|Color score, blood flow|
|None (1)||222 (27.8)||11 (4.1)|
|Minimal (2)||311 (38.9)||42 (15.8)|
|Moderately strong (3)||219 (27.4)||109 (41.0)|
|Very strong (4)||48 (6.0)||104 (39.1)|
We used Bayesian least squares support vector machines (LS-SVMs)25 and relevance vector machines (RVMs)20 to classify tumors as benign or malignant. These methods are well suited to classification purposes. They are mathematically complex models that are flexible in that they are able to create a non-linear decision boundary as opposed to, for example, logistic regression. Recently, an opinion paper was published in this journal in which the characteristics of support vector machines are compared with those of logistic regression analysis26. We constructed six models: Bayesian LS-SVM and RVM models with either a linear, radial basis function (RBF), or additive RBF kernel. Technical details about these methods are given in the appendix.
The dataset was randomly split up into a training set containing 70% of the cases and a test set containing 30% of the cases. The division into training and test sets was stratified for outcome and center such that in both sets the proportion of malignant cases was equal, as well as the proportion of cases from each center. This training- and test-set division was the same as that used for developing and evaluating a logistic regression model24. The models were trained on the training set and their performance was then evaluated on the test set. Each model (including the logistic regression model24) was confronted with the test-set data and the probability of malignancy for each individual was computed. These probabilities were then used to create a receiver–operating characteristics (ROC) curve27. The statistical significance of differences in areas under the ROC curve (AUC) was determined using a nonparametric method28. Confidence intervals (CI) for the AUCs were obtained using 100 bootstrap samples of the test set. This method takes into account that AUCs are bounded at 1 such that a CI, in case of a high AUC, should be asymmetric. The AUC for each bootstrap sample was calculated and the 95% CI was defined as 2.5th percentile to 97.5th percentile. The models were also compared with the widely used risk of malignancy index (RMI)5 by also creating ROC curves for the test-data cases with CA-125 information only.
The models were also evaluated with respect to the obtained sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR + ), and negative likelihood ratio (LR−). In order to obtain these measures, a cut-off level had to be chosen on the probability of malignancy as estimated by the Bayesian model. The cut-off level was always chosen based on the training-set results and we chose a cut-off with a sensitivity level as high as possible while still keeping a decent specificity (i.e. 80%) because it was considered very important to correctly identify a malignant case. The statistical significance of differences in sensitivity and specificity levels was investigated using McNemar's test for paired proportions. Because PPV and NPV levels depend on the number of tumors that are classified as malignant and benign, McNemar's test cannot be used to determine the statistical significance of differences in PPV or NPV. Instead, bootstrapping analyses were used to construct a 95% CI for the difference in PPV or NPV between models. If zero fell outside the 95% CI, evidence for a significant difference was obtained.
The IOTA dataset is an almost complete dataset. Most missing values are structural missings sensibly imputed with zero, except for the blood flow indices: when intratumoral blood flow was not sufficient to compute the peak systolic velocity, time averaged maximum velocity, pulsatility index and resistance index, these indices were imputed with 2 cm/s, 1 cm/s, 3, and 1, respectively. Apart from this, CA-125 information was unavailable for 257 cases (24%). Therefore, we decided not to include CA-125 in the variable selection process.
To select the predicting variables for the models, a forward selection method was used within the context of Bayesian LS-SVM models with either a linear or an RBF kernel. In each step of the selection process, the variable that most increases the Bayesian model evidence (see appendix) was added to the model. The selection process was stopped when no additional variable was able to further increase the model evidence. Because this forward selection method may select many variables, an additional backward elimination procedure may be necessary in order to obtain a more reliable set of predicting variables. Such backward elimination can be done subjectively based on knowledge of the variables (e.g. knowledge about the subjectivity and measurement accuracy of variables and about associations between variables) and also by investigating which variables least decrease the model evidence.
The adopted variable selection procedure selected 12 variables: three continuous variables (maximal diameter of the solid component, age, maximal diameter of the whole ovary) and nine binary variables (presence of ascites, blood flow within papillary projections, acoustic shadows, irregular internal cyst walls, very strong intratumoral blood flow, hormonal therapy, a multilocular solid tumor, personal history of ovarian cancer, and suspicion at ultrasound examination of ovarian origin of the tumor). This subset of selected variables served as the variable set for all models that were fitted. The maximal diameter of the solid component was bounded at 50 (all values > 50 mm were replaced with 50 mm). The logistic regression model also used this bounding24.
The performance of the models on the whole test set (n = 312) is shown in Table 4. The performance of a logistic regression model developed using the IOTA data set24 and of the RMI5 are shown for comparison. The test-set AUCs of the six LS-SVM and RVM models (ranging from 0.943 to 0.949; Figures 1 and 2) did not differ significantly (all P > 0.15), nor did they differ significantly from the AUC of the logistic regression model (AUC = 0.942; all P > 0.25). However, when applied to the test-set cases with CA-125 information (n = 236; Table 4), the AUC of all six LS-SVM and RVM models was significantly larger than the AUC of the RMI (all P < 0.004).
|Model (kernel)||AUC (SE)||95% CI||AUC (CA-125)*||Cut-off||Sensitivity||Specificity||PPV||NPV||LR+||LR−|
|LS-SVM (linear)||0.946 (0.019)||0.905–0.974||0.940 (0.020)||0.15||91||84||65||97||5.81||0.11|
|LS-SVM (RBF)||0.945 (0.019)||0.906–0.976||0.941 (0.020)||0.12||93||81||61||97||4.91||0.08|
|LS-SVM (addRBF)||0.943 (0.019)||0.901–0.977||0.937 (0.021)||0.12||91||82||62||97||5.12||0.11|
|RVM (linear)||0.949 (0.018)||0.909–0.977||0.942 (0.020)||0.20||92||82||62||97||5.19||0.10|
|RVM (RBF)||0.946 (0.019)||0.905–0.975||0.940 (0.021)||0.15||92||81||60||97||4.74||0.10|
|RVM (addRBF)||0.946 (0.019)||0.905–0.977||0.941 (0.020)||0.15||93||81||61||97||5.03||0.08|
|LR24||0.942 (0.017)||0.901–0.973||0.936 (0.020)||0.10||93||76||55||97||3.81||0.09|
The models also reached the sensitivity and specificity goals that were set in advance. The kernel methods developed in the present paper even obtained a test-set sensitivity of more than 90% and a specificity of more than 80%. A McNemar test to compare the performance for benign test-set cases (n = 237) revealed that all six kernel methods outperformed the logistic regression model with respect to specificity (all P < 0.0075). Similar tests for the malignant test-set cases (n = 75) did not yield statistically significant differences with respect to sensitivity (all P > 0.15). Bootstrap analyses indicated that the kernel methods also outperformed the logistic regression model with respect to PPV (results not shown). On the present test set, a linear model performed best: the LS-SVM model with linear kernel achieved 91% sensitivity, 84% specificity, 65% PPV and 97% NPV.
Figure 3 displays the LS-SVM classifier with linear kernel, the behavior of which is very similar to that of the RVM model with linear kernel. We could have made similar figures for the models using additive RBF kernels, but since the models with linear kernels performed as well as those with non-linear kernels, this was not necessary. For the continuous variables, the maximal diameter of the ovary (which is often identical to the maximal diameter of the lesion), the maximal diameter of the largest solid component and the woman's age, this linear model associates higher values with a higher probability of malignancy. Dichotomous variables that raise the probability of malignancy are ascites, blood flow within papillary projections, an irregular internal cyst wall, very strong intratumoral blood flow, personal history of ovarian cancer, and a tumor that is suspected by the sonologist to be of ovarian origin. On the other hand, acoustic shadows, hormonal therapy, and a multilocular-solid tumor decrease the model-based probability of malignancy.
The six Bayesian models were also applied to the test-set cases of each center separately. The best and worst of the six test-set AUCs are depicted in Figure 4. In general, the difference between the best and worst AUC values was smaller for centers that contributed most to the dataset. For the two centers that contributed most to the dataset, Malmö and Leuven, the difference was very small. In most centers the models had a test-set AUC exceeding 0.900. The two largest centers in the study had no test-set AUC below 0.950. In the BFR and MIT centers (see Figure 4), all AUCs exceeded 0.950.
The IOTA database provides a good starting point for the development of robust decision support systems that perform well in various situations. In a first step, a logistic regression model was developed24. This model performed very well on the test data. We applied Bayesian LS-SVMs and RVMs to the database. These models are able to capture non-linearity in the classification task and are specifically designed to avoid over-fitting of the training data, which impairs model performance on new data. Despite their appealing rationale, both SVMs and RVMs are not yet frequently used in medical applications. Some exceptions can be found in work by De Smet et al.29, Bowd et al.30 and Majumder et al.31.
The kernel-based models that were developed in this paper had very good test-set performance, with AUCs varying from 0.943 to 0.949. The logistic regression model achieved a test-set AUC of 0.94224. None of the differences in AUC was statistically significant. Sensitivity and specificity were very high for all models, but the kernel-based models obtained a significantly better specificity level than did the logistic regression model. Of course, this difference needs to be confirmed in new prospective studies. To illustrate the results as obtained on the IOTA dataset, the well-performing LS-SVM model with linear kernel would yield around 218 true positives, 641 true negatives, 119 false positives, and 22 false negatives in a set of 1000 patients. The logistic regression model would result in 224, 574, 186, and 16 patients, respectively. The significant difference in specificity between the kernel-based model and the logistic regression model may have arisen owing to small differences in the way cut-offs were chosen. In the present paper we focused on a specificity of at least 80%, while Timmerman et al.24 focused on a specificity of 75%. Tables in Timmerman et al.24 show that a cut-off of 0.15 (91% sensitivity and 82% specificity on the training set) instead of 0.10 (93% and 77%, respectively) would have resulted in 92% sensitivity and 81% specificity on the test set, which is a performance similar to the kernel-based models24.
Test-set performance varied between centers, but was good to very good in all of them. From a modeling perspective, our results indicate that a non-linear model is not necessary because it adds complexity without any obvious gain in efficiency. In that sense, a Bayesian LS-SVM model with linear kernel is appealing because it is linear, has a unique solution (as opposed to RVM models), avoids over-fitting of the training set, and it may have better specificity than that of the logistic regression model.
Interestingly, all models developed on the IOTA database clearly outperform the RMI. It must be emphasized, however, that the RMI was tested prospectively while the other models were all developed on the training set of the database of this study, and so the performance of the LS-SVMs and RVMs is probably slightly overestimated.
Despite Table 3 showing multilocular-solid tumors to be malignant more often than tumors that are not multilocular-solid (relative risk = 2.2), in the LS-SVM and RVM models a multilocular-solid tumor decreased the risk of malignancy. The models use several variables simultaneously, some of which may be interrelated. This can cause results one would not expect from univariate data analysis. For example, patients with a multilocular-solid tumor more often had very strong intratumoral blood flow, irregular internal cyst walls, blood flow within at least one papillary projection, ascites, and large maximal diameters of the ovary and the solid component. All of these variables are associated with higher risk of malignancy (Tables 1–3, Figure 3).
It is possible to build stand-alone software for clinical use of the LS-SVM and RVM models.
A crucial next step is to collect new data at the same centers as in this study as well as at completely new centers in order to evaluate prospectively the performance of the LS-SVM and RVM models.
This research was supported by Research Council KUL: GOA-AMBioRICS, CoE EF/05/006 Optimization in Engineering, several PhD/postdoc & fellow grants, Flemish Government: FWO (research communities (ICCoS, ANMMM)), Belgian Federal Science Policy Office IUAP P5/22 (‘Dynamical Systems and Control: Computation, Identification and Modeling’), EU: BIOPATTERN (FP6-2002-IST 508803), ETUMOUR (FP6-2002-LIFESCIHEALTH 503094), Healthagents (IST–2004-27214), the Swedish Medical Research Council: (grants nos. K2001-72X-11605-06A, K2002-72X-11605-07B and K2004-73X-11605-09A), Swedish governmental grants: (Landstingsfinansierad regional forskning (Region Skåne and ALF-medel)), and Funds administered by Malmö University Hospital.
Support vector machines (SVMs)25 transform the ‘input space’ (i.e. the multidimensional scatter plot of the predicting variables) into a high-dimensional ‘feature space’ (Figure A1). In the feature space a linear separation is sought (Figure A1) that aims for a good trade-off between maximization of the margin between both classes (this limits the model's complexity) and minimization of the number of training-set misclassifications. This trade-off—finding a simple model that is still able to distinguish between benign and malignant tumors—enhances the generalizing ability of the model to new data. How the feature space is constructed depends on the kernel function used. The kernel function is a measure of similarity between the data of two patients. A linear kernel results in a linear separation between the two classes in the original input space (Figure A2). On the other hand, non-linear kernels such as the radial basis function (RBF) kernel create a non-linear decision boundary in the input space (Figure A3).
Let us denote the data values for a patient by x. The SVM model makes a prediction for the patient based on the following model formulation: α1 · K(x, x1) + … + αN · K(x, xN) + b, where K(·, ·) is the kernel function giving the similarity between the data x of our patient with data xi of training-set patient i(i = 1, …, N), αi is the weight for training-case i, reflecting the importance of that case in the model, and b is a constant. The sign of the result of the model formulation determines the hard prediction—benign or malignant—made by the model: if the result is negative, the tumor is predicted to be benign, while a positive result leads to the prediction of malignancy. This can be compared to the logistic regression model, in which the hard prediction depends on whether the predicted probability of the outcome of interest (e.g. malignancy) is above or below 0.50. Of course, this cut-off (0 for SVMs, 0.50 for logistic regression) can be adapted in order to aim for desired sensitivity and specificity levels.
SVMs end up using only a few training-data cases for constructing the decision boundary. These cases are called the support vectors. The other training cases have zero weight (αi = 0). The support vectors are cases that usually lie close to the decision boundary. Also, the process to find the αi's leads to a unique solution, which is a major advantage over other flexible methods such as artificial neural networks or relevance vector machines (RVMs). A variant of standard SVMs has been developed19, 32 in which the training process involves solving a set of linear equations. This variant, called least squares SVMs (LS-SVMs), greatly simplifies SVM training while retaining the attractive advantages of SVMs described above.
A disadvantage of (LS-)SVM classifiers is that they do not provide class probabilities. Applying a Bayesian framework to LS-SVMs can overcome this drawback33, 34. The main aim of Bayesian analysis is that uncertainty in the estimates of the model parameters is accounted for by looking for a posterior probability distribution on the model parameters. The posterior distribution is obtained by confronting a prior probability distribution with the information in the data. The prior distribution reflects the prior knowledge or belief about likely values of the model parameters. The prior distribution is combined with the information about the model parameters in the collected data to give the posterior distribution. This involves solving highly complex integrals. The procedure of MacKay33 avoids this complexity by making approximations in order to obtain the posterior distribution. Class probabilities are obtained by mathematically taking into account the full posterior distribution, i.e. by taking into account the uncertainty in the estimated model parameters.
The Bayesian perspective contains methods for model comparison by computing the probability of the data obtained when a particular model is assumed to underlie the data. If this probability is computed for two competing models then the model for which this probability is higher is said to have higher evidence. This part of the Bayesian methodology was used for predictor variable selection.
In relevance vector machines (RVMs)20, a model y(x) = α1 · K(x, x1) + … + αN · K(x, xN) + b (cf. supra) is also approached from a Bayesian perspective using MacKay's procedure. A fundamental difference between SVMs and RVMs is that RVMs do not use a feature space such that the basis function K(·, ·) in the model can be any function (therefore, RVMs are not truly kernel based). It applies MacKay's procedure by using normal (Gaussian) prior probability distributions for each training sample's weight αi(i = 1, …, N). This Gaussian prior has mean zero and variance βi. Fitting the RVM model to the data results in many βis approaching zero. This means that the posterior distribution for αi has mean zero with variance zero such that the resulting ith data sample has no influence and need not be used. The other data samples are similar to the support vectors in SVM models, but the samples used in RVM models are not those close to the decision boundary (cf. SVMs) but are those that can be seen as prototypical examples of the classes. An important disadvantage of RVM models is that there is no unique solution for αi. Since RVMs are formulated within a Bayesian framework they also yield class probabilities. The Bayesian framework for LS-SVMs is applied in a different way, but is too complex to describe concisely in this appendix. Readers are referred to the paper by Van Gestel et al.34.
A disadvantage of using non-linear kernels is that it is difficult to disentangle the influence of each variable on the outcome. Linear models do not have this disadvantage since linear models generally use a weighted sum of the predicting variables to predict the outcome such that we can examine the effect of each predicting variable separately. Therefore, additive methods have been developed for kernel techniques35 and were applied in the present paper. Thus, six models are presented in this paper: Bayesian LS-SVMs and RVMs with either a linear, RBF, or additive RBF kernel.