Preoperative diagnosis of ovarian tumors using Bayesian kernel-based methods

Authors


Abstract

Objectives

To develop flexible classifiers that predict malignancy in adnexal masses using a large database from nine centers.

Methods

The database consisted of 1066 patients with at least one persistent adnexal mass for which a large amount of clinical and ultrasound data were recorded. The outcome of interest was the histological classification of the adnexal mass as benign or malignant. The outcome was predicted using Bayesian least squares support vector machines in comparison with relevance vector machines. The models were developed on a training set (n = 754) and tested on a test set (n = 312).

Results

Twenty-five percent of the patients (n = 266) had a malignant tumor. Variable selection resulted in a set of 12 variables for the models: age, maximal diameter of the ovary, maximal diameter of the solid component, personal history of ovarian cancer, hormonal therapy, very strong intratumoral blood flow (i.e. color score 4), ascites, presumed ovarian origin of tumor, multilocular-solid tumor, blood flow within papillary projections, irregular internal cyst wall and acoustic shadows. Test set area under the receiver–operating characteristics curve (AUC) for all models exceeded 0.940, with a sensitivity above 90% and a specificity above 80% for all models. The least squares support vector machine model with linear kernel performed very well, with an AUC of 0.946, 91% sensitivity and 84% specificity. The models performed well in the test sets of all the centers.

Conclusions

Bayesian kernel-based methods can accurately separate malignant from benign masses. The robustness of the models will be investigated in future studies. Copyright © 2007 ISUOG. Published by John Wiley & Sons, Ltd.

Introduction

Ovarian cancer is a common and highly lethal cancer of the female reproductive system. In 2003, 14 657 women died of ovarian cancer in the US1. Early detection of the cancer is important for survival24. An important aspect of the management of adnexal masses is planning the appropriate surgery. For example, it is important to recognize Stage I ovarian cancer and to avoid any spilling of cyst contents in order not to worsen the prognosis2.

Previous scoring systems, logistic regression models, artificial neural networks, and support vector machines512 to predict the outcome of ovarian masses were often designed on the basis of data collected in small samples from a single center using non-standardized data collection13. Such models are usually not robust and may not perform very well when they are tested in new populations1418.

The present paper aims at constructing mathematical models using least squares support vector machines19 and relevance vector machines20 to predict malignancy in adnexal masses. These methods can construct non-linear models and are thus an interesting alternative or addition to logistic regression models.

Patients and methods

Data

The International Ovarian Tumor Analysis (IOTA) study group collected the data using a standardized protocol. Women with an ovarian mass were recruited at nine different centers in Sweden, Belgium, Italy, France and the UK. Patients presenting with at least one overt persistent adnexal mass and who underwent ultrasound examination by a principal investigator at one of the participating centers were eligible for inclusion. A standardized examination technique and standardized terms and definitions were used21. When two masses were present, data from the most complex mass were used. Pregnant women were excluded. A web-based data entry module secured by a secure socket layer certificate was used for data collection22. Investigators were encouraged to measure serum CA-125 levels in peripheral blood for all patients, but information on CA-125 was not necessary for inclusion. The main outcome was the classification of the mass as either benign (0) or malignant (1). Other diagnostic information was also collected, i.e. the specific histological diagnosis and, in case of malignancy, the surgical stage. All tumors were classified following the criteria recommended by the International Federation of Gynecology and Obstetrics (FIGO)23. The IOTA study has been described in detail elsewhere24.

The IOTA dataset consisted of information regarding 1066 tumors, 266 of which were malignant (25%). Among the malignant masses, 169 were primary invasive (63%), 55 were borderline malignant (21%) and 42 were metastatic (16%). Descriptive statistics of the dataset are given in Tables 1–3.

Table 1. Descriptive statistics for continuous/ordinal variables
VariableBenign Malignant
nMedian (range) %*nMedian (range) %*
  • *

    The percentage of cases for which a value was observed. Except for PI, RI, PSV, TAMXV and CA 125, missing values arose because no value could be observed. Missing data were imputed by 0 (e.g. in a unilocular cyst there are by definition no septa or solid components, and so thickness of thickest septum will be 0 and the diameters of the largest solid components will also be 0).

  • Number of separate papillary projections (1, 2, 3, > 3).

  • Number of locules (0, 1, 2, 3, 4, 5–10, > 10).

  • §

    Ratio of volume of the largest solid component to volume of the lesion.

Age (years) 800 42 (17–90)  266 56 (17–94)  
Number of years postmenopause 229 10 (1–40) 29 145 12 (0–44) 55
Parity 800 1 (0–10)  266 2 (0–7)  
Maximal diameter of ovary (mm) 800 61 (11–320)  266 100 (13–410)  
Maximal diameter of lesion (mm) 800 63 (11–320)  266 100.5 (8–410)  
Volume of lesion (mL) 800 73 (0.2–7781)  266 303 (0.1–11829)  
Fluid in pouch of Douglas (mm) 140 12 (2–61) 18 144 24 (3–100) 54
Septum thickness (mm) 343 2.1 (1–20) 43 143 4.0 (1–20) 54
Height of papillation (mm) 156 7 (2–62) 20 121 14 (3–62) 45
Maximal diameter of papillation (mm) 156 10 (3–90) 20 121 21 (4–110) 45
Volume of papillation (mL) 156 0.2 (0.006–70) 20 121 2 (0.008–226) 45
Ratio of papillation volume to lesion volume 156 0.003 (0–0.456) 20 121 0.006 (0–0.42) 45
Number of papillations156 1 (1–> 3) 20 121 > 3(1–> 3) 45
Number of locules800 1 (0–> 10)  266 3 (0–> 10)  
Maximal diameter of largest solid component (mm) 309 21 (3–230) 39 244 50 (4–214) 92
Volume of the largest solid component (mL) 309 1.6 (0.006–1978) 39 244 34 (0.008–2291) 92
Ratio solid to lesion§309 0.03 (0–1) 39 244 0.24 (0–1) 92
Pulsatility index (PI) 506 0.95 (0.13–5.8) 63 246 0.74 (0.25–2.26) 92
Resistance index (RI) 506 0.59 (0.12–1.0) 63 246 0.50 (0.17–1.0) 92
Peak systolic velocity (PSV, cm/s) 506 11.4 (2.0–85.5) 63 246 24.30 (3.9–202) 92
Time-averaged maximum velocity (TAMXV, cm/s) 498 6.9 (1.0–60.0) 62 241 17.0 (3.0–137) 91
CA-125 (U/mL) 567 17 (1–1409) 71 242 167 (4–31610) 91
Table 2. Descriptive statistics for binary variables
VariableBenignMalignant
n% yes%*n% yes%*
  • *

    The percentage of cases for which a value was observed. Missing values arose because no value could be observed and missing data were imputed as 0.

  • No arterial blood flow detected—venous flow only.

Family history of ovarian cancer8002.5 2664.9 
Family history of breast cancer80010.8 26612.4 
Personal history of ovarian cancer8000.8 2663.0 
Personal history of breast cancer8002.9 2665.6 
Nulliparous80041.9 26625.2 
Hysterectomy8006.3 26610.5 
Postmenopausal80032.9 26663.5 
Hormonal therapy80023.5 26617.7 
Postmenopausal bleeding within the year before ultrasound examination22914.92914517.955
Bilateral masses80016.6 26631.2 
Pelvic pain during the examination80028.8 26619.6 
Ascites8002.9 26642.1 
Incomplete septum present8009.3 2664.1 
Suspected ovarian origin80081.9 26683.5 
Papillation80019.5 26645.5 
Flow within at least one papillation15633.32012184.345
Presence of an irregular papillation15649.42012182.645
Irregular internal walls in the lesion80032.8 26681.6 
Acoustic shadows80013.0 2661.5 
Venous blood flow8009.0 2663.4 
Table 3. Descriptive statistics for categorical/ordinal variables
Variable Benign Malignant
n (%) n (%)
Locularity
  Unilocular 311 (38.9) 2 (0.8)
  Unilocular, solid 88 (11.0) 44 (16.5)
  Multilocular 176 (22.0) 20 (7.5)
  Multilocular, solid 168 (21.0) 116 (43.6)
  Solid 52 (6.5) 84 (31.6)
  Not classifiable 5 (0.6) 0 (0.0)
Echogenicity
  Anechogenic 303 (37.9) 107 (40.2)
  Low level 149 (18.6) 60 (22.6)
  ‘Ground glass’ 192 (24.0) 33 (12.4)
  Hemorrhagic 8 (1.0) 2 (0.8)
  Mixed 114 (14.3) 18 (6.8)
  No cyst fluid 34 (4.3) 46 (17.3)
Color score, blood flow
  None (1) 222 (27.8) 11 (4.1)
  Minimal (2) 311 (38.9) 42 (15.8)
  Moderately strong (3) 219 (27.4) 109 (41.0)
  Very strong (4) 48 (6.0) 104 (39.1)

Bayesian least squares support vector machines and relevance vector machines

We used Bayesian least squares support vector machines (LS-SVMs)25 and relevance vector machines (RVMs)20 to classify tumors as benign or malignant. These methods are well suited to classification purposes. They are mathematically complex models that are flexible in that they are able to create a non-linear decision boundary as opposed to, for example, logistic regression. Recently, an opinion paper was published in this journal in which the characteristics of support vector machines are compared with those of logistic regression analysis26. We constructed six models: Bayesian LS-SVM and RVM models with either a linear, radial basis function (RBF), or additive RBF kernel. Technical details about these methods are given in the appendix.

Evaluation of the performance of the models

The dataset was randomly split up into a training set containing 70% of the cases and a test set containing 30% of the cases. The division into training and test sets was stratified for outcome and center such that in both sets the proportion of malignant cases was equal, as well as the proportion of cases from each center. This training- and test-set division was the same as that used for developing and evaluating a logistic regression model24. The models were trained on the training set and their performance was then evaluated on the test set. Each model (including the logistic regression model24) was confronted with the test-set data and the probability of malignancy for each individual was computed. These probabilities were then used to create a receiver–operating characteristics (ROC) curve27. The statistical significance of differences in areas under the ROC curve (AUC) was determined using a nonparametric method28. Confidence intervals (CI) for the AUCs were obtained using 100 bootstrap samples of the test set. This method takes into account that AUCs are bounded at 1 such that a CI, in case of a high AUC, should be asymmetric. The AUC for each bootstrap sample was calculated and the 95% CI was defined as 2.5th percentile to 97.5th percentile. The models were also compared with the widely used risk of malignancy index (RMI)5 by also creating ROC curves for the test-data cases with CA-125 information only.

The models were also evaluated with respect to the obtained sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR + ), and negative likelihood ratio (LR−). In order to obtain these measures, a cut-off level had to be chosen on the probability of malignancy as estimated by the Bayesian model. The cut-off level was always chosen based on the training-set results and we chose a cut-off with a sensitivity level as high as possible while still keeping a decent specificity (i.e. 80%) because it was considered very important to correctly identify a malignant case. The statistical significance of differences in sensitivity and specificity levels was investigated using McNemar's test for paired proportions. Because PPV and NPV levels depend on the number of tumors that are classified as malignant and benign, McNemar's test cannot be used to determine the statistical significance of differences in PPV or NPV. Instead, bootstrapping analyses were used to construct a 95% CI for the difference in PPV or NPV between models. If zero fell outside the 95% CI, evidence for a significant difference was obtained.

Variable selection

The IOTA dataset is an almost complete dataset. Most missing values are structural missings sensibly imputed with zero, except for the blood flow indices: when intratumoral blood flow was not sufficient to compute the peak systolic velocity, time averaged maximum velocity, pulsatility index and resistance index, these indices were imputed with 2 cm/s, 1 cm/s, 3, and 1, respectively. Apart from this, CA-125 information was unavailable for 257 cases (24%). Therefore, we decided not to include CA-125 in the variable selection process.

To select the predicting variables for the models, a forward selection method was used within the context of Bayesian LS-SVM models with either a linear or an RBF kernel. In each step of the selection process, the variable that most increases the Bayesian model evidence (see appendix) was added to the model. The selection process was stopped when no additional variable was able to further increase the model evidence. Because this forward selection method may select many variables, an additional backward elimination procedure may be necessary in order to obtain a more reliable set of predicting variables. Such backward elimination can be done subjectively based on knowledge of the variables (e.g. knowledge about the subjectivity and measurement accuracy of variables and about associations between variables) and also by investigating which variables least decrease the model evidence.

Results

The adopted variable selection procedure selected 12 variables: three continuous variables (maximal diameter of the solid component, age, maximal diameter of the whole ovary) and nine binary variables (presence of ascites, blood flow within papillary projections, acoustic shadows, irregular internal cyst walls, very strong intratumoral blood flow, hormonal therapy, a multilocular solid tumor, personal history of ovarian cancer, and suspicion at ultrasound examination of ovarian origin of the tumor). This subset of selected variables served as the variable set for all models that were fitted. The maximal diameter of the solid component was bounded at 50 (all values > 50 mm were replaced with 50 mm). The logistic regression model also used this bounding24.

The performance of the models on the whole test set (n = 312) is shown in Table 4. The performance of a logistic regression model developed using the IOTA data set24 and of the RMI5 are shown for comparison. The test-set AUCs of the six LS-SVM and RVM models (ranging from 0.943 to 0.949; Figures 1 and 2) did not differ significantly (all P > 0.15), nor did they differ significantly from the AUC of the logistic regression model (AUC = 0.942; all P > 0.25). However, when applied to the test-set cases with CA-125 information (n = 236; Table 4), the AUC of all six LS-SVM and RVM models was significantly larger than the AUC of the RMI (all P < 0.004).

Figure 1.

Test set (n = 312) receiver–operating characteristics curves for the least squares support vector machine (LS-SVM) models with linear kernel (equation image) and radial basis function (RBF) kernel (equation image) and the previously fitted logistic regression model (equation image).

Figure 2.

Test set (n = 312) receiver–operating characteristics curves for the relevance vector machine (RVM) models with linear kernel (equation image) and radial basis function (RBF) kernel (equation image) and the previously fitted logistic regression model (equation image).

Table 4. Performance of the models on the test set. The performances of a previously published logistic regression (LR) model developed on the same dataset24 and of the risk of malignancy index (RMI)5 are shown for comparison
Model (kernel) AUC (SE) 95% CI AUC (CA-125)*Cut-off Sensitivity Specificity PPV NPV LR+ LR−
  • *

    Using only the test-set cases with CA-125 data (n = 236). addRBF, additive radial basis function kernel; AUC, area under the receiver–operating characteristics curve; LR + , positive likelihood ratio; LR−, negative likelihood ratio; LS-SVM, least squares support vector machine; NPV, negative predictive value; PPV, positive predictive value; RBF, radial basis function kernel; RVM, relevance vector machine; SE, standard error.

LS-SVM (linear) 0.946 (0.019) 0.905–0.974 0.940 (0.020) 0.15 91 84 65 97 5.81 0.11
LS-SVM (RBF) 0.945 (0.019) 0.906–0.976 0.941 (0.020) 0.12 93 81 61 97 4.91 0.08
LS-SVM (addRBF) 0.943 (0.019) 0.901–0.977 0.937 (0.021) 0.12 91 82 62 97 5.12 0.11
RVM (linear) 0.949 (0.018) 0.909–0.977 0.942 (0.020) 0.20 92 82 62 97 5.19 0.10
RVM (RBF) 0.946 (0.019) 0.905–0.975 0.940 (0.021) 0.15 92 81 60 97 4.74 0.10
RVM (addRBF) 0.946 (0.019) 0.905–0.977 0.941 (0.020) 0.15 93 81 61 97 5.03 0.08
LR240.942 (0.017) 0.901–0.973 0.936 (0.020) 0.10 93 76 55 97 3.81 0.09
RMI50.870 (0.028) 100 78*80*61*90*3.84*0.27*

The models also reached the sensitivity and specificity goals that were set in advance. The kernel methods developed in the present paper even obtained a test-set sensitivity of more than 90% and a specificity of more than 80%. A McNemar test to compare the performance for benign test-set cases (n = 237) revealed that all six kernel methods outperformed the logistic regression model with respect to specificity (all P < 0.0075). Similar tests for the malignant test-set cases (n = 75) did not yield statistically significant differences with respect to sensitivity (all P > 0.15). Bootstrap analyses indicated that the kernel methods also outperformed the logistic regression model with respect to PPV (results not shown). On the present test set, a linear model performed best: the LS-SVM model with linear kernel achieved 91% sensitivity, 84% specificity, 65% PPV and 97% NPV.

Figure 3 displays the LS-SVM classifier with linear kernel, the behavior of which is very similar to that of the RVM model with linear kernel. We could have made similar figures for the models using additive RBF kernels, but since the models with linear kernels performed as well as those with non-linear kernels, this was not necessary. For the continuous variables, the maximal diameter of the ovary (which is often identical to the maximal diameter of the lesion), the maximal diameter of the largest solid component and the woman's age, this linear model associates higher values with a higher probability of malignancy. Dichotomous variables that raise the probability of malignancy are ascites, blood flow within papillary projections, an irregular internal cyst wall, very strong intratumoral blood flow, personal history of ovarian cancer, and a tumor that is suspected by the sonologist to be of ovarian origin. On the other hand, acoustic shadows, hormonal therapy, and a multilocular-solid tumor decrease the model-based probability of malignancy.

Figure 3.

Least squares support vector machine (LS-SVM) model with linear kernel. For each predicting variable, the black line shows its influence on the model outcome y(x) (i.e. before the probability of malignancy is computed). For continuous variables, the upper (lower) plus signs represent the malignant (benign) training cases. For binary variables, the size of the upper (lower) circles represents the number of malignant (benign) training cases. Max diam, maximal diameter; Pap proj, papillary projections; Pers hist, personal history; solid comp, largest solid component.

The six Bayesian models were also applied to the test-set cases of each center separately. The best and worst of the six test-set AUCs are depicted in Figure 4. In general, the difference between the best and worst AUC values was smaller for centers that contributed most to the dataset. For the two centers that contributed most to the dataset, Malmö and Leuven, the difference was very small. In most centers the models had a test-set AUC exceeding 0.900. The two largest centers in the study had no test-set AUC below 0.950. In the BFR and MIT centers (see Figure 4), all AUCs exceeded 0.950.

Figure 4.

Overview of test set area under the curve (AUC) values of the six kernel models for different centers. The best (▵) and worst (▿) AUC result for each center is shown. MSW, University Hospital Malmö (Sweden); LBE, University Hospital K.U. Leuven (Belgium); KUK, King's College Hospital London (UK); BFR, Hôpital Boucicaut Paris (France); MFR, Centre Medical des Pyramides Maurepas (France); RIT, Università Cattolica del Sacro Cuore Rome (Italy); MIT, Dipartimento di Scienze Cliniche ‘Luigi Sacco’, Università degli Studi di Milano (Italy); OIT, Ospedale San Gerardo (Università degli Studi di Milano) Monza (Italy).

Discussion

The IOTA database provides a good starting point for the development of robust decision support systems that perform well in various situations. In a first step, a logistic regression model was developed24. This model performed very well on the test data. We applied Bayesian LS-SVMs and RVMs to the database. These models are able to capture non-linearity in the classification task and are specifically designed to avoid over-fitting of the training data, which impairs model performance on new data. Despite their appealing rationale, both SVMs and RVMs are not yet frequently used in medical applications. Some exceptions can be found in work by De Smet et al.29, Bowd et al.30 and Majumder et al.31.

The kernel-based models that were developed in this paper had very good test-set performance, with AUCs varying from 0.943 to 0.949. The logistic regression model achieved a test-set AUC of 0.94224. None of the differences in AUC was statistically significant. Sensitivity and specificity were very high for all models, but the kernel-based models obtained a significantly better specificity level than did the logistic regression model. Of course, this difference needs to be confirmed in new prospective studies. To illustrate the results as obtained on the IOTA dataset, the well-performing LS-SVM model with linear kernel would yield around 218 true positives, 641 true negatives, 119 false positives, and 22 false negatives in a set of 1000 patients. The logistic regression model would result in 224, 574, 186, and 16 patients, respectively. The significant difference in specificity between the kernel-based model and the logistic regression model may have arisen owing to small differences in the way cut-offs were chosen. In the present paper we focused on a specificity of at least 80%, while Timmerman et al.24 focused on a specificity of 75%. Tables in Timmerman et al.24 show that a cut-off of 0.15 (91% sensitivity and 82% specificity on the training set) instead of 0.10 (93% and 77%, respectively) would have resulted in 92% sensitivity and 81% specificity on the test set, which is a performance similar to the kernel-based models24.

Test-set performance varied between centers, but was good to very good in all of them. From a modeling perspective, our results indicate that a non-linear model is not necessary because it adds complexity without any obvious gain in efficiency. In that sense, a Bayesian LS-SVM model with linear kernel is appealing because it is linear, has a unique solution (as opposed to RVM models), avoids over-fitting of the training set, and it may have better specificity than that of the logistic regression model.

Interestingly, all models developed on the IOTA database clearly outperform the RMI. It must be emphasized, however, that the RMI was tested prospectively while the other models were all developed on the training set of the database of this study, and so the performance of the LS-SVMs and RVMs is probably slightly overestimated.

Despite Table 3 showing multilocular-solid tumors to be malignant more often than tumors that are not multilocular-solid (relative risk = 2.2), in the LS-SVM and RVM models a multilocular-solid tumor decreased the risk of malignancy. The models use several variables simultaneously, some of which may be interrelated. This can cause results one would not expect from univariate data analysis. For example, patients with a multilocular-solid tumor more often had very strong intratumoral blood flow, irregular internal cyst walls, blood flow within at least one papillary projection, ascites, and large maximal diameters of the ovary and the solid component. All of these variables are associated with higher risk of malignancy (Tables 13, Figure 3).

It is possible to build stand-alone software for clinical use of the LS-SVM and RVM models.

A crucial next step is to collect new data at the same centers as in this study as well as at completely new centers in order to evaluate prospectively the performance of the LS-SVM and RVM models.

Acknowledgements

This research was supported by Research Council KUL: GOA-AMBioRICS, CoE EF/05/006 Optimization in Engineering, several PhD/postdoc & fellow grants, Flemish Government: FWO (research communities (ICCoS, ANMMM)), Belgian Federal Science Policy Office IUAP P5/22 (‘Dynamical Systems and Control: Computation, Identification and Modeling’), EU: BIOPATTERN (FP6-2002-IST 508803), ETUMOUR (FP6-2002-LIFESCIHEALTH 503094), Healthagents (IST–2004-27214), the Swedish Medical Research Council: (grants nos. K2001-72X-11605-06A, K2002-72X-11605-07B and K2004-73X-11605-09A), Swedish governmental grants: (Landstingsfinansierad regional forskning (Region Skåne and ALF-medel)), and Funds administered by Malmö University Hospital.

Appendix

Least squares support vector machines

Support vector machines (SVMs)25 transform the ‘input space’ (i.e. the multidimensional scatter plot of the predicting variables) into a high-dimensional ‘feature space’ (Figure A1). In the feature space a linear separation is sought (Figure A1) that aims for a good trade-off between maximization of the margin between both classes (this limits the model's complexity) and minimization of the number of training-set misclassifications. This trade-off—finding a simple model that is still able to distinguish between benign and malignant tumors—enhances the generalizing ability of the model to new data. How the feature space is constructed depends on the kernel function used. The kernel function is a measure of similarity between the data of two patients. A linear kernel results in a linear separation between the two classes in the original input space (Figure A2). On the other hand, non-linear kernels such as the radial basis function (RBF) kernel create a non-linear decision boundary in the input space (Figure A3).

Figure A1.

Graphical representation of the underlying rationale for support vector machines. The squares represent class 1, the circles represent class 0. Two variables, X1 and X2, are used to predict whether a case belongs to class 1 or class 0. The input space (a) is transformed into a high-dimensional feature space (b) in which a linear separation is constructed. When using a non-linear transformation this coincides with a non-linear decision boundary in the input space.

Figure A2.

Support vector machines with a linear kernel produce a linear decision boundary between both classes. In this example, two predictor variables X1 and X2 were used to construct the decision boundary.

Figure A3.

When using support vector machines with a non-linear kernel such as the radial basis function kernel, a non-linear decision boundary between both classes is produced. X1 and X2, predictor variables.

Let us denote the data values for a patient by x. The SVM model makes a prediction for the patient based on the following model formulation: α1 · K(x, x1) + … + αN · K(x, xN) + b, where K(·, ·) is the kernel function giving the similarity between the data x of our patient with data xi of training-set patient i(i = 1, …, N), αi is the weight for training-case i, reflecting the importance of that case in the model, and b is a constant. The sign of the result of the model formulation determines the hard prediction—benign or malignant—made by the model: if the result is negative, the tumor is predicted to be benign, while a positive result leads to the prediction of malignancy. This can be compared to the logistic regression model, in which the hard prediction depends on whether the predicted probability of the outcome of interest (e.g. malignancy) is above or below 0.50. Of course, this cut-off (0 for SVMs, 0.50 for logistic regression) can be adapted in order to aim for desired sensitivity and specificity levels.

SVMs end up using only a few training-data cases for constructing the decision boundary. These cases are called the support vectors. The other training cases have zero weight (αi = 0). The support vectors are cases that usually lie close to the decision boundary. Also, the process to find the αi's leads to a unique solution, which is a major advantage over other flexible methods such as artificial neural networks or relevance vector machines (RVMs). A variant of standard SVMs has been developed19, 32 in which the training process involves solving a set of linear equations. This variant, called least squares SVMs (LS-SVMs), greatly simplifies SVM training while retaining the attractive advantages of SVMs described above.

Bayesian LS-SVMs: providing class probabilities

A disadvantage of (LS-)SVM classifiers is that they do not provide class probabilities. Applying a Bayesian framework to LS-SVMs can overcome this drawback33, 34. The main aim of Bayesian analysis is that uncertainty in the estimates of the model parameters is accounted for by looking for a posterior probability distribution on the model parameters. The posterior distribution is obtained by confronting a prior probability distribution with the information in the data. The prior distribution reflects the prior knowledge or belief about likely values of the model parameters. The prior distribution is combined with the information about the model parameters in the collected data to give the posterior distribution. This involves solving highly complex integrals. The procedure of MacKay33 avoids this complexity by making approximations in order to obtain the posterior distribution. Class probabilities are obtained by mathematically taking into account the full posterior distribution, i.e. by taking into account the uncertainty in the estimated model parameters.

The Bayesian perspective contains methods for model comparison by computing the probability of the data obtained when a particular model is assumed to underlie the data. If this probability is computed for two competing models then the model for which this probability is higher is said to have higher evidence. This part of the Bayesian methodology was used for predictor variable selection.

Relevance vector machines

In relevance vector machines (RVMs)20, a model y(x) = α1 · K(x, x1) + … + αN · K(x, xN) + b (cf. supra) is also approached from a Bayesian perspective using MacKay's procedure. A fundamental difference between SVMs and RVMs is that RVMs do not use a feature space such that the basis function K(·, ·) in the model can be any function (therefore, RVMs are not truly kernel based). It applies MacKay's procedure by using normal (Gaussian) prior probability distributions for each training sample's weight αi(i = 1, …, N). This Gaussian prior has mean zero and variance βi. Fitting the RVM model to the data results in many βis approaching zero. This means that the posterior distribution for αi has mean zero with variance zero such that the resulting ith data sample has no influence and need not be used. The other data samples are similar to the support vectors in SVM models, but the samples used in RVM models are not those close to the decision boundary (cf. SVMs) but are those that can be seen as prototypical examples of the classes. An important disadvantage of RVM models is that there is no unique solution for αi. Since RVMs are formulated within a Bayesian framework they also yield class probabilities. The Bayesian framework for LS-SVMs is applied in a different way, but is too complex to describe concisely in this appendix. Readers are referred to the paper by Van Gestel et al.34.

Kernel functions used in this paper

A disadvantage of using non-linear kernels is that it is difficult to disentangle the influence of each variable on the outcome. Linear models do not have this disadvantage since linear models generally use a weighted sum of the predicting variables to predict the outcome such that we can examine the effect of each predicting variable separately. Therefore, additive methods have been developed for kernel techniques35 and were applied in the present paper. Thus, six models are presented in this paper: Bayesian LS-SVMs and RVMs with either a linear, RBF, or additive RBF kernel.

Ancillary