Support vector machines versus logistic regression: improving prospective performance in clinical decision-making
Article first published online: 19 MAY 2006
Copyright © 2006 ISUOG. Published by John Wiley & Sons, Ltd.
Ultrasound in Obstetrics & Gynecology
Volume 27, Issue 6, pages 607–608, June 2006
How to Cite
Pochet, N. L. M. M. and Suykens, J. A. K. (2006), Support vector machines versus logistic regression: improving prospective performance in clinical decision-making. Ultrasound Obstet Gynecol, 27: 607–608. doi: 10.1002/uog.2791
- Issue published online: 19 MAY 2006
- Article first published online: 19 MAY 2006
In this issue of the Journal De Smet et al.1 describe the use of new models to predict depth of infiltration in endometrial carcinoma based on transvaginal sonography. This paper uses standard logistic regression models and compares them with the more advanced least squares support vector machine (LS-SVM) models2 with linear and radial basis function (RBF) kernels. While classical logistic regression analysis is the standard method used for clinical classification problems3–5, we believe that this new and more advanced method might still improve performance.
Traditional statistical methods such as multivariate logistic regression intend to build a classification model that fits a set of patients (‘training’ set) optimally. Unfortunately, this strategy may easily result in a model that fits these training patients too well and is therefore not capable of making good predictions for previously unknown patients (‘independent’, ‘prospective’ or ‘test’ set). This problem is often referred to as overfitting the training patients, and leads to poor generalization to previously unknown patients. Support vector machines (SVMs) are a relatively new method based on the principle of statistical learning theory6 to solve classification and regression problems. This method tries to learn and generalize well when building a model using a given set of patients. This way, SVMs perform reasonably well on a training set, but not at the expense of performance when making predictions for previously unseen patients.
Logistic regression tries to fit a model as well as possible on the patients of the training set. Even with samples that do not follow the general underlying distribution in the case of outliers, logistic regression fits the training set too well, leading to a substantial number of misclassified patients when applied prospectively. SVMs try to generalize well when building a model using the given set of patients. With SVMs, optimization of the generalization performance is achieved by controlling two terms, i.e. by minimizing the classification error on the training set together with minimizing the complexity of the model. This trade-off is represented by a regularization parameter (γ) in the LS-SVM formulation.
A further disadvantage of logistic regression is that the technique is not able to identify possible non-linear structures in a set of patients. When nonlinear relationships exist, a nonlinear decision boundary may result in a better performance overall. Unlike logistic regression, SVMs are designed to generate more complex decision boundaries. An LS-SVM with a simple linear kernel function corresponds to a linear decision boundary. Instead of a linear kernel, more complex kernel functions, such as the commonly used RBF kernel, can be chosen. An RBF kernel requires optimization of the kernel parameter (σ), which controls the curvature of the decision boundary. Figure 1 shows an example in which using an SVM with an RBF kernel would be more appropriate than would using an LS-SVM with a simple linear kernel. With this more complex decision boundary, the nonlinearity in this set of patients could be better described than would be possible with a linear decision boundary.
LS-SVMs are reformulations of the standard SVMs and qualitatively they are similar. Already they have been used extensively for various classification problems, including medical ones7. LS-SVM models can be trained easily using LS-SVMlab vers. 1.55, 8 for MATLAB, as shown by De Smet et al.1. These authors aimed to predict the depth of infiltration in endometrial carcinoma based on transvaginal sonography data from 97 training-set patients that included the ultrasound parameters, number of fibroids detected during ultrasound examination, the degree of differentiation of the cancer, the presence of a clear cell component, and the presence of a serous papillary component. Stepwise logistic regression selected the degree of differentiation, the number of fibroids and the endometrial thickness and volume. Subsequently, these variables were used to train a logistic regression model and LS-SVM with linear and RBF kernels. Compared with the area under the receiver–operating characteristics curve (AUC) of the subjective assessment (72%), prospective evaluation of the mathematical models on 76 test-set patients resulted in an equally good AUC for the LS-SVM model with a linear kernel, and in a better AUC (77%) for the LS-SVM model with an RBF kernel, although this difference was not significant. The performance of the standard logistic regression model (66%) was significantly worse, although the training set performance was similar to that of the LS-SVM model with a linear kernel. This shows that the level of overfitting for the standard logistic regression model was higher than was that for the LS-SVM model with a linear kernel (the logistic regression model will also reflect some accidental characteristics of the training set that would not reoccur in an independent set of patients).
When training LS-SVM models with a linear and an RBF kernel, it is necessary to optimize the regularization parameter γ. Using an RBF kernel also requires tuning of the kernel parameter σ. These parameter(s) can be tuned in LS-SVMlab with the ‘tunelssvm’ function using a ‘linesearch’ approach for the LS-SVM with a linear kernel (tuning of γ) and a ‘gridsearch’ approach for the LS-SVM with an RBF kernel (tuning of σ and γ) when optimizing the ‘leave-one-out cross-validation’ performance on the training set. These parameter settings can be used subsequently when training the definitive model with the ‘trainlssvm’ function. The ‘simlssvm’ function allows making predictions for new patients using the previously built model. Since regularization is performed in LS-SVM models, the generalization of this technique on an independent set of patients can be expected to be more optimal than is possible with standard logistic regression.
Although special software is needed (like LS-SVMlab) to train LS-SVM models from a dataset, they can be used easily by clinicians once the parameters of the models have been determined, as shown by De Smet et al.1, using regular software packages that allow implementation of elementary calculations. Both the standard logistic regression model and the LS-SVM with a linear kernel have been stated explicitly as simple equations in four variables and they can be evaluated immediately with a simple calculator, if necessary. Note that, in principle, the number of terms in an LS-SVM model is equal to the number of patients in the training set plus one. However, it is possible to write LS-SVMs with a linear kernel as a simple linear equation in their variables, by rearranging the terms. This is not possible for LS-SVMs with an RBF kernel, although each term in this sum has a very simple form and an LS-SVM model with an RBF kernel can therefore be implemented easily in any software package that allows calculations to be performed simply (e.g. in Microsoft Excel).
To conclude, we have summarized here the two advantages of the recent and more advanced SVMs over traditional logistic regression. Unlike logistic regression, SVMs have means to prevent the model from being sensitive to outliers in the data, resulting in a model that is capable of making good predictions for prospective analyses. Moreover, SVMs are able to cope with non-linearity in the data by using nonlinear kernel functions instead of a simple linear kernel.
This work was supported by: Research Council KUL: GOA-AMBioRICS, IDO (IOTA Oncology, Genetic networks), several PhD/postdoc and fellow grants; Flemish Government: FWO: PhD/postdoc grants, projects G.0115.01, G.0407.02, G.0413.03, G.0388.03, G.0229.03 and IWT: PhD Grants, STWW-Genprom, GBOU-McKnow, GBOU-SQUAD, GBOU-ANA; Belgian Federal Government: DWTC [IUAP V-22 (2002–2006)]; EU: CAGE; Biopattern.