SEARCH

SEARCH BY CITATION

Keywords:

  • nomograms;
  • artificial neural networks;
  • risk groupings;
  • probability tables;
  • classification and regression trees

Abstract

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES

Accurate estimates of the likelihood of treatment success, complications and long-term morbidity are essential for counselling and informed decision-making in patients with urological malignancies. Accurate risk estimates are also required for clinical trial design, to ensure homogeneous patient distribution. Nomograms, risk groupings, artificial neural networks (ANNs), probability tables, and classification and regression tree (CART) analyses represent the available decision aids that can be used within these tasks. We critically reviewed available decision aids (nomograms, risk groupings, ANNs, probability tables and CART analyses) and compared their ability to predict the outcome of interest. Of the available decision aids, nomograms provide individualized evidence-based and highly accurate risk estimates that facilitate management-related decisions. We suggest the use of nomograms for the purpose of evidence-based, individualized decision-making.


Abbreviations
RCT

randomized clinical trial

RP

radical prostatectomy

c-index

concordance index

CART

classification and regression tree

ANN

artificial neural network.

INTRODUCTION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES

Accurate estimates of the likelihood of treatment success, complications and long-term morbidity are essential for counselling and informed decision-making in patients with urological malignancies. Informed consent that is based on accurate estimation of the likelihood of various treatment outcomes might improve treatment satisfaction, particularly when complications arise [1]. Accurate risk estimates are also required for clinical trial design, to ensure an homogeneous distribution of risk factors across trial arms.

There are several options for risk estimation in medicine, including urological oncology. In the complete absence of data, physicians can refuse to even attempt to make valid predictions. This might be the case with extremely rare pathologies. Alternatively, predictions can be made based on personal knowledge and experience. This method is frequently used in clinical practice, where clinicians rely on a mix of objective and subjective experience [2,3]. Quoting an average outcome probability represents another option. The use of data from randomized clinical trials (RCTs) can illustrate this situation. For example, based on the report by Bill-Axelson et al.[4] of an RCT of radical prostatectomy (RP) vs watchful waiting, a 60-year-old patient could be told that relative to watchful waiting, RP will decrease his risk of death from prostate cancer by virtually 50%, from 20% to 10%, at 10 years of follow-up. Although such data are based on high-quality evidence, they do not appeal universally to patients or clinicians. The reluctance to use RCT data stems from the patients’ and clinicians’ awareness that differences in individual patient characteristics might undermine the validity of such results in individual cases. Based on this limitation, various predictive models have been proposed [3,5]. In this review we provide a systematic approach allowing an assessment of the strengths and the weakness of available models.

‘KATTAN-TYPE’ NOMOGRAMS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES

The statistical definition of a nomogram is a graphical representation of a mathematical formula. The predictors of such a formula might be modelled as continuous or categorical variables to predict a particular endpoint. The statistical methods can consist of multivariable logistic regression or Cox proportional hazards analyses [6]. Nomograms consist of several axes; each variable is represented by a scale, with each value of that variable corresponding to a specific number of points according to its prognostic significance. For example, the nomogram in Fig. 1[6] assigns to each PSA level a unique point value that represents its prognostic significance. In a final pair of axes, the total point value from all the variables is converted to the probability of reaching the endpoint of interest. Nomograms provide the probability of a particular outcome on a continuous scale, which is usually 0–100%.

image

Figure 1. A preoperative nomogram based on 983 patients treated at The Methodist Hospital, Houston, TX, for predicting PSA recurrence after RP [6].

Download figure to PowerPoint

DECISION-TOOL REQUIREMENTS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES

The discriminant properties (ability to predict the outcome of interest) of nomograms and other decision tools is measured using the concordance index (or c-index), which is virtually synonymous with the area under the receiver operating characteristic curve. While the area under the curve requires binary outcomes (e.g. cure/fail), the c-index can be applied in the presence of case censoring and is used in analyses of time-to-event data [7]. In such analyses, the c-index indicates how accurate the model predictions are. A model with 50% concordance is as good as the toss of a coin. The c-index of most models is 70–85%; few models exceed 80% accuracy (concordance). Concordance provides an overall measure of a model’s ability to discriminate between individuals with and without the endpoint of interest. A model might predict very well in low-risk patients; conversely, it might predict extremely poorly in high-risk individuals. If the proportion of low-risk individuals is high relative to high-risk individuals, the overall accuracy might be very good and the c-index might not alert the potential model user to its deficiencies when high-risk patients are assessed.

Model calibration circumvents this limitation of the c-index calibration plots that graphically depict the model’s calibration, and should always accompany c-index-derived accuracy estimates. Calibration plots show the model’s predicted probability on the x-axis and the observed rate of the event of interest on the y-axis. Perfect calibration is achieved when predicted probabilities match the observed rate as 1:1. A 45° line is indicative of a perfect calibration. Most good models will show minor departures from ideal predictions, manifested by underestimation of the observed rate (calibration curve above the ideal 45° line) or by overestimation of the observed rate (calibration curve below the ideal 45° line).

Both metrics, discrimination and calibration, should be reported with each model. The provision of just one of these two metrics is insufficient to warrant the implementation of a model into clinical practice. Moreover, both metrics should ideally be applied to an external validation sample, which is needed to ultimately confirm the performance characteristics of a model. Unfortunately, external data might not always be available. Under such circumstance it is imperative to perform an internal validation, where bootstrapping simulates the application of the model under novel testing conditions. Other forms of internal validation, such as cross-validation or data splitting, are less efficient and provide substantially more biased estimates of model’s true accuracy. Failure to subject a model to external or internal validation tests might result in spuriously elevated accuracy.

Taken together, three criteria are required to assess a predictive or prognostic model: (i) the presence of an internal or external validation; (ii) a measure of accuracy (discrimination); and (iii) an assessment of the model’s calibration.

COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES

Nomogram vs risk groupings

Physicians often use risk groups to predict the treated natural history of the disease. Patients with similar characteristics are aggregated in risk strata such as high, intermediate or low. Despite its apparent logic, this method is statistically inefficient and might reduce the accuracy. Its flaws consists of the assumption that all patients within a risk group are equal, but risk groups might be heterogeneous.

A commonly used risk-grouping tool was developed by D’Amico et al.[8,9] for predicting biochemical recurrence in patients treated with RP, external-beam radiotherapy or brachytherapy. It places patients into mutually exclusive risk groups based on clinical stage, biopsy Gleason sum and pretreatment PSA level.

Various studies have documented the superior performance of nomograms compared to risk-grouping [10–12]. This might stem from intergroup heterogeneity or from the effect of spectrum bias that reduces the influence of significantly lower or higher values than those observed for most patients. By contrast to risk groups, nomograms rely on the most informative type of data coding. For example, continuously risk-coded values will not be categorized, to avoid spectrum bias and heterogeneity within strata.

Nomogram vs look-up tables

The superior predictive accuracy of multivariable nomograms vs look-up tables is illustrated in a comparison of a nomogram to the ‘Partin Tables’[13,14], to predict the pathological features of prostate cancer, with a suite of nomograms. The Partin tables combine serum PSA level (four categories), clinical stage (seven categories), and biopsy Gleason sum (five categories) to predict organ-confined, established extracapsular extension, seminal vesicle invasion, and lymph node involvement at RP. Several studies compared the Partin tables with nomograms predicting the same endpoints, and showed that nomograms had superior accuracy in predicting organ-confined disease, seminal vesicle invasion and lymph node invasion [10,15].

The differences in accuracy might not be trivial; Chun et al.[16] showed that a logistic regression-based nomogram predicting Gleason sum upgrading was 80.4% accurate, vs 52.3% accuracy for a look-up table (P < 0.001). Moreover, the nomogram showed virtually ideal performance. Conversely, important departures from an ideal prediction were recorded for the look-up table. This said, a formal external validation of the Chun et al. nomogram is pending; this comparison will provide ultimate and validated proof of the accuracy of this model. Until then, the advantage of the nomogram over the look-up table will rest on a comparison between external and internal validation, that provides an advantage to the nomogram format.

Nomogram vs tree analysis

Classification and regression tree (CART) analysis is another type of predictive model that uses nonparametric techniques to evaluate data, account for complex relationships, and present the results in a clinically useful form. CARTs rely on progressive splitting of the population (Fig. 2). Splits are based on considerations of statistical significance. The variables that are chosen for each split, the discriminatory values, and the order in which the splitting occurs are all produced by the underlying mathematical algorithms that maximize the model’s discriminant ability. This type of approach can inflate type I error, as it relies heavily on data-based associations. In such cases internal or external validation is particularly important [17].

image

Figure 2. A simplified example of a CART-recursive partitioning based on a logistic regression analyses. In this analysis, the clinician simply follows the paths of the tree that describes the characteristics of the patient being evaluated, and arrives at the prediction of the outcome of interest for that particular patient.

Download figure to PowerPoint

As for look-up tables, CART analyses are limited by variable coding limitations, as only categories are allowed. This results in intergroup heterogeneity and in spectrum bias. Several studies have shown that traditional statistical methods perform better than CART analysis. Kattan et al.[10] showed that Cox proportional hazards regression models provided superior predictive accuracy than four tree-based methods. Artificial neural networks (ANNs) are also not bound by the CART boundaries and offer an advantage. Chun et al.[11] showed that a logistic regression-based nomogram was superior to a CART model for predicting the side of extracapsular extension (84% vs 70%). Moreover, the nomogram calibration plot was substantially better than that of the CART model.

Nomogram vs ANNs

In the last 10 years a new class of techniques, the ANN, has been proposed as a supplement or alternative to standard statistical techniques. An ANN classifies a patient’s cancer risk through a complex learning process, where data are manipulated in system layers of hidden software ‘neurones’ (Fig. 3). The data are divided into three parts; the training set, the validation (or testing) set and the verification set. The sample sizes of the sets can be variable. In general, the sample sizes depend on the number of predictions, the homogeneity of patient characteristics, the distribution of prediction, and the strength of the association between predictors and the outcome. The network is then presented with the training set, which provides inputs (one or many risk factors) and desired output (e.g. the presence or absence of prostate cancer). The network then assigns weights and interactions to input variables to promote self-learning and to maximize discriminant ability. The validation set is used to decide when to stop training the network and the verification set is used to report the performance of the ANN. The validation set is not synonymous with an internal or external validation of a nomogram.

image

Figure 3. Example of the interconnections of an ANN (prostate cancer biopsy).

Download figure to PowerPoint

Theoretically, an ANN should have considerable advantages over standard statistical approaches, as there are no rules for variable coding or interactions. This represents an advantage over standard statistical models, which at most can account for one or two interactions. Additionally, ANNs do not require explicit distributional assumptions (such as normality).

However, ANNs are not without their drawbacks. The primary disadvantage of an ANN is its ‘black box’ quality. It is impossible to detect internal modelling flaws within an ANN model. Despite the significant and often fruitful efforts aimed at opening the ‘black box’, several practical disadvantages of the ANN persist. For example, sequential model-reduction techniques, such as the Akaike criteria, allow stepwise and user-defined removal of variables of lesser significance in regression models. Individual regression coefficients and their significance, which indicate the magnitude, direction and importance of each variable, cannot be examined.

Based on these considerations, in a review of 28 studies comparing ANN and regression modelling, it was concluded that regression-based techniques are similar to ANNs in several characteristics. They offer an alternative method, and the performance and accuracy can be equal to regression-based models, especially if the sample size is large [18].

In a literature review, Schwarzer et al.[19] concluded that machine-learning methods have often failed to perform better than traditional statistical methods, and they outlined numerous design flaws in studies that suggested the superiority of ANNs. For example, in several analyses the investigators failed to test the ANN using a verification dataset, which is equivalent to assessing the discriminant properties of a regression model without accounting for the effect of overfit bias, that is accounted for by either internal or external validation.

Several authors directly compared ANNs and regression-based models; Terrin et al.[20], Kattan et al.[10] and Chun et al.[21] found that regression models outperformed ANNs when both methods were subjected to formal external validations with equally strict testing criteria. However, others did not confirm this, first in a large review on 72 studies comparing ANN and regression, where in 51% ANN was better than regression, with no difference of ANN and regression in 42%[22], and second in the same tested population [23] as used by Chun et al.[21].

CONCLUSIONS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES

The use of prognostic and decision aids in urology, e.g. nomograms, look-up tables, risk grouping, CART analysis, and ANNs, has grown rapidly in the last decade. Most such tools provide accurate risk estimation and potentially complement clinical judgement. Continuous multivariable models such as nomograms have consistently shown better performance characteristics than the other options. Moreover, nomograms provide a user-friendly interface, which does not require computer software for interpretation/prediction. Nonetheless, virtually all available nomograms are accessible in either web-based or in palm-top computer formats for those interested in electronic application (http://www.nomograms.org and http://www.nomogram.org).

In the future, the predictive ability of nomograms might be improved by incorporating biological markers of disease, quality-of-life measures, or applying prognostic modelling to individual physician or group datasets. At present, nomograms allow clinicians to standardize clinical decision-making using evidence-based and fully individualized tools. Most importantly, nomograms provide accurate and reproducible predictions for patients, who deserve the most accurate prediction of their disease characteristics.

REFERENCES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES

EDITORIAL COMMENT

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES

Shariat et al. reviewed different multivariate models for the diagnosis and staging of prostate cancer, and recommended nomograms as tools for decision-making. We agree that logistic regression (LR)-based nomograms have several advantages over risk groupings, look-up tables or CART analysis. However, there were several studies not considered in their review [1–3], which could not confirm a better performance of LR-based models compared with ANNs. The example of the highly praised study by Chun et al.[4], where the advantage of the nomogram disappears when simply considering the different PSA assays [5] shows that many factors such as analytical, computational and, most importantly, population-based differences must be considered before one can state that LR-based nomograms are superior to ANNs, as unfortunately done in the past [6,7].

From the mathematical point of view it would be surprising if LR-based models outweigh ANN models which can more easily handle complex biological systems. An important review found no differences in 28 studies comparing both LR and ANN, especially when the number of patients is large [8]. ANN and LR models both clearly have an over single variables. It is not the question of which models perform somewhat better, but how to convince physicians that multivariable models work better than one marker alone, e.g. %free PSA or total PSA.

Carsten Stephan*, Henning Cammann and Klaus Jung*

*Department of Urology and Institute of Medical Informatics, Charité– Universitätsmedizin Berlin, and Berlin Institute for Urological Research, Berlin, Germany

REFERENCES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES

REPLY

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES

We agree with the comments of Stephan et al. that the modelling type matters little, while the ability to predict with accuracy matters the most. This said, it is not surprising that various modelling techniques that are based on data-driven relationships might outperform one another. For example, a nomogram based on cubic splines (data-driven) can be as powerful in predicting the outcome of interest as are ANNs. Although it might be of interest to some to debate which model is most accurate, once the maximum accuracy has been reached and it is not 100% perfect, some clinicians might reject the option of model-derived predictions. For now, no model in urological oncology can provide 100% accurate predictions. Residual sources of error account for a misclassification rate of 15–25%, when for example biochemical recurrence (BCR) or other endpoints are examined. Should this error rate discourage clinicians from relying on multivariable predictor and prognostic tools? The answer is definitely not. Although predictive and prognostic tools are not perfect, their ability to foretell the outcome of interest exceeds that of expert clinicians substantially and statistically significantly. For example, clinician experts at Memorial Sloan-Kettering Cancer Center were 54% accurate in predicting lymph node metastases, vs 72% for nomogram-based predictions [1]. Similarly, clinicians were, on average, 68% accurate in predicting the life-expectancy of patients with prostate cancer treated with either RP or radiotherapy, vs 84.3% for a nomogram predicting the same outcome [2,3]. Of various predictive models, nomograms recently emerged as the preferred format when the opinions of North American urologists were polled [4]. Despite the established benefit of various multivariable models in predicting BCR and other prostate cancer endpoints, scepticism might be encountered when their implementation into routine clinical practice is suggested. Lack of a prospectively confirmed benefit in patient outcomes is commonly used as an argument against the use of predictive and prognostic models. At first glance, clinicians who are accustomed to randomized prospective trials that quantify the benefits and detriments of a standard of care to that of a novel molecule tend to agree with absence of data supporting the usefulness of nomograms in clinical practice. However, a randomized prospective evaluation of nomogram-based decision-making vs standard-of-care (clinician-based) decision-making is neither practically nor ethically possible. First, ethical considerations would not allow the randomization of patients to management that is purely based on information technology. Second, nomograms and other predictive or prognostic tools are merely meant to assist the clinician in decision-making. In that regard, they provide the clinician with a probability of a given endpoint, say BCR. When the nomogram states that the patient is 80% likely to have BCR after RP, the clinician might, for example, decide to (i) do nothing, (ii) increase the frequency of follow-up, (iii) initiate adjuvant radiotherapy, or (iv) start androgen deprivation. Although the nomogram-derived prediction of elevated risk of BCR prompts the clinician to select one of the four treatment options, it entirely depends on the clinician’s preference which of the choices is selected. It is therefore impossible to objectively test the effect of nomograms on health outcomes, as clinicians’ choices remain the decisive factors in diagnostic and/or therapeutic decision making.

Pierre I. Karakiewicz*, Umberto Capitanio, Hendrik Isbarn*, Claudio Jeldres* and Shahrokh S. Shariat

*Cancer Prognostics and Health Outcomes Unit, University of Montreal, Montreal, Quebec, Canada;Department of Urology, Vita-Salute San Raffaele, Milan, Italy;Department of Urology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA

REFERENCES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. ‘KATTAN-TYPE’ NOMOGRAMS
  5. DECISION-TOOL REQUIREMENTS
  6. COMPARISON OF NOMOGRAMS WITH OTHER PREDICTION TOOLS
  7. CONCLUSIONS
  8. CONFLICT OF INTEREST
  9. REFERENCES
  10. EDITORIAL COMMENT
  11. REFERENCES
  12. REPLY
  13. REFERENCES
  • 1
    Specht MC, Kattan MW, Gonen M, Fey J, Van Zee KJ. Predicting nonsentinel node status after positive sentinel lymph biopsy for breast cancer: clinicians versus nomogram. Ann Surg Oncol 2005; 12: 6549
  • 2
    Walz J, Gallina A, Perrotte P et al. Clinicians are poor raters of life-expectancy before radical prostatectomy or definitive radiotherapy for localized prostate cancer. BJU Int 2007; 100: 12548
  • 3
    Walz J, Gallina A, Saad F et al. A nomogram predicting 10-year life expectancy in candidates for radical prostatectomy or radiotherapy for prostate cancer. J Clin Oncol 2007; 25: 357681
  • 4
    Capitanio U, Jeldres C, Shariat SF, Karakiewicz P. Clinicians are most familiar with nomograms and rate their clinical usefulness highest, look-up tables are second best. Eur Urol 2008; doi:10.1016/j.eururo.2008.04. 082