Assessing the accuracy and generalizability of the preoperative and postoperative Karakiewicz nomograms for renal cell carcinoma: results from a multicentre European and US study



What's known on the subject? and What does the study add?

  • The preoperative and postoperative Karakiewicz models for RCC are considered among the best prognostic tools available for clinical counseling. Nevertheless, their predictive acuracy was externally validated only in two papers: by the same author and in an independent sample of Asian patients. However, these models have not been externally validated in truly independent multicentre series of patients.
  • Our study demonstrated that these models 1) provide robust prognostic information; 2) were robustly built; 3) are useful also in population far from the original series. The present results are the first to show the validity and generalizability of Karakiewicz nomograms, which are based on surgical series from European centres, for large-, mid- and small-volume European and American centres.


  • To assess the accuracy and generalizability of the pre- and postoperative Karakiewicz nomograms for predicting cancer-specific survival (CSS) in patients with renal cell carcinoma (RCC).

Patients and Methods

  • This retrospective study included 3231 patients from European and US centres, who were treated by radical or partial nephrectomy for RCC between 1992 and 2010.
  • Prognostic scores for each patient were calculated and the primary endpoint was CSS.
  • Discriminating ability was assessed by Harrell's c-index for censored data. The ‘validation by calibration’ method proposed by Van Houwelingen was used for checking the calibration of covariate effects. Calibration was graphically explored.


  • Local and systemic symptoms were present in 23.2% and 9.1% of the patients, respectively.
  • The median follow-up (FU) was 49 months. At the last FU, 408 cancer-related deaths were recorded, Kaplan–Meier estimates of CSS (with 95% confidence intervals [CIs]) at 5 and 10 years were 0.86 (0.84–0.87) and 0.77 (0.75–0.80), respectively.
  • Both nomograms discriminated well. Stratified c-indices for CSS were 0.784 (95% CI 0.753–0.814) for the preoperative nomogram, and 0.842 (95% CI 0.816–0.867) for the postoperative one, with a significant difference between the two values (P < 0.001).
  • The covariate-based predictions on our data for both nomograms were valid. The calibration plots showed no relevant departures from ideal predictions.


  • The results suggest that the postoperative Karakiewicz nomogram discriminates substantially better than the preoperative one.
  • These nomogram-based predictions may be used as benchmark data for pretreatment and postoperative decision-making in patients at various stages of RCC.

cancer-specific survival


predictive accuracy




linear predictor


Over the last two decades, the management options for patients with RCC at all stages have increased [1]. At the same time, several prognostic factors and tools have been identified, developed and validated to help with clinical decision-making [2]. In general, all these prognostic tools are more accurate than the standard TNM classification or Fuhrman grade in predicting survival outcomes. A substantial advantage of prognostic tools is the ability to measure their predictive accuracy (PA), which allows an objective evaluation of their performance [3]. Although the first models provided a PA that was only better than would be expected by chance (∼65–67%) [4-6], models with improved PA have recently been devised and validated [7, 8]. These models integrate readily available variables such as patient age, gender, presence and type of symptoms at presentation, tumour size, TNM stage and Fuhrman grade, to allow accurate prognostication (preoperative PA of 84–88% and postoperative PA of 80–91%) [3]. Such models can be used in clinical practice to aid patient counselling, application of appropriate postoperative surveillance programmes and also, most recently, patient classification for clinical trials. For example, whereas the UK Medical Research Council SORCE trial recruits patients with intermediate- and high-risk Leibovich scores for randomization between sorafenib and placebo [9], other trials, the Adjuvant Sorafenib or Sunitinib for Unfavourable Renal Carcinoma and Sunitinib Treatment of Renal Adjuvant Cancer trials, select patients based on variables included in a different nomogram [9, 10]. One of the barriers to the wide use of nomograms, despite the fact that they outperform risk grouping and tables [11-15], is the lack of robust and widespread assessment of their accuracy and generalizability.

The preoperative and postoperative Karakiewicz nomograms that integrate readily available clinical, radiological and pathological information accurately predict cancer-specific survival (CSS) after nephrectomy [7, 8, 16]. The PA of these nomograms has been externally validated by Karakiewicz and in an independent sample of Asian patients [7, 8, 10], but they have not been externally validated in a truly independent multicentre series of patients. The aim of the present study, therefore, was to externally validate the accuracy (discrimination and calibration) and to assess the generalizability of the pre- and postoperative Karakiewicz nomograms in a multicentre European–US series of patients surgically treated for RCC.

Patients and Methods

Study Subjects

Consecutive patients who underwent surgery for RCC between 1992 and 2010 in 11 centres in Europe (nine centres) and the USA (two centres) were pooled in one database which provided a total of 3911 patients. The hospitals involved did not provide data on patients for the complete time period, but contributed to the database by adding data on all consecutive patients undergoing surgery during the hospital's period of participation. Data acquisition was approved by the local ethical committees before assessment. None of the patients received neoadjuvant treatment before or adjuvant treatment after surgery. Nevertheless, in case of simultaneous metastasis or disease progression during follow-up (FU), different treatment concepts based on current guidelines were applied, i.e. surgical treatment of metastases, immunotherapy and/or targeted therapy [1, 17].

Inclusion criteria mirrored the eligibility criteria common to the two nomograms: we excluded the patients with benign lesions (n = 280), patients with lack of information on variables used in either nomogram (n = 305), patients who died from surgical complications (perioperative death) in the first month after nephrectomy (n = 49) and patients with a FU < 1 month (n = 299). A total of 3230 patients were thus available for analysis.

Study Variables

From each institutional database, anonymous patient demographic, surgical and pathological data were collected and centrally pooled. Patients were staged preoperatively using abdomino-pelvic CT or MRI, and either a chest CT scan or X-ray. Pathological staging was determined in accordance with the 2002 TNM classification [18]. For patients diagnosed before 2002, clinical data were converted to the 2002 (TNM) staging system. Classifications T3a, b and c were pooled. Tumour size of pathological specimens was determined as the largest dimension in cm. The Heidelberg classification was used to stratify histological subtypes and the Fuhrman grading scheme was used to determine the nuclear grade of tumours [19]. Clinical presentation at time of diagnosis was categorized as incidental or symptomatic (local or systemic), as described by Patard et al. [20].

The preoperative nomogram included patient age, gender, clinical stage, presence of metastases, tumour size and symptom classification. The postoperative nomogram included TNM stages, tumour size, Fuhrman grade, histological subtype and local symptoms. The probability of CSS was obtained by estimation.


Patients were followed according to protocols established at each institution. Duration of FU was assessed from the date of surgery until last FU. Death was defined as either cancer-specific or not related to the tumour. The primary endpoint of the study was CSS. Patient survival data were received from the patient's GP, by direct phone call or from the institutional cancer registries. FU assessment was completed in 2011.

Statistical Analysis

The appropriate scores and predicted probabilities were attributed to each patient according to original nomograms and the details of the nomograms are described elsewhere [7, 8]. For descriptive purposes only, individual predicted probability values were arbitrarily categorized into five classes (<0.5, 0.5–0.7, 0.7–0.8, 0.8–0.9, 0.9–1.0). The endpoint was CSS, defined as the time from surgery to death attributable to cancer according to clinicians; patients who died from causes other than RCC were censored at the date of death and surviving patients were censored at the date of last available information. The median FU was estimated using the inverse Kaplan–Meier method [21]. Survival curves were estimated by the product-limit method of Kaplan–Meier and compared using the log rank statistic.

The discriminating ability of the prognostic nomograms was assessed using Harrell's c-index for censored data [22]. Harrell's c-index was calculated stratified by centre [23] and 95% CIs were calculated according to Pencina et al. [24]. A comparison of the two stratified c-index values was performed using the method described by Haibe-Kains et al. [25]. The c-index and 95% CI for each centre was also calculated. We assessed heterogeneity of c-index between centres using Q and I2 statistics as a measure of the proportion of total variation in estimates that was attributable to heterogeneity.

Calibration was explored graphically, comparing the mean predicted values with observed survival estimates by decile of predicted values. The ‘validation by calibration’ method proposed by Van Houwelingen [26] was used for checking the calibration of covariate effects, testing if the relative effects of covariates were actually stronger or weaker than originally estimated. For both nomograms, we fitted a Cox regression model, stratified by centre, in which the linear predictor (LP) from each nomogram was entered as a covariate. The corresponding regression coefficient allowed us to test the hypothesis ßLP = 1, which, if not rejected, indicates the validity of covariate-based predictions using the new data [27]. We assessed heterogeneity of ßLP between centres using the likelihood ratio of two centre-stratified models, one with centre-specific LP estimates and one with an overall LP estimate, as suggested by Smith et al. [28]. Under the null hypothesis of no heterogeneity, this statistic approximately follows a chi-squared distribution on J − 1 degrees of freedom, where J is the total number of centres [28].

Statistical analyses were performed using the SAS version 9.2 (SAS Institute, Cary, NC, USA) and R version 2.13.2 (R Foundation for Statistical Computing, Vienna, Austria) software packages.


The clinicopathological characteristics of 3230 patients are shown in Table 1. The median (range) patient age at surgery was 62 (17–97) years. Male gender was predominant (64%). Only 9% of patients had systemic symptoms, and 8% showed metastatic disease at the time of presentation. Although radical nephrectomy was the most common treatment at all centres, the rate of partial nephrectomy was still 22% with pT1a stage disease found in 36% of cases. A nephron-sparing procedure was used in 18 and 82% of the cases for imperative and elective indications, respectively. Conventional clear-cell histology was the most frequently observed (78%).

Table 1. Baseline characteristics*.
VariableWhole dataset (N = 3230)
  1. *Table entries are percentages of sample but for date of surgery and age.
Date of surgery1992–2010
Median (range) age, years62 (17–97)
Gender, % 
Symptoms, % 
Nephrectomy type, % 
Histology, % 
Stage T, % 
Metastases, % 
Tumour size, % 
≤4.0 cm38.8
4.1–7.0 cm34.1
>7.0 cm27.1
Fuhrman's grade 

The distribution of risk categories of the patient population according to both the preoperative and postoperative nomograms is shown in Table 2. In brief, despite differences among the centres, the majority of patients had a good/very good prognosis (5-year predicted probability values 80–100%) in ∼75 and 80% of the cases according to the preoperative and the postoperative nomogram, respectively.

Table 2. Distribution of risk categories by centre.
 5-year predicted probability values, %
Preoperative nomogram
 1 (n = 211)6.62.410.015.265.9
 2 (n = 344)
 3 (n = 164)
 4 (n = 214)7.58.912.128.043.5
 5 (n = 808)
 6 (n = 152)5.316.413.217.847.4
 7 (n = 193)11.911.413.017.146.6
 8 (n = 498)
 9 (n = 62)1.69.711.319.458.1
10 (n = 237)18.610.58.416.945.6
11 (n = 286)7.314.
Overall (N = 3169)
Postoperative nomogram
 1 (n = 213)
 2 (n = 341)
 3 (n = 165)
 4 (n = 210)
 5 (n = 821)
 6 (n = 155)5.22.614.828.449.0
 7 (n = 199)
 8 (n = 202)
 9 (n = 62)
10 (n = 243)17.710.36.216.949.0
11 (n = 288)8.36.914.625.344.8
Overall (N = 2899)

The median FU was 50 months. During the first 5 years, 488 patients died and 1015 patients were alive at 5-year FU. At last FU, 622 subjects had died from any cause (416 cancer-related deaths). Kaplan–Meier estimates of CSS at 1, 3, 5 and 10 years were 0.95 (95% CI 0.94–0.96), 0.92 (95% CI 0.91–0.93), 0.86 (95% CI 0.84–0.87) and 0.77 (95% CI 0.75–0.80), respectively.

Both nomograms discriminated well (Fig. 1). Stratified c-indices for CSS were 0.784 (95% CI 0.753–0.814) for the preoperative (3169 patients) and 0.842 (95% CI 0.816–0.867) for the postoperative nomogram (2899 patients), and there was a significant difference between the two values (P < 0.001; Table 3). Surprisingly, in the US centres (#8 and #10) the nomograms, originally developed for European patients, discriminated very well. A high degree of heterogeneity of c-indices between the centres was found for both nomograms (P < 0.001) with I2 values of 82 and 76% for the preoperative and postoperative nomograms, respectively (Table 3).

Figure 1.

Kaplan–Meier curves of CSS for the two prognostic nomograms.

Table 3. Results of discrimination c-index value by centre.
CentreHarrell's c-index (95% CI)
Preoperative nomogramPostoperative nomogram
 10.874 (0.735–1.000)0.871 (0.718–1.000)
 20.903 (0.858–0.949)0.921 (0.889–0.953)
 30.845 (0.760–0.931)0.897 (0.842–0.953)
 40.806 (0.734–0.877)0.848 (0.789–0.907)
 50.754 (0.712–0.796)0.825 (0.791–0.859)
 60.671 (0.536–0.806)0.787 (0.671–0.904)
 70.914 (0.876–0.951)0.937 (0.910–0.964)
 80.903 (0.825–0.981)0.910 (0.877–0.942)
 90.766 (0.638–0.894)0.790 (0.664–0.916)
100.895 (0.843–0.947)0.890 (0.843–0.938)
110.788 (0.736–0.840)0.844 (0.800–0.888)
Overall0.783 (0.753–0.814)0.842 (0.816–0.867)

Table 4 shows the results of the analysis of the calibration of covariate effects. The null hypothesis βLP = 1 on the validity of LP was not rejected for the preoperative (P = 0.441) or for the postoperative nomogram (P = 0.876). No significant heterogeneity of βLP among centres was found for the preoperative (P = 0.109) or for the postoperative nomogram (P = 0.430), with most estimates close to the null hypothesis βLP = 1 (Table 4). These results indicated the validity of covariate-based predictions on our data for both nomograms. The calibration plots of the two prognostic nomograms (Fig. 2) are shown for 1-, 2-, 5-, and 10-year predictions. The graphs showed no relevant departures from ideal predictions.

Figure 2.

Calibration plots of the two prognostic nomograms predicting CSS at 1, 2, 5 and 10 years.

Table 4. Results of ‘validation by calibration’ by centre.
Centreβ estimates of LP (95% CI)
Preoperative nomogramPostoperative nomogram
 11.09 (0.61–1.57)0.97 (0.57–1.37)
 20.87 (0.67–1.08)0.85 (0.66–1.04)
 31.08 (0.70–1.46)1.12 (0.78–1.47)
 41.05 (0.75–1.35)1.10 (0.84–1.36)
 50.94 (0.80–1.08)1.04 (0.92–1.15)
 60.50 (0.01–1.00)0.81 (0.34–1.28)
 71.22 (0.98–1.47)1.12 (0.92–1.33)
 81.17 (0.69–1.66)0.77 (0.19–1.34)
 90.51 (0.10–0.92)0.60 (0.17–1.03)
101.05 (0.63–1.46)0.96 (0.61–1.32)
110.96 (0.71–1.20)1.03 (0.82–1.25)
Overall0.97 (0.89–1.05)1.01 (0.93–1.08)


Despite several prognostic models having been developed to predict the outcome of patients with RCC, only a few of them have been fully assessed with calibration, external validation, assessment of transportability and reproducibility [3]. A possible explanation is the tendency of many investigators to develop new models based on their own datasets, instead of comparing and improving existing models [10, 29, 30].

The accuracy and generalizability of prognostic systems are related concepts. Accuracy, defined as the degree to which predictions match outcomes, consists of discrimination and calibration. Generalizability is the ability of the system to provide accurate predictions in a different sample of patients, which depends on its reproducibility and transportability. In the present study, we evaluated accuracy and generalizability of the strong prognostic tools.

Based on the results of the present study, and as expected, the postoperative nomogram, which is based on pathological variables, discriminates better than the preoperative one (P < 0.001). Moreover, calibration plots demonstrated that both nomograms were well calibrated at 1-, 2- 5-, and 10-year FU. Finally, there was no evidence that the relative effects of the covariates were inadequately estimated for either nomogram. Recently, Tan et al. [10] compared the prediction of survival outcomes of several prognostic systems (the Karakiewicz, Kattan and Sorbellini nomograms, and the Leibovich model) and concluded that the postoperative Karakiewicz nomogram was the most useful in terms of individual counselling, providing excellently calibrated CSS estimates.

As far as generalizability is concerned, a model's reproducibility is usually assessed using bootstrapping techniques in the development sample, as Karakiewicz did in the original studies [7, 8]. Moreover, the assessment of transportability requires external validation in different populations. The present dataset, based on a multi-institutional collection of data provided within a wide timeframe, ensured heterogeneity of the study cohort that was suitable for transportability assessment. The heterogeneity among centres may be considered a strength of the study, rather than a weakness. To generalize a prognostic model, a high grade of heterogeneity, accounting for the entire spectrum of disease, is desirable, mirroring the clinical world (Table 2).

Notably, none of the patients in the present dataset was included in either the development series or in the validation cohort of the original studies [7, 8]. Differences in the methods of patient selection and data collection between the development study and clinical practice are common when multiple independent investigators attempt to apply any prognostic system. The more heterogeneous the settings are in which the system is tested and found to be accurate, the more it will generalize to an untested setting [31]. Consistency of results among centres favours a valid conclusion. With regard to this, the present study proves that the nomograms achieved a robust geographic transportability, despite some degree of heterogeneity, as confirmed by excellent performances recorded in the US centres in terms of c-index and by the strong validity of covariate-based predictions using our data for both nomograms (‘validation by calibration’ proposed by Van Houwelingen) [26].

In 1999, Justice et al. [31], in order to help clinicians and researchers become more familiar with prognostic technology, proposed a five-level hierarchy of external validation for prognostic systems. Following their description, a study achieves a maximum level of validation if it is assessed for reproducibility, historical, geographical and methodological spectrum, and FU period transportability (level 5); however, if a study is only evaluated on its reproducibility it achieves the internal validation (level 0). In the present study, for the first time, we performed multiple independent validations in the hands of diverse independent investigators at diverse geographical sites, reaching level 4 of validation.

The present study has some potential limitations. Its retrospective design can be considered the main one, but the excellent calibration shown suggests that this is probably of relatively minor concern. The multi-institutional nature of the cohort may be interpreted as a further limitation; however, to evaluate the accuracy and generalizability of prognostic models, heterogeneity of baseline characteristics, rather than homogeneity, is advisable and desirable. Lack of central pathology review might also represent a weakness. However, even if the central pathology review ideally increases validity by minimizing the interobserver variability, it is useless, from a clinical viewpoint, since variability is common in clinical practice. In the present series we recorded a 49-month median FU that could appear too short for long-term predictions, but the FU does exceed the FU periods of the original articles [7, 8].

With the expanding range of potential therapeutic interventions including surveillance, ablation and systemic therapies, physicians and patients need more accurate and generalizable prediction tools to help them in their clinical decision-making. The present results are the first to show the validity and generalizability of Karakiewicz nomograms, which are based on surgical series from large-, mid- and small-volume European and US centres.

In conclusion, the present study better defines the general applicability of Karakiewicz's nomograms for predicting survival in patients with RCC treated with nephrectomy. These nomogram-based predictions may be used as benchmark data for pretreatment and postoperative decision-making in patients with various stages of RCC.


L. Cindolo personally thanks A. Nista for his support.

Conflict of Interest

None declared.