Fax: (011) 39 0824 57477
Validation by calibration of the UCLA integrated staging system prognostic model for nonmetastatic renal cell carcinoma after nephrectomy
Article first published online: 12 MAY 2008
Copyright © 2008 American Cancer Society
Volume 113, Issue 1, pages 65–71, 1 July 2008
How to Cite
Cindolo, L., Chiodini, P., Gallo, C., Ficarra, V., Schips, L., Tostain, J., de La Taille, A., Artibani, W. and Patard, J. J. (2008), Validation by calibration of the UCLA integrated staging system prognostic model for nonmetastatic renal cell carcinoma after nephrectomy. Cancer, 113: 65–71. doi: 10.1002/cncr.23517
- Issue published online: 20 JUN 2008
- Article first published online: 12 MAY 2008
- Manuscript Accepted: 22 FEB 2008
- Manuscript Revised: 15 FEB 2008
- Manuscript Received: 24 JAN 2008
- Italian Ministry of Education, University and Research
- renal cancer;
- University of California at Los Angeles (UCLA);
- Integrated Staging System (UISS);
- validation by calibration
To the authors' knowledge, calibration of the University of California at Los Angeles (UCLA) Integrated Staging System (UISS) prognostic score in patients nephrectomized for nonmetastatic renal cell carcinoma (RCC) has never been specifically addressed. The objective of the current study was to evaluate the calibration of the UISS prognostic score in a European multicenter retrospective study.
Six European centers participated in the study. According to the UISS, the endpoint was overall survival (OS). Survival curves were estimated by the Kaplan-Meier method. For calibration assessment, the approach of ‘validation by calibration’ first proposed by Van Houwelingen was used. The original prognostic score is embedded in a ‘calibration model’ that allows testing, in the validation cohort, the baseline hazards function as well the model linear predictor. Estimates of the ‘calibration model’ were used to recalibrate the UISS score.
Of the 2471 available subjects, 399 had died of any cause within the first 5 years. The observed OS curves were compared with the corresponding expected model-based curves. The UISS model did not adequately predict OS, particularly in the extreme categories (P < .0001). Patients in the validation sample, indeed, fared systematically better than patients in the developing cohort. There was no evidence, instead, of a change in the relative effect of the prognostic covariates. After recalibration, the UISS score worked well in the validation cohort.
The UISS score has good discrimination accuracy and is based on an adequately developed risk function. However, it systematically underestimates OS. At least in a European cohort of RCC patients, the use of the recalibrated UISS model could improve prediction accuracy. Cancer 2008. © 2008 American Cancer Society.
Renal cell carcinoma (RCC) is the most frequent malignancy of the adult kidney and is considered the most lethal genitourinary neoplasm, with approximately 40% of patients dying of metastatic disease progression.1
Localized RCC is surgically treated with either radical nephrectomy or nephron-sparing options.2 Complete tumor ablation is curative in the majority of patients; nevertheless, approximately 30% of patients who undergo nephrectomy for localized RCC will experience disease recurrence during their lifetime.3 Recently, substantial therapeutic progress has been achieved in metastatic RCC with antiangiogenic drugs.4–6 However, to our knowledge to date no treatment has proven effective in the adjuvant setting for high-risk localized RCC.7, 8 The identification of factors that will predict the course of the disease and the response to current therapeutic agents is therefore an important challenge. Designing accurate prognostic models should aid in optimizing care for individual patients, namely, choices regarding adjuvant and neoadjuvant treatments, counseling about life expectancy, patient classification into groups for international therapy study protocols, and optimization of a cost-effective and evidence-based schedule for postoperative surveillance.
Several prognostic systems have been proposed in the literature9–14 to predict the prognosis of patients with nephrectomized nonmetastatic RCC, but they differ as to number and type of covariates, tool properties (nomogram or prognostic categories), and endpoints (overall survival, cancer-specific survival, recurrence-free survival). Among these models, the University of California at Los Angeles (UCLA) Integrated Staging System (UISS), a postoperative model based on TNM stage, Fuhrman grade, and Eastern Cooperative Oncology Group (ECOG) performance status (PS), has gained important credibility by being applied with success in a large international cohort.15
In a previous article,16 we compared the discriminating accuracy of 4 prognostic scores in a large multicenter European cohort (ie, the ability of models to separate those who develop events from those who do not). We found 1) that postoperative models discriminated substantially better than preoperative ones, and 2) that both Kattan et al9 and the UISS10 models worked well in a clinical perspective.
To our knowledge, calibration has been evaluated for the Karakievicz et al,14 Hupertan et al,19 and Kattan20 nomograms, but has never been investigated for UISS score. Thus, the objective of the current study was to evaluate the calibration of UISS prognostic score in a large multicenter study involving patients from 6 centers from 3 European countries (Italy, France, and Austria).
MATERIALS AND METHODS
The UISS-UCLA integrated staging system10 for nonmetastatic RCC is a decision box integrating TNM stage, ECOG PS, and Fuhrman grade. Variables were selected, in the initial article, on the base of Cox proportional hazards model.21 Patients were categorized in 3 groups with low, intermediate, and high risk and predicted probabilities of overall survival (OS) were reported yearly up to 5 years.
The study sample of this multicenter cohort retrospective study consisted of subjects who underwent surgery for RCC between 1984 and 2002 in 6 urologic centers in Italy (Verona and Naples), France (Rennes, St Etienne, and Créteil), and Austria (Graz). Institutional databases were available in the participating clinical centers and covered 3151 subjects. Patient management and follow-up were described elsewhere.16 In the current study, we included subjects who mirrored the eligibility criteria of the article in which the UISS model was first developed6; therefore, we excluded the patients with distant metastasis (M+) or histologically confirmed lymph node-positive (N+), as well as patients with benign lesions, bilateral disease, or Bellini ducts carcinoma. We also excluded the subjects who died after surgical complications and those with a follow-up of <1 month. An additional 37 subjects were excluded because the lack of information regarding Fuhrman grade precluded an attribution to UISS risk category. A flow chart of patient selection is reported in Figure 1 Information regarding demographic, surgical, and pathologic data was reported elsewhere.16 Pathologic staging was determined in accordance with the TNM staging system (1997)22; tumor size was determined on pathologic specimens as the greatest dimension in centimeters (cm); and general health status was measured according to the ECOG PS score, categorized as ECOG 0 versus ≥1.
Using the UISS, the endpoint was OS, defined as the time from surgery to death for any cause or, for surviving patients, to the last available information. To be consistent with information provided by Zisman et al,21 only the events occurring within the first 5 years were considered. Survival curves were estimated by the product-limit method of Kaplan-Meier. These ‘observed’ OS curves were compared with the ‘expected’ OS values predicted by the UISS model for each risk category. Peculiar to the UISS model is the availability of an expected OS ‘profile,’ rather than just a single value, as usual; in the main article, 5 estimates for the first 5 years were provided.10 As a consequence, the assessment of calibration was improved, but a more complicated statistical approach was needed.
In this study, we adopted the approach of ‘validation by calibration’ proposed by Van Houwelingen.23 For each risk category, a Weibull proportional hazards model was fitted using the OS values predicted by the UISS model. These expected curves were plotted against the observed Kaplan-Meier curves, and possible differences were assessed by a ‘calibration model,’23 which evaluated how much the original prognostic score was valid on the new data by testing 3 different parameters (α, β, and γ). If the joint null hypothesis on α = 0, β = –1, and γ = 1 was rejected (ie, if discrepancies were found between observed and expected curves), estimates of the calibration model were used to recalibrate predicted probabilities. Note that recalibration does not affect the model's discrimination accuracy. Specific details of this approach are reported in the articles by Van Houwelingen23 and Miceli et al.24
Statistical analyses were performed using SAS (version 8.1; SAS Institute Inc, Cary, NC) and R 2.4.1 (R Foundation for Statistical Computing, Vienna, Austria) software packages.
In all, 2471 RCC patients who underwent surgically resection between 1984 and 2002 with complete baseline and follow-up data were available for analysis (Fig. 1).
Baseline characteristics of patients are reported in Table 1. Fifty percent of patients were recruited after the end of 1996. The median age of all patients was 62 years (range, 10–91 years), and 64.6% of the patients were male. Approximately 57% of patients had pT1 disease and a good ECOG PS (0) was reported in 75.0% of subjects. In the wide majority of cases (86.3%), a conventional clear cell carcinoma was found, followed by the papillary (9.4%) and chromophobe (3.4%) histologic subtypes. Radical nephrectomy was prevalent (90.0%), and only 8.6% of patients had systemic symptoms at the time of presentation. According to the UISS model, 1121 patients (45.4%) were classified as low risk, 1068 (43.2%) as intermediate, and 282 (11.4%) as high risk (Table 2). The relative frequency of low risk was higher than the 27% reported in the UISS cohort,10 whereas fewer patients were included in the high-risk group. Distribution of UISS categories was different in the 6 centers: the first center had a higher number of high-risk subjects (28.3%), whereas a higher prevalence of patients with good prognosis was found in the fifth and sixth centers (Table 2).
|Variable||Entire dataset (N=2471)|
|Date of surgery||1984-2002|
|Median (range)||62 (10-91)|
|ECOG performance status|
|Tumor size, cm|
At the time of analysis, 569 patients (23.0%) had died, with an overall 5-year Kaplan-Meier estimate of 0.791. During the first 5 years, 399 patients had died and 1010 were alive at the time of 5-year follow-up. The observed Kaplan-Meier OS estimates in the validation sample, stratified for the 3 UISS groups, are reported in the first column of Table 3, together with the expected values predicted by the UISS model in the second column. The observed 5-year OS estimates were 0.901, 0.743, and 0.523, respectively, in the low-risk, intermediate-risk, and high-risk groups, and were systematically higher than the corresponding expected values predicted by the UISS model.
|Year||UISS category of risk|
|(1) Observed||(2) Predicted||(3) Recalibrated||(1) Observed||(2) Predicted||(3) Recalibrated||(1) Observed||(2) Predicted||(3) Recalibrated|
To assess the calibration of the whole score, a UISS-Weibull proportional hazards model was fitted using the ‘expected’ OS values (points in Fig. 2) and the model-based curves were estimated (dashed lines in Fig. 2). The goodness-of fit of the model was excellent (R2 = 0.98).
Comparison of the observed Kaplan-Meier survival curves in the UISS subgroups and the corresponding expected model-based curves is depicted in Figure 2. Overall, the UISS model underestimated OS, particularly in the extreme categories. Using the “calibration model” the Likelihood Ratio Test for the joint null hypothesis on baseline hazard and linear predictor was highly statistically significant (P < .0001), supporting the evidence that the UISS model was not well calibrated (ie, did not predict OS). The 3 parameter estimates were 0.210, −1.038, and 0.965, respectively, for α, β, and γ. Individual tests on baseline hazards function and linear predictor highlighted that the calibration model captured a change in baseline hazards function, but there was no evidence of a change in the relative effect of the prognostic covariates. The Likelihood Ratio Test for the joint null hypothesis on baseline hazard function was statistically significant (P = .001), whereas the test on validity of linear predictor was not refused (–β/γ = 1.076; P = .3088), meaning that patients in our validation sample fared systematically better than patients in the cohort in which the model was first developed.
The estimates of the calibration model were used to recalibrate the UISS-Weibull fit curves (Fig. 3); the agreement with the Kaplan-Meier survival curves was improved, but in the high-risk group the curves were crossing, indicating possibly more complex recalibrations. In the third column in Table 3, we report the new OS predicted probabilities (recalibrated) for the UISS risk-category, as derived from the recalibrated UISS-Weibull fit curves. Recalibrated probabilities were higher than the original ones, with a maximum relative change of 16% and reduced the OS underestimate of UISS score in the validation cohort. The heterogeneity of calibration among centers was visually investigated (data not shown). In 5 centers the UISS model clearly underestimated survival in low-risk patients; further survival in high-risk patients was underestimated in approximately half of the centers. A good agreement between observed and predicted curves was found only in the sixth center.
The development of accurate prediction models for patients with RCC after definitive treatment is of crucial importance for patient counseling, follow-up, and treatment planning.
Ideally, every model should be fully externally validated before its clinical use, but this rarely happens. If accuracy is inadequately preserved, the system may not be transportable to other clinical environments and may not find its way in clinical practice. For a complete evaluation of predictive accuracy, both discrimination and calibration should be assessed. Their relative importance depends on the intended application. In patient counseling, the accuracy of the numeric probability (calibration) possibly is more important, although discrimination appears to be most influential in comparing prognostic scores or using stages of disease severity to plan clinical trials.17
Unfortunately, a suitable validation process is hard to implement, so that many risk prediction scores are produced, but very few of them are rightly validated.25
The UISS prognostic system is largely used in the urology field, is the most widely tested tool,26 and is currently used as a stratification criteria for 2 ongoing adjuvant phase 3 trials in intermediate or high-risk localized RCCs.27 Discrimination accuracy of the UISS score has been externally validated in a large international cohort15 and it ranked high in comparison with 3 other prognostic models for nonmetastatic RCC, only slightly less well than the individual Kattan nomogram.16
However, to our knowledge, the current study is the first focusing on calibration of the UISS prognostic system. We found that the UISS model did not predict OS well, specifically in the extreme categories. We found also that the inconsistency was mainly due to an underestimation bias, because patients in our validation sample fared systematically better than the patients in the cohort in which the model was first developed. There was no evidence, instead, that the relative effects of the covariates in the model were inadequately estimated. Thus, we attempted to recalibrate (ie, to improve) the predicted probabilities.
Although we substantially used the same multicenter cohort as previously,16 we could not compare calibration performances of the different prognostic systems because they were developed with different outcomes and methods.
The current study provides a good example of the different meanings of discrimination and calibration. UISS categories discriminated RCC patients well with different risks, but the intermediate-risk category appeared to be particularly heterogeneous and prone to improvement by factors such as tumor size and clinical presentation.16 Conversely, the intermediate-risk category, on average, provided the more accurate prediction of OS probabilities.
Calibration is routinely assessed by using calibration curves, in which predicted probabilities are plotted versus the observed outcomes. Nevertheless, the presence of an OS prediction ‘profile,’ suggested the use of a more efficient, but more complicated, statistical method. Different from the usual visual test, the ‘validation by calibration’ approach23 prevents multiple comparisons, uses all information available, and performs a single calibration test for the whole model. Finally, we improved the overall prediction accuracy by recalibrating the model to increase generalizability in European patients.
Distribution of the 3 UISS risk classes in our cohort was different from the original cohort of Zisman et al,10 in which, of 486 nonmetastatic patients, 128 (27%) were classified as low risk, 190 (41%) as intermediate risk, and 150 (32%) as high risk. In our larger cohort we found an overall different distribution (45.4%, 43.4%, and 11.4%, respectively), with remarkable differences noted among centers. High-risk patients ranged from 3.0% to 28.3% among the different centers. This heterogeneity may be explained by differences in the frequency of advanced disease (T3b,c, T4) (range, 18.9‒36.8%), of Fuhrman grade >1 (range, 33.8‒91.1%), and of ECOG >0 (range, 6.1‒58.2%). However, the UISS model had a similar behavior across the centers, thus excluding that lack of calibration was due to difference in the risk distribution between our cohort and the original cohort of Zisman et al.10 Furthermore, the best agreement was found in the sixth center, in which the differences were most pronounced. Different survivals among centers could be explained by differences in referral pattern, accuracy of preoperative staging, and biases introduced by the subjective nature of tumor grading and assignment of ECOG PS.15
Rather than a weakness of the study, heterogeneity among centers, possibly due to differences in tumor staging, in variability of the ECOG PS assessment, and in surgical and oncologic approaches and pathology assessment may be considered as a strength of the study. To generalize a prognostic model, a high grade of heterogeneity, accounting for the entire spectrum of disease, is desirable, mirroring the clinical real world. A further strength of the study is the high number of subjects available for analysis, 2471 nonmetastatic patients, which allowed one to achieve risk classes of adequate size.
As previously highlighted,16 retrospective design is the main shortcoming of this study insofar as loss of information and the heterogeneous quality of data could follow from differences among centers in surveillance protocols and diagnostic assessment. Because OS was the study endpoint, a survival selection bias could operate in early recruited patients. It is unlikely, instead, that treatment in high-stage disease (which ranged from ‘wait and see’ to interferon or to interleukin-2 regimens) could affect the recorded survival data.
In the current study, we simply recalibrated, but did not change, the original UISS model. This approach did not modify discrimination accuracy, but improved prediction accuracy, by changing the baseline survival of the model. From a clinical perspective, the new estimates may better drive patients and clinicians in making informed decisions facing renal cancer, and at least in a European setting the use of our recalibrated UISS model could improve the prediction accuracy.
In conclusion, the results of the current study confirm the general applicability of the UISS score for predicting survival in patients with RCC, because it demonstrates a good discrimination accuracy16 and because it is based on a risk function adequately developed. Recalibrated UISS estimates may be a useful tool in clinical practice. Finally, the ‘validation by calibration’ approach could be of great value for the generalization of a prognostic model in other populations.
- 24Revising a prognostic index developed for classification purposes: an application to gastric cancer data. J Appl Stat. 2004; 31: 817–830., , .