Letter to the Editor
Prognostic Markers: Data Misinterpretation Often Leads to Overoptimistic Conclusions
Article first published online: 6 JAN 2012
© Copyright 2011 The American Society of Transplantation and the American Society of Transplant Surgeons
American Journal of Transplantation
Volume 12, Issue 4, pages 1060–1061, April 2012
How to Cite
Foucher, Y., Combescure, C., Ashton-Chess, J. and Giral, M. (2012), Prognostic Markers: Data Misinterpretation Often Leads to Overoptimistic Conclusions. American Journal of Transplantation, 12: 1060–1061. doi: 10.1111/j.1600-6143.2011.03889.x
- Issue published online: 28 MAR 2012
- Article first published online: 6 JAN 2012
To the Editor:
Despite a wealth of literature reports on risk factors for patients and graft loss in transplantation, predicting transplant outcome still remains elusive to physicians. This is frequently caused by the incorrect or awkward use of statistical terminology, resulting in overoptimistic interpretation of results. To illustrate this issue, three articles have been published in the American Journal of Transplantation; Le Treut et al. (1), Rana et al. (2) and Mannon et al. (3). All three proposed markers for the prediction of long-term survival after transplantation, but their statistical results were based only on survival differences, which are necessary but not enough to conclude as to the clinical utility of the marker. In fact, further analyses would have been required to support the conclusions of the authors. In this letter, we provide a viewpoint on this complex issue and provide a short explanation that may help future misinterpretations to be avoided.
In the type of prognostic study described in the three papers cited, the markers are measured at the time of inclusion and the patients are followed for measurement of time-to-failure. Because failure is not observed for all patients at the end of the follow-up period and because the length of follow-up can differ among patients, the statistical analyses must take this censoring into account. For this reason, investigators often use the Kaplan–Meier estimator for calculating survival curves. The significance of the difference between curves is evaluated using the log-rank test. However, the resulting p values do not depend only on the predictive capacity of the marker, but also on the sample size: a large sample set will enable small differences between survival curves to be detected. A statistically significant difference only demonstrates that the correlation between the marker and the outcome is not observed by chance (i.e. as a result of sample-to-sample fluctuation), but does not inform on prognostic accuracy. The corresponding hazard ratio is also often assessed to quantify the magnitude of the difference. However, it actually only indicates an increase in failure risk, not the prognostic accuracy.
In diagnostic medicine, the standard criteria to assess the predictive capacity of a marker are sensitivity and specificity. When the prediction is made up to a specific time point after the biomarker measurement, the sensitivity represents the proportion of patients, among all the patients who experience failure before that time point, that are correctly classified as high risk. The specificity, on the other hand, represents the proportion of patients, among all the patients who have not failed before the time point, that are correctly classified as low risk. The main challenge in longitudinal analyses is that not all patients may be followed up to same time point. It is therefore impossible to simply estimate these proportions based only on the patients who achieve a follow-up to the time point, this results in considerable selection bias, leading to an overrepresentation of patients with failure. Heagerty et al. proposed a mathematical solution to this problem (4), suggesting the adaptation of the Kaplan–Meier estimator.
In the study by Treut et al. (1), the authors endeavored to predict long-term survival after liver transplantation based on primary tumor location and liver size (score). The high-risk group comprised 23 patients with 2 points (5-year survival rate of 12%). The low-risk group comprised 55 patients with less than 2 points (5-year survival rate of 68%). The authors concluded that their analysis enabled the development of a useful patient selection tool based on this score. However, if one applies the method of Heagerty et al. (4), these data provide a sensitivity of 55% and a specificity of 93% at 5 years. This illustrates the presence of a high number of false negatives, i.e. a large number of high risk patients who are incorrectly classified as being at low risk, whereas a clinically useful predictor should be associated with a small number of errors to avoid clinicians making erroneous decisions. In the present example, high-risk patients would potentially be deprived of necessary medical attention.
In the study by Rana et al. (2), the authors described the Survival Outcomes Following Liver Transplant score based on 18 factors presented as a predictor of 3-month recipient survival after liver transplantation. Five groups were defined by the score. The low-risk group included 17 896 patients with a low or low-moderate score (3-month survival rate of 95%). The high-risk group included 3777 patients with high-moderate, high or futile score (3-month survival rate of 82%). These data give a sensitivity of 45% and a specificity of 85% (values not available in the paper). Therefore, using this score, more than half of the patients who died within the first 3 months posttransplantation would have been misclassified as being part of the low-risk group. However, the authors concluded that this tool can accurately predict 3-month survival after liver transplantation.
Finally, in the study by Mannon et al. (3), the authors reported that inflammation in areas of tubular atrophy in kidney allograft biopsies is a strong predictor of transplant survival. The high-risk group was made up of 232 patients with the presence of inflammation (iatr > 0, 2-year survival rate of 68%). The low-risk group was made up of 105 patients with the absence of inflammation (iatr = 0, 2-year survival rate of 89%). For a prognosis at 2 years, this corresponds to a sensitivity of 86% and a specificity of 37% (values not available in the paper). Here again, around 63% of patients without graft failure would be wrongly considered at risk.
In conclusion, we have briefly demonstrated that the p value (or the corresponding HR) do not indicate whether a given variable will be a good predictor. Markers in transplantation are often mistakenly defined as “predictors”, whereas they are in fact only correlated (albeit strongly) with graft outcome. Although here we focus on three particular studies; this is a common issue in the transplant literature. We strongly recommend the use of appropriate methodology for future studies of predictors, so as to provide clinicians with a clearer understanding of their true clinical utility. We recommend the method of Heagerty et al. (4), although there is additional statistical literature available on this subject (5–8).
The authors of this manuscript have no conflicts of interest to disclose as described by the American Journal of Transplantation.