Improved survival prediction and comparison of prognostic models for patients with hepatocellular carcinoma treated with sorafenib

Abstract Background The ‘Prediction Of Survival in Advanced Sorafenib‐treated HCC’ (PROSASH) model addressed the heterogeneous survival of patients with hepatocellular carcinoma (HCC) treated with sorafenib in clinical trials but requires validation in daily clinical practice. This study aimed to validate, compare and optimize this model for survival prediction. Methods Patients treated with sorafenib for HCC at five tertiary European centres were retrospectively staged according to the PROSASH model. In addition, the optimized PROSASH‐II model was developed using the data of four centres (training set) and tested in an independent dataset. These models for overall survival (OS) were then compared with existing prognostic models. Results The PROSASH model was validated in 445 patients, showing clear differences between the four risk groups (OS 16.9‐4.6 months). A total of 920 patients (n = 615 in training set, n = 305 in validation set) were available to develop PROSASH‐II. This optimized model incorporated fewer and less subjective parameters: the serum albumin, bilirubin and alpha‐foetoprotein, and macrovascular invasion, extrahepatic spread and largest tumour size on imaging. Both PROSASH and PROSASH‐II showed improved discrimination (C‐index 0.62 and 0.63, respectively) compared with existing prognostic scores (C‐index ≤0.59). Conclusions In HCC patients treated with sorafenib, individualized prediction of survival and risk group stratification using baseline prognostic and predictive parameters with the PROSASH model was validated. The refined PROSASH‐II model performed at least as good with fewer and more objective parameters. PROSASH‐II can be used as a tool for tailored treatment of HCC in daily practice and to define pre‐planned subgroups for future studies.


| INTRODUC TI ON
Hepatocellular carcinoma (HCC) is the most common primary liver cancer and the second leading cause of cancer-related death worldwide. 1 Most patients with HCC present with, or eventually progress to, advanced stage disease which bears a poor prognosis. Sorafenib, a multikinase inhibitor, was the first treatment to show a survival benefit in patients with advanced stage HCC. In two randomizedcontrolled trials, sorafenib improved the median overall survival (OS) by 2-3 months compared with placebo. 2,3 Since then, sorafenib has been the standard treatment for patients with advanced stage HCC who are ineligible for loco-regional treatment and have preserved (Child-Pugh A) liver function.
However, there is significant heterogeneity in outcomes in patient treated with sorafenib with an OS ranging from <3 months to 2-3 years. [2][3][4] This indicates that the survival benefit offered by sorafenib varies between individual patients. Select subgroups may have similar or more benefit from alternative options such as lenvatinib, 5 best supportive care or clinical trials.  6 Previous studies have identified markers of liver function (ie albumin, bilirubin), clinical parameters (ie performance status, body composition) and tumour characteristics (ie alpha-foetoprotein [AFP], macrovascular invasion, tumour extent) that may aid in prognostic stratification prior to sorafenib treatment. [7][8][9][10][11][12][13][14][15] Predictive factors, that is, those associated with improved survival benefit over placebo, included absence of extrahepatic spread, presence of hepatitis C virus and a low neutrophil-to-lymphocyte ratio (NLR). 16 Based on the combination of baseline factors, several scoring systems have been proposed for survival stratification of patients with advanced HCC treated with sorafenib. [17][18][19][20] Limitations of these models include the use of factors that either have a degree of subjectivity (ie infiltrative tumour growth, ascites) or are not commonly available (ie Des-gamma-carboxypro-

thrombin [DCP]). A recently proposed model, the 'Prediction Of
Survival in Advanced Sorafenib-treated HCC' (PROSASH), provided individualized survival prediction with excellent risk group discrimination based on nine parameters (age, macrovascular invasion, extrahepatic spread, performance status, disease aetiology, albumin, creatinine, aspartate transaminase (AST) and AFP). 21 The PROSASH model was built and validated on the data from patients treated with sorafenib in two and tested in an independent dataset. These models for overall survival (OS) were then compared with existing prognostic models.
Results: The PROSASH model was validated in 445 patients, showing clear differences between the four risk groups (OS 16.9-4.6 months). A total of 920 patients (n = 615 in training set, n = 305 in validation set) were available to develop PROSASH-II. This optimized model incorporated fewer and less subjective parameters: the serum albumin, bilirubin and alpha-foetoprotein, and macrovascular invasion, extrahepatic spread and largest tumour size on imaging. Both PROSASH and PROSASH-II showed improved discrimination (C-index 0.62 and 0.63, respectively) compared with existing prognostic scores (C-index ≤0.59).

Conclusions:
In HCC patients treated with sorafenib, individualized prediction of survival and risk group stratification using baseline prognostic and predictive parameters with the PROSASH model was validated. The refined PROSASH-II model performed at least as good with fewer and more objective parameters. PROSASH-II can be used as a tool for tailored treatment of HCC in daily practice and to define pre-planned subgroups for future studies. and existing prognostic models were compared to determine the utility for clinicians to predict the survival of these patients.

| Study population
Patients with HCC treated with sorafenib were recruited consecutively at five tertiary European centres with specialist multidisciplinary services for HCC management: Bordeaux (n = 306) and Rennes (n = 129), France; Freiburg (n = 183), Germany; Amsterdam (n = 156) and Rotterdam (n = 167), the Netherlands. The data were collected after obtaining the relevant authorization from the institutional review boards and this retrospective study was performed under ethically approved protocols (REC reference 12/LO/1088 and W17_420#17.488). Patients were diagnosed with HCC by histological or radiological criteria in accordance with international guidelines. 6,29 Exclusion criteria included patients receiving combination treatments (ie selective internal radiation therapy [SIRT] with sorafenib) or those with an Eastern Cooperative Oncology Group performance status (ECOG PS) >2. Patients received sorafenib with a target dose of 400 mg BID, with toxicity-adjusted dosing and patient management according to the local practice.

| Data collection and outcomes
Commonly available clinical, imaging and serum variables prior to sorafenib treatment were collected by members of the research team. Imaging parameters were obtained from the most recent radiological imaging prior to first dose of sorafenib. Radiological staging included a multiphasic contrast-enhanced computed tomography (CT) or dynamic magnetic resonance imaging (MRI). The main outcome measure, OS, was defined from the date of start of treatment to date of death or censored on the date of last follow-up.
Patients were staged according to the PROSASH model. 21 To assess whether improved prediction may be possible using data from daily practice, a new model was built and validated (PROSASH-II, detailed below). The utility of both models was compared with existing prognostic scores that could be assessed in the dataset, including the BCLC staging system, Child-Pugh classification, albumin-bilirubin (ALBI) grade, 30 Japan Integrated Staging (JIS) score, 31 hepatoma arterial-embolization prognostic (HAP) 32 and the Sorafenib Advanced HCC Prognostic (SAP) score. 18 With the exception of BCLC stage and Child-Pugh classification, which are commonly used in daily practice and were coded by the individual investigators, all prognostic scores were calculated using the raw data.

| Statistical methods
Continuous variables were described as means with standard deviation (SD) or medians with interquartile range in case of highly skewed distributions. Categorical variables were described as absolute and relative frequencies. The Kaplan-Meier method was used to generate and compare survival curves, and to estimate median OS with 95% confidence interval (95% CI). For all analyses, a two-tailed P < .05 was considered statistically significant. Statistical analysis was performed using SPSS Statistics for Windows Version 24.0 (IBM Corp) and STATA/SE 14.1 (StataCorp).

| Model building, testing and external validation
For the building of a prognostic model from patients treated in daily practice, the data of four centres were clustered into a training dataset and the largest independent dataset (Bordeaux) was used as an external validation set. Baseline variables that were considered clinically relevant and available in both datasets were included in the model building process (Table S1). Highly skewed variables were log-transformed. BCLC stage and Child-Pugh grade were excluded from the model building process owing to multicollinearity with factors used in these scoring systems. Multiple imputations (10x) using chained equations were performed to account for missing key parameters that were missing at random in the training dataset. 33,34 Model performance, derived coefficients and P values of imputed data were compared with complete case data.
In the training set, the association between OS and baseline variables was assessed in an exploratory univariable and subsequent multivariable flexible parametric survival analysis. [35][36][37] The advantages of a flexible parametric analysis over the more commonly used Cox proportional hazard analysis were previously described. 21,37 Risk factors were reported with hazard ratio (HR) and corresponding P values.
The multivariable model was built using a stepwise forward selection procedure of variables significant at the 5% level. The model was reported according to the TRIPOD guidelines 38 as well as tested, optimized and validated using the methods described by Royston and Altman. 39 Any time-dependent effects and potential proportional hazard violations by variables in the model were examined using the likelihood ratio (LR) test. 37 The LR test was also used to optimize the degrees of freedom (number of knots) for the restricted cubic spline function. 37 Lastly, Martingale residuals were plotted against continuous variables to check the functional form and non-linearity.
A linear predictor was derived using the coefficients of the model variables. Four risk groups were generated by applying the previously suggested cut-offs at the 16th, 50th and 84th centiles of the training set's linear predictor. 39 The model, including the linear predictor and the centile-based risk group stratification, was applied to the external validation set.
The calibration of survival prediction was visually assessed by comparing the similarity between the observed and predicted survival curves in both the training and validation set. The observed and predicted survival-percentage at 12 months were also compared.
Model discrimination was visually inspected by examining the separation survival curves of the four risk groups. In addition, survival rates between the risk groups were compared using HRs or log-rank test and the accompanying P values. Lastly, subgroup analyses of the new model were performed in patients with Child-Pugh A or Child-Pugh B because current guidelines recommend selecting patients with Child-Pugh A patients only. 6,29

| Model comparison
The PROSASH model incorporates the variable 'aspartate transaminase (AST)' which was not available in the Rennes (training) and Bordeaux (validation) datasets. Therefore, model comparisons were performed in three subgroups of patients: 1. The imputed training dataset, 2. The external validation set, with complete data for all prognostic models except for the PROSASH model and.

Patients with complete data for all prognostic scores.
For each prognostic model, the utility and discriminative performance was quantified using the Akaike Information Criterion (AIC) Harrell's Cindex and Royston-Sauerbrei's R 2 D 40,41 A lower AIC indicates a better goodness of fit, whereas a higher Harrell's C-index indicates a larger proportion of patient pairs has agreement between the survival prediction and observed survival outcome in terms of rank. A higher R 2 D reflects a better explained variation on the log relative hazard scale. Most prognostic models consist of a linear predictor or point-based system with a risk group categorization which can lead to loss of information (ie ALBI-score and ALBI grade 1, 2 and 3). To assess the difference, the performance of each model as a linear predictor or points and risk groups was assessed.
Because of lacking data, the number of Child-Pugh points could not be calculated, thus only the Child-Pugh classes (A, B and C) were assessed.

| Study population
In total, 941 patients who received sorafenib for advanced HCC between February 2003 and December 2016 were identified for this study. Of these, 21 patients (2%) were excluded because they received a combination of sorafenib with loco-regional treatment (n = 20) or due an ECOG PS >2 (n = 1). Subsequently, 920 patients were included in this study, of whom 615 (67%) patients were included in the training cohort and 305 patients (33%) in the external validation cohort. The baseline characteristics of both cohorts are summarized in Table 1.
Both cohorts had similar baseline features except that in the external validation cohort, more patients had ECOG PS 0 (65% vs 45%, P < .001) and alcohol-induced liver disease was more common (64% vs 35%, P < .001) compared with the training cohort, respectively. The median OS was 8.3 months (95% CI 7.6-9.2) in all patients. There was no statistically significant difference in survival between the training and validation cohort (HR 1.05, 95% 0.91-1.21, P = .128; Figure S1).

| Validation of the PROSASH model in routine clinical practice
The PROSASH model could be applied to 445/615 (73%) of patients from the training set who had a median OS of 8.0 months (95% CI 6.7-9.1). None of the patients from the external validation set were available owing to missing AST (Table S1)

| Prognostic factors and improved model: PROSASH-II
First, multiple imputation was performed on the training set to account for missing data (Table S1). An exploratory univariable analysis showed that albumin, Ln(bilirubin), ECOG PS, macrovascular invasion, extrahepatic spread, largest tumour size, number of liver lesions, Ln(AFP) and receiving prior HCC treatments were associated with OS (Table S2).

| PROSASH-II performance in training and validation set
There were clear and statistically significant survival differences between the PROSASH-II risk groups in the training set (Figure 2A Figure 4); however, overall, the predicted and observed survival were closely matched.

Pugh class
In a subgroup analysis of Child-Pugh A patients (n = 767), who had a median OS of 9.1 months, there were clear survival differences  (log-rank P = .002).

| PROSASH-II model performance and comparison
The performance of the different prognostic models was compared and summarized in Tables 4 and 5. Comparisons were performed in the training set with imputed missing data (n = 615), the validation set with complete data (n = 290) and a subgroup of 438 patients with complete data for all prognostic models. Across the various prognostic models, there was a slight loss in discriminative power when patients were categorized in risk groups or prognostic classes.
Moreover, there was a trend towards a higher C-index and R 2 D and lower AIC across all assessed prognostic models in the validation set compared with the training set. In all different subsets, the models with the lowest predictive performance in terms of AIC, C-index and R 2 D were the BCLC, Child-Pugh and JIS. The HAP and SAP score performed very similarly in the different subsets.
In the training set, the higher C-index (0.65, IQR 0.64-0.65) and R 2 D (0.12, 95% CI 0.08-0.17) of the PROSASH-II indicated improved discriminative performance and explained variation compared with the currently available models. Likewise, the PROSASH-II had a lower AIC (1684) which indicated a better goodness of fit.
In the validation set, the PROSASH-II model had a higher Cindex (0.68, 95% CI 0.65-0.72) and lower AIC (828) than commonly used scores such as BCLC and Child-Pugh. It also had the highest R 2 D (0.16, 95% CI 0.08-0.24) of all tested models, reflecting better explained variation. However, the model appeared to have a similar prognostic performance as the HAP and SAP scores, the latter showing a slightly higher C-index (0.69, 95% CI 0.66-0.72) and lower AIC (817) than the PROSASH-II model.   (Table 6).

| D ISCUSS I ON
Interestingly, most of these models were not specifically built for sorafenib-treated HCC patients and none of them performed optimal. 9,18,19,[45][46][47] Lack of consensus, easy applicability and external validation have hampered implementation of these prognostic scores in clinical practice.
We were able to compare eight different prognostic models: proposed PROSASH-II model ( Table 6). All tested models included parameters for liver function (ie albumin, bilirubin, AST), most of them included tumour-related parameters (ie AFP, tumour size, macrovascular invasion) and some included 'other' baseline parameters (age, HCC aetiology, ECOG PS). Only a few scores have incorporated predictive parameters that were associated with increased benefit of sorafenib over placebo (extrahepatic spread, NLR and hepatitis C virus infection). 16 This may reflect the modest impact of sorafenib on the natural history of advanced HCC. The well-known prognostic impact of the severity of the underlying liver disease was confirmed in this study, reflected by multivariable significance and incorporation of albumin in the PROSASH and PROSASH-II models. In accordance with prior studies, 9,48,49 we showed that despite using less parameters, ALBI has a better discrimination than the Child-Pugh classification.
Although initially developed to stratify HCC patients treated with TACE, the HAP score showed that a further improvement of predictive accuracy is possible by combining liver function (albumin, bilirubin) and tumour-related (AFP, tumour size) parameters. 18 The set of candidate predictors. 50,51 In this study, we aimed to report on the optimized statistical associations in daily clinical practice of sorafenib-treated patients guided by two main principles in prognostic model building. Firstly, the parameters should be commonly available in centres treating patients with HCC. Secondly, models should be widely validated and universally applicable. For this purpose, we used large international datasets that have inevitable differences in data availability. As suggested by Royston et al, this was handled by multiple imputation of randomly missing data (Table S1) and by balancing data availability (ie parameter selection) and analytic power (ie patient numbers). 50 Using this approach, we were able to build the PROSASH-II model which required fewer and only highly reproducible parameters while it performed better in terms of C-index, AIC and R 2 D than its predecessor. Disease aetiology and ECOG PS are less objective parameters which may lead to inter-and intra-user variability in daily practice, favouring PROSASH-II as a tool that can aid clinicians in providing patienttailored treatment. Moreover, PROSASH-II was built and tested on a daily clinical practice population in which it will be applied.
Currently, guidelines recommend to consider all patients with wellpreserved liver function (Child-Pugh-A) who are unsuitable for loco-regional therapy for sorafenib treatment.