Prediction of Survival Among Patients Receiving Transarterial Chemoembolization for Hepatocellular Carcinoma: A Response‐Based Approach

Background and Aims The heterogeneity of intermediate‐stage hepatocellular carcinoma (HCC) and the widespread use of transarterial chemoembolization (TACE) outside recommended guidelines have encouraged the development of scoring systems that predict patient survival. The aim of this study was to build and validate statistical models that offer individualized patient survival prediction using response to TACE as a variable. Approach and Results Clinically relevant baseline parameters were collected for 4,621 patients with HCC treated with TACE at 19 centers in 11 countries. In some of the centers, radiological responses (as assessed by modified Response Evaluation Criteria in Solid Tumors [mRECIST]) were also accrued. The data set was divided into a training set, an internal validation set, and two external validation sets. A pre‐TACE model (“Pre‐TACE‐Predict”) and a post‐TACE model (“Post‐TACE‐Predict”) that included response were built. The performance of the models in predicting overall survival (OS) was compared with existing ones. The median OS was 19.9 months. The factors influencing survival were tumor number and size, alpha‐fetoprotein, albumin, bilirubin, vascular invasion, cause, and response as assessed by mRECIST. The proposed models showed superior predictive accuracy compared with existing models (the hepatoma arterial embolization prognostic score and its various modifications) and allowed for patient stratification into four distinct risk categories whose median OS ranged from 7 months to more than 4 years. Conclusions A TACE‐specific and extensively validated model based on routinely available clinical features and response after first TACE permitted patient‐level prognostication.

Clinic Liver Cancer (BCLC) intermediate stage (B) or for those at the BCLC 0/A stage who are not candidates for percutaneous ablation, liver resection, or transplantation by virtue of the tumor location, portal hypertension, or comorbidity. (1,2) This recommendation was based on two randomized trials and subsequent studies. (3)(4)(5)(6)(7) However, the heterogeneity of this "intermediate" population has been extensively documented, and the unmet need of stratification according to baseline features has been emphasized. (8,9) Among those in the cohort who are classified as "ideal candidates" for TACE, an expected median survival in the order of 30 months is quoted, but even within this patient group, there is a wide variation in survival. (5,6,10) However, in practice, many patients receive TACE outside the guideline criteria. For example, vascular invasion (VI) is not always considered a contraindication to TACE (11) ; therefore, in this expanded population, variation in survival may be even greater. This wide variability in survival has led to attempts to define the prognostic features and combine these into scores (or "models") that can be applied to assess prognosis at a subgroup or individual patient level. One frequently quoted aim is to identify that subgroup of patients who respond poorly to TACE and may be considered for systemic therapies. (8,12) Among the first prognostic scores to be developed was the hepatoma arterial embolization prognostic (HAP) score, which is based on a simple points system involving tumor size, alpha-fetoprotein (AFP), bilirubin, and albumin. (13) The HAP score (which was enhanced by Kim et al. (14) by adding tumor number [referred to as the modified HAP-II {mHAP-II}]) has the advantage of easy applicability and simplicity but does not permit individual patient-level prognostication. This limitation was overcome by Cappelli et al., who developed the modified HAP-III (mHAP-III) to include HAP variables, together with tumor number in their continuous (as opposed to dichotomized) form. (15) mHAP-III permits individual patient-level prognostication expressed as the likelihood of survival at a specific period of time after the first TACE.
A second, and more important, limitation of current scores is that they may be HCC-specific rather than TACE-specific.
In this study, it was confirmed that the HAP score is HCC-specific rather than TACE-specific, and we present TACE-specific models that permit accurate individualized patient survival prediction.

Patients and Methods
This analysis was reported according to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis guidelines. (16) As a prelude to the main study, the specificity of the HAP score for patients undergoing TACE was examined in 3,556 patients with early HCC who underwent resection and in 967 patients with advanced HCC who received sorafenib within clinical trials. (17,18) In the main study, the reported TACE cohort (19) was expanded by collecting further cases in which the response to TACE according to the modified Response Evaluation Criteria in Solid Tumors (mRE-CIST) (20,21) was recorded. This analysis has involved only patients who were classified by the local investigator as undergoing TACE as their primary and first treatment. Patients whose TACE was used as a bridge to transplantation or other potentially curative treatment options were excluded, as were patients with extrahepatic metastasis. The study protocol conformed to the ethical guidelines of the 1975 Declaration of Helsinki as reflected in a priori approval by the appropriate institutional review committee.
All participating centers had specific expertise in the management of HCC and the practice of TACE. There were 19 centers representing 11 different countries, including a reported multicenter cohort (22,23) that comprised patients from London (United Kingdom), Osaka ( Japan), Seoul (Korea), and Novara (Italy) ( Tables 1 and 2). Most centers used "conventional" TACE, although several moved to drug-eluting bead (DEB)-based TACE after 2008. In all centers, patients were followed up by computed tomography (CT) or magnetic resonance imaging scans once every 3 months after stable disease (SD) had been attained.
The "other" cause comprised mainly patients with nonalcoholic fatty liver disease (NAFLD), other types of chronic liver disease, and more than one cause. The first TACE procedure was undertaken within 6 weeks of diagnosis, and laboratory data were recorded during that period.
VI (including portal vein, hepatic vein, and inferior vena cava involvement) was assessed in the portal phase of CT and supplemented where appropriate by arterial portography and classified as "present" or "absent." Response assessments according to mRECIST (20,21) were made within the 6 to 9 weeks following the first TACE treatment. mRECIST response was categorized as complete response (CR), partial response (PR), SD, and progressive disease (PD). mRECIST data were available in eight of the 17 cohorts (2,688 patients). This analysis did not take into account further TACE treatments undertaken after the first one. Liver function was assessed by the Child-Pugh grade (as graded by the local investigator) and the albumin-bilirubin (ALBI) score, the latter being graded according to the published cut-off points. (24) Grades 1, 2, and 3 refer to good, intermediate, and poor liver function, respectively. Data on treatment of hepatitis C with direct-acting antivirals (DAAs) were not collected, but an estimate of the number who might     After generation of the models, as described below, they were externally validated in independent data sets from China and Germany, representing "Eastern" and "Western" cohorts respectively. External validation and calibration were undertaken using methods described by Royston and Altman. (25,26)

StatIStICal MetHoDS
Analysis was carried out using Stata/SE 14.1 (StataCorp, TX). Continuous variables were reported as the mean (with standard deviation) or median (with interquartile range), the latter for variables with skewed distributions. Categorical variables were presented as percentages. Logarithmic transformation (log 10 ) was applied to skewed variables. Overall survival (OS) was calculated from date of treatment to date of death. Patients who were still alive were censored at date of last follow-up. Survival curves were plotted using the Kaplan-Meier (KM) method. For the Post-TACE-Predict model, which considers mRECIST response, OS was calculated from the date of response assessment rather than from the date of treatment. Patients with missing data were excluded.
All patients, excluding those from the largest Eastern (Xi'an, n = 786) and Western (Freiburg, n = 407) cohorts, were randomly split into two equally sized groups (n = 1,714), one for deriving the model(s) and one for internal validation of the model (Supporting Fig. S1A). Patients were randomly split by generating a pseudorandom number from a uniform distribution (0, 1) for each patient, followed by shuffling patients by sorting these random numbers. Subsequently, the first half of the patients was labeled as the "training set," and the second half was labeled as the "internal validation set." External validation was then conducted using Xi'an and Freiburg data sets. Before construction of the models, the applicability of the original HAP and the subsequent mHAP-III models (13,15) was tested on all four subgroups.
The clustering structure of the data set (i.e., the correlation between observations within a center) was taken into account in the statistical analysis. Robust estimates of the standard errors and variancecovariance matrix were obtained by considering the  The linear predictor was derived using the coefficients of each model. To generate four risk categories, reported cutoffs were applied to the linear predictor of the training set at its sixteenth, fiftieth, and eightyfourth centiles. (25) The same cutoffs were used for subsequent groupings in the other cohorts. KM survival curves according to the risk categories were plotted for each of the training and validation sets. Median OS (with 95% CIs), HR, and P values comparing the HR of the reference group (least risk category) to the others were also reported. Prognostic performance of the models (using the nonstratified linear predictor) was measured by Harrell's C, Gönen and Heller's K, and Royston-Sauerbrei's R 2 D . (25,27,28) Models were calibrated by comparing model-predicted versus observed survival curves. Modelpredicted mean survival curves were generated by applying fractional polynomial regression to approximate the log baseline cumulative hazard function as a smooth function of time. (25) Model-predicted versus KM estimates were then plotted according to each risk category in the derivation and validation sets.

Results
Within the substudy, the HAP score could clearly identify four distinct prognostic subgroups, both in patients undergoing resection and in those receiving sorafenib for advanced HCC (Supporting Fig. S2A,B). The median OS according to each HAP score and the HR and P values are shown in Supporting Table S1.
The baseline demographics of the patients from each center are shown in Tables 1 and 2. The percentage of patients who had undergone TACE treatments before January 1, 2012, and January 1, 2013, was 68% and 75.5%, respectively. The percentage of patients with missing data in at least one of the model variables was 14% (training set). For each variable individually, the percentage of missing data was ≤5%.
mRECIST assessments were undertaken within 9 weeks after first TACE for the majority of patients (94.6%) with a mean (standard deviation) of 5.5 weeks (6.8).
The overall median survival for the entire group of patients who underwent TACE was 19

applICatIoN oF tHe Hap aND mHap-III SCoReS
The HAP score and the mHAP-III score were applied to the present data set. The latter score does not categorize patients into risk categories but provides individual-level prognostication, and this will be compared with HAP later (see the Model Comparisons section). The HAP score stratified the patients into four risk categories in all four subgroups (Supporting Fig. S3A-D). The median OS according to each HAP score as well as the HR and P values are shown in Supporting Table S1.

UNIVaRIaBle CoX RegReSSIoNS
The results from the univariable Cox regression analysis based on the training set are shown in Supporting Table S2. Sex, cause, tumor number, tumor size, VI, AFP, and bilirubin were found to be statistically significant prognostic variables. When survival was assessed from date of response assessment (instead of date of treatment), mRECIST response (following first TACE), cause, tumor number, tumor size, VI, AFP, and bilirubin significantly influenced prognosis.

pre-taCe-predict
The model confirmed the prognostic influence of the variables in the mHAP-III model, namely tumor number, tumor size, AFP, albumin, and bilirubin, in addition to VI and cause (Table 3). It produced four distinct risk categories in each of the four subgroups (Fig. 1A-D). There was no statistically significant difference between the two lowest risk categories in the external validation sets, probably attributable to the low patient numbers in risk category 1 (n = 40-44) ( Table 4). Median OS ranged from 35 to 47 months in risk category 1 to 8 to 9 months in risk category 4 ( Table 4). The formula used to generate the curves in Fig. 1

was as follows:
where HCV is the reference group for cause.
To calculate the probability of survival at t months for a given patient, the following equation was used: where S 0 (t) is 0.89, 0.74, 0.48, and 0.32 for probability at 6, 12, 24, and 36 months, respectively.    10.13) in those with PD (Fig. 2), although these figures should be treated with caution because the different response cohorts had different baseline features that would also influence survival. Nonetheless, in the Post-TACE-Predict model, response was clearly an independent prognostic factor (Table 3), in addition to tumor number, tumor size, AFP, bilirubin, and VI.

post-taCe-predict Model
Four distinct risk categories were observed in each of the four subgroups ( Fig. 3A-D); however, there was some overlap between the two lowest risk categories in the Western external validation set, in which the patient numbers were again very low, with only 9 patients in risk category 1. The median OS of the risk categories ranged from 25 to 56 months in risk category 1 to 7 to 10 in risk category 4 ( Table 4). The formula to generate the curves in Fig. 3 was as follows: where CR is the reference group for mRECIST.  To generate the four risk categories, the following cutoffs were applied (as determined by the sixteenth, fiftieth, and eighty-fourth centiles): ≤1.82 (risk category 1), >1.82 to ≤2.49 (risk category 2), >2.49 to ≤3.37 (risk category 3), and >3.37 (risk category 4).
To calculate the probability of survival at t months for a given patient, the following equation was used: where S 0 (t) is 0.92, 0.79, 0.52, and 0.36 for probability at 6, 12, 24, and 36 months, respectively.
For routine clinical application, a simple online calculator (based on Equations 1-4) that takes the variables from the model(s) and returns the scores, the risk category, and survival likelihood at six monthly

MoDel CalIBRatIoN
Plots of KM estimates versus pre-TACE-predicted and post-TACE-predicted survival curves were, overall, very similar (Supporting Figs. S4 and S5A-D), although it should be noted that there was an overlap in the CIs for the KM estimates in the lowest two risk categories of the external validation sets. This was reflected by the non-statistically significant HRs, as stated above; low patient numbers may have contributed to this. Table 5 summarizes the comparisons between the different models by Harrell's C, Gönen and Heller's K, and Royston-Sauerbrei's R 2 D . It confirms that mHAP-III performs better than the HAP score. It also shows a trend of increasingly better survival prediction performance from mHAP-III to the pre-TACE and then post-TACE models.

Discussion
These models, based on TACE response, stratify survival better than the currently available HAP and mHAP-III models. The median OS was 19.9 months, almost identical to the figures of 19.4 months reported by Lencioni in a large systematic review of published trials involving TACE between 1980 and 2013. (29) This suggests that this cohort is representative of the current international practice of TACE for HCC. Furthermore, the clear demonstration that the degree of response has a major and independent impact on survival strongly supports the contention that TACE is indeed altering the natural history. (29) The heterogeneity of intermediate-stage HCC and the widespread use of TACE outside recommended guidelines has encouraged the development of scores that can predict survival after TACE using baseline clinical features. (10,12,14,(30)(31)(32) The first of these, the HAP score, has been internationally validated and enhanced by the addition of a fifth variable, namely tumor number. (13,23,33) Recognizing the limitations of points-based scores, Cappelli et al. built a model (known as mHAP-III) based on the mHAP-II score but using the same variables in their continuous form, which permitted individual patient prognostication. (15) Sposito et al. subsequently validated the mHAP-III model in an independent data set of 298 patients and confirmed its superiority to both HAP and mHAP-II. (34) The reported STATE and START scores (8) also appear to be valuable in identifying patients as poor or good candidates for TACE but require variables such as C-reactive protein, which were not routinely measured in the centers involved in the present study. Similarly, the ABCR score (35) that combines four variables (AFP, BCLC stage, change in Child-Pugh score, and tumor response) aims to identify those with poor prognosis who may not achieve benefit from further TACE. Again, the variables were not available to make a direct comparison (particularly the actual CP scores), but in the follow-up prospective study, an attempt will be made to collect the requisite variables to permit comparison of STATE, START, and ABCR with the current models. It will also be possible to investigate other and potentially valuable additional variables, such as performance status and presence or absence of cirrhosis. Nonetheless, the additional significant variables, the individual patient prognostication, and the extensive international validation are likely to represent a real improvement on existing scores. The online calculator (TACE-Predict) provides a simple utility for individual patient-level prognostication. It also permits easy graphical assessment of the importance of the various prognostic variables on ultimate survival. The model involves readily available, routinely recorded clinical variables. The clear correlation of survival with degree of response (as assessed by mRECIST) is consistent with past findings. (36) Using these calculators, clinicians will be able to predict the probability of survival at the individual patient level, thereby furthering the ultimate aim of matching "personalized prognosis" to "personalized therapy." For example, either before proposed first TACE or at the time of first response assessment, the clinician will be able to consider if the predicted survival is appropriate in the light of the potential side effects and toxicities of TACE. This may be particularly clinically valuable in the situation where the predicted outcome is poor, and consideration might be given to systemic therapy. Moreover, all the models were validated on large cohorts of patients to demonstrate the applicability of this approach to both the Eastern and Western practice.
It is acknowledged that the TACE procedure is unlikely to be entirely consistent across centers. However, this limitation applies equally to all TACE studies, including those on which current guidelines are based. Similarly, there must be interobserver variation in mRECIST classification. Although such variation may be overcome in the clinical trial setting by centralized review of relevant scans, this cannot be a solution in clinical practice. Hence, we made the pragmatic decision that mRECIST classification, as assessed by the local investigator, would be used in the present study.
Nonetheless, there is considerable heterogeneity in achievement, for example, of CR. The most likely explanation is that those centers with the highest CR (Italy and Egypt) had smaller tumors, more early-stage disease, less VI, and more solitary nodules. The very clear separation of survival according to mRECIST (Fig. 2) suggests that a valid parameter is indeed being measured. It is recognized that calculating OS from mRECIST assessment introduces a degree of variability into the post-TACE model because of the differing times of imaging between patients. This source of variability is, however, intrinsic to the time at which mRECIST is assessed, which is patient-specific, and would affect any model that includes mRECIST, regardless of whether OS is calculated in the model from date of mRECIST response or date of treatment.
The inherent limitations of a retrospective study are also acknowledged. First, there are several other baseline features that are likely to impact OS and could be included in the analysis, specifically, the extent of VI (11) (as opposed to a simple binary classification of present or absent), the structure of the tumor (pseudocapsule versus infiltrative), or liver function kinetics. However, such parameters are not routinely collected, and their inclusion in the study would have limited the applicability of the models. Second, only the first TACE in this study was considered. Assessment of the response after the second TACE or using the "best response" are also options, but both would limit the applicability of the model. Furthermore, patients were excluded who had received TACE as a "bridge to transplantation." An alternative approach would have been to recruit such patients and censor at the time of transplantation, but, given the usually short period of time between TACE and transplantation, this alternative approach would only have minimal impact on the models. In the prospective study, the investigation of the impact of all the above limitations will be feasible.
As in many areas of hepatology, the recent availability of curative therapies for HCV will have a broad impact on predictive and therapeutic studies. At present, it is not known whether patients who have developed HCC after a DAA-induced sustained virological response should be classified as HCV-positive in the models, but the number of such cases is likely to be relatively small. The great majority of patients in the present study were recruited before DAAs became widely available. The question of how to assign cause as a variable remains challenging, even in a prospective study. Although cause was shown to be an important prognostic factor, with patients who were HCV-positive surviving longer, several of the cases had multiple causes; however, even with a large data set of more than 4,000 cases, the numbers in individual subgroups, such as those with HCV and alcohol excess or both HBV and HCV, remain too small for meaningful statistical analysis. NAFLD is an increasingly important causal factor in HCC development; however, there are no internationally agreed-on criteria for diagnosis of NAFLD in the setting of HCC. Furthermore, it is acknowledged that the diagnosis of NAFLD is difficult in the setting of cirrhosis (which is the case in most HCCs) because the characteristic features of NAFLD have often "burned out" and are unrecognizable by the time consequential cirrhosis has developed. For all these reasons, it is concluded that the fairest statement of cause is, as used here, simply HBV or HCV or "other." Many programs offer TACE with DEB-TACE as opposed to conventional TACE. This has the advantage of offering a better pharmacokinetic profile by means of sustained and controlled drug release. (37) Published meta-analyses, however, suggest that there is little difference in terms of impact on outcome, (38)(39)(40)(41)(42) albeit with a decreased need for repeat sessions. (43) This was therefore not included in the analysis.
International guidance and expert reviews quote overall post-TACE survival of more than 30 months. (1) If the analysis of the data set is confined to those that strictly align with TACE guidelines, survival is indeed in the order of 30 months, and in the model, just using baseline features, some subgroups surviving more than 40 months are identified. The overall median survival of 19.9 months is also similar to that reported in a recent review, (29) suggesting that TACE is often prescribed for patients beyond BCLC B. The model and online calculator can help rationalize the use of TACE and avoid interventions with an expected poor prognosis and the associated risks.
In summary, an extensively validated and TACEspecific model based on routinely available clinical features and response after first TACE is presented. The model and its associated online calculator permit patient-level prognostication and may help clinicians rationalize the use of TACE by avoiding intervention in patients with a predicted poor prognosis.