Predicting survival after transarterial chemoembolization for hepatocellular carcinoma using a neural network: A Pilot Study

Deciding when to repeat and when to stop transarterial chemoembolization (TACE) in patients with hepatocellular carcinoma (HCC) can be difficult even for experienced investigators. Our aim was to develop a survival prediction model for such patients undergoing TACE using novel machine learning algorithms and to compare it to conventional prediction scores, ART, ABCR and SNACOR.


| INTRODUC TI ON
Hepatocellular carcinoma (HCC) is one of the most common cancers worldwide and the fourth leading cause of cancer death. 1,2 According to the Barcelona Clinic Liver Cancer (BCLC) classification, transarterial chemoembolization (TACE) is the recommended treatment for patients in intermediate-stage HCC (BCLC-B). 3 However, deciding when to repeat and when to cease TACE treatment and possibly change to systemic treatment, or even to best supportive care, can be difficult for even experienced investigators. Several conventional scoring systems have been developed to provide decision support regarding retreatment with TACE in patients with HCC, including the Assessment for Retreatment with TACE (ART) score, 4 the ABCR score 5 and the SNACOR score. 6 The ART (ART) score comprises the following parameters: increase of aspartate aminotransferase, an increase of Child-Pugh score from baseline and radiologic tumour response. The ABCR score consists of the following parameters: alpha-fetoprotein (AFP) and BCLC at baseline, increase in Child-Pugh score from baseline and radiological tumour Response. The SNACOR comprises the parameters tumour Size, tumour Number, baseline AFP, Child-Pugh class and Objective radiological Response. However, none of these scoring systems are widely used in clinical practice. Several attempts of external validation by us and other working groups have failed, with only poor to moderate predictive ability. [7][8][9][10] In recent years, machine learning techniques-in particular artificial neural networks (ANNs)-have been increasingly used for prediction purposes. ANNs are complex and flexible nonlinear computing systems. They were devised in an attempt to build artificial systems based on the characteristics of neurones of the brain, both structurally and functionally. Such networks are trained in a supervised manner by exposure to paired input-output data; once trained they are able to make predictions based on new input data. 11 ANNs have shown promising results compared to conventional statistical approaches. In the field of hepatology, ANNs were superior in predicting mortality of patients with end-stage liver disease compared to model for end-stage liver disease (MELD) as well as in predicting HCC tumour grade and microvascular invasion compared to a conventional linear model. 12,13 Recently, Meek et al suggested stronger implementation of such techniques in interventional oncology. 14  To the best of our knowledge, no attempt has yet been made to develop a survival prediction model for patients with HCC undergoing TACE using neural networks. Therefore, the purpose of this study was to implement such a novel approach for treatment stratification and to compare it to conventional prediction scores.

| PATIENTS AND ME THODS
We used the TRIPOD guidelines when writing our manuscript. 17 The study was approved by the responsible ethics committee (permit number 2018-13619). Patient records and clinical information were de-identified before analysis.

| Patients
We performed a database search and identified a total of 860 patients who underwent TACE for HCC at our tertiary referral centre between January 2005 and December 2017. To ensure at least 1 year of follow-up, the final evaluation date was December 31, 2018. The study included only TACE-naïve patients with HCC confined to the liver who then underwent at least two TACE treatments. The study excluded patients who underwent liver transplantation within the follow-up period and patients who developed a portal venous tumour thrombus before the second TACE treatment. After applying these inclusion and exclusion criteria, 282 patients were included in the final analysis ( Figure 1).

| Diagnosis, treatment and follow-up
Hepatocellular carcinoma was diagnosed by histological or radiological evaluation according to the guidelines of the American Association for the Study of Liver Diseases (AASLD) or the European Association for the Study of the Liver (EASL). 18,19 Treatment was performed in a standardized manner described in detail elsewhere. 20,21 All patients underwent CT or MRI prior to their first and second TACE treatment.
These examinations were the basis for the radiological assessment of tumour response, which was evaluated by applying the Modified Response Evaluation Criteria in Solid Tumours. 22 Objective tumour response was defined as a partial response or complete response.
Stable disease and progressive disease were assessed as a lack of radiological response. Overall survival (OS) was the primary endpoint

Key points
• Predicting survival in patients with primary liver cancer is essential for deciding further treatment.
• Conventional scoring systems have remained behind expectations; thus, we used artificial intelligence to improve prediction.
• The artificial neural network we developed led to good prediction and outperformed the three most widely known conventional scoring systems.
• The difference reached significance in case of the ART score (P < .001); for ABCR and SNACOR significance was not reached (P = .143 and P = .201).
of this study. To ensure comparability with the conventional scoring systems, the interval was defined as the day prior to the second TACE treatment until death or last follow-up.

| Data acquisition
Data were acquired from the laboratory database and clinical registry software specially developed for the characterization of patients with HCC. 23 Baseline characteristics including demographic data, aetiology of liver disease, liver function parameters, TACE-related parameters and relevant comorbidities were documented. Tumour load was represented by the total number of lesions, the lesions' sizes and the tumour growth pattern. is trained by exposure to paired input-output data. The ANN learns through modification of these weights according to feedback. The final network applies these previously determined weights to new input data and thus makes predictions. 13,24,25 Our ANN was built using Python 3.7.3 with scikit-learn (https ://scikit-learn.org/stabl e/) (0.19.2). It consisted of 46 input nodes.

| Design of the neural network
During fine-tuning, we found that a network architecture with three fully connected hidden layers with sizes 20, 12 and 4 performed best. We used ReLU as activation function on all hidden layers and softmax classification for the final fully connected layer. We used stringent L2-regularization to prevent overfitting. The ANN was used to predict the OS after 1 year, starting from the day prior to the second TACE treatment. Therefore, the final two output nodes represented survival (=1) and death (=2).
Each of the 46 parameters formed one input node. The selected input data comprised the general demographic parameters age and gender, type of TACE (conventional TACE [cTACE], drug eluting bead TACE [DEB-TACE]), type of imaging before the second TACE (CT, MRI) and all parameters used by the abovementioned risk scoring systems ART, ABCR and SNACOR. These scoring systems comprise the parameters BCLC stage, alpha-fetoprotein level, tumour size and number, Child-Pugh score, radiological tumour response and aspartate aminotransferase level. We also included other potentially clinically meaningful parameters regarding aetiology (alcohol abuse, hepatitis B and C, non-alcoholic steatohepatitis 26 ), comorbidities (nicotine abuse, 27 obesity, 28 diabetes 26 ) and tumour growth pattern. 29 In addition, we included the following potentially meaningful parameters indicating liver function: MELD, 30 bilirubin, 30,31 albumin 31 and the international normalized ratio. 30 TACE-related parameters (alanine aminotransferase 32 ), other laboratory values (thrombocyte count, 33 sodium level 34 ) and sarcopenia 28 were also evaluated. Sarcopenia was measured by means of the skeletal muscle index, which was calculated at the level of L3 as described elsewhere. 28 Measurements were performed using a dedicated Picture Archiving and Communication System (Sectra®, Linköping, Sweden). All parameters were captured before the first and second TACE treatment.
All continuous input parameters were standardized by subtracting the mean and dividing by the standard deviation. The design and architecture of our ANN including an input layer, hidden layers and an output layer is provided in the Figure S1.  Table 1.

| Training and validation of the ANN
Both groups were similar regarding baseline characteristics except for age (P = .020), albumin (P = .019), type of TACE (P < .001) and type of imaging prior to second TACE (P < .001).

| Statistical analysis
Continuous data were described by medians and ranges and compared using a two-tailed unpaired Student's t test or Wilcoxon test where appropriate. Categorical data were described as percentages and compared using the chi-squared test or Fisher's exact test. The predictive performance of the ANN was measured using the area under the receiver operating characteristic curve (AUC); the same approach was used to compare the ANN to ART, ABCR and SNACOR. The AUC ranges from 0 to 1, and values can be interpreted as follows: 0.9-1, 'excellent prediction'; 0.8-0.9, 'good prediction'; 0.7-0.8, 'fair prediction'; 0.6-0.7, 'poor prediction'; and 0.5-0.6, 'very poor prediction'. 35,36 A value <0.5 indicates 'anti-prediction'. Cumulative/dynamic receiver operating characteristic (ROC) curves were obtained using Python

| RE SULTS
In the diagnostic training cohort, the 1-year OS was 71.5%, and in the holdout validation cohort, the 1-year OS was 63.1% (P = .283).

| Predictive performance of the neural network
For predicting 1-year survival, the ANN had a mean AUC in the diagnostic training cohort of 0.77 ± 0.13 (Figure 3).
These results were further verified in the holdout validation cohort, which had a mean AUC of 0.83 ± 0.06 (Figure 4), a positive predictive value of 87.5%, and a negative predictive value of 68.0%.

| Predictive performance of the neural network compared to conventional scoring systems
In the last step, the performance of the ANN was compared to the existing scoring systems ART, ABCR and SNACOR. [4][5][6] The AUCs were 0.54 ± 0.08, 0.70 ± 0.07 and 0.73 ± 0.07, respectively, and therefore, F I G U R E 2 Diagram defining the different datasets and visualizing the process of building and validating the model. In the first step, 80% of the patients (n = 225) were allocated to the diagnostic training dataset. This dataset was used for training following a five-fold cross validation approach. In the second step, the more recently treated 20% of the patients (n = 57) were used for validation The sum of aetiologies could exceed 100% because patients could have more than one aetiology. b We considered only therapies performed within the observation period of 12 months. c All 24 patients received sorafenib; in eight patients sorafenib therapy had been started between the first and second TACE. d Two patients received nivolumab, the remainder received sorafenib.

TA B L E 1 (Continued)
lower than the AUC of our ANN ( Figure 5). This difference reached significance in case of the ART score (P < .001); for ABCR and SNACOR significance was not reached (P = .143 and P = .201 respectively).
The complete ANN is publicly available online in the Mendeley repository. 37 To facilitate easy implementation of the model, the repository comprises the Python script, a sample data file and a detailed manual.

| D ISCUSS I ON
This is the first study applying an ANN for survival prediction in patients with HCC undergoing TACE. The ANN achieved a promising performance at predicting 1-year survival in patients with HCC prior to the second TACE treatment. With an AUC of 0.83, the ANN outperformed conventional scoring systems.
As there is a dire need for more objective decision-making, several prediction tools using a conventional score-based approach have been developed for treatment stratification in patients with HCC. The ART, ABCR and SNACOR scores aim to improve treatment stratification for patients with HCC prior to their second TACE treatment. [4][5][6] They achieved AUCs/Harrell's C indices between 0.60 and 0.75 for 1-year survival. 5,6 However, in external validations by several study groups, their predictive ability could not be reproduced, 7-10 for example, in our own external validations, we obtained Harrell's C indices between 0.54 and 0.59. 7,8 This difference is probably at least partially due to the so-called 'overfitting' effect, which has been described as 'a phenomenon occurring when a model maximizes its performance on some set of data, but its predictive performance is not confirmed elsewhere due to random fluctuations of patients' characteristics in different clinical and demographical backgrounds'. 38 The phenomenon of non-replicability has been recognized as a common problem, particularly in the life sciences.
It describes how different factors including inherent characteristics of the systems, bias in reporting, or problems in study design, execution, or interpretation lead to different results between the original study and the replication attempt. 39 Only very few studies have used machine learning-approaches in patients with HCC in the setting of TACE. 15

F I G U R E 5
Receiver-operating characteristic analysis comparing the predictive ability of the artificial neural network (ANN) to ART, ABCR and SNACOR. The respective areas under the curve (AUCs) were 0.83 for the ANN, 0.73 for SNACOR, 0.70 for ABCR and 0.54 for ART. These AUCs correspond to 'good prediction' for the ANN, 'fair prediction' for SNACOR and ABCR and 'very poor prediction' for ART a fully connected ANN based on a multitude of clinical parameters (n = 46) to predict OS prior to the second TACE. Similar approaches using an ANN have already been used following tumour resection. [40][41][42] Regarding its use in interventional oncology, Wu et al tried to predict disease free survival in patients with HCC after radiofrequency ablation. 43 Until now, it has never been tried in the setting of TACE.
Using this novel approach, we achieved a promising predictive performance with an AUC of 0.83. Once trained, an ANN like ours can easily be implemented in clinical routine and might help to determine further treatment; an explanatory figure can be found in Figure S2. A head-tohead comparison of our ANN with the ART, ABCR and SNACOR scores yielded highest AUCs for the ANN, corresponding to 'good prediction', followed by SNACOR and ABCR ('fair prediction') and the lowest for ART ('very poor prediction'). However, the predictive performance was only significantly better in case of the ART score, for ABCR and SNACOR significance was not reached. As this head-to-head comparison is based solely on the small holdout validation group comprising only 57 patients, the non-significance might be due to underpower.
One of the main advantages of such an ANN is that it can include a broad choice of variables. In our case, we used a total of 46 input variables covering most evidence-based prognostic variables used in daily clinical practice to characterize patients with cirrhosis and/or HCC. Moreover, ANNs are easily scalable when the complexity, the number of patterns and the number of inputs of the dataset increase, and might therefore carry advantages over classical machine learning techniques like random forest classifiers etc. 44 However, the use of an ANN is associated with several shortcomings. In contrast to traditional statistical approaches, it is somehow a 'black box', which does not allow for easy identification of parameters associated with good predictive ability. Furthermore, an ANN cannot deal with missing values. Therefore, to avoid multiple imputations, the data have to be as complete as possible. Unfortunately, medical patient data is often incomplete or difficult to retrieve from different existing data sources (eg separate radiology information system, hospital information system, laboratory information system etc). Another issue is the lack of digitization as some information is still paper-based.
In the future, the broad introduction of novel tools, such as structured reporting, may improve data quality, completeness and availability, facilitating the training and application of neural networks.
Our analysis has several limitations. Firstly, and most importantly, our study lacks an independent external validation cohort.
Although we used a holdout patient cohort not used for training as validation, further external validation is mandatory. To encourage independent study groups to verify our results and to address the issue of non-reproducibility, we provide the ANN for download in our Mendeley repository. 37 Secondly, the study design was retrospective and the final sample size (n = 282) was only moderate. Most likely, our dataset used herein is too small to use the ANN to its full capacity. Therefore, the performance might probably not be superior to classical approaches in this case; however, it is very unlikely that the performance falls behind that of any classical approach. Ideally, training and subsequent validation would be performed with a sufficiently large patient cohort using a multicenter approach. Such a multicenter approach would increase the robustness of the model and also tackle the problem of 'overfitting'. Thirdly, we included all variables used by all three scoring systems, including some handcrafted variables (eg Child-Pugh, BCLC). Using such handcrafted variables still requires additional user input; however, the parameters used herein are commonly applied clinical parameters to characterize patients with liver disease, which should be readily available in most liver centres.
Even though we included a broad variety of potentially clinically meaningful parameters, it is possible that we missed variables that would have further improved the prediction. Moreover, some variables were not included because they were not available for all patients, for example, most patients lacked tumour grading and status of small vessel infiltration because they were diagnosed non-invasively. Furthermore, it may be possible that the inclusion of other more advanced parameters could further enhance prediction, for example, radiomic data including texture analysis. [45][46][47] Another possible reason of bias could be the large period of data recruitment comprising 13 years. Meanwhile, several technical improvements were made for TACE. Although we introduced DEB-TACE in 2006 at our institution-and therefore both TACE regimes were constantly used throughout the whole recruitment period-the distribution between cTACE and DEB-TACE was significantly different between training and holdout validation group due to a shift towards DEB-TACE in recent years. However, both techniques were equally effective in the two largest multicenter RCTs. 20,48 Consequently, both techniques are equally endorsed by the most recent EASL guideline and the choice is left to the operator. 18 Additionally, several new systemic drugs became available influencing the switch from TACE to systemic therapy. 49 We used CT and MRI for measuring tumour load as well as for determination of tumour response. In recent years, there was a shift towards MRI, consequently the proportion of patients receiving MRI was greater in the validation group. However, both imaging modalities are accepted for HCC-imaging as well as for determination of tumour response. 18,22 Further, we included primarily Caucasian patients. Due to fundamental differences in patient characteristics, for example, regarding aetiology of liver disease, these results may not be transferable to Asian patients. Lastly, we decided on a neural network with three hidden layers. As there is no commonly accepted design of such networks for similar purposes, it may be possible that the current design is not the best one available. This suggests that a perfectly designed network could allow for even better prediction.

| CON CLUS IONS
Neural networks could be better at predicting patient survival after TACE for HCC compared to existing scoring systems using a conventional statistical approach. Once established, such prediction models could easily be deployed into clinical routine and help determine optimal patient care. Especially less experienced investigators might profit from support mechanisms based on such machine learning algorithms.
Nevertheless, clinical reality is more complex than such a network can capture. Therefore, it may only serve as one of several components in decision-making and cannot replace a clinician's long-lasting practical experience. Inclusion of additional parameters in the prediction model could potentially further increase its performance.
This could not only include clinical parameters that have already demonstrated predictive value, such as postembolization syndrome after first TACE, 50 but also novel predictive parameters such as texture analysis. [45][46][47] Potentially, a combination of our ANN with a convolutional neural network using pattern recognition might further enhance prediction. However, to avoid the problem of overfitting and to enhance generalizability, the network needs to be built on a broader database including clinical data from several institutions.

ACK N OWLED G EM ENTS
The study includes data from the doctoral thesis of one of the authors (FW). DPdS and RK contributed equally to the manuscript. We thank Lukas Müller for his support with data acquisition.