Predicting survival in metastatic non‐small cell lung cancer patients with poor ECOG‐PS: A single‐arm prospective study

Abstract Background Patients with advanced non‐small cell lung cancer (NSCLC) are a heterogeneous population with short lifespan. We aimed to develop methods to better differentiate patients whose survival was >90 days. Methods We evaluated 83 characteristics of 106 treatment‐naïve, stage IV NSCLC patients with Eastern Cooperative Oncology Group Performance Status (ECOG‐PS) >1. Automated machine learning was used to select a model and optimize hyperparameters. 100‐fold bootstrapping was performed for dimensionality reduction for a second (“lite”) model. Performance was measured by C‐statistic and accuracy metrics in an out‐of‐sample validation cohort. The “lite” model was validated on a second independent, prospective cohort (N = 42). Network analysis (NA) was performed to evaluate the differences in centrality and connectivity of features. Results The selected method was ExtraTrees Classifier, with C‐statistic of 0.82 (p < 0.01) and accuracy of 0.81 (p = 0.01). The “lite” model had 16 variables and obtained C‐statistic of 0.84 (p < 0.01) and accuracy of 0.75 (p = 0.039) in the first cohort, and C‐statistic of 0.706 (p < 0.01) and accuracy of 0.714 (p < 0.01) in the second cohort. The networks of patients with lower survival were more interconnected. Features related to cachexia, inflammation, and quality of life had statistically different prestige scores in NA. Conclusions Machine learning can assist in the prognostic evaluation of advanced NSCLC. The model generated with a reduced number of features showed high accessibility and reasonable metrics. Features related to quality of life, cachexia, and performance status had increased correlation and importance scores, suggesting that they play a role at later disease stages, in line with the biological rationale already described.


| INTRODUCTION
Recent data suggest a wide variation of survival among patients with metastatic non-small cell lung cancer (NSCLC), the leading cause of cancer-related deaths worldwide, largely attributable to the clinical and molecular heterogeneity of this population. 1 Most prospective studies exclude patients with poor PS (ECOG-PS > 1), especially in the late-stage setting, when it occurs in 48% of NSCLC patient self-reports. 2 The risk of aggressive treatment causing harm in poor ECOG-PS NSCLC patients is not negligible. 3,4 Besides, the early introduction of palliative care was also shown to improve clinical outcomes and reduce financial burden. 5,6 Predicting survival is crucial to determining best resource allocation, treatment objectives, and advanced healthcare directives. 7 Clinician-made survival estimation alone is reported to be overly optimistic. 4,8,9 Several prognostic models have been developed to estimate the survival of terminally ill cancer patients. The currently established models generally present either a lack of clinical application, no or minimal external validity, 10 not being developed with a specific primary cancer, or not considering nuances of cancer stages. [11][12][13] In addition, most models rely solely on assessing one clinical dimension (e.g. symptoms, functionality, or disease burden).
More recently, machine learning (ML) models and nomograms have become more popular tools for the task of outcome predictions. These methods allow for assessing complex associations between variables, surpassing the performance of previous models crafted through analog means, or even providing treatment recommendations. [14][15][16][17] However, there is limited work on evidence supporting accurate prediction for NSCLC patients with poor PS.
Thus, more extensive research is required to develop a model capable of assisting healthcare providers in better caring for this vulnerable population of poor PS advanced NSCLC patients and allocating resources to reduce the disease's burden to patients, families, and taxpayers.
The aim of this study is to extensively assess demographic, clinical, and laboratory features of metastatic NSCLC patients with poor PS, analyze the interaction between them, and evaluate the performance of a patient classification model to predict overall survival (OS) ≤90 days using automated ML optimization strategies.

| Patients
Patients enrolled in our institution from 19 November 2017, to 1 February 2021 (cut-off date) were included in this prospective, single-center study. Eligible patients were adults 18 years and older, with histologically confirmed, treatment-naïve, advanced NSCLC (American Joint Committee on Cancer [AJCC] 7th edition IVa or IVb stages), with ECOG-PS 2, 3, or 4.
The patients were collected in two different moments: the first (Cohort 1 [C1]) was collected as a discovery cohort for model development. Alternatively, the second (Cohort 2 [C2]) was collected posteriorly to the completion of the development of our ML model, as a second out-of-sample dataset for independent validation.

| Patient features
We collected 83 baseline features about demography, histology, metastatic sites, EGFR mutational status; nutritional status (body mass index [BMI] and weight loss in the last 6 months), and body composition (mean hand grip strength, 18 dominant mid-arm muscle circumference and measurement of the psoas as well as abdominal intermuscular adipose tissue, subcutaneous adipose tissue, and visceral adipose tissue 19,20 ); medical history (comorbidities, Charlson Comorbidity Index 21,22 ), smoking status, PS (ECOG-PS 23 and Karnofsky scales 24 ); symptoms, Edmonton Symptom Assessment System (ESAS) 25 13 and Palliative Prognostic Score [PaP] 12 ), and laboratory values (blood cell counts, renal function, C-reactive protein, and albumin). We disconsidered C-reactive protein and albumin because of a high percentage of missing values.

K E Y W O R D S
machine learning, network analysis, non-small cell lung cancer, poor performance status, prognostic markers

| Survival analysis
The median OS (mOS) was determined by performing a Kaplan-Meier OS analysis of all 106 patients.

| Data pre-processing
We scaled the data of C1 by minimal and maximum feature values (MinMax scaling), ranging from 0 to 1. We imputed the missing data using the K-nearestneighbors (KNN) method. The KNN-based imputation is a classic method that selects K patients nearest to the patient with the missing data. The distance was determined in Manhattan distance, measured by each patient's position on a multidimensional graph, in which each dimension is a feature. We then used a weighted average of values to estimate the missing value. 29 We weighted the K patients based on the similarity of their features to those of the patient of interest. After imputation, to avoid data leakage, scaling was undone. We labeled data as the OS status via one-hot encoding, being patients with OS >90 days "0" and the remaining, "1." We split a validation set (Cohort 1.5 [C1.5]) by randomly selecting 16 (15.09%) patients.

| Outlier removal
After separating C1.5, we removed the outliers in the 90 patients remaining in C1. We did this by grouping patients with the same label (i.e. 1 or 0). Then, we clustered them using KNN (after another MinMax scaling). Finally, we used Density-Based Spatial Clustering of Applications with Noise (DBSCAN) 29 to remove those who did not belong to any cluster. This process resulted in the removal of six patients, three from each label group, and leaving 84 patients in C1.

| Data balancing
To reduce possible imbalances within C1, the 84 data points underwent Synthetic Minority Over-Sampling Technique (SMOTE). This approach amplifies data synthetically, especially in small data sets, generating new, artificial patients similar to those of the minority group. 31 This method yielded 12 new data points, resulting in 96 patients for training and testing. After SMOTE, MinMax scaling was undone, returning all features to their standard units.

| Model selection
We used Hyperopt (documentation available at https:// hyper opt.github.io/) to select and optimize the ML classification model, preprocessing methods (scaling and/ or normalization), and their hyperparameters. 32

| Training and test sets
We split C1 randomly 100 times into different training and test dataset iterations to estimate the model's performance and eliminate sample selection bias. The test sets, used to assess the model's performance immediately after training, contained a random population consisting of 20% of the C1 (n = 19). We assigned the remaining 80% (n = 77) to the training set. We used the training set to tune the parameters of the model.

| Model performance analysis
The model defined and optimized by HyperOpt was trained with each of the 100 different training sets and tested with the respective test sets. We used the mean area under the receiver operating characteristic curve (ROC) (C-statistic) and accuracy to measure model performance. We considered statistically significant the ML metrics with p-values <0.05, as calculated through the Monte Carlo Method.

| Significant feature selection for "lite" model building
A smaller number of features (ideally the least possible) are preferred to increase the model's ease of use and generalization to predict previously unseen data. For this purpose, each iteration of the selected model voted for the most significant features. We used those selected by over 60% of the iterations for further analysis. A collinearity assessment was made by calculating variable inflation factors and Spearman's correlation coefficient matrix, excluding the least voted features of highly correlated groups. The remaining variables composed the final selection for the "lite" model. Its performance was measured by the mean area under the ROC curve (Cstatistic) and accuracy in the 100-fold bootstrap of C1 and validation with C1.5. We performed a Cox analysis to verify the relationship between each variable selected for the "lite" model with patients' survival. Figure 1 presents a diagram of the data pipeline from raw data to results in C1 and C1.5.
After validation with C1.5, the "lite" model was trained with a dataset comprising the modified C1 (after imputation, outlier removal, and SMOTE) and C1.5 (the pure, unadulterated validation data removed from C1 prior to modifications). This version of the "lite" model was used to predict the survival of the patients prospectively collected in C2. Model performance was measured by the mean area under the ROC curve (C-statistic) and accuracy.

| Network analysis
We used the raw C1 dataset to avoid false discoveries caused by deleting any feature with more than 5% missing values and consequent data imputation. Each node represents a feature. The correlation between all feature pairs was calculated by Spearman's correlation coefficient. We corrected the p-values for multiple tests using the false discovery rate (FDR) Benjamini-Hochberg method. 33 After correlation selection, we plotted network graphs using force-directed (Fruchterman-Reingold) and circular layouts. The assessment of node centrality in the constructed network was made by F I G U R E 1 Diagram of the data pipeline from raw data to machine learning model validation metrics in the C1 data set. The raw data were submitted to imputation, followed by splitting of 16 patients into the validation set (C1.5). We performed Synthetic Minority Over-Sampling Technique (SMOTE) on the remainder of the dataset for data balancing and then made a 100× bootstrap, generating 100 different combinations of training and test datasets. The original dataset with synthetic data points was also used to train the Bayesian model selection and hyperparameter tuning library (HyperOpt). The selected and tuned model was trained and tested in each bootstrapped dataset with all features (full model), with the selection of the most relevant features. Validation on C1.5 was also performed with the full model. The same classifier and hyperparameters were trained and tested in filtered versions of the bootstrapped datasets. These filtered versions contained only the features selected during the training of the full model. The model with a reduced number of features ("lite" model) was then tested on a filtered C1.5 eigenvector centrality analysis. A higher prestige score indicates that one node (or feature) correlates to other influential nodes and vice versa. 34 Several fields of information science use this method, such as modern search engines. 35 We built one network for each group of patients (OS ≤90 days and >90 days). We used BioNetStat to verify if there were statistical differences between the groups' networks. To this end, we statistically compared their structures (spectral entropy) and the prestige score of features (eigenvector centrality). 36 We also performed an edge statistical comparison between the networks of OS groups. After determining significant correlations in each network (corrected p-value <0.05), we compared these correlations in the two survival groups using bootstrap analysis. By correcting the p-value of the bootstrap analysis by FDR, we got the valid correlations that statistically change between the two groups.
To reduce the number of variables, we sorted them into groups: QLQ C30, CAX 24, Prognostic Scores, Laboratory Exams, Muscle Strength, Symptoms (ESAS), Demographic Characteristics, and Tumor Status. Each group is a node. One group is connected to another if the features from both groups are connected. The weight of edges between groups is the sum of edges' weight that connects groups' features. We show a loop edge connecting the group node to itself when the connected variables are in the same group.

| Ethical considerations
This prospective study "Factors Determining the Prognosis in NSCLC Stage IV (LUPROG)" (NCT04306094) was approved by the local ethics committee. Informed consent was obtained from all participants. The present study follows the Guidelines for Developing and Reporting ML Predictive Models in Biomedical Research for the transparent report of an ML architecture for individual prognosis estimation. 38 The data generated and analyzed during this study are available from the corresponding author upon reasonable request.

| Model selection
The model selected and adjusted by the automated ML library was Extremely Randomized Trees Classifier  (ExtraTreesClassifier), and no preprocessing (scaling/ normalization) was necessary. Figure S1 shows the performance of the 20 best models assessed.

| "Lite" model feature selection
The initial model was built with 83 features (listed in the methods sections), and only 16 were included after feature selection. The ones judged the most important after bootstrapping were as follows: need for supplementary oxygen, strong opioid use, EORTC-QLQ C30 physical functioning, and role functioning. Hematocrit, ECOG-PS, wild-type EGFR status, EORTC CAX24 "dry mouth," and PaP score were selected as significant by at least 93% of iterations (Table 2).
Patients from C2 classified as predicted OS >90 days had a mOS (CI 95%) of 165 days (103-280), while that of those classified as predicted OS ≤90 was 61 days . Logrank analysis shows a significant difference between estimated survival of different groups (p = 0.01). Figure 2 shows a Kaplan-Meier plot of the survival of the patients in C2.
In the Cox analysis with 16 features selected for the "lite" model, six were related to patients' survival (p-value <0.05): CAX24 features (dry mouth, loss of control, and food aversion), strong opioid use, ECOG-PS, and PaP score (Table 3).

| Network analysis
Features with more than 5% missing values were removed from the analysis, and after that, patients with any missing data were excluded ( Figure S2). The remaining dataset consisted of 73 features and 95 patients. For patients F I G U R E 2 Kaplan-Meier survival estimates for all C2 patients (blue) with its 95% confidence interval (light blue area) and survival lines of C2 patients classified by the "lite" model as predicted OS >90 days (green) and ≤90 days (red). Combined with the log-rank test performed, this image illustrates how machine learning-assisted assessment can be an important factor for the prediction of patient survival who survived >90 days, 28 correlations between 34 variables were statistically significant. We formed a sparse network with 11 different small networks, each with two to five components. For the cohort of patients who survived ≤90 days, 99 correlations between 54 variables were statistically significant. The correlation among features in this group of patients formed one large network with 51 nodes. The 10 most "important" variables considered by the prestige score (eigenvector centrality) were QLQ C30 (fatigue and physical functioning) and CAX24 (forcing oneself to eat, loss of control, and physical decline). The remainder were of diverse natures: KPS, PPI, hematocrit, hemoglobin, and strong opioid use.

| Network comparison
There were statistical differences between networks from OS ≤90 days patients and OS >90 days patients (network centrality p-value = 0.21, network structure p-value = 0.01) in the eigenvector centralities and the structural features of the networks (spectral distribution) (Table S1).
When comparing the weight of edges between two networks, nine edges (correlation lines between nodes) remain statistically significant ( Figure S3). The QLQ C30 social functioning, CAX24 (eating difficulties, eating and weight-loss worry, and forcing oneself to eat), blood analysis (hematocrit, hemoglobin, leucocyte, and neutrophil counts), prognostic scores, weight loss, and strong opioid use show changes in correlations.

| Feature groups networks
The grouping analysis of networks shows that the OS ≤90 days ( Figure 3A,D) network has more connections in comparison to >90 days ( Figure 3B,D). The QLQ C30 and CAX24 groups in the OS ≤90 days network share a highweighted edge. On the other hand, QLQ C30 and ESAS Symptoms share a high-weighted edge in the OS >90 days network.

| DISCUSSION
In this study, we gathered multidimensional data from 106 patients with metastatic NSCLC and ECOG-PS >1, and created a ML model with a C-statistic of 0.843 and an accuracy of 0.750 in the discovery cohort and respectively 0.706 and 0.714 for an out-of-sample, completely prospective cohort, while using only 16 easily assessable patient characteristics, mainly related to quality of life, symptoms, and cachexia. The results indicate that we can create a model with acceptable performance with modern data imputation and amplification methods, as well as hyperparameter optimization, compensating for a limited number of training data points.
Here, we present a prospective design, seldom seen in ML studies. In addition, this is a comprehensive assessment of several biological, clinical, and psychosocial aspects of patients. It also uses numerous modern data analysis tools to study this population, producing valuable results in an out-of-sample dataset. We can explain the discrepancy between testing and validation metrics in the complete model by sample selection bias: due to the small size of the dataset, larger variations are prone to occur, and the sample used might have been one in which, coincidentally, the selected model would perform better than 95% CI. After feature selection, this phenomenon seemed to be corrected, returning the accuracy within a range of the 95% CI of the training set, remaining within T A B L E 3 Hazard ratios and CI 95% for the features selected by the "lite" model. Variables marked with a "-yes" took their absence as reference

Feature
Hazard ratio p-Value reasonable metrics. This might be justified by the removal of statistical noise, which possibly contributed to reducing selection bias. The reduction in dimensionality also contributed to the accessibility of the model. Finally, a drop in C-statistic and accuracy occurred when testing in C2, as is to be expected in external validation groups, especially considering the size of the prospective cohort, as large as 39.6% of C1. Another noteworthy point is the possibility of pragmatic variable selection for prognostic evaluation. This often results in optimal ML performance with a more compact set of features. 39 The model we developed shows that clinically deteriorated, cachectic, with inferior quality of life scores, wild-type EGFR, and with poor prognostic predictors tend to be classified as short OS (≤90 days), showing biological plausibility, related to cancer-induced systemic inflammation. 40 Of note, when analyzing the "lite" model features in a multivariate model, only 6 of the 16 selected variables have a statistically significant hazard ratio. However, this analysis only shows differences in the whole cohort. Decisiontree-based methods such as ExtraTreesClassifier change population composition after each decision branch, so features that are initially nonsignificant might become so at later stages.
Network analysis (NA) also unveiled interesting information: the network of patients who had OS ≤90 days displayed significantly more connections than that of patients with OS >90 days, with distinct density of nodes in the graphs of the two populations. Of note, the appearance of significant edges between other realms and the groups "Prognostic Scores," cachexia ("CAX24"), and quality of life ("C30") shows how these patients can be more prone to clinical complications that may affect patient prognosis, functionality, and wellbeing.
This work also presents a plausible way to facilitate the development of ML models in the clinical healthcare setting. Robust population modeling is hindered by decentralization of data storage, incompatibility between electronic health record systems, and several barriers to the sharing of patient information. [40][41][42] Synthetic data generation may help diminish this effect.
Moreover, the fact that only simple variables remained in the "lite" model makes the result of this research an accessible tool for the evaluation of borderline cases. It might help reduce the number of patients with ECOG-PS F I G U R E 3 Network analysis representation for patients with OS ≤90 days (A, C) and >90 days (B, D). All features are represented in panels A and B without names for easier visualization. The network with feature names is in Figure S4. Edge color and width represent the sign and magnitude of correlations, respectively. Node radius represents prestige score. The numbers in panels A and B are the 10 higher prestige scores of features in OS ≤90 days network patients: 1: CAX24 Forcing Self to Eat; 2: KPS; 3: PPI; 4: CAX24 Loss of Control; 5: Hematocrit; 6: Hemoglobin; 7: QLQ-C30 Fatigue; 8: QLQ-C30 Physical Functioning; 9: CAX24 Physical Decline; 10: Strong Opioid. In panels C and D, the nodes are features grouped by similarity. Edge width represents the sum of the absolute values of correlations between features >1 submitted to therapies with inherent toxicities, in addition to better allocating resources.

| CONCLUSIONS
Machine learning and NA can be helpful tools for analyzing complex issues such as patient survival prediction. The encouraging results achieved by the ML models may be evidence that the methods used are helpful in improving current data analysis and potentially refining medical decision-making. This is supported by biological rationale achieved both for NA and the pragmatically chosen features. Finally, the model with fewer input variables may present an accessible tool for evaluating NSCLC patients with poor PS. More extensive validation is needed before use in the clinical setting.