A systematic review and meta‐analysis of predictive and prognostic models for outcome prediction using positron emission tomography radiomics in head and neck squamous cell carcinoma patients

Abstract Background Positron emission tomography (PET) images of head and neck squamous cell carcinoma (HNSCC) patients can assess the functional and biochemical processes at cellular levels. Therefore, PET radiomics‐based prediction and prognostic models have the potentials to understand tumour heterogeneity and assist clinicians with diagnosis, prognosis and management of the disease. We conducted a systematic review of published modelling information to evaluate the usefulness of PET radiomics in the prediction and prognosis of HNSCC patients. Methods We searched bibliographic databases (MEDLINE, Embase, Web of Science) from 2010 to 2021 and considered 31 studies with pre‐defined inclusion criteria. We followed the CHARMS checklist for data extraction and performed quality assessment using the PROBAST tool. We conducted a meta‐analysis to estimate the accuracy of the prediction and prognostic models using the diagnostic odds ratio (DOR) and average C‐statistic, respectively. Results Manual segmentation method followed by 40% of the maximum standardised uptake value (SUVmax) thresholding is a commonly used approach. The area under the receiver operating curves of externally validated prediction models ranged between 0.60–0.87, 0.65–0.86 and 0.62–0.75 for overall survival, distant metastasis and recurrence, respectively. Most studies highlighted an overall high risk of bias (outcome definition, statistical methodologies and external validation of models) and high unclear concern in terms of applicability. The meta‐analysis showed the estimated pooled DOR of 6.75 (95% CI: 4.45, 10.23) for prediction models and the C‐statistic of 0.71 (95% CI: 0.67, 0.74) for prognostic models. Conclusions Both prediction and prognostic models using clinical variables and PET radiomics demonstrated reliable accuracy for detecting adverse outcomes in HNSCC, suggesting the prospect of PET radiomics in clinical settings for diagnosis, prognosis and management of HNSCC patients. Future studies of prediction and prognostic models should emphasise the quality of reporting, external model validation, generalisability to real clinical scenarios and enhanced reproducibility of results.


| INTRODUCTION
Head and neck squamous cell carcinoma (HNSCC) is the sixth most common malignancy globally. 1HNSCC constitutes a diverse group of cancers originating from the mucosal epithelium of the oral cavity, pharynx, sinonasal tract and larynx. 2 Despite advances in evaluation and treatment, HNSCC outcomes marginally improved over the past decades attributed to delayed diagnosis and recurrence. 3Understanding tumour heterogeneity is key in cancer management as it has implications on tumour development, therapeutic outcomes and survival. 4on-invasive medical imaging techniques such as magnetic resonance, computed tomography (CT) and positron emission tomography (PET) provide information about tumours. 4Tumours exhibiting high intratumoral heterogeneity have been found to have a less favourable prognosis, which may be due to either inherent aggressive characteristics or treatment resistance. 5PET outperforms other imaging modalities as an ideal tool for characterising the tumour biology at the macroscopic level 5,6 .PET with 2-deoxy-2-[fluorine-18]fluoro-d-glucose (18F-FDG), which is a glucose analogue and has similar metabolism as glucose, provides valuable functional information based on increased glucose uptake and glycolytic activity of cancer cells.Hence, PET provides information about functional and biochemical changes in bodily tissues that precede anatomical changes. 7,8Clinical implications of PET are already evident in brain tumour, thyroid cancer, non-small cell lung cancer, breast cancer, oesophageal cancer, pancreatic cancer, colorectal cancer, cervical cancer, sarcoma and lymphoma in addition to head and neck cancer. 91][12][13] Standard uptake value (SUV), metabolic tumour volume (MTV) or total lesion glycolysis (TLG) provides information useful for diagnosis, earlier evaluation and treatment response evaluation. 4ET radiomic features are found to be better than SUV parameters in some types of cancer in survival outcome prediction. 6Textural analysis has been widespread in PET since the late 2000s. 14Due to the functional nature and close link to tumour biology, the radiomic features extracted from PET images have the potential to capture the phenotypic differences across the tumours correlated with the stage and prognosis of the disease. 14achine learning models can be trained to recognise patterns in complex PET radiomics data, assisting clinicians with risk assessment, diagnosis and prognosis, thus improving patient care. 15By the middle of 2020 among the published radiomics studies only 16% were based on PET or PET/CT. 16Owing to the valuable information provided by PET images about tumour heterogeneity, it is essential to perform a systematic review evaluating the current status and potentials of PET radiomic feature-based models in HNSCC outcomes to direct the course of future research in diagnosis, prognosis and management of HNSCC.
In this study, we present a systematic review to assess the current status of prediction and prognostic models based on pretreatment PET images in HNSCC studies.The objectives of the systematic review are to evaluate the implemented segmentation methods, identify essential radiomic feature-based predictors, assess model development strategies and estimate the overall performance using meta-analysis.

| MATERIALS AND METHODS
This review is conducted according to the guidance of preferred reporting items for systematic reviews and metaanalyses (PRISMA) 17 and the critical appraisal and data extraction for systematic reviews of prediction modelling studies (CHARMS). 18The protocol for the study was registered on the international prospective register of systematic reviews (PROSPERO 2021 Registration number CRD42021287832).

K E Y W O R D S
head and neck squamous cell carcinoma, positron emission tomography, prognosis, radiomics, systematic review Overall survival (OS)/All-cause mortality (ACM): The time from diagnosis to death from any cause or last date of follow-up.Recurrence (R): The time between the end of treatment to local or locoregional recurrence of disease progression or death from any cause or last date of follow-up.We considered disease-free survival (DFS), relapsefree survival and recurrence-free survival (RFS) as recurrence. 20rogression-free survival (PFS): The time from the end of primary treatment to the date of disease progression, death or last follow-up.Distant metastasis (DM): The time to first clinical or pathological evidence of disease spread to distant organs or lymph nodes.Disease-specific survival (DSS): The time from diagnosis to time to death due to HNSCC.

| Inclusion criteria
Studies were included if they met the following criteria: 1. Patients diagnosed with HNSCC cancer pathologically (including anatomic subtypes) 2. FDG-PET/CT or FDG-PET scan is done before treatment 3. Radiomic features considered 4. The clinical outcomes of interest were OS/ACM, R, DM, PFS and DSS. 5. Patients are treated with chemotherapy/radiotherapy/ surgery/brachytherapy or a combination of these.6.Studies with a minimum follow-up time of 1 year were included, and there was no restriction on the last follow-up time at which the outcome was measured.7. Provided details of predictive or prognostic models used along with their performance measures.S1-S3).We categorised the included studies into prediction models (binary outcome) and prognostic models (time-to-event).
The prediction model considers binary outcome; that is the outcome can take only one of two values, such as treatment failure or success, or mortality (dead or alive). 21Hence, the prediction model is a binary classification problem which is usually assessed by performance metrics, the area under the receiver operating curve (AUC) accuracy, sensitivity and specificity. 22The time-to-event analysis is used to analyse the time to disease remission, progression or death for cohorts of patients when the time to event is either recorded or censored. 235][26] Kaplan-Meier plots, log-rank test and the Cox proportional hazard model are the commonly used techniques for analysing the time-to-event data. 25,27he common performance metric used for the prognostic model is the concordance index or C-statistic. 28

| Meta-analysis
We performed the random effect meta-analysis to evaluate the overall performance metrics of both predictive models (binary outcome) and prognostic models (time-to-event outcome) accounting for the heterogeneity between studies.For the meta-analysis, we considered all outcomes (OS, recurrence, DSS, DM and PFS) within each predictive and prognostic modelling framework.We performed additional meta-analyses to compare the performance metrics between manual and threshold-based segmentation methods for studies involving prediction and prognostic models.Further details are presented in Supporting Information.

| Study selection
The initial database search revealed 231 published papers between 2010 and 2021 December.The study selection process resulted in 31 model development studies eligible for the systematic review (Figure 1).The included studies in the review represented about 4500 HNSCC patients, including its subtypes.The main findings of this study are summarised in Tables 1 and 2.

| Study design and patient characteristics
All selected studies were retrospective cohort studies except the study by Lafata et al. 29 The sample size varied across the studies between 52 30 and 707, 31 and 17 of the studies recruited more than 100 patients.The mean age of the patient was ≥50 years in 10 studies.Among the selected studies, 15 considered all disease stages, and 25 studies included T and N grade of the tumour.Chemoradiotherapy is the most frequently mentioned treatment modality (n = 26) which was often combined with other treatment strategies like radiotherapy (n = 20), surgery (n = 6) and biotherapy.The Supporting Information provides further details of sample design and study characteristics (Table S4).

| Segmentation methods
PET-CT was the preferred imaging modality in 30 studies, and in one study, PET images alone were considered. 298][39][40] Other segmentation methods were 42% SUV max , 30% SUV max , SUV >1.5 times liver SUV mean , 50% iso-contour of SUV peak , gradient-based auto-segmentation, Nestle's adaptive thresholding, 50% of SUV peak and graph-based methods.One study did not report the employed segmentation method. 41ables 1 and 2 present a summary of segmentation methods used by included studies.

| Types of features
In addition to demographic, pathological and clinical variables, a few studies considered other variables like smoking status, alcohol consumption, treatment/dose, family history of cancer, and body mass index.For radiomic features, 19 studies reported histogram-based features and 13 discussed shape-based features.Grey level co-occurrence matrix (GLCM), grey level run length matrix (GLRLM), grey level size zone matrix/ grey level zone length matrix (GLSZM/GLZLM) and neighbourhood grey tone difference matrix (NGTDM) features are essential second-order features.The number of features for the model development considered varied between 14 and 6294, with 15 studies incorporating more than 50 features.

| Feature engineering, feature selection and dimensionality reduction
We noted that 29 studies did not report the handling of the outliers and missing values.Some studies identified the skewness of the data and addressed it by the logarithmic transformation 42 or incorporating median values. 42,43The min-max scalar was employed by one study. 35Only a single study 44     elimination, least absolute and selection operator (LASSO) based embedded methods, 31,33,34,41 and ridge regularisation. 3,45The principal component analysis 46 and factor analysis 3 were two common dimension reduction methods reported.Lafata et al. 29 implemented an unsupervised clustering algorithm for feature selection.Few studies 43,44,49 discussed hyperparameter optimisation using cross-validation.Three studies 35,36,47 implemented the synthetic minority oversampling technique (SMOTE), while two others employed standard oversampling method 36,42 to deal with class imbalances.Almost half of the studies did not discuss the resampling method used.K-fold cross-validation and bootstrap resampling were the popular techniques utilised in the remaining studies.

| Prognostic model with a time-to-event outcome
CoxPH model was the commonly used prognostic model for the time-to-event (survival) data, and the C-statistic was the preferred performance metric (Table 2).The CoxPH models reported the best performance for recurrence, PFS and OS with C indices of 0.78, 32 0.76 33 and 0.83, 32 respectively.Models were externally validated by five studies. 3,41,42,46,50For OS prognosis, the random survival forest model 42 exhibited a similar performance as that of the CoxPH model (C-index = 0.76).The random survival forest model documented the best performance for DM prognosis with a C-index of 0.88. 42

| Important radiomic and non-radiomic features
Table 1 highlights the important features associated with the prediction model.T stage 42,49 and tumour volume 35,42 were significant non-radiomic predictors for OS.However, except GLCM correlation, 35 no unique radiomic feature was consistently identified across the studies for OS prediction.For DM, nonradiomic features such as age, N stage, T stage, tumour volume and Karnofsky Performance Status (KPS), and head and neck (H&N) type were important. 42,49KPS, shape-based compactness, age, H&N type, N stage, T stage were critical non-radiomic features for recurrence. 42,43,49Second-order features like NGTDM Strength , GLSZM GLN , GLCM Entropy , GLSZM LZLGE and GLGLM SGE (grey level gap length matrix) features were reported as the key radiomic features. 42,43The features predictive of PFS were MTV, SUV min and the radiomic feature GLSZM SZLGE and histogram-kurtosis. 44able 2 presents significant features reported for the prognostic models.For the prediction of OS, important predictors were age, 33,42,53 T stage, 42,49 primary tumour site, 35,42 EBV DNA, 30,53 and HPV status. 40,42Other crucial features included PET quantitative features like MTV, 2,34,35 SUV, 3,37 TLG, 30 GLCM-based features 2,35 and shape-based features. 3For PFS, first-order feature uniformity 39,40 and age 40,52 were significant features.For recurrence prediction, age, 42,50 tumour volume 34,54 and second-order radiomic features were predominant features.For DSS, NGLCM uniformity was a key feature, 39,40 while for DM, clinical variables and N stage were crucial for prognostic models. 42,49

| Risk of bias in the studies
We assessed the quality of the included studies using PROBAST.The assessment of ROB and applicability is presented in Figure 2, and additional details are provided in Supporting Information (Table S6; Figure S1).The overall ROB was low or unclear in 10 studies and high in 21 studies.Within ROB, high bias was observed in the 'analysis' domain in 25% of the studies and low bias in the 'participant' domain (93.5%).In terms of overall applicability, ROB was of low concern in 8 studies, unclear concern in 22 and high concern in 1 study.The responses to signalling questions in PROBAST are presented in Supporting Information (Table S6).

| Meta-analysis
conducted a meta-analysis of performance metrics of both predictive and prognostic models.For predictive models, there was no evidence of heterogeneity between models (Q = 10.97 with 11 degrees of freedom, p = 0.446).Figure 3A presents the estimates of the logarithm of the diagnostic odds ratio and summary estimate, and the corresponding 95% confidence interval (CI).The estimated log DOR ranged from 0.78 to 4.33, with the pooled estimate being 1.91 (95% CI: 1.49, 2.32).The pooled estimate was equivalent to the estimated DOR of 6.75 (95% CI: 4.45, 10.23), suggesting that PET-features-based models showed good predictive performance.We have also conducted separate meta-analyses of studies that used manual and SUV-based segmentation methods.The pooled estimate of DOR of studies that employed the manual method (8.36; 95% CI: 5.47, 12.78) was higher compared to studies that used the threshold-based segmentation method (4.89; 95% CI: 1.79, 13.38) (Figure S2).
The meta-analysis of prognostic models did not suggest evidence of heterogeneity between models (Q = 24.47 with 16 degrees of freedom, p = 0.080).Figure 3B presents the estimated and pooled C-statistic and the 95% CI.The estimated C-statistic ranged from 0.60 to 0.88, with the pooled estimate being 0.71 (95% CI: 0.67, 0.74), suggesting that the overall performance metric of PET-based prognostic models was reasonable.A single outcome variable did not exhibit consistently higher (or lower) performance metrics for either prediction or prognostic models among all studies.The estimate of pooled C-statistic was smaller in studies that employed the manual segmentation method (0.68; 95% CI: 0.63, 0.71) compared to studies that used the threshold-based segmentation method (0.75; 95% CI: 0.70, 0.78) (Figure S3).

| DISCUSSION
Over the past few years, there has been an increase in interest in exploring the relevance of PET-based radiomics in HNSCC outcome management.PET-based radiomics enable the quantitative evaluation of tumour heterogeneity, thereby facilitating personalised treatment approaches.Various predictive and prognostic models have been employed to predict adverse outcomes in HNSCC.This systematic review aimed to evaluate the role of pretreatment PET radiomics in predicting adverse outcomes in HNSCC patients.
F I G U R E 2 Quality assessment using PROBAST for (A) the overall risk of bias at participants, predictors, outcome and analysis levels and the overall pooled data; (B) the overall applicability of the included studies at participants, predictors and outcome levels and the overall pooled data. 55e systematically reviewed 31 studies to assess the current status of modelling framework for predicting outcomes in HNSCC patients.We identified important segmentation methods and critical radiomic features predictive of outcome and evaluated the predictive and prognostic performance of the reported models using meta-analysis.
Manual segmentation was the preferred segmentation method followed by threshold-based (40% SUV max and 2.5 SUV) methods.The accuracy of manual segmentation is reported to be high, and it is a widely accepted standard if done by an expert radiologist.It is, however, often timeconsuming and operator-dependent. 56Although determining the optimal threshold value is challenging, the recommended threshold values are between 41% and 50% of SUV max .The 2.5 absolute SUV method is software and observer-independent, and easy to use. 57As automatic segmentation is an active field of research, automatic segmentation methods are recommended in future studies. 58e observed that feature engineering techniques, imbalance class adjustment techniques and hyperparameter tuning were minimally explored in the included studies.][61] Based on externally validated models, the ensemble logistic regression for OS and recurrence prediction and the random forest classifier for DM produced the best performance metrics among prediction models.][64][65] For the time-to-event dataset, the survival analysis using the CoxPH model and C-index as a performance metric was widely used and is a recommended method by several researchers. 66The proportional hazard assumption of the Cox survival model is crucial; however, only Cheng et al. 53 confirmed checking the assumption.If the assumption is not met, modelling approaches should include an appropriate stratified analysis or extend the model by incorporating time-dependent predictors.The random forest survival model has limited assumptions with a wide range of applications. 67our studies reported external validation of fitted predictive models, and five studies reported external validation of prognostic models, consistent with the recommendation by Nikolas et al. 58 An exhaustive assessment of model performance metrics was lacking in most studies.The assessment of performance metrics should be externally validated for discrimination (C-statistic, AUC), calibration (calibration in the large, calibration slope) and overall performance (Brier score, scaled Brier score).The validation of the developed model in a new patient set structurally different from the training cohort is necessary for confirming the developed model's generalisability and reproducibility, and wider implementation in clinical practice. 68ur result suggests that the overall ROB was higher in more than 60% of studies and is of high concern regarding applicability in a single study.The high ROB is primarily due to high bias in the following areas: handling the missing data, appropriate use of prespecified or standard outcome definition, the number of participants, accounting for the complexities in the data, and evaluation of appropriate model performance measures and lack of external validation.
The meta-analysis of performance metrics of both predictive and prognostic models demonstrated reasonable performance accuracy.It is important to emphasise that most of these models were not externally validated; therefore, the performance metrics of some models could be overly optimistic.However, considering the direction of these metrics and associated uncertainties, predictive and prognostic models incorporating PET features and other clinical attributes illustrated promising opportunities for further development and refinement of these models toward clinical application.

| Limitations and recommendations
The current systematic review has some limitations.Most studies are retrospective cohorts with varying sample sizes.These differences limit comparison in terms of predictive features and the robustness of the model.Key comparisons of model development and validation are also limited as the majority of the studies did not report detailed methodologies.Despite the fact that PET-based models demonstrated satisfactory performance, the literature suggests that combining PET and CT-based features might improve model performance in head and neck cancer prognosis. 41,42,52The review and evaluation of models incorporating CT radiomics were outside the scope of the current systematic review.
We recommend more prospective studies with larger sample sizes focussing on different imaging modalities, F I G U R E 3 (A) Forest plot of the summary estimate of logarithmic DOR and the corresponding 95% confidence interval (CI) of prediction models (Performance metrics were based on external validation except for Ghosh et al. 35 and Peng et al., 43 where the performance metrics were based on internal validation).(B) Forest plot of pooled C-statistic and the corresponding 95% CI of prognostic models (Performance metrics were based on internal validation except for Bogowicz et al., 46 Lv et al., 50 Martens et al. 3 and Vallières et al., 42 where the performance metrics were based on external validation).

| CONCLUSION
The systematic review explored the current status of existing prediction and prognostic models using clinical variables and PET radiomics in managing HNSCC.Both prediction and prognostic models demonstrated reliable diagnostic accuracy for detecting adverse outcomes, suggesting the prospect of using PET radiomics in clinical settings for diagnosis, prognosis and management of HNSCC patients.Future studies should emphasise the quality of reporting, external model validation, generalisability to real clinical scenarios and enhanced reproducibility of results. 19

2. 1 |
Eligibility criteria 2.1.1 | Outcomes of interest The outcomes of interest and their definitions are as follows: and prognostic models should emphasise the quality of reporting, external model validation, generalisability to real clinical scenarios and enhanced reproducibility of results.

50 Manual
stage, Age, VCA-IgA, N stage, PET-SUVmid_HLH (wavelet based) Lv et al. (2020) Martens et al. (2020) including PET and CT-based radiomics, and the underlying mechanism of tumour heterogeneity.All studies should report outliers, acknowledge missing values handlings and provide a detailed account of the implemented pipeline (like feature selection, dimensionality reduction, techniques to address the class imbalance, hyperparameter tuning, model development, and internal and external validation) for enhanced reproducibility.Studies should incorporate appropriate performance metrics adhering to prediction and prognostic model reporting tools like CONSORT, STROBE and STARD.69 Summary of included studies that reported prognostic models of time-to-event outcomes.