Artificial intelligence for predicting acute appendicitis: a systematic review

Paediatric appendicitis may be challenging to diagnose, and outcomes difficult to predict. While diagnostic and prognostic scores exist, artificial intelligence (AI) may be able to assist with these tasks.


Introduction
4][5] None have reached universal clinical acceptability and applicability.
Artificial Intelligence (AI) is defined as the ability of machines, in particular computer systems, to simulate human intelligence and complete tasks without human intervention. 6Machine learning is a subset of AI that evolved in the late 1950s from the study of pattern recognition and computational learning theory in artificial intelligence. 7Machine learning explores the construction and study of algorithms that can learn from and make predictions on data.Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions. 8he use of machine learning techniques for creation of prediction models is being increasingly recognized as a valid use of AI in healthcare. 9This systematic review has focussed on all studies using AI in the diagnosis and prognostication of paediatric appendicitis, with the aims of: (1) determining the types of clinical parameters that have been employed, (2) examining the algorithms that have been used and (3) ascertaining the level of performance that has been achieved (both in derivation/validation studies and, if available, in implementation studies).

Study design, search strategy and selection criteria
This study was developed, conducted and reported as described in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (see checklist in Supplementary Information 1). 10 The systematic review was registered prospectively with the PROSPERO registry (reference number CRD42022376390).EMBASE, PubMed, and Cochrane library were searched from inception to November 2022 for primary studies relating to the application of AI to the diagnosis or prognostication of paediatric appendicitis.Search strings are outlined in Supplementation Information 2, but included the following: (appendicitis OR appendectomy OR appendicectomy) AND (machine learning OR artificial intelligence OR predictive analytics).Articles that met inclusion criteria also underwent reference list review for articles fulfilling the inclusion criteria as an extension of the search strategy.
The application of inclusion and exclusion criteria to the results of the search was conducted in duplicate with a standardized form (AL, ES, ST and JN).The inclusion criteria applied were: (1) English-language; (2) primary peer reviewed publication (abstracts, posters, reviews and editorials were excluded); (3) presented results separately for paediatric patients (< 18 years of age); (4) presented data on the application of AI to the diagnosis or prognostication of appendicitis specifically in this paediatric cohort; and (5) available in full-text.For studies that presented results on both adult and paediatric populations, if the study did not present results specifically for the paediatric cohort, then the study was excluded.Titles and abstracts were initially screened, prior to full-text review.Disagreement regarding study eligibility were resolved through discussion or consultation with a third reviewer (SB).

Data extraction and analysis
Risk of bias analysis was undertaken in duplicate using the Prediction model Risk Of Bias Assessment Tool (PROBAST) criteria.Data extraction was also performed using a standardized form.Data that were extracted included: country of study, study design, the number of included patients, demographics of included patients, nature of appendicitis diagnosis, time from appendicitis symptom onset, severity of appendicitis, parameters used in model development (including demographics, clinical, laboratory and imaging features), algorithms used in model development (including deep learning and non-deep learning AI algorithms), and model performance (including whether comparator methods were evaluated).
The application of the PROBAST tool to the included studies showed that risk of bias of the included studies was low, with a few exceptions (see Supplementary Information 3).Aspects of the methodology of included studies that raised concern for bias primarily related to a lack of clarification regarding the method by which the diagnosis of appendicitis was made.This methodological issue then at times resulted in a lack of clarity as to whether predictors were excluded from the outcome definition.The majority of studies clearly defined the study population and relevant inclusion and exclusion criteria.
Table 1 illustrates the aims, model parameters and artificial intelligence models utilized by the 10 articles included in this systematic review.7/10 studies shared a similar main aim of improving the diagnosis of appendicitis.3/10 studies additionally aimed to better differentiate between complicated and uncomplicated appendicitis.Other aims included the utilization of triage data to predict the need for further diagnostic testing and the use of artificial intelligence to improve appendix visualization rates.
None of the included studies reported a main outcome of improving patient outcomes, recovery or length of hospital stay.There were also no 'implementation' studies identified-that evaluated patient or system outcomes following the deployment or implementation of an AI model for the diagnosis or prognostication of paediatric appendicitis in a clinical setting.Methods that were used in the included studies to confirm the diagnosis of appendicitis included histopathological confirmation, clinical evaluation, use of clinical risk evaluation scores (for patients diagnosed as acute appendicitis and managed non-operatively), and the use of ICD10 codes as recorded in medical/ administrative records.17]19 Conversely, other studies employed the use of ICD10 codes as the ground truth by which the diagnosis of appendicitis was established, 20 which may be inaccurate.There were also studies in which there were few, if any, details provided regarding the means of establishing the diagnosis. 13,14he most commonly used parameters in AI models for the diagnosis and classification of severity of paediatric appendicitis included varying combinations of demographic, history, physical examination, laboratory, and ultrasound parameters.Stiel   18 Some studies also incorporated the use of NLP to extract parameters from unstructured free-text from emergency department triage notes. 18,20There were other studies that used AI to create models based on more restricted datasets for indirectly improving the diagnosis of appendicitis, and also in differentiating between simple and complicated appendicitis.Hayashi et al. used abdominal ultrasound images as input parameter data to develop algorithms that help to identify the appendix on ultrasounds scans. 14Reismann et al. explored the application of AI to gene expression data in order to predict the subtypes of appendicitis (namely phlegmonous or gangrenous necrotizing appendicitis). 16The clinical rationale for this study was based on some emerging evidence that 'phlegmonous' and 'gangrenous' appendicitis have different immunological basis and underlying risk of complications.
3/10 studies created prediction models for the differentiation between complicated (gangrenous/perforated) and uncomplicated (phlegmonous) appendicitis.All of these studies used histopathological reporting of the surgical specimens to confirm the classification of complicated and uncomplicated appendicitis. 12,16,17They also used very similar histopathological features for defining complicated appendicitis.Aydin et al. further classified 'complicated' appendicitis into sub-categories-with and without abscess while testing different prediction models. 12

AI algorithms
The different algorithms employed for the creation of prediction models (Table 1) are some of the common machine learning (ML) algorithms that fall under the category of 'supervised', 'unsupervised', 'semi-supervised' and 'reinforcement' learning algorithms. 21Most (6/10) studies created multiple prediction models using different algorithms.Logistic regression and linear models were frequently employed. 12,15,16,18,209][20] Deep learning methodologies, including artificial neural networks, were used in 4 studies. 11,13,14,18Su et al. also used natural language processing (NLP) algorithm for extracting features from unstructured text and 'K nearest neighbour' approach for imputation of missing data prior to creating prediction models using 'binary logistic regression' and 'random forest' algorithms. 20

Model performance
All studies included in this review with the exception of 1, reported the performance of their prediction models by calculating the area under the receiver operating characteristic (ROC) curve (AUC).Other standard comparison measures (sensitivity, specificity and accuracy) were used to compare different models in most studies with some reporting additional measures (PPV, TPR and FPR) (Table 2).
Only one study reported results based on calculation of 'Area under the curve-precision recall' (AUC-PR) for models created to differentiate between uncomplicated and complicated appendicitis.AUC-PR is useful for classification performance assessment for 'unbalanced binary responses' 22 and therefore is more suitable for prediction models which are heavily skewed towards one of two possible outcomes (e.g.simple versus complicated appendicitis).While some studies reported sensitivity, specificity, only one study reported PPV, TPR and FPR.No study reported positive likelihood ratios, negative likelihood ratios or numbers needed to misdiagnosis (NNM). 23

Discussion
AI is an umbrella term which can be considered to comprise of different subsets that include techniques such as machine learning (ML), deep learning, neural networks, natural language processing, robotics and genetic algorithms.Different AI techniques can be used in clinical research to: (1) automate data collection, (2) monitor data quality, (3) adjudicate outcome events, (4) analyse large, multidimensional or sparse datasets, (5) discover novel biological features. 24here are numerous non-AI scoring systems which have been previously developed to predict acute appendicitis such as the Alvarado, the Tzanakis and the paediatric appendicitis score (PAS) with varying sensitivities (71.9%-100%) and specificities (66.6%-100%) for such scoring systems. 19While such scores are easy to calculate, upwards of 70% of patients may be assigned an intermediate risk score which neither rules in nor rules out appendicitis resulting in the need for further diagnostic imaging.
The more recently developed paediatric Appendicitis Risk Calculator (pARC) has been shown by Cotton et al. to perform better than PAS in classifying patients into clinically actionable groups. 25owever, even in this prospective study, 42% of all patients were still classified as 'intermediate risk' (15%-84% chance of having appendicitis). 25 have outlined some of the key differences between conventional statistical methods and machine learning as applied to clinical research. 26Conventional statistical modelling is a 'top -down' approach that relies on 'a priori' knowledge (or assumptions) on the relationship between input and output parameters based on previous evidence.Machine learning on the other hand, is a 'bottom-up' approach which begins with analysis of raw data using different ML algorithms and no 'pre-hoc' biases' to develop models far more complex than possible with conventional statistical techniques.This makes machine learning an effective tool for risk assessment and predicting outcomes of interest.
The other key advantage of using ML is the ability to create 'self-learning' models which can be deployed on 'learning platforms' that are configured for continually improving the model performance as more data is acquired. 27his systematic review identified only 10 eligible studies that have developed AI models to predict the diagnosis of paediatric appendicitis using varying combinations of demographics, clinical parameters, laboratory data, and imaging findings.All these studies have been published within the last 3 years-in keeping with the exponential rise in total number of AI related clinical research publications since 2018. 28espite variations in the use of inputted model parameters, almost all of the included studies reported high levels of model performance (as estimated by AUROC >80%).However, very few studies reported other important measures of predictive model performance (Table 2).When considering clinical application of prediction models, other performance measures such as accuracy, F1 score, recall and precision as well as numbers needed to diagnose and numbers needed to misdiagnose should be presented in addition to an AUROC. 29dditionally, only one study addressed the issue of 'model explainability' by reporting on individual feature importance in the prediction model using Shapley Additive Explanations (SHAP) Values plots. 30It is therefore difficult to surmise their clinical usability given the lack of further statistical analysis.
There are several limitations of this review that should be acknowledged.These limitations include the exclusion of articles published in languages other than English, the exclusion of articles due to full-text unavailability, and the potential for publication bias.Additionally, due to the inclusion criteria cut-off at 18 years of age, it is possible that studies relating to young adults may have been excluded that would have value in the diagnosis of older paediatric patients.
Further research in this area should include implementation studies of models aiming to assist with diagnosis and the development of further models to aid in paediatric appendicitis prognostication.

Table 1
Study characteristics

Table 1 Continued
et al., Marcinkevics et al., and Akgül et al., all employed a combination of clinical, laboratory, and imaging factors in AI algorithms for the diagnosis of paediatric appendicitis. 11,15,19In contrast, Grigull et al. and Aydin et al., utilized only demographic and laboratory data, citing that clinical and radiological data may vary between practionners. 12,13Singh et al. used emergency department triage data to predict which investigations patients visiting a paediatric emergency department would require, including the requirement for abdominal ultrasound in the setting of suspected appendicitis.

Table 2
Comparative table showing the best