Artificial intelligence in histopathological image analysis of central nervous system tumours: A systematic review

The convergence of digital pathology and artificial intelligence could assist histopathology image analysis by providing tools for rapid, automated morphological analysis. This systematic review explores the use of artificial intelligence for histopathological image analysis of digitised central nervous system (CNS) tumour slides. Comprehensive searches were conducted across EMBASE, Medline and the Cochrane Library up to June 2023 using relevant keywords. Sixty‐eight suitable studies were identified and qualitatively analysed. The risk of bias was evaluated using the Prediction model Risk of Bias Assessment Tool (PROBAST) criteria. All the studies were retrospective and preclinical. Gliomas were the most frequently analysed tumour type. The majority of studies used convolutional neural networks or support vector machines, and the most common goal of the model was for tumour classification and/or grading from haematoxylin and eosin‐stained slides. The majority of studies were conducted when legacy World Health Organisation (WHO) classifications were in place, which at the time relied predominantly on histological (morphological) features but have since been superseded by molecular advances. Overall, there was a high risk of bias in all studies analysed. Persistent issues included inadequate transparency in reporting the number of patients and/or images within the model development and testing cohorts, absence of external validation, and insufficient recognition of batch effects in multi‐institutional datasets. Based on these findings, we outline practical recommendations for future work including a framework for clinical implementation, in particular, better informing the artificial intelligence community of the needs of the neuropathologist.

Gliomas were the most frequently analysed tumour type.The majority of studies used convolutional neural networks or support vector machines, and the most common goal of the model was for tumour classification and/or grading from haematoxylin and eosinstained slides.The majority of studies were conducted when legacy World Health Organisation (WHO) classifications were in place, which at the time relied predominantly on histological (morphological) features but have since been superseded by molecular advances.Overall, there was a high risk of bias in all studies analysed.Persistent issues included inadequate transparency in reporting the number of patients and/or images within the model development and testing cohorts, absence of external validation, and MPJ and ZQ shared first co-authorship.SB and HJM shared senior co-authorship.

INTRODUCTION
Benign and malignant tumours of the central nervous system (CNS) encompass over 100 distinct entities.CNS tumours (both malignant and non-malignant) are the most common tumour site in children (0-15 years), and the second most common tumour site in adolescents and young adults (15-39 years). 1 The diagnostic pathway for CNS tumours involves multidisciplinary input, with the integration of clinical, demographic, imaging and pathological parameters.Pathological assessment, in particular, is the gold standard for precise, evidencebased classification of CNS tumours, with the 2021 World Health Organisation (WHO) Classification of Tumours of the CNS acting as the current reference for taxonomic classification. 2e emergence of artificial intelligence (AI) has the potential to provide tools for automated, rapid analysis of medical data, improving diagnostic workflow efficiency.AI refers to the use of machines (computers) to solve complex tasks that typically require human cognition and analysis.Within the diagnostic pathway for CNS tumours, the application of AI to radiological image analysis has been reviewed, with demonstrable benefits in predicting tumour grade and molecular profile. 3Similarly, DNA methylation profiling by AI-based classifiers (machine learning algorithms) has become a well-established tool for classification based on epigenetic parameters. 2,4However, the potential benefits of AI in interpreting histopathological features on slides of CNS tumour specimens remain unclear.][7][8][9][10] Advances have also been made in histopathological tasks where interobserver variation exists, such as Gleason grading of prostate cancer and in time-consuming tasks, such as determining and counting mitotic figures in tumour cells. 6,11Indeed, some of these capabilities are available as FDA-approved products (e.g.Paige AI for prostate cancer detection). 12Unique challenges, however, exist in CNS tumour classification from slide image analysis algorithms, namely the large number of tumour subtypes and the frequent overlap of morphological phenotypes across diagnostic entities, in particular in many low-grade glial and glioneuronal tumour types.It remains unclear whether these unique challenges have been accounted for in the existing literature.
A systematic analysis of AI-based histopathological image analysis of CNS tumours is lacking despite a growing body of relevant literature.The objective of this study is to survey the scope of AI employed in histopathological slide image analysis of CNS tumours, with the goal of identifying future directions in this field.

METHODS
The review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines and prospectively registered with the International Prospective Register of Systematic Reviews (PROSPERO) database of systematic reviews (registration ID: CRD42023434059). 13 systematically interrogated the EMBASE, Medline and Cochrane Library databases up to June 2023 to identify studies utilising AI in the histopathological image analysis of CNS tumour tissue.A combination of MeSH terms and relevant keywords were used in the search strategy, including AI, machine learning, deep learning, brain neoplasms, pathology and computer-assisted image processing (Table S1).We limited the scope of the review to include studies focussing on conventional, clinically well-established histopathological image analysis (i.e.haematoxylin and eosin (H&E) and/or immunohistochemically stained tissue) and excluding studies exploring experimental (currently unvalidated) techniques such as Raman spectroscopy.We excluded studies not published as full-text articles in English.
Full-text articles meeting the inclusion criteria were independently assessed by two investigators (MPJ and ZQ).Information extracted from each study included the following: publication year; study stage; purpose of the AI algorithm; tumour type studied; use of H&E staining and/or immunohistochemical markers; characteristics and source of the training and testing datasets; data pre-processing techniques; details of internal and external validation; feature extraction and dimensionality reduction techniques; code availability;

Key points
• This review explores the use of AI for image analysis of central nervous system tumour slides.
• The field is at an early stage and poorly aligned with current diagnostic challenges.
• Practical recommendations for future work are outlined.
summary of the AI algorithm and model architecture; interpretability considerations; and AI model outcome measures.Risk of bias assessment was performed by two investigators (MPJ and ZQ) using the Prediction model Risk of Bias Assessment Tool (PROBAST). 14A narrative synthesis was conducted to provide a comprehensive summary of the study characteristics, AI techniques employed, and key findings.

RESULTS
The literature search identified 68 studies meeting the eligibility criteria for inclusion (Figure 1).  All udies were retrospective and preclinical (Tables 1 and S2, and Figure 2).Studies were published between 1995 and 2023, half of which were published from 2020 onwards (Table 1 and Figure 2).

CNS tumour types
Gliomas were the most frequently analysed tumour type (52 studies) (Table 1 and Figure 2).Although glioblastoma was analysed in 33 studies, only eight out of 28 studies published post-2016 specified isocitrate dehydrogenase (IDH) gene mutation status (as per recommended classification systems). 261,62 Brain metastases from the breast, lung or melanoma were analysed in four studies. 39,43,58,60endymomas (subtype not specified) were investigated in three studies. 34,43,73CNS lymphoma was investigated in one study. 43The exact CNS tumour type studied was unclear in one study. 65

Dataset characteristics
One study utilised a mouse model of disseminated malignancy, and all other studies utilised human tissue. 60The studies utilising human tissue covered adult and paediatric populations, ranging in size from 4 to 1185 patients and 10-97,252 digitised images (Figure 2).All studies were retrospective and cross-sectional (i.e.samples were analysed at a single point in time; rather than over several points in time as in longitudinal analyses).The most commonly used dataset was derived from The Cancer Genome Atlas, used in model development for 31 studies and external validation for two studies (Table 1 and F I G U R E 1 Preferred reporting items for systematic reviews and meta-analysis (PRISMA) flow diagram outlining study selection process.The primary search strategy yielded 1072 results, of which 68 studies were suitable for inclusion in the systematic review.
T A B L E 1 Overview of the included studies.Figure 2).The number of cases ranged from 52 to 1185, whereas the number of images varied from 200 to 3611.The Guwahati Neurological Research Centre was another recurrently used dataset, albeit constrained by smaller sample sizes, with a maximum of 204 images or 20 patients included. 26,27,31,51,52Six studies did not report the source of their datasets. 19,46,58,59,63,65One study used a dataset with a simulated population derived from published literature. 28Only two studies conducted exploratory analyses to examine the impact of sample size on the predictive performance of the model, aiming to address the challenge of requiring extensive labelled data for model training.
Among them, only one study discussed methodologies for sample-size determination, employing inverse power law functions. 36 algorithm usage AI algorithms can be classified into classical machine learning and deep learning.Classical machine learning algorithms tend to be computationally simpler and advantageous when dealing with structured data, such as tabular data.Deep learning algorithms are computationally complex and are suitable for analysing complex data such as images and natural language.In Figure 3, we summarise key algorithm types used by included studies, and whether they fall under the classical machine learning or deep learning type.The most frequently employed classical machine learning algorithms were support vector machines, which identify the best margin of separation between data points of different classes in high-dimensional space (Figure 3), featured in 21 studies.The most frequently employed deep learning algorithms were convolutional neural networks, which employ hierarchical operations to process data and identify important features in an image (Figure 3), and featured in 30 studies.Classical machine learning algorithms dominated the landscape in earlier years, being the choice for 90% of studies published before 2013 (Table 1 and Figure 2).In contrast, deep learning algorithms were more frequently (67.2%) used in studies published after 2013.

Image analysis goal
The reviewed studies encompassed a range of image analysis goals (Figure 2).For each goal, we describe the performance metrics used GOAL 1: IMAGE GENERATION Tissue image generation was the focus of two studies, aiming to develop tools for dataset augmentation and education. 80,81Both studies adopted Turing tests (i.e.asking pathologists to assess whether the images were artificially generated or real) to show that distinguishing real from synthetic images was somewhat challenging (in both studies just over half of the images were deemed 'real').

GOAL 2: MORPHOLOGY RECOGNITION
6][17][18][19][20][21][22][23][24][25] Microvascular characteristics such as vessel circularity and area were considered in one study; however, determining whether vessels were normal or pathological was not explicitly performed. 18e area under the receiver operating characteristic curve (AUC) [18][22][23][24][25] Two studies did not report performance measures. 19,20In one study, the AI model performed similar to human observers, particularly in detecting microvascular proliferation (AUC 0.994), geographic necrosis (AUC 0.994) and palisading necrosis (AUC 0.964). 15However, given the rapidity at which pathologists can screen slides for these features and the relatively short time it takes to diagnose common CNS tumours (such as glioblastoma, meningioma, and most instances of ependymoma, astrocytoma and oligodendroglioma), the time-and cost-benefit analysis of implementing AI for this purpose is debatable.,24 However, only two of these studies conducted comparisons with human pathologist's opinions. 15,24atures guiding the model were identified, including the observation that cells in IDH-mutant cases were larger and more circular versus wild-type counterparts; however, the clinical relevance of these features was not explored in the context of existing literature. 20

GOAL 3: IMMUNOHISTOCHEMISTRY DETECTION AND QUANTIFICATION
61,62 Among them, five studies quantified cellular proliferation hotspots using Ki67 immunohistochemistry and performed grading as per the WHO 2007 classification system. 35,58,59,61,62In one study, AI was used to quantify CD276 immunohistochemically labelled cells, a putative glioblastoma stem cell marker. 57The algorithm's intricacy demanded a labour-intensive training process, involving the manual labelling of 31,947 cells across eight WSIs.Subsequent external validation using an independent cohort revealed a quoted accuracy of 97.7%; however, the cohort was small relative to the number of cells in the training process (12,211   CD27-stained cells only).As such, the clinical applicability (and general utility) of the model is highly questionable, given the extensive human labelling process required to capture sufficient variance in the data.
Model outputs were commonly compared with that of human pathologists or conventional image analysis software, and concordance was demonstrated using measures of correlation such as Spearman's rho. 58,59,61The AI model was demonstrated to have less variability compared to manual annotations between pathologists for Ki67 quantification in only one of these studies. 58In this study, the algorithm was adopted to align Ki67-stained WSIs to H&E staining, facilitating automated region of interest selection and reducing interobserver variability for Ki67 quantification. 58

GOAL 4: NUCLEUS SEGMENTATION
4][65] Sikpa and others applied nucleus detection to quantify breast metastatic disease in the brain using an animal model with disseminated cancer spread, serving as an indicator of disease burden. 60However, whether the results would be translatable to humans is unclear; the model used (representing hundreds of micrometastasis in the mouse brain) is not representative of the typical human counterpart (a single large metastasis).In Nalisnik et al., an AI nucleus detection model was employed to quantitatively characterise glioma microvascular structures, such as hypertrophy and hyperplasia. 64Increased hyperplasia was found to be associated with higher grades within each molecular subtype (IDH-wild-type astrocytoma, IDH-mutant astrocytoma and oligodendroglioma).A regression analysis model was trained using these phenotypes across 781 WSIs, revealing a concordance index of 0.76, demonstrating some ability to rank patient survival based on these phenotypes.However, this is unsurprising as these phenotypes are those chosen by the WHO classification as prognostically relevant; hence, the conclusions are somewhat circular.Generalisation to other datasets was not performed and would be necessary for clinical validation.Meanwhile, Xing et al. proposed a generalisable model of nucleus detection applicable across multiple staining and tissue preparation methods, in an attempt to address the problem of batch effect in multicentre datasets. 65del outputs for nucleus segmentation were generally in agreement with manual annotations or simpler computational techniques, as demonstrated through statistical analyses such as Pearson's correlations and false-positive area ratios. 60,63Segmentation margins were examined in all four studies to assess interpretability.

GOAL 5: TUMOUR CLASSIFICATION AND GRADING
Thirty-two studies focussed on tumour classification or grading directly from H&E-stained tissue sections.Eighteen of these studies focussed on grading gliomas, the majority of which aimed to distinguish glioblastoma from lower-grade counterparts. 32,36,40,42,44,45,48,49,54,55,82veral studies did not specify the subtype of tumour classified (e.g.astrocytoma subtype unspecified, oligodendroglioma/astrocytoma subtype unspecified), thus their inclusion criteria and therefore clinical utility are questionable.
8][49][50]54,55 Three studies (all published in 2022 or 2023) adopted the latest WHO integrated classification for gliomas as per new molecular markers. 38,39,43 43Subsequently, molecular parameters were imputed to formulate an integrated diagnosis using a decision tree classification algorithm, a simple classical machine learning method (see Figure 3).Although this approach acknowledges the significance of both morphological characteristics and molecular features, it did not exhibit discernible enhancements when compared to the established pathology pipeline.
Five studies subtyped paediatric medulloblastoma into classic, nodular, desmoplastic or large cell. 26,27,31,51,52Two studies delineated anaplastic from non-anaplastic medulloblastoma. 29,53However, given that molecular stratifications in medulloblastoma are becoming increasingly important, the diagnostic value of such histological classification in the absence of integration with molecular parameters is debatable. 2Nonetheless, anaplasia in medulloblastoma is still regarded as a high-risk feature, and whilst its significance is diminishing in certain molecular subtypes, such an algorithm would be helpful if clinically validated.Two studies focussed on tissue feature subtyping of meningiomas into meningothelial, fibroblastic, transitional, or psammomatous. 33,46Although this may demonstrate the ability of image recognition algorithms to discern distinct features, again the diagnostic value is limited as these subtypes are of less importance and have been superseded by molecular stratification algorithms. 84Three studies performed a broad classification of CNS tumours, including astrocytoma, ependymoma and oligodendroglioma. 28,34,41However, all of these classification models were based on morphological categories with no clear demonstration of time-cost benefit relative to pathologist review nor comparison of accuracy relative to the final molecular diagnosis, making unclear their ability to offer additional clinical and prognostic utility.This is particularly relevant to tumour types, for example, meningiomas, in which current classifications are primarily at the genomic and epigenomic level. 2 The most commonly utilised performance metrics were accuracy, sensitivity, specificity and F1 score (see Figure 3 for definitions of these performance metrics).Studies reported variable accuracy rates ranging from 85% to 100%; however, none conducted comparative analyses against human pathologist assessment (indeed 85% would be considered poor performance relative to the accuracy required in clinical practice).Only eight studies investigated interpretability. 29,32,34,36,44,45,47,48This included the use of representation spaces to illustrate morphological features learned during training, such as edges, nuclear stains and cellular orientations, and visualisations with limited apparent clinical utility. 29,32Other studies generated probabilistic heatmaps to highlight the model's attention during the decisionmaking process, which included tumour cell clusters, suggesting the plausibility of the proposed models. 37,457][68][69] One study used nuclear morphology to predict the transcriptional profile of glioblastoma: classical, proneural, neural and mesenchymal. 66However, this classification has been superseded by other systems because of emerging evidence, including the IDH status.Jungo et al. predicted the 1p19q co-deletion status of IDH-mutant tumours, reporting an accuracy of 88.6%, arguably lower than that acceptable in clinical practice 39 and probably even inferior to the morphological examination by an experienced neuropathologist.Another study sought to predict mutational status in glioblastoma and scored AUC metrics over 0.7 in four genes of interest (IDH1, ATRX, TP53 and RB1).Two studies assessed interpretability. 66,67For example, humanrecognisable features deterministic of IDH mutational status were revealed using methods to make predictions understandable through dimensionality reduction of complex datasets.These characteristics included oligodendroglial cytomorphology and the extent of pleomorphism. 67However, during external validation, the model showed reduced performance (accuracy 0.809 vs 0.936 at internal testing), suggesting failure to generalise to independent datasets.The value of AI-based prediction of molecular status needs to be justified where relatively rapid cost-efficient methods already exist (e.g.widely utilised immunohistochemical tests for IDH mutations).

GOAL 7: SURVIVAL AND OUTCOME PREDICTION
1][72][73][74][75][76][77][78][79] Most studies adopted a multi-modal approach, integrating histological data with other modalities such as radiological, genomic or clinical data.Patients were stratified into survival probability groups or derived survival predictions through regression analysis.Evaluation metrics involved accuracy, AUC and concordance index.8][79] No studies explicitly showed that histopathology data alone performed better or similar to multimodal data.
]79 Factors such as the percentage of hypertriploid nuclei and small, dense chromatin clump frequency were found to be relevant in stratifying anaplastic astrocytoma patients into prognostic outcomes. 70ree studies considered interpretability by defining molecular pathways and genetic expression features linked to survival. 73,76,77However, the histopathological features associated with survival were mainly demonstrated using representative images from the long and short survival groups, without explicit evaluation of which morphological features guided AI decision-making.

Internal and external validation
Internal validation refers to reserving a proportion of the original dataset to assess AI model reliability.Internal validation plays a crucial role in selecting the optimal model among candidate models and estimating whether the model will be able to generalise on unseen data.Robust but computationally expensive methods such as k-fold cross-validation were used in 37 studies, and leaveone-out cross-validation was utilised in four studies. 16,34,41,82Seven studies relied solely on the train-test split approach, which is computationally simple but less representative of the model's true generalisability. 18,19,28,32,64,65,72Nine studies did not provide details about internal validation. 35,39,45,58,60,61,63,78,81See Figure 3 for detailed definitions of internal validation techniques employed.
External validation evaluates model performance using entirely new and independent data that were not part of the model's training or validation process.It is essential in determining a model's reproducibility and applicability in real-world clinical settings.Only seven studies conducted external validation. 18,28,32,57,67,74,81Only three studies within this subset reported model performance on the corresponding unseen datasets. 18,67,74See Figure 3 for detailed definitions of external validation methods employed.

Risk of bias assessment
Using the PROBAST evaluation tool, a significant proportion of studies displayed high risks of bias (61 studies) and limited applicability (66 studies) overall (Table S3 and Figure 4).In this context, risk of bias refers to flaws in the study's design, execution, or analysis that may result in systematically skewed assessments of a model's predictive accuracy.Applicability refers to whether the model will be representative of the population to which it will ultimately be applied.Forty-six studies scored a high risk of bias in the 'Participants' domain.This was largely attributed to (39 studies) sourcing of participant data from pre-existing datasets, where data are typically collected for a purpose other than model development or validation and often without an appropriate protocol. 14Six studies did not provide clear information regarding the data source used. 19,46,58,59,63,65Concerning the 'Predictors' domain, a high risk of bias was identified in 16 studies because of the use of manual annotation for ground truth labelling.This can result in inter-observer bias, as manual techniques may vary across observers.Within the 'Outcomes' domain, although the risk of bias was infrequent, a majority (56 studies) of studies demonstrated low applicability because of a lack of accessible published source code.
The majority of studies (43 studies) scored a high risk of bias in the 'Analysis' domain.This was typically attributed to (37 studies) lack of reporting of the number of patients and/or images within the development and testing cohorts, impeding assessment of whether an adequate number of participants with the investigated outcome were included and whether the analysis covered all enrolled participants.Only two studies described methods for handling missing data. 72,76Four studies did not provide any model performance information. 19,20,22,28Except for the seven studies that conducted external validation, the risk of model overfitting on training data was largely overlooked.

Summary of findings
This review highlights the status of AI-driven histopathology image analysis in neuro-oncology.This is an evolving field, with half of the 68 reviewed studies published after 2020.The field is in its early stage; all of the studies were in the preclinical phase, retrospective in AI-driven image analysis for CNS tumour histopathology lags behind several other disciplines.For example, the capacity of AIdriven histopathology image analysis to achieve diagnostic accuracies on par with human pathologists has been prospectively demonstrated in other cancer types, such as gastric and colonic cancer. 85,86The use of AI in prostate cancer grading is already at clinical evaluation stages. 87In the field of neuro-oncology, AI applied to radiomic and tumour DNA methylation data is also at a more advanced stage.For example, AI algorithms applied to magnetic resonance imaging (MRI) images of pituitary neuroendocrine tumours to predict Ki67 proliferation indices have been tested in clinical settings. 88

Challenges facing AI-driven image analysis of CNS tumours
This review reveals an absence of clinical integration of the AI image analysis algorithms.Achieving accurate CNS tumour classification through AI algorithms presents a multifaceted challenge.In contrast to many somatic tumours, CNS tumours, particularly low-grade gliomas, encompass a broad spectrum of subtypes, with either considerable morphological heterogeneity even within a single tumour type or considerable morphological overlap between distinct molecular subtypes. 89There is often a poor correlation between morphological fea-

Clinical recommendations
Currently, the literature appears skewed towards using AI to classify gliomas into morphological subtypes which are no longer listed in the 2021 WHO Classification (and have been superseded by molecular classifications), so it is unclear how they could assist current clinical workflows.Indeed, genetic and epigenetic parameters have now superseded the importance of histological subtyping in low-grade glioneuronal tumours, as they show considerable morphological overlap which may not be addressed with histological image analysis alone.
The use of AI for image analysis in CNS tumour histopathology requires application to tasks which could be more usefully integrated into existing diagnostic workflows.For example, specific labour-intensive tasks, including determining mitotic and Ki67 indices to inform prognosis and stratify aggressive subtypes, have demonstrated convincing performances when executed by AI algorithms compared to human counterparts. 93These tasks require significant time investment and are prone to interobserver disagreement and human error.
Historically, these tasks have been difficult to automate (i.e. using rule-based software which operates on a set of predefined rules) and may benefit from AI assistance (which can iteratively improve by learning from data and making predictions based on new data). 94,95-guided image analysis may also help inform and/or streamline requests for molecular testing based on a preliminary morphological diagnosis, although again this would require demonstrable timeand/or cost-benefit relative to neuropathologist review.Finally, AI could be used to 'mine' histopathological imaging data for 'subvisual' morphological features useful in diagnosis/prognostication unapparent to the pathologist.This may be particularly helpful in cases deemed unsolvable after assessment by pathologist review and available molecular testing, including DNA methylation arrays and genome sequencing. 96Relevant to prognostication, AI has been used to predict the survival of breast cancer patients from H&E-stained slides, with greater accuracy than standard pathologist grading, based on stromal morphological structures previously unrecognised as prognostically relevant. 97Similarly, AI models have been shown to extract prognostic information and make molecular predictions from tissue morphology in colorectal and bladder cancer, with greater accuracy than pathologists. 98,99Improved communication between clinicians and engineers is imperative to achieve these advancements given the unique challenges in developing AI models for image analysis of CNS tumours.Tumour Atlas) could be exploited for AI analysis whilst overcoming some of the issues associated with batch effects. 102

Strengths and limitations
Through a systematic review of the literature, the present study offers an up-to-date exploration of AI-driven applications for the analysis of CNS tumour histopathology image analysis.The findings are critically evaluated in the context of clinical utility, with the provision of practical recommendations (Figure 5).However, certain limitations should be acknowledged.Although the identification of studies was comprehensive, it was constrained to the search strategies employed.Only full-text articles in the English language were considered, which could result in the omission of certain studies.Whilst an array of databases in the biomedicine domain have been examined, future investigations could encompass databases within computer science and related disciplines, including resources such as the IEEE Xplore Digital Library.

CONCLUSION
We present a systematic review of the literature concerning the use of AI for the analysis of neuro-oncological histopathological images.CONSORT-AI). 68,90 Data pre-processing pipelines help render raw data suitable for training AI models.Eighteen studies, especially those utilising publicly available datasets, implemented quality control measures such as removing images with inferior resolution or processing artefacts.Image augmentation describes the technique of artificially expanding the training dataset to enhance model generalisability and mitigate class imbalances.This was implemented in 20 studies through a range of techniques, including flipping, rotating and geometric transformations, with some benefits for model performances.38,68Image normalisation, a process whereby image pixel values are standardised to a common scale to ensure model training efficiency, was described in 17 studies, using a variety of methods including contrast adjustment, colour adjustment and normalisation techniques to overcome inconsistencies in the staining process.4,17,23,29,38,42,45,48,59,63,64,68,72,[76][77][78][79][80]Four studies used predeveloped, open-source image pre-processing pipelines, two of which T A B L E 1 (Continued) Whole Slide Image (WSI) Pre-processing pipeline from https://github.com/deroneriksson/python-wsi-preprocessing,which performs a range of manoeuvres including colour correction, image tiling and tissue identification.38,64,68,78Furthermore, dimensionality reduction, the technique of reducing input features whilst retaining essential information from the training data, was primarily utilised in studies adopting classical machine learning algorithms.This was carried out to enhance training efficiency and reduce the risk of overfitting variables (whereby a model performs well on the training dataset but this is not recapitulated on an independent external dataset).Deep learning typically does not involve explicit dimensionality reduction because of its intrinsic capacity to learn hierarchical features from raw data.Therefore, dimensionality reduction was only performed in one study utilising a deep learning algorithm.78 and whether model interpretability was considered.Model interpretability involves discerning the model's primary contributing features to comprehend the model's decision-making process.It is crucial for trusted clinical integration, protecting against errors during model training and potentially revealing new insights through the recognition of previously undiscovered patterns.
Summary results of included studies, including study design, clinical and dataset characteristics, and artificial intelligence algorithm type and goal.F I G U R E 3 Development pipeline for artificial intelligence in digital histopathology, with relevant definitions (not an exhaustive list; see Goodfellow et al. 2016 for a detailed review).

69
Lietchy et al. and Liu et al. focussed on predicting IDH status from H&E stained slides.67,68Although Lietchy et al.'s model did not outperform human pathologists when assessed using the AUC metric when combining decisions made by both humans and the AI model within a man-machine hybrid framework, the model achieved superior performance compared to the consensus of two expert neuropathologists.67 nature, and most failed to conduct direct comparisons with human pathologist assessment and to validate their outcomes with molecular tests.Moreover, all studies displayed a high risk of bias and/or limited applicability and thus potential clinical utility.Persistent issues included inadequate reporting of dataset characteristics (including the number of patients and/or images used for model development/ validation and describing the methods for handling missing data), absence of external validation, insufficient recognition of batch effects in multi-institutional datasets or normalisation approaches for batch effects, and lack of published source code.Together, such issues preclude testing of model performance in independent patient cohorts by different research groups, critical in judging a model's safety, reliability and generalisability.
Moreover, an essential prerequisite for the implementation of any AI algorithm on CNS tumour histopathology is the availability of a clinically validated digital pathology workflow integrated within the neuropathology department.This should include dedicated scanners for routine real-time digitisation of WSIs, image management software, and real-time access of AI algorithms to digitised images.Whilst the requirement for dedicated equipment imposes financial hurdles, access to external image analysis systems to stored histology datasets imposes data privacy and logistical hurdles.Engineering recommendationsStudies to date are largely of low quality, with a high risk of bias and limited applicability.Key issues include inadequate documentation of dataset attributes and the handling of missing data.A critically small number of studies are externally validated, which is essential for demonstrating a model's ability to generalise on unseen datasets.Only a limited number of studies share their model source code, a practice which enhances research reproducibility, facilitates collaboration efforts and enables peer validation.Finally, AI model evaluation should be evaluated using clinically relevant appropriate metrics (e.g.relevant online tools).100Several multi-centre datasets are utilised in the current literature, but this can cause batch effects (non-biological factors that create variation in the data) at various stages, from tissue collection to image digitisation.This could cause AI models to focus on the unique WSI signatures of individual sites, rather than inherent biological attributes.101Recommendations have been made for studies utilising multi-centre datasets, including reporting variations in outcomes observed across sites and implementing various pre-processing steps, including stain normalisation.101These steps are often omitted in the reviewed studies and should be considered.Comprehensive, freely available single-centre histopathology datasets (e.g.The Digital Brain Despite a growing body of relevant literature, the field remains at an early stage; all of the studies were retrospective and preclinical, and poorly aligned with current diagnostic neuropathology workflows.A F I G U R E 5 Recommendations for the clinical and engineering communities to help bridge the gap between preclinical studies (the current state of the field) and clinical implementation in the field of AI-driven histopathology image analysis of CNS tumours.high risk of bias was identified across the majority of studies; persistent issues identified included an absence of external validation and inadequate reporting of study characteristics.Based on these findings, we propose specific clinical and engineering recommendations, including adopting up-to-date integrated classification systems, improved reporting transparency of the number of patients and/or images within the model training and testing cohorts, rigorous external validations, and better considerations of model interpretability.We suggest that implementations of such changes, alongside better crossdisciplinary collaborations among clinicians, computer scientists, image analysts and engineers, are needed for the creation of robust AI models able to transition from preclinical models into clinical trials, with structured evaluation as per published guidance (e.g.DECIDE AI,