Artificial intelligence for diagnostic and prognostic neuroimaging in dementia: A systematic review

Artificial intelligence (AI) and neuroimaging offer new opportunities for diagnosis and prognosis of dementia.

in AI and medicine will help achieve the promising potential of AI tools in practice.

K E Y W O R D S
artificial intelligence (AI), Alzheimer's disease, dementia, machine learning (ML), neurodegenerative diseases, neuroimaging

Highlights
• There has been a rapid expansion in the use of machine learning for diagnosis and prognosis in neurodegenerative disease • Most studies (71%) relied on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset with no other individual dataset used more than five times • There has been a recent rise in the use of more complex discriminative models (e.g., neural networks) that performed better than other classifiers for classification of AD vs healthy controls • We make recommendations to address methodological considerations, addressing key clinical questions, and validation • We also make recommendations for the field more broadly to standardize outcome measures, address gaps in the literature, and monitor sources of bias

INTRODUCTION
There is a pressing need to improve diagnosis and prognosis for people with dementia.Up to 20% of people may receive the wrong diagnosis, 1 and differentiating between early symptoms in dementia based on clinical information and neuropsychological testing alone is subjective and prone to error.There is large geographic variability in the likelihood of receiving a diagnosis, even within a single country. 2 Diagnostic investigations such as neuroimaging and cerebrospinal fluid (CSF) tests can support clinical diagnosis; however it can take years to receive a diagnosis from the initial onset of symptoms. 3Receiving a timely and accurate diagnosis is critical for people with dementia, their carers, and families: 4,5 it provides the opportunity for forward planning; and with the advent of disease modifying treatments an early accurate diagnosis will guide treatment selection, working toward a precision medicine approach. 6uroimaging is a non-invasive investigation used in routine clinical practice to support the diagnosis of dementia. 7,8[17] Human clinical judgment has traditionally been used to interpret clinical neuroimaging. 9Visual rating scales may support this assess-ment using features such as medial temporal lobe atrophy 18 and white matter hyperintensity load. 19,20However, the development of more sophisticated approaches and richer data may mean that the most informative features are not amenable to human measurement or observation.For example, resting-state functional MRI can be used to derive a variety of connectivity metrics between 1000s of nodes that are amenable to machine learning (ML) approaches. 21Deep learning methods have also demonstrated superiority to human neuroimaging interpretation. 22,235][26] Neuroimaging data are particularly well-suited to analysis using ML, particularly deep learning, given its high dimensionality, non-linear nature and high covariance within the data.A large and growing number of ML studies have investigated how neuroimaging features can be used to predict cognitive diagnoses and conversion to dementia, fueled by the availability of large datasets, such as the Alzheimer's Disease Neuroimaging Initiative (ADNI). 27wever, uncertainty remains about which ML approaches have the greatest potential to inform clinical decision making and how their performance compares to human decision making.
We therefore conducted a systematic review to establish: (1)   28 , drug discovery and trials optimization 29 , genetics and omics 30 , biomarkers 31 , neuroimaging (this article), prevention 32 , applied models and digital health 33 , and methods optimization 34 .

METHODS
We conducted a systematic review to investigate the use of ML meth- for Systematic Reviews and Meta-Analyses) guidelines, 35 and the protocol was registered with PROSPERO (ID: CRD42021232249) prior to the screening of abstracts.

Search strategy
The databases MEDLINE (via Ovid), Embase (via Ovid), Cochrane Library, BNI (via ProQuest), PsycINFO (via EBSCOhost), CINAHL (via EBSCOhost), and Emcare (via Ovid) were searched using the title, abstract, keyword, and MeSH term fields from inception to January 8, 2021, with the support of the Cambridge University Clinical School Library.Results were limited to English language studies.Full search terms for each database can be found in Supplementary Material 1.Studies which were known to the authors and met the inclusion/exclusion criteria of the review, but were not initially identified using the search strategy, were also included.

PICOS framework
Outline of the parameters of this systematic review according to the PICOS framework:

Inclusion & exclusion criteria
The inclusion and exclusion criteria used during the screening process to determine which studies would be included in the systematic review can be found below: Inclusion criteria: 1.Primary research studies only.

Study selection
The initial records were identified using the search criteria.These records underwent de-duplication using a Zotero (https://zotero.com)automation tool, which flagged possible duplicate studies, and were manually screened by a reviewer to merge genuine duplicates.Following de-duplication, all studies were screened across two stages.
During the first stage, each abstract was independently reviewed by two reviewers to determine their eligibility for inclusion based on the outlined criteria using the screening tool Rayyan (https://www.rayyan.ai/).Once both reviewers screened their allocated abstracts, inclusion/exclusion decisions were unblinded.For abstracts where there was disagreement between screeners, a third independent reviewer assessed the abstract and made the final decision as to (1) progression to full-text screening stage or (2) exclusion.
The second stage involved full-text screening of all included studies by one reviewer per paper.For studies where the reviewer was unsure if the study met the outlined criteria, a second opinion was sought and a joint decision made after discussion with the second reviewer.

Data extraction
One reviewer per paper manually collected data from each report independently into an Excel spreadsheet without the use of automation tools.The following data were extracted from the included studies: 1. Article information: First author, year, journal, country of first author's affiliated institution.
2. Study method: Patient population(s), neuroimaging modality, source of data.For studies using different datasets relating to a study, information regarding which specific dataset was extracted where possible.For example, for ADNI studies, the specific dataset used (ADNI-1, ADNI-2, ADNI-GO, J-ADNI) was identified and recorded where available.

Risk of bias assessment
Following the second stage of screening, all included studies were assessed for risk of bias by one reviewer using a hybrid version of the Joanna Briggs Institute (JBI) Critical Appraisal checklist covering the areas we deemed most relevant to this area of research. 36The specific questions used for risk of bias assessment and their outcome for each study can be found in Supplementary Material 2. We only excluded studies exhibiting clear methodological concerns, such as lack of reporting of basic participant demographics, in order to accurately depict and identify current barriers in the literature limiting translation to clinical practice.Given a training set of labeled features, there are multiple ways to learn a classifier that can then be used to predict class membership for new, unlabeled instances.We categorized classifiers according to the object they seek to learn or model.

Data synthesis and approaches to classification
1. Generative classifiers learn the joint distribution of the features and labels. 37Examples include naïve Bayes and linear/quadratic discriminant analysis.After training, it is possible to generate (hence the name) new pairs of features and labels by sampling from the learned joint distribution.

Discriminative classifiers learn the conditional distribution of
the labels given the features. 38Examples include logistic and Gaussian process regression with potential regularization, knearest neighbors, and most ensemble methods (such as random forests).

Non-probabilistic, algorithmic classifiers directly learn the decision
boundary in feature space. 39Examples include maximum margin classifiers and support vector machines.
We note that some non-probabilistic classifiers can be reframed in a probabilistic light. 40,41For this reason, some authors consider these methods to be discriminative in nature and draw less of a distinction between our types (2) and (3).
In order to determine how well a classifier generalizes to new data, models are typically evaluated using a validation set consisting of labeled data withheld from the training process.The model's predictions in the validation data can be compared to known labels using a variety of different metrics; precision, recall, accuracy, AUC, and Fscores are all estimated in this way.If a classifier performs much better on training data than on validation data, this can indicate overfitting.
In such a case, the model may be refitted with regularization terms or priors that penalize model complexity.
Following data extraction, we conducted a meta-analysis.3][44] We attempted to overcome these barriers by running a focused evaluation of the performance of ML algorithms, measured with AUC values, for a specific task: classification of AD versus healthy controls.This was achieved using a Stratified Weighted Random Method (SWRM) approach by assigning weights to the datasets and features (see further methodological details in Supplementary Material 1).

RESULTS
The initial search strategy yielded 2709 studies, which underwent abstract screening following de-duplication.Three additional studies which were not picked up in the initial search strategy but met the inclusion criteria were identified by experts in the field and underwent full-text screening.The studies were consolidated to 255 studies after full-text screening (full list of references in Supplementary Material France (6%), and South Korea (6%).
Risk of bias assessment resulted in exclusion of three studies which exhibited clear methodological concerns, such as lack of reporting of basic participant demographics (supplementary material 2).The majority of studies used clearly defined inclusion criteria (95%) with detailed descriptions of participants and settings (91%).Only 41% of studies explicitly identified potential confounding factors.

Datasets
Few studies used more than a single dataset, with 233 studies using one dataset, 18 used two datasets, and the remainder used three or more datasets.The most commonly used dataset was ADNI (see Figure 2).In the majority of the studies using data from ADNI, the specific cohort used (ADNI-1, ADNI-2, ADNI-GO, J-ADNI) was not stated (129 of 181) (Table 2 in Supplementary Material 1).Where the cohort was available

AI methods
The classifier type most frequently used was a non-probabilistic algorithmic approach (48%), an example of which is support vector machines (SVM), followed by discriminative classifiers (32%) which includes most neural networks.Generative classifiers and "other" methods, mainly consisting of studies which combined multiple AI algorithms to generate novel or complex classification tools were difficult to categorize; each constituted 10% of the literature.Most of these studies focused heavily on computational methods which are not easily accessible to a clinical audience.
The number of studies which used algorithmic classifiers (mainly SVM) increased considerably between 2013 and 2015, after which its use stabilized.In contrast, there was a sharp rise in the number of studies using discriminative approaches (e.g., neural networks) starting in 2017, with discriminative studies outnumbering algorithmic studies for the first time in 2019 (Figure 3).
In order to unveil potential differences in performance between ML methods, we examined AUC values for classifying AD versus healthy controls across studies (Figure 4).Of note is that only 13% (11 of 84)   of these studies reported a confidence interval for the AUC value.Of these 11 studies, 5 did not report the range of the confidence interval (e.g., 90% or 95%).
We employed a meta-analytic approach using the stratified weighted random method (SWRM) to weigh results based on the dataset, imaging modality, and type of ML method used (methodological details in the Supplementary Material 1).We found that for classification of AD versus healthy controls (i) discriminative models  We identified four studies which used transfer learning for classification [45][46][47][48] which were trained on ImageNet 45 ADNI (normal controls and AD), 46 Human Connectome Project (HCP), 47 and generic images, 48 and were transferred to ADNI, 45 ADNI (stable and progressive MCI), 46 ADNI, 47 and ADNI (sMRI). 48

MRI
[51] In total, 68.6% (175 of 255) of studies relied on volumetric structural MRI measurements.In the few studies that tested traditional and AI approaches head-to-head, AI methods outperformed raw volumetric measurements, for example, against hippocampal volume for diagnosis 52,53 and for predicting conversion of MCI to AD. 54 The reported accuracy of AI methods for the diagnosis of AD varied between 60.2% and 99.3%.Of note, estimates in the lower range were found when using a multi-class classifier (i.e., AD vs.MCI vs. healthy controls, rather than AD vs. healthy controls) 55,56 or where an independent validation group was used. 57ntributing to heterogeneity, the aim of "diagnosis" differed between studies using structural MRI.For example, there were 17 studies specifically targeting early diagnosis in which "early" disease was variably defined by: MMSE score < 24 [58][59][60] ; CDR 0.5-1 48,[61][62][63] ; [72] Studies using longitudinal structural MRI measures (n = 6) 69, [73][74][75][76][77] suggest that multiple timepoints may be more accurate than base-line measures alone for the diagnosis of AD, 62 and were particularly useful when applied to the prediction of MCI to AD conversion. 69,75,77Of interest, longitudinal changes in volumetric MRI may need to be considered in the context of baseline volumetry to be meaningful. 74enty-eight studies investigated the use of non-volumetric The accuracy for differentiating between AD patients and healthy controls ranged from 79.2% to 99.1%.0][81][82] As expected, differentiating MCI subtypes and between MCI and AD cohorts was a more difficult task, which is also often the case in clinical practice.[85] Twenty-six studies (the first published in 2012) used resting-state MRI (rsMRI); we did not identify any studies using task-based MRI.
All but 4 studies 51,52,86,87 focused on diagnosis and the majority (20 of 26) used ADNI data, either as the primary dataset or as a replication dataset.Graph measures were often used to summarize network characteristics.Overall, the accuracy of discriminating between AD and controls ranged between 85% and 97%, but dropped when discriminating between MCI and controls (70-88%).
][90][91] F I G U R E 4 Forest plot depicting AUC values for classifications of AD patients versus healthy controls.Confidence intervals are shown where this was reported.Studies were stratified according to the type of machine learning method used including algorithmic (orange), discriminative (blue), generative (green) and other (red).Unweighted average AUC values for each type of machine learning method is depicted with a diamond.

Neurophysiological imaging
3][94] The majority of the studies (n = 21) used quantitative EEG, while the remaining used either MEG, 95 event-related potential EEG 96 , or combined EEG with SPECT. 97though half (n = 12) of these studies have been published since 2018, this cohort of publications also included some of the earliest studies identified in this review starting in 2005. 98,99All neurophysiological studies used data from their local institution, the largest of which included EEG recordings from 272 participants, 100 although most studies (n = 13) included less than 50 participants.In a manner similar to other imaging modalities, SVM was the most common (n = 12) ML tool used and no other algorithm was used in more than three studies.Accuracy of discrimination between AD and healthy controls varied from 69% in the single MEG study 95 up to 100% in one study using four EEG features. 101
133][134][135] An additional approach used PET and structural MRI data in combination with other markers (i.e., apolipoprotein E4 [APOE4] status and cognitive scores) to train a classifier, then selected neuroimaging features for classification, showing better performance when neuroimaging data (gray matter density, amyloid burden, APOE4 status; r = −0.68)were used to predict individualized rate of cognitive decline in MCI, compared to cognitive predictors (depression, memory and executive function scores; r = −0.4). 136Similarly, three studies showed that SPECT is able to classify MCI and AD, but its predictive value for MCI conversion improved when combined with other imaging modalities or cognitive assessments. 97,137,138

Approaches to prognosis in AD
Fifty-four studies investigated either prognosis or a combination of diagnosis and prognosis.The majority were retrospective designs (51 of 54).Of 54 studies, 47 (87%) looked at prognosis in terms of MCI to AD conversion.Of these studies, two approaches were used to evaluate the performance of prognostic predictions; some exclusively used baseline data (fixed), while others used multiple imaging time points (continuous) and related these to time to conversion.
MRI alone was the main imaging modality used (36 of 54 studies) with an additional six studies combining MRI and PET.Nine studies used only PET data, 102,[111][112][113]116,122,126,139 one used SPECT, 137 and two used EEG data. 93,95 The main utcome measure for these studies was conversion to AD from MCI over a prespecified period of time (47 of 54 studies). A smallerproportion of studies (n = 4) used cognitive decline as an outcome measure.Similar to the diagnostic studies discussed in this review, the majority of the neuroimaging data came from the ADNI database (78%, 42 of 54 studies).An additional three studies combined local datasets with ADNI.
Thirty-eight studies used only baseline imaging data to predict a future diagnosis with a range of accuracy between 65% and 96% (mean AUC 0.79, standard deviation 0.09).Seven used multiple imaging timepoints to make predictions with accuracies between 73% and 92% (mean AUC 0.81, standard deviation 0.10).One paper found a substantial improvement with longitudinal data (AUC 0.93) compared to baseline data alone (AUC 0.54), 111 and a second paper achieved a high level of accuracy using baseline neuroimaging information with longitudinal cognitive scores (AUC = 0.90). 70me to conversion was divided into two categories: conversion within a fixed timeframe (42 of 47), or a continuous measure of time of conversion (5 of 47 ) and MCI to AD conversion within 24 months. 77,142ditionally, two of seven studies predicted conversion from cognitively normal to AD in 7 59 and 2 years. 64,143Finally, only one paper examined prognosis in non-AD neurodegenerative diseases, namely PD and DLB 93 with an AUC of 0.87.

Non-Alzheimer's dementias
The majority of studies that included patients with non-Alzheimer's dementia used neuroimaging features to improve the differential diagnosis between different dementia diagnoses.In total, 17 studies included a non-AD dementia group, 14 featured a non-AD dementia as the diagnosis of interest, with the remaining 3 using the non-AD groups as a control group.5][146][147][148] These studies attempted differential diagnosis of FTD (from AD and/or LBD) most often using neuropsychological data and structural imaging (four of seven studies), with two studies using EEG 92,94 and one using structural MRI for classification based on post-mortem pathology. 147[151][152] Structural MRI was the most frequently used imaging modality (11 of 17 studies).Two studies focused on the differential diagnosis between PD and LBD, 93,153 and only two on vascular dementia. 78,154e majority of studies used data from local hospitals or memory clinics (14 of 17 studies); one paper used local data combined with ADNI, 57 and three studies used multi-center or cohort data. 144,148,150Since the majority of studies utilized prospective or retrospective data from local clinics, datasets were relatively small compared to multi-center studies like ADNI with most studies including 60 to 100 patients and some as low as 15 patients in a single diagnostic category. 78The studies with larger patient numbers tended to come from multi-center studies 144,150 or used retrospective data over a long period of time. 147

DISCUSSION
In this systematic review, we examined 255 published studies using neuroimaging alone for the diagnosis or prognosis of neurodegenerative disease.The vast majority of studies (71%) used the ADNI dataset which primarily uses MRI and focuses on the conversion from MCI to AD.The dominance of ADNI means that this emphasis is reflected in the published literature, with the majority of studies using struc-tural MRI alone or in combination with another MRI modality or PET, almost all of which focused on AD.The size of the ADNI data has led to a rapid rise since 2017 in the use of more complex discriminative AI methodologies, including deep learning models.These more complex models have in general outperformed simpler algorithmic and generative models, although comparison between studies is challenging given differences in diagnostic criteria and outcome measures.Most studies of diagnosis published ROC curve analysis results; however, there were marked differences between studies in definitions such as "early" dementia, and in the outcome measures used in prognostic studies.There remain significant gaps in the literature including non-Alzheimer's neurodegenerative diseases (most strikingly vascular dementia with only two studies), the limited application of promising neurophysiology methods, and validation in clinically relevant populations.
ML methods have been successfully applied to almost every aspect of neurodegenerative disease. 155A previous review of ML for neuroimaging in dementia included studies up to 2016, 42 since when the field has expanded rapidly.Approximately 60% of the studies we included (n = 152) have been published since 2016.Some progress has been made on the concerns raised by Pellegrini and colleagues, including the overreliance on SVM classifiers and MRI.SVM was still the most frequently used classifier in our cohort which is unsurprising given that it was one of the first widely adopted methods.However, the overreliance on SVM classifiers has reduced, reflecting the rapid growth of this field and moving toward the use of a range of ML methodologies, as well as PET and/or multimodal approaches.However, despite this surge in studies, several barriers prevent the integration of these novel methods into everyday clinical practice.Below we discuss three critical issues identified from this systematic review: (1) reporting and reproducibility of methodology, (2) addressing clinically relevant questions, (3) validation of results.

Methodological considerations
While it is encouraging to see a wide range of methods applied to neuroimaging data, the multiplicity of approaches creates a challenge in assessing the validity of each method, comparing between differing models, and independently reproducing the results.Although we did not systematically review reproducibility, in general we found limited descriptions of many models, and only a minority of studies reported the availability of code to enable replication.
Reproducibility and transparency in neuroimaging research is an increasingly prominent issue, most clearly outlined by Poldrack and colleagues. 156The neuroimaging field has led the way in open science efforts, such as large data sharing platforms pioneered by the Human Connectome Project, 157 and introducing best practice for analysis and data sharing through the COBIDAS guidelines. 158,159To increase the reliability of results, pre-registering analysis through platforms such as the Open Science Framework 160 has been advocated for in both neuroimaging studies 161 and ML methodologies. 162More generally, staged approaches to model validation in ML are available to improve confidence in model performance. 25 found that the combination of multiple imaging modalities, such as MRI and PET, improved the performance of ML models for classification tasks related to AD.We speculate that using features from multiple modalities enables the models to train on several different biomarkers which provide a more holistic representation of the underlying disease mechanisms, such as changes in structure (volumetric MRI), network-connectivity metrics (resting-state fMRI), and metabolic physiology (PET).Although the results suggest this approach may be beneficial, the limited number of studies identified here using this method means that it is difficult to suggest which combinations of modalities will be best at improving the performance of ML models.

Addressing key clinical questions
Relevant clinical questions can be split into early diagnosis, differential diagnosis, prognosis and predicting response to treatment.There were no studies investigating the response to treatment, perhaps unsurprisingly given that the currently widely available treatments for dementia are symptomatic rather than disease modifying.The majority of studies considered the diagnosis of AD, or the prognostic prediction of MCI conversion to early AD.However, variability in definitions such as "early Alzheimer's disease" limited comparison between studies.This partly reflects the wider field where, for example, a clear definition of MCI has remained elusive despite recent efforts to reach such a consensus. 163 found no studies that assessed the common clinical challenge of differential diagnosis from among multiple (>2) possible diagnoses.This is a much harder problem to solve for ML algorithms because it requires a multi-class classifier which is computationally more challenging and typically yields lower accuracy than a binary classifier.
The lack of appropriate multiclass data is a major limitation, particularly given the reliance on the ADNI dataset that consists almost exclusively of amnestic MCI or AD patients.The National Alzheimer's Coordinating Center dataset has Alzheimer's and non-Alzheimer's dementia patients from a real-world setting, 164 but is much more variable in scanning sequences (including MRI field strengths), and reports clinically defined diagnoses rather than research diagnostic criteria.
ROC curve analysis was widely used to characterize diagnostic classification performance.In particular, we found the AUC is often reported as the main measure of classification between groups, usually accompanied by the PPV and NPV.The PPV and NPV are more relevant to clinical practice, providing interpretation of the proportion of correct positive and negative results for a classification.The outcome measure for prognostic studies is more challenging.We found that studies predicting prognosis usually grouped outcomes and applied ROC curve analysis.This is particularly relevant for predicting MCI to AD conversion; however, it is not applicable to other situations, such as predicting the rate of cognitive decline in established dementia.

Validation of results
We found that studies using an independent dataset for validation, as opposed to cross-validation or other similar methods, reported much lower accuracy, particularly when a community-based population was used.For instance, applying an SVM classifier trained on ADNI and applied to memory clinics found markedly reduced accuracy in the clinical setting (AUC = 0.76 for AD diagnosis) compared to that in the training dataset (AUC = 0.96). 57A few recent studies have addressed the risk of overfitting by assessing generalizability in unseen independent research datasets, 104,165,166 collectively demonstrating the value of this approach in identifying methodological issues relevant to the overall model performance.Therefore, validation studies are critical, particularly those in a memory clinic setting where the tools are ultimately to be used.
The over-reliance on a single dataset such as ADNI introduces potential ethnic and socio-economic biases to models that may hamper generalization, an issue that has been specifically raised in the ADNI dataset. 167Concerns have been raised more generally about bias in ML models, 168 including in the context of health applications. 169is is of particular concern in marginalized ethnic groups who have poorer health indicators in general, 170 and who may miss out on access to health services due to socio-demographic, cultural, or religious beliefs, 171 including dementia services. 172,173More representative datasets are critical for models to translate reliably to all parts of the population, to inform risk prediction models, and work toward closing gaps in health inequality related to dementia.Addressing bias in these collected datasets, and differences between genetic or ethnic groups in model performance, or applicability to different socio-economic populations, will be critical to address in ongoing data collection.It is unlikely that a single study or a single dataset can properly address these challenging issues, so collaboration between studies and between countries is required.This is happening to some extent in initiatives such as J-ADNI in Japan which is almost identical to the North American protocol and has been used to compare diagnosis and progression in dementia between both cohorts. 174her examples include the Longitudinal Aging Study in India (LASI-DAD) 175 and through initiatives such as the Genetic Frontotemporal dementia Initiative (GeNFI), 176 which recruits multi-nationally.Federated learning may also help address this issue by providing broader accessibility to datasets from diverse backgrounds and international sources.
A number of methodological approaches are available for measuring or mitigating bias. 177Examples include the geometric solution to learn fair representations (He et al. 2020), 178 which removes correlations between the data and specified protected features, as well as IBM's AI Fairness 360 toolkit (Bellamy et al. 2019), 179 which provides an accessible set of fairness metrics for a model and accompanying explanations to help mitigate bias.We did not find the issue of bias to be discussed or addressed in the studies we reviewed.

4.4
Challenges for the field Some of the issues we have highlighted can be addressed by individual researchers, but others require engagement from the neuroimaging, ML, and clinical communities more generally.This kind of collaboration has proven successful in initiatives such as ADNI.Although ADNI is a powerful dataset and has facilitated the use of more complex methodologies, similar collaborations for data collection and curation are required to help address ML for non-Alzheimer's neurodegenerative disease, and for EEG data.
Given the challenges of comparisons between studies using different methodologies and definitions, we suggest the field move toward consensus on outcome measures.Diagnostic criteria exist for the major neurodegenerative disorders, but better definitions of 'early' disease, and standard methods to assess prognosis would facilitate model selection.We outline our recommendations in Box 1.
In addition to overcoming these barriers related to transparency, establishing large, diverse datasets, external validation and consensus definitions, we will also need to address translational challenges more broadly to implement AI into real-world clinical settings. 180Overcoming the technical obstacles of integrating AI will be required for different types of bias/artifacts when data are conglomerated from various sources/institutions 181 while ensuring the security and privacy of sensitive health records for storage and sharing. 182Several factors currently limit the adoption of AI tools by clinicians including identity threat, 183,184 disruption of clinical workflow, and the uncertainty surrounding the basis of "black box" algorithms, particularly when the output disagrees with their own clinical judgement. 185By improving interpretability, explainable AI may be the most amenable approach to building trust and understanding in the medical profession. 186Furthermore, social and legal issues will require significant attention if implementation of AI into clinical practice is to be successful.For example, there remains uncertainty about which party is responsible when the use of AI tools result in harm from both legal 187 and patient 188 perspectives, while patients in general may prefer human supervision over AI. 189

Limitations
This systematic review has three main limitations.First, although we aimed to provide an informed and broad overview of the existing literature on this subject, our exclusion of reports not written in English and those where the full text was not available meant that some studies which would have otherwise met the inclusion criteria may not have been covered in this review.Two key additional exclusion criteria were the decisions not to include studies using linear regression for classification, and studies combining neuroimaging with other biomarkers without reporting the model performance for the neuroimaging features in isolation.Our motivation was to focus specifically on neuroimaging, and specifically on recognized ML methods, but it is possible we excluded studies with high clinical value and translational potential.
Second, the heterogeneity in classification tasks, ML methods used and statistical reporting across studies may have introduced bias when trying to decipher which tasks and results to extract.More specifically, this was an issue with the more technical studies which compared multiple (often > 5) ML methods across three or more classification groups introducing a large number of comparisons and results to consolidate and extract.For this reason, we decided to run our meta-analysis on a very specific task from which we could extract the AUC value for classifying AD versus healthy controls.This heterogeneity in AI methods, imaging modalities, and patient cohorts also meant that we were unable to provide insight into which features performed best for specific classification tasks.We do not address significant ethical issues in big data analysis of data security, consent to data sharing, and the acceptability of AI methods to clinicians and the general public.
Third, we employed a risk of bias screening tool that depended on a subjective judgment for each paper's inclusion or exclusion, and there may have been heterogeneity in this assessment between screeners.
We chose a low threshold for inclusion based on study quality in order to accurately depict and identify current barriers in the literature limiting translation to clinical practice.We only excluded studies exhibiting clear methodological concerns, such as lack of reporting of basic participant demographics.The screening tool had a binary outcome (inclusion/exclusion), and we were unable to investigate the potential relationship between study quality and ML performance.

CONCLUSIONS
In this systematic review, we generate a number of recommendations to facilitate translation of ML methods for patient benefit in the diagnosis and prognosis of dementia.We highlight issues of methodological heterogeneity, clinical relevance of results, and validation/replication of findings.We offer a set of recommendations to address key gaps in the literature including the importance of addressing key clinical questions, providing sufficient details of AI methods, and validating findings in independent datasets which are clinically relevant.Looking forward, the field is likely to move toward the establishment of real-world datasets, multi-model imaging methods, and complex ML algorithms emphasizing the importance of providing sufficient methodological details to enable independent replication.We are optimistic that addressing these concerns will accelerate the translation of ML methods for patient benefit in neurodegenerative disease.and organized the symposium from which this paper and others in the series originated, obtained funding, contributed to the conception of the work, revised the manuscript for intellectual content, and harmonized the manuscript with other papers in the series.Ilianna Lourida revised the manuscript for intellectual content and harmonized the manuscript with other papers in the series.All authors read and approved the final manuscript.

• 2 .
Participants: Patients with cognitive disorders due to neurodegenerative diseases.• Index: Neuroimaging data assessed with ML for diagnosis and/or prognosis.• Comparator: Traditional manual/subjective diagnostic/prognostic assessment.• Outcome: Accuracy of diagnosis and/or prognosis.• Study design: Controlled study.RESEARCH IN CONTEXT 1. Systematic Review: We conducted comprehensive searches of MEDLINE, Embase, Cochrane Library, BNI, PsycINFO, CINAHL, and Emcare to identify studies that examine the potential of artificial intelligence (AI) and machine learning methods applied to neuroimaging to inform clinical diagnosis and prognosis in dementia and other neurodegenerative diseases.Interpretation: The use of AI in neuroimaging is expanding rapidly with the evidence base being dominated by studies conducted using the ADNI dataset, algorithmic classifiers, and structural MRI focusing on Alzheimer's disease.Improved diagnostic accuracy was observed when a combination of neuroimaging modalities was used, e.g., PET and structural MRI.Findings also suggest superior performance of discriminative models compared to algorithmic and generative classifiers for the classification of Alzheimer's disease vs healthy controls.3. Future Directions: We highlight gaps in knowledge, current challenges, and issues to be addressed in future research around reproducibility and reporting, relevant clinical questions, and validation of results.We advocate wider collaboration between clinical, neuroimaging, and data science teams, and present recommendations to move toward clinically useful, machine learning methods applied to neuroimaging for dementia.
We used descriptive statistics to determine the following characteristics of the extracted dataset: source of neuroimaging data, type of neuroimaging used, ML methods, focus on diagnosis and/or prognosis, accuracy of diagnostic/prognostic classifications, and global distribution of first authors' institutions.Studies using MRI were labeled according to the types of features used for the classification task including volumetric structural, non-volumetric structural, and functional MRI.Volumetric structural imaging was defined as MRI methods measuring the volume of specific regions using voxel-based segmentation techniques.Studies were classified as using non-volumetric structural MRI if the features used for classification were related to cortical thickness, texture, or surface area using T1-or T2-weighted images and/or diffusion tensor imaging (DTI) data.The type of AI algorithm used for the diagnostic/prognostic classification task was extracted.Studies which used AI methods for feature extraction but not classification were excluded.

3 )
. A flow chart of the screening process reported according to the PRISMA 2020 guidelines 35 is shown in Figure 1.The publication time period ranged from 2005 to 2021.The included studies were classified by country based on the institutional affiliation of the first author.The most common countries included China (26%), USA (17%), Italy (7%),

(n = 52
), 36 (69.2%) studies used a single cohort, 8 (15.4%) used two cohorts, and 8 (15.4%) used three cohorts.Of those that used ADNI-2 and ADNI-GO (n = 11), a majority (n = 9) also used ADNI-1.Apart from using the ADNI dataset alone, 19 studies used data from ADNI combined with other datasets including the UK Biobank and AIBL.The majority (n = 11) of these combination studies used a local dataset in addition to the ADNI dataset.

F I G U R E 1
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 flow diagram for systematic review outlining the number of studies identified and excluded at each stage.F I G U R E 2 Datasets used across included studies.The majority of studies (n = 181, 71.0%) used the ADNI dataset alone or in combination with another dataset.Local data were used in 69 (27.1%) studies.Multiple studies used a combination of two datasets or more resulting in an overlap between the categories listed here.ADNI = Alzheimer's Disease Neuroimaging Initiative, OASIS = Open Access Series of Imaging Studies, AIBL = Australian Imaging, Biomarker & Lifestyle study of ageing, Bdx-3C = Bordeaux 3 Cities study, BLSA = Baltimore Longitudinal Study of Aging, CADDementia = Computer-Aided Diagnosis of Dementia challenge, NACC = National Alzheimer's Coordinating Center.
Transfer learn-ing was typically used for fine tuning neural networks, particularly when the authors felt the dataset was not sufficiently large enough to properly train the neural network algorithm.Accuracy varied between these studies, including for the following classification tasks: AD versus healthy controls (90.4-99.1),MCI versus healthy F I G U R E 3 Changes in classification methods over time.This figure shows the rise in the use of discriminative classifiers in the last 4 years.The use of algorithmic classifiers increased up to 2015 and has remained steady since.The use of generative models has stayed relatively stable since its first use in 2005.controls (83.2-99.2),and MCI converters versus non-converters (70.6-81.6).

Figure 5 .
Structural MRI and PET/SPECT were the most frequently used imaging modalities for diagnosis and prognosis of dementia, being used in approximately 71% and 25% of studies respectively.Around half of studies leveraged structural MRI alone (134 of 255) and those making use of multiple modalities (49 of 255) often used sMRI and PET (35 of 49) together.It is only since 2020 that studies incorporating three or more different modalities have begun to structural imaging features for diagnosis (n = 24) and/or prognosis/conversion (n = 7).The input consisted of T1-or T2-weighted images, DTI data, or a combination thereof, to estimate non-volumetric features such as cortical thickness, texture, and surface area.These studies focused on (i) optimization of image pre-processing techniques, (ii) investigation of feature selection methods, and (iii) optimization of classifiers and subsequent validation of the developed method.

BOX 1 :
Recommendations to move toward clinically useful, machine learning methods applied to neuroimaging for dementia Recommendations for machine learning studies Methodological considerations • Provide sufficient description of the methods, with available code, to enable independent replication • Use a staged approach to model validation • Pre-register analysis • Consider using multiple modalities Addressing key clinical questions • Clearly state the diagnostic criteria used • For diagnosis, report performance in terms of ROC curve analysis, including PPV and NPV, and confidence intervals • Clearly define measures of prognosis, and consider the use of odds ratios and survival analysis Validation • Independently validate models in at least one independent dataset • Validate findings in a real-world dataset (e.g., memory clinics) Recommendations for the field more broadly • Work toward consensus on outcome measures for diagnosis and prognosis • Establish large datasets of non-AD and/or multiple types of dementia • Establish open datasets for EEG comparable to those with MRI • Monitor ethnic and sociodemographic bias in data collection and encourage cross-study collaboration to address these biases