Predictive modelling of brain disorders with magnetic resonance imaging: A systematic review of modelling practices, transparency, and interpretability in the use of convolutional neural networks

Abstract Brain disorders comprise several psychiatric and neurological disorders which can be characterized by impaired cognition, mood alteration, psychosis, depressive episodes, and neurodegeneration. Clinical diagnoses primarily rely on a combination of life history information and questionnaires, with a distinct lack of discriminative biomarkers in use for psychiatric disorders. Symptoms across brain conditions are associated with functional alterations of cognitive and emotional processes, which can correlate with anatomical variation; structural magnetic resonance imaging (MRI) data of the brain are therefore an important focus of research, particularly for predictive modelling. With the advent of large MRI data consortia (such as the Alzheimer's Disease Neuroimaging Initiative) facilitating a greater number of MRI‐based classification studies, convolutional neural networks (CNNs)—deep learning models well suited to image processing tasks—have become increasingly popular for research into brain conditions. This has resulted in a myriad of studies reporting impressive predictive performances, demonstrating the potential clinical value of deep learning systems. However, methodologies can vary widely across studies, making them difficult to compare and/or reproduce, potentially limiting their clinical application. Here, we conduct a qualitative systematic literature review of 55 studies carrying out CNN‐based predictive modelling of brain disorders using MRI data and evaluate them based on three principles—modelling practices, transparency, and interpretability. We propose several recommendations to enhance the potential for the integration of CNNs into clinical care.

psychiatric conditions characterized by a variety of features, including impaired cognition, altered mood states, psychosis, neurodegeneration, and memory loss (American Psychiatric Association, 2013).These phenotypes incur public and personal health burdens through reduced quality of life, social stigma, and increased mortality (American Psychiatric Association, 2013; James et al., 2018) and are therefore the focus of intense research.In particular, there is significant interest in building predictive models designed to differentiate conditions and their subtypes.Biomarkers identified using predictive modelling approaches could yield mechanistic insights into these diseases (Kupfer et al., 2008;Taber et al., 2010) offering the potential for early intervention and improved disease management (Shah & Scott, 2016).
Magnetic resonance imaging (MRI) provides non-invasive measures of brain structure and the increasing availability of large-scale collections of MRI data has enabled a wealth of predictive modelling studies (Grover et al., 2015;Milham et al., 2017).
Previously, machine learning and classical statistical approaches have been used to highlight differential neuroanatomical patterns across several conditions, including subcortical structure volume reduction in bipolar disorder and Alzheimer's disease (Hibar et al., 2016;Roh et al., 2011).However, incorporating such information into clinical systems is non-trivial, as the dynamics and limitations of a particular biomarker must be addressed prior to use (Carroll, 2013;Furiea & Gisele, 2009).Additionally, the methods used to identify discriminative features have their own considerations, such as requiring preprocessing tools to derive tabular brain summary information (Jenkinson, 2012;Reuter et al., 2012).These tools can produce variable results depending on the parameters chosen, even when applied to the same dataset, highlighting the importance of domain expertise in justifying data processing decisions (Botvinik-Nezer et al., 2020).Additionally, statistical modelling often requires formal specification of expected variable relationships and is generally unsuited to high-dimensional imaging data structures.Traditional machine learning approaches are also limited by their inability to consider spatial dependencies between groups of pixels, making it necessary to use tabular summary data.Deep learning algorithms have therefore become a popular methodology given their ability to consider arbitrarily complex relationships, providing greater model flexibility, and their lack of requirement for specification of expected variable relationships.Convolutional neural networks (CNNs) are deep learning models designed to detect spatial patterns in imaging data and have shown impressive predictive performance in various classification tasks.They have also been widely applied in the field of medical imaging for segmentation and prediction, particularly in the context of ageing and psychiatric/neurological disorder diagnosis (Kamnitsas et al., 2017;Simonyan & Zisserman, 2015;Ueda et al., 2019;Zou et al., 2017).These recent developments have been enabled by access to large standardized neuroimaging data collections, such as the Alzheimer's Disease Neuroimaging Initiative (ADNI) (Jack Jr et al., 2008) and the UK Biobank (Littlejohns et al., 2020).However, there are a few caveats that bear consideration; firstly, deep learning models can suffer from multiple limitations, such as high parameter dimensionality, lack of interpretability, random weight initialization, lack of uncertainty, and difficulty to train (LeCun et al., 1995;LeCun et al., 2012;Yam & Chow, 2000;Zhang et al., 2018).Secondly, clinical decision systems require rigorous validation and reporting frameworks for more interpretable models; the use of opaque deep learning algorithms makes validation and transparency more difficult to achieve (Collins et al., 2015;Haibe-Kains et al., 2020).Clinical decision systems that offer no explanation of an output are less likely to be incorporated into patient care frameworks.These factors combine to make the application of deep learning to clinical settings challenging.
As the number of studies applying deep learning to brain disorder prediction using neuroimaging data increases, the opportunity arises to examine factors which may limit their potential for clinical application.In this work, we systematically review 55 papers which report on such approaches.While many of the studies examined have been designed to demonstrate predictive capability, we sought to assess the existing literature with the aim of identifying key principles that can maximize the potential clinical value of future work.These principles are: (1) modelling practices, (2) transparency, and (3) interpretability.Below, we first provide a brief overview of CNNs and their workflow in the context of brain disorder imaging-based models; we subsequently detail our motivation behind focusing on these three principles.We then analyze the selected articles in the context of these principles and propose several recommendations for future studies based on our results.

| Convolutional neural networks
CNNs are a popular deep learning algorithm for many areas of biomedical imaging research, particularly those utilising MRI data (Hosseini-Asl et al., 2018;Kamnitsas et al., 2017;Simonyan & Zisserman, 2015;Zou et al., 2017).Their structure is designed to account for spatial data patterns; this is accomplished through the use of filters and feature maps.A feature map is derived via convolutional operations, which are a matrix multiplication between a weight vector of arbitrary window size and an input image patch of the same size.(or an ensemble of CNNs).Owing to the fact that many existing CNN models have been applied to 2D data domains, studies in the medical imaging field can adapt their data to fit existing architectures and make use of existing weights via transfer learning.Alternatively, researchers can train new models that operate on 3D data, as structural MRI scans are natively 3D (Billones et al., 2016;Weiss et al., 2016).Some studies also train custom architectures on 2D data (Aderghal et al., 2017;Barbaroux et al., 2020;Pelka et al., 2020).The CNN output is usually presented as a probability, which is then used to calculate performance metrics, such as the area under the receiver operating characteristic curve (AUC) and accuracy.
In the following sections, we define and justify our emphasis on modelling practices, transparency, and interpretability in the context of brain disorder classification using neuroimaging data.We note that these principles are domain-agnostic and overlap with recent recommendations for improving the translational potential of machine learning and deep-learning models (Walsh et al., 2021).

| Modelling practices
Modelling practices here refers to the reliability of the methodology used.Studies that can be reproduced and that have attempted to mitigate factors that can influence the reliability of results are more likely to be integrated into clinical care settings.We examine the use of repeat experiments, the data splitting procedure, the reported accuracy, and the data representation strategy to evaluate this principle.
Repeat experiments ensure that the reported performance metrics are trustworthy across multiple random weight initializations and that the system as a whole can be expected to perform well if retrained.This is pertinent given that CNNs are parameter-dense, making them more prone to overfitting.A useful type of repeat experiment includes k-fold cross-validation, whereby data is split into k folds and k À 1 folds are used to train the model with the k-th fold serving as the testing set.This procedure is repeated k times until every fold has served as the testing set, providing an estimate of model performance across multiple data splits.
The reported accuracy is the final performance of the model as estimated from an evaluation strategy, which can include k-fold crossvalidation, performance estimation on a separate test split within the same population, or estimation on a separate population.The overall capacity of a model to classify a brain disorder with fidelity in a generalizable manner will ultimately dictate its potential for clinical use.
The data representation strategy is of specific importance for CNN models in this domain, as structural MRI data is natively 3D, whereby each number is represented by a pixel.Thus, modelling entire volumes can be computationally expensive, and some studies may opt to split data into individual 2D slices.This comes with a set of considerations: firstly, each 2D slice is treated as an individual instance during conventional training procedures, meaning that performance metrics can either be reported per slice or combined to derive patientlevel quantities, prompting consideration of voting strategies; secondly, 2D data are more prone to information leakage if train-test splitting is carried out after 2D slice derivation because the model may have been exposed to data from the same patient during training and evaluation.(information leakage), which can inflate performance estimates Additionally, studies may take multiple 3D patches per patient which may result in similar issues (Goldacre et al., 2019).facilitation of greater understanding of experimental outcomes and improved levels of reproducibility (Eglen et al., 2017;Markowetz, 2015).

| Transparency
There are many hyperparameters associated with deep learning models, the choice of which can greatly impact predictive performance; clear and detailed reporting of these values and the mechanism of their choice is therefore an important aspect of model transparency as are comprehensive descriptions of model architecture and training schedules.Where possible, direct sharing of model weights is encouraged as it not only improves transparency but can also mitigate the large computational overhead of model training.In addition to model transparency, an explicit description of the data sources and demographics which yielded the reported results is vital not only to the reproducibility of a study but also to understanding any potential biases contained in the data.

| Interpretability
Interpretability refers to the efforts made to explain features driving model predictions.Deep learning systems can be difficult to interpret, but efforts can be made to highlight image regions that are used during prediction to determine whether or not that information is relevant.This is particularly important as CNNs are prone to overfitting and can make use of any image feature, in turn making algorithmic biases likely (Hooker, 2021;Lepri et al., 2018).Ensuring that CNNs are using relevant information can increase confidence in the system.Models can be interpreted by saliency methods such as gradient-based class activation mapping (Selvaraju et al., 2016;Simonyan & Zisserman, 2015).
These approaches rely on deriving the gradient of a model's output with respect to input and weighting that quantity by the input-the final metric is then overlaid on the input for visualisation.This can indicate which regions are most 'important' for prediction, but they are not directly comparable to coefficients from classical regression models.Another approach to understanding model decisions is the use of counterfactuals, which involves measuring changes in the predictive performance of a model when it is exposed to inputs with known qualities.An example of this would be noting the change in model output when a patient image with a thicker amygdala is used as the input (Keane & Smyth, 2020).We assess the interpretability of a study based on the use of methods that produce a saliency map (such as [Selvaraju et al., 2016] or [Simonyan & Zisserman, 2015]), which are gradientbased, or which provide visualisation of internal feature map outputs.

| METHODS
We conducted a systematic literature review according to PRISMA guidelines (Page et al., 2021), the details of which are provided below.

| Inclusion/exclusion criteria
We limited our search to consider studies making use of traditional CNN architectures exclusively, whereby convolutional layer outputs, or other model outputs, are not used to train separate machine learning models.As these are the most common types of architecture, this better enables comparisons across studies.We also focused our attention on studies that use structural MRI data, as functional MRI data structures can often have different modelling requirements, including the use of time series methodologies that make them more difficult to compare between studies.

| Search details
We performed a Web of Science (all databases) and Pubmed search with the following keywords.For Web of Science, 77 results were returned, and for Pubmed 114 results were returned.Titles and abstracts were screened for relevance to the research question, and duplicates across both databases were removed, leaving a total of 74 papers.Nineteen studies were excluded for using functional MRI data and applying hybrid models where CNNs were not the primary modelling method; this resulted in a total of 55 papers remaining for review.The flowchart of this process is presented in Figure 2.

| Desired variables
A standardized questionnaire was designed to evaluate the methodological details of the studies considered, including the presence or absence of repeat experiments, the overall data representation strategy, the reported accuracy, the sample size, the data source, whether or not an interpretability method was applied, and whether or not code was made available.To obtain accuracy measures, we recorded the highest performance in testing experiments as reported by authors for the main classification task.For example, certain studies made use of 3-way classifiers for Alzheimer's disease, cognitive impairment, and controls-in these cases, we took only the Alzheimer's versus control reported accuracy.We marked sample size as NA where patient-level data numbers were not reported.We used t-tests to determine whether or not accuracy statistically varied across binary categories and carried out a linear regression to examine the relationship between accuracy and sample size.We also carried out a linear regression of accuracy against all measured variables (except code availability) to query their relationships with reported accuracy.

| RESULTS
We organise our findings according to our three principles: modelling practices, transparency, and interpretability.The selected articles and their attributes can be found in Table 1, and a numerical summary of all results can be found in Table 2.

| Modelling practices across studies
We found that 24 out of 55 papers represented data in 2D format (Table 2).While this is more computationally efficient than 3D, it can make information leakage more likely.Accuracy calculation can be carried out per slice or per patient, introducing issues surrounding optimal voting strategies.Of the 24 studies making use of 2D slices, only one referred to voting methods (Ahmed et al., 2020).Several studies made use of single slices per patient (Aderghal et al., 2017;Herzog & Magoulas, 2021;Mendoza-Léon et al., 2020).One paper making use of 2D slices provided code, detailing how individual 3D patient volumes were split into collections of 2D images (Sarraf et al., 2019).
We noted that 24 out of 55 studies made use of multiple models for training and prediction, with some papers using the output of one trained CNN as the input to another (Cui & Liu, 2019a;Li & Liu, 2018;Lian, Liu, Pan, & Shen, 2020;Liu, Zhang, Adeli, & Shen, 2018;Liu, Zhang, Nie, et al., 2018).This may impact generalization by increasing the chances of overfitting.Several studies used statistical tests to preselect informative image patches which can introduce bias by focusing the model on regions which may not be informative in full models (Liu, Zhang, Adeli, & Shen, 2018;Liu, Zhang, Nie, et al., 2018;Mendoza-Léon et al., 2020).Furthermore, pre-selecting regions based on accuracy metrics in one population may influence generalization capacity in another.In several studies, one model was trained on the whole dataset and the weights from that model were then used for transfer learning of another model with the same dataset, leading to potential leakage or overfitting (Ahmed et al., 2020;Folego et al., 2020;Lin et al., 2018;Mendoza-Léon et al., 2020;Pelka et al., 2020).We note that while overfitting mitigation strategies can be employed, in cases where the weight training has been informed by access to testing labels, no degree of post-leakage mitigation can remedy these specific effects.Thirty out of 55 studies employed repeat experiments, 10 of which reported only point estimate performance metrics.We found that the mean accuracy across all studies was 89.36 ± 8.694% (μ ± STD) and that there was no significant difference between reported accuracy metrics in any questionnaire results based on t-tests and linear regression, however, accuracy did appear to be inversely correlated with increased sample size (Figure 3).Further, a regression of accuracy against all variables yielded non-significant test statistics for every coefficient.The mean sample size of the 55 studies was ≈828 ± 691.

| Transparency and interpretability considerations across studies
We found that 49 out of 55 papers did not provide code or model weights, meaning that the majority of studies relied on textual methods summaries.This implies limited methodological transparency which is an issue considering how modelling choices can impact system performance.Studies providing code facilitate clear, reproducible experimental practices (Böhle et al., 2019;Folego et al., 2020;Hu et al., 2021;Qiu et al., 2020;Sarraf et al., 2019;Spasov et al., 2019).
Forty-four of the 55 studies made use of the ADNI dataset during either training or testing.Every study made explicit mention of the database from which their samples came and textually described their preprocessing pipelin.

| DISCUSSION
Below, we discuss the findings summarized in Table 2 and propose several recommendations to maximize the potential clinical value of future studies making use of CNNs to predict brain disorders from structural neuroimaging data.

F I G U R E 2 Flowchart detailing the article selection process.
T A B L E 1 Tabular presentation of the studies considered for this systematic literature review.

Modelling practices
Transparency modelling can consider the spatial dependencies across 3 axes of brain data, it is unclear as to the benefits of focusing on smaller 3D regions compared to entire 2D images along one axis.A significant minority of papers made use of 2D data structures (24/57), which provide an attractive alternative considering the high computational burden of modelling in 3D and the ability to capture all brain information along one dimension.Workflows making use of multiple 2D images per patient can however be prone to information leakage and may represent testing accuracies by different means, thus requiring a greater level of care.We found that there were no statistically significant differences between reported accuracies across data representations (2D = 91.5 ± 6%, 3D = 88 ± 9%).This suggests that if performance inflation has occurred, it does not appear to be enriched for a specific data representation strategy.Nevertheless, researchers should be cognizant of the individual limitations associated with each experimental approach and proactively address issues where possible.For example, information leakage can be mitigated by ensuring slice

| Repeat experiments
Most studies implemented repeat experiments via cross-validation, which can account for performance estimation variation caused by weight initialization stochasticity and fold splitting.While studies varied in the amount of data available for training, and consequently the number of folds they considered during cross-validation, evidence of repeat experiments greatly increases the reliability of reported performance metrics.As highlighted in (Hutson, 2018), reproducibility is not guaranteed even when code is provided, making repeat experiments particularly important.Twenty-five of the 55 considered articles did not employ repeat experiments, which reduces confidence in their reported results.A number of studies using repeat experiments reported only point estimates, which do not fully describe the range of performance metrics, potentially leading the reader to underestimate the variation in performance.Code inaccessibility exacerbates this issue, leaving the reader unclear as to the procedure followed.
We again found no significant difference between accuracy metrics reported across repeat experiment procedures (repeat experiment studies = 87.944± 10.43%, non-repeat experiment studies = 91.58± 5.025%), although this does not minimize the importance of carrying out repeat experiments.We recommend that researchers continue to employ repeat experiments and report their results with means and standard deviations.

| Code availability
Most studies did not provide code.As detailed in (Wen et al., 2020), the principles of fairness, accountability, and transparency are of paramount importance for deep learning modelling studies, and code inaccessibility is a significant obstacle to their realization.The construction of deep learning systems requires many algorithmic decisions which can influence performance, introduce bias, and impact reproducibility.Deep learning models optimize an objective function over a set of arguments, meaning that any decisions taken in preprocessing and model construction can affect the capabilities of the system as a whole, and propagate subjective choices throughout ostensibly objective models (Hooker, 2021).For instance, several studies have examined algorithmic biases against underrepresented and/or marginalised groups (Bagdasaryan et al., 2019;Buolamwini & Gebru, 2018;Diakopoulos, 2015).Aside from domain-specific benefits to code sharing, the larger scientific community has recently shifted towards open science frameworks, with several high-profile journals requiring methodological transparency (Eglen et al., 2017;Stodden, 2011; Nature editorial policies, 2021; Science editorial policies, 2021).Therefore, we believe that code availability and other literate programming tools-to enhance understanding and reproducibility (Eglen et al., 2017;Google, 2018;Kluyver et al., 2016).
This would also encourage accountability by allowing researchers to examine pipelines interactively and to identify any potential 'blind spots'that the model authors may have overlooked in their modelling decisions (Wen et al., 2020).Additionally, because model training is often computationally intensive, having access to models trained in similar domains could enable transfer learning approaches, accelerating scientific discovery in this domain.Therefore, we recommend that authors share model weights and code to facilitate reproducibility and increase the potential for clinical translation.

| Saliency and interpretability
We found many studies did not interrogate their presented models to ensure that relevant information is being used to make predictive decisions.Where irrelevant information is included, clinical utility will be severely limited.For example, when investigating patterns of Alz- focused on prediction as opposed to inference, meaning that the mechanistic understanding of relationship dynamics is often secondary to test accuracy.This is challenging in the context of discovery and clinical settings.Furthermore, saliency methods have their own limitations arising from their algorithmic derivation of importance, which can affect interpretation (Adebayo et al., 2018).Similarly, while counterfactuals are promising, they are difficult to empirically measure and require significant computational overhead.Nonetheless, interpretability efforts allow researchers to visually evaluate model 'attention', which can serve to increase confidence and reduce bias overall, a topic of concern regarding the application of models to society at large (Diakopoulos, 2015;Hooker, 2021).

| Accuracy metrics, sample sizes, and data sources
We found that, overall, studies reported impressive predictive accuracies in their primary modelling questions (89.36 ± 8.694%)underscoring the potential of deep learning models to aid clinical decision making This makes the careful consideration of the principles outlined here all the more important.Despite observing no significant differences in accuracies stratified by questionnaire categories, we highlight the importance of applying these principles from a qualitative standpoint.
Studies with high accuracies that have applied repeat experiments and carefully considered data representation strategies can elicit more trust.This trust can be further enhanced by making code available so that results can be reproduced, with the additional benefit of allowing researchers to apply trained models to their own data.One significant barrier to full reproducibility in this context is data privacy concerns, which may limit the potential for release of a fully reproducible paper.
On average, sample sizes were large, although there was a high degree of variation (828 ± 691).While there was no significant relationship between sample size and accuracy, it appears that there is a weak negative correlation between the two variables, even in spite of large database crossover between studies, with 44 studies making use of ADNI (R 2 = 0.055, p = 0.09).This provides evidence that increased sample size may help to reduce bias in performance estimation.
Additionally, we note that every study detailed their preprocessing pipeline textually and explicitly stated their database source.This is an essential aspect of data transparency that should continue to be universally embraced by future authors.Our observation of 44 studies of a total pool of 55 making use of the same database speaks to the importance of the ADNI consortium, but may also indicate that this cohort of patients is overrepresented in the literature.This may be an issue when considering site effects in neuroimaging studies (Bayer et al., 2022).Future efforts to diversify data sources in this domain are dependent on establishing accessible and robust data consent frameworks that represent different data demographics.

| Future perspectives and commentary
This systematic literature review highlights areas of focus across modelling practices, transparency, and interpretability in the context of maximizing the potential for clinical utility and reproducibility.
These points underscore long-standing differences between deep learning and classical statistics, whereby the former is usually concerned with predictive performance and the latter with making inferential statements.embracing the principles of reproducibility, transparency, and interpretability for predictive models.This will increase the confidence in such methods and accelerate the path to future clinical integration.
We summarise our key recommendations in Table 3.

| LIMITATIONS
This work reviewed studies from two database sources but is not guaranteed to have evaluated all available relevant research.This study also did not consider studies making use of functional neuroimaging data sources, which comprise a large corpus of research.We did not endeavour to comprehensively identify potential information leakage-an in-depth consideration of this concept is explored in (Wen et al., 2020).Additionally, while we encourage the use of interpretability methods, we acknowledge the multiple drawbacks which may limit their utility and application.We further note that it is difficult to identify a unified set of optimal experimental parameters across every context-our commentary is designed to draw attention to the limitations arising from specific procedures and encourage researchers to carry out experiments that mitigate these issues as much as possible.We also note that fair comparison of reported accuracies across a myriad of diverse studies is extremely challenging.
Finally, we have endeavoured to ensure that our evaluation is neither reflective of overall study quality nor reductive with respect to the three nuanced principles introduced-our binary descriptors are intended to serve as a vehicle to discuss important concepts and to encourage continued careful research into brain conditions using CNN-based predictive models.
We conducted a systematic literature review of 55 studies carrying out CNN-based predictive modelling of brain disorders using structural brain imaging data and evaluated them in the context of their modelling practices, transparency, and interpretability.We provided recommendations that we believe will increase the potential clinical value of deep learning systems in this domain.Careful consideration of these concepts can help to inform a clinical framework that can effectively incorporate deep learning into diagnostic and prognostic systems, improving patient care.
T A B L E 3 Key recommendations arising from the results of this systematic literature review, their benefits, and the risks associated with non-adherence.
Every number in the input window is multiplied by every number in the filter and summed together, providing the pixel value of a new feature map at the next layer.The convolution of the same filter over every patch of the input image generates the entire output feature map, which is usually the same size as the input image.Multiple feature maps are used in CNN architectures, each with its own filters, which, throughout model training, can detect distinct data patterns such as shapes and/or edges.CNNs build increasingly abstract representations of input data through iterative transformation operations with all variables at successive layers being the sum-weighted combination of all previous layer outputs.Terminal fully connected layers provide a predictive output.Weight initialization is often random and training is carried out via backpropagation.A more in-depth consideration of neural networks and their training can be found in LeCun et al. (1995) and (2012).

1. 2
Figure1or a variant thereof.Preprocessing is usually applied to skull strip, register raw input images, crop, resize, and/or contrast normalize.The preprocessed inputs are then used as training data for a CNN Transparency refers to how clearly the study's methods are reported, including the use of code and model sharing.Several important advantages to code sharing have been described previously, including the F I G U R E 1 General experimental workflow.The preprocessed input image, either in 2-or 3-dimensional format is passed to a CNN model (or ensemble of CNN models) for training and prediction, The weight vector, w, is updated via backpropagation at each epoch (training iteration), minimizing the error of the loss function chosen to train the model.

(
or patch) conversion post-data splitting at the patient level, which can be verified by providing well-annotated code.Additionally, where multiple slices or patches have been used per patient, the voting strategy should be explicitly detailed.Utilizing single 2D slices may lead to performance estimation inflation because there is no guarantee the same biological information is being considered per patient at the same slice index.Several studies also made use of model stacking, whereby the input of a model is the output of another trained model.This may impact the model's ability to generalize well to different data by increasing the chances of overfitting.This is because the first model in a stacking configuration has already derived a representation of the data informed by test labels.This bias is distinct from using traditional unsupervised dimensionality reduction techniques to derive an input for a subsequent predictive model.Additionally, deep learning systems can be opaque, making it difficult to understand the first deep learning model's data representation and consequently the properties of the input used for the final predictive model.
= 38), Yes (n = 17) Are there repeat experiments?No (n = 25), Yes (n = 30) transparent methodological descriptions are important aspects of deep learning experiments in this domain, independent of potential clinical applications.Within a patient-care context, we underscore the importance of constructing reproducible systems to increase trust, both from a clinician and patient perspective.We further encourage the exploration of minimal Jupyter/Google Colab notebooks-and heimer's disease neurodegeneration, it is important to verify that factors such as skull thickness are not significantly weighted by the model.Even in cases where known irrelevant information can be removed by preprocessing, visual maps can draw attention to global patterns that may highlight the biases of models.As previously stated, algorithmic biases in predictive settings are concerning, and saliency methods can help researchers identify sources of bias.Additionally, attempting to understand the image features driving model predictions can help to relate new models to previous findings.These methods may also be used to generate new hypotheses and discover novel biomarkers, for example by highlighting neuroanatomical regions discriminative for particular conditions which may suggest they have mechanistic relevance 17 studies investigated neuroanatomical features driving model predictions via interpretability methods, thus increasing the potential to highlight sources of bias in model training.Interpretability methods, however, do have several considerations that may limit their utility, requiring careful consideration of how best to understand opaque models.Most existing methods deriving a saliency map return an 'importance' value per pixel, which has no direct link to human-interpretable neuroanatomy.Usually, this F I G U R E 3 Plots of accuracy variation across binary categories (a = interpretability, b = representation, c = repeat experiments) and sample size (d).In plots a-c, studies where accuracy was not reported were excluded.(a) Violin plot of accuracy across interpretability categories (t-test p > 0.05).(b) Violin plot of accuracy across data representation categories (t-test p > 0.05).(c) Violin plot of accuracy across repeat experiment categories (t-test p > 0.05).(d) Scatter plot of sample size (x-axis) versus accuracy (y-axis) where the correlation was non-significant (p > 0.05).represents the degree of change in the output relative to a small perturbation in the input pixel, collapsing a potentially non-linear relationship to single values.While it provides an empirical assessment of captured patterns and is a useful visual aid, it offers little interpretative value compared to coefficients returned by classical statistical models.The deep learning field in general has been historically