Artificial intelligence for dementia genetics and omics

Genetics and omics studies of Alzheimer’s disease and other dementia subtypes enhance our understanding of underlying mechanisms and pathways that can be targeted. We identified key remaining challenges: First, can we enhance genetic studies to address missing heritability? Can we identify reproducible omics signatures that differentiate between dementia subtypes? Can high-dimensional omics data identify improved biomarkers? How can genetics inform our understanding of causal status of dementia risk factors? And which biological processes are altered by dementia-related genetic variation? Artificial intelligence (AI) and machine learning approaches give us powerful new tools in helping us to tackle these challenges, and we review possible solutions and examples of best practice. However, their limitations also need to be considered, as well as the need for coordinated multidisciplinary research and diverse deeply phenotyped cohorts. Ultimately AI approaches improve our ability to interrogate genetics and omics data for precision dementia medicine.


INTRODUCTION
Dementia results from a variety of heterogeneous pathologies, such as Alzheimer's disease (AD), Parkinson's disease dementia (PDD), dementia with Lewy bodies (DLB), frontotemporal dementia (FTD), and cerebrovascular disease. 1 The number of people living with dementia worldwide is around 45 million and, as life expectancy increases and populations age, this number is expected to increase. 2][5][6][7][8][9][10] However, even with established bonafide associations, the task of characterizing variants and genes in the context of complex disease molecular pathophysiology, as well as its interacting genes and pathways, remains a daunting challenge. 11cent progress in cutting-edge genetic and omics technologies, such as epigenomics, transcriptomics, proteomics, and metabolomics, which refer to the comprehensive assessment of a set of specific types of biological molecules, allied with emerging computational methods, hold promise of faster discoveries.However, because of the large number of associations investigated in most omics scale studies, it is necessary to have large sample sizes collected in a consistent manner.Scaling up multidisciplinary dementia studies, such as those using omics approaches, comes with challenges and implies the need of coordinated efforts from clinicians, basic and computational scientists.Appropriate funding and infrastructures capable of dealing with large numbers of biological samples and big data are also needed.
As the omics field continues to expand in dementia research, artificial intelligence (AI)-powered technologies, and in particular machine learning (ML) and deep learning (DL), are well-suited for the detection of undiscovered patterns in high-dimensional data and advance dementia research in unprecedented ways (Figure 1).Coronavirus disease 2019 (COVID-19) demonstrated that progress can rapidly be made toward tackling a disease when certain scientific practices are altered. 12Coordinated action across interested parties can result in extraordinary progress within short periods of time.Significant progress could be made rapidly in dementia research if interested parties were able to organize such that we could tackle the systemic problems that hold back the field, some of which are discussed below.
Here we identify and discuss five unresolved key questions in dementia research, which could be addressed using omics combined with advanced AI approaches: (1) How can we enhance genetic studies to inform our understanding of dementia risk?(2) Can we find reproducible omics brain signatures that differentiate between dementia subtypes?(3) Can high-dimensional omics data identify improved molecular biomarkers for dementia compared to single marker approaches?(4) How do we use genetics to inform our understanding of causal risk factors?And (5) Which biological processes are altered by genetic risk for dementia-related diseases?Tackling these questions is crucial to improving our understanding of dementia, and involves coordinating a multitude of players whose expertise go well beyond omics.It also involves improving the availability of bioresources and clinical data as well as developing analytical tools and ML algorithms to deal with high-dimensional and heterogeneous data.We note some of the challenges which must be surmounted to answer these questions within the next decade.In each instance, we highlight possible solutions and exemplar projects and communities, who have set good examples that can be used to improve our performance as a dementia research community.
This review is one of a series of eight articles in a Special Issue

2.1
How can we enhance genetic studies to inform our understanding of dementia risk?

State of the science
The majority of GWAS rely upon logistic or linear regression-based approaches to test for associations between individual genetic variants (single nucleotide polymorphisms; SNPs) and a binary or continuous outcome. 13,14This process is repeated until an estimate of association has been generated separately for each genetic variant.Then p-values are used to gauge whether any of these individual associations are strong enough to be considered genome-wide significant when correcting for multiple testing (a conventional threshold for 'hits' is 5 × 10 −8 ). 15After a GWAS has been conducted it is often then possible to construct a polygenic risk score (PRS) by summing the value for each genetic variant weighted by the effect size from the initial GWAS. 16S have important applications as research tools, in clinical trials and in clinical practice, as they can facilitate causal inference modeling and genetic risk stratification on an individual level.Despite twin study heritability estimates of around 60%-80% for AD, 17 recent SNP-based estimates of common variant heritability of AD from GWAS and PRS are much lower (up to 20%), 18 suggesting that much of the genetic contribution to dementia risk remains unexplained.Other approaches are needed to uncover this missing heritability by integrating multi-omics or non-linear modeling.

What problems need addressing?
The diagnosis of dementia and its subtypes is imprecise. 19Current GWAS are based on cases for whom diagnosis of a specific dementia subtype has been largely made based upon clinical signs and symptoms.Thus, although current dementia GWAS are likely to be enriched for pathology related to the dementia subtype of interest, they will inevitably also contain other dementia subtypes and pathologies in their cases.This is problematic since etiology and risk factors are likely to differ for each dementia subtype, so genetic markers with small effect sizes that are specific to a single dementia subtype will be harder to detect than generalized dementia pathways.
There is currently a marked lack of diversity within dementia genetics studies, with GWAS discovery being largely confined to the genetics of AD in non-Hispanic White adults of European ancestry.
Although some small GWAS have been conducted in non-European samples, 20,[21][22][23] have measured non-AD dementias, 6,9,10 and incorporated dementia-related intermediate quantitative phenotypes or endophenotypes (such as amyloid-beta and cerebral small vessel disease), [24][25][26]  The study of both coding and non-coding rare/structural variants associated with dementia risk needs to be further pursued through short-and long-read sequencing technologies, which are thought to be important contributors to missing heritability in dementia. 27der the hood, long-read sequencing is powered by DL, using GPUpowered alignment algorithms to better characterize the genome.
Other potential reasons for missing heritability include unmeasured interactions between genes (epistasis) and failing to account for correlations between genetic variants due to population structure, dynastic effects, assortative mating or functional relationships. 28

Possible solutions
Perhaps the simplest way to enhance future GWAS is to further increase sample sizes and the diversity of these samples.This has been the main strategy so far, and has been reasonably successful in identifying additional genetic variants and, to a lesser degree, improving the phenotypic variance explained.It is reasonable to assume that by further increasing sample sizes (essentially more of the same) further discoveries will be made.Increasing sample sizes considerably will involve enhancing existing research studies or establishing new studies.It is also important to consider the existence of different dementia subtypes and how to distinguish them.It may be possible to take advantage of existing well characterized samples that have not previously been genotyped due to resource limitations, such as gold standard post mortem brain bank material with linked clinical data.That said, the cost of new studies which include clinical characterization is likely to remain high, and the number of existing samples is finite, raising practical concerns.Although there is no theoretical upper limit, in practice a predictive accuracy plateau in part limited by heritability is often reached, beyond which additional training data is not helpful.Given the large amount of missing heritability remaining, it is likely that increasing sample sizes may be needed but will not be sufficient in future GWAS, and alternative approaches will be required. 29veraging population diversity, rather than omitting it, can both improve statistical power and better detect causal variants.For example, a transfer learning approach was used to enhance the findings from a modestly sized GWAS in a Japanese population using summary statistics from a larger European ancestry GWAS. 21Conversely, transancestry cohorts can also be used to improve genetic variant discovery and localization in European ancestry GWAS.Transfer learning heuristics can also potentially be employed with different rates across global and local admixture levels in some populations for higher accuracy.
As an alternative to the standard linear approaches employed in traditional GWAS, advanced ML approaches may offer various benefits 30 (Table 1), including the ability to: (1)  cancer. 31Similarly, improvements have been observed by applying DL to predict survival in age-related macular degeneration 32 and reduce multiple testing burden. 33The tool DeepWAS 34 was used to identify genetic variants associated with multiple sclerosis and major depressive disorder while simultaneously predicting their cell-type-specific regulatory effects using multi-omics data integration.DeepNull 35 is a DL-based tool that models non-linear associations between the phenotype and non-genetic covariates.This improved GWAS hits detection by 6% and phenotypic prediction by 23% on average across 10 different UK Biobank traits, while also substantially reducing the false positive rate.Despite these advances, few attempts have so far been made to apply these techniques to dementia.While early attempts to apply MLbased methods to improve AD risk variant prediction have yet to find substantial improvements over traditional GWAS, the cohorts in which these models have been applied are extremely underpowered, 36,37 leaving ample opportunities to fully leverage ML-based methods on large-scale genomic data. 38ese ML approaches may provide the key to the development of PRS with greater predictive accuracy and specificity. 39However, the degree of improvement offered by ML methods may be partly dependent on the complexity and inter-individual heterogeneity of the genetic architecture underlying the disease of interest.For instance, DeepPRS, 40 a novel DL-based model that does not only rely on the additive effect of risk SNPs, outperformed more traditional PRS models across a variety of disease phenotypes, including AD.Thus, we anticipate further improvements in these approaches will unlock some of the unexplained heritability observed in prior GWAS, enhancing future research, trials, and clinical practice.

Examples of best practice
The Global Parkinson's Genetics Program (GP2) 41 is in the process of collecting 100,000 European Parkinson's Disease cases, and a further 50,000 cases from under-represented populations around the world.
They are primarily achieving this through collaborations and partnerships with researchers and organizations in other countries across the world, highlighting that large collaborative efforts are crucial for success.
Recent work in multi-ancestry PRS is a good first step in the right direction, 42 but with larger sample sizes of participant level data, a ML approach could perform well.Lake and colleagues leverage genetically quantified admixture and random effects models in a population with complex substructures using both random-effects derived risk scores and a risk heuristic that leverages the rates of genetic admixture to build a better predictive model. 22

2.2
Can we find reproducible omics brain signatures that differentiate between dementia subtypes?

State of the science
4][45][46] Similarly to the GWAS described in the previous section, the largest brain omics studies have focused exclusively on AD.
For example, a meta-analysis of the AD human brain transcriptome, 47 which using gene expression data from over 2000 samples identified 30 coexpression modules as the major source of AD transcriptional perturbations.Additionally, a meta-analysis of AD epigenome-wide association studies, 48 using deoxyribonucleic acid (DNA) methylation data from over 2000 individuals identified 334 differentially methylated positions associated with AD neuropathology across cortical regions.Yet, robust disease-specific omics signatures or signatures shared across diseases are lacking.Neurodegenerative diseases are heterogeneous entities and there is extensive clinical, pathological, and genetic overlap. 49Co-pathologies alongside a dominant condition are frequent (e.g., presence of Lewy bodies in AD patients). 50Cross disease/pathology studies are starting to emerge, for example, addressing epigenetic changes across neurodegenerative diseases, 51,52 and disentangling amyloid-β and tau-pathology-associated transcriptomic profiles in AD. 53 However, to find distinguishing molecular signatures we require large well-powered trans-diagnostic cohorts, with a range of primary co-pathologies, and to develop powerful unsupervised ML methods to cluster omics data. 54Although the increasing availability of single-disease datasets has opened the way to meta-analysis and multiple-cohort reanalysis, [55][56][57][58][59][60] much more is needed to assess which mechanisms are conserved across pathologies and which are disease-specific.

What problems need addressing?
It is yet to be understood how and why selective vulnerability occurs in different brain regions and cell types across different neurodegenerative diseases.However, findings from omics studies are often not replicable at the gene/effect level even within a single disease.
How then can replicability be enhanced?Several issues need to be addressed: First, studies are often undertaken in small cohorts, which lack statistical power to detect significant molecular changes, and may reflect sampling bias and disease heterogeneity. 59Availability of brain tissue, especially for rare diseases and for matched cognitively normal controls, 61 is a limiting factor.Second, phenotype definitions are not unified.The dominant pathology (e.g., AD or Parkinson's disease) is often used as the label, but variable degrees of co-pathologies impact molecular signatures.Instead, multiple pathologies could be combined as a quantitative "polypathology score."Third, hemispheric asymmetry in neuronal processes is a fundamental feature of the human brain and drives symptom lateralization (e.g., Parkinson's disease and FTD), which is reflected molecularly. 62,63This interferes with histopathology to omics comparisons, mostly investigated in opposite hemispheres. 62urth, genetic variability between individuals is often not accounted for in omics studies.Fifth, there is considerable heterogeneity across studies including differences in brain regions, brain cell type compositions, protocols and platforms to generate the molecular data, and analytic pipelines used.Sixth, the influence of confounding factors, such as batch effects, post mortem interval, or ribonucleic acid (RNA)/DNA quality, can vary substantially between brain banks due to distinct standard procedures.The ML paradigm may be useful in multiple ways for the identification of reliable and discriminatory brain omics signatures.There is a clear need to integrate omics data generated for samples both from different brain regions and different cohorts, thus enabling the latent space modeling of multimodal brain omics, 67 different brain regions, different cell types, 68,69 and different neurodegenerative phenotypes or diseases.This latent space will allow the uniform treatment of samples and a seamless creation of ML models for downstream tasks, such as diagnosis or interpretation.
Multi-omics data in well characterized pathology samples will allow us to refine dementia subtyping.AI can play a huge role in this.DL and computer vision can be used for generating harmonized digital pathology datasets. 70These datasets and samples can then be input into the pipeline for omics characterization.Data from such pathology-based omics studies will be harmonized across sites using a number of unsupervised learning methods.At its core, single cell resolution using tools like scVI 71 rely on ML to annotate and quantify cellular components of multi-omics datasets which can then be used for multimodal subtyping at the intersection of genomics and pathology.

Examples of best practice
ML approaches applied to dementia brain omics data, such as epigenomics, transcriptomics, and proteomics data, have started to emerge and illustrate the promise of using such methods to maximize findings from existing data.Huang and colleagues have recently developed EWASplus, a computational method that uses a supervised ML strategy to extend EWAS coverage to the entire genome, 38 and implicates additional epigenetic loci for AD that are not found using array-based AD EWASs.Wang and colleagues implemented a DL method that analyzes RNA-seq data from brain donors to characterize post mortem brain transcriptome signatures associated with amyloid-β plaques, tau neurofibrillary tangles and clinical severity in multiple AD and related dementia populations. 58In the proteomics space, Tasaki and colleagues applied a deep neural network approach to predict protein abundance from mRNA expression, in an attempt to track the early protein drivers of AD and related dementia subtypes. 72These approaches demonstrate how such methodologies can be used to identify potential early protein drivers and possible drug targets for preventing or treating AD and related dementias.

2.3
Can high-dimensional omics data identify improved molecular biomarkers for dementia compared to single marker approaches?

State of the science
Technological advances and large, shared, international datasets allow a new approach to understanding diseases including biomarker identification.Single molecule assays, such as Simoa, allow accurate measurement of plasma proteins. 73Notably, plasma neurofilament light (NfL) has been comprehensively shown by many research groups to be substantially increased in a diverse array of neurological brain conditions when compared with age-matched controls, leading to the proposal of NfL being the first established blood-biomarker for neurological and cognitive decline. 74Targeted biomarkers such as NfL have begun to be translated into clinical settings but the use of multi-omics data has so far been limited.However, omics modalities present opportunities for the identification and application of new biomarkers.For example, most dementias appear to have a considerable polygenic component, which present potential as multi-assay risk biomarkers.Genome sequences comprising petabytes of data can be resolved to common single nucleotide variation, rare variants, and structural variants all with potential as markers of disease risk.RNA expression data are currently used in biomarker discovery though not yet achieving the accuracy of blood proteins in disease prediction. 75,76A methylation data can provide a route to identify non-recorded environmental exposures through imputation of these risk factors from published predictors. 77This strategy could help validate epidemiological reports of environmental risk factors and help stratify patients across diagnostic boundaries, which may provide stimuli for additional analyses and clinical follow-up. 78Genes where DNA methylation is altered by specific environmental factors could identify molecular pathways of relevance across dementias.In addition to markers of aging, they have also been used as predictors of cognitive function. 79However, before these markers can be translated to the clinic, they would need to demonstrate stringent accuracy in independent validation cohorts.
While these multimodal datasets described above can contribute to biomarker discovery, many diagnostics companies and regulatory bodies prefer a single readout approach.This is contrary to the basic concept that multimodal data can more accurately reflect complex biological systems.

What problems need addressing?
The development of large harmonized omics datasets is challenging.
The first challenge relates to the issue of data quality: high dimensional omics data are acquired from different sources, in distinct formats and over multiple sites, and accompanied by patient medical records.As errors may occur during measurement or processing (i.e., batch effects), they risk potentially compromising the reproducibility and the usability of the generated data.The second challenge is of a computational nature: the preliminary analyses of multi-omics data require a data harmonization process and the development of integration, clustering, functional characterization, and visualization tools.Beyond this step, one of the goals in the biomarker study is the inference and the prediction of biological systems. 80The statistical method traditionally deployed in the inference requires explicit assumptions, which are not necessarily intuitive in the large omics dataset. 81Finally, given dimensionality constraints posed by integrating large multiple omics datasets, the computational burden and storage space requirements can be limiting.The last challenge is to make these datasets sharable and accessible to a large community. 82e development of a large omics dataset therefore requires establishing standardized protocols for the acquisition, transfer, and analysis of clinical and omics data that can be used by the scientific research community.
At its core, the issues with multimodal datasets needed for building the next generation of complex biomarkers is both a wide data and sparsity problem.Studies are simply not large enough, similar enough, or data easily accessible enough to identify better biomarkers which have clinical relevance.

Possible solutions
85]39 Application of DL approaches on large scale omics datasets allows researchers to detect new disease relationships with the data.Translating these discoveries into multi-panel tests will be key in applying potential biomarkers.As the costs of omics assays continue to drop, the standard use of high-throughput DNA, RNA, protein, and metabolomics biomarkers in the clinic need to become a reality.Largescale sequencing initiatives that focus on the genomic underpinnings of neurodegenerative diseases 41,[86][87][88][89][90] will aid in the development of more targeted and cost-effective tests such as PRSs and metabolite panels. 91Collectively, these initiatives will enable many opportunities for biomarker identification, validation in both diagnosis and early disease detection, as well as raise important ethical and technical challenges.
In its simplest terms, information theory dictates that adding impactful and independent features to a model should improve its predictability, although limiting analyses to such features may be difficult due to wide data issues in genomics.In ML, facing high dimensionality problems where the number of features is much greater than the number of samples is relatively frequent.That is, why the problem of feature selection has worsened in recent decades. 92,93In addition, techniques such as federated learning 94 are likely to be useful in analyzing biomarkers across datasets that cannot be combined for ethical or practical reasons safely.

Examples of best practice
Analyzing datasets from independent cohorts and then combining them in a meta-analysis can improve statistical power and the ability to detect significant associations.For example, a meta-analysis of 569 lipidomics species measured in the Australian Imaging, Biomarkers and Lifestyle (AIBL) cohort and the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort identified multiple lipids from several species predictive of prevalent and incident AD. 95 Within cohort integration of data modalities can also yield novel disease markers, for example, coexpression networks of metabolite and gene expression data from the ADNI cohort identified new metabolite candidate markers. 96The

State of the science
It was recently estimated that reducing modifiable risk factors could prevent around 40% of all-cause dementia cases. 97However, the evidence-base for most hypothesized risk factors being causal is weak, with conflicting findings across studies depending on study design, time of risk factor measurement, type of outcome, sample size and study population. 97,98 is assigned randomly at conception, it is largely independent of confounding factors that often cause bias in observational research.The genome also cannot be modified by subsequent disease, making bias due to reverse causation unlikely.MR is a widely used method and can be a useful tool for understanding the etiology of risk factors, [100][101][102][103][104] but it also has limitations that should be carefully considered. 105,106spite the clear advantages of MR studies, few other methods have been developed that can explore the causal relationships between risk factors and dementia-related outcomes.

What problems need addressing?
There are several common problems that can impact causal inference if they are not duly addressed and can lead to unreliable conclusions being made.Power is problematic in many MR studies examining causality of risk factors on dementia. 100Confidence intervals are often wide, so meaningful effects in either direction cannot be excluded.This is often the case for risk factors that are difficult to measure (e.g., sleep disturbance and physical inactivity). 107,108Weak instruments (i.e., those with an F-statistic <10) can introduce bias. 109Examples of strong instruments that have been used in MR of dementia risk include plasma glucose, 110 educational attainment and intelligence, 111 type-2 diabetes mellitus and glycated hemoglobin (HbA1c), 112 but these only represent a small fraction of dementia risk factors.
Collider bias can also be introduced into causal analyses when an included sample suffers from selection bias, for example, due to differential patterns of survival associated with the risk factor of interest. 113dividuals need to live long enough to obtain a dementia diagnosis so observed causal effects of any risk factor associated with premature mortality (e.g., smoking) on dementia risk are likely biased. 114Very few studies attempt to identify and, if necessary, correct for survival bias, despite it being demonstrated to produce spurious protective effects in MR studies of causal risk factors for AD and Parkinson's disease. 115,116usal analyses may also be biased by population effects that confound the relationship between the genetic instrument and outcome variable (violating the 'independence' MR assumption 117 ).Certain dementia risk factors, such as educational attainment, have been shown to be highly influenced by assortative mating (i.e., non-random mating) within populations, 117

but this has not yet been systematically
assessed in studies of dementia risk factors, so we do not know the extent to which current causal estimates are being biased by these population effects.
Confounding due to horizontal pleiotropy is especially problematic in MR studies that measure the causal association between a complex risk factor (i.e., a phenotype which is highly polygenic) and an outcome.
It is becoming increasingly apparent that many SNPs in the genome causally influence multiple traits, making the "exclusion restriction" MR assumption (i.e., that the only path between the genetic instrument and the outcome is via the exposure) less likely to be upheld.
In addition, even though many dementia risk factors are genetically inter-correlated 118 and co-occurrence of multiple risk factors within an individual increases dementia risk more than being exposed to a single risk factor, 119 most studies only measure the causality of one risk factor on dementia.By only measuring bivariate relationships, we are likely overlooking synergistic effects or overlapping causal pathways between dementia risk factors, reducing our ability to identify shared biological pathways that are especially central in raising dementia risk and to characterize the patterns of pleiotropic effects between risk factors.There are methods to disentangle this such as genomic or transcriptomic structural equation modeling-SEM, 120,121 but they require well-powered GWAS, which are not available for all risk factors.
Aside from MR, few causal modeling methods have been developed for use with genetic data.Even in cases where new causal methods have been proposed, such as Bayesian network analysis (BN), 122 latent causal variable analysis (LCV), 123 and the multi-SNP mediation intersection-union test (SMUT), 124 these have not yet been applied in dementia risk factor research and there is a noticeable lack of causal ML modeling in the genomics field.

Possible solutions
One of the key ways that AI methods could be harnessed to improve causal analyses in dementia research is to use ML/DL to strengthen genetic instruments for MR.Traditionally, instruments are created from GWAS summary statistics that are measured using logistic regression and defined p-value thresholds, whereas COMBI 28 and DeepCOMBI 33 use Support Vector Machines (SVM) and deep neural networks, respectively, to identify SNPs related to a phenotype.Particularly, DeepCOMBI has been shown to replicate known disease loci, as well as identify novel ones.DeepMR integrates ML with MR by using multi-task DL models to initially learn the relationship between different sets of genomic marks (e.g., chromatin marks) associated with a pathway or phenotype of interest and then uses MR to examine causal relationships between them, 125 which could help to identify more functionally relevant SNPs for inclusion in the exposure instrumental variable.
Existing methods that quantify and correct for known sources of bias should also be routinely implemented.Automated AI methods could help support this, for example, MR-MoE (MR-Mixture of Experts), which is an ML framework that applies random forest learning algorithms to MR results to identify the method for your analysis that is, least likely to be biased by horizontal pleiotropy. 126veral of the associations between dementia and its risk factors are likely non-linear.For example, the association between sleep duration and dementia is likely to be U-shaped: both too little and too much sleep have been associated with increased dementia risk. 97,127,128In this instance, sleep duration is a categorical discrete rather than a truly continuous phenotype, and its genetic instruments are weak in comparison with other risk factors. 110Non-linear MR accounts for non-linearity between continuous exposures and outcomes, 129 but it has scarcely been applied to MR studies of dementia risk.One recent study used non-linear MR to assess the causal influence of sleep duration on dementia-related cognitive outcomes. 130Thus, to use MR to understand non-linear relationships between risk factors and dementia, we should focus future GWAS efforts on improving the modeling of continuous risk factors in situations where observational evidence suggests that there is a non-linear causal relationship with dementia.
Room for future improvement includes the potential leveraging of tree-based, boosted, bagged, or other ML algorithms to create interpretable model cascades of causal risk.This could increase the value of previous MR studies while at the same time addressing their shortcoming of generally focusing on only a single exposure at a time.AI has the power to model multiple potentially connected causal risk factors at scale.

Examples of best practice
Recently, a multivariate GWAS was performed using random forest regression to predict causal SNPs for 56 neuroimaging phenotypes, which identified the APOE SNP rs429358 as the top locus as well as additional lead SNPs that mapped to genes relevant to brain disorders, which were not identified by traditional linear regression methods. 131other study introduced the MR-based Structure Learning (MRSL) algorithm, which used graph theory combined with multivariable MR to uncover causal and mediating pathways between 44 diseases and 26 biomarkers using publicly available GWAS summary statistics. 132gether, these results highlight the potential benefits of utilizing ML-based multivariate approaches to model the genetics underlying inter-correlated risk factor traits when performing causal analyses in dementia research.
Noyce and colleagues previously assessed the impact of survival bias on estimates of the causal effect of body mass index (BMI) on Parkinson's disease. 116They performed simulations to estimate the likely effect that their MR analysis would show if survival bias was present, when assuming that BMI was not truly related to Parkinson's disease.The objective was to see if the likely magnitude of the survival bias was large enough to explain the MR results estimated from the real data.They demonstrated that the seemingly protective effect of higher BMI on Parkinson's disease risk was likely due to survival bias related to increased frailty in people with lower BMI, rather than being the true causal driver.Since effects from survival bias are likely to be especially important for causal analysis of risk factors in dementia research it is crucial that we start to consistently test for this and other common forms of bias in future studies to minimize the impact of spurious findings within our field.

2.5
Which biological processes are altered by genetic risk for dementia-related diseases?

State of the science
Highly penetrant variants in APP, PSEN1, or PSEN2 have pointed to a central role of amyloid-β in early-onset AD. 133 Separately, GWAS for late-onset AD identified several biological processes enriched for genes associated with disease risk, including amyloid-β processing, lipid metabolism, and immune responses. 134,135Although most AD GWAS associations are non-coding, rare coding variants have implicated key microglial genes such as TREM2 and PLCG2. 135,136Follow-up experiments in cellular and animal models confirmed the effects of these genes on microglial activation and lipid processing. 137,138Epigenomic maps from purified cell populations 139 or single cells 140 have localized non-coding AD risk variants to microglia-specific enhancers, regulating genes including BIN1 and RIN3.An alternative way of linking risk variants to genes is to identify quantitative trait loci (QTLs) that influence gene expression, followed by a test for statistical colocalization with nearby GWAS loci.A variation on the previously discussed topic of MR called SMR is often used to establish causal inferences for the function of these QTLs in the context of disease risk on a per gene level.Recent studies in purified microglia from living 141 or post mortem 142,143 donors have nominated some AD and Parkinson's disease risk genes, but so far they are underpowered relative to bulk brain datasets.Thus, while genetic studies of AD indicate a clear role of microglia, 144,135,136,141,145 the roles of specific cell types are still being discovered in other neurodegenerative conditions, such as Parkinson's disease 139,146 and amyotrophic lateral sclerosis. 1475.2What problems need addressing?
GWAS for different dementias have so far mainly used a case-control framework to identify genetic loci associated with a clinical diagnosis.
However, this approach ignores the complexity of neuropathological changes that occur in patients, which usually predate clinical symptoms by years or decades, and which may involve multiple distinct pathologies. 54,148The decoupling of genetic associations from specific pathologies makes it difficult to identify the most relevant cellular model for a given locus.In this absence, most cellular models have focused on a single cell type, and thereby fail to elucidate the probable interplay between different cell types that leads to neurodegeneration.
Furthermore, identifying and validating the causal genes at GWAS loci continues to remain challenging, due to both the uncertainty in the specific causal variants and the cell types through which they act. 149ditionally, GWAS loci may arise only in a specific cellular state, such as response to a pathology, as has been recently shown for the UNC13A amyotrophic lateral sclerosis/FTD locus. 150,151As a result, the genes and biological processes that are identified as relevant have depended largely upon the prior hypotheses of investigators and on the cellular models and analysis methods that were used.Although the scale and resolution of single-cell transcriptomic and epigenomic datasets is increasing, there isn't yet a robust and reproducible catalog of all cell types and cell states relevant to brain function and disease processes.Additionally, curated resources cataloging genes involved in many biological processes are often victims of bias due to publication and funding issues as well as reporting bias.

Possible solutions
New technologies have the potential to improve our understanding of neurodegenerative diseases, if applied systematically and at scale.
Single-cell technologies are beginning to reveal the cell type diversity of the human brain, 152 and to identify cell type-specific gene expression changes in disease. 140,153The GTEx project 154 was transformative in describing gene regulation across human tissues, enabling others to link these genetic effects to human disease risks.However, its sampling of bulk tissues limits its use for understanding biological mechanisms.
Single-cell technologies now make it possible to envision a cell typespecific gene regulatory atlas of the human brain.Such an atlas should be built in a robust way across multiple labs, and include both healthy and diseased donors of different ages.
We must also seek to recapitulate the spatial dimension of cell type localization and gene expression.Only by probing gene expression directly in a tissue section can we reliably establish organ-wide patterns of gene expression, reconstruct cell-cell interactions and assess how neuropathology affects local gene expression.Mouse models have highlighted how amyloid plaques influence oligodendrocyte and microglia gene expression across disease stages. 155Going forward, a brain-wide, spatially-resolved gene expression atlas, possibly integrating splicing information, 156 would be a rich complement to a standard gene regulatory atlas.
To understand the molecular mechanisms of neurodegenerative disease genetic associations, we need to perturb the function of candidate genes and measure their effects in relevant cellular models.However, an ad-hoc approach in the most accessible cell types will not lead to robust conclusions.With CRISPR-based tools these perturbations can be done at genome-wide scale, in specific cell types derived from human induced pluripotent stem cells (iPSCs), and with highthroughput phenotyping assays.As a community, we should coordinate to systematically investigate a broad set of candidate genes, across multiple cellular phenotypes and in a range of cellular models.Additionally, as part of therapeutic development, these perturbed screens will likely need to be carried out across networks upstream of known targets.

Examples of best practice
For psychiatric disease, the PsychENCODE project set an example by collecting multiple types of omic data from over a thousand post mortem brains across three diseases and three brain regions. 157,46,158Crucially, integrative analyses need to leverage these multiple omic layers to generate novel insights, as demonstrated in previous studies of bulk brain. 46,159Recent studies have used scRNA-seq methods to examine specific brain regions in disease and control individuals for AD, 153,160 amyotrophic lateral sclerosis and FTD, 161 revealing cell type-specific effects of disease pathology.For all of these datasets and analyses to be most useful, robust ML methods are needed to integrate distinct omics modalities and to ensure reproducible results.Promising approaches in this direction have recently been applied to large-scale single-cell data from mouse motor cortex, 162 and the human immune system. 163 genetic studies of dementias increase in size, so does the need to identify the causal genes at associated loci.New methods enable enhanced fine-mapping using functional genomic data (e.g., PolyFun 164 ), and better prediction of enhancer-promoter connections (e.g., activity-by-contact score).One such example is the identification of USP6NL as the putative causal gene within the AD GWAS locus "ECHDC3" by linking a functionally fine-mapped variant within a microglia enhancer with the USP6NL promoter. 142This finding was further supported by strong colocalization between the GWAS-eQTL.This methodology has also been applied to Parkinson's disease. 165DL models have also shown dramatic improvements in predicting the effects of genetic variants on splicing, pathogenicity (coding variants), and gene expression.Along with experimental data, both variant effect predictions and fine-mapping data can be used as input to ML methods that directly predict the most likely causal genes at GWAS loci.
Beyond cellular maps and genetic associations, a systematic approach to model systems is needed.A National Institutes of Health (NIH)-funded project, the iPSC Neurodegenerative Disease Initiative (iNDI), 166 is creating more than 100 isogenic iPSC lines with mutations associated with dementias.How these are used to model neurodegeneration in specific derived cell types will be up to the creativity and vision of the research community.
Clustered regularly interspaced short palindromic repeats (CRISPR) based studies and methods such as perturbSeq and CROPseq have pushed the boundaries of what can be assayed rapidly with edited cell lines. 167These techniques are already being sought after by biotechs looking to quantify up and downstream effects of genetic and genomic therapeutic targets.Enough of this type of data, combined with DL to recognize patterns of functionally connected genes or graph-based network models could identify communities of risk factors that are functionally connected to disease risk. 168These new communities could serve as less biased pathways derived from the appropriate tissues and cell types.

LIMITATIONS OF AI AND ML IN THE DEMENTIA OMICS FIELD
High-throughput methods, such the full suite of omics platforms, including genomic, transcriptomic, epigenomic, proteomic, metabolomic, and related technologies, have inaugurated a new era of systems biology.This provides abundant and detailed data, which conventional analytical and statistical approaches are often not capable of dealing with.AI and ML algorithms, which are designed to automatically mine data for insights into complex relationships in these massive datasets, are still at its infancy in dementia genetics and omics research, and far from being explored at its full capacity.
Despite major strengths and achievements so far, it is worth having in mind possible caveats of AI models in the omics field, including the following examples: (1) Interpretation (the black box), as often the complexity of certain models makes it difficult to understand the learned patterns and consequently it is challenging to infer the causal relationship between the data and an outcome; (2) "Curse" of dimensionality: omics datasets represent a huge number of variables and often a small number of samples, as mentioned in multiple sections of this paper; (3) Imbalanced classes: most models applied to omics data deal with disease classification problems (e.g., use of major pathology labels in the presence of co-pathologies, as mentioned in section 2.2); and (4) Heterogeneity and sparsity: data from omics applications is often heterogeneous and sparse since it comes from subgroups of the population (e.g., as highlighted in section 2.1), different platforms (e.g., multiple array and sequencing based platforms), multiple omics modalities (e.g., transcriptomics, epigenomics, proteomics) and is often resource intensive to generate.Many of these limitations, however, can be overcomed with improvements to data generation (e.g., larger more diverse harmonizable studies) and analysis (e.g., using dimensionality reduction strategies and interpretable ML approaches).

CONCLUDING REMARKS
In conclusion, omics technologies, including genomics, epigenomics, transcriptomics, proteomics, and metabolomics, can provide increasingly comprehensive high-dimensional insights into the biological system of each individual when combined with AI approaches.This in turn can contribute immensely to a better understanding of AD and other forms of dementia, and to the development of personalized medicines.
However, a number of thorny issues hamper the use of omics technologies and AI in dementia research.These include the need for better and more comprehensive and less biased genetics and omics dementiarelated data resources, the development of improved AI algorithms, and the need for more collaborative multidisciplinary collaboration.
Increased funding, a more coordinated collaborative global effort, and a greater number of diverse and deeply phenotyped cohorts, together with innovative AI methods have the potential to overcome these challenges and to increase the pace of discovery that we are able to achieve.
Ultimately, this would have a major impact on our understanding of the underlying disease processes and help to improve the prevention, diagnosis, and treatment of dementia.
allows learning at different levels of data granularity and facilitates unprecedented advances in dementia research Multi-omics data F I G U R E 1 Illustration of multiple aspects of dementia research that can be enhanced by the use of appropriate genetics and omics data allied with the implementation of artificial intelligence approaches.
125mples of artificial intelligence methods to potentially address current challenges in the study of dementia genetics and omics.bypredictiveaccuracy and hampered by heritability Novel DL-based model that does not only rely on the addictive effect of risk SNPs, may outperform more traditional PRS models across a variety of disease phenotypes Causal inferences are often underpowered and limited in scope DeepMR125approaches integrate ML with MR by using multi-task DL models to learn the relationship between different sets of genomic marks associated with a pathway or phenotype of interest and then uses MR to examine causal relationships between them Abbreviations: AI, artificial intelligence; DeepMR, deep Mendelian randomization; DL, deep learning; GWAS, genome-wide association studies; ML, machine learning; MR, Mendelian randomization; PRS, polygenic risk score.
due to the long and ill-defined prodromal period of dementia.In addition, it would be impractical or unethical to conduct an RCT of harmful risk factors such as air pollution and traumatic brain injury.These limitations make it difficult to ascertain which risk factors would be the most useful to target in interventions, and at what point in life such interventions would be most efficacious.
Many studies are prone to bias by unmeasured or residual confounding, reverse causation due to dementia's long latency period, and survival bias.Traditionally, randomized controlled trials (RCTs) have been necessary to confirm causal pathways between a risk factor and an outcome.However, these are notoriously challenging for dementia research because it would require monitoring participants over many decades