The Parkinson's Disease DNA Variant Browser

ABSTRACT Background Parkinson's disease (PD) is a genetically complex neurodegenerative disease with ~20 genes known to contain mutations that cause PD or atypical parkinsonism. Large‐scale next‐generation sequencing projects have revolutionized genomics research. Applying these data to PD, many genes have been reported to contain putative disease‐causing mutations. In most instances, however, the results remain quite limited and rather preliminary. Our aim was to assist researchers on their search for PD‐risk genes and variant candidates with an easily accessible and open summary‐level genomic data browser for the PD research community. Methods Sequencing and imputed genotype data were obtained from multiple sources and harmonized and aggregated. Results In total we included a total of 102,127 participants, including 28,453 PD cases, 1650 proxy cases, and 72,024 controls. Conclusions We present here the Parkinson's Disease Sequencing Browser: a Shiny‐based web application that presents comprehensive summary‐level frequency data from multiple large‐scale genotyping and sequencing projects https://pdgenetics.shinyapps.io/VariantBrowser/. Published © 2021 This article is a U.S. Government work and is in the public domain in the USA. Movement Disorders published by Wiley Periodicals LLC on behalf of International Parkinson and Movement Disorder Society.


Supporting Data
Additional Supporting Information may be found in the online version of this article at the publisher's web-site. AB S T R A CT : Background: Parkinson's disease (PD) is a genetically complex neurodegenerative disease with $20 genes known to contain mutations that cause PD or atypical parkinsonism. Large-scale nextgeneration sequencing projects have revolutionized genomics research. Applying these data to PD, many genes have been reported to contain putative disease-causing mutations. In most instances, however, the results remain quite limited and rather preliminary. Our aim was to assist researchers on their search for PD-risk genes and variant candidates with an easily accessible and open summary-level genomic data browser for the PD research community. Methods: Sequencing and imputed genotype data were obtained from multiple sources and harmonized and aggregated. Parkinson's disease (PD) is a neurodegenerative disease hallmarked by dopaminergic neuron degradation and Lewy-body inclusions in the brain. The exact molecular mechanisms underlying PD remain largely unknown, but the disease is influenced by age, environmental, and complex genetic factors. Putative deleterious and highly functional variants in more than 20 genes and 90 common genetic risk variants have been associated with PD or atypical parkinsonism. However, the population risk of known mutations and risk loci only represents a fraction of the known detectable heritable component of disease, suggesting that additional genetic influence is yet to be identified. 1,2 Most genes associated with PD have been discovered through linkage mapping studies in large family studies, such as SNCA 3 and LRRK2. [4][5][6] Some studies contain large sequencing cohort validation analyses such as the one nominating VPS13C. 7 The majority of the recent studies that nominate potential PD genes lack replication of results. Current research aggregation resources such as ClinVar are useful for searching known pathogenic variants, but the information presented often misses the context behind the clinical interpretation and lacks large case-control frequency information.
Although other resources such as MDSGene (https:// www.mdsgene.org/) provide in-depth genotypephenotype information, however, they lack large study case-control frequencies. 8 Next-generation sequencing has produced petabytes of genomic data and has transformed genomic medicine. However, databases housing these data, such as gnomAD 9 and BRAVO variant browser, 10 do not contain disease-specific data (yet), and there is a need for accessible resources that specifically include allele frequencies per disease group. Here, we aggregated multiple genomic data sets based on PD cases and controls and created an exonic summary data user-friendly browser, https://pdgenetics.shinyapps.io/VariantBrowser/.

Data Aggregation
We collected sequencing data from multiple different sources ( Table 1). The PD Genome Project includes publicly available whole-genome sequencing data from AMP-PD (https://amp-pd.org/) and other sources. The International Parkinson's Disease Genomics Consortium (IPDGC) cohort from Parkinson's Disease Genetics Sequencing Consortium (PDGSC) data was downloaded in November 2019 and was processed using a previously described pipeline, https://github.com/ipdgc/pdgsc. The IPDGC resequencing project is a resequencing data set that includes a large number of monogenic genes (ATP13A2, FBXO7, GBA, LRRK2, MAPT, PARK7 [DJ-1], PINK1, PLA2G6, PRKN, SNCA, and VPS35) and genome-wide association study (GWAS) loci regions from a previous PD GWAS. 11 The IPDGC genotype data were processed using a previously described quality control pipeline that has been previously described here, https://github.com/neurogenetics/GWAS-pipeline. 1,12 It was imputed using the Haplotype Reference Consortium Panel and filtered with the estimated r2 (RSQ) threshold of 0.8. UK Biobank (UKB) exome data (field 23160, "Population-level FE variants, PLINK format") were downloaded in May 2019. 13 The PD status of the UKB participants was based on UKB field number 42033, "Source of Parkinson's disease report," which determined the PD status on 3 criteria: self-report, hospital admission, and death registries. UKB proxy cases were defined as participants with no PD but with a parent with PD based on UKB field numbers 20107 and 20110, "Illnesses of father" and "Illnesses of mother." Additional quality control was done to remove participants without casecontrol status and mean depth of less than 20. Note that the vast majority of data are from European ancestry. Data were trimmed to exome calling regions identical to those used in gnomAD, specifically bait-covered regions plus 50 bp upstream and downstream. Before merging, all hg38 data were mapped to hg19 using CrossMap v0.4.0, 14 and each data set was filtered for relatedness by excluding individuals with PIHAT values > 0.125. After merging, duplicate samples were removed based on either sample ID or PIHAT values > 0.8 using PLINK v1.9 15 (Fig. S1). The data were merged, and allele count and frequency were generated using PLINK. Merged data were annotated using ANNOVAR 16 (Fig. 1).

Browser Design
The IPDGC Sequencing Browser was designed using the Shiny library under R version 3.6.1. All data present in the browser are nonidentifiable aggregate summary-level data. The design was inspired by the gnomAD and BRAVO variant browsers, featuring gene-level information panel and separate variant-level windows. However, this browser increases the information density presented in a single page format with collapsible and information panels to search results facilitating the user visualization and interpretation. It also contains an integrated tutorial function to guide new users. The browser is an open-source project, and the code is available on our GitHub platform, https:// github.com/kimjonggeolj/ipdgc_exome_browser.

Results
After quality control, we included a total of 6,126,909 variants from 102,446 participants, specifically 29,454 PD cases, 1650 proxy cases from UKB, and 71,342 controls. Of the 3,581,869 exonic variants ($58%), 2,144,315 ($60%) were nonsynonymous variants, and 1,078,658 ($30%) were synonymous variants (Table S1). As a positive control, we assessed the allele frequency of LRRK2 p.G2019S (rs34637584). This variant is one of the most common PD genetic factors associated with both familial and sporadic forms of disease. 17 Our browser shows the minor allele frequency of this variant at 0.007716 for cases, 0.0003201 for controls, and 0.001212 for proxy cases. Zygosity distributions show a similar pattern. Of 23,068 cases and 64,036 controls, there are 354 heterozygous cases and 41 heterozygous controls. There is 1 homozygous carrier case, whereas there are no controls with the same zygosity pattern. An association test (chi square allelic test) without adjusting by any covariate but excluding proxy cases showed an odds ratio of 24.28 (95% CI, 17.57-33.6) and a P = 1.76 × 10 −179 , which is in line with previous reports for this variant. Another example is the SNCA p.A53T (rs104893877) variant, which was the first pathogenic SNCA variant described resulting in autosomaldominant PD. 3 The browser shows 2 cases carrying this variant and no controls. The browser can also be used for autosomal-recessive disorders, for example, using PRKN (PARK2) p.R275W (rs34424986) as an example, one of the most common PRKN pathogenic variants. 18 Six cases were homozygous for this variant, and no controls were identified in a homozygous state. The result showed confidence that the data set could be used to identify or provide evidence for a potential PD causal variant.

Discussion
Here we present the Parkinson's Disease DNA Variant browser, a public platform for the scientific community that allows rapid querying of specific genes and variants in several large case-control cohorts. Provided with a gene name or gene boundaries, the browser will present the user with summarized information on the variants found within the gene, such as the distribution of variants categorized by their functional consequences. Given a specific variant, the browser will present the user with annotated information on the variant including allele Although we performed extensive quality control to ensure high-quality information was used, the data presented in the browser have inherent limitations. The data sets merged different sequencing technologies including whole-genome sequencing, whole-exome sequencing, resequencing, and imputed array-based genotyping and were aligned using different genome builds such as hg19 and hg38. This leaves gaps from low-imputation regions and cross-mapping failures, although the cross-mapping introduced less than 0.1% reduction in the total number of variants (Fig. S1). Although no sequencing or genotyping method can guarantee perfect data, imputed data because of its nature may add additional uncertainties regarding its results. Our quality control filters reduce this uncertainty, but nevertheless users should always consult the specific study-level breakdown of the variant frequency and count. The presented data only include autosomal data and do not contain any variants from sex chromosomes. Furthermore, the data presented only include exonic regions and their immediate flank; thus, it cannot provide information on the majority of the noncoding variants. Although this database contains multiple whole-exome and genome sequencing data, it may still be difficult to identify significant allele frequency and count differences between very rare variants, especially those of recessive inheritance. Researchers should use the annotated information such as ClinVar significance to critically assess any candidate variants. In addition, we only included a very limited amount of phenotypic data with case-control status, which creates potential bias for age-related penetrance. In future versions, we aim to include age of onset and other phenotype data if available. Of note, the majority of the data included are from European ancestry. We hope in future versions to increase the diversity of the data. Last, some genes, regions of interest, and structural variation are very complicated to genotype and sequence (including the GBA gene because of the high similarities with the pseudogene), and therefore interpretation of these complex regions should be done with caution. Future largerscale and targeted studies will hopefully resolve the issue with complex genomic regions.
In summary, we present here an online resource developed for the PD research community to quickly retrieve annotated genomic information on genes and variants in a user-friendly manner, without any required data science or coding experience. Users can access the browser to get information on reported PD risk factors or supplement their own research with data from a large-scale data set. We envisage this browser to be the first step toward easily sharing genomic information that will be continuously updated as new data become available.
Acknowledgments: We thank all the subjects who donated their time