Similar burden of pathogenic coding variants in exceptionally long‐lived individuals and individuals without exceptional longevity

Abstract Centenarians (exceptionally long‐lived individuals—ELLI) are a unique segment of the population, exhibiting long human lifespan and healthspan, despite generally practicing similar lifestyle habits as their peers. We tested disease‐associated mutation burden in ELLI genomes by determining the burden of pathogenic variants reported in the ClinVar and HGMD databases using data from whole exome sequencing (WES) conducted in a cohort of ELLI, their offspring, and control individuals without antecedents of familial longevity (n = 1879), all descendent from the founder population of Ashkenazi Jews. The burden of pathogenic variants did not differ between the three groups. Additional analyses of variants subtypes and variant effect predictor (VEP) biotype frequencies did not reveal a decrease of pathogenic or loss‐of‐function (LoF) variants in ELLI and offspring compared to the control group. Case–control pathogenic variants enrichment analyses conducted in ELLI and controls also did not identify significant differences in any of the variants between the groups and polygenic risk scores failed to provide a predictive model. Interestingly, cancer and Alzheimer's disease‐associated variants were significantly depleted in ELLI compared to controls, suggesting slower accumulation of mutation. That said, polygenic risk score analysis failed to find any predictive variants among the functional variants tested. The high similarity in the burden of pathogenic variation between ELLI and individuals without familial longevity supports the notion that extension of lifespan and healthspan in ELLI is not a consequence of pathogenic variant depletion but rather a result of other genomic, epigenomic, or potentially nongenomic properties.


| INTRODUC TI ON
Exceptionally long-lived individuals (ELLI) are a unique segment of the population who exhibit not only long human lifespan but also long healthspan, and seemingly often overcome the adverse environmental effects on their physiological health (Ismail et al., 2016;Milman & Barzilai, 2015;Sebastiani et al., 2013). For this reason, they represent an extreme phenotype of successful aging (Tesi et al., 2018). The prevalence of centenarians is estimated to be approximately 1/3000 individuals in the United States (US) and Europe (Teixeira, Araújo, Jopp, & Ribeiro, 2017), and this rare group is studied all around the globe (Nebel & Schreiber, 2004;Teixeira et al., 2017) with the aim of identifying the biological mechanisms for healthy aging.
Interestingly, somatic mutations are also known to accumulate with age (Milholland, Auton, Suh, & Vijg, 2015;Ye et al., 2013), challenging the physiological homeostasis and relative health observed in ELLI. Consequently, it could be hypothesized that one would expect to find a higher number of pathogenic variants in ELLI than in unrelated controls. A contradicting hypothesis is that ELLI possess "the perfect genome," containing a lower burden of pathogenic variation compared to the general population (Freudenberg-Hua et al., 2016;Milman & Barzilai, 2015;Stevenson et al., 2015;Ye et al., 2013). Both hypotheses require gathering of additional evidence in support or contradiction of them.
We aimed to test the hypothesis of whether the ELLI genomes are relatively depleted of coding pathogenic variants compared to individuals without genetic predisposition to exceptional longevity in a cohort of ELLI, offspring of ELLI, and unrelated controls without familial longevity, using a cohort (differing from the above mentioned) from a founder population of Ashkenazi Jews (Table 1). Using a population with a strong founder effect increases statistical power to identify genetic factors responsible for traits of interest (Carmi et al., 2014;Freudenberg-Hua et al., 2014;Lencz et al., 2018). The Ashkenazi Jewish population in the United States is among the largest founder populations in the world, and as such, it has substantial potential for natural variation and offers sufficient genetic and phenotypic diversity (Carmi et al., 2014;Lencz et al., 2018). The coding disease-associated variants that were investigated were sourced from two well-established databases, ClinVar (Landrum et al., 2017), and the Human Gene Mutation Database (HGMD ® ) (Stenson et al., 2017). ClinVar is a publicly available database that compiles and aggregates interpretations of clinically relevant genetic variants. It is one of the largest publicly available databases for clinically relevant variation and provides a reliable and updated source for analyses of pathogenic variation burden in genomic samples. HGMD is a curated commercial database that catalogues genetic variation reported as associated with human diseases. We chose to assess pathogenic variants since it was demonstrated that higher burden of disease-associated variants is correlated with higher disease risk (Bick et al., 2012;Milholland et al., 2015;Patel et al., 2017). Together, the variants from both databases comprise a comprehensive list of pathogenic variants and were used to assess, using various approaches, the difference in disease-causing mutation load (defined as the amount of potentially harmful mutations per individual) between the three groups in our cohort.

| Annotation of pathogenic exome variants
A total of 777,023 coding variants (623,003 in ELLI, 656,599 in offspring, and 609,864 in controls, Figure S1) passed the QC stage ( Figure S2) and were queried using the compiled list of disease-associated variants. The dispersion of the groups was homogenous as can be seen in Figure S3. The three groups had a large portion of variants in common. Among all the variant identified 64.9% (504,861) of the variants were shared between all 3 groups and among the pathogenic variants annotated 74.5% (6262) of the variants were shared between all 3 groups ( Figure S1 and Figure 1, diagrams generated using VennDiagram R package (62)). The total number of variants recognized by Ensemble VEP was 7288 in ELLI, 7470 in offspring, and 7062 in controls. The distributions of variants by biotype and coding consequences were very similar between the cohort groups ( Figures S4 and S5). The frequencies of VEP biotypes also did not differ by group ( Figure S8).

| Case-control association analysis
In order to identify candidate pathogenic variants that were differentially enriched in any of the groups, we performed case-control association tests. These tests did not reveal any variants that were significantly associated with ELLI status (Figures S3, S6-S8 and Table   S1).

| Mutation load
The median mutation load per individual of all pathogenic variants (Table S2) was not significantly different between the three groups for both heterozygous and homozygous variants (KW p = 0.2 and 0.29, for hetero-and homozygous variants, respectively, Table S3, Figure 2 and Figure S9).
Interestingly, the difference between ELLI and controls was sig-  Figure 2 and Figure S9). Analysis of disease-specific suscepti-  (Table S5) did not yield any significant differences between the groups (Table S3 and Figure S10). An additional categorization of the strict filtering into autosomal recessive (AR), autosomal dominant (AD), and both autosomal recessive and dominant (AR/AD) modes of inheritance did not highlight any differences between the groups either (Table S6). These results did not vary by gender.

| Variant effect predictions, eQTL characterization, and polygenic risk scores
MAFs (Minor Allele Frequency) were evaluated in order to assess the frequency of rare variants between the three groups, revealing no statistically significant differences between ELLI, offspring, and controls (Figures S11 and S12). To gain deeper biological insights alluding to possible molecular function of variants, variant effect predictions and biotypes were queried and found almost identical between the three groups ( Figure S5). The pathogenic variants that Polygenic risk scores showed low predictive value of SNPs in our data set. The highest R2 values were all below 0.025 indicating very low predictive value for the longevity phenotype (Figures S13-S15 and Table S7).
In an effort to explore factors that may be responsible for these unique characteristics, we conducted a study to test whether the genomes of ELLI are depleted, or not, of pathogenic variants com- The similarity in amounts of variants was striking to us, especially given the expectation of somatic mutation accumulation previously reported (Milholland et al., 2015;Ye et al., 2013). Keeping in mind the chronological age gap between our cohort groups, a similar mutation accumulation between ELLI and controls suggests a different aging rate for the ELLI. Very low polygenic risk scores, obtained using longevity as the trait tested, indicate no predictive value and elude away from a gene coding interaction underlying the exceptional longevity phenotype. With 490-503 variants included in the analyses, there was no prediction of the longevity phenotype among our functional variants. These findings suggest that exceptional lifespan and healthspan are not attributable to a relative depletion of pathogenic gene variants.
Noteworthy are two significant differences we observed. (1) When looking into specific age-associated diseases, we found that the ELLI group carried less pathogenic variants associated with cancer and with Alzheimer's disease, in contrast to our expectations based on Milholland et al. (2015) and Ye et al. (2013). This result is intriguing in light of the vast evidence linking somatic mutation accumulation and those two age-associated diseases (Dapeng, Wang, & Di, 2016;Lodato & Walsh, 2019;Milholland et al., 2015;Park et al., 2019). Further, it is possible that the mutation accumulation in our ELLI group is slower than the accumulation in the control group; however, this rate was not examined in this study. That said, the similarity in amounts of pathogenic variants and the specific significance in difference in the cancer and Alzheimer's disease variants hint at this and can provide a lead for a follow-up study.
In the context of other findings, this result is not surprising.  (Stevenson et al., 2015). These results are also consistent with another study that characterized the whole exome of a pair of ELLI brothers and did not find any significant difference between them and the population genome (Tindale et al., 2015). The consistency of our results with these studies steers away from the "perfect genome" hypothesis.
The findings of this study support phenotypic and lifestyle stud- This hypothesis requires additional in depth assessment and testing, not presented in our current study.
The use of whole exome sequencing data for this analysis allowed for the comparison of both common and rare functional coding variants in a large cohort of ELLI, offspring, and controls from the same founder population, strengthening the genetic homogeneity of this study. Thus, the lack in pathogenic variant differences between the study groups is unlikely to be confounded by differences among ELLI and controls. Additionally, focusing on the "Pathogenic/Likely Pathogenic" ClinVar variants together with HGMD variants and further subjecting these to annotation by the Enselmbl VEP, allowed us

| Sequencing and alignment, variant identification, and genotype assignment
Whole exome sequencing (WES) of the cohort was performed in collaboration with the Regeneron Genetics Center (RGC) following methods previously described (Strauss et al., 2018  We continued with 777,023 autosomal only variants with minimum allele count (MAC) of 1 that were divided into three sets for ELLI (N = 623,003 variants), offspring (N = 656,599 variants), and controls (N = 609,864) ( Figure S1) some of which are unique to each group, or shared by 2 of 3 groups. These variant sets were used to compare against a master pathogenic dataset (comprised of HGMD variants and ClinVar variants as described below) for downstream analyses.

| Quality control for case-control association analysis
GRCh38 ClinVar database (downloaded April 7, 2019 from ftp:// ftp.ncbi.nlm.nih.gov/pub/clinv ar/) was filtered by clinical significance and only "likely pathogenic" or "pathogenic" annotations were retained. Variants containing conflicting evidence were removed (Landrum et al., 2017). These variants were merged according to chromosome and position with HGMD variants filtered for "High" confidence and "DM" (disease-causing) classifications. This merged list contained 225,492 pathogenic variants. After extraction of these variants from our exome data, we obtained a dataset of 8853 pathogenic variants that was used for all analyses and will be referred to as "pathogenic variants" (Table S2).
Datasets for case-control association analysis, containing only autosomal chromosomes, were prepared for each pair of groups.
Within each set of case-control pair, we performed extended sample and variant QC, according to Anderson et al. (2010). First, samples in the case-control pairs were filtered based on sample missingness (>5%), cryptic relatedness using Identity-By-Descent analysis (pi_hat > 0.1785) and outliers' removal using Eigensoft smartPCA.
The two latter analyses were performed on a LD-pruned subsets of variants with minor allele frequency (MAF) > 1% (Anderson et al., 2010). Since we sampled 159 direct offspring of ELLI, we wanted to check whether including them would affect the analyses; therefore, we filtered the offspring-ELLI pair for cryptic relatedness with a less stringent pi_hat (0.43), removing only first-degree relatives.
The variants in the case-control pairs were filtered by missingness (>10%), MAF > 0.1%, differential missingness between case and con-  (Price et al., 2006) in order to characterize population substructure prior to proceeding with statistical and bioinformatic analyses ( Figure S3).
The final preparation for the case-control association analysis was the extraction of the pathogenic variants from our case-control pairs. This extraction resulted in 7288, 7470, and 7062 variants for ELLI, offspring, and controls, respectively. Case-control association analysis using allelic model in Plink 1.9 software (Chang et al., 2015;Marees et al., 2018) was conducted on the resulting pairs. Inflation was tested using Q-Q plots (Clayton, 2020) revealing a slightly deflated genomic inflation factor with small variation from expected distribution (0.832-1.03) that likely resulted from the inclusion of rare variants in the analysis (MAF > 0.1%) ( Figure S6). Manhattan plots were created using the R package qqman (Turner, 2014).

| Variant annotation
Overlapping variants between each of our group variant sets and the pathogenic variants were further annotated using Variant Effect Predictor (VEP) by Ensembl (Yates et al., 2016)

| Mutation load and eQTL characterization
This analysis was performed on the three groups' (ELLI, offspring, and control) data that were filtered only for autosomal variants and MAC = 1. We tested mutation load in 3 sets of our pathogenic variants data: (1) the full pathogenic variants dataset, (2) common age-associated disease (T2D, Stroke, Cancer, CVDs and myocardial infarction, Alzheimer's, Parkinson, Dementia, and COPD) susceptibility variants, which were filtered for out of the pathogenic variants set (Table S8), and (3) a more strict filter of pathogenic variants including only pathogenic variants with at least 2 literature reports (2*) for ClinVar and only high confidence disease-causing variants from HGMD (Table S5). The age-associated diseases (2)  We thank Andrew Blumenfeld for bioinformatics support.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data products from this study will be made available to researchers, collaborators and analysts without cost and upon request. Tables S2,   S5, S8 and a.BIM format file available for sharing through a Google Drive link, which may be accessed via email to the corresponding author. Shared Google Drive link is required to access or download files. As part of the sharing process, users must agree to the conditions of use governing access to the public release of data, including restrictions against attempting to identify study participants, necessity of destruction of data after analyses are completed, reporting responsibilities, restrictions on redistribution of data to third parties, and proper acknowledgement of the data resource. Authorized users will receive user support as well as information related to errors in the data, notice of future releases, and publication lists. The information provided to users will not be used for commercial purposes, and will not be redistributed to third parties.