Integrated genetic analyses revealed novel human longevity loci and reduced risks of multiple diseases in a cohort study of 15,651 Chinese individuals

Abstract There is growing interest in studying the genetic contributions to longevity, but limited relevant genes have been identified. In this study, we performed a genetic association study of longevity in a total of 15,651 Chinese individuals. Novel longevity loci, BMPER (rs17169634; p = 7.91 × 10−15) and TMEM43/XPC (rs1043943; p = 3.59 × 10−8), were identified in a case–control analysis of 11,045 individuals. BRAF (rs1267601; p = 8.33 × 10−15) and BMPER (rs17169634; p = 1.45 × 10−10) were significantly associated with life expectancy in 12,664 individuals who had survival status records. Additional sex‐stratified analyses identified sex‐specific longevity genes. Notably, sex‐differential associations were identified in two linkage disequilibrium blocks in the TOMM40/APOE region, indicating potential differences during meiosis between males and females. Moreover, polygenic risk scores and Mendelian randomization analyses revealed that longevity was genetically causally correlated with reduced risks of multiple diseases, such as type 2 diabetes, cardiovascular diseases, and arthritis. Finally, we incorporated genetic markers, disease status, and lifestyles to classify longevity or not‐longevity groups and predict life span. Our predictive models showed good performance (AUC = 0.86 for longevity classification and explained 19.8% variance of life span) and presented a greater predictive efficiency in females than in males. Taken together, our findings not only shed light on the genetic contributions to longevity but also elucidate correlations between diseases and longevity.


| INTRODUC TI ON
The average human life expectancy has been rising for decades (Greene, 2001;Oeppen & Vaupel, 2002), and it was recently estimated that the number of long-lived individuals (more than 90 years old) was 63.5 million worldwide as of 2020 (United Nations, 2019). It is clear that longevity represents a complex trait that is influenced by genetic and environmental factors and their interactions (Passarino et al., 2016). Twin studies (Herskind et al., 1996;Skytthe et al., 2003) have estimated that the heritability of longevity is approximately 20%-30% in modern societies, and the proportion increases to approximately 40% for long-lived individuals (Hjelmborg et al., 2006;Perls et al., 2000;Terry et al., 2007;van den Berg et al., 2019). Although longevity is considered to exhibit relatively high heritability, limited genetic loci related to this trait have been identified in previous genome-wide association studies (GWAS; Deelen et al., 2014;Joshi et al., 2017;McDaid et al., 2017;Sebastiani et al., 2012;Zeng et al., 2016). Apolipoprotein E (APOE) is the only gene that has been replicated by multiple independent GWAS meta-analyses Joshi et al., 2017;McDaid et al., 2017). One recent meta-analysis revealed rs7676745 near GPR78 as a novel locus . In European populations, other GWAS meta-analysis studies have replicated several longevity genes, including CHRNA3/5, CDKN2A/B, SH2B3, and FOXO3A McDaid et al., 2017). Our previous GWAS in a Chinese population additionally identified IL6 and ANKRD20A9P (Zeng et al., 2016). One possible reason for the lack of replication could be the variation in phenotype definitions. Some studies have compared old cases with young controls. The selection of age cutoffs varies among different studies. A recent study conducted on multiethnic datasets used the 90th/99th survival percentile as the age cutoff . Some other studies have used more extrema cutoffs, with only centenarians being included among the cases (Sebastiani et al., 2012(Sebastiani et al., , 2017Zeng et al., 2016). Therefore, CLHLS can provide an ideal dataset in analyzing the association of genetic and non-genetic data with life span in humans.
In addition to studying the genetics of longevity and life span, age-related diseases and their correlations with longevity have attracted much attention (Sakaue et al., 2020). In either human centenarians or long-lived animals, it has long been observed that longevity and the occurrence of diseases, such as cardiovascular and cerebral stroke, are inversely correlated either genetically or experimentally (Altmann-Schneider et al., 2013;Hammond et al., 1971;Rosa et al., 2019;van der Lee et al., 2019). A previous study, using genetic data of parental life span, reported genetic correlations between several complex traits and mortality in a general population of European ancestry . Our study defined individuals with ages greater than 90 as the longevity group, rather than the parental survival which is a debatable longevity phenotype. Therefore, a systematic exploration of the correlation between longevity and complex diseases in the current study may reveal more information.
Another research interest is to predict longevity and life span based on age-related diseases and genetic markers. The polygenic risk score (PRS) generated from the summary statistics of association studies is a commonly used predictor for genetic factors. For example, a recent genetic study reported that a polygenic score could identify people with the top 10% parental survival PRS, who might outlive an average of 5 years those with the bottom 10% parental survival PRS (Timmers et al., 2019). In addition, circulating glucuronic acid levels (Ho et al., 2019) or telomeres (Whittemore et al., 2019) have been used as biomarkers for life span prediction. To date, there are very few studies that explore the potential of life span prediction by using a combination of genetic data, disease conditions, and lifestyle factors.
Here, we performed a large-scale integrated analysis based on 15,651 Chinese individuals from CLHLS to identify the longevity genes, to explore the relationships between diseases and longevity, and to apply these longevity-related factors for life span prediction. This study including 2,509 centenarians is one of the largest centenarian studies in the world Sebastiani et al., 2017;Timmers et al., 2019). We firstly designed a customized SNP chip using a carefully selected set of SNPs that captured 27,656 candidate variants correlated with longevity, agerelated diseases, and immunity. Next, we carried out a candidategene association analysis on the age-stratified phenotype ("cases" were defined as individuals surviving 90 years or over, while the "controls" had an age of less than 75, which is the average life span in China) and life span, respectively. Then, we performed a meta-analysis incorporating the current dataset and our previously published GWAS dataset by removing the overlapped samples. Moreover, we evaluated the polygenic prediction of diseases on longevity using polygenic risk score (PRS) analysis and inferred causal relationships between longevity and diseases using the bidirectional Mendelian randomization method. Finally, we built predictive models for longevity, and life span by integrating genetic factors, disease status, and lifestyles. Overall, this study aimed to reveal the sex-combined and sex-specific longevity genes/pathways and investigate their predictive effectiveness on longevity and life span.

| Participants and phenotypes
This study included a total of 15,651 individuals from the Chinese Longitudinal Healthy Longevity Surveys (CLHLS), which were conducted in 1998, 2000, 2002, 2005, 2008, 2011, and 2014 in a randomly selected half of the counties and cities in 22 out of 31 provinces in China. The primary dataset (dataset 1) included 13,228 individuals with ages ranging from 30 to 114. All individuals were genotyped by using a well-designed customized chip targeting approximately 27 K longevity-related SNPs. These candidate SNPs were selected based on previously published associations with longevity, chronic diseases, and health indicators. For replication purposes, a dataset 2 included 4477 individuals based on our previous study (Zeng et al., 2016). 2054 samples were overlapped between the two datasets.
Demographic and clinical information (i.e., diseases) was recorded for participants in this study. Phenotypic data were collected using internationally standardized questionnaires adapted to the Chinese cultural and social context. The CLHLS study was approved by the Biomedical Ethics Committee of Peking University (IRB00001052-13074). All participants or their legal representatives signed written consent forms in the baseline and follow-up surveys.

| Customized SNP chip design
We customize a SNP chip containing 27,656 selected longevity and disease-related SNPs for targeted genotyping (Table S1). The selected SNPs could be characterized as corresponding to five major components (Tables S2 and S3): (1) 11,893 SNPs associated with longevity based on our previous CLHLS GWAS study on 4477 Chinese individuals (Zeng et al., 2016); (2) 1881 reported longevity SNPs based on the other previously published longevity studies, including the European Union (EU) longevity (Deelen et al., 2014)

| Sample filtering
The samples were required to meet 3 selection criteria: (1) a genotype calling rate >90%; (2) no existing population stratification according to a multidimensional scaling (MDS) procedure implemented in PLINK v1.07, based on which individuals deviating from the main population cluster were removed; and (3) no inclusion of duplicates or first-degree relatives when evaluating pairwise through identity by descent (IBD). After sample filtering, 12,664 samples were included in the dataset 1.

| Variant filtering
To determine the high-quality genotypes, we applied a conservative inclusion threshold for variants: (1) minor allele frequency >5%, (2) genotype calling rate >90%; and (3) Hardy-Weinberg equilibrium (HWE) p > 10 −5 . To further confirm the quality of the genotypes, we calculated the concordance rate of the genotypes using 2,054 samples that overlapped between dataset 1 and dataset 2. Then, we removed the variants with a concordance rate <0.9 ( Figure S7), which largely eliminated the bias caused by two different arrays (Illumina ZhongHua and Affix arrays). After variant filtering, 23,769 out of the 27,656 variants remained in dataset 1, and 818 K out of the 900 K variants remained in dataset 2.

| Imputation
We performed imputation analysis by pre-phasing genotypes with SHAPEIT v2.5 (Delaneau et al., 2011), and then imputing variants from the 1000 Genomes Project released on October 2014 with 2504 samples (http://1000g enomes.org) as a reference panel using IMPUTE2 v2.3.1 (Howie et al., 2009). SNPs with a quality score (R 2 ) >0.9 were included after imputation. After further quality control filtering for SNPs as described above, we eventually obtained 287 K SNPs from 12,664 individuals in dataset 1 and 5.6 M SNPs from 4,477 individuals in dataset 2 for the subsequent genetic association analyses.

| MHC analysis
To identify potential MHC associations for longevity, 2,656 MHC tag SNPs were included in the 27 K arrays for dataset 1. Then, we used beagle 5 (Browning et al., 2018) with the HAN-MHC datasets as a reference panel to impute MHC alleles, and the imputation accuracy was 0.96 at the two-digit level as previously described (Zhou et al., 2016). In dataset 2, the samples were genotyped using Illumina HumanOmniZhongHua-8 BeadChips tagging 900,015 SNPs, among which 8,350 SNPs were located in the MHC region. We imputed the MHC alleles using the same procedure applied for dataset 1 and obtained 104 imputed HLA alleles presented in both two datasets. For each dataset, 104 tests were performed in the cases and controls. In each test, one allele was compared with the other 103 alleles grouped together. The allelic 2 × 2 contingency table for a specific HLA allele contained the counts of that allele and the counts of the other 103 alleles in cases and controls. We next performed a meta-analysis of the two datasets for the 104 imputed HLA alleles for longevity. Finally, a Bonferroni-corrected p < 0.0005 = 0.05/104 for 104 alleles was defined as significant.

| Association analysis for longevity
We performed genetic association analysis of 287 K imputed SNPs in dataset 1. More specifically, after sample filtering, a total of 8,490 individuals (4,662 cases with an age ≥90 and 3,828 controls with an age <75; 75 is the average life span of Chinese individuals) in dataset 1 were used for a case-control association analysis. We then performed association analysis in dataset 2 (Zeng et al., 2016). Since 1,922 individuals in dataset 2 were overlapped with 8,490 case/ control samples in dataset 1, a case-control association analysis was performed in 2,555 independent samples by removing the 1,922 overlapped samples from 4,477 samples of dataset 2. The 2,555 independent samples included 1,105 centenarians' cases and 1,450 controls with age <65. For each dataset, we applied logistic regression to calculate the p-values and odds ratio (ORs) of the SNPs by adjusting for sex and the top two MDS dimensions using PLINK 1.07.
Next, a meta-analysis was performed on the two case-control association results, using inverse-variance weighted fixed-effect meta- To investigate the correlations between the identified longevityrelated SNPs and diseases and the other traits, we reviewed diseases GWAS in this study (see Section 2.12 below). Then, we downloaded the summary statistics data from the Japan BioBank, a study of 300,000 Japanese citizens suffering from cancers, diabetes, rheumatoid arthritis, and other common diseases (Triendl, 2003).
Similarly, we searched the longevity SNPs in the summary statistics data from Japan Biobank to examine their associations with metabolic traits and diseases.

| Sex-specific association analysis for longevity
We performed sex-specific genetic association analyses in males and females separately. Male-specific variants were identified as those that (1) were significantly associated with longevity in males (p male < 5 × 10 −8 ) but not significant in females (p female > 0.05), and (2) exhibited a nominally significant sex difference (p-value testing for difference in sex-specific effect estimates, p difference < 0.05; formula 1). Female-specific variants were identified as (1) significantly associated with longevity in females (p female < 5 × 10 −8 ) but not significant in males (p male > 0.05), and (2) exhibited a nominally significant sex difference (p difference < 0.05; formula 1). For each variant, we calculated p difference testing for the difference between the male-specific and female-specific beta estimates m and f using the T-statistic (formula (1)) where SE m and SE f are the standard errors of the beta estimations in different sex groups.

| Functional annotation and enrichment analysis
The significant loci with p < 10 −5 identified in the sex-combined genetic association analysis with longevity were mapped to genes using (1) To identify sex-specific longevity pathways, the best-fit p-value cutoffs of 5 × 10

| Observational correlation analysis
We had detailed questionnaire information including sex, age, diseases, cognition, and lifestyle factors. The observational correlation analysis was performed to assess the statistical relationship (i.e., the correlation) between longevity and these influencing factors, which were evaluated by multivariable linear regression analysis while adjusted for sex and the top two MDS (Table S13). In the multivariable linear regression model, the phenotype is longevity trait with longevity cases as 1 and middle-aged controls as 0. The variables included diseases, cognition, and lifestyle factors, for example, suffering from a disease or not was defined as 1 and 0, respectively.

| Polygenic risk Scores (PRS) analysis
In this cohort study, we calculated weighted polygenic risk scores based on 3,966 known susceptibility markers from the GWAS catalog for many age-related diseases (Table S5). We imputed the missing risk alleles and corresponding beta weights whenever possible by checking the details in the original reports. Markers were coded additively, and the logarithms of the reported odds ratios were used as weights. All markers were clumped by pairwise linkage disequilibrium (r 2 > 0.8) prior to constructing the polygenic risk score. Each disease containing at least five SNPs was used to generate the PRS for each individual, and 87 diseases were ultimately included (Table   S14). PRS analysis was performed to calculate the correlations of longevity and disease risks not only in all individuals but also in males and females, respectively.

| Mendelian randomization (MR) analysis
We had detailed disease records for each individual. Therefore, we performed the GWAS for each of the disease types using the same approach that we applied for the longevity association analysis and used summary statistics data for longevity and various diseases for MR analysis. To calculate the causal effect of longevity on diseases, as well as the effects of diseases on longevity, we performed a bidirectional MR analysis using four different MR methods, including the GSMR method (Zhu et al., 2018) in GCTA tool and inverse-variance weighting (IVW), weighted median and MR-Egger regression methods implemented in the "TwoSampleMR" R package for robust validation. A consistent effect across the four methods is less likely to be a false positive. In the process of the MR analysis, we selected independent SNPs as instrumental variables, setting a linkage disequilibrium threshold of r 2 < 0.2 in a 500-kb window. We explored multiple settings for instrument strength with p < 10 −3 , p < 10 −4 , and p < 10 −5 , respectively. We used the MR-Egger intercept implemented in MR-Egger regression to test for the presence of directional pleiotropy.

| Survival analysis
By the last interview, 3,040 of the 12,664 individuals were reported to have died by their families. To study the relationship between genotypes and life span, we used age and live/dead status as phenotypes, and we used a multivariate Cox proportional regression model to perform association analysis. The model was implemented with the "coxph" function from the survival package in R 3.5.1. In the model, the individuals were either dead (status 1), alive (status 0); or missing at the follow-up interview were subjected to censoring. The surviving subjects were calculated according to age and censoring parameters. Then, the Cox regression was performed using all genotypes and sex as independent variables and the surviving status as the dependent variable. The survival curves were plotted using the "survfit" function in R 3.5.1.

| Lasso regression prediction
The prediction analysis was completely independent from the association analyses, and all the 23,769 SNPs incorporated with 19 disease statuses, five lifestyle measurements, and sex were entered into the least absolute shrinkage and selection operator (Lasso).
Lasso is a supervised machine learning method that can select a subset of SNPs to achieve the best prediction efficiency. The disease status, lifestyle measurements, and SNP genotypes were imputed separately. Disease status and lifestyles were imputed using the MICE package in R 3.5.1. All the SNPs remaining after QC (n = 23,769) were imputed internally without using any reference panel, because the reference panel-based imputation leverages linkage disequilibrium information (e.g., highly correlated SNPs will be imputed), which is redundant information in terms of prediction.
The missing genotypes were imputed using Beagle 5.0 (Browning et al., 2018). Together with sex, 23,824 predictors were entered into the Lasso regression (Tibshirani, 1996). The whole dataset was split into 80% and 20% subsets for model training and testing. Fivefold cross-validation was conducted for the training dataset. The training process for Lasso regression included feature selection and model fitting. Only one of the features with redundant information was selected for modeling, such as SNPs in high linkage disequilibrium. For F I G U R E 1 Study design and workflow. To investigate the longevity-associated genes/pathways, we first carefully selected and designed a customized SNP chip that captured 27,656 candidate variants mainly for longevity as well as disease, health indicator, and immunity. Then, we genotyped these SNPs in a large sample of 13,228 individuals. Next, we carried out the genetic association analyses using age-stratified phenotype ("cases" were defined as individuals surviving past 90 years of age and "controls" with age less than average life span of 75) as well as incorporating all individuals' age and the survival status as phenotype, respectively. Furthermore, we performed meta-analysis together with previous dataset (removing overlapped samples) to identify the longevity genes in gender-combined and gender-stratified groups, respectively. In addition, we evaluated polygenic prediction of diseases on longevity using polygenic risk score (PRS) analysis and inferred causal relationships between longevity and diseases using bidirectional Mendelian randomization (MR) method. Finally, we built the prediction model for longevity and life span using all existing factors.
longevity prediction, AUCs were calculated to evaluate prediction efficiency. For the life span prediction, the explained variance for life span was estimated using linear model. The predictions were also performed in the male and female groups separately.

| Study subjects and design
This study was composed of two datasets including a total of 15,651 individuals. The first included 13,228 individuals with 27,656 longevity-and disease-related SNPs genotyped by using our customized SNP chip (Tables S1-S3, see also Section 2). After implementing the standard quality control procedures, 12,664 samples and 23,769 out of the 27,656 SNPs remained for subsequent analysis ( Figure   S1A). Reference-based imputation enlarged the dataset into 287,000 SNPs. The second dataset was the GWAS set that we previously published, including 4,477 samples (2,178 centenarians and 2,299 middle-aged controls, Figure S1B) and 5.6 M imputed SNPs (Zeng et al., 2016). The two datasets included 2,054 overlapping individuals, and the genotype concordances of SNPs for the same individual were measured for quality control. The discordant SNPs (genotype concordance <0.9) were removed, and the remaining SNPs were imputed for association analysis. The analysis flow is presented in Figure 1.
We first performed a case-control association analysis on 8,490 individuals (4,662 cases with age ≥90 and 3,828 controls with age <75, Figure S1A) from dataset 1. Twelve SNPs achieved significance after Bonferroni's correction (Table S4; p < 1.81 × 10 −6 = 0.05/27,656). In dataset 2, 1,922 individuals, who were overlapped with the 8,490 case/control samples of dataset 1, were removed. We then investigated the significance of these 12 SNPs in the independent dataset 2 with 2,555 individuals. Two SNPs were also nominally significant in the same direction (Table S4; p < 0.05). Therefore, we performed both sex-combined and sex-stratified meta-analyses of these two independent datasets including 11,045 individuals to further identify potential longevity genes and pathways. Meanwhile, we carried out survival analysis, polygenic risk score (PRS) prediction, and

| Novel longevity genes revealed by metaanalysis
As shown in Figure 1, we identified three loci that were significantly associated with longevity (p < 5×10 −8 ) in the meta-analysis of the two TA B L E 1 Three loci associated with longevity at genome-wide significance datasets. Among these 3 identified longevity loci, one is the wellknown locus located near the TOMM40/AOPE region. The two newly identified genes/loci include BMPER and TMEM43 (Table 1; Figure 2A; Figure S2). The top signal, rs17169634 (p = 7.91 × 10 −15 ), is located in the intronic region of the BMPER gene and has been reported to be associated with Alzheimer's disease (Nelson et al., 2014; Table S5). The second significant signal is the well-known longevity locus TOMM40/ APOE (rs2075650; p = 6.17 × 10 −10 ). The third top SNP, rs1043943 (p = 3.59 × 10 −8 ), is located in the 3'-UTR of TMEM43, which is in strong linkage disequilibrium (LD) with rs2228001 (r 2 = 0.95; p = 1.13 × 10 −7 ), a missense mutation in the XPC gene. It has been reported that rs1043943 might regulate the expression of XPC, which is a nucleotide excision repair (NER) gene involved in DNA damage repair, and the deletion of XPC leads to the development of lung tumors in mice (Hollander et al., 2005). We investigated these three identified longevity-associated SNPs in the two largest relevant meta-analyses results Timmers et al., 2019;  While for rs17169634 in BMPER, although Deelen's study showed nominally significance (p = 0.033), the effect of minor allele G was in the opposite direction with our results. We further performed association analyses for diseases in dataset 1 and found that the BMPER locus was associated with arthritis (p = 3.76 × 10 −6 ) and prostate cancer (p = 6.32 × 10 −3 ), and the G allele for SNP rs17169634 has decreased effects on the risk of arthritis and prostate cancer. TOMM40/APOE locus linked to dementia (p = 2.40 × 10 −4 ) and arthritis (p = 3.01 × 10 −3 ), and TMEM43 was associated with Parkinson's disease (p = 0.045).
Interestingly, our three longevity SNPs have also been linked to multiple metabolic traits in GWASs on the Japan BioBank dataset (Triendl, 2003; Figure 2B). Specifically, BMPER is associated with body mass index (BMI); TOMM40/APOE is associated with low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), total cholesterol (TC), total triglyceride (TG), and colorectal cancer (CRC), and TMEM43 is associated with aspartate aminotransferase (AST) and blood uric acid (UA).
Since we enriched SNPs located in the MHC region, we next im-  Note: Male-specific analysis was performed in 4,413 males (3,758 males in dataset 1 and 655 independent males in dataset 2). Female-specific analysis was performed in 6,632 females (4,732 females in dataset 1 and 1,900 independent females in dataset 2). EA, effect allele; NEA, not effect allele; the sex-differential p-values tested for difference between the male-specific and female-specific beta estimates using the T-statistic.

TA B L E 3 Observational correlation and bidirectional MR analysis for longevity and diseases
To further explore the biological functions of these identified longevity-associated signals, we investigated 300 SNPs associated with longevity at a suggestive significance level (p < 10 −5 ) using the FUMA tool (Watanabe et al., 2017). Interestingly, we found that these longevity-associated genes presented significantly up-regulated expression patterns in multiple specific cerebral regions, including the brain substantia nigra and brain amygdala ( Figure S3). Through cross-referencing with the GWAS catalog, we found that longevityassociated SNPs showed significant enrichment related to the cerebrospinal total tau (T-tau) levels, cerebrospinal phosphorylated tau (P-tau 181p) levels, cerebral amyloid deposition (PET imaging), and various age-related diseases, such as Alzheimer's disease, age-related macular degeneration, type 2 diabetes, and ischemic stroke ( Figure   S4). Based on KEGG analysis, four enriched pathways included type 2 diabetes mellitus, Alzheimer's disease, and the two previously identified calcium and MAPK signaling pathways ( Figure S5). Taken the anti-correlation between these age-related diseases and longevity.

| Sex-specific genes associated with longevity
Regarding the sex difference in genetics of longevity, we performed a sex-stratified genetic association analysis. We compared the three genome-wide significant SNPs between sexes and found that only the APOE/TOMM40 SNP showed a sex difference (Table 2, p male = 0.008; p female = 5.12 × 10 −9 ; p sex difference = 0.049). The other two SNPs were not significantly different between males and females (p < 0.05 in both males and females, and p difference > 0.05). Notably, we identified two male-specific associations (Table 2; Figure 3A; p < 5×10 −8 in males and p sex difference < 0.05) and four female-specific associations (p < 5×10 −8 in females and p sex difference < 0.05) linked to longevity. The male-specific longevity locus, rs2308910 at HLA-DPA1, was a missense mutation at amino acid position 59 where Glu is converted to Asp. HLA-DPA1 as an HLA class II gene plays a central role in the immune system by presenting peptides derived from extracellular proteins. The other male-specific longevity SNP rs16981095 regulated the expression of the gene TPM4 in whole blood (Westra et al., 2013) that its related pathways are dilated cardiomyopathy and cardiac muscle contraction.
We next investigated sex-specific pathways based on sex-specific loci. The best-fit p cutoffs of 0.0178 and 0.0004 according to PRSice software were used to select SNPs for pathway analyses in males and females, respectively. The selected longevity-associated SNPs were significantly enriched in 9 pathways for males (false discovery rate, FDR <0.05; Table S9). These pathways were mainly enriched in DNA replication and mismatch repair-related pathways including mismatch repair, nucleotide excision repair, base excision repair, and pathways related to amino acid metabolism including ATP-binding cassette transporters (ABC transporters), beta-alanine metabolism pathway, arginine, and proline metabolism pathway. In females, 10 pathways were enriched and clustered into the cancer-related pathway, including glioma pathway, melanoma pathway, chronic myeloid leukemia pathway, JAK-STAT signaling pathway, B-cell receptor signaling, and pathways related to the metabolism of terpenoids such as terpenoid backbone biosynthesis, porphyrin and chlorophyll metabolism, valine leucine and isoleucine degradation, and glyoxylate and dicarboxylate metabolism (Table S9).

| Replication of previously identified loci for human longevity
For validation purposes, we investigated previously reported 1,881 SNPs in multiple GWAS longevity studies (SNPs resources listed in Table S2). After quality control, 1,305 of the 1,881 SNPs were available for investigation. Nine SNPs in three genes were well replicated in this study and showed statistical significance after multiple testing corrections (Table S10; p < 3.8 × 10 −5 = 0.05/1,305).
Notably, we replicated 11 out of the 22 previously reported sex-specific SNPs but not the rare SNPs (Table S12). These 11 associations were nominally significant in one sex (p < 0.05) but not significant in the other sex (p > 0.05), and the sex-differential p < 0.05. SNP rs4972778 at KIAA1715 showed the most significant difference between sexes (p male = 0.49; p female = 6.62 × 10 −6 ; p sex difference = 5.13 × 10 −4 ).

| Observational, PRS, and MR analyses identify the correlations of diseases with longevity
On the basis of detailed questionnaire information including sex, age, diseases, cognition, and lifestyle factors, we systematically analyzed the effects of these factors on longevity. First, the observational correlations of longevity with diseases were investigated (Table S13). Interestingly, we found that the most influential factors related to an increased probability of longevity were being female, exhibiting a lower education level, exhibiting a lower career status, not smoking, not drinking, and an absence of diseases such as hypertension, type 2 diabetes mellitus (T2D), cardiovascular disease (CVD), dyslipidemia, gastroenteric ulcer, arthritis or cholelithiasis.
At the organ and tissue aging levels, we found that long-lived individuals are more likely to suffer from cataracts, glaucoma, and dementia diseases, and to exhibit lower activities of daily living (ADL) and Mini-Mental State Examination (MMSE) scores. In addition, we found that T2D and CVD were significantly inversely correlated with longevity in females.
Based on a total of 3,966 disease markers from the GWAS catalog genotyped in this study, we constructed the polygenic risk scores (PRS) of each participant for 87 disease traits, and we calculated their correlations with longevity. Seven nominally significant correlations were identified, but none of them passed the multiple testing adjusted threshold (Table S14; p < 0.0006).
Moreover, longevity showed direct observational correlations with T2D (p observation = 2.54 × 10 −12 ) and CVD (p observation = 1.58 × 10 −7 ; individuals showed a causal effect on a decreased risk of arthritis (p GCTA-GSMR = 8.22 × 10 −5 ), stroke (p GCTA-GSMR = 2.57 × 10 −3 ), and hypertension (p GCTA-GSMR = 3.29 × 10 −3 ), presenting no horizontal pleiotropy. Conversely, we also investigated the potential effect of disease on longevity, and we observed that T2D and dementia negatively affected longevity when taking p < 10 −3 as an instrumental cutoff, while stroke and cataract negatively affected longevity when taking p < 10 −4 as the instrumental cutoff (Table S15). Taken together, our series of analyses, including observational correlation, PRS, and bidirectional MR analysis, revealed that long-lived individuals tend to exhibit a lower genetic risk of T2D, CVD, and arthritis, and in turn that the absence or delayed onset of diseases such as T2D, dementia, and stroke lead ones to live longer and have higher odds to be longevity (Table 3; Table S15).

| Determinants of life span identified by survival analysis
During the follow-ups, approximately 24% (3,040/12,664) of the individuals died (with an age of death recorded). We performed survival analyses to identify genetic variants associated with life span using the Cox regression model (Table S16). The results showed that rs1267601 in BRAF was the variant most associated with life span, and carriers of the CC genotype had significantly higher survival rates at age 100+ compared with CT/TT carriers (28.6% for CC, 16.9% for CT, and 11.8% for TT; hazard ratio (HR) for survival 1.35; p = 8.33 × 10 −15 ; Figure 4A). In addition, we found that the Alzheimer's disease-related SNP rs17169634 in BMPER was the lon-

(b) (c) Predictions in Females
of the minor allele G exhibiting a substantially longer life span than noncarrier (28.3% for GG, 16.8% for GA, and 11.3% for AA; survival HR = 1.25; p = 1.45 × 10 −10 ; Figure 4B). Except for the BRAF and BMPER loci, we did not observe any other loci that reached genomewide significance. The APOE/TOMM40 locus was correlated with life span at nominal significance (p = 0.0013).

| Predictions of longevity and life span
One of the ultimate objectives for identifying factors contributing to longevity is to predict longevity and life span. Based on all the factors we identified from both previous studies and our own association studies, we constructed a predictive model for longevity (age ≥90 vs. age <75) and life span through Lasso regression (Tibshirani, 1996). The prediction was independent of the association study; all the SNPs that we designed on the customized SNP chip (n = 23,800 after quality control) as well as 19 disease phenotypes and fivce lifestyles entered prediction model construction. We constructed three models using (1) Figure 5A). We further investigated the significance of the SNPs selected by Lasso regression for the prediction in our genetic association study. We found that those SNPs that effectively contributed to the prediction exhibited significantly lower p-value enrichment ( Figure S6).
For the prediction of life span, we used 3,023 individuals who had an exact age of death and detailed phenotypic records in our datasets. All three predictive models yielded good performance prediction models are listed in Tables S17 and S18.

| DISCUSS ION
In this study, we present several findings regarding the genetic contributions to longevity and their gender differences based on

15,651 individuals from the cohort of the Chinese Longitudinal
Healthy Longevity Survey. We designed an informative SNP chip for studying the genetics of longevity. Longevity case-control analysis (n = 11,045) and survival analysis (n = 12,664) were performed in different subsets of the cohort. In addition to previously published longevity-related studies, we included SNPs for relevant diseases, such as CVD and T2D, intending to obtain results that were comparable to those of previous studies. The main findings and several highlights of this work are described below.
First, we identified two novel loci (BMPER and TMEM43/XPC) and replicated three loci (TOMM40/APOE, FGD3, and AKT1) associated with longevity in Chinese populations. Interestingly, these five longevity-associated loci have been linked to diseases, especially age-related diseases. For example, BMPER has been associated with aging and its related diseases, such as Alzheimer's disease (Nelson et al., 2014), and it is also involved in the regulation of the proinflammatory phenotype of the endothelium (Helbing et al., 2011), functioning primarily in the vascular (Lockyer et al., 2017) and respiratory systems (Helbing et al., 2013). XPC is involved in DNA damage repair and is associated with disease characterized by an extreme sensitivity to ultraviolet rays from sunlight, such as xeroderma pigmentosum, complementation group c and xeroderma pigmentosum, variant type, and the deletion of XPC leads to lung tumors in mice (Hollander et al., 2005). The TOMM40/ APOE locus has been reported to be associated with longevity in multiple studies among diverse populations and the locus contributes to Alzheimer's disease (Seshadri et al., 2010), age-related macular degeneration (Cipriani et al., 2012), cardiovascular disease (Middelberg et al., 2011), cognitive decline (Davies et al., 2014), immunity (Reiner et al., 2008), and lipid metabolism/dyslipidemia (Aulchenko et al., 2009). FGD3, a putative regulator of cell morphology and motility, was associated with longevity in the NECS study, and its expression plays a prognostic role in breast cancer (Renda et al., 2019). AKT1 is relevant to longevity (Deelen et al., 2013;Nojima et al., 2013), and the dysregulation of AKT signaling leads to diseases for which there are major unmet medical needs, such as cancer, diabetes, and cardiovascular and neurological diseases (Hers et al., 2011). The two novel signals were also linked to multiple age-related phenotypes not only in this cohort but also in GWASs from the Japan BioBank, whose cohort is ethnically closer to the Chinese population. The BMPER locus was associated with arthritis, prostate cancer, and BMI. The TMEM43 locus was associated with Parkinson's disease, AST, and UA. These findings consistently revealed the genetic overlap between exceptional longevity and age-related diseases and traits (Fortney et al., 2015). It is noted that five SNPs including rs10757274, rs4977574, rs2891168, rs10965235, and rs944797 located in well-known CDKN2B locus, which is associated with CVD, were also associated with longevity with nominal significance in our study. Future fine-mapping with denser makers or genome sequencing will be required to illuminate the hidden information. SNP rs17169634 was not significantly associated with longevity in our dataset 2; therefore, the direction of effects in dataset 2 could not be determined that the confidence interval of effect size included zero. The significant signal was driven by dataset 1, where we also tested its association with complex diseases. The G allele for rs17169634 in BMPER has reduced effects on the risk of arthritis (p = 3.76 × 10 −6 ) and prostate cancer (p = 6.32 × 10 −3 ). Taken together, the results in our data showed that the G allele for rs17169634 in BMPER increased the probability of being longevity in our logistic regression and has increased effects for life expectancy in survival analysis and reduced the risks of age-related diseases. Notably, the causal effects from an increased chance of longevity to reduced risk of arthritis were also identified in our MR analysis. These directions of effects are as expected that long-lived individuals show a delay in overall morbidity through having beneficial effects for diseases (Andersen et al., 2012). As for our previously reported SNPs in IL6 and ANKRD20A9P, they did not pass the quality control in our current analysis. Therefore, they cannot be replicated. We further checked the frequency of reported SNPs on these two genes. The minor allele of rs2069837 in IL6 has a lower frequency (0.075) than it is in the dbSNP Asian population (0.179), and this allele has reduced effects for longevity. Since the proportion of centenarians is much higher in our previous study (48%; Zeng et al., 2016) than in other GWAS studies for longevity, the underrepresentation of this allele in our dataset is plausible. The inconsistency of the results could be caused by the differences in proportions of centenarians between our two datasets and also among different ethnic populations. As for SNP rs2440012 in ANKRD20A9P, the minor allele G was overrepresented in our previous study (0.076) compared with Asians in dbSNP (G = 0). It has been filtered out in Deelen's meta-analysis may be due to multi-allelic problems in the European population (C = 0.90, A = 0.0015, G = 0.098). Additional independent datasets are needed for a detailed look into these loci.  (Austad & Fischer, 2016;Candore et al., 2006;Ostan et al., 2016;Yuan et al., 2020); however, very few studies have reported the sex-differential effect of genetics for longevity. TOMM40/APOE is well-characterized longevity locus that could be split into 4 LD blocks. We found that two of these 4 LD blocks were associated with longevity in females but not in males (p difference < 0.05), in line with our previous study (Zeng et al., 2018).
This may indicate sex-specific genetic associations of longevity may be caused by differences during meiosis between males and females. The distinction of recombination rates between sex groups has been reported in both human and animals (Li & Merila, 2010;Tapper et al., 2005). Since the recombination was closely interacted with natural selection (Schumer et al., 2018), differences in recombination are plausible to lead to sex or population stratification and thereby causing a small group of people having enriched evolutionary benefit alleles. Therefore, it is necessary to use strand-specific, long-segment sequencing technologies or family studies to detailed look into the LD structure for longevity people in future studies.
Interestingly, the predictive effectiveness of SNPs for longevity is slightly better in females (AUC = 0.732) than in males (AUC = 0.707).
For life span predictions, SNPs could explain 7% of the variance for life span in females (p = 1.25 × 10 −6 ) but failed to provide a significant prediction for life span in males (p = 0.10). All these results are consistent with our previous finding (Zeng et al., 2018) that the genetic association with longevity is stronger in females than in males.
Notably, we found that some diseases also presented sex-differential patterns associated with longevity. For example, T2D and CVD were more significantly correlated with longevity in females. Previous studies have reported sex differences between cardiovascular diseases and aging, in which it is assumed that genetic traits and sex hormones play the key roles (Rodgers et al., 2019).
Our PRS and MR analyses revealed negative correlations between longevity and multiple diseases, including CVD, T2D, and arthritis. The results were generally in consistent with those in a meta-analysis of the European population .
However, other studies indicated different conclusions. One publication based on the Leiden Longevity Study (LLS) suggested that disease risk alleles do not compromise human longevity (Beekman et al., 2010). The authors only considered 30 disease risk SNPs, while our analyses included more carefully selected SNPs for agerelated diseases (Erikson et al., 2016), and the obtained polygenic risk scores reflected an overall significant decrease in genetic disease risk in exceptionally long-lived individuals. Taken together, these findings suggested that some disease risk SNP alleles might increase the chance of longevity , but there are more effective disease risk SNP alleles associated with earlier mortality (Erikson et al., 2016;Joshi et al., 2017). The benefits of utilizing polygenic risk scores are that it summed the effects of multiple alleles instead of looking at the count of each risk allele. Therefore, when considering the additive effects aggregating all risk alleles, the genetic risks of multiple diseases were found to be reduced in longlived populations.
There is growing interest in predicting the risks for diseases and complex traits using polygenic risk scores (Khera et al., 2018).
Previous studies have predicted longevity and life span based mainly on animal models (Huang et al., 2004;Shen et al., 2014;Swindell et al., 2008) or the use of single biomarker (Ho et al., 2019;Whittemore et al., 2019). One recent study using the UK Biobank A limitation of the present study is the candidate-gene approach, which might preclude the discovery of new possible causative genes or biological pathways. However, our selection of candidates was primarily based on our previous genome-wide association studies conducted in 4477 Chinese individuals from the Chinese Longitudinal Healthy Longevity Survey (CLHLS). The SNPs with p-values smaller than 0.015 in our previous GWAS were all selected for inclusions on our customized chip. We collected additional candidate SNPs from existing studies, including studies not only on longevity genetics but also on other age-related complex diseases and traits. Moreover, by leveraging on imputation technology, the candidate SNP sets were further expanded. By incorporating all these SNPs together, we performed multi-candidate genes association analyses, which is suboptimal for genome-wide associations but still very informative.
Secondly, in order to identify sex-specific genetic markers associated with longevity, we stratified our sample into male and female groups. The benefit of stratifying the sample is an increased chance to find those sex-specific SNPs tagging different causal variants in different sex groups. However, this strategy also has the drawback that the reduced sample size for each analysis group caused decreased power. We only replicated half of the previous identified sex-specific loci, and more replication studies are required in the future. Thirdly, we noted that the predicted life span was generally shorter than the true life span, indicating undefined missing confounders contributing to the life span (genetic and other confounding factors or their interactions). Future genetic studies of longevity based on affordable exome and whole-genome sequencing might be helpful to further identify a larger number of longevity-associated genetic variants by applying the analysis of rare genetic and copy number variants. Together, these findings provide a benchmark for the development of longevity-and life span-predictive models.
Further studies are warranted to improve the models through the identification of an additional panel of predictive variables and the development of innovative computational approaches.
In summary, our results not only identified novel longevity genes but also depicted the landscape of genetic contributors to longevity and life span through a complex of sex-differential and diseaserelated interactive circuits, which could be more precisely predicted in the near future.

ACK N OWLED G M ENTS
We thank for the support from the National Key Research

CO N FLI C T O F I NTE R E S T
We declare no competing interests. All authors read and approved the final version of the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study have been deposited