Targeted sequencing of the 9p21.3 region reveals association with reduced disease risks in Ashkenazi Jewish centenarians

Abstract Genome‐wide association studies (GWAS) have pinpointed the chromosomal locus 9p21.3 as a genetic hotspot for various age‐related disorders. Common genetic variants in this locus are linked to multiple traits, including coronary artery diseases, cancers, and diabetes. Centenarians are known for their reduced risk and delayed onset of these conditions. To investigate whether this evasion of disease risks involves diminished genetic risks in the 9p21.3 locus, we sequenced this region in an Ashkenazi Jewish centenarian cohort (centenarians: n = 450, healthy controls: n = 500). Risk alleles associated with cancers, glaucoma, CAD, and T2D showed a significant depletion in centenarians. Furthermore, the risk and non‐risk genotypes are linked to two distinct low‐frequency variant profiles, enriched in controls and centenarians, respectively. Our findings provide evidence that the extreme longevity cohort is associated with collectively lower risks of multiple age‐related diseases in the 9p21.3 locus.


Subjects
The Ashkenazi Centenarian cohort has been described in our previous studies (Shlush et al. 2008;Ryu et al. 2016).Briefly, 450 centenarians (mean age: 98) and 500 controls (mean age: 73) with no family record of longevity were involved in the study.Genomic DNA was extracted from the participants' whole blood samples of participants and subjected to whole genome amplification using GenomiPhi V2 DNA Amplification kits (GE Healthcare Life Sciences) prior to capture genome sequencing.

Pooled Target Capture Sequencing
The pooled capture-seq technique was performed as described previously (Ryu et al. 2018).In short, genomic DNA samples were divided into groups of 25 and pooled equimolarly.This resulted in 18 pools of centenarians and 20 pools of controls.Using one microgram of pooled DNA, the sequencing library was prepared following Illumina's Truseq library protocol.The 9p21.3 region (chr9:21950000-22180000, hg19) was specifically captured using the Nimblegen SeqCap EZ Choice library target capture kit according to the manufacturer's instructions.The capture and library preparation procedures were performed in a single batch experiment for all 38 pools.Sequencing was performed on the Illumina HiSeq2500 platform with an average depth of 30x per individual.Sequencing results were aligned to the human genome hg19 using BWA.After pre-processing with PICARD tools (http://broadinstitute.github.io/picard),multisample variant calling was achieved using CRISP (Bansal 2010).The variants were annotated to dbSNP149 and refGene for functional annotation/prediction.ANNOVAR was used for functional annotation (Wang et al. 2010).

Genotyping
Genomic regions with variants subject to genotyping were PCR amplified from genomic DNA using FastStart Taq DNA polymerase (Roche).Genotyping was performed on the Sequenom MassARRAY platform following the manufacturer's instructions.Genotype calls were performed using TYPER 4.0, the coupled analysis software for the assay, with default call thresholds.

Statistical Analysis
Allele frequencies for each variant in the centenarian and control groups were calculated, and statistical inference of association was determined using Fisher's exact tests.For rare variant analysis, the Sequence Kernel Association Test (SKAT) (Wu et al. 2011) and its derivative methods (SKAT-O, SKAT-C) were applied to the entire sequenced interval, as well as subdivisions based on gene locations.As previously described (Ryu et al. 2021), analysis was performed by using minor allele frequency (MAF) of variants in each pool instead of allele counts in each individual.Reported p-values were adjusted for multiple testing using the Benjamini-Hochberg procedure.For permutation tests for combined depletion of risk variants, identification tags for pools were permuted 10,000 times.For each permuted dataset, the Z-scores of the five GWAS SNPs shown in Figure 1B were calculated for the centenarian group.The number of incidences in which the sum of Z-scores was equal to or less than that of the actual centenarian group were counted and then divided by the total number of trials to calculate p-values.
The Python Sklearn package was utilized for the feature selection study.Before initiating the learning process, the allele frequency matrix was pre-processed by 1) adjusting the allele count to the minor allele (for cases where the major allele was the alternative), 2) removing singletons and doubletons, and 3) normalizing to Zscore.Random Forest and Gradient Tree Boosting classifiers were implemented using the RandomForestClassifier (n_estimator=10000) and GradientBoostingClassifier (n_estimator=5000, learning_rate=0.02) in the ensemble module, respectively.Principal Component Analysis (PCA) was performed using the PCA function in the decomposition module.A linear Support Vector Classifier (SVC) was built to separate control and centenarian pools using the first 3 PC values with the SVC function in the svm module.The significance of separation was tested by permuting the pool's identification tags 10,000 times and counting instances where classification accuracy was equal to or higher than the actual data.The python seaborn.heatmapfunction was used to generate a correlation matrix heatmap.A threshold of 0.05 was applied for PC2 and PC3 to select the important feature SNPs.

Data Usage
The Linkage Disequilibrium (LD) information for the European population (CEU) was obtained from the 1000 Genomes Project.The LD matrices for selected variants in Figures S2 and S4 were plotted using the online LDlink tool (Machiela and Chanock 2015).The same-batch exon sequencing results for the Ashkenazi Jewish cohort were obtained from the authors of a previous study (Ryu et al. 2021).

Figure S1
Genotyping validation of variant allele frequency called from pooled capture sequencing for (A) centenarians and (B) controls.See Table S1 for list of assayed variants.Table S1: List of variants for validation analysis by genotyping.
Table S2 List of variants validated by genotyping (attached xlsx separately) Table S3: Adjusted p value from SKAT analysis of sequenced region and subdivisions.

Figure
Figure S2 (A) Distribution of variants with nominally significant (p < 0.05) difference in MAFs between centenarians and controls in 9p21.3 region.(B) Pearson correlation (R2) for the allele frequency of significant in all 38 sequencing pools.(C) Linkage disequilibrium map of these variants in European (CEU) population.

Figure S3
Figure S3Feature importance for all variants in (A) random forest classifier and (B) stochastic gradient boosting classifier for centenarian association.

Figure S4
Figure S4 LD matrix for the five representative GWAS SNPs in all populations from LDLink database.Numbers in the squares indicate R squared correlation.

Figure S5
Figure S5 Comparison of the result between 2,216 9p21.3 variants in this study and 34,954 variants on 360 gene exons that were sequenced together in the same batch.(A) Cluster map of the sample pools by Pearson correlation.(B) Permutation test of Euclidian distance separation between control and centenarian groups by the PC2 and PC3 from the PCA analysis of the two datasets.(C) Boxplot showing the distribution of PC1 in control and centenarian groups.Significance of the difference was calculated by Mann-Whitney U test.

Figure S6
Figure S6 2D plots denoting feature weight (y xias) for all SNPs in the first 3 principle components versus their (A) minor allele frequency and (B) position.

Figure S7
Figure S7 Volcano plots for (A) important feature SNPs in the principle components 2 and 3 and (B) CDKN2A-downstream SNPs.