Jonathan L. Haines, Center for Human Genetics Research, Vanderbilt University Medical Center, 519 Light Hall, Nashville, TN. Tel: 615–343-5851; Fax: 615–343-8619 37232-0700; E-mail: firstname.lastname@example.org
To identify novel late-onset Alzheimer disease (LOAD) risk genes, we have analysed Amish populations of Ohio and Indiana. We performed genome-wide SNP linkage and association studies on 798 individuals (109 with LOAD). We tested association using the Modified Quasi-Likelihood Score test and also performed two-point and multipoint linkage analyses. We found that LOAD was significantly associated with APOE (P= 9.0 × 10–6) in all our ascertainment regions except for the Adams County, Indiana, community (P= 0.55). Genome-wide, the most strongly associated SNP was rs12361953 (P= 7.92 × 10–7). A very strong, genome-wide significant multipoint peak [recessive heterogeneity multipoint LOD (HLOD) = 6.14, dominant HLOD = 6.05] was detected on 2p12. Three additional loci with multipoint HLOD scores >3 were detected on 3q26, 9q31 and 18p11. Converging linkage and association results, the most significantly associated SNP under the 2p12 peak was at rs2974151 (P= 1.29 × 10–4). This SNP is located in CTNNA2, which encodes catenin alpha 2, a neuronal-specific catenin known to have function in the developing brain. These results identify CTNNA2 as a novel candidate LOAD gene, and implicate three other regions of the genome as novel LOAD loci. These results underscore the utility of using family-based linkage and association analyses in isolated populations to identify novel loci for traits with complex genetic architecture.
Late-onset Alzheimer disease (LOAD) is a neurodegenerative disorder causing the majority of dementia cases in the elderly. A complex combination of genetic and environmental components likely determines susceptibility to LOAD (Bertram et al., 2010). The APOE E4 allele is a well-established genetic risk factor for LOAD. Additional risk genes have been difficult to detect and replicate until recent successes using large consortia-derived genome-wide association study (GWAS) datasets, which have added CR1, CLU, PICALM, BIN1, EPHA1, MS4A, CD33, CD2AP and ABCA7 to the list of confirmed LOAD susceptibility genes, each with modest effect (Harold et al., 2009; Lambert et al., 2009; Seshadri et al., 2010; Hollingworth et al., 2011; Naj et al., 2011).
Despite these recent successes the majority of the genetic risk for LOAD remains unknown. The remaining genetic risk may in part lie in additional loci with small effects at the population level, making most datasets underpowered. The use of a genetically isolated founder population, such as the Amish, represents an alternative to the use of large population based consortia-derived datasets in the search for genetic risk factors. In the case of a founder population, the number of disease variants is hypothesised to be fewer, thereby decreasing heterogeneity and increasing power.
We have taken this approach to discover at least one novel LOAD risk gene by studying the Amish communities of Holmes County, Ohio, and Adams, Elkhart and LaGrange Counties, Indiana (Hahs et al., 2006; McCauley et al., 2006). These communities are collectively part of a genetically isolated founder population originating from two waves of immigration of Swiss Anabaptists into the United States in the 1700s and 1800s. The first wave of immigration brought the Anabaptists to Pennsylvania. In the early 1800s some of these immigrants moved to Holmes County, Ohio (Beachy, 2011), while a second wave of immigration from Europe established more Amish communities in Ohio (including Wayne County but not Holmes County) and Indiana (including Adams County) (Hostetler, 1993). Starting in 1841, the Elkhart and LaGrange Counties Amish community was founded by Amish families primarily from Somerset County, Pennsylvania, and from Holmes and Wayne Counties, Ohio, who were seeking new farmland to settle (Amish Heritage Committee, 2009). The Amish marry within their faith, limiting the amount of genetic variation introduced to the population. Not only are the Amish more genetically homogeneous, but because of their strict lifestyle, environmental exposures are also more homogeneous. The Amish have large families and a well-preserved comprehensive family history that can be queried via the Anabaptist Genealogy Database (AGDB) (Agarwala et al., 1999; Agarwala et al., 2003), making the Amish a valuable resource for genetic studies.
Our current study undertook a genome-wide approach, in a population isolate, using complementary linkage and association analyses to further elucidate the complex genetic architecture of LOAD. We utilised linkage analysis to look for sharing of genomic regions among affected individuals, while also using association analysis to look for differences in allele frequencies between affecteds and unaffecteds. We previously performed a genome-wide linkage study using microsatellites genotyped in only a small subset of the individuals included in this study (Hahs et al., 2006). Here we use a much larger dataset with a much denser panel of markers using a genome-wide single nucleotide polymorphism (SNP) chip. The results indicate that several novel regions likely harbour LOAD genes in the Amish, underscoring the genetic heterogeneity of this phenotype.
Materials and Methods
Methods for ascertainment were reviewed and approved by the individual Institutional Review Boards of the respective institutions. Participants were identified from published community directories, referral from other community members or due to close relationship with other participants, as previously described (Edwards et al., 2011). Informed consent was obtained from participants recruited from the Amish communities in Elkhart, LaGrange and surrounding Indiana counties, and Holmes and surrounding Ohio counties with which we have had established working relationships for over 10 years.
For individuals who agreed to participate, demographic, family and environmental information was collected, informed consent was obtained, and both a functional assessment and the Modified-Mini-Mental State Exam (3MS) were administered (Teng & Chui, 1987; Tschanz et al., 2002). Those scoring ≥87 on the 3MS were considered cognitively normal and were considered unaffected in our study. Those scoring <87 were re-examined with further tests from the CERAD neuropsychological battery (Morris et al., 1989). Depression was also evaluated using the geriatric depression scale (GDS). Diagnoses for possible and probable AD were made according to the NINCDS–ADRDA criteria (McKhann et al., 1984). A yearly consensus case conference was held to confirm all diagnoses.
SNPs for APOE were genotyped for 823 individuals (127 with LOAD). To identify the six APOE genotypes determined by the APOE*E2, *E3 and *E4 alleles, two SNPs were assayed using the TaqMan method [Applied Biosystems Inc. (ABI), Foster City, CA, USA]. SNP-specific primers and probes were designed by ABI (TaqMan genotyping assays) and assays were performed according to the manufacturer's instructions in 5 μl total volumes in 384-well plates. The polymorphisms distinguish the *E2 allele from the *E3 and *E4 alleles at amino acid position 158 (NCBI rs7412) and the *E4 allele from the *E2 and *E3 alleles at amino acid position 112 (NCBI rs429358).
Genome-wide genotyping was performed on 830 DNA samples using the Affymetrix 6.0 GeneChip ® Human Mapping 1 million array set (Affymetrix®, Inc Santa Clara, CA, USA). DNA for this project was allocated by the respective DNA banks at both the Hussman Institute of Human Genomics (HIHG) at the University of Miami and the Center for Human Genetics Research (CHGR) at Vanderbilt University. Genomic DNA was quantitated via the ND-8000 spectrophotometer and DNA quality was evaluated via gel electrophoresis. The genomic DNA (250 ng/5ul) samples were processed according to standard Affymetrix procedures for processing of the Affymetrix 6.0 GeneChip assay. The arrays were then scanned using the GeneChip Scanner 3000 7G operated by the Affymetrix® GeneChip® Command Console® (AGCC) software. The data were processed for genotype calling using the Affymetrix® Power Tools (APT) software using the birdseed calling algorithm version 2.0 Affymetrix®, Inc Santa Clara (Korn et al., 2008).
We applied a number of quality control (QC) procedures to both samples and SNPs to ensure the accuracy of our genotype data prior to linkage and association analyses. Specific sample QC included: (1) Each individual DNA sample was examined via agarose to ensure that the sample was of high quality prior to inclusion on the array; (2) CEPH samples were placed across multiple arrays to ensure reproducibility of results across the arrays; (3) Samples with call rates <95% were re-examined individually to ensure quality of genotypes and (4) Ultimately if the sample call rate remained below 95% after further evaluation, attempts were made to rerun the array with a new DNA sample. If the sample still failed, it was dropped. Nine samples were dropped due to low genotyping efficiency. Three samples were excluded because they did not connect into a pedigree with the rest of the samples, and therefore, relationships of those individuals could not be accounted for. Sixteen samples with questionable gender based on X chromosome heterozygosity rates were eliminated. Three samples appearing to be aberrantly connected in the pedigree based on the genotype data were also excluded.
Specific SNP QC included: (1) Dropping 76,816 SNPs with call rates <98%. (2) Dropping 206,970 SNPs with minor allele frequencies (MAF) ≤0.05. We additionally excluded 7849 SNPs with a MAF less than 0.05 after adjusting for pedigree relationships using MQLS (see later). Due to the relatedness in this dataset we did not check SNPs for Hardy-Weinberg equilibrium. Following this extensive QC, 798 samples (109 with LOAD, see Table 1) and 614,963 SNPs were analysed. Because APOE genotyping and QC were done separately from genome-wide genotyping and QC, the sample sizes are different and the datasets are mostly, but not completely, overlapping. All 798 samples belong to one 4998-member pedigree with many consanguineous loops. The AGDB provided the pedigree information using an “all common paths” database query with all genotyped individuals (Agarwala et al., 2003).
Table 1. Genome-wide dataset. Ages of exam and onset averages and standard deviations were calculated for the 798 samples—Late-onset AD (LOAD) samples, cognitively normal (unaffected) samples and unclear or unknown samples—which passed QC for genome-wide genotyping.
Average age of exam (standard deviation)
Average age of onset (standard deviation)
Unclear or unknown
We used the Modified Quasi-Likelihood Score (MQLS) test (software version 1.2) to correct for pedigree relationships (Thornton & McPeek, 2007). MQLS is analogous to a χ2 test, the most common approach for case-control data analysis with a binary trait, but MQLS incorporates kinship coefficients to correct for correlated genotypes of all the pedigree relationships. This test allows all samples to be included without dividing the pedigree. The MQLS test cannot be applied to X chromosome data, which were, therefore, eliminated from analysis. Because we previously found that Adams County has a lower APOE-4 allele frequency than the general population (Pericak-Vance et al., 1996), we did a stratified association analysis for APOE analysing Adams County separately from the combined Elkhart, LaGrange and Holmes Counties. Using the same stratification, we also re-analysed our most significant SNPs from the GWAS analysis. To test the validity of the MQLS test in our pedigree, we performed simulation studies using this same pedigree structure to assess the type 1 error rate using MQLS for association. Type 1 error rates were not inflated (unpublished data).
Because of the large size and substantial consanguinity of the pedigree, we used PedCut (Liu et al., 2008) to find an optimal set of subpedigrees including the maximal number of subjects of interest within a bit-size limit (24 in this study) conducive to linkage analysis. This procedure resulted in 34 subpedigrees for analysis with an average of seven genotyped individuals (three genotyped affected) per subpedigree. Parametric heterogeneity two-point LOD (HLOD) scores were computed assuming affecteds-only autosomal dominant and recessive models using Merlin (Abecasis et al., 2002). A disease allele frequency of 10% was used to approximate Alzheimer disease prevalence. For the dominant model penetrances of 0 for no disease allele and 0.0001 for one or two copies of the disease allele, and under the recessive model penetrances of 0 for zero or one disease allele and 0.0001 for two disease alleles were used. Because the underlying genetic model is unknown, we tested both dominant and recessive models to maximise our ability to find a disease locus. SNPs on the X chromosome were analysed using MINX (Merlin in X). Regions showing evidence for linkage, that is, containing at least one two-point HLOD ≥ 3.0, were followed up with parametric multipoint linkage analysis (also using Merlin). For the multipoint analyses, SNPs were pruned for linkage disequilibrium (LD) in each region so that all pair-wise r2 values were <0.16 between all SNPs (Boyles et al., 2005). The LD from the HapMap CEPH samples (parents only) were used for pruning. Because the HapMap CEPH samples may not be an exact representation of LD in our Amish population, we also tested pruning using the data from this Amish dataset, but linkage results did not change using this approach (data not shown). Because linkage analyses can be biased when breaking larger pedigrees into a series of smaller ones (Liu et al., 2006, 2007), we performed simulation studies assuming no linkage (e.g. null distribution) and using the same large pedigree structure and the same pedigree splitting method. We determined empirical cut-offs for significance in our linkage studies to maintain a nominal type I error rate. We found that after 1000 replications, only 2.5% of the multipoint linkage scans generated a maximum HLOD > 3.0 (unpublished data).
All computations were done using either the Center for Human Genetics Research computational cluster or the Advanced Computing Center for Research and Education (ACCRE) cluster at Vanderbilt University.
We found that LOAD was significantly associated with APOE (MQLS P= 9.0 × 10−6) in our Amish population except for the Adams County, Indiana, community (MQLS P= 0.55). The E4 frequency, adjusted for pedigree relationships, in LOAD individuals in Elkhart, LaGrange and Holmes Counties was 0.18 for affected individuals compared to 0.11 for unaffected individuals (Table 2). This compares to an E4 allele frequency of 0.38 in Caucasian AD individuals (0.14 for controls) (alzgene.org). We also saw a progressively younger average age of onset with each additional copy of the E4 allele (Table 3), consistent with other populations. We did not see evidence for linkage with APOE in our subpedigrees (dominant HLOD = 0.50, recessive HLOD = 0.29).
Table 2. MQLS-corrected APOE allele frequencies. APOE allele frequencies of Late-onset AD (LOAD) affected individuals versus cognitively normal individuals (unaffecteds) were calculated using MQLS to correct for pedigree relationships. Frequencies were calculated in the Adams County individuals separately from Elkhart, LaGrange and Holmes Counties.
APOE allele frequencies
Elkhart, LaGrange and Holmes Counties
Table 3. Average ages of onset and standard deviations by APOE genotype and number LOAD affected and unaffected by APOE genotype.
Average age of onset (standard deviation)
Number LOAD affected
Number cognitively normal
In the GWAS, the most significant MQLS P value (7.92 × 10−7), which did not surpass a Bonferroni-corrected genome-wide significance threshold of 8.13 × 10−8, was at rs12361953 on chromosome 11 in LUZP2 (leucine zipper protein 2) (Table 4, Fig. 1). The pedigree-adjusted minor allele frequency was 0.26 for affected individuals versus 0.15 for unaffected individuals. Fourteen additional SNPs had P values <1.0 × 10−5 (Table 4). According to our simulation analyses, we have >80% power to detect a P≤ 0.005 under an additive model with an odds ratio of 2.0 (data not shown). After stratifying, each of the fifteen top SNPs had a more significant P value in the non-Adams County dataset. Although some of the SNPs have very different minor allele frequencies in the two strata, the less significant P values for the Adams County dataset can be explained mostly by the lack of power in that stratum (9 LOAD affected). All SNPs showed the same direction of effect in the two strata except for rs472926, rs12361953 and rs472926 (Table S1). These association results did not fall within a megabase of any of the other nine previously verified LOAD genes (CR1, CLU, PICALM, BIN1, EPHA1, MS4A, CD33, CD2AP and ABCA7). However, four SNPs (rs10792820, rs11234505, rs10501608 and rs7131120) in PICALM generated nominally significant P values (P < 0.05). Rs11234505 is only ∼3.0 kb from rs561655, the most significant SNP published by Naj et al. (2011) and rs10501608 is only ∼10.5 kb from rs541458 the most significant SNP published by Harold et al. (2009) and Lambert et al. (2009). We also have a nominally significant SNP, rs6591625, in the MS4A10 gene. The SNP is ∼0.5 Mb from rs4938933, the most significant SNP published by Naj et al. (2011).
Table 4. Top genome-wide association results calculated using MQLS. Minor allele frequencies (MAF) are MQLS-corrected for pedigree relationships. A gene is only listed if SNP falls within specified gene. Megabase pair (Mpb) positions are based on NCBI Build 36.
MQLS P value
1.22 × 10–06
8.44 × 10–06
9.02 × 10–06
4.97 × 10–06
1.82 × 10–06
8.67 × 10–06
1.49 × 10–06
1.06 × 10–06
1.94 × 10–06
7.92 × 10–07
3.28 × 10–06
7.00 × 10–06
5.64 × 10–06
7.88 × 10–06
9.31 × 10–06
In the genome-wide analysis, forty-five regions, among all chromosomes except 17, 21 and X, had at least one two-point HLOD ≥ 3.0 (data not shown). Multipoint linkage analysis for these regions resulted in four regions, one each on chromosomes 2, 3, 9 and 18 with a multipoint peak HLOD > 3 (Table 5, Fig. 2). The highest peak occurred on chromosome 2 with a recessive peak HLOD of 6.14 (90.91 Mbp) and a dominant peak HLOD of 6.05 (81.03 Mbp). The most significant association results within the recessive and dominant ±1-LOD-unit support interval were rs1258411 (P= 5.29 × 10−2) and rs2974151 (P= 1.29 × 10−4), respectively. Rs1258411 is not located in a gene, but rs2974151 is located in an intron of CTNNA2 (catenin, alpha 2). In addition to rs2974151, 10 other SNPs in this gene had P < 0.05 (data not shown). While this is less than 5% of the analysed SNPs in CTNNA2, it still warrants attention.
Table 5. Most significant multipoint linkage results. Parametric dominant (Dom) and recessive (Rec) multipoint maximum heterogeneity (HLOD) scores were calculated using Merlin. Regions are determined by ±1-LOD-unit support intervals.
Dom peak HLOD
Dom peak alpha
Lowest MQLS P value in region
Rec peak HLOD
Rec peak alpha
Lowest MQLS P value in region
The next highest multipoint result was on chromosome 3 with a dominant HLOD of 5.27 and a recessive HLOD of 3.53. The peak for both models is at 168.43 Mbp, and the most significant association result in the ±1-LOD-unit support interval was at rs9812366 (P= 4.00 × 10−2), which is intergenic. The linkage peak on chromosome 9 reached an HLOD of 4.44 (107.76 Mbp) under the dominant model and 3.77 (101.7 Mbp) under the recessive model. This peak overlaps with the suggestive linkage peak found in the joint linkage analysis published by Hamshere et al. (2007), however this region has not been consistently replicated in other studies. For both models the most significant association result in the ±1-LOD-unit support interval was at rs9969729 (P= 1.94 × 10−6), which is intergenic. On chromosome 18 the dominant and recessive results both peaked at 8.77 Mbp with HLOD = 3.97 for the dominant model and HLOD = 4.43 for the recessive model. The most significant association result in this ±1-LOD-unit support interval was at rs632912 (P= 8.80 × 10−4), which is intergenic. None of these regions overlap the linkage peaks found in our previous genome-wide microsatellite linkage study, which used only a subset of the individuals in the current dataset (Hahs et al., 2006). As with our association results, these multipoint peaks did not encompass the previously known LOAD genes.
APOE was clearly associated with dementia in our population; however, it did not explain the majority of affected individuals. In the Adams County communities, there were only 8/74 individuals who carried at least one APOE-E4 allele. In the remaining Amish communities, the APOE-E4 allele was more common, but still less common than in the general population. In addition, the majority of affected individuals (81/127, 64% for all counties; 45/115, 39% for non-Adams counties) did not carry an APOE-E4 allele. The specific deficit of the APOE-E4 allele in Adams County as well as differences in allele frequencies for some of the top GWAS SNPs indicates at least some level of locus heterogeneity underlying LOAD in the Amish population.
Additional support for locus heterogeneity arises from the linkage results. Examination of the subpedigree-specific lod scores for the four significant loci indicates that 13 of the 34 subpedigrees generate no lod scores >0.50 for any of the loci, and 14/21 (67%) of the remaining subpedigrees generate lod scores >0.50 for only one of the four loci. In addition, the vast majority of the remaining SNPs across the genome generated HLOD scores with alpha values (proportion of linked pedigrees) <1.0. Finally, the suggestion of locus heterogeneity is consistent with the societal differences across church districts, which can further restrict marriages even within the Amish.
Because of the relatedness of individuals in our dataset we could take advantage of both linkage and association approaches to identify potential LOAD loci. In our examination, we found that our most significant association results did not fall under any of the four linkage peaks. However, under the linkage peaks we did see some evidence of association. Within our most significant region of linkage lies CTNNA2, which also had suggestive evidence for association. In addition to the result at rs2974151 (P= 1.29 × 10−4), multiple SNPs in CTNNA2 had P <0.05, decreasing the likelihood of a false positive association for this gene. However, because of the relatedness in our dataset it was difficult to get an accurate measurement of LD structure to determine if the SNPs in this region were more highly correlated due to a founder effect.
CTNNA2 encodes the catenin alpha 2 protein, which is a neuronal-specific catenin. Catenins are cadherin-associated proteins and are thought to link cadherins to the cytoskeleton to regulate cell–cell adhesion. Catenin alpha 2 can form complexes with other catenins such as beta-catenin, which interacts with presinilin. Mutations in presinilin lead to destabilisation of beta-catenin, which potentiates neuronal apoptosis (Zhang et al., 1998). Catenin alpha 2 is also thought to regulate morphological plasticity of synapses and cerebellar and hippocampal lamination during development in mice (Park et al., 2002). It also functions in the control of startle modulation in mice (Park et al., 2002).
It was not completely unexpected to see some discordance between the linkage and association results, as was demonstrated in our APOE results where we saw evidence for association but not for linkage. Because we needed to divide the pedigree to facilitate linkage analysis and because we used an affecteds-only analysis, only a subset of the individuals analysed in association analysis were analysed in linkage analysis. The breaking of the pedigree likely reduces the observed genomic sharing between relatives as the tracking of the natural flow of alleles was somewhat disrupted, as we saw when we tested APOE for linkage. Also, the very nature of association analysis versus linkage analysis will provide some different results. Linkage analysis locates shared genomic regions between affected individuals in the same pedigree by testing for cosegregation of a chromosomal segment from a common ancestor. Association using MQLS tests for differences in allele frequencies between affected and unaffected individuals while correcting for the pedigree relationships. Association analysis is more powerful in detecting protective effects as well as smaller effects in the population compared to affecteds-only linkage analysis but is underpowered when sample sizes are small and genetic heterogeneity is present. Conversely, linkage analysis is more suitable for finding large effects in a small number of related individuals and is more robust to genetic heterogeneity.
Our results confirmed the complex genetic architecture of LOAD even in this more homogeneous set of individuals. Multiple genes appeared to be significantly contributing to LOAD risk in the Amish. We replicated the effect of APOE, replicated the evidence for linkage on 9q22, and also found modest evidence for association of both PICALM and MS4A in this population. Most importantly, this unique population allowed us to find additional candidate loci, particularly in the CTNNA2 region in which we saw strong evidence for both linkage and association. The role of CTNNA2 in the brain also makes this gene a promising candidate. The CTNNA2 region, in addition to other potential risk regions, needs to be more closely examined to identify the underlying responsible variants and their functional consequences.
We thank the family participants and community members for graciously agreeing to participate, making this research possible. This study is supported by the National Institutes of Health grants AG019085 (to JLH and MAP-V) and AG019726 (to WKS). Some of the samples used in this study were collected while WKS, JRG and MAP-V were faculty members at Duke University. The authors would like to thank Gene Jackson of Scott & White for his effort and support on this project. Additional work was performed using the Vanderbilt Center for Human Genetics Research Core facilities: the Genetic Studies Ascertainment Core, the DNA Resources Core and the Computational Genomics Core. The authors have no conflict of interest to declare.