A Genome-wide Association Study of Autism Reveals a Common Novel Risk Locus at 5p14.1


*Corresponding authors: Margaret A. Pericak-Vance, Ph.D., Dr. John T. Macdonald Foundation Professor of Human Genomics, Director, Miami Institute for Human Genomics, 1120 NW 14th Street, CRB-819 (M860), Miami, Florida 33136. Tel: (305) 243-2308; Fax: (305) 243-2396; E-mail: mpericak@med.miami.edu Jonathan L. Haines, Ph.D., T.H. Morgan Professor of Human Genetics, Professor, Molecular Physiology & Biophysics, Director, Center for Human Genetics Research, 519 Light Hall, Vanderbilt University Medical Center, Nashville, TN 37232-0700. Tel: (615) 343-5851; Fax: (615) 343-8619; E-mail: jonathan@chgr.mc.vanderbilt.edu


Although autism is one of the most heritable neuropsychiatric disorders, its underlying genetic architecture has largely eluded description. To comprehensively examine the hypothesis that common variation is important in autism, we performed a genome-wide association study (GWAS) using a discovery dataset of 438 autistic Caucasian families and the Illumina Human 1M beadchip. 96 single nucleotide polymorphisms (SNPs) demonstrated strong association with autism risk (p-value < 0.0001). The validation of the top 96 SNPs was performed using an independent dataset of 487 Caucasian autism families genotyped on the 550K Illumina BeadChip. A novel region on chromosome 5p14.1 showed significance in both the discovery and validation datasets. Joint analysis of all SNPs in this region identified 8 SNPs having improved p-values (3.24E-04 to 3.40E-06) than in either dataset alone. Our findings demonstrate that in addition to multiple rare variations, part of the complex genetic architecture of autism involves common variation.


Autism is a neurodevelopmental disorder characterized by impairments in social interaction and communication, and the presence of restricted and repetitive patterns of interest or behavior (Centers for Disease Control, 2008). It is among a spectrum of disorders (ASDs) with symptoms that may range from quite severe (autism) to relatively mild (Asperger syndrome). With improved surveillance and a broadening of the diagnostic criteria, the most recent prevalence studies suggest that ASDs may affect as many as 1 in 150 children in the U.S. making it one of the most common neurodevelopmental disorders (NCBI, 2008). ASDs are most often diagnosed before age four, and are at least three to four times more frequent in males than females (NCBI, 2008).

Overwhelming evidence from twin and sibling studies demonstrates that autism is highly heritable (Steffenburg et al., 1989; Bolton et al., 1994; Bailey et al., 1995), but there is no consensus on the underlying genetic architecture. There are two alternative proposals, one involving numerous rare genetic mutations and the other involving fewer but more common genetic variations. Supporting the rare mutation hypothesis are mutations in several genes and rare structural DNA variations both of which have been identified, although the pervasiveness of these effects remains controversial (Sebat et al., 2007; Weiss et al., 2008). Data supporting the effect of common variation has been more difficult to find. Several genome-wide linkage screens and focused candidate gene association studies have been performed in autism (International Molecular Genetic Study of Autism Consortium (IMGSAC), 2001; Shao et al., 2002; Szatmari et al., 2007), but the results have been disappointing and no universally accepted susceptibility polymorphism has yet emerged. Collectively these data have suggested that the common variant hypothesis may not be relevant to autism genetics.

A recent study by Arking et al. (2008) combining linkage and genome-wide association in 72 multiplex autism families identified a common variant in the CNTNAP2 gene that was associated with autism primarily in families where all affected individuals were male (male only families). This association was also seen by Alarcón et al. (2008) and similar to Arking et al. (2008), the effect was primarily in male only autism families. However, this association has not been widely replicated.

Materials and Methods

Ascertainment and Sample Description

We ascertained autism patients and their affected and unaffected family members as part of the Collaborative Autism Project (CAP) through four clinical groups at the Miami Institute for Human Genomics (MIHG, Miami, Florida), University of South Carolina (Columbia, South Carolina), W.S. Hall Psychiatric Institute (Columbia, South Carolina) and Vanderbilt Center for Human Genetics Research (Vanderbilt University, Nashville, Tennessee).

Participating families were enrolled through a multi-site study of autism genetics and recruited via support groups, advertisements, and clinical and educational settings. All participants and families were ascertained using a standard protocol. These protocols were approved by appropriate Institutional Review Boards. Written informed consent was obtained from parents and assent from minors was obtained whenever possible.

Core inclusion criteria were as follows: (1) chronological age between 3 and 21 years of age; (2) presumptive clinical diagnosis of autism; (3) expert clinical determination of autism diagnosis using DSM-IV criteria supported by the Autism Diagnostic Interview-Revised (ADI-R) in the majority of cases and all available clinical information. The ADI-R is a semi-structured diagnostic interview which provides a diagnostic algorithm for classification of autism (Autism Genetics Resource Exchange, 2008). All ADI-R interviews were conducted by formally trained interviewers who have achieved reliability according to established methods. Thirty-eight individuals were missing an ADI-R. For those cases we implemented a best estimate procedure to determine a final diagnosis using all available information from the research record and data from other assessment procedures. This information was reviewed by a clinical panel led by an experienced clinical psychologist and included two other psychologists and a pediatric medical geneticist—all of whom were experienced in autism. Following review of case material the panel discussed the case until a consensus diagnosis was obtained. Only those cases in which a consensus diagnosis of autism was reached were included; (4) minimal developmental level of 18 months as determined by the Vineland Adaptive Behavior Scale (VABS) (Sparrow et al., 1984) or the VABS-II (Sparrow et al., 2005) or IQ equivalent > 35. These minimal developmental levels assure that ADI-R results are valid and reduce the likelihood of including individuals with severe mental retardation only. We excluded participants with severe sensory problems (e.g., visual impairment or hearing loss), significant motor impairments (e.g., failure to sit by 12 months or walk by 24 months), or identified metabolic, genetic, or progressive neurological disorders.

A total of 487 Caucasian families (1537 individuals) were genotyped. This dataset consisted of 80 multiplex families (more than one affected individual) and 407 singleton (parent-child trio) families. In addition, GWAS data were obtained from the Autism Genetic Resource Exchange (AGRE) (Autism Genetics Resource Exchange, 2008) for use as a validation dataset. The full AGRE dataset is publicly available and contains families with the full spectrum of autism spectrum disorders. We selected only families with one or more individuals diagnosed with autism (using DSM-IV and ADI-R); affected individuals with a non-autism diagnosis within these families were excluded from the analysis. This resulted in a confirmation dataset of 680 multiplex families (3512 individuals) from the AGRE ‘SingleAllAgre’ beadstudio file (Autism Genetics Resource Exchange, 2008). Family and individual identifiers for all AGRE samples which passed our quality control are listed in Table S3.

Genotyping of the Discovery Dataset

Genomic DNA was purified from whole blood using Puregene chemistry on the Qiagen Autopure LS according to standard automated Qiagen protocols (Qiagen, Valencia, CA).

DNA samples were quantitated via the ND-8000 spectrophotometer and DNA quality was evaluated via gel electrophoresis on a 0.8% agarose gel. The concentration for all qualified samples was normalized to 50 ng/ul and samples were arrayed in Matrix 0.5ml 2D barcoded tubes in racks of 96. Sample identity was confirmed by genotyping 8 SNPs using Taqman allelic discrimination assays (Applied Biosystems, Foster City, CA) and assessing for concordance with historical data.

Samples that passed the above exclusion criteria were genotyped using Illumina's Human 1M v1 Beadchip, containing 1,072,820 SNPs (of those 258,665 loci are in reported and new CNV regions). The samples have been processed according to Illumina Procedures for processing of the Infinium II® assay (Illumina Inc., San Diego, CA).

The above protocol was automated using the Tecan EVO-1 to further enhance the efficiency and consistency of the assay (Tecan Group Ltd., Männedorf, Switzerland). Samples were processed in batches of 48 at a time. The same Quality Control DNA sample was repeated during each run to ensure reproducibility of results between runs. Data was extracted by the Illumina® Beadstudio software from data files created by the Illumina BeadArray reader. Samples and markers with call rates below 95% were excluded from analysis and a GenCall cutoff score of 0.15 was used for all Infinium II® products.

Sample Quality Control

After genotyping, samples were subject to a battery of a quality control (QC) tests. We used the same protocol for both the discovery and validation datasets. Reported and genetic gender were examined using X-chromosome linked SNPs. Relatedness between samples, sample contaminations, mis-identification and duplications were tested using genome-wide identity-by-descent (IBD) estimation; inconsistent samples were dropped from the analysis. The numbers of remaining samples are listed in Table S1.

As a next step we tested for Mendelian inconsistencies on all SNPs and samples. Mendelian errors (ME) can emerge from sample mis-identification, DNA contamination, copy-number variation (CNV), genotype calling errors and other reasons. The median of ME per family in both investigated cohorts was below 0.005%. More than 99% of the discovery families and 98% of the validation families had ME below 0.02%. We excluded families with ME >2% from the analysis. This threshold would still allow for small deletions and duplications that are common in the human genome.

SNP Quality Control

SNPs were subject to QC before analysis. We removed SNPs with minor allele frequencies below 5% because of restricted power in the discovery sample.

As expected, we observed negative correlation between the proportion of ME per SNP and p-value for HWE. To minimize genotyping errors we excluded SNPs with p-value <10−6 for HWE and ME >4%. Remaining erroneous genotypes were set as missing. PLINK software was used for quality control steps described above (Purcell et al., 2007).

Illumina provides information on which 1M BeadChip SNPs were located within known common CNV regions. We compared the distribution of ME per family and per SNP. No significant differences between ME per SNP in the known CNV regions and the remaining markers were identified. The same quality criteria were used for both the discovery and the validation datasets. The summary of SNPs is presented in Table S2.

Population Stratification

Although population substructure does not cause type I error in family-based association tests, multiple founder effects could result in reduced power to detect an association in a heterogeneous disease such as autism. Thus we conducted EIGENSTRAT (Patterson et al., 2006) analysis on all parents from analyzed families for evidence of population substructure using the 491,664 SNPs genotyped in both the discovery and validation datasets. To ensure the most homogeneous groups for association screening and replication, we excluded all families with outliers defined by EIGENSTRAT (Patterson et al., 2006) out of 4 standard deviations of principle components 1 and 2. After all QC steps, 1,390 samples from 438 autistic families remained in the final discovery dataset and 2390 samples from 457 autistic families (Tables S1 and S3) in the validation dataset. The average genotyping rate in the remaining individuals was 99.8%.

Genotype Imputation

Since the validation dataset was genotyped on a different GWAS SNP panel with a smaller number of SNPs (558183), the genotypes from our data and the data from the AGRE were imputed independently by the program IMPUTE (Marchini et al., 2007) using a phased CEU HapMap dataset as a reference (International HapMap Consortium et al., 2007). Individual genotypes with probability less than 0.90 were not included. All individuals were treated independently while doing imputation. Mendelian inconsistencies were zeroed out in PLINK (Purcell et al., 2007). The results for the imputation are found in Table 1. Results on imputed SNPs missing more than 10% of the genotypes were labeled in Table 1 and should be interpreted with caution because of possible bias.

Table 1.  Association results on top 96 SNPs.
ChromosomeSNPPositionMAFp-value HWEAllelep-value# discoveryp-value# validationp-value joint#
  1. MAF: minor allele frequency in discovery dataset.

  2. p-value HWE: Hardy-Weinberg Equilibrium test p-value in discovery dataset.

  3. In italic and bold are the p-values for markers not genotyped on 550 K Illumina panel. Genotypes for these markers were imputed.

  4. “-” data could not been imputed because no genotypes are available for the reference dataset.

  5. Allele: Minor Allele/Major Allele in discovery dataset.

  6. Shaded: Marker with improved p-value in validation dataset.

  7. #: The Pedigree Disequilibrium Test (PDT) was performed on all SNPs for association testing.

  8. *: the imputed markers missing more than 10% of genotypes.


Association Analysis

Association analysis was performed using the pedigree disequilibrium test (PDT) (Martin et al., 2000, 2001). This method provides valid and robust tests for allelic association across both trios and extended families. Only autosomal markers were tested for association. The estimation of odds ratios and 95% confidence interval calculations were performed using UNPHASED (Dudbridge, 2008). Power calculations for association analysis were performed using the Genetic Power Calculator (Purcell et al., 2008).

Linkage Disequilibrium

Linkage disequilibrium (LD) patterns and haplotype block delineation were determined by using Haploview 4.1 (Choi et al., 2001). Blocks were defined using the confidence interval method described by Gabriel et al. (2002). Pair-wise LD measures (r2) were calculated in the 3822 unrelated founders of the joint sample.


To more comprehensively test the common variant hypothesis, we performed an unbiased genome-wide association study of common variation using as a discovery dataset the Caucasian autistic families from the Collaborative Autism Project (CAP). We validated our findings using an independent publicly available family-based Genome-Wide Association Study (GWAS) dataset from the Autism Genome Research Exchange (AGRE) (Autism Genetics Resource Exchange, 2008). Quality-control (QC) procedures were applied to the more than 1,000,000 single nucleotide polymorphisms (SNPs) in the discovery dataset and 550,000 SNPs in the validation dataset.

After applying QC filters, 775,311 common autosomal SNPs remained in the discovery dataset with an average genotyping rate of 99.80% and 500,100 common autosomal SNPs remained in the validation dataset with an average genotyping rate of 99.82%. To account for possible population stratification, we excluded families if the values for the top two principal components for either of the probands’ parents were >4 standard deviations from the core Caucasian cluster generated in EIGENSTRAT (Patterson et al., 2006). The final datasets included 1390 samples from 438 autistic families in the discovery dataset and 2390 samples from 457 autistic families in the validation dataset. For any SNP of interest in the discovery dataset not directly genotyped in the validation dataset, imputation of genotypes was performed in the validation dataset using the program IMPUTE (Marchini et al., 2007). The Pedigree Disequilibrium Test (PDT) (Martin et al., 2000, 2001) was used for all association analyses. The distribution of p-values examined in the discovery dataset demonstrated a close match to that expected for a null distribution except at the extreme tail of low p-values (Fig. 1). This is expected if there is little residual error in the data and common variants of modest effect sizes are acting in autism. In the discovery dataset, none of the p-values met the stringent and overly conservative Bonferroni correction for genome-wide significance (Fig. 2).

Figure 1.

Quantile-Quantile (Q-Q) plot of PDT p-values for the discovery dataset. The Q-Q plot measures deviation from the expected deviation of p-values. The diagonal (red) line represents the expected (null) distribution. The slight deviation of the observed values above expected values at the tail of the distribution is consistent with modest genetic effects.

Figure 2.

Genome-wide plot of association p-values in the discovery dataset. −log10 (p-value) for all 775,311 tested SNPs in 438 families are plotted against their genomic location. 96 SNPs have p-values <10−4 (horizontal red line) and 6 SNPs have p-values <10−5 (blue horizontal line). Individual chromosomes are demarked by different colors.

Examination of the 651 SNPs in the CNTNAP2 gene (Arking et al., 2008; Bakkaloglu et al., 2008) in our discovery dataset revealed only eight genotyped SNPs that were nominally significant (p-values = 0.002–0.04). The results did not significantly improve in male only families (data not shown). The tagging SNP, rs270102, reported by Alarcón et al. (2008), was not significant in either the overall or male only family dataset. SNP rs7794745 showing linkage in the Arking et al. (2008) study was not genotyped in our dataset. Association of imputed genotypes for this SNP was not significant (p = 0.62). None of the tested markers met gene-wide (CNTNAP2) significance after correction (data not shown).

Despite no genome-wide significant association, 96 SNPs showed strongly suggestive association with autism risk (Table 1, p < 0.0001) and met our initial criteria for follow-up. Among the 96 top hits, 2 SNPs, residing in 5p14.1, had improved p-values in the joint analysis and also had nominally significant association signals in the validation dataset encouraging us to look at this region in more detail. Therefore, we examined every SNP (n = 46) genotyped in this region (25830kb to 26100kb) in both datasets regardless of their initial p-value. Analyses of these data revealed a cluster of 19 SNPs including 8 imputed SNPs showing nominally significant association (P < 0.05) in the validation dataset (data not shown). Eight SNPs on chromosome 5p14.1 (Table 2) showed improved association signals in the joint dataset. Risk was associated with the same allele for these eight SNPs in both datasets and the p-values became more significant (p-values: 3.24E-04 to 3.40E-06) in the joint analysis, with the most significant p-value coming from one of the top 96 hits rs10038113. The odds ratios for the major alleles ranged from 0.75 to 1.32 (Table 2).

Table 2.  Association statistics for validated SNPs on chromosome 5p14.1.
SNP numberSNPPositionAlleleMAFp-value HWEp-value discoveryOR-discoveryp-value validationp-value jointOR-joint
  1. MAF: minor allele frequency in discovery dataset.

  2. p-value hwe: Hardy-Weinberg Equilibrium test p-value in the discovery dataset.

  3. Allele: minor allele/major allele based on discovery dataset.

  4. OR: Odds Ratios for joint sample for major allele, minor allele used as a reference allele.

  5. Note: Nine SNPs in 5p14.1 had p-values <0.05 in both the discovery and the validation datasets and generated improved p-values in the joint analysis.


To determine if we might miss a strong signal by only using the CAP dataset as the discovery dataset, we also reversed the datasets for discovery and validation and used our same two stage approach. 21 SNPs had p-values < 0.0001 in the AGRE dataset but none of them could be replicated in the CAP dataset even with a nominal significance of p < 0.05.

We computed the power of the Transmission Disequilibrium Test (TDT) in 438 triad families that approximates a lower bound for power of the PDT in our discovery sample. Given a prevalence of autism of 0.0066 (Chakrabarti & Fombonne, 2005) and a SNP in LD (D’= 1) with a risk allele frequency of 0.6, we expect 84% power to detect an association at p = 0.0001 under a recessive model (GRRAA= 2, GRRAa= 1) and 33% under an additive model (GRRAA= 2, GRRAa= 1.5). These are consistent with the allelic GRR's estimated for the chromosome 5 region. The power to detect a Bonferroni-corrected genome-wide significance (P = 0.05/775,311 SNPs = 6.4 × 10−8) drops to 30% and 2.5%, respectively, for recessive and additive models.


We examined the linkage disequilibrium (LD) pattern among the eight replicated SNPs with improved p-values (Fig. 3) to gain a better understanding of the association. Seven of these SNPs form two tightly linked LD blocks. Given that none of these SNPs reside within known genes or known regulatory sequences, the clustering of association signals suggests that one or more nearby functional variants is responsible for the signal. A survey of the genomic landscape surrounding this region of association reveals several interesting avenues for further molecular investigation. There are numerous sequence segments exhibiting a high degree of evolutionary conservation, suggesting potential regulatory, but currently undetermined, functions. In addition, there are three known copy number variants (CNVs) in proximity to the most significant SNPs (Table 2). Preliminary investigation of these CNVs in the discovery dataset is not suggestive of a causal relationship with autism (data not shown). Exhaustive molecular analysis of the candidate region is ongoing. In addition, although the immediate 1 Mb vicinity of the association region contains no known genes, flanking the region are CDH9 and CDH10, two genes belonging to the cadherin family, a group of proteins containing members that are involved in calcium-dependent cell-cell junctions in the nervous system (Liu et al., 2006; Pokutta & Weis, 2007) and which are possible targets of regulatory action.

Figure 3.

Linkage Disequilibrium pattern among validated SNPs on chromosome 5p14.1. Linkage disequilibrium (LD) was measured as r2 values, which range from 0 (no correlation) to 1 (complete correlation). LD was calculated between each pair of SNPs. Two blocks of strong LD were observed and span 3 Kb (SNPs 2–4) and 28 Kb (SNPs 5–8). SNP numbers correspond to the order in Table 2.

Our power calculation shows that stringent adjustments for multiple testing provide power only to detect loci with large effects given our sample size. Lowering the threshold for significance allows detection of loci with relatively small effects (such as the chromosome 5 locus), while also relying on replication to limit the false positives. We note that this region of 5p14.1 did not generate exceptional p-values in our initial GWAS, suggesting that a strong single gene association, such as those seen with the APOE gene in Alzheimer disease and the CFH gene in age related macular degeneration (International Multiple Sclerosis Genetics Consortium et al., 2007) is highly unlikely in autism. The absence of a large effect is consistent with the results of previously published linkage studies (Ma et al., 2007; Allen-Brady et al., 2008). Only through the analysis of the validation dataset were we able to identify this replicated signal, highlighting the value of both a validation dataset and of joint analyses. Two additional datasets have shown association of autism at 5p14.1. These include a cohort of 1241 ASD cases and 6491 control subjects and a cohort of 108 ASD cases and 540 controls. The combined p-values for SNPs in the 5p14.1 region in these datasets combined with ours, which includes over 10,000 subjects, range from 7.4 × 10−8 to 2.1 × 10−10. These results survive stringent Bonferroni correction. (Wang et al., 2009).

Our approach, which uses a validation set as indication of a true association, has proven successful in other GWAS as exemplified by the identification of IL7RA and IL2RA susceptibility alleles in multiple sclerosis (MS) where no SNPs in either gene met genome-wide significance in the discovery dataset, but were confirmed through validation in an additional dataset (International Multiple Sclerosis Genetics Consortium et al., 2007). These MS findings have been confirmed recently across numerous datasets (International Multiple Sclerosis Genetics Consortium (IMSGC), 2008). We also note that other such common variants are likely to exist in autism and further GWAS studies are warranted.

Our identification and replication of common variation on chromosome 5p14.1 associated with autism is a promising development in the struggle to understand the genetics of autism. It also highlights the power of GWAS for detecting moderate genetic effects in neurobehavioral phenotypes. Our results, in combination with the multiple rare variants already identified, suggest that the genetic architecture of autism is as exquisitely complex as is its clinical phenotype.


We thank the patients with autism and their family members who participated in this study and personnel at the Miami Institute for Human Genomics (MIHG) including Sol Kissner from the MIHG Genetic Epidemiology and Statistical Genetics Core; Rachel Henson and Daniela Martinez from the MIHG Genotyping Core; staff at the MIHG Biorepository especially Sandra West; members of the MIHG and the Vanderbilt Center for Human Genetics Research autism ascertainment teams especially Laura Nations, Sandra Brinkley, Shannon Donnelly and Genea Crockett, Noelle Blackburn as well as Mary Margaret Welch for her expert editing and proofing of this paper. We would also like to thank Dr. Michael Schmidt for his contributions to the data analysis and Drs. Jeffery M Vance, and Stephan Zuchner for their helpful comments and advice. This research was supported by grants from the National Institutes of Health (NIH) (NS26630, NS36768 and MH080647) and by a gift from the Hussman Foundation. Data management and analysis were performed in part using the Computational Genomics Core of the Vanderbilt Center for Human Genetics Research. We also acknowledge the partial support of the Autism Genome Project (AGP) which is supported by Autism Speaks. We also wish to gratefully acknowledge the resources provided by the AGRE consortium and the participating Autism Genetic Resource Exchange (AGRE) families. The AGRE resource is supported by Autism Speaks. A subset of the participants was ascertained while Dr Pericak-Vance was a faculty member at Duke University.