T. C. Gilliam, Columbia Genome Center, 1150 St. Nicholas Avenue, Room 508, New York, NY 10032, USA. E-mail: email@example.com
Common genetic disorders are believed to arise from the combined effects of multiple inherited genetic variants acting in concert with environmental factors, such that any given DNA sequence variant may have only a marginal effect on disease outcome. As a consequence, the correlation between disease status and any given DNA marker allele in a genomewide linkage study tends to be relatively weak and the implicated regions typically encompass hundreds of positional candidate genes. Therefore, new strategies are needed to parse relatively large sets of ‘positional’ candidate genes in search of actual disease-related gene variants. Here we use biological databases to identify 383 positional candidate genes predicted by genomewide genetic linkage analysis of a large set of families, each with two or more members diagnosed with autism, or autism spectrum disorder (ASD). Next, we seek to identify a subset of biologically meaningful, high priority candidates. The strategy is to select autism candidate genes based on prior genetic evidence from the allelic association literature to query the known transcripts within the 1-LOD (logarithm of the odds) support interval for each region. We use recently developed bioinformatic programs that automatically search the biological literature to predict pathways of interacting genes (pathwayassist and geneways). To identify gene regulatory networks, we search for coexpression between candidate genes and positional candidates. The studies are intended both to inform studies of autism, and to illustrate and explore the increasing potential of bioinformatic approaches as a compliment to linkage analysis.
Autism is a pervasive neurodevelopmental disorder that severely impairs development of normal social and emotional interactions and related forms of communication. Disease symptoms characteristically include unusually restricted and stereotyped patterns of behaviors and interests. Autism describes the most severe manifestation of a broad spectrum of disorders, known as autism spectrum disorders (ASD) that share these essential features, but vary in their degree of severity and/or age of onset. While it is difficult to accurately estimate the prevalence of ASD, due to an apparent increase over the past few decades (Chakrabarti & Fombonne 2001; Gillberg & Wing 1999; Prior 2003), recent studies suggest that ASD affects 34–60 individuals per 10 000 (Charman 2002; Fombonne 2003; Yeargin-Allsopp et al. 2003).
Twin and epidemiological studies show that autism is a highly heritable disorder. When one monozygotic (MZ) twin is diagnosed with autism or ASD, the disease concordance is 70–90%, compared to 0–25% concordance among same-sex dizygotic twins (Bailey et al. 1995; Folstein & Rutter 1977; Lauritsen & Ewald 2001; Rutter 2000). The estimated heritability of ASD is believed to be approximately 90%, which is extremely high relative to other complex genetic diseases (Hyttinen et al. 2003; Ju et al. 2000). The impact of genetic determinants on disease liability is further substantiated by comparing the disease risk for a sibling of a proband diagnosed with ASD (2–6%) with the population prevalence of ASD (0.04–0.1%) (Smalley 1997; Smalley et al. 1988; Szatmari et al. 1998), yielding a relative risk of 50–100 for ASD (Lamb et al. 2000).The rate by which autism and ASD incidence drops among first, second and third degree relatives provides another indication that disease susceptibility arises from the combined effects of multiple, possibly interacting, genes (Lamb et al. 2000; Rutter 2000). Therefore, even though autism is clearly among the most heritable of all psychiatric disorders, the likely interaction of multiple genes that increase susceptibility to autism, rather than directly cause it, presents formidable challenges for genetic studies.
The search for genetic linkage between DNA markers spanning the entire genome and single-gene disorders with clear Mendelian patterns of inheritance has been enormously successful, in many cases leading to the identification of disease genes and their causal mutations despite years of failure using non-genetic, hypothesis-driven approaches (Botstein & Risch 2003). The success of such studies depends upon the identification of clear recombinant breakpoints that define the boundaries of the disease locus, and typically demarcate a minimal genetic region that harbors the disease gene along with dozens of non-disease related, positional candidate genes (Riordan et al. 1989; Rommens et al. 1989). Whereas ‘single-gene’ disorders are typically quite rare, common heritable disorders are believed to arise from the combined effects of multiple predisposing gene variants, presumably in combination with environmental factors. Consequently, the influence of any single gene-variant upon disease status is likely to be small, and therefore difficult to detect using genetic linkage strategies. Moreover, the population prevalence of gene variants with small or negligible individual effects upon reproductive fitness will follow the same stochastic course as neutral polymorphisms, in some instances reaching significant frequencies. This explains in part how heritable disorders with multiple gene etiologies become common, and also why they are elusive gene mapping targets, i.e., it becomes difficult to detect enhanced sharing of disease-related alleles among affected individuals when the same gene variant is prevalent among control individuals. For these reasons and others (Altmuller et al. 2001; Lander & Kruglyak 1995; Lander & Schork 1994; Weiss & Terwilliger 2000), evidence for linkage between a common heritable disorder and DNA marker alleles tends to be weak and difficult to distinguish from the type of random statistical fluctuations that inevitably accompany a full genome scan. Consequently, a conservative survey of positional candidate genes based upon whole genome scan analysis typically requires the analysis of positional candidate genes within multiple, broad linkage peaks, often spanning 10–40 million base pairs, and comprising upwards of 50–100 genes.
Consistent with these rather dire predictions, we recently completed the largest whole genome linkage scan of ASD reported to date, and found no statistically significant evidence for linkage between DNA marker alleles and disease status (Yonan et al. in press). We did, however, detect ‘suggestive’ evidence for ASD predisposing loci on chromosomes 17, 5, 11, 4 and 8. Such moderate linkage signals may reflect the marginal contribution to disease risk arising from a given genetic locus, or alternatively, false positive findings that reflect random statistical fluctuation. While independent replication is the standard to distinguish between the two possibilities, the criteria required to declare replication are model and disease dependent, and thus necessarily vague, and at least in theory, replication of a specific linkage finding is many times more complex than detection of any one among several predisposing genetic loci (Lander & Kruglyak 1995).
For reasons outlined above, whole genome linkage analysis of common heritable disorders identifies a large and unmanageable number of positional candidate genes, the vast majority of which are unrelated to the disease target. We propose the use of genomic data-mining strategies to parse these relatively large candidate gene sets with the purpose of identifying a subset of biologically meaningful genes that map to predetermined genetic loci. To illustrate this approach, we have surveyed the top five ASD-linked regions in a recent genomewide linkage study (Yonan et al. in press). The strategy is to predict a subset of likely candidate genes mapping to each putative linkage peak. Such candidates would then become the focus of further genetic and biological testing.
There is substantial interest in using bioinformatic resources in conjunction with linkage methodologies to identify the most promising candidate loci within large and sometimes unconfirmed linkage regions, so that they may be examined further (Baron 2002). We chose to use positively associated genes to query known transcripts within peak linkage regions using several complimentary bioinformatic methods. We examined several different bioinformatic approaches in order to identify convergent evidence for specific candidate genes, as well as to explore the future potential and current limitations of these approaches.
Materials and methods
Characterization of putative ASD-linked chromosomal regions
The chromosomal regions examined in this study are shown in Fig. 1. Beginning with 345 families that had two or more siblings diagnosed with either autism or ASD, we used affected sib pair analysis to identify genomewide linkage to ASD (Yonan et al. in press). Five chromosomal regions from the genome scan met a cutoff of a pointwise P-value of < 0.01, which we interpreted as being ‘moderately suggestive’. Here we examine the chromosomal regions defined by the 1-LOD support interval of the 5 most significant peaks. Details of the analysis that lead to the identification of these regions have been described previously (Liu et al. 2001; Yonan et al. in press).
Association and linkage tables
We performed a search for allelic association between candidate gene allelic variants and autism or ASD using the PubMed database (http://www.ncbi.nlm.nih.gov/). This search strategy was augmented by personal knowledge of the literature and by references from key publications (Table 1). A similar strategy was used to compile the list of genomewide linkage studies for autism and ASD (Table 2).
Table 1. Summary of association studies for autism
Table summarizes current positive and negative association studies for specific genes and autism disorder or related phenotypes. Positive allelic associations are shown in bold type. Also shown are any whole genome linkage peaks that overlap with a gene tested for association, and their linkage scores.*MLS = Multipoint LOD score; †TDT = Transmission Disequilibrium Test; ‡LD = Linkage Disequilibrium; ¶PDT = Pedigree Disequilibrium Test; §MTDT = Multiallelic TDT; **DQ = Development Quotient
Table summarizes genomewide linkage studies for autism or ASD, organized by chromosomal position and showing sample size used. Only the linkage regions with an MLS > 1.4 are shown for consistency of comparison. Linkage regions from Yonan et al. (in press), that the current study is based upon, are shown in bold. Liu et al. (2001) is not shown since the complete sample (110 families) is included and reanalyzed in Yonan et al. (in press).
Peak position = position of the highest point/marker in Kosambi centimorgans from pter = 0.
Physical location = position of the highest point/marker as mapped onto the Human Genome Browser.
LOD score = usually MLS score, however, Z demarks an NPL Z score.
83 + 69 = 89 families were used in the initial genomewide scan and then 69 families were added to follow up in 13 candidate regions.§PSD = Phrase Speech Delay
We compiled a comprehensive list of genes (known and predicted from transcripts) in our five most significant regions using the Celera Discovery System (http://www.celeradiscoverysystem.com) and the NCBI Human Genome Project (UCSC Genome Browser; http://www.genome.ucsc.edu/ version 24 (hg15) April 2003 Freeze) databases. This exhaustive gene list was created by performing database queries against the UCSC Human Genome Browser's annotation database. The table definitions and data of two MySQL (http://www.mysql.com) tables, refGene and refLink, were downloaded from the public FTP site at UCSC (http://www.ftp://genome.ucsc.edu/goldenPath/10april 2003/database/) and recreated locally. Genes that mapped to the corresponding intervals in the Celera map were downloaded manually. All genes located within the physical boundaries defined by the 1-LOD unit support intervals on each chromosome were then extracted; the complete list of these 383 genes is available as supplementary material accompanying this paper (see Supplementary material section). This list was then further evaluated using several online databases. The Celera database annotates category and family for each gene using the Panther Protein Function. The Human Genome Project provides a gene ‘index’, a set of links to multiple annotation databases, for each Ref Seq transcript, including to the Online Mendelian Inheritance of Man (OMIM), Locus Link, PubMed, Gene Lynx, Gene Cards and Ace View databases. A short list of ‘neural-related’ genes was identified based upon evidence of their involvement in neuronal development/control, neurotransmitter function, transcription regulation and similar functions that made them logical disease-related candidates for the autism spectrum disorders.
Gene ontology methods
Gene Ontology (GO) is a controlled vocabulary designed to describe key aspects of the molecular function, biological process and cellular component of gene products (Bard 2003). Using the complete list of all 383 positional candidate genes (see above) we screened genes for neural-related GO terms in an effort to identify likely candidates for ASD. Screening was per- formed with the program pathwayassist (version 1.1, Stratagene Corp, La Jolla, CA) and the FatiGO website (http://www.fatigo.bioinfo.cnio.es/).
Pathwayassist and ResNet database
The pathwayassist software (Ariadne Genomics, Rockville, MD) allows the user to explore gene interaction networks represented in the ResNet (tm) database. ResNet (tm) is a comprehensive database of molecular networks compiled by proprietary natural language processing techniques applied to the whole PubMed database. The database contains more than 100 000 events of regulation, interaction and modification between 15 000 proteins, cell processes and small molecules. The architecture of ResNet and pathwayassist has been described (http://www.ariadnegenomics.com). pathwayassist provides a ‘front end’ that allows the user to query the database, and to direct the construction of specific networks relative to genes of interest.
The complete list of all 383 positional candidate genes was loaded into pathwayassist. Of those genes, 203 were recognized by the software, and were thus subjected to subsequent analysis. The ‘Expand Pathway’ feature of pathwayassist was used to build a network of connections starting with these 203 genes and including all available categories of interaction. This expanded list was then searched to find genes that interacted with neural-related positional candidate genes in the following manner. The genes in the expanded set that had interesting GO terms were identified, and then their interacting ‘neighbors’ were selected using the ‘Select Neighbors’ command. Set operations were used to reduce the list to only those genes that were among the original list of 203 positional candidate genes. Nine genes not found in the manual search described above were identified in this manner for further evaluation. Of these, four appeared to be logical candidates, and to have been correctly identified by pathwayassist as having valid interactions (Method 4, in Table 3) after manual inspection.
Table 3. Semi-automated search for candidate genes
Table shows all candidate genes within our linkage regions that were found by different search strategies.
pathwayassist was also used to search for pathway relationships beginning with the 13 genes that have been reported to be positively associated with autism in at least one previous study (Table 1). The pathwayassist‘Build Pathway’ function was used to search for pathways beginning with these genes. Next, the pathway was expanded to examine the connections to any of the positional candidate genes of the current study. As before, 203 of the positional candidates were recognized by the program and used in this analysis, only a few of which showed connections to this pathway (Method 5 in Table 3). Interactions among the 203 positional candidates were excluded from the analysis, as these interactions were unrelated to our hypothesis.
Geneways pathway prediction system
geneways is a program that uses a natural language processing algorithm to extract relationships between molecules or molecular processes by digesting published research literature and building these relationships into pathways (Rzhetsky et al. 2000). Electronic copies of the full text of research articles are downloaded to a local database where biologically important concepts such as names of genes, proteins, processes, small molecules and diseases are extracted from the text (Krauthammer et al. 2000) and clarified in relation to the many synonyms and homonyms and other ambiguities that are often applied to an identical term (Hatzivassiloglou et al. 2001). An associated program, genies is a natural language processing parser (Friedman et al. 2001). The output of genies is represented with semantic trees. A separate module unwinds these complex output trees into simple binary statements that are saved into the geneways knowledge base. The geneways system extracts some percentage of incorrect, redundant or contradictory statements that continue to pose bioinformatic challenges (Krauthammer et al. 2002), and currently requires manual curation and annotation. The user can conveniently request information about each interaction and retrieve the complete articles from which the information was extracted.
The pathway built with geneways was based on two sets of genes. The first consisted of about 20 genes that had been previously identified in the literature as playing a role in autism, either from positive association findings (Table 1), known chromosomal abnormalities or similar methods. The second list was the complete list of 383 positional candidate genes. geneways was then used to try to identify connections between these two groups of genes and to observe how those potential candidates might interact with each other and with other pathways. Currently, it is only possible to examine the geneways database by building a pathway out from a single gene, rather than having an exhaustive algorithm systematically identify all possible interactions. geneways was used to identify and visualize all the meaningful connections from the first list of known autism candidates to any information stored in the database. Several of the identified genes in this pathway were located within our linkage regions. Next, additional positional candidate genes were tested to see if they were connected with the same pathway (Method 6 in Table 3). We added an additional 30 positional candidates that we deemed most likely to contribute to ASD. These were genes that from the manual search made the most logical sense to possibly be involved in ASD phenotypes. Of the 30 genes that we examined, only six had direct connections to other genes in the pathway. Only those 30 candidates were examined using this strategy because our experience with this software suggests that it is important to limit the number of genes examined in order to produce an informative pathway that provides testable connections rather than an exhaustive but unwieldy pathway. Each arrow in Fig.2 represents either a physical or a logical interaction. Logical connections may represent multistep processes that include intermediaries not shown in the diagrams.
Transcription microarray meta-analysis
Whole genome gene expression arrays were used to identify possible functional relationships by searching for genes that are coexpressed with key autism candidate genes and positional candidate genes, based on mRNA expression microarray data. To increase the reliability of coexpression detection, only patterns of coexpression that were consistent in multiple data sets were used, since a coexpression relationship that is found in two or more independent studies is less likely to be an artifact. Because we did not have access to sufficient quantities of high-quality human brain gene expression data, we analyzed the homologs of our candidate genes in a set of seven independently collected mouse brain gene expression data sets. Of the 383 candidate genes, 170 had known mouse homologs, many of which are curated orthologs, which were then used for further analysis.
Of the seven mouse brain gene expression data sets used for Transcription Microarray Meta-Analysis, five were from unpublished in-house data and two were from published data sets (Sandberg et al. 2000; Zhao et al. 2001). Except for the dataset of Sandberg, which included data from six brain regions, all samples were from the hippocampus. Zhao et al. compared the subfields of the hippocampus. The additional data sets from our group are currently unpublished and consist primarily of test-control studies, with between 8 and 24 microarrays per data set, distributed as biological replicates of each condition. The conditions studied in each of these data sets were as follows: Young vs. old mice (M. Verbitsky, A.L. Yonan, G. Malleret, E.R. Kandel, T.C. Gilliam & P. Pavlidis, submitted); protein kinase C-gamma knockout vs. control mice; mice expressing a dominant negative protein kinase A regulatory subunit (R(AB); Abel et al. 1997) vs. control; a separate experiment using R(AB) and control animals to examine the effects of context-cued fear conditioning; and an analysis of mice expressing a dominant-negative inhibitor of CCAAT/enhancer-binding protein-family member transcription factors, compared to controls (Chen et al. 2003). Each data set was filtered to remove genes clearly lacking detectable expression, removing 30% of genes with the smallest maximal expression in each data set. Each gene was then analyzed to identify genes it was coexpressed with. For each gene, the Pearson correlation coefficient of all pairs of gene expression profiles in the data set was calculated. A P-value was calculated for the Pearson correlation assuming the null distribution follows a t-distribution (Zar 1999). P-values for each correlation were Bonferroni corrected, and genes with corrected P-values < 0.01 were considered coexpressed with the query gene. We note that this method does not make use of the experimental grouping of the samples (e.g., young vs. old), and thus genes which are coexpressed do not necessarily (indeed, typically do not) have expression patterns that are ‘relevant’ to the originally defined experimental groups. Pairs of genes that meet the criteria for coexpression were entered in a database. From the seven data sets, for all genes examined by the microarrays (∼10 000), we extracted ∼200 000 gene pairs (< 0.1% of all possible pairs). We then screened this database for pairs involving a positional candidate gene homolog that was identified in at least two of the seven data sets. We also attempted to identify genes that were coexpressed with the 13 genes implicated by positive findings from association studies (Table 1). However, we were unable to identify any genes in our linkage regions that were coexpressed with these genes (data not shown).
Table 1 summarizes results from studies that have sought to detect allelic association between candidate genes and autism or autism-related phenotypes. A total of 13 genes and three markers spanning 10 distinct cytogenetic regions purportedly show positive evidence for allelic association to autism. Of these 10 regions only 17q11 is concordant with the linkage regions identified in Yonan et al. in press (Fig. 1).
Table 2 summarizes the results from nine genomewide linkage studies for autism and ASD. Interpretation of genetic linkage to common heritable disorders is fraught with uncertainity and cross-study comparisons are not straightforward (Altmuller et al. 2001). All other factors being equal, larger sample studies are less prone to both false positive and false negative errors, thus we focused on the five strongest linkage signals from the large Yonan et al. study rather than, for example, choosing the five strongest linkage signals across all nine genomewide scans, or the five regions most supported by independent studies. As shown in Table 2, the Yonan et al. study (345 multiplex families) is more than three times the size of other reported genomewide studies. When comparing the results from Yonan et al. (in press) with those of other published studies in which evidence for linkage exceeded an MLS > 1.4 (P < 0.01; Nyholt 2000), overlap was identified on 17q (IMGSAC 2001a). The five putative ASD linkage regions selected for study are indicated in Fig. 1 (also shown as bold in Table 2).
Semi-automated search for ASD candidate genes
In a first attempt to parse positional candidate genes, we used public and commercial biological databases, together with Gene Ontology formalisms (see Materials and methods) to predict a subset of ‘neural related’ genes of potential relevance to ASD (Table 3). Candidates were selected from the 383 positional candidate genes based upon information gathered by manual search of the public UCSC Human Genome Browser and the proprietary Celera Discovery System together with their related links (Method 1, Table 3). A further search using neural-related GO terms (see Materials and methods) identified 11 additional genes (TIAF1, TNFAIP1, TRAF4, CARD6, CCL28, ITGA2, CHRM4, MDK, CXCL1, FGF5, UNC5C) not already identified by the manual search (Method 2, Table 3). Finally, an additional four candidate genes (IL6ST, LIFR, EIF4E, IL8) were identified using the pathwayassist computational software based upon their predicted network association with neural-related pathway genes (Method 4, Table 3; see Materials and methods).
Computational pathway prediction methods
In the present paper, we have attempted to leverage what little information is available about the genes that may contribute to autism in order to identify additional candidate genes for autism based on the results from our genomewide linkage study. Our hypothesis was that by constructing pathways between the genes already suspected to be involved in autism and our positional candidate genes, we could identify a subset of those positional candidates more likely to be involved in autism.
geneways' predictions regarding the connections between several of the positional candidate genes and a short list of genes suspected to be involved in autism (including both genes positively associated with autism and biological inferences) are shown in Fig. 2. Interactions among three of the genes positively associated with autism (GLUR6, HRAS1 and SLC6A4; shown as circles with red letters) together with connecting pathway genes (blue circles), molecules (red triangles) and processes (yellow rectangle), and 10 positional candidate genes (brown circles) were discovered (Fig. 2; Method 6, Table 3). When using the geneways program, each connecting line is a ‘clickable’ link that displays the underlying text that supports the interaction.
Gene networks illustrated in Fig. 3 were developed using a conceptually similar strategy, using pathwayassist instead of geneways. The pathwayassist‘Build Pathway’ function found valid connections (as determined by manual inspection) between 2 of the 13 genes that have been positively associated with autism (GLUR6 and UBE3A; Table 1) and a subset of the positional candidate genes. Positional candidates that were found to have valid connections to this pathway are shown as Method 5, Table 3.
We analyzed patterns of whole genome gene expression across multiple microarray data sets to identify possible gene regulatory interactions between the selected set of autism candidate genes and a subset of positional candidate genes. Of the 383 candidate genes analyzed, murine homologs for 170 genes were identified, which we then used to query seven independent mouse brain expression data sets. No reliable coexpression patterns were detected among the 13 positively associated autism candidates and the subset of 170 positional candidates. However, 10 of the 170 positional candidates showed highly reliable coexpression with one or more genes that were detected in multiple gene expression data sets (Table 4). A total of 107 genes were coexpressed with the set of 10 query genes. Based on their functions and annotations, we determined that a subset of these 107 genes showed potential relevance to neurodevelopmental disorders (Table 4).
Table 4. Genes co-expressed with positional candidates based on gene expression data from mouse brain
Gene accession ID
Linkage region (chromosome)
Number of matches
Co-expressed candidates gene
Genes that are located within the 1-LOD support interval of our QTL regions (Index Genes) and that belong to classes of coexpressed genes. First the mouse homologue of each index gene was identified (when available). In the absence of appropriate human gene expression data, we utilized 7 independently collected sets of mouse brain gene expression data, consisting of 8–24 microarrays each, to develop classes of coexpressed genes. We identified genes that were reproducibly coexpressed (in two or more of the data sets) with the mouse homologue of the index gene. When an index gene belonged to a functional expression class, the other genes in that class were identified (total # of matches), and the likely candidates from that expression class identified. Candidate genes so identified may be downstream targets of a transcriptional activation pathway common to the index gene and the candidate, with the index gene acting either as a transcription factor (for example, zinc-fingers and homeoboxes 1), or as the modulator of a transcription factor.
In this study we have sought to apply emerging bioinformatic tools to a problem that characterizes nearly all gene-mapping studies that target common, heritable disorders. Common heritable disorders are characteristically multigenic and heterogeneous in nature. Consequently, linkage peaks tend to be broad and weakly significant such that subsequent positional mapping and gene identification is greatly complicated. In a minority of cases, follow-up allelic association analysis has apparently been used successfully to delimit the disease gene region and to identify the disease related genetic variation (Horikawa et al. 2000; Ogura et al. 2001). The recent sequencing of the human genome, along with the genomes of other well-researched organisms, now makes identification of positionally mapped genes a straightforward bioinformatic exercise. However, knowledge of which genes reside within an interval alone does not significantly change the complexity of gene mapping.
Positional mapping poses unique challenges that are well suited for computational data-mining approaches. Peak linkage findings demarcate chromosomal regions most likely to harbor disease-related genetic variation, yet positional candidate genes pose unique bioinformatic problems: some portion of peak regions will be false positives and harbor no disease related genes, some peaks that do harbor disease related genetic variation will consist of only one disease-related gene among other genes that bear no relationship to the disease, other peaks might obtain their prominence due to the contribution of more than one disease-related gene, and some portion of disease related genes will likely reside outside the identified peak regions.
In addition to positional candidate genes, other types of genetic evidence are typically used to identify common disease causing alleles. Allelic association, or linkage disequilibrium, is used to detect historical association between a candidate gene variant and disease phenotype. Association studies are vulnerable to many of the same genetic complexities that confound genetic linkage studies with the following difference: association studies are robust to locus heterogeneity (since they only test one locus at a time), but confounded by allelic heterogeneity. Association studies are also believed to be quite vulnerable to genotypic differences related to population substructure (background genotypic differences that are unrelated to phenotype) (Hoggart et al. 2003). Thus, the ‘candidate status’ of most candidate genes is subject to uncertainty. Nevertheless, the subset of genes contained within a suggestive linkage peak is likely to be enriched for actual disease genes compared to the genome as a whole.
Both geneways and pathwayassist are pathway prediction methods that are designed to recognize written language and to extract key phrases that describe basic biological relationships between genes, small molecules, cellular processes and similar phenomenon. pathwayassist reads abstracts, whereas geneways reads the entire article. Both programs use sophisticated algorithms to predict pathway interactions. Because of problems with interpretation of language, figures and tables, a number of oversights and erroneous conclusions are inevitable in these programs. Thus, human interpretation and curation of these databases and their output remain critically important. Another aspect of natural language processing algorithms is that they must discriminate between physical interactions (binding or cleavage, oxidation, etc.), and logical interactions (e.g., the effect of a drug on gene expression). In the first case, two molecules are known to interact directly, whereas in the latter case, the mechanism of interaction may involve a multistep pathway. Thus, the pathways identified by these algorithms must be carefully filtered and/or checked by an expert user in order to establish the type of experimental data that was collected and to determine what biological experiments are required to further test the proposed pathways.
Identification of candidates by the ‘manual’ search strategy, the GO strategies and by the pathwayassist and geneways pathway prediction programs all depend upon data from the published literature. In contrast, the transcription microarray meta-analysis is largely based on unpublished experimental data, and thereby provides a completely independent bioinformatic approach to the same positional mapping problem. Since criteria to distinguish correct from incorrect bioinformatic predictions are often lacking, it is desirable to employ independent computational strategies and identify convergent pathway predictions (Eisenberg et al. 2000). Yeast whole genome gene expression studies show that coexpressed sets of genes are enriched for functionally related or physically interacting genes (Eisen et al. 1998; Ge et al. 2001). Genes that are coexpressed may be coregulated by a common transcription factor. Alternatively, one of the genes in a group of transcriptionally coregulated genes may be the transcription factor that drives the expression of the others (e.g., ZHX1, Table 4) or it may simply be upstream in a transcriptional cascade of genes that influence the expression of downstream members of the same cascade. Thus, we used this method to identify genes with known function in the brain that may be relevant to ASD, and which are coexpressed with positional candidate genes identified in the genome scan. Such genes might be the downstream targets of transcription factors that reside within the linkage regions and possess functional polymorphisms.
The ability to predict gene expression pathways identified using the method presented in this paper depends heavily on the quality and applicability of the underlying expression data. In the present example, we utilized expression data from mouse brains, rather than human brains, due to availability – an obvious shortcoming. Another limitation is that six of the seven datasets were derived from the hippocampus, rather than the entire brain, or a brain region more relevant to autism (e.g., amygdala). Development of large, well-characterized databases that can store and manage gene expression data and integrate with a range of other heterogeneous data sets, will likely overcome these shortcomings in the near future (e.g., Bader et al. 2003). More sophisticated computational programs to predict regulatory motifs (e.g., Bussemaker et al. 2001), together with high throughput experimental paradigms for selective and systematic perturbation of well-characterized biological systems (Barstead 2001; Elbashir et al. 2001; Ideker et al. 2001; McCaffrey et al. 2002) will likewise increase the power and scope of this approach. Perhaps the most promising strategies are those that combine the rigor of high throughput experimental paradigms with the speed, power and scope of computational data-mining approaches. Whole genome yeast and Drosophilia‘two-hybrid arrays’ test every permutation of protein–protein interaction, and despite a high false positive rate, are ideally suited for integrated computational analyses (von Mering et al. 2002). Mass spectrometry analysis of purified protein complexes (Ho et al. 2002) and the exploration of genetic interactions by identification of synthetic lethal gene combinations in yeast (Tong et al. 2001) are both potentially powerful complimentary approaches to the prediction of interacting gene networks and pathways.
It is estimated that upwards of 90% of an individual's liability to develop autism or ASD is determined by genetic factors, yet the disease liability attributable to any single genetic variant may be so small that it is undetectable by current gene mapping strategies. This problem may be addressed to some extent by strategies to predict biological pathways since these strategies may identify interacting sets of genes that together account for a significant portion of heritable disease liability. The role of additive vs. epistatic gene interactions in the etiology of common heritable disorders is unclear at this point, as is the importance of this distinction in the mapping of such traits and disorders (Carrasquillo et al. 2002; Cox et al. 1999; Holmans 2002; Tempeton 2000). Computational pathway predictions together with Gene Ontology annotations, gene regulatory information and other molecular interaction data should inform the characterization of additive vs. epistatic gene–gene interactions in ways that complement genetic studies. Thus, it is hoped that computational and bioinformatic approaches will lead to the identification of ‘candidate gene networks’ that encompass a significant fraction of a given disease's heritable component.
Table 3 summarizes the candidate gene predictions based upon six bioinformatic methods. With the exception of LIFR and EIF4E, all candidate genes detected by two or more of the automated search strategies were likewise detected by manual searches, suggesting that convergent findings from automated strategies are more reliable. The identified candidate genes are biased in favor of neurobiological disease etiology due to our search strategies. However, the etiology of autism may depend on susceptibility to environmental insults, rather than primary neurological deficits. For example, four candidates with known immunological function (IL6ST, LIFR, CD44 and IL8) were only detected by predicted pathway relationships, reflecting the lack of bias inherent in these pathway prediction approaches (Table 3).
In the present study we identified several genes using multiple bioinformatic approaches. Most notably the serotonin transporter (SLC6A4 a.k.a. 5-HTT) was identified by all but one of our search strategies (Table 3) including allele specific association studies (Table 1). Of the 408 microsatellite markers genotyped for linkage to ASD in the study by Yonan et al. (in press), the single most significant linkage was detected by a marker that maps less than one megabase distal to SLC6A4. It is also noteworthy that SLC6A4 is located in the only linkage region identified by Yonan et al. (in press) that overlaps with the findings of another linkage study (Table 2). Other studies have indicated that autism patients and their unaffected first degree relatives have elevated blood serotonin levels and there is evidence that drugs that selectively target the serotonin transporter can ameliorate some autism related symptoms (Cook & Leventhal 1996; Gingrich & Hen 2001). Thus SLC6A4 appears to be a particularly promising candidate gene for ASD, although it is not clear that the new data substantially bolster pre-existing data. Both pathway prediction programs predict relationships between glutamate receptor 6 (GLUR6; which has been positively associated with autism; Table 1) and positional candidates glial cell derived neurotrophic factor (GDNF) and SLC6A4, though not obviously via common pathways. Finally, the prolactin receptor (PRLR), and zinc-fingers and homeoboxes 1 (ZHX1), were identified by the transcriptional pathways prediction method (Table 4) as well as by the manual and GO strategies (Table 3).
Piccolo (PCLO) was identified by the transcriptional pathways prediction method as being coexpressed with the positional candidate, heterogeneous nuclear ribonucleoprotein D-like (HNRPDL) (Table 4). While PCLO itself is not located in our linkage region, and thus is not a positional candidate, this finding raises the possibility that HNRPDL may be upstream of PCLO in a transcriptional cascade. Based on its function, PCLO has been suggested as a possible candidate gene for autism (Fenster & Garner 2002), although this is not substantiated by allelic association (Table 1) (Nabi et al. 2003). The coexpression data suggest an alternative possibility: a polymorphism within HNRPDL may lead to differential expression of downstream targets that include PCLO. Thus the present results suggest that HNRPDL might be a viable candidate for autism, a conclusion that would not have been reached by any of the strategies whose results are reflected in Table 3.
The current study was designed to explore emerging bioinformatic technologies for the purpose of parsing large sets of genetically mapped (positional) candidate genes in search of disease related genetic variation. Using a large family study of autism and ASD we show that sophisticated bioinformatics approaches can be applied to this task and that convergent approaches might be used to offset inherent biases in any given approach to ultimately identify a subset of genes that are enriched for disease related genetic variation in the study sample, thus providing testable hypotheses. We further note with optimism that the genesis of integrative databases, powerful whole genome computational data-mining approaches, and high-throughput experimental paradigms to evaluate molecular interactions and pathway associations, bode well for the merger of bioinformatic and gene mapping approaches in the future.
The supplementary material contains the complete list of all 383 positional candidate genes in the top five regions from Yonan et al. (2003). There is a detailed explanation of the positions in cM and Mb for each candidate region, as well as their LOD scores in each region. Also, each chromosomal region is shown separately with only the genes that reside in that region listed for ease of comparison.
We gratefully acknowledge the Autism Genome Resource Exchange (AGRE) families who made this study possible and the Cure Autism Now Foundation, which founded and continues to support AGRE. This research was funded by MH64547 (TCG) and a generous donation from Judith P. Sulzberger, MD. We are grateful to the AMDeC Bioinformatics Core Facility for assistance with all bioinformatic studies. Finally, we would like to thank Adina Grunn for technical assistance in determining the genotypes that lead to this analysis.