Bioinformatic analysis of autism positional candidate genes using biological databases and computational gene network prediction


T. C. Gilliam, Columbia Genome Center, 1150 St. Nicholas Avenue, Room 508, New York, NY 10032, USA. E-mail:


Common genetic disorders are believed to arise from the combined effects of multiple inherited genetic variants acting in concert with environmental factors, such that any given DNA sequence variant may have only a marginal effect on disease outcome. As a consequence, the correlation between disease status and any given DNA marker allele in a genomewide linkage study tends to be relatively weak and the implicated regions typically encompass hundreds of positional candidate genes. Therefore, new strategies are needed to parse relatively large sets of ‘positional’ candidate genes in search of actual disease-related gene variants. Here we use biological databases to identify 383 positional candidate genes predicted by genomewide genetic linkage analysis of a large set of families, each with two or more members diagnosed with autism, or autism spectrum disorder (ASD). Next, we seek to identify a subset of biologically meaningful, high priority candidates. The strategy is to select autism candidate genes based on prior genetic evidence from the allelic association literature to query the known transcripts within the 1-LOD (logarithm of the odds) support interval for each region. We use recently developed bioinformatic programs that automatically search the biological literature to predict pathways of interacting genes (pathwayassist and geneways). To identify gene regulatory networks, we search for coexpression between candidate genes and positional candidates. The studies are intended both to inform studies of autism, and to illustrate and explore the increasing potential of bioinformatic approaches as a compliment to linkage analysis.

Autism is a pervasive neurodevelopmental disorder that severely impairs development of normal social and emotional interactions and related forms of communication. Disease symptoms characteristically include unusually restricted and stereotyped patterns of behaviors and interests. Autism describes the most severe manifestation of a broad spectrum of disorders, known as autism spectrum disorders (ASD) that share these essential features, but vary in their degree of severity and/or age of onset. While it is difficult to accurately estimate the prevalence of ASD, due to an apparent increase over the past few decades (Chakrabarti & Fombonne 2001; Gillberg & Wing 1999; Prior 2003), recent studies suggest that ASD affects 34–60 individuals per 10 000 (Charman 2002; Fombonne 2003; Yeargin-Allsopp et al. 2003).

Twin and epidemiological studies show that autism is a highly heritable disorder. When one monozygotic (MZ) twin is diagnosed with autism or ASD, the disease concordance is 70–90%, compared to 0–25% concordance among same-sex dizygotic twins (Bailey et al. 1995; Folstein & Rutter 1977; Lauritsen & Ewald 2001; Rutter 2000). The estimated heritability of ASD is believed to be approximately 90%, which is extremely high relative to other complex genetic diseases (Hyttinen et al. 2003; Ju et al. 2000). The impact of genetic determinants on disease liability is further substantiated by comparing the disease risk for a sibling of a proband diagnosed with ASD (2–6%) with the population prevalence of ASD (0.04–0.1%) (Smalley 1997; Smalley et al. 1988; Szatmari et al. 1998), yielding a relative risk of 50–100 for ASD (Lamb et al. 2000).The rate by which autism and ASD incidence drops among first, second and third degree relatives provides another indication that disease susceptibility arises from the combined effects of multiple, possibly interacting, genes (Lamb et al. 2000; Rutter 2000). Therefore, even though autism is clearly among the most heritable of all psychiatric disorders, the likely interaction of multiple genes that increase susceptibility to autism, rather than directly cause it, presents formidable challenges for genetic studies.

The search for genetic linkage between DNA markers spanning the entire genome and single-gene disorders with clear Mendelian patterns of inheritance has been enormously successful, in many cases leading to the identification of disease genes and their causal mutations despite years of failure using non-genetic, hypothesis-driven approaches (Botstein & Risch 2003). The success of such studies depends upon the identification of clear recombinant breakpoints that define the boundaries of the disease locus, and typically demarcate a minimal genetic region that harbors the disease gene along with dozens of non-disease related, positional candidate genes (Riordan et al. 1989; Rommens et al. 1989). Whereas ‘single-gene’ disorders are typically quite rare, common heritable disorders are believed to arise from the combined effects of multiple predisposing gene variants, presumably in combination with environmental factors. Consequently, the influence of any single gene-variant upon disease status is likely to be small, and therefore difficult to detect using genetic linkage strategies. Moreover, the population prevalence of gene variants with small or negligible individual effects upon reproductive fitness will follow the same stochastic course as neutral polymorphisms, in some instances reaching significant frequencies. This explains in part how heritable disorders with multiple gene etiologies become common, and also why they are elusive gene mapping targets, i.e., it becomes difficult to detect enhanced sharing of disease-related alleles among affected individuals when the same gene variant is prevalent among control individuals. For these reasons and others (Altmuller et al. 2001; Lander & Kruglyak 1995; Lander & Schork 1994; Weiss & Terwilliger 2000), evidence for linkage between a common heritable disorder and DNA marker alleles tends to be weak and difficult to distinguish from the type of random statistical fluctuations that inevitably accompany a full genome scan. Consequently, a conservative survey of positional candidate genes based upon whole genome scan analysis typically requires the analysis of positional candidate genes within multiple, broad linkage peaks, often spanning 10–40 million base pairs, and comprising upwards of 50–100 genes.

Consistent with these rather dire predictions, we recently completed the largest whole genome linkage scan of ASD reported to date, and found no statistically significant evidence for linkage between DNA marker alleles and disease status (Yonan et al. in press). We did, however, detect ‘suggestive’ evidence for ASD predisposing loci on chromosomes 17, 5, 11, 4 and 8. Such moderate linkage signals may reflect the marginal contribution to disease risk arising from a given genetic locus, or alternatively, false positive findings that reflect random statistical fluctuation. While independent replication is the standard to distinguish between the two possibilities, the criteria required to declare replication are model and disease dependent, and thus necessarily vague, and at least in theory, replication of a specific linkage finding is many times more complex than detection of any one among several predisposing genetic loci (Lander & Kruglyak 1995).

For reasons outlined above, whole genome linkage analysis of common heritable disorders identifies a large and unmanageable number of positional candidate genes, the vast majority of which are unrelated to the disease target. We propose the use of genomic data-mining strategies to parse these relatively large candidate gene sets with the purpose of identifying a subset of biologically meaningful genes that map to predetermined genetic loci. To illustrate this approach, we have surveyed the top five ASD-linked regions in a recent genomewide linkage study (Yonan et al. in press). The strategy is to predict a subset of likely candidate genes mapping to each putative linkage peak. Such candidates would then become the focus of further genetic and biological testing.

There is substantial interest in using bioinformatic resources in conjunction with linkage methodologies to identify the most promising candidate loci within large and sometimes unconfirmed linkage regions, so that they may be examined further (Baron 2002). We chose to use positively associated genes to query known transcripts within peak linkage regions using several complimentary bioinformatic methods. We examined several different bioinformatic approaches in order to identify convergent evidence for specific candidate genes, as well as to explore the future potential and current limitations of these approaches.

Materials and methods

Characterization of putative ASD-linked chromosomal regions

The chromosomal regions examined in this study are shown in Fig. 1. Beginning with 345 families that had two or more siblings diagnosed with either autism or ASD, we used affected sib pair analysis to identify genomewide linkage to ASD (Yonan et al. in press). Five chromosomal regions from the genome scan met a cutoff of a pointwise P-value of < 0.01, which we interpreted as being ‘moderately suggestive’. Here we examine the chromosomal regions defined by the 1-LOD support interval of the 5 most significant peaks. Details of the analysis that lead to the identification of these regions have been described previously (Liu et al. 2001; Yonan et al. in press).

Figure 1.

The 1-LOD interval of the five most significant multipoint Maximum Likelihood Score (MLS) regions from genomewide Affected Sib Pair analysis to ASD (Yonan et al. in press). The x-axis depicts genetic distance in Kosambi centimorgans from pter (zero coordinate) to qter; the y-axis represents MLS. The thick line and shading of the peaks demark the 1-LOD interval that defined each region. The size of the 1-LOD intervals are shown in Kosambi centimorgans. The physical distance, as well as the number of transcripts, was taken from the Human Genome Browser for each region, as described in Materials and methods.

Association and linkage tables

We performed a search for allelic association between candidate gene allelic variants and autism or ASD using the PubMed database ( This search strategy was augmented by personal knowledge of the literature and by references from key publications (Table 1). A similar strategy was used to compile the list of genomewide linkage studies for autism and ASD (Table 2).

Table 1.  Summary of association studies for autism
PhenotypeStudy Size
and design
  1. Table summarizes current positive and negative association studies for specific genes and autism disorder or related phenotypes. Positive allelic associations are shown in bold type. Also shown are any whole genome linkage peaks that overlap with a gene tested for association, and their linkage scores.*MLS = Multipoint LOD score; †TDT = Transmission Disequilibrium Test; LD = Linkage Disequilibrium; ¶PDT = Pedigree Disequilibrium Test; §MTDT = Multiallelic TDT; **DQ = Development Quotient

DRD54p16NoAutistic disorder38 families,
et al. (2002)
DRD24q15NoAutistic disorder38 families,
et al. (2002)
HLA6p21NoAutistic disorder20 patients
vs. 709 controls
Stubbs et al. (1980)  
HLA-DR6p21YesAutistic disorder50 patientsWarren  
beta 1   vs. 79 controlset al. (1996)  
GluR66q21YesAutistic disorder107 trios, TDT
and 51 families,
Jamain et al. (2002)  
HOXA17p15NoAutistic spectrum
disorder (ASD)
case (n = 35) vs.
control (n = 35)
et al. (2002)
DLX67q21-q22NoAutistic disorder196 families,
Nabi et al.
2.2; 3.2CLSA
IMGSAC (2001a
PCLO7q21-q22NoAutistic disorder196 families,
Nabi et al.
2.2; 3.2CLSA
(1999); 2001a
IMGSAC (2001a)
PAI-17q22NoAutistic disorder167 trios,
linkage and
et al. (2001)
RELN7q22YesASD – with
delayed phrase
126 familiesZhang
et al. (2002)
RELN7q23NoAutistic disorder167 families,
et al. (2002)
FOXP27q31No to FOXP2
gene; yes to
Specific language
impairment (SLI)
96 families,
linkage and
et al. (2003
GRM87q31YesAutistic disorder196 families,Serajee  
  (haplotype) TDTet al. (2003)  
WNT27q31–33YesAutistic with severe
language abnormality
50 familiesWassink
et al. (2001)
WNT27q31–33NoAutistic or language
135 singleton
and 82 multiplex
et al. (2002)
COPG27q32NoAutistic disorder169 families,
et al. (2002)
CPA17q32NoAutistic disorder169 families,
et al. (2002)
CPA57q32NoAutistic disorder169 families,
et al. (2002)
D7S18047q32YesAutistic spectrum
170 multiplex
families, TDT
with 76 markers
PEG17q32NoAutistic disorder169 families,Bonora2.55–3.55IMGSAC
/MEST   TDTet al. (2002) (1998)
D7S25337q33YesAutistic spectrum
170 multiplex
families, TDT
with 76 markers
EN27q36NoAutistic spectrum
204 AGRE
families, TDT
et al. (2003)
3.66Auranen et al.
PENK8q11-q12NoAutistic disorder38 families, TDTPhilippe
et al. (2002)
BDNF11p13NoAutistic disorder38 families, TDTPhilippe
et al. (2002)
HRAS11p15YesAutistic disordercase (n = 55)
vs. control
(n = 55)
et al. (1995)
TH11p15NoAutistic disorder38 families, TDTPhilippe
et al. (2002)
NCAM11q22NoAutistic disorder38 families, TDTPhilippe
et al. (2002)
Autistic disorder115 trios,
Kim et al.
GABRA515q11-q13NoAutistic disorder226 families,
Menold et al.
GABRB315q11-q13YesAutistic disorder80 families,
Buxbaum et al.
GABRB315q11-q13NoAutistic disorder226 families, PDTMenold et al.(2001)  
GABRG315q11-q13YesAutistic disorder226 families,
Menold et al.
ATP10C15q11-q13NoAutistic disorder115 trios,
Kim et al.
UBE3A15q11–q13YesAutistic disorder94 multiplex
families, LD
Nurmi et al.
NF117q11NoAutistic disorder204 patients vs.
200 controls
et al. (2001)
2.34, 2.83IMGSAC (2001a),Yonan et al. (2003)
OMGP17q11YesAutistic disorder
(DQ** > 30)
case (n = 37) vs.
control (n = 101)
et al. (2003)
2.34, 2.83IMGSAC(2001a),Yonan et al. (2003)
BLMH17q11YesAutistic disorder81 trios, TDTKim
et al. (2002a)
2.34, 2.83IMGSAC(2001a),Yonan et al. (2003)
5-HTT17q11YesAutistic disorder81 trios, TDTKim2.34, 2.83IMGSAC(2001a),
/SLC6A4    et al. (2002a) Yonan et al. (2003)
5-HTT17q11NoHyperserotoninemia134 autisticPersico2.34, 2.83IMGSAC (2001a)
/SLC6A4  in autistic patientspatients vs. 291
1st degree
et al. (2002) Yonan et al. (2003)
5-HTT17q11No5-HT blood levels96 families, TDTBetancur2.34, 2.83 IMGSAC (2001a),
/SLC6A4    et al. (2002) Yonan et al. (2003)
5-HTT17q11NoAutistic disorder98 trios, TDTPersico2.34, 2.83 IMGSAC (2001a),
/SLC6A4    et al. (2000a) Yonan et al. (2003)
HOXB117q21NoAutistic spectrum
case (n = 35) vs.
control (n = 35)
et al. (2002)
PCSK220p11NoAutistic disorder38 families, TDTPhilippe
et al. (2002)
PDYN20p12NoAutistic disorder38 families, TDTPhilippe
et al. (2002)
ADA20q13YesAutistic disorder118 patients vs.
126 controls
et al. (2001)
ADA20q13NoAutistic disorder91 families, 44
trios, TDT and 91
patients vs. 152
et al. (2000b)
MAO AXp11NoAutistic disorder38 families, TDTPhilippe
et al. (2002)
MAO BXp11NoAutistic disorder38 families, TDTPhilippe
et al. (2002)
GRPRXp22NoRett syndromecase (n = 25) vs.
control (n = 100)
et al. (1998)
HOPAXq13NoAutistic disorder155 patients vs.
157 controls
et al. (2002)
DXS287Xq23YesInfantile autismcase controlPetit
et al. (1996)
FMR-1Xq27NoAutistic disorder123 familiesKlaucket al. (1997)  
Table 2. : Summary of genomewide linkage studies for autism
Top regionsPeak position*Physical locationLOD scoreReferencesn (families)
  • Table summarizes genomewide linkage studies for autism or ASD, organized by chromosomal position and showing sample size used. Only the linkage regions with an MLS > 1.4 are shown for consistency of comparison. Linkage regions from Yonan et al. (in press), that the current study is based upon, are shown in bold. Liu et al. (2001) is not shown since the complete sample (110 families) is included and reanalyzed in Yonan et al. (in press).

  • *

    Peak position = position of the highest point/marker in Kosambi centimorgans from pter = 0.

  • Physical location = position of the highest point/marker as mapped onto the Human Genome Browser.

  • LOD score = usually MLS score, however, Z demarks an NPL Z score.

  • 83 + 69 = 89 families were used in the initial genomewide scan and then 69 families were added to follow up in 13 candidate regions.§PSD = Phrase Speech Delay

1p13149 cM113 Mb2.15Risch et al. (1999)90
1q23164 cM154 Mb2.63Auranen et al. (2002)38
2p1296 cM76 Mb1.60 IMGSAC (2001a)83 + 69
2q31181 cM175 Mb3.74 IMGSAC (2001a)83 + 69
2q31186 cM183 Mb2.39–3.32 (Z)Buxbaum et al.
(2001), PSD§
3p2536 cM11 Mb1.51Shao et al. (2002)99
3q26191 cM180 Mb4.81Auranen (2002)38
4p164.6 cM3.5 Mb1.55 IMGSAC (1998)99
4q2194 cM85 Mb1.72Yonan et al. (in press)345
5p1358 cM40 Mb2.54Yonan et al. (in press)345
6q1383 cM70 Mb2.23Philippe et al. (1999)51
7q21104 cM91 Mb2.20 CLSA (1999)75
7q22112 cM100 Mb3.20IMGSAC (2001a)83 + 69
7q32142 cM128 Mb2.55–3.55 IMGSAC (1998)99
7q36170 cM153 Mb3.66Auranen (2002)38
8q24132 cM125 Mb1.50Yonan et al. (in press)345
11p1346 cM34 Mb2.24Yonan et al. (in press)345
13q1221 cM30 Mb2.30 CLSA (1999)75
13q2255 cM73 Mb3.40 CLSA (1999)75
16p1319 cM10 Mb1.51–1.97 IMGSAC (1998)99
16p1325 cM12 Mb2.93 IMGSAC (2001a)83 + 69
17q1150 cM28 Mb2.34 IMGSAC (2001a)83 + 69
17q1152 cM29 Mb2.83Yonan et al. (in press)345
Xq2163 cM94 Mb2.54Shao (2002)99

Manual search strategy

We compiled a comprehensive list of genes (known and predicted from transcripts) in our five most significant regions using the Celera Discovery System ( and the NCBI Human Genome Project (UCSC Genome Browser; version 24 (hg15) April 2003 Freeze) databases. This exhaustive gene list was created by performing database queries against the UCSC Human Genome Browser's annotation database. The table definitions and data of two MySQL ( tables, refGene and refLink, were downloaded from the public FTP site at UCSC (http://www. 2003/database/) and recreated locally. Genes that mapped to the corresponding intervals in the Celera map were downloaded manually. All genes located within the physical boundaries defined by the 1-LOD unit support intervals on each chromosome were then extracted; the complete list of these 383 genes is available as supplementary material accompanying this paper (see Supplementary material section). This list was then further evaluated using several online databases. The Celera database annotates category and family for each gene using the Panther Protein Function. The Human Genome Project provides a gene ‘index’, a set of links to multiple annotation databases, for each Ref Seq transcript, including to the Online Mendelian Inheritance of Man (OMIM), Locus Link, PubMed, Gene Lynx, Gene Cards and Ace View databases. A short list of ‘neural-related’ genes was identified based upon evidence of their involvement in neuronal development/control, neurotransmitter function, transcription regulation and similar functions that made them logical disease-related candidates for the autism spectrum disorders.

Gene ontology methods

Gene Ontology (GO) is a controlled vocabulary designed to describe key aspects of the molecular function, biological process and cellular component of gene products (Bard 2003). Using the complete list of all 383 positional candidate genes (see above) we screened genes for neural-related GO terms in an effort to identify likely candidates for ASD. Screening was per- formed with the program pathwayassist (version 1.1, Stratagene Corp, La Jolla, CA) and the FatiGO website (

Pathwayassist and ResNet database

The pathwayassist software (Ariadne Genomics, Rockville, MD) allows the user to explore gene interaction networks represented in the ResNet (tm) database. ResNet (tm) is a comprehensive database of molecular networks compiled by proprietary natural language processing techniques applied to the whole PubMed database. The database contains more than 100 000 events of regulation, interaction and modification between 15 000 proteins, cell processes and small molecules. The architecture of ResNet and pathwayassist has been described ( pathwayassist provides a ‘front end’ that allows the user to query the database, and to direct the construction of specific networks relative to genes of interest.

The complete list of all 383 positional candidate genes was loaded into pathwayassist. Of those genes, 203 were recognized by the software, and were thus subjected to subsequent analysis. The ‘Expand Pathway’ feature of pathwayassist was used to build a network of connections starting with these 203 genes and including all available categories of interaction. This expanded list was then searched to find genes that interacted with neural-related positional candidate genes in the following manner. The genes in the expanded set that had interesting GO terms were identified, and then their interacting ‘neighbors’ were selected using the ‘Select Neighbors’ command. Set operations were used to reduce the list to only those genes that were among the original list of 203 positional candidate genes. Nine genes not found in the manual search described above were identified in this manner for further evaluation. Of these, four appeared to be logical candidates, and to have been correctly identified by pathwayassist as having valid interactions (Method 4, in Table 3) after manual inspection.

Table 3.  Semi-automated search for candidate genes
Gene nameFull nameChromosomeMethod
  • Table shows all candidate genes within our linkage regions that were found by different search strategies.

  • Method:

  • 1 

    = Manual search of biological databases

  • 2 

    = Gene Ontology (GO) query

  • 3 

    = Positive association study (Table 1)

  • 4 


  • 5 

    =pathwayassist predicted pathway candidates

  • 6 

    =geneways predicted pathway candidates

ACCN1Neuronal amiloride-sensitive cation channel 117q1, 2
BLMHBleomycin hydrolase17q1, 3
CENTA2Centaurin-alpha 2 protein17q1, 2
GIT1G protein-coupled receptor kinase-interactor 117q1, 2, 6
NF1Neurofibromin17q1, 6
OMGOligodendrocyte myelin glycoprotein17q1, 3
SLC6A4Solute carrier family 6 (serontonin transporter)17q1, 2, 3, 5, 6
TIAF1TGFB1-induced antiapoptotic factor 1 isoform 117q2
TNFAIP1Tumor necrosis factor, alpha-induced protein 117q2
TRAF4TNF receptor-associated factor 4 isoform 117q2
CARD6Caspase recruitment domain family, member 65p2
CCL28Small inducible cytokine A28 precursor5p2
GDNFGlial cell derived neurotrophic factor5p1, 2, 5, 6
GHRGrowth hormone receptor5p1, 6
IL6STInterleukin 6 signal transducer5p4
IL7RInterleukin 7 receptor precursor5p1, 2
ITGA2Integrin alpha 2 precursor5p2
LIFRLeukemia inhibitory factor receptor5p4, 6
FYBFYN binding protein5p6
PRLRProlactin receptor5p1, 2
Nup155Nucleoporin 155 kDa5p1
SLC1A3Solute carrier family 1, member 3 (glutamate transporter)5p1, 2
DAB2Disabled homolog 2, mitogen-responsive phosphoprotein5p5
API5Apoptosis inhibitor 511p1, 2
CATCatalase11p1, 2
CHRM4Cholinergic receptor, muscarinic 411p2
ELF5E74-like factor 5 (ets domain transcription)11p1, 2
MC7Transcription factor in neuroblasts and developing neurons11p1
MDKMidkine (neurite growth-promoting factor 2)11p2
MAPK8IP1Mitogen-activated protein kinase 8 interacting protein 111p5
CD44CD44 antigen11p6
SLC1A2Solute carrier family 1, member 2 (glutamate transporter)11p1, 2
TRAF6TNF receptor-associated factor 611p1, 2, 6
ATOH1Atonal homolog 14q1, 2
BIKEBMP-2 inducible kinase4q1
CDS1Phosphatidate cytidylyltransferase 14q1
CNOT6LCCR4-NOT transcription complex, subunit 6-like4q1
CXCL1Chemokine (C-X-C motif) ligand 14q2
EIF4EEukaryotic translation initiation factor 4E4q4, 5, 6
FGF5Fibroblast growth factor 5 isoform 1 precursor4q2
GRID2Glutamate receptor, ionotropic, delta 24q1, 2
PTPN13Protein tyrosine phosphatase, non-receptor type 134q5
IL8Interleukin 8 precursor4q4
NFKB1Nuclear factor of kappa light polypeptide gene4q1, 2
NK16-1NK6 transcription factor related, locus 14q1, 2
NUP54Nucleoporin 54 kDa4q1, 2
SHRMLShroom-related protein4q1
SNCAAlpha-synuclein isoform NACP1404q1, 2
SPBPDNA-binding protein amplifying expression of4q1, 2
RAP1GDS1RAP1, GTP-GDP dissociation stimulator 14q5
PKD2Polycystic kidney disease 24q6
TACR3Tachykinin receptor 34q1
UNC5CUnc-5 homolog C4q2
MTBPMdm2, transformed 3T3 cell double minute 2, p53 binding protein8q5
TAF2TBP-associated factor 28q5
ZHX1Zinc-fingers and homeoboxes 18q1, 2

pathwayassist was also used to search for pathway relationships beginning with the 13 genes that have been reported to be positively associated with autism in at least one previous study (Table 1). The pathwayassist‘Build Pathway’ function was used to search for pathways beginning with these genes. Next, the pathway was expanded to examine the connections to any of the positional candidate genes of the current study. As before, 203 of the positional candidates were recognized by the program and used in this analysis, only a few of which showed connections to this pathway (Method 5 in Table 3). Interactions among the 203 positional candidates were excluded from the analysis, as these interactions were unrelated to our hypothesis.

Geneways pathway prediction system

geneways is a program that uses a natural language processing algorithm to extract relationships between molecules or molecular processes by digesting published research literature and building these relationships into pathways (Rzhetsky et al. 2000). Electronic copies of the full text of research articles are downloaded to a local database where biologically important concepts such as names of genes, proteins, processes, small molecules and diseases are extracted from the text (Krauthammer et al. 2000) and clarified in relation to the many synonyms and homonyms and other ambiguities that are often applied to an identical term (Hatzivassiloglou et al. 2001). An associated program, genies is a natural language processing parser (Friedman et al. 2001). The output of genies is represented with semantic trees. A separate module unwinds these complex output trees into simple binary statements that are saved into the geneways knowledge base. The geneways system extracts some percentage of incorrect, redundant or contradictory statements that continue to pose bioinformatic challenges (Krauthammer et al. 2002), and currently requires manual curation and annotation. The user can conveniently request information about each interaction and retrieve the complete articles from which the information was extracted.

The pathway built with geneways was based on two sets of genes. The first consisted of about 20 genes that had been previously identified in the literature as playing a role in autism, either from positive association findings (Table 1), known chromosomal abnormalities or similar methods. The second list was the complete list of 383 positional candidate genes. geneways was then used to try to identify connections between these two groups of genes and to observe how those potential candidates might interact with each other and with other pathways. Currently, it is only possible to examine the geneways database by building a pathway out from a single gene, rather than having an exhaustive algorithm systematically identify all possible interactions. geneways was used to identify and visualize all the meaningful connections from the first list of known autism candidates to any information stored in the database. Several of the identified genes in this pathway were located within our linkage regions. Next, additional positional candidate genes were tested to see if they were connected with the same pathway (Method 6 in Table 3). We added an additional 30 positional candidates that we deemed most likely to contribute to ASD. These were genes that from the manual search made the most logical sense to possibly be involved in ASD phenotypes. Of the 30 genes that we examined, only six had direct connections to other genes in the pathway. Only those 30 candidates were examined using this strategy because our experience with this software suggests that it is important to limit the number of genes examined in order to produce an informative pathway that provides testable connections rather than an exhaustive but unwieldy pathway. Each arrow in Fig.2 represents either a physical or a logical interaction. Logical connections may represent multistep processes that include intermediaries not shown in the diagrams.

Transcription microarray meta-analysis

Whole genome gene expression arrays were used to identify possible functional relationships by searching for genes that are coexpressed with key autism candidate genes and positional candidate genes, based on mRNA expression microarray data. To increase the reliability of coexpression detection, only patterns of coexpression that were consistent in multiple data sets were used, since a coexpression relationship that is found in two or more independent studies is less likely to be an artifact. Because we did not have access to sufficient quantities of high-quality human brain gene expression data, we analyzed the homologs of our candidate genes in a set of seven independently collected mouse brain gene expression data sets. Of the 383 candidate genes, 170 had known mouse homologs, many of which are curated orthologs, which were then used for further analysis.

Of the seven mouse brain gene expression data sets used for Transcription Microarray Meta-Analysis, five were from unpublished in-house data and two were from published data sets (Sandberg et al. 2000; Zhao et al. 2001). Except for the dataset of Sandberg, which included data from six brain regions, all samples were from the hippocampus. Zhao et al. compared the subfields of the hippocampus. The additional data sets from our group are currently unpublished and consist primarily of test-control studies, with between 8 and 24 microarrays per data set, distributed as biological replicates of each condition. The conditions studied in each of these data sets were as follows: Young vs. old mice (M. Verbitsky, A.L. Yonan, G. Malleret, E.R. Kandel, T.C. Gilliam & P. Pavlidis, submitted); protein kinase C-gamma knockout vs. control mice; mice expressing a dominant negative protein kinase A regulatory subunit (R(AB); Abel et al. 1997) vs. control; a separate experiment using R(AB) and control animals to examine the effects of context-cued fear conditioning; and an analysis of mice expressing a dominant-negative inhibitor of CCAAT/enhancer-binding protein-family member transcription factors, compared to controls (Chen et al. 2003). Each data set was filtered to remove genes clearly lacking detectable expression, removing 30% of genes with the smallest maximal expression in each data set. Each gene was then analyzed to identify genes it was coexpressed with. For each gene, the Pearson correlation coefficient of all pairs of gene expression profiles in the data set was calculated. A P-value was calculated for the Pearson correlation assuming the null distribution follows a t-distribution (Zar 1999). P-values for each correlation were Bonferroni corrected, and genes with corrected P-values < 0.01 were considered coexpressed with the query gene. We note that this method does not make use of the experimental grouping of the samples (e.g., young vs. old), and thus genes which are coexpressed do not necessarily (indeed, typically do not) have expression patterns that are ‘relevant’ to the originally defined experimental groups. Pairs of genes that meet the criteria for coexpression were entered in a database. From the seven data sets, for all genes examined by the microarrays (∼10 000), we extracted ∼200 000 gene pairs (< 0.1% of all possible pairs). We then screened this database for pairs involving a positional candidate gene homolog that was identified in at least two of the seven data sets. We also attempted to identify genes that were coexpressed with the 13 genes implicated by positive findings from association studies (Table 1). However, we were unable to identify any genes in our linkage regions that were coexpressed with these genes (data not shown).


Table 1 summarizes results from studies that have sought to detect allelic association between candidate genes and autism or autism-related phenotypes. A total of 13 genes and three markers spanning 10 distinct cytogenetic regions purportedly show positive evidence for allelic association to autism. Of these 10 regions only 17q11 is concordant with the linkage regions identified in Yonan et al. in press (Fig. 1).

Table 2 summarizes the results from nine genomewide linkage studies for autism and ASD. Interpretation of genetic linkage to common heritable disorders is fraught with uncertainity and cross-study comparisons are not straightforward (Altmuller et al. 2001). All other factors being equal, larger sample studies are less prone to both false positive and false negative errors, thus we focused on the five strongest linkage signals from the large Yonan et al. study rather than, for example, choosing the five strongest linkage signals across all nine genomewide scans, or the five regions most supported by independent studies. As shown in Table 2, the Yonan et al. study (345 multiplex families) is more than three times the size of other reported genomewide studies. When comparing the results from Yonan et al. (in press) with those of other published studies in which evidence for linkage exceeded an MLS > 1.4 (P < 0.01; Nyholt 2000), overlap was identified on 17q (IMGSAC 2001a). The five putative ASD linkage regions selected for study are indicated in Fig. 1 (also shown as bold in Table 2).

Semi-automated search for ASD candidate genes

In a first attempt to parse positional candidate genes, we used public and commercial biological databases, together with Gene Ontology formalisms (see Materials and methods) to predict a subset of ‘neural related’ genes of potential relevance to ASD (Table 3). Candidates were selected from the 383 positional candidate genes based upon information gathered by manual search of the public UCSC Human Genome Browser and the proprietary Celera Discovery System together with their related links (Method 1, Table 3). A further search using neural-related GO terms (see Materials and methods) identified 11 additional genes (TIAF1, TNFAIP1, TRAF4, CARD6, CCL28, ITGA2, CHRM4, MDK, CXCL1, FGF5, UNC5C) not already identified by the manual search (Method 2, Table 3). Finally, an additional four candidate genes (IL6ST, LIFR, EIF4E, IL8) were identified using the pathwayassist computational software based upon their predicted network association with neural-related pathway genes (Method 4, Table 3; see Materials and methods).

Computational pathway prediction methods

In the present paper, we have attempted to leverage what little information is available about the genes that may contribute to autism in order to identify additional candidate genes for autism based on the results from our genomewide linkage study. Our hypothesis was that by constructing pathways between the genes already suspected to be involved in autism and our positional candidate genes, we could identify a subset of those positional candidates more likely to be involved in autism.

geneways' predictions regarding the connections between several of the positional candidate genes and a short list of genes suspected to be involved in autism (including both genes positively associated with autism and biological inferences) are shown in Fig. 2. Interactions among three of the genes positively associated with autism (GLUR6, HRAS1 and SLC6A4; shown as circles with red letters) together with connecting pathway genes (blue circles), molecules (red triangles) and processes (yellow rectangle), and 10 positional candidate genes (brown circles) were discovered (Fig. 2; Method 6, Table 3). When using the geneways program, each connecting line is a ‘clickable’ link that displays the underlying text that supports the interaction.

Figure 2.

GeneWays pathway showing the interrelationships of several positional candidate genes. This pathway was based on previously identified candidate genes of autism disorder and then built out to show how some of our positional candidate genes may interact. Small molecules are shown as pink triangles, processes are shown as yellow boxes, positional candidate genes found from genomewide linkage study for ASD are shown in brown circles, and all other genes that connect the pathway are shown in turquoise circles. The three gene names that are shown in red text are genes that have been identified as positively associated with autism (Table 1). Note that 5-HTT(serotonin transporter) is the same gene as SLC6A4.

Gene networks illustrated in Fig. 3 were developed using a conceptually similar strategy, using pathwayassist instead of geneways. The pathwayassist‘Build Pathway’ function found valid connections (as determined by manual inspection) between 2 of the 13 genes that have been positively associated with autism (GLUR6 and UBE3A; Table 1) and a subset of the positional candidate genes. Positional candidates that were found to have valid connections to this pathway are shown as Method 5, Table 3.

Figure 3.

A pathway built using pathwayassist between genes positively identified in association studies for autism and 203 of the 383 positional candidates. Two of the 13 such positively associated genes (ovals with yellow centers) were found to interact with positional candidate genes (ovals with green centers) via pathwayassist. The subset of interactions shown here was chosen as being relevant to the pathway originally built out from the positively associated genes.

Co-expression data, transcription microarray meta-analysis

We analyzed patterns of whole genome gene expression across multiple microarray data sets to identify possible gene regulatory interactions between the selected set of autism candidate genes and a subset of positional candidate genes. Of the 383 candidate genes analyzed, murine homologs for 170 genes were identified, which we then used to query seven independent mouse brain expression data sets. No reliable coexpression patterns were detected among the 13 positively associated autism candidates and the subset of 170 positional candidates. However, 10 of the 170 positional candidates showed highly reliable coexpression with one or more genes that were detected in multiple gene expression data sets (Table 4). A total of 107 genes were coexpressed with the set of 10 query genes. Based on their functions and annotations, we determined that a subset of these 107 genes showed potential relevance to neurodevelopmental disorders (Table 4).

Table 4.  Genes co-expressed with positional candidates based on gene expression data from mouse brain
Gene descriptionGene
accession ID
BP positionMouse
of matches
  • Genes that are located within the 1-LOD support interval of our QTL regions (Index Genes) and that belong to classes of coexpressed genes. First the mouse homologue of each index gene was identified (when available). In the absence of appropriate human gene expression data, we utilized 7 independently collected sets of mouse brain gene expression data, consisting of 8–24 microarrays each, to develop classes of coexpressed genes. We identified genes that were reproducibly coexpressed (in two or more of the data sets) with the mouse homologue of the index gene. When an index gene belonged to a functional expression class, the other genes in that class were identified (total # of matches), and the likely candidates from that expression class identified. Candidate genes so identified may be downstream targets of a transcriptional activation pathway common to the index gene and the candidate, with the index gene acting either as a transcription factor (for example, zinc-fingers and homeoboxes 1), or as the modulator of a transcription factor.

  • * Same gene as PCLO in Table 1 (Nabi et al. 2003).

  • These genes are found as both index genes and coexpressed candidates.

  • Genes also identified in Table 3.

HNRPDLheterogeneous nuclearNM_005463483737143Mm.19531017piccolo (presynaptic
 ribonucleoprotein D-like     cytomatrix protein)*
       matrin 3
PPP3CAprotein phosphatase 3
(formerly 2B), catalytic
NM_0009444102337365Mm.2934Mm.6150 (Highly
similar to HAPP_RAT
PKD2polycystin 2NM_000297489321599Mm.64422 
PELOCGI-17 proteinNM_015946552066463Mm.32412glutamine synthase
NDUFS4NADH dehydrogenaseNM_002495552827009Mm.144421potassium voltage-gated
 (ubiquinone) Fe-S     channel, Shal-related
 protein 4     family, member 2
PRLRprolactin receptorNM_000949535064208Mm.27521ectonucleotide
phosphodiesterase 2
ZHX1zinc-fingers andNM_0072228123929781Mm.3721625aquaporin 4;
 homeoboxes 1     quaking; cerebellar postnatal
       development protein 1
ENPP2ectonucleotideNM_0062098120238123Mm.2810714prolactin receptor,;
 pyrophosphatase/     calmodulin-like 4;
 phosphodiesterase     SLC4A2;
ALDOCaldolase C,NM_0051651726752009Mm.772940Calmodulin;
 fructose-bisphosphate     neurochondrin-1; thyroid hormone
       receptor alpha;
protein; procholecystokinin hippocampal amyloid
       precursor (CCK)
JJAZ1joined to JAZF1NM_0153551730113956Mm.219641 


In this study we have sought to apply emerging bioinformatic tools to a problem that characterizes nearly all gene-mapping studies that target common, heritable disorders. Common heritable disorders are characteristically multigenic and heterogeneous in nature. Consequently, linkage peaks tend to be broad and weakly significant such that subsequent positional mapping and gene identification is greatly complicated. In a minority of cases, follow-up allelic association analysis has apparently been used successfully to delimit the disease gene region and to identify the disease related genetic variation (Horikawa et al. 2000; Ogura et al. 2001). The recent sequencing of the human genome, along with the genomes of other well-researched organisms, now makes identification of positionally mapped genes a straightforward bioinformatic exercise. However, knowledge of which genes reside within an interval alone does not significantly change the complexity of gene mapping.

Positional mapping poses unique challenges that are well suited for computational data-mining approaches. Peak linkage findings demarcate chromosomal regions most likely to harbor disease-related genetic variation, yet positional candidate genes pose unique bioinformatic problems: some portion of peak regions will be false positives and harbor no disease related genes, some peaks that do harbor disease related genetic variation will consist of only one disease-related gene among other genes that bear no relationship to the disease, other peaks might obtain their prominence due to the contribution of more than one disease-related gene, and some portion of disease related genes will likely reside outside the identified peak regions.

In addition to positional candidate genes, other types of genetic evidence are typically used to identify common disease causing alleles. Allelic association, or linkage disequilibrium, is used to detect historical association between a candidate gene variant and disease phenotype. Association studies are vulnerable to many of the same genetic complexities that confound genetic linkage studies with the following difference: association studies are robust to locus heterogeneity (since they only test one locus at a time), but confounded by allelic heterogeneity. Association studies are also believed to be quite vulnerable to genotypic differences related to population substructure (background genotypic differences that are unrelated to phenotype) (Hoggart et al. 2003). Thus, the ‘candidate status’ of most candidate genes is subject to uncertainty. Nevertheless, the subset of genes contained within a suggestive linkage peak is likely to be enriched for actual disease genes compared to the genome as a whole.

Both geneways and pathwayassist are pathway prediction methods that are designed to recognize written language and to extract key phrases that describe basic biological relationships between genes, small molecules, cellular processes and similar phenomenon. pathwayassist reads abstracts, whereas geneways reads the entire article. Both programs use sophisticated algorithms to predict pathway interactions. Because of problems with interpretation of language, figures and tables, a number of oversights and erroneous conclusions are inevitable in these programs. Thus, human interpretation and curation of these databases and their output remain critically important. Another aspect of natural language processing algorithms is that they must discriminate between physical interactions (binding or cleavage, oxidation, etc.), and logical interactions (e.g., the effect of a drug on gene expression). In the first case, two molecules are known to interact directly, whereas in the latter case, the mechanism of interaction may involve a multistep pathway. Thus, the pathways identified by these algorithms must be carefully filtered and/or checked by an expert user in order to establish the type of experimental data that was collected and to determine what biological experiments are required to further test the proposed pathways.

Identification of candidates by the ‘manual’ search strategy, the GO strategies and by the pathwayassist and geneways pathway prediction programs all depend upon data from the published literature. In contrast, the transcription microarray meta-analysis is largely based on unpublished experimental data, and thereby provides a completely independent bioinformatic approach to the same positional mapping problem. Since criteria to distinguish correct from incorrect bioinformatic predictions are often lacking, it is desirable to employ independent computational strategies and identify convergent pathway predictions (Eisenberg et al. 2000). Yeast whole genome gene expression studies show that coexpressed sets of genes are enriched for functionally related or physically interacting genes (Eisen et al. 1998; Ge et al. 2001). Genes that are coexpressed may be coregulated by a common transcription factor. Alternatively, one of the genes in a group of transcriptionally coregulated genes may be the transcription factor that drives the expression of the others (e.g., ZHX1, Table 4) or it may simply be upstream in a transcriptional cascade of genes that influence the expression of downstream members of the same cascade. Thus, we used this method to identify genes with known function in the brain that may be relevant to ASD, and which are coexpressed with positional candidate genes identified in the genome scan. Such genes might be the downstream targets of transcription factors that reside within the linkage regions and possess functional polymorphisms.

The ability to predict gene expression pathways identified using the method presented in this paper depends heavily on the quality and applicability of the underlying expression data. In the present example, we utilized expression data from mouse brains, rather than human brains, due to availability – an obvious shortcoming. Another limitation is that six of the seven datasets were derived from the hippocampus, rather than the entire brain, or a brain region more relevant to autism (e.g., amygdala). Development of large, well-characterized databases that can store and manage gene expression data and integrate with a range of other heterogeneous data sets, will likely overcome these shortcomings in the near future (e.g., Bader et al. 2003). More sophisticated computational programs to predict regulatory motifs (e.g., Bussemaker et al. 2001), together with high throughput experimental paradigms for selective and systematic perturbation of well-characterized biological systems (Barstead 2001; Elbashir et al. 2001; Ideker et al. 2001; McCaffrey et al. 2002) will likewise increase the power and scope of this approach. Perhaps the most promising strategies are those that combine the rigor of high throughput experimental paradigms with the speed, power and scope of computational data-mining approaches. Whole genome yeast and Drosophilia‘two-hybrid arrays’ test every permutation of protein–protein interaction, and despite a high false positive rate, are ideally suited for integrated computational analyses (von Mering et al. 2002). Mass spectrometry analysis of purified protein complexes (Ho et al. 2002) and the exploration of genetic interactions by identification of synthetic lethal gene combinations in yeast (Tong et al. 2001) are both potentially powerful complimentary approaches to the prediction of interacting gene networks and pathways.

It is estimated that upwards of 90% of an individual's liability to develop autism or ASD is determined by genetic factors, yet the disease liability attributable to any single genetic variant may be so small that it is undetectable by current gene mapping strategies. This problem may be addressed to some extent by strategies to predict biological pathways since these strategies may identify interacting sets of genes that together account for a significant portion of heritable disease liability. The role of additive vs. epistatic gene interactions in the etiology of common heritable disorders is unclear at this point, as is the importance of this distinction in the mapping of such traits and disorders (Carrasquillo et al. 2002; Cox et al. 1999; Holmans 2002; Tempeton 2000). Computational pathway predictions together with Gene Ontology annotations, gene regulatory information and other molecular interaction data should inform the characterization of additive vs. epistatic gene–gene interactions in ways that complement genetic studies. Thus, it is hoped that computational and bioinformatic approaches will lead to the identification of ‘candidate gene networks’ that encompass a significant fraction of a given disease's heritable component.

Table 3 summarizes the candidate gene predictions based upon six bioinformatic methods. With the exception of LIFR and EIF4E, all candidate genes detected by two or more of the automated search strategies were likewise detected by manual searches, suggesting that convergent findings from automated strategies are more reliable. The identified candidate genes are biased in favor of neurobiological disease etiology due to our search strategies. However, the etiology of autism may depend on susceptibility to environmental insults, rather than primary neurological deficits. For example, four candidates with known immunological function (IL6ST, LIFR, CD44 and IL8) were only detected by predicted pathway relationships, reflecting the lack of bias inherent in these pathway prediction approaches (Table 3).

In the present study we identified several genes using multiple bioinformatic approaches. Most notably the serotonin transporter (SLC6A4 a.k.a. 5-HTT) was identified by all but one of our search strategies (Table 3) including allele specific association studies (Table 1). Of the 408 microsatellite markers genotyped for linkage to ASD in the study by Yonan et al. (in press), the single most significant linkage was detected by a marker that maps less than one megabase distal to SLC6A4. It is also noteworthy that SLC6A4 is located in the only linkage region identified by Yonan et al. (in press) that overlaps with the findings of another linkage study (Table 2). Other studies have indicated that autism patients and their unaffected first degree relatives have elevated blood serotonin levels and there is evidence that drugs that selectively target the serotonin transporter can ameliorate some autism related symptoms (Cook & Leventhal 1996; Gingrich & Hen 2001). Thus SLC6A4 appears to be a particularly promising candidate gene for ASD, although it is not clear that the new data substantially bolster pre-existing data. Both pathway prediction programs predict relationships between glutamate receptor 6 (GLUR6; which has been positively associated with autism; Table 1) and positional candidates glial cell derived neurotrophic factor (GDNF) and SLC6A4, though not obviously via common pathways. Finally, the prolactin receptor (PRLR), and zinc-fingers and homeoboxes 1 (ZHX1), were identified by the transcriptional pathways prediction method (Table 4) as well as by the manual and GO strategies (Table 3).

Piccolo (PCLO) was identified by the transcriptional pathways prediction method as being coexpressed with the positional candidate, heterogeneous nuclear ribonucleoprotein D-like (HNRPDL) (Table 4). While PCLO itself is not located in our linkage region, and thus is not a positional candidate, this finding raises the possibility that HNRPDL may be upstream of PCLO in a transcriptional cascade. Based on its function, PCLO has been suggested as a possible candidate gene for autism (Fenster & Garner 2002), although this is not substantiated by allelic association (Table 1) (Nabi et al. 2003). The coexpression data suggest an alternative possibility: a polymorphism within HNRPDL may lead to differential expression of downstream targets that include PCLO. Thus the present results suggest that HNRPDL might be a viable candidate for autism, a conclusion that would not have been reached by any of the strategies whose results are reflected in Table 3.

The current study was designed to explore emerging bioinformatic technologies for the purpose of parsing large sets of genetically mapped (positional) candidate genes in search of disease related genetic variation. Using a large family study of autism and ASD we show that sophisticated bioinformatics approaches can be applied to this task and that convergent approaches might be used to offset inherent biases in any given approach to ultimately identify a subset of genes that are enriched for disease related genetic variation in the study sample, thus providing testable hypotheses. We further note with optimism that the genesis of integrative databases, powerful whole genome computational data-mining approaches, and high-throughput experimental paradigms to evaluate molecular interactions and pathway associations, bode well for the merger of bioinformatic and gene mapping approaches in the future.

Supplementary material

The following material is available from: http://www.blackwellpublishing.comproductsjournalssuppmatGBBGBB041GBB041sm.htm

The supplementary material contains the complete list of all 383 positional candidate genes in the top five regions from Yonan et al. (2003). There is a detailed explanation of the positions in cM and Mb for each candidate region, as well as their LOD scores in each region. Also, each chromosomal region is shown separately with only the genes that reside in that region listed for ease of comparison.


We gratefully acknowledge the Autism Genome Resource Exchange (AGRE) families who made this study possible and the Cure Autism Now Foundation, which founded and continues to support AGRE. This research was funded by MH64547 (TCG) and a generous donation from Judith P. Sulzberger, MD. We are grateful to the AMDeC Bioinformatics Core Facility for assistance with all bioinformatic studies. Finally, we would like to thank Adina Grunn for technical assistance in determining the genotypes that lead to this analysis.