SEARCH

SEARCH BY CITATION

Keywords:

  • Human inparalogues;
  • gene duplication;
  • bioinformatics method;
  • human evolution

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Gene duplications represent an important class of evolutionary events that is likely to have contributed to the unique human phenotype in the short evolutionary time since the human-chimpanzee divergence. With the availability of both human and chimpanzee genome drafts in high coverage re-sequencing assemblies and the high annotation quality of most human genes, it should now be possible to identify all human lineage-specific gene duplication events (human inparalogues) and a few pioneering studies have attempted to do that. However, the different levels of coverage in the human and chimpanzee's genomes assemblies, and the differing levels of gene annotation, have led to problematic assumptions and oversimplifications in the algorithms and the datasets used to detect human lineage-specific gene duplications. In this study, we have developed a set of bioinformatic tools to overcome a number of the conceptual problems that are prevalent in previous studies and have collected a reliable and representative set of human inparalogues.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

The divergence of the human and chimpanzee lineages has been estimated to have occurred between 4.30 and 7.02 million years ago (mya) (Stauffer et al., 2001; Kumar et al., 2005; Steiper & Young, 2006; Patterson et al., 2006). In this relatively short evolutionary time, humans have acquired a range of unique phenotypic features; notably obligate bipedalism (Hunt, 1994), tripling of brain size (Nieuwenhuys et al., 2007), and complex culture, including language (Enard et al., 2002). Gene duplication is a class of genomic event that is likely to have contributed to the evolution of modern human phenotype. A seminal work on the fate of duplicated genes proposes that in most cases one copy maintains the original functionality of the gene, while the other copy “escapes” the constraint of purifying selection, and thus becomes “free” to accumulate mutations that might give rise to novel functionalities (neo-functionalisation) or loss of function (nonfunctionalisation) (Ohno, 1970). Later experiments on duplicated gene expression levels have shown that extant gene pairs might partition between them the functions of the single ancestral gene (Prince & Pickett, 2002). The subfunctionalisation model (also called the duplication-degeneration-complementation model) proposes that the two gene copies acquire complementary loss of function, and together they produce the full functionality of the ancestral gene (Force et al., 1999).

The first step in identifying gene duplications in a specific species and for a particular species pair is the identification of corresponding orthologues in the reference species. This allows a distinction to be made between “out-” and “in-” paralogues; duplications that happened before or after speciation, respectively. However, until recently most studies focused on paralogues, without the distinction between inparalogues and outparalogues (Remm et al., 2001), while some studies combined paralogy with segmental duplications research (Bailey & Eichler, 2006).

The main bioinformatics tool available to identify species’ inparalogues is InParanoid (In-paralogue and Orthologue Identification) (Remm et al., 2001; O’Brien et al., 2005), an algorithm and a database that differentiates various species’ outparalogues from inparalogues. InParanoid is the only comprehensive bioinformatics tool that is designed specifically to detect inparalogues (rather than paralogues in general), which makes it a potentially ideal tool for detecting human inparalogues. The algorithm identifies orthologues and inparalogues between any given pair of genomes (two species comparisons only), while the software MultiParanoid detects orthology and paralogy among multiple species, making it conceptually more similar to COG (Clusters of Orthologues Groups), the first platform created to identify large-scale clusters/groups of orthologues and paralogues (Berglund et al., 2008). This is distinct from previous methods that identified smaller and separate sets of orthologues and paralogues (Tatusov et al., 1997). Given the proteomes (in this case – exactly one protein from each coding gene for a nonredundant dataset) of two given species, B and C, with a most recent common ancestor, A, the InParanoid algorithm (Remm et al., 2001; O’Brien et al., 2005) proceeds as follows: (1) find all sequence pairwise similarities between B-C, C-B, B-B, and C-C using the amino acid sequence similarity search tool BLASTP (Altschul et al., 1990); (2) mark two-way best hits as potential orthologues (these are termed seed-orthologues); (3) add potential inparalogues for each seed-orthologues pair, by assuming that two inparalogues (which are, by definition, from the same species) are closer to each other than the distance between the seed-orthologues, otherwise the gene duplication is assumed to be before the divergence of B and C, and thus the two sequences are considered to be outparalogues; (4) calculate confidence value scores for the potential inparalogues using the averaged BLASTP bit scores; and (5) resolve overlapping groups of orthologues and inparalogues.

We have identified several problems associated with using InParanoid's database of human-chimpanzee orthologues/inparalogues (O’Brien et al., 2005), and with attempting to use InParanoid locally with the human and chimpanzee proteomes (Hubbard et al., 2009). In the Results section, we detail the problems that we have encountered in the different categories of the genomes’ annotations and the InParanoid database. In summary, the use of a nonmodel organism's proteome (while for the purpose of this study we define “model organism” as species with an experimentally validated annotation) as one of the species when performing inparalogue prediction using InParanoid, or any other published inparalogue prediction algorithm, underestimates the nonmodel organism's inparalogue count. This, and the other problems that we have presented above, present major obstacles in detecting human lineage-specific gene duplications, and have necessitated the development of a new algorithm. In this work, we describe an algorithm to detect lineage-specific duplications in any case where one genome is that of a model organism, while the other is nonmodel (Fig. 1). The algorithm first filters the data by removing ambiguous sequences and resolving gene overlaps, then selects human inparalogue candidates using InParanoid and the mouse proteome. The inparalogue candidates are used to identify potential chimpanzee orthologues and human inparalogues on chimpanzee and human genomes, respectively. The full duplication lengths of these candidates are identified, and phylogenetic trees inferred, while removing topologies that indicate human outparalogues and filtering for molecular clock violations. The final step of the algorithm is the removal of gene conversion (GC) candidates to provide a human inparalogue candidate set. The algorithms are implemented in scripts that are available from the authors on request. We perform this analysis and report a set of human-specific gene duplications that have occurred since the human-chimpanzee split.

image

Figure 1. The filtering and analyses stages in the human lineage gene duplication detecting algorithm. Each stage provides the input for the next stage, while the initial input is the full human and mouse proteomes. H1 and H2 represent the human orthologue and human inparalogue candidate sequences, respectively.

Download figure to PowerPoint

Materials and Methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

The algorithm we present (Fig. 1) requires the availability of genome assemblies for both species of interest and the proteome of an outgroup model organism. The algorithm was implemented using the programming language Perl (http://www.perl.org/). Other applications that were used will be described in the relevant sections.

Choosing an Outgroup and Filtering Data

The first part of the algorithm is to identify potential inparalogues between the two model organisms using proteome data and InParanoid software. The first model organism is one that is the target species – in this case human – and the second should be the phylogenetically most closely related model organism with an available annotated genome. In this case, mouse is the most closely related model organism to human (Waterston et al., 2002; Benton & Donoghue, 2007). Using the biological data mining website BioMart (Smedley et al., 2009), the following Ensembl annotation features for all 21,388 human and 23,019 mouse protein-coding genes were obtained: (1) chromosome number/symbol; (2) start location; (3) end location. Genes were removed if the chromosome's symbol was either ambiguous or indicated haplotype data. Also genes were removed if they were overlapping, following the logical rule:

  • image

This rule ensures that if two genes are located on the same chromosome and the start location of one gene is located within the other gene then it is classed as overlapping and thus removed. This step was repeated until there were zero overlaps. From each human and mouse nonambiguous and nonoverlapping known coding gene, the longest peptide sequence was obtained. Altogether, the final set of acquired peptides representing the human and mouse proteomes consisted of 18,522 human and 21,043 mouse amino acid sequences; the above filtering process removed 2866 human peptides and 1976 mouse peptides.

Human-Mouse InParanoid Analysis

As mentioned above, InParanoid (Remm et al., 2001; O’Brien et al., 2005) will not collect a reliable list of human inparalogues with the human and the projected chimpanzee proteomes. However, InParanoid provides a very robust and accurate platform for detecting inparalogues among two model organisms such as human and mouse (van Noort et al., 2003). Performing an InParanoid analysis with the 18,522 human peptide sequences and 21,043 mouse peptide sequences resulted in 16,227 clusters of human-mouse seed orthologues, 305 containing one or more human inparalogues (occurring since the human-mouse divergence around 61.5 mya (Benton & Donoghue, 2007). The human-chimpanzee human inparalogue set should nest within this set of 305 genes.

Human-Chimpanzee BLAT Analysis

BLAT (BLAST Like Alignment Tool) is a software tool that identifies highly similar DNA or peptide sequences in a database, such as a full genome assembly (Kent, 2002). To detect gene duplications, the cDNA sequence was used to search the human and chimpanzee genomes (the chimpanzee's assembly has a coverage of ×6, a reasonably high level for a nonmodel organism). For each of the 305 human-mouse clusters containing one or more human inparalogues, the cDNA sequence of the human seed-orthologue peptide was acquired using BioMart (Smedley et al., 2009), then BLAT was used to identify the chimpanzee orthologues and human paralogues. BLAT's characteristics are tailored to identify DNA sequence duplications on genomes with a high degree of similarity. This makes BLAT suited for comparisons of closely related species such as human and chimpanzee, and consequently suited for finding human inparalogues, which are assumed to have a smaller distance from their human paralogue than the distance between the human and chimpanzee orthologues. The BLAT run of the 305 human cDNA sequences against the human and chimpanzee genomes on the UCSC web server was automated using a Perl script that is available at: http://genomewiki.ucsc.edu/index.php/Image:BlatBot_pl.txt (importantly, Ensembl and UCSC have, at the time of writing, similar genome assemblies for human and chimpanzee, so there was no discrepancy between the use of both Ensembl and UCSC systems in this work). The chimpanzee orthologues and human inparalogue candidates were collected from all BLAT hits by applying the following criteria:

  • 1
    Highest bit scores (which the BLAT algorithm uses to determine the best match). For human inparalogues detection, a minimum threshold of half of that of the best hit was applied (i.e., a hit was considered as a paralogue if its score was more than half of the best hit).
  • 2
    Sequence length similarity of at least 50%, since local alignment may capture various regions of the BLATed cDNA sequence scattered on huge regions of the chromosome.

The human inparalogue candidates were then filtered for overlaps to avoid redundancies in the BLAT hits.

Finding the Full Extent of Human Duplicated Regions

The human cDNA sequences collected using the above protocol will usually represent only a portion of the actual segmental sequence duplicated. To find the full extent of each duplicated region, the Ensembl Perl API interface (http://www.ensembl.org/info/data/api.html) was employed. The dataset was divided into triplets of (1) human orthologue, (2) human inparalogue candidate, and (3) chimpanzee orthologue. Upstream from the start of each of the three sequences, sliding windows of 100 base pair slices were obtained and compared to each other. In case, there was a similarity greater than 90% (a heuristic value, greater than the similarity between two random sequences and lower than the expected 95%–100% human inparalogues/human-chimpanzee orthologues comparisons (Britten, 2002; Mikkelsen et al., 2005) another 100 base pair slice upstream of the previous slice was obtained and the same similarity check was made. The window continued its upstream slide until similarity dropped below 90%. The same process was performed downstream of the end of each human and chimpanzee gene sequence. Importantly, as genomes are represented by only one strand, whenever a sequence was on the opposite strand the complementary sequence was inferred and the upstream–downstream directions were reversed. By checking for nonoverlap between the extended tandem duplications, the full human duplications and their full length chimpanzee orthologue DNA sequences were obtained.

Alignment, Phylogenetic Trees, and Molecular Clock Testing

At this stage, the human-human-chimpanzee orthologue set contains inparalogue candidates, as they were identified as potential human-chimpanzee inparalogues only by comparison of human and mouse proteomes. The human-mouse divergence has been estimated to have occurred some ∼61.5 mya, while the human-chimpanzee split occurred (according to a recent estimate) some ∼6.6 mya (Steiper & Young, 2006), so the majority of duplications identified at this stage are expected to be outparalogues with respect to human-chimpanzee divergence. Phylogenetic tree inference offers a means of categorising homology types, although it is very computationally demanding to infer phylogenies from full genome data (Koonin, 2005; Yang, 2006; Altenhoff & Dessimoz, 2009). However, the InParanoid clustering and various filtering procedures described above reduce the number of potential inparalogue candidate sets to a few hundred human-human-chimpanzee gene triplets and this has made possible the differentiation of human inparalogues from outparalogues using a phylogenetic approach. The first step in any phylogenetic inference is multiple sequence alignment. The software we chose to do this was MAFFT – Multiple sequence Alignment employing Fast Fourier Transform (Katoh et al., 2002). The advantage of this method is its flexibility and reliability tradeoff as it automatically optimises the alignment according to the lengths of the different DNA sequences used. This is particularly useful when alignments need to be performed on sequences from a few hundred to hundreds of thousands of base pairs in length. A benchmark test (Katoh et al., 2005) has shown high performance of MAFFT when compared to other well-established methods, including MUSCLE (Edgar, 2004), T-Coffee (Notredame et al., 2000), and ClustalW (Thompson et al., 1994) (note that the benchmark was, as with most alignment benchmarks, testing protein alignments. In manual testing that we performed MAFFT also showed favourable performance for DNA sequences). We automated the application of the DNAML and DNAMLK maximum likelihood phylogeny inference programs, which are a part of the Phylip package (Felsenstein, 1989), with the aligned sequences as input. DNAMLK assumes a molecular clock, whereas DNAML does not. Both tree topology and maximum likelihood score were obtained from each set of sequences. The first filtering process kept only the trees with the clock tree topology ([H1,H2],C1) – representing two human inparalogues and their chimpanzee orthologue (see Fig. 2 for the three possible tree topologies). Then a likelihood ratio test of the molecular clock was applied (Felsenstein, 1981), as follows: LR= 2(LR1LR0), where ML1 is the DNAMLK (clock) log maximum likelihood score and ML0 is the DNAML (no clock) log maximum likelihood score. In cases where LR > 3.84 (p-value < 0.05 for a χ2distribution with 1 degree of freedom: df =s− 2 where s is the number of sequences) the molecular clock was considered to be violated.

image

Figure 2. The tree space of two human sequences and one chimpanzee sequence. The left tree is the only one that represents two human inparalogues and their chimpanzee orthologue. In the central and right trees, the human sequences are outparalogues.

Download figure to PowerPoint

Gene Conversion

GC may cause true outparalogues to be categorised as inparalogues as one of its main effects is to reduce sequence difference between adjacent paralogues. However, an important feature of gene conversion is that it occurs more frequently in sequences with high Guanine Cytosine content. Various studies give the range of 60%–90% (Galtier et al., 2001; Galtier, 2003; Marais, 2003; Spencer et al., 2006; Chen et al., 2007). Combining this with the observation that gene converted sequences are usually located a short genetic distance from each other (Chen et al., 2007), it was possible to construct, to a first order of approximation, a GC filter. The GC detection software GENECONV (Sawyer, 1989) (http://www.math.wustl.edu/~sawyer/geneconv) is the most well-established computational tool for this purpose. However, GENECONV does not differentiate between recent gene duplications and GC and does not take into account Guanine Cytosine content, so for the specific purpose of our study we needed to develop a new GC detection tool. For each human inparalogue candidate pair on the same chromosome, the genetic distance between the sequences was calculated using the DNADIST program with the F84 substitution matrix (Felsenstein, 1989). After calculating the genetic distances, the Guanine Cytosine content of the inparalogue candidates was calculated. Pairs where the genetic distance was smaller than 0.02 on a scale of 0–0.1 and their Guanine Cytosine content was greater than 60% were considered to be GC candidates. Figure 3 shows that all inparalogue candidates with a high Guanine Cytosine content also have a very short genetic distance from each other and cluster into one well-defined group, which should be enriched for GC sequences. All inparalogue candidates that were identified as GC candidates by the above criteria were removed. It is important to note that this candidate gene converted set may consist of false positives – genes that were assigned gene converted status but that are genuine inparalogues. However, since only nine such sequences were removed, and since a reliable final dataset of true inparalogues is of a greater importance, this should not be considered as a problem.

image

Figure 3. Guanine Cytosine content and divergence on same chromosome human inparalogues; a set of gene conversion candidates. For each duplication event, genetic distance was calculated between the human orthologue and its inparalogue candidate using the F84 substitution matrix in the Phylip package (Felsenstein, 1989). Guanine Cytosine content was calculated by counting the G and C bases in each inparalogue candidate, then dividing by the full sequence length. The red circle shows that all duplications having high Guanine Cytosine content also have a short genetic distance from their orthologues, distinguishing them as likely to have undergone gene conversion.

Download figure to PowerPoint

Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

The Human Inparalogues Set

After applying the full process that we have described in this work, 138 human inparalogues were identified, 104 of them are duplications that occurred on the same chromosome, while 34 are duplications found on different chromosomes. In Table S1, we present this full set, including the orthologue's Gene Ontology function (or other BioMart resources when GO was not available) (Ashburner et al., 2000) and the chromosome, start position, end position, and strand for each duplication. See Table 3 for a summary of Table S1, including a gene enrichment score for each functional group.

Table 3.  Summary of human lineage gene duplications
Inparalogue GO functionNumber of duplicationsOrthologues Ensembl IDs1Functional group's enrichment score2
  1. 1Note that since an orthologue was often copied more than one time, there are cases where the number of orthologue IDs is smaller than the number of duplications.

  2. 2Gene enrichment score for functional groups was calculated using DAVID (http://david.abcc.ncifcrf.gov/). The score is the geometric mean of the group's p-value, so higher score means lower p-value – gene enriched groups.

Cellular regulation12ENSP00000325681,ENSP00000340685,ENSP00000341674, ENSP00000361400,ENSP00000348915,ENSP00000379097, ENSP00000332124,ENSP00000386810,ENSP000003773042.29
Immune system22ENSP00000282633,ENSP00000289473,ENSP00000297496, ENSP00000303532,ENSP00000324633,ENSP00000333329, ENSP00000335281,ENSP00000335307,ENSP00000340685, ENSP00000367605,ENSP00000374785,ENSP00000367605, ENSP00000374785,ENSP00000374789,ENSP00000374792, ENSP00000374808,ENSP000003829515.33
Inter/intra cellular signalling16ENSP00000056217,ENSP00000289473,ENSP00000328013, ENSP00000328230,ENSP00000341674,ENSP00000342143, ENSP00000344026,ENSP00000352498,ENSP00000358942, ENSP00000370076,ENSP000003868100.17
Membrane protein11ENSP00000226272,ENSP00000277634,ENSP00000328443, ENSP00000329825,ENSP000003330711.07
Metabolic and catabolic processes19ENSP00000255845,ENSP00000305847,ENSP00000333329, ENSP00000347318,ENSP00000348915,ENSP000003509392.15
Nucleus activity3ENSP00000358518,ENSP000003680390.00
Sensory perception11ENSP00000305469,ENSP00000324687,ENSP00000329982, ENSP00000331774,ENSP000003355293.02
Transcription and translation regulation15ENSP00000305857,ENSP00000322697,ENSP00000334501, ENSP00000338561,ENSP00000346045,ENSP00000346045, ENSP00000361377,ENSP00000366381,ENSP00000375129, ENSP00000378953,ENSP00000302756,ENSP000003751290.02
Unknown10ENSP00000329982,ENSP00000340685,ENSP00000361377, ENSP00000374792,ENSP00000377554,ENSP00000382514-
Noncoding duplication19ENSP00000236937,ENSP00000259216,ENSP00000329982, ENSP00000333329,ENSP00000346022,ENSP00000348915, ENSP00000352498,ENSP00000359991,ENSP00000361377, ENSP00000374789,ENSP00000374792,ENSP00000387009-

The Problems in Nonmodel Organisms’ Annotations

Chimpanzee is a nonmodel organism whose genome sequence (Mikkelsen et al., 2005) has been annotated through projection from the human genome by Ensembl. The majority of annotated genomes available from Ensembl and BioMart are of nonmodel organisms, such as the orangutan, macaque, horse, cat, platypus, and more (Mikkelsen et al., 2005; Hubbard et al., 2009). One major implication of annotating a nonmodel organism's genome is that only a very low proportion of genes will have been characterised experimentally. The annotation of a nonmodel organism's unknown genome is carried out by projection, which involves aligning its transcripts to the known genes from the evolutionary nearest genome(s) (Table 1), human in the case of chimpanzee. Note that for low-coverage genomes or where genes cannot be annotated by projection from a model organism, Ensembl apply another annotation category termed “novel genes.” Unlike projection, this process allows the content of the original assembly to be changed according to the model organism's gene sequence (see: http://www.ensembl.org/info/docs/genebuild/genome_annotation.html). In this study, we will term both “projected” and “novel” genes as “projected.” The majority of annotated chimpanzee genes are projected from known human genes. As a consequence, comparing a human genome/proteome with its projected chimpanzee counterpart is essentially the same as comparing human genes with “more poorly annotated” versions of themselves.

Table 1.  Gene categories in model and nonmodel organisms’ genomes and proteomes
OrganismNumber known protein-coding genesNumber projected and novel protein-coding genesNumber ambiguous genesNumber overlapping genesNumber haplotype genes
Human (Homo sapiens)21388282212125741
Chimpanzee (Pan troglodytes)26471718212681226-
Orangutan (Pongo pygmaeus abelii)3813162551245100788
Macaque (Macaca mulatta)8742103111231854-
Mouse (Mus musculus)23019982731976-
Cow (Bos taurus)204715832745874-
Horse (Equus caballus)723195991531024-

The Ensembl database (version 52) contains 21,416 human annotated coding genes – mostly “known,” and 19,829 chimpanzee coding genes – mostly “projected” and “novel” (see also Table 1 for the number of known genes for each species) – about 1600 more human annotated genes. The results of this difference can be illustrated by performing InParanoid analysis using the human and chimpanzee proteomes as input, after applying various filtering steps that are detailed in the Materials and Methods section. The output of this analysis is the full set of human and chimpanzee orthologue groups (including those that contain inparalogues). We identified cases of human and chimpanzee orthologous groups where one species has inparalogues, while the other species has no inparalogues (in other words, human- or chimpanzee-specific inparalogue groups). This gives an indication of how balanced the two genomes’ annotation is, under the null hypothesis that human and chimpanzee have a similar number of species-specific gene duplications. The numbers of human- and chimpanzee-specific gene duplications show a massive bias towards human duplications – 192 human-specific inparalogues groups, and only 33 in chimpanzee (Table 2), an almost sixfold difference. While at first sight this may appear to indicate that the human lineage has had a significantly accelerated gene duplication rate compared to chimpanzee (or alternatively that chimpanzee had a significant deceleration), performing InParanoid analysis with the human proteome and that of several other species indicates that this is not the case. Rather there seems to be a bias in finding inparalogues stemming from the nature of the specific species’ genome annotation. When running the human proteome against organisms for which the majority of the genes are well annotated (e.g., mouse or cow, see Tables 1 and 2) the tendency to detect more human lineage-specific gene duplications was reversed. We detected differences of 1.27 and 1.57 times more lineage-specific inparalogue-containing groups in cow and mouse, respectively, than in human. The number of human peptide sequences used is only 1.07 times larger than the chimpanzee’s, while the number of mouse peptides is 1.1 times larger than human and the cow's is 0.9 times the size of the human’s, so differences in the number of peptide sequences among different species is unlikely to account for the bias seen in the human-chimpanzee comparison. Performing similar InParanoid runs and species-specific duplication analyses of the human proteome against other nonmodel primates revealed similar patterns to those seen with chimpanzee: a 3.64-fold more human-specific inparalogue groups than orangutan-specific, and 1.84-fold more human-specific inparalogue groups than macaque-specific inparalogue groups. Conversely, performing the analyses described above for the human proteome against horse, a nonprimate nonmodel organism, revealed more similarity in the number of species-specific inparalogue groups among the two species – 196 human-specific versus 204 horse specific (Table 2). However, the Ensembl genome annotation for horse is projected from all known mammalian genes, and – at a lower priority – from nonmammalian vertebrates (http://www.ensembl.org/Equus_caballus/Info/Index). Also, the horse genome assembly coverage in Ensembl version 52 was ×6.79 (where the assembly coverage unit represents the average number of times that each unit of the genome was sequenced), a relatively high level of coverage (Ensembl unofficially defines “low coverage” as about ×2.5 or less). The similar number of human- and horse-specific inparalogue groups, together with the horse annotation process and its high coverage show that when all available amniotes’ (i.e., mammals, reptiles, and birds) genomes are taken into consideration then there is little numeric bias.

Table 2.  The number of orthologous clusters having species-specific inparalogues, detected by InParanoid
Organisms testedN estimated human-specific duplicationsN estimated species-specific duplications
  1. Nonmodel organisms are identified by “*.” For hypothetical species j and k, a cluster was detected for having species-specific inparalogues by counting the number of inparalogues for j and k, then if the number of j inparalogues is greater than 0 and the number k inparalogues is equal to 0 then the cluster is considered as having j-specific inparalogues (and vice versa for k-specific inparalogues).

Human-chimpanzee*19233
Human-mouse207326
Human-orangutan*17147
Human-macaque*208111
Human-cow220279
Human-horse*196204
Identifying the problems in using Human haplotype data

As a part of the effort to map human genomic variants that may be associated with susceptibility to common diseases, two projects were conducted to identify haplotypes of two of the Major Histocompatibility Complex (MHC) genes on Human chromosome 6 (COX and QBL). Susceptibility to more than 100 diseases has been mapped to this region (Stewart et al., 2004; Traherne et al., 2006). The Ensembl annotated human genome database (Hubbard et al., 2009) includes 246 COX and 234 QBL haplotypes, as well as 741 protein variants (Hubbard et al., 2009; Smedley et al., 2009, Table 1). The InParanoid database of orthologues and inparalogues (http://inparanoid.sbc.su.se/cgi-bin/index.cgi) was collected using the InParanoid algorithm with the full known proteome of each species. It attempts to identify inparalogues using the longest protein sequence from each human coding gene. However, the InParanoid database input is not filtered for human haplotype data and so COX and QBL protein sequences are included. The result is many variants of the same genes collected from different genomes, which leads to false detection of inparalogues (i.e., false positives) as haplotypes and protein variants are automatically categorised as inparalogues, even though they are actually variants of the same gene among different individuals. The use of haplotype data in InParanoid's inparalogue detection procedure has an effect similar to artificially adding hundreds of almost identical copies of the same genes to the human genome database.

The Problems in GC Data

Following gene duplication, adjacent paralogous are prone to reciprocal unequal crossovers by virtue of the high degree of homology between them (Chen et al., 2007). As a consequence of these unequal crossovers, an “acceptor” sequence is replaced, wholly or partly, by a sequence that is copied from the “donor” sequence, whereas the sequence of the donor remains unaltered. This process is termed GC (Chen et al., 2007). As a consequence of GC, the two copies of the gene can revert to a very high degree of similarity, even though their duplication event may predate a speciation event. In the case of a gene duplication occurring before the most recent speciation event, followed by GC, any currently available inparalogue detection method is likely to identify the two copies as inparalogues, when in fact they are outparalogues. There are no bioinformatics filters currently available for GC detection that are suitable for the task of this study, and so we expect that all currently available inparalogue detection methods will include false-positive inparalogues.

The Problems in Ambiguous Data

As a part of the Ensembl (Hubbard et al., 2009) gene annotation process of each species, transcripts are aligned to the whole sequenced genome to identify the chromosomal location of each gene. Due to low sequence coverage or low transcript quality, there are cases where a transcript cannot be mapped to specific chromosomal regions, and consequently the gene's chromosomal location is categorised as “random” (when a specific chromosome is identified), “Un” (when the chromosome is unknown), or “NT” (like “Un,” but with the original contig's name specified as a chromosome). Here, we refer to these three classes of ambiguous annotation data as “ambiguous.” The numbers and distributions of these “ambiguous” genes vary among the different annotated species. The Ensembl database (http://www.ensembl.org/index.html) includes 221 ambiguous human genes and 1268 ambiguous chimpanzee genes (Table 1) that were used to assemble the current InParanoid human-chimpanzee database (O’Brien et al., 2005). Although the sequence quality for some of these genes may be adequate, the fact that they are “ambiguous” (as described above) makes it difficult to detect whether they overlap with other genes. In addition, if they are identified as gene duplications it is difficult to know if they are tandem duplications (on a similar chromosome) or gene duplications located on different chromosomes. Altogether, the fact that these genes cannot be traced to a specific location suggests a problem in the quality of the genome annotations, and so using these genes makes the current inparalogue dataset less reliable.

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

This study describes the problems encountered in detecting a reliable set of human-chimpanzee human inparalogues when using the currently available homology detection methods. These problems include the use of nonmodel organisms, human haplotype data, proteome data, and GC. We go on to describe an algorithm that we have developed to overcome these problems and find a set of human inparalogues.

The algorithm that we have developed and the filtering processes applied are relevant for any model/nonmodel organism inparalogue detection project. For example, in a proposed future project the algorithm can be applied to find cow (an organism with a majority of genes known) – dog (an organism with a majority of genes projected) inparalogues, using a rat (a model organism with a majority of known genes) as an outgroup, where cow and dog diverged about 62.3 mya, while cow-dog diverged from rat about 95.3 mya (Benton & Donoghue, 2007). In this example, since cow and dog are more distantly related than human-chimpanzee, we suggest the use of BLAST instead of BLAT, as it is more sensitive for more distant homologies. We see no reason why the method we describe should not be used to identify the full inparalogue datasets for all model/nonmodel organism pairs for which high-coverage genome data are available. However, when applying the algorithm for detecting other species-specific gene duplications, the issue of the genome assembly quality should be taken into account. Other than inaccuracies in the genomic sequence, which is the obvious consequence of a low-quality genome assembly, there are large-scale errors in such genomes. For example, the chimpanzee genome assembly coverage is inferior than human’s, and as a result of the genome mis-assembly many false duplications appear in the chimpanzee genome (Kelley & Salzberg, 2010). Thus, any attempt to discover chimpanzee-specific gene duplications would need to account for these false duplications. This problem of the current chimpanzee assembly did not affect our results because we used only the chimpanzee orthologues, of which there was one copy only.

The main problem that we detected in the InParanoid database is the abundance of false positives, as elaborated in the Material and Methods and the Results sections. To estimate the potential of false positives and false negatives in our method, we compared our orthologue clusters results with the results of running InParanoid with the human and the chimpanzee proteomes (the longest peptide from each Ensembl coding gene, as it is in the InParanoid human-chimpanzee inparalogues database). For orthologue clusters that contain human inparalogues the number of orthologues and inparalogues for the human-chimpanzee run is 1609, while our results contain a total number of 185: a maximum of 1424 false positives in the InParanoid database. We estimate that the maximum number of false negatives, true human inparalogues that were not detected in our study is 76. This number was estimated by the number of orthologues and inparalogues that were detected in the InParanoid run and not in our study. However, we stress that the true false-positives number is likely to be significantly smaller as the vast number of false positives in the human-chimpanzee run was likely to affect the quality of clustering and inparalogue detection. We estimate that the maximum number of false positives, false human inparalogues that were detected in our study is 101. This number was estimated from the number of orthologues and inparalogues that were detected in our study run but not in the InParanoid run. We stress again that due to the vast number of false positives in the human-chimpanzee InParanoid run this number is likely to be significantly lower as the main effect of our method is to filter false positives from the results, and so our method is much more prone to false negatives than to false positives. Consequently, we believe that the true false-positive number from our analysis is close to zero. The number of orthologues and inparalogues that were detected in our study run and also detected in the InParanoid run is 84.

Some improvements that we envisage for future versions of the algorithm include a more robust process of detecting GC, and the creation of a fully automated pipeline of the process. Such an application would take as an input two evolutionary neighbouring species and their outgroup species proteomes, and provide as an output the full set of inparalogues after performing the filtering, clustering, and tree inference procedures.

To detect the full regions of the gene duplications (rather than only the region represented by the protein sequence), InParanoid could be adapted to use BLASTN instead of BLASTP. This would result in the input being a nonredundant filtered genome, rather than the proteome data of the species. However, testing this option has revealed that current conventional computing power is insufficient for such a task. A test run of 3000 human and chimpanzee sequences (1500 from each species), where the upper threshold for one sequence length was 300,000 base pairs required about 4 GB RAM. A full genome InParanoid run with about 20,000 genes from each species and no sequence length threshold (which may include sequences of a million base pairs or more) would require about 50 GB RAM (as an approximation, assuming that the full nonredundant genomes will be more than 10–15 times larger than the 3000 human and chimpanzee dataset that was tested). Adapting InParanoid for distributed computing may enable such a task in a feasible timeframe (i.e., in a number of weeks or less) and amount of computer memory.

In addition to gene duplications, identifying other classes of “all human-lineage” genomic events is also likely to be informative on the evolution of the unique human phenotype. Such events include pseudogenization (Wang et al., 2006), regulatory region changes (Montgomery, 2009), retroviral insertions and sequence deletions (Costantini & Bernardi, 2009), and genomic rearrangement (Zhang et al., 2009). For all such searches, the particulars of differences in genome annotation quality would need to be accounted for.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

We thank Ziheng Yang, Neil Bradman, Richard Emes, and Jessica Vamathevan for their very important contribution, as well as the Ensembl (particularly Bert Overduin) and BLAT teams for their ongoing support and patience. YI was funded by the UCL ORS, UCL Graduate School, B’nai B’rith and the Anglo-Jewish Association scholarships.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information
  • Altenhoff, A. M. & Dessimoz, C. (2009) Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol 5, e1000262.
  • Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990) Basic local alignment search tool. J Mol Biol 215, 403410.
  • Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. & Sherlock, G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 2529.
  • Bailey, J. A. & Eichler, E. E. (2006) Primate segmental duplications: crucibles of evolution, diversity and disease. Nat Rev Genet 7, 552564.
  • Benton, M. J. & Donoghue, P. C. (2007) Paleontological evidence to date the tree of life. Mol Biol Evol 24, 2653.
  • Berglund, A. C., Sjolund, E., Ostlund, G. & Sonnhammer, E. L. (2008) InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucleic Acids Res 36, D263D266.
  • Britten, R. J. (2002) Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. Proc Natl Acad Sci U S A 99, 1363313635.
  • Chen, J. M., Cooper, D. N., Chuzhanova, N., Ferec, C. & Patrinos, G. P. (2007) Gene conversion: mechanisms, evolution and human disease. Nat Rev Genet 8, 762775.
  • Costantini, M. & Bernardi, G. (2009) Mapping insertions, deletions and SNPs on Venter's chromosomes. PLoS One 4, e5972.
  • Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 17921797.
  • Enard, W., Przeworski, M., Fisher, S. E., Lai, C. S., Wiebe, V., Kitano, T., Monaco, A. P. & Paabo, S. (2002) Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418, 869872.
  • Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17, 368376.
  • Felsenstein, J. (1989) PHYLIP – Phylogeny inference package (Version 3.2). Cladistics 164166.
  • Force, A., Lynch, M., Pickett, F. B., Amores, A., Yan, Y. L. & Postlethwait, J. (1999) Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531145.
  • Galtier, N. (2003) Gene conversion drives GC content evolution in mammalian histones. Trends Genet 19, 6568.
  • Galtier, N., Piganeau, G., Mouchiroud, D. & Duret, L. (2001) GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics 159, 907911.
  • Hubbard, T. J., Aken, B. L., Ayling, S., Ballester, B., Beal, K., Bragin, E., Brent, S., Chen, Y., Clapham, P., Clarke, L., Coates, G., Fairley, S., Fitzgerald, S., Fernandez-Banet, J., Gordon, L., Graf, S., Haider, S., Hammond, M., Holland, R., Howe, K., Jenkinson, A., Johnson, N., Kahari, A., Keefe, D., Keenan, S., Kinsella, R., Kokocinski, F., Kulesha, E., Lawson, D., Longden, I., Megy, K., Meidl, P., Overduin, B., Parker, A., Pritchard, B., Rios, D., Schuster, M., Slater, G., Smedley, D., Spooner, W., Spudich, G., Trevanion, S., Vilella, A., Vogel, J., White, S., Wilder, S., Zadissa, A., Birney, E., Cunningham, F., Curwen, V., Durbin, R., Fernandez-Suarez, X. M., Herrero, J., Kasprzyk, A., Proctor, G., Smith, J., Searle, S. & Flicek, P. (2009) Ensembl 2009. Nucleic Acids Res 37, D690D697.
  • Hunt, K. D. (1994) The evolution of human bipedality: ecology and functional morphology. J Hum Evol 26, 183202.
  • Katoh, K., Kuma, K., Toh, H. & Miyata, T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33, 511518.
  • Katoh, K., Misawa, K., Kuma, K. & Miyata, T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30, 30593066.
  • Kelley, D. R. & Salzberg, S. L. (2010) Detection and correction of false segmental duplications caused by genome mis-assembly. Genome Biol 11, R28.
  • Kent, W. J. (2002) BLAT–the BLAST-like alignment tool. Genome Res 12, 656664.
  • Koonin, E. V. (2005) Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet 39, 309338.
  • Kumar, S., Filipski, A., Swarna, V., Walker, A. & Hedges, S. B. (2005) Placing confidence limits on the molecular age of the human-chimpanzee divergence. Proc Natl Acad Sci U S A 102, 1884218847.
  • Marais, G. (2003) Biased gene conversion: implications for genome and sex evolution. Trends Genet 19, 330338.
  • Mikkelsen, T. S., Hillier, L. W., Eichler, E. E., Zody, M. C., Jaffe, D. B., Yang, S. -P., Enard, W., Hellmann, I., Lindblad-Toh, K., Altheide, T. K., Archidiacono, N., Bork, P., Butler, J., Chang, J. L., Cheng, Z., Chinwalla, A. T., Dejong, P., Delehaunty, K. D., Fronick, C. C., Fulton, L. L., Gilad, Y., Glusman, G., Gnerre, S., Graves, T. A., Hayakawa, T., Hayden, K. E., Huang, X., Ji, H., Kent, W. J., King, M. -C., Kulbokasiii, E. J., Lee, M. K., Liu, G., Lopez-Otin, C., Makova, K. D., Man, O., Mardis, E. R., Mauceli, E., Miner, T. L., Nash, W. E., Nelson, J. O., Paabo, S., Patterson, N. J., Pohl, C. S., Pollard, K. S., Prufer, K., Puente, X. S., Reich, D., Rocchi, M., Rosenbloom, K., Ruvolo, M., Richter, D. J., Schaffner, S. F., Smit, A. F. A., Smith, S. M., Suyama, M., Taylor, J., Torrents, D., Tuzun, E., Varki, A., Velasco, G., Ventura, M., Wallis, J. W., Wendl, M. C., Wilson, R. K., Lander, E. S. & Waterston, R. H. (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 6987.
  • Montgomery, S. (2009) Current computational methods for prioritizing candidate regulatory polymorphisms. Methods Mol Biol 569, 89114.
  • Nieuwenhuys, R., Voogd, J. & Van Huijzen, C. (2007) The human central nervous system: a synopsis and atlas. Berlin : Springer-Verlag.
  • Notredame, C., Higgins, D. G. & Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302, 205217.
  • O’Brien, K. P., Remm, M. & Sonnhammer, E. L. (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33, D476D480.
  • Ohno, S. (1970) Evolution by gene duplication. Berlin : Springer-Verlag.
  • Patterson, N., Richter, D. J., Gnerre, S., Lander, E. S. & Reich, D. (2006) Genetic evidence for complex speciation of humans and chimpanzees. Nature 441, 11031108.
  • Prince, V. E. & Pickett, F. B. (2002) Splitting pairs: the diverging fates of duplicated genes. Nat Rev Genet 3, 827837.
  • Remm, M., Storm, C. E. & Sonnhammer, E. L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314, 10411052.
  • Sawyer, S. (1989) Statistical tests for detecting gene conversion. Mol Biol Evol 6, 526538.
  • Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G. & Kasprzyk, A. (2009) BioMart – biological queries made easy. BMC Genomics 10, 22.
  • Spencer, C. C., Deloukas, P., Hunt, S., Mullikin, J., Myers, S., Silverman, B., Donnelly, P., Bentley, D. & Mcvean, G. (2006) The influence of recombination on human genetic diversity. PLoS Genet 2, e148.
  • Stauffer, R. L., Walker, A., Ryder, O. A., Lyons-Weiler, M. & Hedges, S. B. (2001) Human and ape molecular clocks and constraints on paleontological hypotheses. J Hered 92, 469474.
  • Steiper, M. E. & Young, N. M. (2006) Primate molecular divergence dates. Mol Phylogenet Evol 41, 384394.
  • Stewart, C. A., Horton, R., Allcock, R. J., Ashurst, J. L., Atrazhev, A. M., Coggill, P., Dunham, I., Forbes, S., Halls, K., Howson, J. M., Humphray, S. J., Hunt, S., Mungall, A. J., Osoegawa, K., Palmer, S., Roberts, A. N., Rogers, J., Sims, S., Wang, Y., Wilming, L. G., Elliott, J. F., De Jong, P. J., Sawcer, S., Todd, J. A., Trowsdale, J. & Beck, S. (2004) Complete MHC haplotype sequencing for common disease gene mapping. Genome Res 14, 11761187.
  • Tatusov, R. L., Koonin, E. V. & Lipman, D. J. (1997) A genomic perspective on protein families. Science 278, 631637.
  • Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 46734680.
  • Traherne, J. A., Horton, R., Roberts, A. N., Miretti, M. M., Hurles, M. E., Stewart, C. A., Ashurst, J. L., Atrazhev, A. M., Coggill, P., Palmer, S., Almeida, J., Sims, S., Wilming, L. G., Rogers, J., De Jong, P. J., Carrington, M., Elliott, J. F., Sawcer, S., Todd, J. A., Trowsdale, J. & Beck, S. (2006) Genetic analysis of completely sequenced disease-associated MHC haplotypes identifies shuffling of segments in recent human history. PLoS Genet 2, e9.
  • Van Noort, V., Snel, B. & Huynen, M.A. (2003) Predicting gene function by conserved co-expression. Trends Genet 19, 238242.
  • Wang, X., Grus, W. E. & Zhang, J. (2006) Gene losses during human origins. PLoS Biol 4, e52.
  • Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., Antonarakis, S. E., Attwood, J., Baertsch, R., Bailey, J., Barlow, K., Beck, S., Berry, E., Birren, B., Bloom, T., Bork, P., Botcherby, M., Bray, N., Brent, M. R., Brown, D. G., Brown, S. D., Bult, C., Burton, J., Butler, J., Campbell, R. D., Carninci, P., Cawley, S., Chiaromonte, F., Chinwalla, A. T., Church, D. M., Clamp, M., Clee, C., Collins, F. S., Cook, L. L., Copley, R. R., Coulson, A., Couronne, O., Cuff, J., Curwen, V., Cutts, T., Daly, M., David, R., Davies, J., Delehaunty, K. D., Deri, J., Dermitzakis, E. T., Dewey, C., Dickens, N. J., Diekhans, M., Dodge, S., Dubchak, I., Dunn, D. M., Eddy, S. R., Elnitski, L., Emes, R. D., Eswara, P., Eyras, E., Felsenfeld, A., Fewell, G. A., Flicek, P., Foley, K., Frankel, W. N., Fulton, L. A., Fulton, R. S., Furey, T. S., Gage, D., Gibbs, R. A., Glusman, G., Gnerre, S., Goldman, N., Goodstadt, L., Grafham, D., Graves, T. A., Green, E. D., Gregory, S., Guigo, R., Guyer, M., Hardison, R. C., Haussler, D., Hayashizaki, Y., Hillier, L. W., Hinrichs, A., Hlavina, W., Holzer, T., Hsu, F., Hua, A., Hubbard, T., Hunt, A., Jackson, I., Jaffe, D. B., Johnson, L. S., Jones, M., Jones, T. A., Joy, A., Kamal, M., Karlsson, E. K., et al . (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520562.
  • Yang, Z. (2006) Computational molecular evolution. Oxford : Oxford University Press.
  • Zhang, F., Carvalho, C. M. & Lupski, J. R. (2009) Complex human chromosomal and genomic rearrangements. Trends Genet 25, 298307.

Supporting Information

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Table S1. Human lineage gene duplications.

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer-reviewed and may be re-organised for online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

FilenameFormatSizeDescription
AHG_609_sm_t1.doc234KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.