How to usefully compare homologous plant genes and chromosomes as DNA sequences

Authors


(fax +1 501 642 4995; e-mail elyons@nature.berkeley.edu).

Summary

There are four sequenced and publicly available plant genomes to date. With many more slated for completion, one challenge will be to use comparative genomic methods to detect novel evolutionary patterns in plant genomes. This research requires sequence alignment algorithms to detect regions of similarity within and among genomes. However, different alignment algorithms are optimized for identifying different types of homologous sequences. This review focuses on plant genome evolution and provides a tutorial for using several sequence alignment algorithms and visualization tools to detect useful patterns of conservation: conserved non-coding sequences, false positive noise, subfunctionalization, synteny, annotation errors, inversions and local duplications. Our tutorial encourages the reader to experiment online with the reviewed tools as a companion to the text.

Introduction

Comparative genomics is founded on the assumption that much of life’s language is contained in its linear DNA sequence. Comparisons of genomic DNA sequences – be they from different species or, as with polyploids, within the same nucleus – present one way to understand the syntax and vocabulary of this language. One great advantage of using whole rather than partial genome sequence is that comparisons may be made between the most closely related genes or regions in the genomes compared. Homologous genes or chromosomal regions are similar because they share a common ancestor, but finding the closest homolog may be inferred only by finding homologs within regions containing a similar pattern of gene content. If these best homologous DNA sequences are from different organisms, they are called orthologs (with some exceptions). If homologous sequences are within one genome, they are called paralogs. A special case of paralogy results from polyploidy. Duplicated genes or chromosomal regions derived from polyploidy are called homeologs. Comparisons among homeologs are routine when working with angiosperms. The fundamentals of comparative genomics and its nomenclature have been reviewed elsewhere (Koonin, 2005). Some definitions particularly important for plant scientists are given in Table 1.

Table 1.   Comparative genomic definitions of special relevance to angiosperms
OrthologsA pair of homologous genes or chromosomal regions derived from the same syntenous chromosomal positions in different species. Additional gene duplications and/or losses following speciation may result in complex relationships between sets of orthologous genes.
Plant CNSA protocol for identifying conserved non-coding sequences (CNSs) in plants using a pair-wise blastn (Altschul et al., 1990) to identify high-scoring segment pairs (HSPs) between the non-protein-coding sequences near usefully diverged, orthologous or homeologous genes. These sequences are at least 15-bp long with an e-value equal to or more significant than a 15/15 exact nucleotide match (Inada et al., 2003; Kaplinsky et al., 2002). Other equally sensitive alignment algorithms can be substituted, as long as the 15/15 exact match significance cut-off is used.
HomeologsA pair of genes retained following polyploidy, identified by residing in syntenic regions of the chromosome. Being duplicates within the same organism, homeologs are a special case of paralogs, but all homeologs occurred contemporaneously, whereas the history of local gene duplicates is obscured by gene conversion.
FractionationThe mechanism by which a duplicated gene, chromosomal segment or genome tends to return to pre-duplication gene content, but not necessarily retain its pre-duplication gene order. Fractionation is the loss of one or the other of the initial homeologs, but not of both. The process of fractionation is associated with chromosomal rearrangements and ‘transcriptome shock’ (Wang et al., 2006), and may help cluster dose-sensitive genes (Thomas et al., 2006).
Plant αCNSAs CNS above, but the chromosomal regions are homeologous (syntenous and paralogous) remnants of the most recent tetraploidy event (α) in the lineage (Thomas et al., 2006). blast results for α pairs in Arabidopsis are displayed and may be researched in a custom viewer: http://synteny.cnr.berkeley.edu/AtCNS. Subfunctionalization, defined in the text, is expected of homeologous pairs, but not in orthologous pairs. Furthermore, homeologs are under different selective constraints as compared with orthologs.
GenespaceGenespace is defined here as the space of an individual gene, which is a computational surrogate for ‘cistron’, where the total genespace of a genome is the sum of all of the genespaces of its genes. This gene-level genespace is computed after CNSs have been identified for a syntenic genomic region, and each CNS has been sorted to a gene: the segment of genome between the most 5′ (upstream) and most 3′ CNS, untranslated region (UTR) or feature, plus approximately 500 bp on each side (depending on neighboring features; Thomas et al., 2007). Within a genespace are exons, UTRs, CNSs, known motifs, positions where specific transcription factor binding sites reside and any feature that is fixed at a chromosomal locus. This non-standard term has little use in mammals because CNSs are difficult to sort to individual genes, but is particularly useful for plant research.
Phylogenetic footprintThe most inclusive term for the conserved sequence between two or more sequences without stipulations as to the extent of divergence. A CNS is a type of phylogenetic footprint.
Local alignment algorithmComputational method to identify local regions of sequence similarity between two or more biological sequences, where hits may or may not be collinear, and may be on either strand.
Global alignment algorithmComputational method to find the best possible alignment between two or more biological sequences that extends across the entire length of all sequences on the same DNA strand. If settings are not stringent enough, noise can look like syntenic conserved regions because global algorithms make all alignments collinear.

Comparison of biological parts to identify similarities is an ancient preoccupation originally used to order the natural world. The Linnaean classification system makes a fine example. Later, these comparisons were used to identify possible evolutionary trends. One such eukaryotic trend is towards increasing maximums of morphological complexity (Freeling and Thomas, 2006), but there are many more. The modern synthesis of Darwin’s natural selection, genetic laws and some principles of population genetics (see Mayr, 1993) provides the most popular logic by which genomes are compared. It is known that genomic DNA sequences have regions with and without function, and different functional regions confer their function through different means. DNA may function by indirectly encoding protein, directly encoding RNA, binding macromolecules, directing or modifying the movement of regulatory molecules or being epigenetically modified. Compounding the matter, some DNA may have multiple simultaneous functions. Reducing a primary DNA sequence to a particular set of biologically meaningful structures is daunting (Pearson, 2006). However, it is sometimes possible to find something meaningful about the biological function of DNA with incomplete structural knowledge by comparison with a related DNA sequence. It is at this juncture that comparative genomics can be useful because DNA that functions, without regard to mechanism, tends to have its primary sequence evolutionarily conserved (Hardison, 2000, 2003).

Our purpose is limited. We have prepared a tutorial of DNA sequence comparison algorithms and data visualization tools commonly used by plant researchers. Using these tools, we identify the types of information that can be acquired, show how the ability to change alignment algorithms and parameters is crucial for discovery and illustrate how visualization of the results is almost as important as the resolution of the alignment algorithm itself.

Plant (angiosperm) genomes are known to be different from mammalian or any other animal genomes in several important ways. Recent and ancient polyploidy is widespread among angiosperms. The former may be deduced from chromosome counts (Adams and Wendel, 2005), whereas detecting the latter requires a nearly complete genome sequence. Within the fully-sequenced genomes of Arabidopsis thaliana, poplar, rice and grape are the remnants of at least two ancient tetraploidies (Adams and Wendel, 2005; Bowers et al., 2003; De Bodt et al., 2005; Jaillon et al., 2007; Paterson et al., 2005; Tuskan et al., 2006). Ancient tetraploidies cannot be inferred from chromosome counts because fractionation, the mechanism of genomic content loss that naturally follows all types of DNA duplications, often returns a polyploid to a chromosomal number and gene count more like that of its pre-polyploid ancestor. In addition to polyploidy, plant genomes contain much transposon-derived DNA. Such DNA is usually only a few million years old, and is often found both in locally repeated blocks and spread throughout the genome. Researchers must be aware that sequences of this highly repetitive nature often obfuscate comparisons among and within plant genomes. Finally, the region around genes that contains additional non-protein coding functional sequence is structured differently in angiosperms as compared with mammals. These sequences are identified by comparison of duplicated chromosomal regions, and are often called conserved non-coding sequences (CNSs; Table 1).

Mammalian CNSs are approximately 10 times larger and are much more numerous than plant CNSs when using alignment cut-offs appropriate for plant CNS discovery (Kaplinsky et al., 2002). Were the most popular CNS alignment cut-off used in animal research (100-bp long with >70% identity; Loots et al., 2000) applied to plants, plants would have nearly zero CNSs (Gao and Innan, 2004; Inada et al., 2003; Thomas et al., 2007). It follows that CNSs are more deeply conserved in the vertebrate lineage than in the angiosperm linage. Vertebrates have over a thousand enhancer-like conserved non-coding sequences that have been conserved since the divergence of human and fish 450 Mya (Goode et al., 2005; Ovcharenko et al., 2005b; Siepel et al., 2005; Woolfe et al., 2005). Although those most-conserved plant CNSs may also operate as enhancers, plants do not have such deeply conserved CNSs (Freeling et al., 2007). As originally observed by Kaplinsky et al. (2002), mammalian CNSs often occur continuously down a chromosome, so the assignment of any one of them to a particular gene is not possible using spacing alone. Work on maize–rice (Inada et al., 2003, Guo and Moose, 2003), Brachypodium–rice (Bossolini et al., 2007) and especially alignments of the two most recent post-tetraploid genomes within Arabidopsis (Freeling, 2007; Thomas et al., 2007) all demonstrate that almost all plant CNSs cluster near one gene: this cluster of conservation has been used to estimate what we call a single ‘genespace’. This non-standard term is defined in Table 1. The reasons why plants, as compared with mammals, have less conserved sequence between genes is not known, but this articulated pattern of conservation permits assigning CNSs to genes, and this information is powerful. For example, the most CNS-rich genes in Arabidopsis are transcription factors known to be necessary for response to environmental signals (Freeling et al., 2007).

The concept of ‘synteny’ is essential to any comparison of homologous genes or chromosomes. Given the inherent complexity of this term, the definition that follows is simply how we use this term. Two or more once-duplicated sequences are said to be ‘syntenic’ when it is possible, using extant genomic data, to reconstruct a valid ancestral sequence from which the sequences originated. When two chromosomal regions have mainly co-linear genes or other features, they are obviously derived from a common ancestral genomic region and are considered syntenic. In reality, duplication is followed by an evolutionary winnowing process (called ‘fractionation’, see Table 1) that includes gene loss, inversions, translocations, insertions, deletions and epigenetic marks. This results in a loss of collinearity of genes and other features, but it is often possible to reconstruct a putative ancestor nevertheless. An outgroup genome is often necessary to prove synteny, especially if the remaining duplicate regions share zero or few conserved sequences. When this reconstruction is possible, the duplicate regions are called ‘syntenous’ or ‘syntenic’. Only the post-duplication movement of single genomic features to another genomic region destroys our ability to detect synteny. Because of ancient polyploidies in all angiosperms, synteny is evidenced within plant genomes, and not just between them. The tutorial that follows uses synteny between homeologous (i.e. syntenic, paralogous; Table 1) genomic features in several ways to identify patterns of evolution in plant genomes.

Duplication may be of varying degrees of completeness: local (tandem), segmental, whole chromosome and whole genome (polyploidy). Each sort of duplication has very different selective constraints (Koonin, 2005) and dosage effect/compensation expectations (Birchler et al., 2005; see Freeling and Thomas, 2006). For example, prevalent gene conversion makes comparisons among locally duplicated genes challenging because it unlinks their date of origination from their observed degree of divergence (exemplified in yeast; Gao and Innan, 2004).

Some homologous DNA sequence comparisons are meaningful only if the DNA sequences have diverged to a ‘useful’ level. In theory, DNA sequence without specific function will either accumulate point mutations at the background rate of the region or may be deleted altogether (if such a mechanism operates). The former is true for many third codon position base substitutions in the protein coding sequence, and is true for all of the non-functional sequence. As the non-functional sequence changes more quickly than the functional sequence, there is a point in evolutionary time that conservation of the sequence is evidence of function. Conversely, if the level of sequence divergence is small, conservation is expected because of carry-over. Although adequate divergence is essential, it is important to realize that there can be too much. For example, homologous regulatory sites are known to lose sequence similarity even though binding function is conserved (called ‘binding site turnover’; Ludwig et al., 2005; Moses et al., 2006).

There is certainly a window of useful divergence when comparing plant non-coding sequences. The three first papers on plant CNS discovery (Guo and Moose, 2003; Inada et al., 2003; Kaplinsky et al., 2002) established the maize–rice divergence time as being appropriate for CNS detection, and argued that maize–rice diverged to approximately the same extent as mouse–man. Sufficient divergence for detecting CNSs in plants is indicated when the average blast high-scoring segment pair (HSP) between orthologous/homeologous coding regions is approximately 85% identical in nucleotide sequence (unpublished rule-of-thumb, M. Freeling). However, the definitive ‘find plant CNS’ settings may always be adjusted so that only significant non-coding alignments are detected.

Subfunctionalization is the natural process whereby duplicate cis-acting units of function (e.g. exons and enhancers) tend to lose dispensable sequences in a compensatory fashion (Force et al., 1999; Lynch and Force, 2000). This results in the full set of functions of the ancestral gene being divided between both duplicates so no one gene is complete. Subfunctionalization of cis-acting regulatory DNA sequences has been noticed in plants (Haberer et al., 2004; Langham et al., 2004). When paralogs are aligned, subfunctionalized regions of the sequence cannot be seen because they exist in only one of the two duplicates. An appropriate outgroup capable of better representing the ancestor of the duplicates is required to identify subfunctionalized sequences.

Comparing genomic sequence

The workhorses of comparative genomics are sequence alignment algorithms. Sequence alignment algorithms break into two major classes: global and local. Global alignments (Needleman and Wunsch, 1970) generate the best alignment across the whole length of the sequences, whereas local alignments (Smith and Waterman, 1981) find as many best-subsequence alignments as possible. The usefulness of results generated by these two classes of alignment algorithms depends on the type of genomic region analyzed, and how much false-positive noise is retained in the results. In general, if the compared regions are believed to be similar across their entire lengths, then global alignment algorithms are preferred. Cases of inversion and local duplication – both common in syntenic regions – violate this assumption of collinearity, and local alignment algorithms are generally preferred. Also, different algorithms in each class are optimized for different alignment tasks. To highlight these differences in algorithm classes and optimizations, we use six alignment algorithms (three global and three local) in the tutorial. These are listed in Table 2.

Table 2.   Alignment and visualization tools used in the tutorial
Alignment algorithmAlgorithms typeVisualizationWeb service
  1. *GeLo is our own visualization package (to be published elsewhere).

Avid (Bray et al., 2003)GlobalVista (Mayor et al., 2000)http://genome.lbl.gov/vista
blastn (Altschul et al., 1990;  Tatusova and Madden, 1999)LocalGeLo*http://synteny.cnr.berkeley.edu/CoGe/GEvo.pl
blastz (Schwartz et al., 2000, 2003)LocalGeLo*http://synteny.cnr.berkeley.edu/CoGe/GEvo.pl
DiAlign (Morgenstern, 1999;  Morgenstern et al., 1998; Pohler et al., 2005)Globalabc (Couper et al., 2004)http://dialign.gobics.de
Lagan/Shuttle Lagan  (Brudno et al., 2003a,b)GlobalVistahttp://genome.lbl.gov/vista
Mulan (Ovcharenko et al., 2005a)LocalMulanhttp://mulan.dcode.org

Visualization software

We are beginning to see the development of modular alignment visualization software that can be used with the output from any sequence alignment algorithm. Vista (Mayor et al., 2000) is a prime example of this paradigm and has been used for visualizing alignment results from several algorithms. Similarly, we have developed our own genome visualization module, GeLo, which we use in the tutorial for visualizing blast results, and which is now being used to display the results from several alignment algorithms.

Tutorial

In the tradition of online tutorials for web applications, the following short manual is written colloquially. The reader becomes ‘you’ at this point in the discourse and we provide you links to our web application for regenerating our examples and figures, as well as generating sequence and annotation sets for import into other comparative genomic tools (Table 3). There are many tools available and these have been reviewed elsewhere (recently by Pollard et al., 2006). To illustrate differences and similarities between different commonly used alignment algorithms, we chose six of them for our examples (Table 2). Please note that there are several ways to import sequences (and associated annotations) into our, and most other, web applications: retrieval from a local database, import from GenBank via an accession number, or directly submitting a sequence in FASTA or GenBank format.

Table 3.   Links for regenerating and modifying the blastn and blastz examples used in the tutorial, and for obtaining sequence and annotations files in FASTA and GAF (Gene Annotation Format) format, respectively
  1. These output files may be exported to other sequence analysis applications.

Figure 1. CNS detectiona: http://tinyurl.com/386hqz
b: http://tinyurl.com/38l482
Figure 2. HSP spike filera: http://tinyurl.com/2rf4zc
b: http://tinyurl.com/398sq4
c: http://tinyurl.com/2kkx7t
d: http://tinyurl.com/3ajl3m
Figure 3. Subfunctionalizationhttp://tinyurl.com/2jyy3k
Figure 4. Syntenya: http://tinyurl.com/3cb2pp
b: http://tinyurl.com/2nd3ar
c: http://tinyurl.com/3agtrp
Figure 5. Annotation errora: http://tinyurl.com/2ngomn
b: http://tinyurl.com/35p2bw
Figure 6. Inversiona: http://tinyurl.com/2o5krw
Figure 7. Local duplicationa: http://tinyurl.com/3yec97
b: http://tinyurl.com/2psnck

Detecting CNSs

Figure 1 shows the results from the algorithms in Table 2 applied to a pair of genespaces in Arabidopsis: these genespaces from chromosomes 2 and 4, which contain several kb of sequence, were chosen because they are homeologous and, using blastn terminology (Figure 1a), they contain 14 short HSPs and three coding sequence (CDS; reading frame) HSPs. Almost all HSPs are collinear. These are typical plant CNSs (except that HSP2 is actually a microRNA gene), and any detailed analysis of these genes should detect these dispersed, short stretches of non-coding DNA sequence conservation. This blastn analysis (Figure 1a), using the settings and noise filter defined in Table 1, is our baseline for comparison with the other five alignment algorithms. For each of the other algorithms, only the alignment to the genespace on chromosome 4 is shown.

Figure 1.

 Detecting conserved .ncon-coding sequences (CNSs) in plants with various sequence comparison algorithms, settings and visualization software.
Each analyzes the genespace from an Arabidopsis pair of transcription factor genes (TAIR version 7 At2g18550 and At4g3740) derived from its most recent polyploidy.
(a) Shows alignment to both genomic regions; (b–g) shows alignment to the genespace of At4g3740 only.
(a) blastn using CNS discovery settings for plants (-W 7 -G 5 -E 2 -q −2 -r 1) and a 15-bp spike sequence; GeLo visualization. CNSs identified by Thomas et al. (2007) are highlighted by blue double arrows.
(b) blastz (default settings) and GeLo visualization.
(c) Mulan using mammalian CNS discovery settings of 100 bp, 70% sequence identity.
(d) Mulan using plant CNS discovery settings of 15 bp, 70% sequence identity.
(e) Chaos-DiAlign (default) and abc visualization.
(f) Avid (default settings) and Vista visualization.
(g) Lagan (default settings) and Vista visualization. Although alignment figures were aligned with respect to one another for easy cross-comparison of identified regions of similarity, the output from DiAlign introduced gaps in the genomic region from chromosome 4 that extended the length of this region.

blastz (Figure 1b) identified the pair of homeologous coding sequences with a single HSP that covered the entire gene model and extended into its 5′ non-coding sequence. Although this 5′ extension covered one CNS identified by blastn (HSP8, Figure 1a), it failed to find any distal CNSs. Although we can conclude that blastz can easily identify putative gene homologs, it is not appropriate for finding plant CNSs.

Mulan, another local alignment tool, provides a Vista-like visualization of results for identifying CNSs that can filter the alignments based on the minimum length of CNSs and their percentage sequence similarity. Applying animal CNS settings (100-bp length, 70% identity,) Mulan identified the 5′-most cluster of CNSs (Figure 1c), but did not find the intervening CNSs identified by blastn. We changed the Mulan filter to be similar to plant CNSs settings by lowering its minimum length to 15 bp (Figure 1d). Although the 5′ CNS cluster covered more sequence, Mulan still missed the same CNSs as with the animal settings.

Figure 1(e–g) show the results of the global alignment algorithms DiAlign (which uses the local alignment algorithm Chaos for anchors; Brudno and Morgenstern, 2002), Avid and Lagan. All these algorithms identified the pair of homeologous genes and the 5′ distal CNS cluster. Chaos-DiAlign and Lagan found several of the intervening CNSs identified by blastn (although the Chaos-DiAlign server did not support adding gene annotations to the abc visualization). Although blastn may not be the most appropriate alignment algorithm for all comparative genomics problems, this comparison shows that it performs well for detecting plant CNSs.

Where is the noise?

blastn reduces false-positive noise in its alignments (i.e. HSPs, blast ‘hits’) by using the concept of an expect value (e-value). An in-depth discussion of the e-value calculation in blast is beyond the scope of this review, so please see http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html for full details. To facilitate CNS research, Thomas et al. (2007); and the Arabidopsis CNS website, http://synteny.cnr.berkeley.edu/AtCNS) devised a heuristic method to efficiently filter noise from genespaces of various lengths. These workers (from this laboratory) added an identical sequence of known length (called a ‘spike’ sequence) to the 3′ end of the compared sequences. Using blastn to generate HSPs, they identified the HSP containing the spike sequence and removed all other HSPs of greater e-value. They found that a 15-bp spike sequence eliminated most of the noise from their analyses. Figure 2 shows a short syntenic region within Arabidopsis subjected to various levels of noise filtration using spike sequences of various lengths. Note that the filter with a 15-bp spike sequence eliminates all noise from the analysis (Figure 2c,d), leaving four CNSs. We leave it to you to try various other spike sequence lengths and gap/mismatch penalties using the links in Table 3.

Figure 2.

 Rising above the noise.
Conserved non-coding sequence (CNS) discovery in Arabidopsis between the homeologous pair of genes At1g01030 and At4g01500, including 5000 nucleotides upstream and downstream of each gene. blastn was used to find regions of sequence similarity (-W 7 -G 5 -E 2 -q −2 -r 1). These comparisons use a high-scoring segment pair (HSP) filter devised by Thomas et al. (2007), based on e-value cut-off values calculated by spiking the sequences with a known exact match sequence of variable length, and removing any HSP with an e value greater than the HSP containing the spike sequence. Empirically evaluating the results shows that the 15-bp spike sequence is appropriate for removing noise from the analysis.
(a) e-value cut-off based on a 12-bp spike sequence.
(b) e-value cut-off based on a 13-bp spike sequence.
(c) e-value cut-off based on a 14-bp spike sequence.
(d) e-value cut-off based on a 15-bp spike sequence.

Subfunctionalization of CNSs

Now that we can detect CNSs in homeologous genespaces, we will extend this to include an outgroup sequence for the purpose of identifying CNSs that are shared or fractionated/subfunctionalized. Figure 3 shows blastn comparisons with a 15-bp spike sequence of two homeologous maize genes (liguleless-like transcription factors retained from a tetraploidy that happened approximately 15 Ma) to a rice ortholog. In this example, each maize gene has several CNSs in common with rice that are not shared in the homeologous genespace (CNSs highlighted with purple ovals), which is evidence for subfunctionalization of the non-coding sequence of the maize genes.

Figure 3.

 Subfunctionalization of conserved non-coding sequences (CNSs) illustrated via GenBank accession numbers.
blastn comparison of three homologous gene regions using plant CNS settings and a 15-bp spike. Two maize homeologs are compared with a rice outgroup ortholog (GenBank accessions AY180106, AY180107 and AP003287, respectively). GEvo permits using the reverse complement of any sequence (along with its annotations), and permits selecting a reference sequence (in this case, rice). blast high-scoring segment pairs (HSPs) are blue and green numbered boxes. CNSs that have subfunctionalized are indicted within purple ovals. Subfunctionalization within these genespaces has been discovered previously (Langham et al., 2004). Note the ‘holes’ in the 5′ region of the upper maize homeolog (AY180106): these are probably recent transposon insertions.

Synteny demonstration

Figure 4(a,b) visualize synteny between two intragenomic regions of Arabidopsis using blastn and blastz, respectively. In both analyses, there are six pairs of genes that share a high degree of sequence similarity (blue double arrows) and are collinear, demonstrating synteny. However, the results from blastz are easier to interpret visually. There are several genes in each region that do not have a corresponding homeolog (purple ovals), which we assume is to the result of fractionation.

Figure 4.

 Detecting synteny and fractionation.
(a) blastn and (b) blastz sequence comparison of a syntenic region derived from the most recent genome duplication event in Arabidopsis. Upper region, chromosome 1 identified by gene At1g07300; lower region, chromosome 2 identified by gene At2g29640. These regions comprise of six pairs of homeologous genes (blue double arrows). The upper and lower regions have five and six genes, respectively, that do not have homeologs (purple ovals), and the lower region has one annotated pseudogene (orange oval). Blue numbered boxes mark regions of sequence similarity identified by blast.
(c) Comparison of the two intragenomic syntenic regions from (a) wth a syntenic outgroup sequence from Vitus viniferai anchored by gene GSVIV00024149001. Green and red numbered boxes are blastz high-scoring segment pairs (HSPs) between the intragenomic regions of the in-group and the outgroup sequence. Red and green ovals and arrows identify genes and their orthologous regions in the outgroup sequence. Purple ovals identify genes not present in the outgroup sequence; the orange oval is a pseudogene. By comparison with an outgroup sequence, fractionation of gene content becomes apparent. All homeologous gene pairs and the majority of singlet genes are represented in the outgroup sequence. Notice that many annotated grape genes are not represented in the two Arbidopsis chromosomes shown. This is expected because two tetraploidies occurred along the Arabidopsis lineage, whereas none happened along the grape lineage; there is another equally syntenic pair of Arabidopsis chromosomal regions that are the fractionation products of this segment (Jaillon et al., 2007).

To demonstrate fractionation comparison with an outgroup sequence is necessary. For understanding intragenomic fractionation, such an outgroup would ideally have diverged before the intragenomic duplication event and not undergone a duplication event of its own. Figure 4(c) shows an example of this using Vitis vinifera (grape; Jaillon et al., 2007). In this example, although the intragenomic regions share a subset of their gene content, the unannotated outgroup contains the majority of the gene content and evidences fractionation. In addition, the outgroup comparison allows us to infer the pre-duplication ancestral state of the intragenomic syntenic regions, and track which genes have been preserved as singlets or retained as duplicates.

Expect errors in genomic sequences and annotation

If you examine the homeologous gene pair At1g07300 and At2g29640 from the previous example (Figure 4b, yellow exons in gene models), you will notice that these sequences have been assigned very different exon structures with respect to one another. Looking at their shared sequence similarity, you will notice that the blastz HSP covers and extends beyond the 5′ end of At1g07300, and partially covers the gene model of At2g29640. This difference may indicate that the genes are evolving in a unique fashion or that an annotation error was made. In either case, this pair of homeologs needs closer inspection. Although Arabidopsis gene models are certainly the best current models in plants, many Arabidopsis gene models are incorrect (Thomas et al., 2007).

Figure 5(a) shows a pair-wise analysis of the two Arabidopsis regions using blastn. There are two clusters of HSPs with one set (HSPs 1 and 2) covering the entire coding region of At1g07300 and the 3′ exon of At2g29640, and the other set (HSPs 3, 4, 5, 6 and 7) covers the 5′ region of At1g07300 and the intronic region of At2g29640, with one HSP overlapping a middle coding exon. The general lack of congruence of gene models for these homeologs and their odd placement of sequence similarity suggests an annotation error. Checking their annotations at TAIR (http://www.arabidopsis.org), there is full-length cDNA support for At1g07300 and none for At2g29640. This implies that the gene model for At1g07300 is correct and that At2g29640 has an annotation error.

Figure 5.

 Detecting annotation errors.
(a) Alignment of a homeologous gene pair in Arabidopsis with an annotation error using blastn. The regions analyzed included 2500 nucleotides of the 5′ and 3′ regions of genes At1g07300 and At2g29640.
(b) Alignment of two syntenic intragenomic regions from (a) to the syntenic region of an outgroup (Vitus vinifera) using blastn. Here it is evident that the Arabidopsis gene model for At2G29640 probably encompasses two genes, the 5′ section of which has been lost from the other Arabidopsis syntenic region.

For further analysis of annotation errors uncovered through comparison of syntenic regions, an outgroup sequence is needed. Figure 5(b) shows the two intragenomic regions with an annotation error compared with an outgroup sequence (grape, V. vinifera). Here, we can see that both At1g07300 and At2g29640 have some 3′ sequence similarity to the outgroup. However, At2g29640 also has 5′ sequence similarity to the outgroup that is not present in the 5′ region of At1g07300. Also, the 5′ cluster of Arabidopsis HSPs in the non-coding sequence is not present in the outgroup. This suggests that the 5′ cluster of HSPs are CNSs and that At2g29640 may represent two genes, one of which has been retained in the syntenic Arabidopsis genomic region, and one that has been fractionated. In addition, you will notice an ‘HSP stack’ (HSPs 5–8) in the comparison between At1G07300 and V. vinifera. This results from a simple sequence repeat (in this case a GAGA repeat) and the way in which blast identifies regions of sequence similarity.

Inversions happen frequently, and break collinearity

Figure 6 compares two alignment tools, blastz and Shuffle-Lagan, for their ability to identify an inversion within a syntenic region. Although both algorithms were able to identify an inversion containing at least four genes as well as a putatively missed gene in one region, identifying regions of similarity is easier using the GeLo visualization for blastz.

Figure 6.

 Detecting inversions.
Sequence comparison of a syntenic region derived from the most recent genome duplication event in Arabidopsis with an inversion in one region. The upper region is from chromosome 1 identified by gene At1g02690, and the lower region is from chromosome 4 identified by gene At4g02150.
(a) blastz.
(b) Shuffle-Lagan. Blue arrows highlight homeologs, red arrows highlight homeologs in an inverted chromosomal region and the orange arrow represents a putative non-annotated gene on chromosome 1. Both algorithms identified the inversion event.

Local duplications are very common and can greatly clutter alignment graphics

Duplications of two general types are shown in Figure 7 using blastz. Figure 7(a) shows a region with a local duplication compared with itself, and Figure 7(b) shows a region with a local duplication compared with its syntenic region containing 12 local duplicates (two of which are pseudogenes). Notice that HSP1 in Figure 7(a) nearly covers the entire sequence. This is to the result of comparing a genomic region against itself. Also, notice the HSP stacks in Figure 7(b) that happen when one gene is present in many copies in the other region. Although the HSP numbers overlap and are difficult to interpret, which is a limitation of this type of visualization, the expansion of this gene family by local duplication is apparent.

Figure 7.

 Detecting local duplications.
(a) blastz comparison of a genomic region with itself (chromosome 1) identifies a local duplicate (red double arrows) consisting of genes At1g07440 and At1g07450.
(b) blastz comparison of a syntenic region derived from the most recent genome duplication event in Arabidopsis. The upper region is from chromosome 1 and the lower region is from chromosome 2 identified by genes At1g07440 and At2g29300, respectively. Blue numbered boxes mark regions of sequence similarity identified by blastz. Blue arrows highlight syntenic paralogous gene pairs. Regions highlighted in red denote the expansion or contraction of a gene family.

Conclusion

Looking forward

Plant biologists face an exciting future. There will be about a dozen plant genomes completed over the next few years, bringing opportunities to characterize new patterns of similarity and change in the structure and content of plant genomes. A challenge will be to associate phenotypes with specific patterns of sequence conservation. For example, many sequence motifs identified in CNSs are under active selection, but we know little else about their function, either biochemically or phenotypically. Although we know that there are different selective environments for genomes arising from polyploidy versus speciation, we do not yet understand the evolutionary constraints and consequences. However, we know that after polyploidy, genes are retained or lost based on their family type and their ancestral genomic region. Apart from knowing that gene dosage is primarily important for retention, and that subfunctionalization is particularly important after retention (Freeling, 2007), we do not understand exactly how bias of gene content occurs as a consequence of duplications. Comparative genomics is a young and vibrant field, and is especially so for plants because plant genespace is relatively less complex than that in mammals, and because tetraploidies offer many advantages for analysis. At the core of this enterprise are DNA alignment and visualization tools, some of which are reviewed here.

Ancillary