There are four sequenced and publicly available plant genomes to date. With many more slated for completion, one challenge will be to use comparative genomic methods to detect novel evolutionary patterns in plant genomes. This research requires sequence alignment algorithms to detect regions of similarity within and among genomes. However, different alignment algorithms are optimized for identifying different types of homologous sequences. This review focuses on plant genome evolution and provides a tutorial for using several sequence alignment algorithms and visualization tools to detect useful patterns of conservation: conserved non-coding sequences, false positive noise, subfunctionalization, synteny, annotation errors, inversions and local duplications. Our tutorial encourages the reader to experiment online with the reviewed tools as a companion to the text.
Comparative genomics is founded on the assumption that much of life’s language is contained in its linear DNA sequence. Comparisons of genomic DNA sequences – be they from different species or, as with polyploids, within the same nucleus – present one way to understand the syntax and vocabulary of this language. One great advantage of using whole rather than partial genome sequence is that comparisons may be made between the most closely related genes or regions in the genomes compared. Homologous genes or chromosomal regions are similar because they share a common ancestor, but finding the closest homolog may be inferred only by finding homologs within regions containing a similar pattern of gene content. If these best homologous DNA sequences are from different organisms, they are called orthologs (with some exceptions). If homologous sequences are within one genome, they are called paralogs. A special case of paralogy results from polyploidy. Duplicated genes or chromosomal regions derived from polyploidy are called homeologs. Comparisons among homeologs are routine when working with angiosperms. The fundamentals of comparative genomics and its nomenclature have been reviewed elsewhere (Koonin, 2005). Some definitions particularly important for plant scientists are given in Table 1.
Table 1. Comparative genomic definitions of special relevance to angiosperms
A pair of homologous genes or chromosomal regions derived from the same syntenous chromosomal positions in different species. Additional gene duplications and/or losses following speciation may result in complex relationships between sets of orthologous genes.
A protocol for identifying conserved non-coding sequences (CNSs) in plants using a pair-wise blastn (Altschul et al., 1990) to identify high-scoring segment pairs (HSPs) between the non-protein-coding sequences near usefully diverged, orthologous or homeologous genes. These sequences are at least 15-bp long with an e-value equal to or more significant than a 15/15 exact nucleotide match (Inada et al., 2003; Kaplinsky et al., 2002). Other equally sensitive alignment algorithms can be substituted, as long as the 15/15 exact match significance cut-off is used.
A pair of genes retained following polyploidy, identified by residing in syntenic regions of the chromosome. Being duplicates within the same organism, homeologs are a special case of paralogs, but all homeologs occurred contemporaneously, whereas the history of local gene duplicates is obscured by gene conversion.
The mechanism by which a duplicated gene, chromosomal segment or genome tends to return to pre-duplication gene content, but not necessarily retain its pre-duplication gene order. Fractionation is the loss of one or the other of the initial homeologs, but not of both. The process of fractionation is associated with chromosomal rearrangements and ‘transcriptome shock’ (Wang et al., 2006), and may help cluster dose-sensitive genes (Thomas et al., 2006).
As CNS above, but the chromosomal regions are homeologous (syntenous and paralogous) remnants of the most recent tetraploidy event (α) in the lineage (Thomas et al., 2006). blast results for α pairs in Arabidopsis are displayed and may be researched in a custom viewer: http://synteny.cnr.berkeley.edu/AtCNS. Subfunctionalization, defined in the text, is expected of homeologous pairs, but not in orthologous pairs. Furthermore, homeologs are under different selective constraints as compared with orthologs.
Genespace is defined here as the space of an individual gene, which is a computational surrogate for ‘cistron’, where the total genespace of a genome is the sum of all of the genespaces of its genes. This gene-level genespace is computed after CNSs have been identified for a syntenic genomic region, and each CNS has been sorted to a gene: the segment of genome between the most 5′ (upstream) and most 3′ CNS, untranslated region (UTR) or feature, plus approximately 500 bp on each side (depending on neighboring features; Thomas et al., 2007). Within a genespace are exons, UTRs, CNSs, known motifs, positions where specific transcription factor binding sites reside and any feature that is fixed at a chromosomal locus. This non-standard term has little use in mammals because CNSs are difficult to sort to individual genes, but is particularly useful for plant research.
The most inclusive term for the conserved sequence between two or more sequences without stipulations as to the extent of divergence. A CNS is a type of phylogenetic footprint.
Local alignment algorithm
Computational method to identify local regions of sequence similarity between two or more biological sequences, where hits may or may not be collinear, and may be on either strand.
Global alignment algorithm
Computational method to find the best possible alignment between two or more biological sequences that extends across the entire length of all sequences on the same DNA strand. If settings are not stringent enough, noise can look like syntenic conserved regions because global algorithms make all alignments collinear.
Comparison of biological parts to identify similarities is an ancient preoccupation originally used to order the natural world. The Linnaean classification system makes a fine example. Later, these comparisons were used to identify possible evolutionary trends. One such eukaryotic trend is towards increasing maximums of morphological complexity (Freeling and Thomas, 2006), but there are many more. The modern synthesis of Darwin’s natural selection, genetic laws and some principles of population genetics (see Mayr, 1993) provides the most popular logic by which genomes are compared. It is known that genomic DNA sequences have regions with and without function, and different functional regions confer their function through different means. DNA may function by indirectly encoding protein, directly encoding RNA, binding macromolecules, directing or modifying the movement of regulatory molecules or being epigenetically modified. Compounding the matter, some DNA may have multiple simultaneous functions. Reducing a primary DNA sequence to a particular set of biologically meaningful structures is daunting (Pearson, 2006). However, it is sometimes possible to find something meaningful about the biological function of DNA with incomplete structural knowledge by comparison with a related DNA sequence. It is at this juncture that comparative genomics can be useful because DNA that functions, without regard to mechanism, tends to have its primary sequence evolutionarily conserved (Hardison, 2000, 2003).
Our purpose is limited. We have prepared a tutorial of DNA sequence comparison algorithms and data visualization tools commonly used by plant researchers. Using these tools, we identify the types of information that can be acquired, show how the ability to change alignment algorithms and parameters is crucial for discovery and illustrate how visualization of the results is almost as important as the resolution of the alignment algorithm itself.
Plant (angiosperm) genomes are known to be different from mammalian or any other animal genomes in several important ways. Recent and ancient polyploidy is widespread among angiosperms. The former may be deduced from chromosome counts (Adams and Wendel, 2005), whereas detecting the latter requires a nearly complete genome sequence. Within the fully-sequenced genomes of Arabidopsis thaliana, poplar, rice and grape are the remnants of at least two ancient tetraploidies (Adams and Wendel, 2005; Bowers et al., 2003; De Bodt et al., 2005; Jaillon et al., 2007; Paterson et al., 2005; Tuskan et al., 2006). Ancient tetraploidies cannot be inferred from chromosome counts because fractionation, the mechanism of genomic content loss that naturally follows all types of DNA duplications, often returns a polyploid to a chromosomal number and gene count more like that of its pre-polyploid ancestor. In addition to polyploidy, plant genomes contain much transposon-derived DNA. Such DNA is usually only a few million years old, and is often found both in locally repeated blocks and spread throughout the genome. Researchers must be aware that sequences of this highly repetitive nature often obfuscate comparisons among and within plant genomes. Finally, the region around genes that contains additional non-protein coding functional sequence is structured differently in angiosperms as compared with mammals. These sequences are identified by comparison of duplicated chromosomal regions, and are often called conserved non-coding sequences (CNSs; Table 1).
Mammalian CNSs are approximately 10 times larger and are much more numerous than plant CNSs when using alignment cut-offs appropriate for plant CNS discovery (Kaplinsky et al., 2002). Were the most popular CNS alignment cut-off used in animal research (100-bp long with >70% identity; Loots et al., 2000) applied to plants, plants would have nearly zero CNSs (Gao and Innan, 2004; Inada et al., 2003; Thomas et al., 2007). It follows that CNSs are more deeply conserved in the vertebrate lineage than in the angiosperm linage. Vertebrates have over a thousand enhancer-like conserved non-coding sequences that have been conserved since the divergence of human and fish 450 Mya (Goode et al., 2005; Ovcharenko et al., 2005b; Siepel et al., 2005; Woolfe et al., 2005). Although those most-conserved plant CNSs may also operate as enhancers, plants do not have such deeply conserved CNSs (Freeling et al., 2007). As originally observed by Kaplinsky et al. (2002), mammalian CNSs often occur continuously down a chromosome, so the assignment of any one of them to a particular gene is not possible using spacing alone. Work on maize–rice (Inada et al., 2003, Guo and Moose, 2003), Brachypodium–rice (Bossolini et al., 2007) and especially alignments of the two most recent post-tetraploid genomes within Arabidopsis (Freeling, 2007; Thomas et al., 2007) all demonstrate that almost all plant CNSs cluster near one gene: this cluster of conservation has been used to estimate what we call a single ‘genespace’. This non-standard term is defined in Table 1. The reasons why plants, as compared with mammals, have less conserved sequence between genes is not known, but this articulated pattern of conservation permits assigning CNSs to genes, and this information is powerful. For example, the most CNS-rich genes in Arabidopsis are transcription factors known to be necessary for response to environmental signals (Freeling et al., 2007).
The concept of ‘synteny’ is essential to any comparison of homologous genes or chromosomes. Given the inherent complexity of this term, the definition that follows is simply how we use this term. Two or more once-duplicated sequences are said to be ‘syntenic’ when it is possible, using extant genomic data, to reconstruct a valid ancestral sequence from which the sequences originated. When two chromosomal regions have mainly co-linear genes or other features, they are obviously derived from a common ancestral genomic region and are considered syntenic. In reality, duplication is followed by an evolutionary winnowing process (called ‘fractionation’, see Table 1) that includes gene loss, inversions, translocations, insertions, deletions and epigenetic marks. This results in a loss of collinearity of genes and other features, but it is often possible to reconstruct a putative ancestor nevertheless. An outgroup genome is often necessary to prove synteny, especially if the remaining duplicate regions share zero or few conserved sequences. When this reconstruction is possible, the duplicate regions are called ‘syntenous’ or ‘syntenic’. Only the post-duplication movement of single genomic features to another genomic region destroys our ability to detect synteny. Because of ancient polyploidies in all angiosperms, synteny is evidenced within plant genomes, and not just between them. The tutorial that follows uses synteny between homeologous (i.e. syntenic, paralogous; Table 1) genomic features in several ways to identify patterns of evolution in plant genomes.
Duplication may be of varying degrees of completeness: local (tandem), segmental, whole chromosome and whole genome (polyploidy). Each sort of duplication has very different selective constraints (Koonin, 2005) and dosage effect/compensation expectations (Birchler et al., 2005; see Freeling and Thomas, 2006). For example, prevalent gene conversion makes comparisons among locally duplicated genes challenging because it unlinks their date of origination from their observed degree of divergence (exemplified in yeast; Gao and Innan, 2004).
Some homologous DNA sequence comparisons are meaningful only if the DNA sequences have diverged to a ‘useful’ level. In theory, DNA sequence without specific function will either accumulate point mutations at the background rate of the region or may be deleted altogether (if such a mechanism operates). The former is true for many third codon position base substitutions in the protein coding sequence, and is true for all of the non-functional sequence. As the non-functional sequence changes more quickly than the functional sequence, there is a point in evolutionary time that conservation of the sequence is evidence of function. Conversely, if the level of sequence divergence is small, conservation is expected because of carry-over. Although adequate divergence is essential, it is important to realize that there can be too much. For example, homologous regulatory sites are known to lose sequence similarity even though binding function is conserved (called ‘binding site turnover’; Ludwig et al., 2005; Moses et al., 2006).
There is certainly a window of useful divergence when comparing plant non-coding sequences. The three first papers on plant CNS discovery (Guo and Moose, 2003; Inada et al., 2003; Kaplinsky et al., 2002) established the maize–rice divergence time as being appropriate for CNS detection, and argued that maize–rice diverged to approximately the same extent as mouse–man. Sufficient divergence for detecting CNSs in plants is indicated when the average blast high-scoring segment pair (HSP) between orthologous/homeologous coding regions is approximately 85% identical in nucleotide sequence (unpublished rule-of-thumb, M. Freeling). However, the definitive ‘find plant CNS’ settings may always be adjusted so that only significant non-coding alignments are detected.
Subfunctionalization is the natural process whereby duplicate cis-acting units of function (e.g. exons and enhancers) tend to lose dispensable sequences in a compensatory fashion (Force et al., 1999; Lynch and Force, 2000). This results in the full set of functions of the ancestral gene being divided between both duplicates so no one gene is complete. Subfunctionalization of cis-acting regulatory DNA sequences has been noticed in plants (Haberer et al., 2004; Langham et al., 2004). When paralogs are aligned, subfunctionalized regions of the sequence cannot be seen because they exist in only one of the two duplicates. An appropriate outgroup capable of better representing the ancestor of the duplicates is required to identify subfunctionalized sequences.
Comparing genomic sequence
The workhorses of comparative genomics are sequence alignment algorithms. Sequence alignment algorithms break into two major classes: global and local. Global alignments (Needleman and Wunsch, 1970) generate the best alignment across the whole length of the sequences, whereas local alignments (Smith and Waterman, 1981) find as many best-subsequence alignments as possible. The usefulness of results generated by these two classes of alignment algorithms depends on the type of genomic region analyzed, and how much false-positive noise is retained in the results. In general, if the compared regions are believed to be similar across their entire lengths, then global alignment algorithms are preferred. Cases of inversion and local duplication – both common in syntenic regions – violate this assumption of collinearity, and local alignment algorithms are generally preferred. Also, different algorithms in each class are optimized for different alignment tasks. To highlight these differences in algorithm classes and optimizations, we use six alignment algorithms (three global and three local) in the tutorial. These are listed in Table 2.
Table 2. Alignment and visualization tools used in the tutorial
*GeLo is our own visualization package (to be published elsewhere).
We are beginning to see the development of modular alignment visualization software that can be used with the output from any sequence alignment algorithm. Vista (Mayor et al., 2000) is a prime example of this paradigm and has been used for visualizing alignment results from several algorithms. Similarly, we have developed our own genome visualization module, GeLo, which we use in the tutorial for visualizing blast results, and which is now being used to display the results from several alignment algorithms.
In the tradition of online tutorials for web applications, the following short manual is written colloquially. The reader becomes ‘you’ at this point in the discourse and we provide you links to our web application for regenerating our examples and figures, as well as generating sequence and annotation sets for import into other comparative genomic tools (Table 3). There are many tools available and these have been reviewed elsewhere (recently by Pollard et al., 2006). To illustrate differences and similarities between different commonly used alignment algorithms, we chose six of them for our examples (Table 2). Please note that there are several ways to import sequences (and associated annotations) into our, and most other, web applications: retrieval from a local database, import from GenBank via an accession number, or directly submitting a sequence in FASTA or GenBank format.
Table 3. Links for regenerating and modifying the blastn and blastz examples used in the tutorial, and for obtaining sequence and annotations files in FASTA and GAF (Gene Annotation Format) format, respectively
These output files may be exported to other sequence analysis applications.
Figure 1 shows the results from the algorithms in Table 2 applied to a pair of genespaces in Arabidopsis: these genespaces from chromosomes 2 and 4, which contain several kb of sequence, were chosen because they are homeologous and, using blastn terminology (Figure 1a), they contain 14 short HSPs and three coding sequence (CDS; reading frame) HSPs. Almost all HSPs are collinear. These are typical plant CNSs (except that HSP2 is actually a microRNA gene), and any detailed analysis of these genes should detect these dispersed, short stretches of non-coding DNA sequence conservation. This blastn analysis (Figure 1a), using the settings and noise filter defined in Table 1, is our baseline for comparison with the other five alignment algorithms. For each of the other algorithms, only the alignment to the genespace on chromosome 4 is shown.
blastz (Figure 1b) identified the pair of homeologous coding sequences with a single HSP that covered the entire gene model and extended into its 5′ non-coding sequence. Although this 5′ extension covered one CNS identified by blastn (HSP8, Figure 1a), it failed to find any distal CNSs. Although we can conclude that blastz can easily identify putative gene homologs, it is not appropriate for finding plant CNSs.
Mulan, another local alignment tool, provides a Vista-like visualization of results for identifying CNSs that can filter the alignments based on the minimum length of CNSs and their percentage sequence similarity. Applying animal CNS settings (100-bp length, 70% identity,) Mulan identified the 5′-most cluster of CNSs (Figure 1c), but did not find the intervening CNSs identified by blastn. We changed the Mulan filter to be similar to plant CNSs settings by lowering its minimum length to 15 bp (Figure 1d). Although the 5′ CNS cluster covered more sequence, Mulan still missed the same CNSs as with the animal settings.
Figure 1(e–g) show the results of the global alignment algorithms DiAlign (which uses the local alignment algorithm Chaos for anchors; Brudno and Morgenstern, 2002), Avid and Lagan. All these algorithms identified the pair of homeologous genes and the 5′ distal CNS cluster. Chaos-DiAlign and Lagan found several of the intervening CNSs identified by blastn (although the Chaos-DiAlign server did not support adding gene annotations to the abc visualization). Although blastn may not be the most appropriate alignment algorithm for all comparative genomics problems, this comparison shows that it performs well for detecting plant CNSs.
Where is the noise?
blastn reduces false-positive noise in its alignments (i.e. HSPs, blast ‘hits’) by using the concept of an expect value (e-value). An in-depth discussion of the e-value calculation in blast is beyond the scope of this review, so please see http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html for full details. To facilitate CNS research, Thomas et al. (2007); and the Arabidopsis CNS website, http://synteny.cnr.berkeley.edu/AtCNS) devised a heuristic method to efficiently filter noise from genespaces of various lengths. These workers (from this laboratory) added an identical sequence of known length (called a ‘spike’ sequence) to the 3′ end of the compared sequences. Using blastn to generate HSPs, they identified the HSP containing the spike sequence and removed all other HSPs of greater e-value. They found that a 15-bp spike sequence eliminated most of the noise from their analyses. Figure 2 shows a short syntenic region within Arabidopsis subjected to various levels of noise filtration using spike sequences of various lengths. Note that the filter with a 15-bp spike sequence eliminates all noise from the analysis (Figure 2c,d), leaving four CNSs. We leave it to you to try various other spike sequence lengths and gap/mismatch penalties using the links in Table 3.
Subfunctionalization of CNSs
Now that we can detect CNSs in homeologous genespaces, we will extend this to include an outgroup sequence for the purpose of identifying CNSs that are shared or fractionated/subfunctionalized. Figure 3 shows blastn comparisons with a 15-bp spike sequence of two homeologous maize genes (liguleless-like transcription factors retained from a tetraploidy that happened approximately 15 Ma) to a rice ortholog. In this example, each maize gene has several CNSs in common with rice that are not shared in the homeologous genespace (CNSs highlighted with purple ovals), which is evidence for subfunctionalization of the non-coding sequence of the maize genes.
Figure 4(a,b) visualize synteny between two intragenomic regions of Arabidopsis using blastn and blastz, respectively. In both analyses, there are six pairs of genes that share a high degree of sequence similarity (blue double arrows) and are collinear, demonstrating synteny. However, the results from blastz are easier to interpret visually. There are several genes in each region that do not have a corresponding homeolog (purple ovals), which we assume is to the result of fractionation.
To demonstrate fractionation comparison with an outgroup sequence is necessary. For understanding intragenomic fractionation, such an outgroup would ideally have diverged before the intragenomic duplication event and not undergone a duplication event of its own. Figure 4(c) shows an example of this using Vitis vinifera (grape; Jaillon et al., 2007). In this example, although the intragenomic regions share a subset of their gene content, the unannotated outgroup contains the majority of the gene content and evidences fractionation. In addition, the outgroup comparison allows us to infer the pre-duplication ancestral state of the intragenomic syntenic regions, and track which genes have been preserved as singlets or retained as duplicates.
Expect errors in genomic sequences and annotation
If you examine the homeologous gene pair At1g07300 and At2g29640 from the previous example (Figure 4b, yellow exons in gene models), you will notice that these sequences have been assigned very different exon structures with respect to one another. Looking at their shared sequence similarity, you will notice that the blastz HSP covers and extends beyond the 5′ end of At1g07300, and partially covers the gene model of At2g29640. This difference may indicate that the genes are evolving in a unique fashion or that an annotation error was made. In either case, this pair of homeologs needs closer inspection. Although Arabidopsis gene models are certainly the best current models in plants, many Arabidopsis gene models are incorrect (Thomas et al., 2007).
Figure 5(a) shows a pair-wise analysis of the two Arabidopsis regions using blastn. There are two clusters of HSPs with one set (HSPs 1 and 2) covering the entire coding region of At1g07300 and the 3′ exon of At2g29640, and the other set (HSPs 3, 4, 5, 6 and 7) covers the 5′ region of At1g07300 and the intronic region of At2g29640, with one HSP overlapping a middle coding exon. The general lack of congruence of gene models for these homeologs and their odd placement of sequence similarity suggests an annotation error. Checking their annotations at TAIR (http://www.arabidopsis.org), there is full-length cDNA support for At1g07300 and none for At2g29640. This implies that the gene model for At1g07300 is correct and that At2g29640 has an annotation error.
For further analysis of annotation errors uncovered through comparison of syntenic regions, an outgroup sequence is needed. Figure 5(b) shows the two intragenomic regions with an annotation error compared with an outgroup sequence (grape, V. vinifera). Here, we can see that both At1g07300 and At2g29640 have some 3′ sequence similarity to the outgroup. However, At2g29640 also has 5′ sequence similarity to the outgroup that is not present in the 5′ region of At1g07300. Also, the 5′ cluster of Arabidopsis HSPs in the non-coding sequence is not present in the outgroup. This suggests that the 5′ cluster of HSPs are CNSs and that At2g29640 may represent two genes, one of which has been retained in the syntenic Arabidopsis genomic region, and one that has been fractionated. In addition, you will notice an ‘HSP stack’ (HSPs 5–8) in the comparison between At1G07300 and V. vinifera. This results from a simple sequence repeat (in this case a GAGA repeat) and the way in which blast identifies regions of sequence similarity.
Inversions happen frequently, and break collinearity
Figure 6 compares two alignment tools, blastz and Shuffle-Lagan, for their ability to identify an inversion within a syntenic region. Although both algorithms were able to identify an inversion containing at least four genes as well as a putatively missed gene in one region, identifying regions of similarity is easier using the GeLo visualization for blastz.
Local duplications are very common and can greatly clutter alignment graphics
Duplications of two general types are shown in Figure 7 using blastz. Figure 7(a) shows a region with a local duplication compared with itself, and Figure 7(b) shows a region with a local duplication compared with its syntenic region containing 12 local duplicates (two of which are pseudogenes). Notice that HSP1 in Figure 7(a) nearly covers the entire sequence. This is to the result of comparing a genomic region against itself. Also, notice the HSP stacks in Figure 7(b) that happen when one gene is present in many copies in the other region. Although the HSP numbers overlap and are difficult to interpret, which is a limitation of this type of visualization, the expansion of this gene family by local duplication is apparent.
Plant biologists face an exciting future. There will be about a dozen plant genomes completed over the next few years, bringing opportunities to characterize new patterns of similarity and change in the structure and content of plant genomes. A challenge will be to associate phenotypes with specific patterns of sequence conservation. For example, many sequence motifs identified in CNSs are under active selection, but we know little else about their function, either biochemically or phenotypically. Although we know that there are different selective environments for genomes arising from polyploidy versus speciation, we do not yet understand the evolutionary constraints and consequences. However, we know that after polyploidy, genes are retained or lost based on their family type and their ancestral genomic region. Apart from knowing that gene dosage is primarily important for retention, and that subfunctionalization is particularly important after retention (Freeling, 2007), we do not understand exactly how bias of gene content occurs as a consequence of duplications. Comparative genomics is a young and vibrant field, and is especially so for plants because plant genespace is relatively less complex than that in mammals, and because tetraploidies offer many advantages for analysis. At the core of this enterprise are DNA alignment and visualization tools, some of which are reviewed here.