Illegitimate recombination is a major evolutionary mechanism for initiating size variation in plant resistance genes

Authors


(fax +41 1 634 8204; e-mail bkeller@botinst.uzh.ch).

Summary

Current models for the evolution of plant disease resistance (R) genes are based on mechanisms such as unequal crossing-over, gene conversion and point mutations as sources for genetic variability and the generation of new specificities. Size variation in leucine-rich repeat (LRR) domains was previously mainly attributed to unequal crossing-over or template slippage between LRR units. Our analysis of 112 R genes and R gene analogs (RGAs) from 16 different gene lineages from monocots and dicots showed that individual LRR units are mostly too divergent to allow unequal crossing-over. We found that illegitimate recombination (IR) is the major mechanism that generates quasi-random duplications within the LRR domain. These initial duplications are required as seeds for subsequent unequal crossing-over events which cause the observed rapid increase or decrease in LRR repeat numbers. Ten of the 16 gene lineages studied contained such duplications, and in four of them the duplications served as a template for subsequent repeat amplification. Our analysis of Pm3-like genes from rice and three wheat species showed that such events can be traced back more than 50 million years. Thus, IR represents a major new evolutionary mechanism that is essential for the generation of molecular diversity in evolution of RGAs.

Introduction

As a defense against pathogens, plant genomes contain a large number of resistance (R) genes whose products directly or indirectly detect specific pathogen effectors and are able to trigger a defense reaction (Belkhadir et al., 2004). Many R genes are embedded in clusters of similar genes called R gene analogs (RGAs). These may not be directly involved in resistance against pathogens but are believed to provide the genetic raw material from which new varieties of R genes can emerge through unequal crossing-over and gene conversion. Such sequence exchange between homologs can lead to the rapid generation of new hybrid genes (Kuang et al., 2004; Noël et al., 1999; Smith and Hulbert, 2005; Van der Hoorn et al., 2001). In this study, we will use the term ‘RGA’ to refer to both R genes and RGAs.

Most RGAs described are of the NBS-LRR type and encode proteins with a nucleotide binding domain (NBS), followed by a series of leucine-rich repeats (LRR). The NBS domain is highly conserved and is probably involved in signal transduction (Jones and Jones, 1997) whereas the LRR domain is the determining factor for specificity of pathogen recognition (Dodds et al., 2001; Jones and Jones, 1997). The LRR domain consists of a repetitive core motif in which conserved leucines (or other small hydrophobic amino acids) alternate with variable amino acid residues. The LRR units are usually 20–30 amino acids long. The leucine residues presumably form the hydrophobic backbone of a horseshoe structure with arrays of solvent-exposed amino acids which have the potential to bind a vast number of structurally unrelated proteins (reviewed by Kedzierski et al., 2004). Thus, identification of the mechanisms that cause genetic diversity in this domain is crucial to our understanding of plant resistance.

The number of LRR units is highly variable between different RGA lineages, but is usually in the range of one to two dozen. Sometimes, the number of LRR units can vary dramatically even between members of the same RGA family and in closely related species (Caicedo and Schaal, 2004). Current models explain this variation in size by unequal crossing-over or template slippage between individual LRR units of an RGA (Caicedo and Schaal, 2004; Kuang et al., 2004; Noël et al., 1999). Such a recombination event requires two highly similar stretches of at least dozens (usually hundreds) of base pairs in size. Mispairing between such repeat units and subsequent crossing-over then leads to the observed duplications or deletions (i.e. increase or decrease in LRR unit number).

However, as our study will show, sequence conservation between LRR units can be limited to the LRR backbone motif. Such divergent repeat units share almost no DNA sequence identity and cannot serve as a template for unequal crossing-over. Therefore, an additional mechanism is required that creates an initial duplication providing two very similar repeat units that allow unequal crossing-over. The question of what causes these initial duplications inspired this study.

Our hypothesis was that such initial duplications within LRR domains are caused by illegitimate recombination (IR), which has been previously described as a frequent source for deletions and duplications in plant genomes (Devos et al., 2002; Ma et al., 2004; Wicker et al., 2003, 2005). The molecular mechanism of IR is poorly understood and it is possible that several independent mechanisms are involved (reviewed by Van Rijk and Bloemendal, 2003). Thus, the term IR describes a commonly observed phenomenon and not a particular molecular mechanism, but it nevertheless characterizes a well-defined group of events which produce specific signatures.

Illegitimate recombination can occur between dispersed homologous sequences of only a few (usually 1–10) base pairs (Daley et al., 2005; Devos et al., 2002). This asymmetric pairing followed by sequence exchange can result in either duplications or deletions. A schematic example of an IR event between two dispersed 3-bp motifs is depicted in Figure 1. Typical evidence for IR is thus the presence of short direct repeats flanking duplicated units (Devos et al., 2002; Ma et al., 2004; Wicker et al., 2003), a characteristic which we refer to as the ‘IR signature’. However, occasionally, IR can also occur in the absence of sequence homology (Daley et al., 2005). Deletions caused by IR can only be detected if a homologous sequence (e.g. another gene from the same cluster) that does not contain the deletion is available (Figure 1b). Because short homologous motifs can occur by chance, IR events have an apparently random nature.

Figure 1.

 Example for illegitimate recombination (IR).
This is a purely schematic depiction and does not suggest a molecular mechanism or actual pairing of the short repeats.
(a) A 3-bp motif (TAG) that occurs twice by chance serves as template for IR, resulting in either a duplication or a deletion. The two repeat units overlap by the 3-bp template and are also flanked by it.
(b) A deletion can only be detected if a homologous sequence without the deletion sequence is available.

In this study, we test the hypothesis that IR is the cause of duplications in LRR domains that can then trigger expansion (or contraction) through subsequent unequal crossing-over events. Through comparative analysis of a large variety of RGA families, we found strong evidence supporting our hypothesis and indicating that IR is the major cause for the initiation of size variation in LRR domains of R genes. Illegitimate recombination was identified to cause apparently random duplications which serve as templates for subsequent unequal crossing-over and lead to the observed rapid increase and decrease in numbers of LRR. Illegitimate recombination was also found to be an important source of deletions in RGAs.

Results

As a starting set we chose 16 previously described R genes (Table 1). Based on phylogenetic analysis of their predicted protein sequences, they could be separated into 16 different clades which are hereafter referred to as evolutionary lineages. Fourteen lineages are of the NBS-LRR type and represent a wide variety of sequences. The two most closely related ones (RGH2b from barley, Hordeum vulgare, and Pi-ta from rice, Oryza sativa) are only 53% similar at the amino acid level while several cannot be aligned at all because of strong sequence divergence. Two lineages (Cf-2 and Cf-9) do not have an NBS domain but instead encode proteins with a transmembrane and an LRR domain. To search for sequence variation caused by IR, we tried to identify multiple homologs for each of the 16 initial R genes. All RGA sequences described here were found in public databases, except Pm3 homologs from tetraploid and hexaploid wheat (Triticum durum and T. aestivum), which are described here.

Table 1.   Overview of resistance gene analogs and their homologs used in this study. The first two letters of the gene name indicate the Latin species names
GeneCommon nameReferenceParalogsaHomologsb
  1. Aet, Aegilops tauschii; At, Arabidopsis thaliana; Hv, Hordeum vulgare; Ls, Lactuca sativa; Os, Oryza sativa; Sp, Solanum pimpinellifolium; Ta, Triticum aestivum; Tm, Triticum monococcum; Zm, Zea mays.

  2. aIncludes paralogs from the same species as well as homologs from closely related varieties or species.

  3. bNumber of homologs identified in the rice genome.

AetLr21Aegilops tauschiiHuang et al., 200302
AtRPP5ArabidopsisNoël et al., 199990
AtRPP8ArabidopsisMcDowell et al., 199810
AtRPM1ArabidopsisGrant et al., 199500
HvMlaBarleyHalterman et al., 200108
HvRGH2aBarleyWei et al., 200205
HvRGH3aBarleyWei et al., 200205
LsRGC2ALettuceMeyers et al., 1998170
SpCf-9TomatoParniske et al., 1997160
OsPi-taRiceBryan et al., 200010
OsXa1RiceYoshimura et al., 199840
SpCf-2TomatoDixon et al., 199880
TaPm3WheatYahiaoui et al., 2006; this work612
TmLr10WheatFeuillet et al., 200304
TmRGA2WheatFeuillet et al., 200305
ZmRp1MaizeRamakrishna et al., 200236
Total  6547

One lineage (RPM1) is only represented by a single gene (AtRPM1), as we did not find a homolog with significant sequence identity at the DNA level. Nine lineages are represented by multiple genes (paralogs) from the same species (e.g. Cf-9 or RPP8, Table 1). Most such paralogs from a respective line were found in a single cluster. For eight evolutionary lineages (all RGAs from grasses other than rice), we could identify 2–12 homologs in the rice genome (Table 1). These were identified by blastn using as a cutoff blast alignments with >60 bp and >75% DNA sequence identity.

The RGAs from the dicotyledonous species lettuce (Lactuca sativa) and tomato (Solanum pimpinellifolium) were used for blastn searches against the Arabidopsis genome but no homologs could be identified based on DNA sequence homology using the same stringency as above. It was also not possible to identify homologs of the grass RGAs in the Arabidopsis genome based on DNA sequence conservation.

In summary, our test set contained 112 RGAs from seven species. All lineages except RPM1 are represented by 2–19 homologs (Table 1). Homologs from the same lineage are 65% to >99% identical to each other at the DNA level. The lineage with most members is Pm3 which contains seven RGAs from wheat and 12 from rice. About half of the identified RGAs are pseudogenes which contain frameshifts, in-frame stop codons, deletions or transposable element insertions. For our analysis, we did not differentiate between pseudogenes and intact RGAs.

The majority of evolutionary lineages show variable LRR domain size whereas some are highly conserved in size

To search for events that may have been caused by IR, we screened the RGA set for the presence of duplications that are detectable at the DNA level. Here, we considered all duplications that were at least 25 bp long with duplicated units >70% identical at the DNA level. Such duplications are clearly visible in a dot-plot alignment of the sequence against itself and allow reliable sequence alignments with common software. Additionally, we searched for differences (insertions, deletions) between members of the same lineage.

Within the five lineages Cf-4/Cf-9, Lr21, Mla, RGA2 and RPP8, the size of the LRR domain is highly conserved. For example, the Mla lineage which contains one RGA from barley and eight from rice is highly conserved in size (2827–2946 bp). Most of that variation comes from variable positions of the stop codon. Other than that, the LRR domains only differ in small indels of a few to a few dozen base pairs.

In these five lineages as well as in the Arabidopsis RPM1 gene, no repeats are detectable at the DNA level in the LRR domain (Supplementary Figure S1). Their LRR units are so divergent that basically only the leucine residues are conserved between individual LRR units, corresponding to <30% DNA sequence identity. Therefore, these LRR units cannot serve as a template for unequal crossing-over.

In 10 lineages, we found variation in the size of the LRR domain between individual members. As variations, we considered all size differences between LRRs of different genes that were larger than 50 bp (a size difference that is easily detectable with visual alignment programs such as dotter). We used a minimal size difference of 50 bp as some homologs from the same lineage are sometimes only 65–70% identical at the DNA level and local variable regions may sometimes differ slightly in size but are difficult to align. The observed size variations included short deletions or insertions as well as dramatic differences in LRR numbers which could result in a size difference of several hundred base pairs. All of these 10 lineages also contain repeat structures that are detectable at the DNA level.

Most duplications in LRR domains show illegitimate recombination signatures

The repeat structures identified in the above 10 lineages were divided into two classes, ‘simple’ and ‘complex’ duplications. As ‘simple duplications’ we define those duplications that consist of only two highly similar repeat units, whereas ‘complex duplications’ refer to arrays of multiple repeat units (see below). We identified 26 simple duplications that range in size from 30 to 498 bp (Table 2). Some are exclusive to just one gene and some occur in multiple genes, indicating that the duplication event pre-dated gene divergence (Table 2). Only two genes, OsRGA2_8-1 and TdPm3-2, contain duplications within the NBS domain, all others were found in the LRR domain.

Table 2.   Simple duplications in leucine-rich repeat domains
GeneIRSaMismatchbLengthcIdentity (%)
  1. aSize of the illegitimate recombination signature (IRS), a stretch of microhomology that served as template for the duplication.

  2. bNumber of mismatches in the IRS motif.

  3. cSize of duplicated unit.

  4. dDuplication is conserved in multiple genes.

AtRPP5204090
AtRPP510128093
AtRPP5-62033596
AtRPP5-αd29479
AtRPP5-βd38886
HvRGH35011097
LsRGC2A26420681
LsRGC2 K46885
LsRGC2_LC3d3029176
LsRGC2_LC4d5129490
LsRGC2_TC4d13229880
LsRGC2_TC7d49880
LsRGC2_TM9d5029984
LsRGC2_TCZd22978
LsRGC2_LA6d1036074
OsLr10_11-2216295
OsPm3_3-3623083
OsPm3_10-2403093
OsRGA2_8-111980
OsRGH3_11-35011097
OsRp1-5516788
OsXa1-34128475
SpCf2-αd13175
TaPm39084
TdPm3-21223688
TdPm3-36143583

According to our hypothesis, a majority of these simple duplications should show an IR signature (i.e. the duplications are flanked by the short repeat motifs that served as templates for the IR). Indeed, in 16 of the 26 duplications, such an IR signature could be identified. The IR signatures have sizes of 2–26 bp. Seven IR signatures are perfect repeats whereas nine contain mismatches (Table 2). Apparently, some simple duplications are more recent than others as the level of sequence identity between duplicated units ranges from 74–97%. The number of mismatches in the IR signatures is usually in the range that would be expected from the overall sequence identity of the two duplicated units (e.g. the duplicated units in the LsRGC2A gene are 81% identical and the 26-bp IR signature contains four mismatches). Two additional putative IR signatures were 1 and 2 bp with one mismatch, respectively (Table 2). However, these could have occurred simply by chance.

The duplications where an IR signature could be identified are usually those with higher sequence identity, whereas those without IR signature are among the most divergent ones (Table 2). Thus, it is possible that some IR signatures were eliminated by sequence divergence or subsequent deletions. For the more divergent duplications, we also found that they are often not immediately adjacent to one another but separated by a few dozen base pairs, suggesting an additional deletion that eliminated the actual border of the duplication. Although another explanation for the absence of IR signatures is that some IR events require no sequence homology at all (Daley et al., 2005), we considered only those duplications which contain an IR signature as supporting our hypothesis.

Interestingly, several independent duplications occurred in the same regions of LRR domains. In RPP5 genes, we identified three independent duplications which occurred within a 500-bp region in the central part of the LRR domain. Multiple duplications were previously described in a very narrow region of the RGC2 homologs (Kuang et al., 2004). However, our analysis shows that the 13 independent duplications described by Kuang et al. (2004) are in fact only seven events that occurred in different evolutionary lineages. For each of the seven events, one example is given in Table 2 and a multiple sequence alignment with precise locations of the duplications is shown in Supplementary Figure S2.

In addition, most RGA lineages contain ancient duplications which can only be detected at the protein level as the underlying DNA sequence repeats have diverged too much. In most cases, the repeat units are so divergent that it was not possible to determine the exact borders of the duplications (data not shown).

Simple duplications can serve as template for subsequent unequal crossing-over

According to our hypothesis, complex duplications are the result of the amplification of an initial simple duplication through unequal crossing-over or template slippage. Once a simple duplication has occurred, it can serve as template for repeat amplification through unequal crossing-over or template slippage. Four RGA lineages (RGH2b, Cf-2, Xa1 and Pm3) contain such arrays of multiple repeat units that presumably were caused by such a process.

The repeat array in RGH2b illustrates the basic principle of IR-initiated repeat amplification: A 30-bp fragment was duplicated through IR between two 5-bp repeats. The two repeat units were amplified in two subsequent events of unequal crossing-over or template slippage, resulting in the current array of four repeat units. Comparison of the four repeat units led to a detailed model for the evolution of the array, including IR, unequal crossing-over and point mutations to explain the current sequence arrangement (Figure 2).

Figure 2.

 Model for the evolution of a repeat array in the leucine-rich repeat domain of the barley RGH2b gene.
An initial duplication is caused by illegitimate recombination between two 5-bp repeat motifs (bold letters). The initial duplication is amplified in two steps by unequal crossing-over or template slippage. Mutations were introduced as late as possible.

The tomato Cf-2 gene contains a more complex array of direct repeats which was previously described as consisting of two types of LRR units (A-type and B-type; Dixon et al., 1996). Our analysis showed that the current structure of the LRR domain has its origin in two independent IR events. A first duplication of a 144-bp fragment was caused by IR between two 8-bp motifs (Figure 3). This initial duplication served as template for several unequal crossing-over or template slippage events which resulted in an array of nine duplicated units. A second IR event took place within repeat unit 5 when a 19-bp imperfect repeat (two mismatches) presumably served as template for the duplication of a 72-bp fragment (corresponding to the B-type LRR unit; Dixon et al., 1996). These two newly duplicated units were also amplified through unequal crossing-over or template slippage, resulting in four subrepeats within the underlying array of 144-bp repeats (Figure 3b). Thus, the A- and B-type LRR units evolved from two IR events that affected different parts of the LRR domain. This highly repetitive structure is obviously very unstable and caused frequent unequal crossing-over events, resulting in a wide variety of Cf-2-like genes with different numbers of LRR units (Caicedo and Schaal, 2004; Dixon et al., 1998).

Figure 3.

 Illegitimate recombination (IR) signatures in the repeat array of the Cf-2 gene.
(a) The leucine-rich repeat domain contains two repeat arrays, previously described as A and B type (Dixon et al., 1996). The larger A-type repeat units (white arrows) represent the more ancient duplications. More recently, amplification of part of the ancient repeat units led to a second, superimposed, array of shorter repeats (grey arrows). Locations of IR signatures are indicated above the alignment with colors corresponding to those in (b).
(b) Sequence of the repeat array, corresponding to positions 451–2000 of the Cf-2 coding region (accession number AY793347). The IR signature for the ancient duplication is in red. The IR templates for the recent duplication are shown in blue and yellow. The first unit of the recent repeat array (underlined) is printed in bold letters.

Also the Xa1 lineage has the characteristics of IR-based repeat evolution: The Xa1 cluster in rice contains five RGAs (OsXa1-1 through OsXa1-5). OsXa1-1 and OsXa1-2 contain no detectable repeat structures. OsXa1-3 contains a simple duplication of 284 bp close to its 3’ end with a 4-bp IR signature. This duplication gave rise to a repeat array in OsXa1-4 and OsXa1-5 which both contain four repeat units. Comparison of these four repeat units showed that units 1–3 of OsXa1-4 and OsXa1-5 are most similar to the first unit of the simple duplication in OsXa1-3 whereas unit 4 is most similar to the second unit of OsXa1-3, thus indicating that units 1–3 evolved from amplification of the first duplication unit in OsXa1-3 (data not shown).

Specific duplications events can be traced back more than 50 million years

The most comprehensive set of RGAs available for this study were those of the Pm3 lineage. The Pm3 gene cluster is found on wheat chromosome 1A. One member of this gene cluster encodes a series of seven recently cloned alleles conferring resistance to wheat powdery mildew (Srichumpa et al., 2005; Yahiaoui et al., 2004, 2006). Here, we have sequenced bacterial artificial chromosome (BAC) clones from the Pm3 locus of tetraploid wheat (Triticum turgidum) and hexaploid wheat (Triticum aestivum cv. Chinese Spring) on which we identified three and two homologs, respectively. Among these, the TaPm3CS gene is allelic to the cloned Pm3 resistance alleles. A sixth Pm3-like sequence was previously found in diploid wheat (Triticum monococcum; Wicker et al., 2003). In addition, we also identified 12 Pm3 homologs from the rice genome. The three wheat species analyzed (T. monococcum, T. turgidum and T. aestivum) diverged within the last 3 million years (Wicker et al., 2003) whereas rice diverged from wheat approximately 50 million years ago (Paterson et al., 2004).

Five RGAs from wheat and four from rice are apparently pseudogenes as they contain frameshifts and in-frame stop codons. For subsequent analysis, frameshifts were removed manually, to obtain hypothetical protein sequences for all of them.

Phylogenetic analysis of the predicted protein sequences of the NBS domain from all 18 wheat and rice Pm3-like genes showed that the Pm3-like genes are separated into three main sublineages (I, II and III, Figure 4a). These sublineages also largely correspond to four loci consisting of one to five genes on different rice chromosomes (Figure 4a). Interestingly, the cluster of five genes on rice chromosome 1 contains members of all three sublineages, suggesting that it represents the ancestor locus from which individual genes (or groups of genes) were moved to other chromosomes. The Pm3 genes characterized in wheat are most closely related to the genes on rice chromosome 3 (Figure 4a).

Figure 4.

 Duplication events during the evolution of Pm3-like genes.
(a) Phylogenetic tree of genes from rice and wheat. Species to which genes belong are indicated by a prefix (Os: Oryza sativa, Ta: Triticum aestivum, Td: Triticum durum, Tm: T. monococcum). For rice genes, the designation ‘OsPm3’ is followed by the chromosome number and individual copy number of loci. Values for 100 bootstrap replicates are given for all branches. The closest Pm3 homolog from Arabidopsis (At3 g14460) served as the outgroup. Letters in circles indicate in which branches of the phylogenetic tree duplications occurred. Putative IR events are indicated with a ‘D’ and unequal crossing-over with an ‘X’. Two main duplications (α- and β-duplication) are labeled. The triangle indicates an event where the β-duplication was presumably reversed through unequal crossing-over.
(b) Repeat structures in Pm3-like genes. Direct repeats are indicated by triangles. White triangles indicate repeat units that originated from the α-duplication and grey triangles those from the β-duplication. Independent duplication events that are exclusive to single genes are indicated as black triangles.
(c) DotPlot alignment of the predicted protein sequence of the rice gene OsPm3_3-2. Both the α- and β-duplication are detectable at the protein level.
(d) DotPlot alignment of the DNA sequence of TaPm3-9 from T. aestivum. Only the more recent β-duplication is detectable at the DNA level. The units of the β-duplication have gone through two rounds of amplification through unequal crossing-over or template slippage, resulting in an array of four repeat units.

Most Pm3-like genes from rice and wheat contain duplications, some are simple duplications and some are arrays of multiple duplicated units (Figure 4b). Two distinct simple duplication events (referred to as α- and β-duplication) have occurred during the early evolution of Pm3-like genes. The α-duplication is more ancient and its repeat units are so divergent that they can only be detected at the protein level (Figure 4c). The repeat units are about 67% similar to each other and encode approximately 70 amino acids. Because of the ancient nature of the duplication, the exact sizes and degrees of similarities of the protein sequence differ between individual genes. The α-duplication units were amplified by unequal crossing-over or template slippage independently in two sublineages. The most divergent gene, OsPm3_1-3, carries three repeat units. Apparently, the latest amplification step is a relatively recent event, as the resulting repeat units are still easily identifiable at the DNA level. Similarly, all Pm3-like genes from wheat contain four α-duplication units, indicating that two additional rounds of unequal crossing-over or template slippage occurred after the wheat/rice divergence (Figure 4b).

The β-duplication must be more recent as it can be detected at the DNA level by DotPlot. The duplicated region has a size of approximately 170 bp but the two repeat units are too divergent for a reliable sequence alignment. The β-duplication apparently also served multiple times as template for unequal crossing-over or template slippage: all wheat Pm3-like genes possess one additional β-duplication unit (Figure 4b), indicating that an additional amplification step must have occurred within the last 50 million years. A second amplification step took place more recently in the TaPm3-9 gene as it contains four β-duplication units. Indeed, DotPlot alignment shows that the four repeat units in TaPm3-9 form a succession of decreasing similarities between repeat units, with the more divergent units representing the more ancient duplication events (Figure 4d). Interestingly, the β-duplication is absent from the entire evolutionary sublineage III although the phylogenetic tree suggests that the founder gene for sublineage III must have possessed it. This suggests that the β-duplication was eliminated from that sublineage through an unequal crossing-over event that reversed the initial duplication.

Interestingly, the OsPm3_10-2 gene from rice chromosome 10 contains three repeat units in almost exactly the same region as the β-duplication (Figure 4b). However, the breakpoints of that duplication are different and thus represent an independent duplication that is exclusive to OsPm3_10-2.

Illegitimate recombination caused several deletions in RGA lineages

Ten evolutionary lineages contain RGAs which carry deletions compared with other members of the same lineage. By comparison of RGAs from the same evolutionary lineage, we identified 57 deletions (i.e. a stretch of DNA that was absent in one or more of the homologs but was present in others). Most deletions are exclusive to one gene while some are conserved in multiple genes, indicating that the event occurred before their divergence. The deletions range in size from 9 to 825 bp with 52 of them being smaller than 60 bp (Supplementary Table S1). For this analysis, only those RGAs were used which are >80% identical at the DNA level (usually they were all from the same species or even the same cluster) so that reliable sequence alignments between homologs in the region of the deletions could be obtained.

Of the 57 deletions, 23 showed IR signatures comprising perfect direct repeats of 2–8 bp. In 20 cases, an IR signature of 3 bp or more carrying one mismatch was identified. Seven IR signatures have a size of only 1 bp. In 11 cases, no IR signature at all was found (Supplementary Table S1). In 13 cases, the identified perfect IR signature was also part of a larger IR signature with a mismatch (Supplementary Table S1).

To study the extent to which IR signatures occur by chance, we wrote a simulation program which introduces random deletions of 300, 200, 100, 50 and 20 bp into an input sequence and then tests for the presence of IR signatures. The simulation was run 10 000 times each on a RGA representative of the 10 lineages in which deletions were detected, totaling 500 000 runs (10 000 simulations of five deletion sizes on 10 different RGAs). Table 3 shows that IR signatures of 2 bp or more are over-represented at = 0.05. Illegitimate recombination signatures longer than 2 bp are highly over-represented (= 0.01). These data indicate that the observed IR signatures (especially those longer than 2 bp) are indeed statistically highly unlikely to have occurred by chance. Curiously, 1-bp IR signatures in our sample of 57 deletions are much less frequent than would be expected from a purely random distribution.

Table 3.   Sizes of illegitimate recombination signatures (IRS) in 500 000 simulated random deletions and in a sample of 57 deletions identified in resistance gene analogs (RGAs)
IRS sizeSimulationRGAExpecteda
  1. aNumber of IR signatures expected in a sample of 57 deletions based on the simulation.

  2. bSignificantly different from expected value (= 0.01).

  3. cSignificantly different from expected value (= 0.05).

1183 9117b,c21.966
248 16210c5.490
312 4225b,c1.416
436343b,c0.414
57992b,c0.091
63792b,c0.043
712500.014
8311b,c0.004

Discussion

The biological function of the LRR domain is widely believed to be the recognition of pathogen effectors. Current models of evolution therefore focus on mechanisms that can introduce variability in the LRR domain, allowing for new R gene specificities to evolve. The birth-and-death model (Michelmore and Meyers, 1998), which describes how new varieties of RGAs arise and disappear, includes sequence exchange through unequal crossing-over and gene conversion as well as point mutations as sources for genetic variability. Amplification of direct repeats is believed to be caused by unequal crossing-over or template slippage (Caicedo and Schaal, 2004; Kuang et al., 2004; Noël et al., 1999). The name ‘leucine-rich repeat’ suggests a repetitive structure of the LRR domain and implies that unequal crossing-over or template slippage between LRR units is not surprising. Consequently, the question concerning the origin of duplications in LRR domains was not raised previously. However, often only the leucine residues of the LRR backbone are conserved between LRR units, making them too divergent to allow unequal crossing-over or template slippage. Examples are the six evolutionary RGA lineages Cf-4/Cf-9, Lr21, Mla, RGA2, RPM1 and RPP8 which show the typical LRR backbone motifs but do not contain any duplications detectable at the DNA level.

In this study, we tested the hypothesis that IR is a major source of simple duplications in LRR domains of RGAs. In almost half (39/83) of the simple duplications and deletions examined, we found perfect IR signatures. This ratio even increases to more than three-quarters (64/83) if IR signatures of only 1 bp and those with mismatches are included. These data suggest that in fact all of the simple duplications and deletions investigated are the product of IR events. Those incidents where no IR signature is found could be explained by the subsequent loss of an IR signature through deletions or mutations. Alternatively, some IR events might not require sequence homology at all (Daley et al., 2005). However, despite that latter possibility, our data show that in most cases evidence for IR in the form of (perfect or imperfect) IR signatures can be found.

Thus, we propose IR as a new major evolutionary mechanism that is at the basis of variability of LRR domains. The addition of IR allows us to close a gap in current models by providing an explanation as to how initial variation that is essential for subsequent sequence evolution, according to previous models, occurs in LRR domains. We were able to explain and completely resolve complex repeat arrays in four RGA lineages (Cf-2, RGH2, Pm3 and Xa1) by tracing them back to initial simple duplications caused by IR.

Duplications caused by IR will eventually diverge with time until they reach the point at which unequal crossing-over cannot occur anymore. In most LRR domains, traces of ancient duplications are still detectable at the protein level. One can therefore assume that most LRR units are the product of ancient cycles of duplication and divergence. The repeat regions in the Pm3-like genes show how such duplications can shape the evolution of R gene families for at least 50 million years.

Functional constraints to illegitimate recombination events

Because IR requires only a few base pairs of homology, each genomic region has, in principle, an equal chance of being affected. Illegitimate recombination creates new genomic DNA through duplications but also removes it from the genome by causing deletions. Previous studies in plants showed that repetitive and other intergenic sequences (believed to be free from selection pressure) are constantly and rapidly reshuffled, resulting in a complete turnover of intergenic sequences within a few million years (Devos et al., 2002; SanMiguel et al., 2002; Wicker et al., 2003). This turnover of genetic material can be seen as the basic background rate of genomic rearrangements that applies for sequences which are not under selection pressure. If functionally important sequences are affected, natural selection will rapidly eliminate those changes that are disadvantageous to the organism. In coding regions of genes, only those changes that do not disrupt the function of a protein will be maintained in the genome. Consequently, several studies have reported IR in intergenic sequences (Devos et al., 2003; Wicker et al., 2003; Ma et al., 2004; Wicker et al., 2005) but only a few studies have described how new variations of genes arose from IR events (Arguello et al., 2006). In contrast, our data show IR to be a frequent source of variability in the evolution of multiple gene families.

There must be considerable constraints on duplications and deletions caused by IR as they should not interrupt the open reading frame or distort the three-dimensional structure of the protein. We observed that certain areas of RGAs allow more modifications whereas others are strongly conserved. It is striking that we found a total of 12 independent duplications occurring within a very narrow region in the central part of the LRR domains of the RPP5, RGC2 and Pm3 genes. Similarly, we found only two duplications within NBS domains. Further studies should investigate the functional significance of IR events in LRR domains for the specificity of recognition of pathogen effectors.

Conclusion and outlook

From our data, we conclude that IR-based duplications and deletions have played a major role in the creation of molecular diversity in RGAs, and therefore have contributed substantially to the evolution of new R gene specificities. However, the precise molecular mechanisms underlying the IR phenomenon are still unknown and a variety of different mechanisms could be responsible. Our data indicate that IR seems to mostly occur between short 0–8 bp stretches of homology. Possible molecular mechanisms could include slipped strand replication which was shown in Escherichia coli to occur between dispersed repeats of 8 bp or more (Lovett, 2004) or it could be the result of double strand break repair mechanisms (Daley et al., 2005). It is essential for our understanding of evolution to identify and characterize the molecular mechanisms behind IR. It is also unknown how frequently IR actually occurs in plant genomes and what its overall impact on genome evolution is. Future large surveys should therefore investigate how much evidence for IR can be found in gene families other than LRR-encoding ones.

Finally, as numerous repeat structures can still be detected at the protein level in several of the RGAs analyzed, one can assume that many of them reflect ancient IR events. Thus it would be fascinating to investigate whether quasi-random duplications caused by IR in the coding regions of genes might ultimately be responsible for the evolutionary origin of all proteins with repetitive motifs.

Experimental procedures

Shotgun sequencing

Bacterial artificial chromosome clones from hexaploid wheat T. aestivum cv. Chinese Spring (Allouis et al., 2003) were identified with probes 294D11-31 and TmRGL1_pro (Yahiaoui et al., 2004). The BAC DNA was isolated with the Qiagen large construct kit (Qiagen, http://www.qiagen.com/), mechanically sheared into fragments of 3–5 kb and ligated into the Topo-Blunt vector (Invitrogen, http://www.invitrogen.com//). For transformation, electrocompetent DH-10B E. coli were used. Plasmid DNA was isolated in 96-well format on a QIAROBOT (Qiagen) and sequenced on an ABI3730 automated sequencer (Applied Biosystems, http://www.appliedbiosystems.com/). Base calling and quality trimming was performed using phred (Ewing et al., 1998) and the sequence assembly was performed with phrap (version 0.990319, http://www.phrap.org). Gaps in the BAC sequences were closed by resequencing shotgun clones spanning the gaps with 1 m betaine added to the sequencing reaction, by primer walking on shotgun clones or by PCR.

Sequence analysis

For sequence analysis, programs from the emboss package (http://emboss.sourceforge.net/), clustalw (Thompson et al., 1994) and dotter (Sonnhammer and Durbin, 1995) were used. For analysis of rice sequences, datasets from TIGR rice genome (version 4, http://www.tigr.org) were used. Rice homologs of RGAs from other species were identified by blastn (Altschul et al., 1997). Pairwise sequence alignments were performed with water (emboss package) using a gap creation penalty of 30.0 and a gap extension penalty of 0.1 for all alignments. water was used to align duplicated units as well as homologs for the identification of deletions.

For processing of large datasets and for the simulation of deletions, programs were written using the language perl. A perl program was written to simulate deletions in RGA sequences. The program introduces a 100-bp deletion at a random place in the input sequence and examines the breaking points for the presence of IR signatures. It runs the simulation multiple (user-defined) times and summarizes what size of IR signatures occur how many times. All coding sequences, deduced proteins and multiple alignments as well as original perl programs used for the analysis are available upon request. Sequences were deposited at GenBank (accession numbers AY146587 and DQ251490).

Acknowledgements

We thank the joint consortium of the John Innes Centre, the Biotechnology and Biological Research Council and INRA (Unité de Recherches en Génomique Végétale) for providing BAC clones of Triticum aestivum. This work was supported the Swiss National Science Foundation (SNF) grant 3100A0-105620.

Ancillary