Homoeologous nonreciprocal recombination in polyploid cotton


Author for correspondence:
Jonathan F. Wendel
Tel: +1 515 294 7172
Email: jfw@iastate.edu


  • Polyploid formation and processes that create partial genomic duplication generate redundant genomic information, whose fate is of particular interest to evolutionary biologists. Different processes can lead to diversification among duplicate genes, which may be counterbalanced by mechanisms that retard divergence, including gene conversion via nonreciprocal homoeologous exchange.
  • Here, we used genomic resources in diploid and allopolyploid cotton (Gossypium) to detect homoeologous single nucleotide polymorphisms provided by expressed sequence tags from G. arboreum (A genome), G. raimondii (D genome) and G. hirsutum (AD genome), allowing us to identify homoeo-single nucleotide polymorphism patterns indicative of potential homoeologous exchanges.
  • We estimated the proportion of contigs in G. hirsutum that have experienced nonreciprocal homoeologous exchanges since the origin of polyploid cotton 1–2 million years ago (Mya) to be between 1.8% and 1.9%. To address the question of when the intergenomic exchange occurred, we assayed six of the genes affected by homoeo-recombination in all five Gossypium allopolyploids using a phylogenetic approach.
  • This analysis revealed that nonreciprocal homoeologous exchanges have occurred throughout polyploid divergence and speciation, as opposed to saltationally with polyploid formation. In addition, some genomic regions show multiple patterns of homoeologous recombination among species.


Plant genomes offer unique opportunities to study the coexistence of duplicated gene copies within the same nucleus. These duplicated copies originate via several mechanisms that generate tandem and dispersed duplicates, including replication errors, retropositioning and other transposable element-mediated processes, and, perhaps most importantly, polyploidy, which causes the simultaneous duplication of all genes. These duplications set in motion a spectrum of evolutionary responses, caused by the opportunities afforded by genic redundancy and the accompanying alteration of functional and selective constraints on one or both duplicated genes. Thus, there has long been interest in the creative aspect of gene duplication (‘neofunctionalization’), with a focus on its potential role in the generation of evolutionary novelty and hence its relevance to adaptation and diversification (Stephens, 1951; Ohno, 1970; Doyle et al., 2008). More recently, a broadened perspective has emerged on the fuller set of evolutionary outcomes of gene duplication, including pseudogene formation and various kinds of sub- and neofunctionalization (Lynch & Conery, 2000; Conant & Wolfe, 2008; Flagel & Wendel, 2009). These diversification processes may be counterbalanced by mechanisms that retard the divergence of duplicates, including preservation or maintenance of redundant gene function as a result of strong selection, and gene conversion mediated by nonreciprocal homoeologous exchange. Although these topics have received relatively less attention than those that generate divergence between duplicates, they are particularly amenable to study in relatively recently formed allopolyploid species. In such cases, two diverged genomes that evolved independently in diploid lineages are reunited into a common nucleus at the time of polyploid formation, thus providing a natural context for direct comparisons of sequence, expression and function among genes in diploid antecedents and their allopolyploid derivatives.

In principle, recombination between homoeologous loci may arise during meiosis or mitosis in cell lineages destined to become germinal. Recent analysis of segregating populations using molecular markers in Brassica napus highlighted the presence of homoeologous exchanges (Parkin et al., 1995; Udall et al., 2005), leading to the observation in progenies of nonreciprocal transpositions between homoeologous genomes (discussed in Nicolas et al., 2008). Of particular relevance here are the two mechanisms leading to nonreciprocal exchanges of homologous chromatids: via crossovers (COs) and the subsequent segregation of one parental and one recombinant chromatid, or via noncrossovers (NCOs, previously named gene conversion events). The most commonly invoked mechanism leading to NCOs is the synthesis-dependent strand annealing pathway, rather than the double-strand break repair model (Szostak et al., 1983) after a double-strand break event (Paques & Haber, 1999; Hollingsworth & Brill, 2004), involving the displacement of one homologous chromatid with the other, leading to the synthesis of homologous DNA (reviewed in de Massy, 2003; Chen et al., 2007; Mezard et al., 2007; Martinez-Perez & Colaiacovo, 2009). In addition to simple gene conversion between homologs, gene conversion may also include the conversion of paralogs (in the case of homogenized gene families; Teshima & Innan, 2004), or even a form of diversifying conversion when a short segment of a novel allele or pseudogene (Chen et al., 2007) is used as a conversion template, potentially leading to a chimeric product. These same processes, classically modeled to describe the segregation of homologous chromatids, are applicable to homoeologous sequences in allopolyploids, and can both lead to the observation in their derivative progenies of nonreciprocal homoeologous recombination (NRHR).

Cotton (Gossypium) allopolyploids (AD genomes; 2n = 52) originated 1.5 million years ago (Mya) (Senchina et al., 2003; Wendel & Cronn, 2003) from hybridization between an A genome (2n = 26), African species much like modern G. arboreum and G. herbaceum, and a D genome (2n = 26), American species similar to modern G. raimondii. Presently, five AD-genome polyploids are widely recognized, although a sixth species has been proposed recently (Krapovickas & Seijo, 2008). These species (and their genomic designations) include G. hirsutum (AD1), or ‘Upland cotton’, accounting for 90% of the global cotton produced (Wendel & Cronn, 2003), G. barbadense (AD2), also cultivated and commonly referred to as ‘Pima’ or ‘Egyptian’ cotton, and three other exclusively wild polyploid species endemic to coastal and island habitats: G. tomentosum (AD3), G. mustelinum (AD4) and G. darwinii (AD5). The presence of nucleotide synapomorphies (shared derived character states) among these five allopolyploid species indicates that it is likely that they originated from a single polyploidization event (Small et al., 1998; Small & Wendel, 1999).

The fate of duplicated genes has previously been studied in allopolyploid cotton, but this was before the recent explosion in the availability of expressed sequence tags (ESTs) and other sequence information. Cronn et al. (1999) used PCR and the sequencing of isolated clones to study homoeologous copies of 16 low-copy-number nuclear genes. Phylogenetic analyses indicated that, for this limited sampling of the genome, single- or low-copy-number genes were evolving independently. By contrast, homoeologous interactions were described for rDNA internal transcribed spacer sequences (Wendel et al., 1995), which were inferred to have experienced bidirectional concerted evolution, meaning that four polyploid species exhibited only D-like copies, whereas one (G. mustelinum) contained mostly or exclusively A-like copies. Such homoeologous interactions among rDNA arrays have since been documented in a variety of other polyploid genera, e.g. Nicotiana (Kovarik et al., 2004, 2008) and Tragopogon (Kovarik et al., 2005).

Our goal in the present study was to utilize a vastly expanded EST database (Udall et al., 2006a; J. A. Udall, unpublished) to assess the extent of NRHR in allopolyploid Gossypium. These EST resources were developed using cDNA libraries from the two diploid progenitor genomes (A, D) of allopolyploid cotton and from multiple cDNA libraries derived from diverse tissues of allopolyploid cotton, thereby allowing homoeologous single nucleotide polymorphisms (homoeo-SNPs) to be identified in the allotetraploid genome and assigned to the appropriate diploid parent of origin (Udall et al., 2006b). By analyzing the homoeo-SNP patterns in the allopolyploid ESTs and contigs, putative homoeologous recombined regions were detected, whereby ESTs in the allopolyploid apparently derived from one parental EST displayed at least one homoeo-SNP signature from the other parent. An NRHR was tentatively scored when one homoeologous sequence displayed such recombination, whereas the other homoeolog displayed a parental copy. To validate these putative homoeo-recombination events, we cloned and resequenced a subset of them, revealing, in the process, several of the artifacts that could lead to the erroneous inference of homoeologous gene conversion, such as autapomorphies (derived character states unique to either one of the parents or the polyploids) and EST assembly errors. To ensure that recombinant ESTs were not caused by post-transcriptional events such as co-splicing of exons from homoeologous A- and D-genome mRNA copies, a subset of primers was designed to amplify both exonic and intronic regions. Finally, we used the phylogenetic context provided by the five natural allopolyploid species to address the question of whether gene conversion is a process that is most probable during the early stages of polyploid stabilization or, instead, is a phenomenon that arises more slowly on an evolutionary timescale.

Materials and Methods

EST assemblies

The detection of recombination events between homoeologous loci was facilitated using the published cotton EST assembly (Udall et al., 2006a) to which 1 750 000 new EST reads were added using a next-generation sequencing method (Roche-454 sequencing). This EST assembly is deep and extensive in terms of presumed gene content coverage, consisting of aligned sequences from G arboreum (A genome), G. raimondii (D genome) and G. hirsutum (AD polyploid genome), generated from multiple cDNA libraries extracted from different organs, at different developmental stages, and in response to different biotic and abiotic stresses. In the assembly process, ESTs from both diploids and from the allotetraploid were included, creating a multispecies assembly, which allows for the detection of homoeo-SNPs. This new cotton EST database contains 19 874 contigs assembling A-, D- and AD-genome ESTs from which homoeo-SNPs can be detected. All EST sequences and the assembly are publicly available at http://www.agcol.arizona.edu/cgi-bin/pave/Cotton/index.cgi.

Detection of homoeologous recombinants

A Python-Biopython custom script was used to detect homoeologous recombinants. This program traverses alignments of putatively orthologous EST sequences from the diploid A and D genomes to detect diagnostic SNPs. SNPs in the diploids that have been transmitted from the time of polyploidization and have remained unaltered since polyploid formation comprise diagnostic homoeo-SNPs in the allopolyploid. We used a conservative approach to identify genome-specific SNPs by requiring that all available A- or D-genome diploid sequences be in complete agreement with regard to all diagnostic nucleotide positions. Inferred genome-specific SNPs were compared with ESTs from allopolyploid cotton to infer the fate and linkage of homoeo-SNPs. When an EST from an allopolyploid displayed both A- and D-genome homoeo-SNPs, it was flagged as a potential homoeologous recombinant. An example of the inference of homoeologous recombinants is illustrated in Fig. 1, which, for clarity, shows both types of unaltered homoeologs (At, Dt), as well as one that might be expected from a recombination event between homoeologous loci. A region presenting a conserved homoeologous sequence (At or Dt) and a homoeologous recombinant event was inferred as revealing an NRHR.

Figure 1.

 Illustration of the detection of homoeologous single nucleotide polymorphisms (homoeo-SNPs) and nonreciprocal homoeologous recombination (NRHR) in diploid and allopolyploid cotton. Genome designations and sequences without gene conversion are as follows: A, Gossypium arboreum; D, G. raimondii; At, G. hirsutum, A-genome homoeolog; Dt, G. hirsutum, D-genome homoeolog; NRHRt, putative gene conversion event in G. hirsutum. The last sequence exhibits homoeo-SNPs corresponding to the A genome in the first and last SNP positions, but to the D genome in the middle SNP position. This is inferred to be an A homoeolog that has undergone NRHR in its middle section by the D homoeolog. 1, homoeo-SNP; 2, parental autoapomorphy; 3, allopolyploid autoapomorphy.

In addition to estimating the prevalence of NRHR throughout the cotton EST assembly, a second, more conservative script was used that considered the possible effects of autapomorphic substitutions (Fig. 1), in the 1–2 Myr since polyploid formation, as these polymorphisms can generate false positives with respect to NRHR estimation. The application of this script led to a decreased proportion of inferred NRHR events.

Validation of bioinformatically inferred homoeologous recombination events

We used traditional Sanger sequencing to verify putative NRHR events from genomic DNA. This work was viewed as essential in order to estimate the effect of EST assembly and other possible artifacts on inferences of recombination events between homoeologous loci. Accordingly, we selected a subset of these putative events and designed amplification primers for PCR, followed by amplicon cloning and sequencing.

Plant material

We sampled putative NRHR events from the species used in the EST sequencing, and representatives of the four other AD-genome polyploid species. These lines, which included G. arboreum (A genome), G. raimondii (D genome), G. hirsutum cv. Acala maxxa (AD1 genome), G. barbadense cv. Pima S6 (AD2 genome), G. tomentosum accession WT936 (AD3 genome), G. mustelinum accession 15C (AD4 genome) and G. darwinii accession PW45 (AD5 genome), were grown in the Pohl Conservatory at Iowa State University. In addition, for the purposes of phylogenetic reconstruction, we included the outgroup species Gossypioides kirkii (Seelanan et al., 1997). Leaf punches were collected and DNA was extracted following a CTAB protocol (Doyle, 1991). The concentration of the DNA solutions was measured and normalized to 50 ng μl−1.

Primer design, cloning and sequencing

Our goal was to design primers that would amplify genomic DNA from genes corresponding to EST sequences in different Gossypium species, and that would include either just exons or both exons and introns from regions displaying a putative homoeologous recombinant in G. hirsutum. To estimate exon–intron boundaries, we blasted our EST and contig sequences against the Arabidopsis genome (TAIR8 genomic sequences release, http://www.arabidopsis.org/), and aligned the cotton transcriptomic sequence to the closest Arabidopsis genomic sequence using Spidey (http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/index.html). Gaps in the resulting alignments were inferred to correspond to putative introns. Primers were designed employing Primer3 v. 1.1.4 (Rozen & Skaletsky, 2000), using default parameters, and were selected using a script to find primer pairs that allowed amplification in both the A and D genomes, either within a single exon or between two or more exons.

Each PCR contained 0.5 μM of each primer, 200 μM of deoxynucleoside triphosphates, 1.5 mM of MgCl2, 1 × Taq DNA polymerase buffer (Invitrogen), 1 unit of Taq DNA polymerase (Invitrogen) and 50 ng of DNA, in a final volume of 20 μl. To ensure thermocycling conditions that limit PCR recombination events (Judo et al., 1998; Cronn et al., 2002), we used a limited number of cycles (20–25) and long elongation times. The amplification program was 94°C for 2 min, followed by 20–25 cycles at 94°C for 30 s, 60°C for 30 s and 72°C for 1 min 30 s, and a final extension at 72°C for 5 min. The primer pairs selected for each targeted region are presented in Table S1 (see Supporting Information).

Amplified products were visualized following agarose gel electrophoresis. Bands of the expected size were excised from the gel, purified using a Qiaquick gel extraction kit (Qiagen, Hilden, Germany) and cloned using the TOPO TA cloning kit (Invitrogen). From the 20 regions of interest displaying a putative recombination between homoeologs that were selected for validation, six were selected and cloned to check both the A-genome and D-genome EST sequence quality and to obtain parental intronic sequences. These six regions were also cloned from the four other AD-polyploid genomes (AD2, AD3, AD4 and AD5) and the outgroup Gossypioides kirkii. For each region of interest and in each species in which a gene was cloned, 10–12 clones were sequenced by Sanger sequencing at the Iowa State University core sequencing facility.

Sequence analysis

Sequencing chromatograms were checked visually and alignments were processed using ClustalW v.2 (Larkin et al., 2007). Each alignment and each SNP were verified visually. Open reading frames (ORFs) were detected using the ORF-Finder program (http://www.ncbi.nlm.nih.gov/projects/gorf/). Alignment editing was made using Jalview (Waterhouse et al., 2009). For each species, we required at least two independent sequences to remove potential PCR recombination artifacts (Judo et al., 1998; Cronn et al., 2002). Regions displaying recombinants were blasted against genomic databases, and separately to the Arabidopsis genome database (TAIR9 cDNA sequences release), to identify putative gene type and function.


Detection of nonreciprocal homoeologous recombination events using cotton homoeologous ESTs

The search for homoeologous recombinants allowed us to detect 485 putative events (disallowing parental autapomorphies) from the 19 874 contigs displaying A-, D- and AD-genome ESTs in the Cotton32 assembly. Of these 485 gene conversion candidates, 341 exhibited a single recombination breakpoint (which may be caused by either homoeologous CO and subsequent segregation of one parental copy and one recombinant, or NCO – gene conversion events), whereas 100 showed two recombination breakpoints, and 44 showed a pattern of three or more recombination breakpoints (Table 1). As our inferences rely on the use of EST assemblies, we may underestimate the actual number of recombination events found at the DNA level. The size range of the recombined regions displaying two recombination breakpoints ranged between 1 and 698 nucleotides in length (N50 < 50 bp). By considering a total of 356 contigs with only one recombinant copy (either D genome, A recombined copies or the reciprocal situation) in the AD1 genome (Table 1), we could conservatively extrapolate that NRHR had modified 1.8% of the cotton transcriptome. The 129 other candidate regions represented patterns for which 74 corresponded to reciprocal homoeologous recombinants (0.4%), 44 to nonreciprocal homoeologous recombinants for both homoeologous copies (0.2%), in which the two homoeologous copies displayed different recombination patterns, and five to recombinants from different regions within a contig (0.02%).

Table 1.   Number of homoeologous recombinant copies detected per contig and patterns of recombination
 Number of homoeologous recombinant copies per contig
  1. 1, single; 2, double; ≥ 3, three or more recombination breakpoints.

Number of recombination breakpoint(s)
 ≥ 319204144

As EST assembly may greatly influence our approach to the detection of NRHR, we chose to calculate the proportion of NRHR events following allopolyploidy using a second more conservative assembly. Analysis of this assembly should minimize the number of EST assembly errors (e.g. paralogous copies assembled in the same contig, leading to the detection of a spurious homoeologous exchange). This reassembled EST database (Cotton41 assembly) contained 17 386 contigs from which homoeo-SNP signatures were inferred. Using this comparative approach, we estimated that the proportion of NRHRs in G. hirsutum (disallowing parental autapomorphies and taking only into consideration the contigs displaying nonreciprocal homoeologous exchanges affecting one homoeolog) was from 1.8% (357/19 874) to 1.9% (335/17 386). The overall proportion of homoeologous exchanges in the cotton transcriptome (considering all the contigs displaying recombinant ESTs) was from 2.4% (486/19 874) to 3.5% (608/17 386).

Validation of putative nonreciprocal homoeologous recombination events

Because bioinformatic inferences of homoeologous exchanges are subject to several possible sources of error (e.g. assembly artifacts, paralogy, sequencing error, autapomorphic homoplasy, RNA editing), it is important to validate at least some of the suspected cases using an independent method. This validation process has value not only in refuting or verifying putative intergenomic recombination, but also in shedding light on the quality of the assembly and the sequence alignments. Here, we chose to validate putative NRHRs and parental autapomorphies for 20 regions (contigs), selected on the basis of several criteria. These criteria included the number of ESTs (both from the diploids and the polyploid) included in the alignment that provided homoeo-SNP evidence, and the number and distribution of putatively converted homoeo-SNPs. The selected contigs (Table 2) ranged in length from 900 to 2203 nucleotides, and each displayed an ORF. The BLAST results presented in Table 2 correspond to alignments with Arabidopsis thaliana, from which the best functional annotation is available. These contigs corresponded to various kinds of genes, some (Cotton16_07872_01, Cotton16_00080_04, Cotton16_14915_01, Cotton16_00255_03) displaying best homology to members of larger gene families (α-tubulin, β-tubulin, ankyrins and strubbeling-receptors, respectively), and others having homologies to single- or low-copy-number genes in Arabidopsis. The primer designing strategy used allowed the amplification of intronic regions for 11 of the 20 targeted genomic regions, one containing more than one intron (Cotton16_00255_03, with three introns).

Table 2.   Description of the regions selected to validate putative nonreciprocal homoeologous recombination (NRHR)
ContigLength (nt)ORF (start–end)Intron startIntron lengthStart point (FP)End point (RP)cDNA lengthBLAST results against Arabidopsis cDNAs
LocusSymbolGeneE value
  1. FP, forward primer; ORF, open reading frame; RP, reverse primer.

Cotton16_00047_02146779–12489661381437AT5G13930.1CHSNaringenin-chalcone synthase0.0
Cotton16_00080_04160764–140772990474874420AT5G23860.1TUB8Tubulin β 80.0
Cotton16_00255_031219220–1215322–462–534425–93–528305553268AT4G03390.1SRF3Strubelling-receptor family 36E-110
Cotton16_01545_01972400–969126546421AT1G64142.1CPuORF23Conserved peptide upstream open reading frame 238E-21
Cotton16_01609_01130382–129953280385884,500AT2G26890.1GRV2Gravistropism defective 2/heat shock protein binding0.0
Cotton16_01666_0198599–758696945250AT2G33120.1SAR1Synaptobrevin-related_protein 15E-115
Cotton16_01765_021816277–162912181647430AT3G49260.1Iqd21Iqd21 (IQ-domain 21); calmodulin binding5E-41
Cotton16_01955_02100151–999216341140623484AT1G80350.1ERH3Ectopic root hair 38E-137
Cotton16_02054_02108239–1016470955486AT3G16150.1l-asparaginase, putative0.0
Cotton16_07872_01158194–14468721339468AT1G04820.1TUA4Tubulin α 40.0
Cotton16_10388_0190086–899392132280466187AT1G31780.1Unknown expressed protein3E-144
Cotton16_13240_01154390–133510681552485AT1G79340.1AtMC4Metacaspase 4; cysteine-type peptidase3E-157
Cotton16_13273_01220368–1999171128815671981415AT5G03340.1CDC48Cell division cycle protein 48, putative0.0
Cotton16_14915_011222149–11358253355461000455AT5G65860.1Ankyrin repeat family protein4E-172
Cotton16_27501_011912621–164910021278277AT4G28300.1Hydroxyproline-rich glycoprotein family protein2E-43
Cotton16_34868_011559146–12409621009391432494AT4G34160.1CYCD3;1Cyclin-dependent protein kinase regulator2E-91
Cotton16_34881_011183191–88947971192626435AT5G24890.1Unknown protein2E-28
Cotton28_210202171239–790167424116372133497AT1G67840.1CSKChloroplast sensor kinase1E-172
Cotton28_21491103922–630434909476AT3G26935.1Zinc finger (DHHC type) family protein2E-78
Cotton28_2419014081637–190310433609791467489AT2G20890.1PSB29Photosystem II binding3E-60

For each sampled genomic region, both A- and D-genome homoeologous copies (AD1-A and AD1-D) were recovered from G. hirsutum. Inclusion into our final validated sequence dataset required that both homoeologous copies be detected, and that each be present at least twice. Cases in which a sequence was obtained only once were considered to be potential PCR recombination or sequencing artifacts, and were removed from the analysis.

Based on these de novo genomic sequences, the bioinformatically detected NRHR events were confirmed for 14 of the 20 regions analyzed (Table 3). Two of the remaining six contigs (Cotton16_01545_01 and Cotton28_24190) were not validated as having experienced a homoeologous recombination, but, instead, were shown to be mistaken inferences arising from the presence of paralogous or highly pseudogenized copies in the AD1 genome that were erroneously stitched together during EST assembly (these contigs showed two clearly distinct pools of sequences). The remaining four contigs (Cotton16_00047_02, Cotton16_01765_02, Cotton16_02054_02, Cotton28_21020) did not agree with the sequence of their respective contig, and may represent situations in which it was not possible to rule out autapomorphies in the diploid parents after polyploid formation c. 1.5 Mya. That is, a homoeo-SNP between modern A- and D-genome diploids may, in principle, reflect true homoeo-SNPs between the A and D genomes at the time of polyploid formation. These cases could be used as indicators or markers of NRHR in modern allopolyploid cotton. Alternatively, this homoeo-SNP may reflect an autapomorphic substitution after polyploid formation (Fig. 1), leading to an erroneous inference of intergenomic recombination (a false positive). Our detection method discriminated between these two types of SNP evidence.

Table 3.   Description of the sequences cloned from regions displaying putative homoeologous recombination events in Gossypium hirsutum
ContigGenome of originType of eventHomoeo-SNP originRec breakpoint positionsNonsynonymous conversions (SNP position in ORF)
  1. Homoeo-SNPs in capital letters refer to exonic sites, whereas those in lower case letters denote intronic sites.

  2. Homoeo-SNP, homoeologous SNP origin; ORF, open reading frame; recBP, recombination breakpoint.

Cotton16_00047_02AD1 AAutapomorphy
AD1 DAutapomorphy
Cotton16_00080_04AD1 AAAAA
AD1 DSingle recBPDDDA705Q to L (689)
Cotton16_00255_03AD1 ADouble recBPaaaaaaaaaaAAAaaaaaddddddaaaaaa1217–1387
AD1 D3 or more recBPaaaaaaaaddDDDdaaaaddddddaaaaaa562–919–1217–1387
Cotton16_01545_01AD1 AParalogy
AD1 DParalogy
Cotton16_01609_01AD1 ASingle recBPAaaAD928 
AD1 DDddDD- 
Cotton16_01666_01AD1 AAAAAAA
AD1 DSingle recBPAAADDD765V to I (616)
Cotton16_01765_02AD1 AAutapomorphy
AD1 DAutapomorphy
Cotton16_01955_02AD1 ASingle recBPAaaaaadDDD389
AD1 DSingle recBPDddddddDDA705
Cotton16_02054_02AD1 AAutapomorphy
AD1 DAutapomorphy
Cotton16_07872_01AD1 AAAA
AD1 DDouble recBPDAD1025–1122
Cotton16_10388_01AD1 ASingle recBPAAAdddd363
AD1 DSingle recBPDDDaaaa
Cotton16_13240_01AD1 ASingle recBPDAAAA1339
Cotton16_13273_01AD1 AAAAAAA
AD1 D3 or more recBPDADDDA1676–2012–2158
Cotton16_14915_01AD1 AAAaaaaaaaaaaA
AD1 DSingle recBPAAddddddddddD859
Cotton16_27501_01AD1 A3 or more recBPDDDDDDDADAAD1100–1155-1172–1199
AD1 D3 or more recBPDDDDDDDADADD1100–1155–1172–1178R to P (536); Y to D (559)
Cotton16_34868_01AD1 AAA
AD1 DSingle recBPDA1152
Cotton16_34881_01AD1 ASingle recBPAAAADDddD351P to S (88)
Cotton28_21020AD1 AAutapomorphy
AD1 DAutapomorphy
Cotton28_21491AD1 ADouble recBPAAAADA751–766L to P (725)
AD1 DDouble recBPDDDDAD751–766P to L (725)
Cotton28_24190AD1 AParalogy
AD1 DParalogy

Intronic sequences were obtained for six regions displaying confirmed recombination events between homoeologous copies (Cotton16_00255_03, Cotton16_01609_01, Cotton16_01955_02, Cotton16_10388_01, Cotton16_14915_01 and Cotton16_34881_01). The intron and intron–exon data provided two novel insights into illegitimate recombination within polyploid cotton. First, they confirmed that these events happened at the DNA level, with the intronic sequences being part of the recombined regions, as opposed to the alternative explanation, whereby the EST evidence reflects post-transcriptional interactions (e.g. RNA editing). Second, the intronic sequences revealed a case in which an NRHR happened within an intron (Cotton16_00255_03; positions 1217–1387, Table 3).

Of the 14 regions in which homoeologous exchanges were confirmed, 10 regions displayed NRHR, six being D homoeologs that had experienced conversion of a region by the A homoeolog, and the four others exhibiting the reciprocal situation. Finally, four additional regions exhibited more complex patterns, reflecting a history of more than two recombination breakpoints. Two of these regions involved NRHR patterns among the two copies detected: the Cotton16_00255_03 region had two copies, one displaying an intronic recombination pattern and the other displaying two recombinant regions, and the Cotton16_001955_02 region presented a suspected A-genome copy with an intronic recombination breakpoint (position 389), whereas the D-genome copy displayed an exonic recombination breakpoint (position 705). The two others (Cotton16_010388_01 and Cotton28_21491) involved the presence of reciprocal A-converted and D-converted copies that could originate from reciprocal double COs.

A logical first step in assessing the likelihood that homoeologous exchanges might have physiological relevance or functional consequences is to ask whether the conversions observed lead to amino acid alterations in the predicted translation products. In our dataset, such amino acid replacements were observed for six regions (nonsynonymous conversions in Table 3 for Cotton16_00080_04, Cotton16_01666_01, Cotton16_01765_02, Cotton16_27501_01, Cotton16_34881_01, Cotton28_21491).

Timing of homoeologous recombination following allopolyploid formation

In an effort to address the question of whether homoeologous recombination is a phenomenon restricted to the early stages of allopolyploid genome stabilization or, instead, occurs more evenly during and after speciation, we assayed for the presence of putative homoeologous recombinants by amplifying, cloning and sequencing both homoeologs from the four remaining allopolyploid species. In addition, we also resequenced the homologous regions from both parental diploids, confirming the presence of only one allelic type for each region in each parental species, and confirming the homoeo-SNPs detected in the EST assembly. For each of these six regions, all A- and D-genome homoeologs were recovered, with the single exception of the AD3 D homoeolog of the Cotton16_01955_02 region.

For all six regions studied in this broader sampling of species, at least one other allopolyploid exhibited a homoeologous recombinant (Table 4). To place these in a phylogenetic context, we superimposed the placement of these putative events onto a cladogram (Fig. 2) that depicts organismal relationships (Wendel & Cronn, 2003), numbering the six contigs I–VI, as shown in Table 4.

Table 4.   Sequences cloned from six selected regions displaying a validated homoeologous recombination event in the four other AD-genome allopolyploids
ContigGenome originType of eventHomoeo-SNP origin patternRec. breakpoint positionsNonsynonymous conversions (position in ORF)
  1. Taxon designations are as follows: AD2, Gossypium barbadense; AD3, G. tomentosum; AD4, G. mustelinum; AD5, G. darwinii. Homoeo-SNPs in capital letters refer to exonic sites, whereas those in lower case letters denote intronic sites.

  2. Homoeo-SNP, homoeologous SNP origin; ORF, open reading frame; RecBP, recombination breakpoint.

Cotton16_00255_03AD3 DSingle recBPddddddddddDDDddaaaaaaaaaaaaaaa1088
Cotton16_01609_01AD2 DSingle recBPAddDD427A to V (347)
AD3 DSingle recBPAddDD427A to V (347)
Cotton16_01955_02AD2 ASingle recBPDdaaaaaAAA239
AD3 DMissing copy
Cotton16_10388_01AD2 ASingle recBPDDDaaaa363G to D (272)
Cotton16_14915_01AD3 DSingle recBPAAaaaaddddddD926
Cotton16_34881_01AD5 ADouble recBPAAAADDaaA351–438
Figure 2.

 Phylogenetic relationships among the five AD-genome polyploid species (A-genome homoeologs in grey; D-genome homoeologs in black) and homoeologous recombination events and their patterns detected in six regions and described in Table 4 (I, Cotton16_00255_03; II, Cotton16_01609_01; III, Cotton16_01955_02; IV, Cotton16_10388_01; V, Cotton16_14915_01; VI, Cotton16_34881_01). White and grey boxes denote A-genome and D-genome single nucleotide polymorphisms (SNPs), respectively, whereas the box size denotes location (large, exon; small, intron).

Perhaps the most striking aspect of this expanded analysis was that not a single homoeologous recombinant appeared to be shared among species. Instead, all recombinations between homoeologous regions corresponded to homoeo-SNP patterns that were different from those originally detected in the G. hirsutum genome. Eight recombinant copies were judged to be specific to G. hirsutum, whereas three were specific to G. tomentosum (AD3), its closest relative. For these regions, only two recombinants were specific to G. barbadense (AD2), one was restricted to G. darwinii, and not a single novel recombinant was detected in G. mustelinum. It is important to note that these data do not imply any overall difference in the prevalence of recombination between homoeologous loci in these species, as there is a strong ascertainment bias in the regions chosen (i.e. they were selected on the basis of evidence for homoeologous recombination in G. hirsutum). Yet, the data do underscore the fact that none of the regions selected for phylogenetic analysis diagnosed an ancient event. In fact, only a single copy was phylogenetically mapped as shared (IIa, Fig. 2). This event, shared by AD2 and AD3, may be phylogenetically complex, in that it occupies a position basal to their common ancestor. Hence, the AD1 recombinant copy may be derivative of this earlier recombination event.


Polyploid speciation is a prominent mode of cladogenesis in flowering plants, and one that is now widely recognized to variously involve multiple non-Mendelian processes, including the loss of DNA (Ozkan et al., 2001; Shaked et al., 2001; Gaeta et al., 2007; Grover et al., 2008; Tate et al., 2009), chromosomal rearrangements (Pires et al., 2004; Pontes et al., 2004; Udall et al., 2005), repatterning of epigenetic marks (Madlung et al., 2002; Liu & Wendel, 2003; Wang et al., 2004; Rapp & Wendel, 2005; Salmon et al., 2005; Lukens et al., 2006; Chen, 2007) and biased gene expression (Adams et al., 2003, 2004; Albertin et al., 2006; Hegarty et al., 2006; Tate et al., 2006; Wang et al., 2006; Gaeta et al., 2007; Flagel et al., 2008; Ha et al., 2009; Rapp et al., 2009). One aspect of polyploidy that has received relatively little attention is the possibility of nonreciprocal homoeologous recombination throughout the genome, with only rDNA loci being commonly surveyed in this respect (Wendel et al., 1995; Kovarik et al., 2008).

Our analysis of homoeologous recombinants in the AD-genome polyploid cottons allowed us to estimate the proportion of both homoeologous reciprocal and nonreciprocal recombinants in the cotton transcriptome. We took into consideration different artifacts that could affect this inference from EST assemblies: autapomorphies, EST assembly artifacts and post-transcriptional recombinations, by validating recombination events using de novo genomic DNA sequencing and intronic information. One other limitation of the use of EST assemblies to detect gene conversion events relies in the fact that these represent the expressed part of the genome (thus missing all nonexpressed conversions or recombinants).

The recombinant homoeologous copies detected correspond to the formation of chimeric copies that represent, in some cases, validated nonsynonymous recombinants when compared with the parental copies. This result highlights the putative source of novelty of recombination of homoelogous copies on duplicated genes (contrary to the concerted evolution observed in specific gene families displaying gene conversion). In addition, the size of NRHR events, ranging from a few nucleotides to 700 nucleotides in length (for copies displaying two recombination breakpoints), is consistent with a previous report in rice (Oryza sativa indica), in which gene-converted regions had an average length of 130 nucleotides, ranging in size from 4 bp to more than 1 kb (Xu et al., 2008). This latter observation, superimposed on the fact that reciprocal homoeologous recombination has not been observed in cotton and, indeed, is not evident in their genetic maps (Endrizzi et al., 1985; Rong et al., 2004), suggests that most of the NRHRs detected are a result of gene conversion events (NCOs) rather than COs between homoeologous chromatids. Wang et al. (2009) have also recently reported a comparison of rice and Sorghum genomes using phylogenetic analysis, which revealed gene conversion among paralogs within each species, leading to an accelerated divergence for each genome. The estimates of gene conversion among duplicated genes in rice and sorghum (5.5% and 4.1%, respectively) and in mammals (1–3%; McGrath et al., 2009) appear to be in accordance with that reported here, although the previous estimates involve paralogous copies in a diploid, whereas our study focused on homoeologs in a polyploid.

The six regions displaying homoeologous exchanges in G. hirsutum were also detected in at least one of the other AD-genome polyploid species. It also appeared that the number of recombinant copies in the same region decreased in correspondence with the phylogenetic distance from G. hirsutum. This could reflect the gradual (as opposed to episodic) accumulation of homoeologous exchanges when coupled with speciation among these five polyploid species over the last 1.5 Myr. Homoeologous pairing during meiosis is suspected to be avoided in the first generations following polyploid formation (Ramsey & Schemske, 2002), because it can lead to infertility as a result of unbalanced gamete formation following meiosis. It is also clear, however, that some recombinations of homoeologous loci may in fact be part of polyploid stabilization in some lineages (Udall et al., 2005). Our finding of a gradual accumulation of homoeologous exchanges is not surprising, as homoeologous pairing is likely to be rarely tolerated. In addition, the parental genomes of the allopolyploid cottons differ significantly in size (by a factor of approximately two; Hendrix & Stewart, 2005), which has been shown to be mainly a result of transposable element insertions in the A genome (Hawkins et al., 2006; Grover et al., 2007). This large size discrepancy probably contributes to the difficulty in homoeologous pairing, and could further explain the late arrival and gradual accumulation of homoeologous exchanges during the evolutionary history of allopolyploid cotton. Although it is true that homoeologous pairing is rarely observed among modern stabilized cotton allopolyploids, classic cytogenetic experiments have demonstrated that it is more frequent in Gossypium neopolyploids (reviewed in Endrizzi et al., 1985). Thus, one might expect that temporal analysis of recombination of homoeologous loci in Gossypium would have led to the discovery of primarily basal, shared events. Yet, our data do not show this effect. Instead, it appears that homoeologous recombinations have arisen sporadically, and have continued even after allopolyploid speciation. This observation is in agreement with gene conversion events rather than reciprocal homoeologous recombination, as NCOs are resolved before chiasmatic associations of homoeologous chromosomes during meiosis (Allers & Lichten, 2001).

Clearly, more work is needed on the pace and timing of such homoeologous exchanges over a longer timescale during and after polyploid formation. The most informative studies are likely to involve a combination of synthetic hybrids and allopolyploids, as well as natural allopolyploids, for which a well-understood phylogenetic framework exists. These types of studies may ultimately permit a model to be developed that would tie together homoeologous pairing over the generations, NRHR leading to gene conversion and the process of polyploid stabilization. In addition to these experiments, additional research is necessary to evaluate the provocative speculation that low sequence divergence between duplicated copies after polyploidization and conservation of duplicate gene expression may be causally connected to homoeologous gene conversion (Chapman et al., 2006). Finally, and in spite of the generally homogenizing consequences of gene conversion, it remains to be demonstrated that the process leads to evolutionary novelty, either by generating chimeric gene copies, or by fixing across two homoeologs favored amino acids that previously were restricted to only one of the two gene copies.

In this study, we found NRHRs in a small percentage of the cotton genome. It is not clear whether the function of these genes is related to the frequency of gene conversion. Continued sequencing of ESTs and the cotton genome will improve our ability to relate the significance of gene conversion to genome evolution and adaptation.


We gratefully acknowledge the National Research Initiative of the USDA Cooperative State Research, Education and Extension Service (2005-35301-15700 to J.A.U and J.F.W.), the National Science Foundation Plant Genome Research Program (0638418 to J.F.W.) and the Department of Ecology, Evolution, and Organismal Biology for a Graduate Research Award to LF. We thank K. Grupp and A. Balu for technical assistance.