The fate of duplicated genes in a polyploid plant genome




Polyploidy is generally not tolerated in animals, but is widespread in plant genomes and may result in extensive genetic redundancy. The fate of duplicated genes is poorly understood, both functionally and evolutionarily. Soybean (Glycine max L.) has undergone two separate polyploidy events (13 and 59 million years ago) that have resulted in 75% of its genes being present in multiple copies. It therefore constitutes a good model to study the impact of whole-genome duplication on gene expression. Using RNA-seq, we tested the functional fate of a set of approximately 18 000 duplicated genes. Across seven tissues tested, approximately 50% of paralogs were differentially expressed and thus had undergone expression sub-functionalization. Based on gene ontology and expression data, our analysis also revealed that only a small proportion of the duplicated genes have been neo-functionalized or non-functionalized. In addition, duplicated genes were often found in collinear blocks, and several blocks of duplicated genes were co-regulated, suggesting some type of epigenetic or positional regulation. We also found that transcription factors and ribosomal protein genes were differentially expressed in many tissues, suggesting that the main consequence of polyploidy in soybean may be at the regulatory level.


Angiosperms represent the largest group of plants, with 350 000 known taxa (Van de Peer et al., 2009). They underwent diversification in the mid-Cretaceous period (i.e. 100 million years ago, MYA), and, in contrast to pteridophytes and gymnosperms, maintained a high radiation rate over a long period of time (Lidgard and Crane, 1988; Crane and Lidgard, 1989; Crepet and Niklas, 2009). Defined by Darwin as an ‘abominable mystery’, this prominence of flowering plants on earth has been extensively studied. Recent theories suggest that carpel evolution, double fertilization and flower development, as well as additional innovations such as reduced cost of seed production and short generation time, contributed to the explosive success of angiosperms (Stuessy, 2004; Lord and Westoby, 2011). Because many genes involved in reproduction and flower development were duplicated before the monocot/dicot radiation (Jiao et al., 2011), whole-genome duplications (WGDs) are believed to be at the origin of angiosperm radiation (De Bodt et al., 2005). Polyploidy, or WGD, is a process that recurrently shaped eukaryotic genomes. Although, in animals, this process is mainly restricted to amphibians and fish (Otto and Whitton, 2000), polyploidy has played a major evolutionary role in plants. Complete genome analyses strongly support the conclusion that, in addition to lineage-specific WGDs, a triplication (γ) and two WGD (ρ and σ), respectively, occurred in eudicots and monocots (Vision, 2000; Jaillon et al., 2007; Lyons et al., 2008; Tang et al., 2010). Recent work also demonstrated that two rounds of WGD occurred 319 and 192 MYA, respectively, shortly before seed plant and angiosperm radiation (Jiao et al., 2011).

Models have been proposed in which duplicated genes are pseudogenized (loss of regulatory sub-function; Moore and Purugganan, 2005), sub-functionalized (partitioning of the function between daughter copies; Cusack and Wolfe, 2007) and/or neo-functionalized (functional diversification; Blanc and Wolfe, 2004). These provide testable hypotheses suggesting that neo-functionalized gene copies undergo positive selection (Ka/Ks > 1), whereas, subfunctionalized gene copies, showing transcriptional divergence across tissues or cell types, are expected to undergo purifying selection (Ka/Ks < 1). Additionally, because redundancy allows one of the copies to accumulate mutations without an immediate effect on the fitness of the organism, polyploidy may give rise to new allelic variants (Feldman and Levy, 2009), gene family expansion (Zahn et al., 2005; Veron et al., 2007) and changes in gene expression (Levy and Feldman, 2004; Rapp et al., 2009; Wang et al., 2012). Moreover, polyploidy has been shown to be associated with methylation changes (Yu et al., 2010; Hegarty et al., 2011; Kenan-Eichler et al., 2011; Zhao et al., 2011) and potentially activation of transposable elements (Kenan-Eichler et al., 2011). These genetic and epigenetic modifications undoubtedly lead to evolution of new traits and thus increased adaptability. This may explain why some newly established polyploids have a competitive advantage compared to their diploid parents. As an example, the recent allopolyploid Spartina anglica, which occurred in the late 19th century, rapidly invaded the British and French coasts and has gradually replaced its diploid parents (Baumel et al., 2001, 2002). Fawcett et al. (2009) also argued that this capacity to adapt and colonize new niches may have been responsible for angiosperm radiation and their survival during the Cretaceous–Tertiary (KT) crisis. Nevertheless, after duplication, many polyploids gradually return to diploid-like chromosome pairing through accumulation of rearrangements and sequence deletions (Blanc and Wolfe, 2004; Levy and Feldman, 2004; Feldman and Levy, 2009; Hufton and Panopoulou, 2009; Tate et al., 2009; Buggs et al., 2010; Xiong et al., 2011), a process referred as diploidization. However, paleopolyploid genomes may retain several copies of the same gene (Tate et al., 2009), and thus provide an opportunity to study the fate of duplicated genes.

Soybean, Glycine max, is part of the legume family, which is one of the largest families in flowering plants, with more than 20 000 species (Doyle and Luckow, 2003). Due to its agricultural prominence, the Glycine max genome has been fully sequenced (Schmutz et al., 2010), and provides an opportunity to explore the molecular effects of genome duplication, as it has experienced at least two polypoid events, the most ancient being 59 MYA. The more recent event was probably an allopolyploid (Gill et al., 2009), and is the result of a merger of two genomes that diverged approximately 13 MYA and reunited approximately 5–10 MYA when the genus was formed, as this polyploidy event is found only in the genus Glycine (Doyle et al., 2003; Straub et al., 2006; Innes et al., 2008; Stefanović et al., 2009). These duplication events result in a genome with approximately 46 430 ‘high-confidence’ genes, of which 75% are present as more than one copy. We took advantage of the availability of these genomic resources to study the evolution of duplicated genes. It is likely that differential expression of duplicates among tissues varies and contributes to phenotypic variation in polyploids (Buggs et al., 2010). However, relatively few studies have investigated this process (Adams et al., 2003; Flagel et al., 2008; Chaudhary et al., 2009; Buggs et al., 2010; Flagel and Wendel, 2010). Using RNA-seq, we tested for transcriptional divergence of 17 500 gene pairs using expression data from seven tissues. Focusing on genes present in only two copies (8910 gene pairs derived from the latest WGD), we estimated the number of neo-functionalized and sub-functionalized genes and compared these data with selection pressures (Ka/Ks ratio) and functional annotation. Our study provides a comprehensive view of gene evolution following a relatively recent duplication event.


Differential expression of all duplicated genes

We were able to map a large number of the total transcripts (82–90%) to the soybean genome, with a smaller percentage (34–64%) that mapped uniquely due to short read length and the duplicated nature of the genome. The green pod library is aberrant in the low number of reads aligning uniquely; this is due to a large number of ribosomal (26S) sequences. Considering all the duplicated genes in Glycine max (genes with 2–6 copies), the expression data show that 34–54% of the paralog gene pairs harbor evidence of transcriptional divergence in any one of the seven tissues (Table 1). Therefore, on average, approximately 50% of the paralog pairs were transcriptionally diverged, or, conversely, 50% showed no evidence of divergence of transcriptional levels between the duplicates. Interestingly, 1210 gene pairs transcriptionally diverged in the same direction across all seven tissues. We computationally identified 720 homeologous segments around the soybean genome (Figure 1a). Maintaining the false discovery rate at 0.05, co-expression of duplicated genes within blocks (gene pairs in a given homeolog that are differentially expressed in the same direction) was observed in 3.5–7.7% of cases of homeologous regions (Table 2 and Figure S1).

Figure 1.

Homeologous relationships between the 20 soybean chromosomes. Homeologous relationships based on (a) the 17 547 gene pairs, (b) the 8910 strictly duplicate gene pairs, and (c) the 611 gene pairs over- or under-expressed in the same direction in the seven tissues. Red and green lines in front of genes (outside the circle) indicate whether the gene is over or under-expressed, respectively.

Table 1. Percentage of differentially expressed paralogs by tissue sample for the whole gene set (2–6 copies) and for strictly duplicated genes
TissueWhole gene setStrictly duplicated genes
Apical meristem53.645.4
Green pods33.525.2
Root tip48.838.5
Table 2. Distribution of homoeologous blocks enriched for over- or under-expressed gene pairs
TissueBlocks enriched for differential expressionPercentage of total blocks
Apical meristem466.4
Green pods557.6
Root tip253.5

Also of interest were differences in homeologous expression across tissues. As an ‘all by all’ tissue comparison would produce too complex a picture to analyze, we used only two pairs of tissues that we considered the most similar (i.e. root versus nodule tissues and flower versus leaf tissues) to test for differential expression for each gene set. As an example, when comparing gene expression between root and nodule samples, we found 16 539 genes out of 32 250 that were differentially expressed. Both tissues had approximately the same number of over- and under-expressed genes (Figure 2). However, when placing genes into the context of homeologous blocks of duplicated genes, we found 22 blocks that were over-expressed in roots versus nodules, but none that were over-expressed in nodules versus roots. A different picture was obtained when comparing flower versus leaf tissues. In that case, 10 753 genes of 32 388 were differentially expressed. In Figure 2, more dots are present above the line at zero than are under it, suggesting an overall over-expression of genes in flower. This was further corroborated by the fact that we found four homeologous blocks that were over-expressed in flowers versus leaves and only one that was over-expressed in leaves versus flowers.

Figure 2.

Differential expression of duplicates across tissues. The log fold change of the normalized gene expression is plotted on the y axis, and the mean log expression is plotted on the x axis. Blue dots represent genes that were significantly differentially expressed; gray dots represent genes that were equally expressed. The red horizontal line is at zero, providing a visual check for symmetry. Top: root versus nodule. The plot appears symmetrical, suggesting that, overall, both tissues have approximately the same number of over- and under-expressed genes. Bottom: leaf versus flower. This graph is non-symmetrical (more blue dots above zero), suggesting that, overall, genes in flower tissue are over-expressed compared to leaf tissue.

Analysis of recently duplicated genes

In order to understand the role of neo- or sub-functionalization on the short-term evolution of duplicated genes in soybean, we focused our further analyses on genes harboring only two copies in the genome (strictly duplicated genes from the recent WGD, Figure 1b). Expression data showed that a large proportion of strictly duplicated genes were differentially expressed (from 25% in green pods to 45% in apical meristems, Tables 1 and 3). When performing pairwise comparisons of tissues, we found that most of the genes were over- or under-expressed in the same direction. However, overall, only 611 pairs of strictly duplicated genes were over- or under-expressed in the same direction across all seven tissues (Figure 1c).

Table 3. Matrix of differential expression of strictly duplicated genes (8910 pairs)
 Apical meristem (4045 pairs)Flower (3890 pairs)Green pods (2242 pairs)Leaves (3546 pairs)Nodule (3640 pairs)Root (3959 pairs)Root tip (3434 genes)
  1. The upper diagonal of the matrix gives the number of gene pairs differentially expressed in same manner in the compared tissues (over- or under-expressed in both tissues). The lower diagonal gives the number of gene pairs differentially expressed in the compared tissues (over-expressed in one tissue but under-expressed in the other tissue). Numbers in parentheses indicate the number of gene pairs that are differentially expressed per tissue.

Apical meristem 239116912207191322291946
Flower186 15102386183420871666
Green pods85116 1409123815061281
Leaves199137120 171119871562
Nodule356370178340 21291613
Root256265124234247 2127
Root tip263331132300355167 

The Ks distribution for all duplicated genes showed two obvious peaks, centered around 0.1 and 0.6 (Figure 3). Strictly duplicated genes had a Ks < 0.4, and 92% of those genes had a Ka/Ks ratio <0.5. Only 29 of the 8910 gene pairs have Ka/Ks ratios > 1 (Figure 4). A logistic regression model showed that ln (Ka/Ks) values had a statistically significant effect on the odds of differential expression of duplicate gene pairs in all the tissue samples (Table 4). As the sign of the logistic regression coefficients for ln (Ka/Ks) is negative, the odds of differential expression of duplicate gene pairs decreases with increasing ln (Ka/Ks) values. β values were similar across tissues, suggesting that the ln (Ka/Ks) value has a similar effect on differential gene expression across all tissues.

Figure 3.

Ks distributions. Histogram showing pairwise Ks values, converted into million years (My), for gene families harboring 2–6 copies in the soybean genome. Ks values between 0 and 0.39 (gray) correspond to the duplication 13 MYA; Ks values >0.4 (white) correspond to the duplication 59 MYA. The inset histogram shows pairwise Ks values calculated for strictly duplicated genes only, converted into million years (My).

Figure 4.

Ka/Ks values for the 8910 strictly duplicated genes.

Table 4. Summary of the seven model fits with ln (Ka/Ks) as the predictor variable
Tissue sampleDecrease in the odds of differential expression for unit increase in ln (Ka/Ks)β valueP value
  1. The table shows the decrease in odds of differential expression of a duplicate gene pair for a unit increase in the ln (Ka/Ks) value, the β value, which represents the effect of ln (Ka/Ks) on the odds of differential expression of duplicate gene pairs, and the P value for the logistic regression hypothesis test for the relationship between the ln (Ka/Ks) value and the odds of differential expression. A unit increase in ln (Ka/Ks) is equivalent to an approximately 2.72 times increase in Ka/Ks value.

Apical meristem1.036863−0.03620.00087
Green pods1.053586−0.05229.20E−006
Root tip1.056118−0.05464.20E−007

Gene ontologies (GO terms) were used to classify genes according to their molecular function and the pathway in which they are involved. Overall, more than 70% of the genes were associated with the GO term ‘molecular function’. We found that 96% of strictly duplicated genes maintained the same function, but 4% (351 pairs of 8910) showed different functions between duplicates at the annotation level. Ka/Ks values for these 4% of neo-functionalized genes were statistically higher than those from genes with same function (0.36 versus 0.24, respectively, < 2.2e−16). When analyzing expression data per class of annotated function, we found that the terms ‘transcription factors’, ‘DNA binding’, ‘magnesium ion binding’ and ‘structural constituents of ribosomes’ contained a significant number of differentially expressed genes compared to other functions even though they are not the most represented classes, excepting the regulation of transcription pathway (Figure 5).

Figure 5.

Gene ontology annotation. The histogram at the top represents the number of genes associated with a function, for functions represented by more than 100 gene sequences or that contain a significant number of differentially expressed duplicate gene pairs. Dotted bars represent functions that contain a significant number of differentially expressed duplicate gene pairs in at least one of the following tissues: AM, apical meristem; F, flower; L, leaf; N, nodule; GP, green pod; R, root; RT, root tip. The histogram at the bottom represents the number of genes associated with a pathway, for pathways represented by more than 100 gene sequences. Pathway denominations are identical to the ones in the GO annotation output, thus regulation of transcription appears in two categories: DNA-dependent and DNA-independent.


Differential expression between duplicated genes is postulated to contribute to phenotypic variation (Buggs et al., 2010), especially in polyploids in which copy number may increase rapidly. The prevalence of expression sub-functionalization after polyploidization (variation in relative expression of homeologs among tissues in the polyploids) has been assessed in only a few studies. In allopolyploid cotton, it was shown that, of 63 genes expressed in 24 tissues, 40% showed biased expression between homeologs (Chaudhary et al., 2009). Similar patterns of biased expression have been observed in Arabidopsis (Blanc and Wolfe, 2004; Groszmann et al., 2011), Tragopogon mirus (Buggs et al., 2010), Paramecium tetraurelia (Arnaiz et al., 2010), Xenopus laevis (Sémon and Wolfe, 2008) and humans (Gu et al., 2002), but generally only for a limited number of genes. In addition to previous studies performed using whole-transcriptome analysis (Galbraith and Birnbaum, 2006; Higgins et al., 2012), our work presents some new insights into duplicated gene expression in plants.

Soybean underwent two rounds of duplication, 59 and 13 MYA (Figure 3), and retained 17 547 genes pairs, of which 8910 are strictly duplicated genes (two copies only). We found that, on average, 50% of duplicated genes showed differential expression between duplicates, regardless of the age of the duplication (Table 1). Focusing on genes with only two copies in the genome (all duplicated 13 MYA), we found that the vast majority have a Ka/Ks ratio <0.4 (Figure 4), indicative of a highly purifying selective pressure at the nucleotide level. In addition, transcriptional divergence and Ka/Ks ratios were negatively correlated (Table 4) in genes across the sampled tissues, suggesting that sub-functionalization across tissues has increased evolutionary pressures to maintain gene function. This is in agreement with previous results showing that, in some Arabidopsis ecotypes and in rice, in contrast to what was predicted, the Single Nucleotide Polymorphisms (SNPs) result in less radical amino acid changes in genes for which a duplicated copy exists in the genome (Chapman et al., 2006). This reinforces theories of the retention of duplicates through sub-functionalization, where the partitioning of genes across tissues implies that both copies must remain functional (Force et al., 1999; Lynch and Force, 2000). It also demonstrates that, if expression sub-functionalization is established rapidly after polyploidization in soybean, as in cotton (Chaudhary et al., 2009), this process has been maintained over time.

Although the soybean genome has undergone diploidization/fractionation (Schlueter et al., 2006; Schmutz et al., 2010), there are still large regions of homeology that remain from both duplication events (Figure 1). Thus, we were able to determine whether evolutionary pressures to retain gene activity, or to allow transcriptional divergence, occurred at the level of individual genes, or at larger chromosome-level regions. Overall, 720 homeologous blocks of duplication were computationally identified. A small proportion of the total homeologous blocks, between 3.5 and 7.7%, showed evidence of biased, or differential, expression across the entire block of homeologs/paralogs (Table 2). Significant clusters of co-expressed genes were also observed across tissues. As an example, although both tissues harbor approximately the same number of over- and under-expressed genes (Figure 2), 22 blocks (3% of the total homeologous blocks) were over-expressed in roots but under-expressed in nodules, and none of the blocks were over-expressed in nodules but under-expressed in roots. In contrast, more genes were over-expressed in flowers than in leaves (Figure 2). However, only four blocks were over-expressed in flowers versus leaves and only one was over-expressed in leaves versus flowers. It is possible that our analyses under-estimate the number of blocks showing evidence of coordinated gene expression, as, in some instances, the size of block may be big enough to mask the sub-structure, i.e. groups of genes within the larger block that show coordinated expression. Although the experimental design included three biological replicates, they were pooled and sequenced as a single library rather than bar-coded by replicate. This design precluded a statistical analysis that could leverage the replicated nature of the experiment. As noted by Auer and Doerge (2010), when RNA-seq data are analyzed for differential expression in lieu of biological replication, true false-discovery rates (FDRs) are often higher than the nominal (i.e. reported) FDRs. Accordingly, the lack of replication should be noted as a limitation of this study due to the possibility that true FDRs are inflated above the reported levels. However, our results do suggest that coordinated regulation of genes within homeologous blocks does occur in soybean, but at a low occurrence, and is variable between tissues. Previous studies documented epigenetic re-patterning after polyploidization as shown in synthetic wheat (Qi et al., 2010; Zhao et al., 2011), in the recent polyploid Spartina anglica (Salmon et al., 2005) and in Senecio (Hegarty et al., 2011). However, epigenetic mutations have been shown to be reversible within a few generations in plants (Becker et al., 2011). We do not yet know to what extent epigenetic patterning effects differential gene expression of duplicated genes in a polyploid such as Glycine max. However, the presence of biased differential expression across entire blocks of homeologs/paralogs provides candidate regions to investigate to role of epigenetic mutation in differential expression.

When comparing the gene expression of strictly duplicated genes between two tissues, the majority of genes were expressed in the same direction, i.e. over- or under-expressed in both tissues (Table 3). In fact, 1210 of the whole set of duplicated genes (17 547 gene pairs) and 611 gene pairs of the 8910 strictly duplicated genes showed evidence of transcriptional divergence in the same direction across all seven tissues (Figure 1c). These may be candidates for genes that are being non-functionalized, although further tissue sampling may reveal tissues in which the other copy is more highly expressed.

Gene ontology (GO) analysis performed on the 8910 strictly duplicated gene pairs showed that nearly all gene pairs retained function (96.6%), at least at the level of GO annotation. Only 4% of the strictly duplicated genes have putatively neo-functionalized over the past 13 million years, even though their Ka/Ks values, while slightly higher than those for gene pairs that retained function (0.36 versus 0.24), were below 1. Overall, only 29 gene pairs, 0.3% of the total, had a Ka/Ks values > 1. However, these genes showed the same function for both copies and therefore constitute candidates to study the process of neo-functionalization. Neo-functionalization has been shown to be responsible for important evolutionary features. For instance, in Arabidopsis, Toc75, a gene whose product is involved in the import of nucleus-encoded proteins into chloroplasts, originated from a duplication in an ancient eukaryotic organism more than 1.2 billion years ago and evolved through neo-functionalization (Töpel et al., 2012). Regulatory neo-functionalization (i.e. gain of a new expression pattern) has played a role in pollen evolution in Arabidopsis (Liu et al., 2011). Evolution through neo-functionalization has also been demonstrated in animals. In Drosophila melanogaster, functional investigations showed that CG11700, a gene specifically expressed in males, was neo-functionalized from a polyubiquitin gene, and its product has evolved as a factor that is responsible for the trade-off between male fecundity and lifespan (Zhan et al., 2012). Genes involved in sex determination in vertebrates also evolved through neo-functionalization (Mawaribuchi et al., 2011), as did genes involved in venom production in marine cone snails (Chang and Duda, 2012). However, neo-functionalization remains a relative rare event when compared to the number of duplicated genes. As this process involves selection of beneficial mutations (Freeling, 2008), neo-functionalization is a slow evolutionary mechanism (Freeling, 2009). This may explain why, in the case of recent polyploids, only few case of neo-functionalization have been reported, while in ancient paleopolyploids, such as Arabidopsis, more duplicates have been shown to have acquired new functions (Blanc and Wolfe, 2004).

A large proportion of strictly duplicated genes are involved in the regulation of transcription (Figure 5) at the functional or pathway level. Interestingly, only four categories (transcription factors, DNA binding, structural constituents of ribosomes and magnesium ion binding), were significantly differentially expressed in at least one tissue (Figure 5). Following the last tetraploidization event in Arabidopsis, transcription factors and ribosomal protein genes were shown to be enriched in the genome (Blanc and Wolfe, 2004; Seoighe and Gehring, 2004; Maere et al., 2005). Ribosomal protein genes are also enriched in yeast (Papp et al., 2003), while transcription factors became enriched in poplar (Rodgers-Melnick et al., 2011) and after the pre-grass polyploidy in rice (Tian et al., 2005). Here, we show that transcription factors and ribosomal protein genes, as in other species, were retained after the polyploidization 13 MYA, and that these genes are more likely to be differentially expressed. This is in agreement with the theory of Ohno, who first suggested that the main consequence of polyploidization is to increase the complexity of regulatory networks (Ohno, 1970).


Among the many models that attempt to explain how/why duplicated genes are retained after polyploidy (for review, see Freeling, 2009), sub-functionalization is the most popular hypothesis even though it remains controversial (Freeling, 2008). Here, we demonstrated that, after two rounds of polyploidization, a large proportion of duplicated genes exhibit differential expression, most of which show evidence of sub-functionalization at the expression level. As this is a relatively rapid process compared to the classical model of sub-functionalization through mutations (Doyle et al., 2008), it may be one of the major evolutionary effects of polyploidization. This is especially true in soybean, where transcription factors and ribosomal protein genes are the two main gene classes that are differentially expressed in several tissues. Therefore, polyploidization events provide raw material to enhance network complexity and organismal adaptability to new environmental conditions. Sub-functionalization, by increasing the pressure to keep both copies of the same gene, may constitute a transitional step to neo-functionalization for some genes, as supported by the small number of putatively neo-functionalized genes in soybean.

Experimental Procedures

Plant culture

All tissues described below were isolated from soybean Glycine max (L.) Merr. cultivar Williams 82. For each tissue, three independent biological replicates were performed for different set of plants to ensure reproducibility of the plant tissues analyzed (i.e. seeds were sown three times on different days, and tissues were harvested as described below). Soybean seeds were surface-sterilized and germinated for 3 days between moist Whatman filter paper (Whatman, Root tips were harvested from 3-day-old seedlings. To produce other tissues, germinated seedlings were transferred to the greenhouse under long-day conditions (16 h light/8 h dark) at 27°C on Promix Bx soil (Premier Horticulture, Fourteen-day-old shoot apical meristems (V2 stage), 18-day-old trifoliate leaves and roots (V2 stage), flowers (R2 stage) and pods (R6 stage) were harvested ( Nodules were harvested 32 days after inoculation of 1 ml of B. japonicum suspension (OD600 = 0.1) to 3-day-old-seedlings.

RNA extraction and sequencing

Total RNA was isolated as described by Libault et al. (2010). For each tissue, equal amounts of RNA isolated from the three independent biological replicates were pooled and sequenced using the Solexa platform (Illumina, Inc., after first- and second-strand cDNA synthesis. Between 4.18 and 6.84 million reads of approximately 36 bp were generated for each tissue.

Duplicate gene analysis for differential expression within tissue samples

Sequence filtering and alignment were processed as described by Libault et al. (2010). Read counts used in expression analyses were based on the subset of uniquely aligned reads that also overlapped the genomic spans of the Glyma1 ( gene predictions. Read counts for a given sample were normalized by using values for a gene's uniquely aligned read counts per million reads uniquely aligning within that sample. For a given gene, the measured expression, i.e. the level at which it was transcribed, was proportional to its length. For each tissue sample, differential expression was tested for each pair of duplicate genes as defined by Schmutz et al. (2010), using the exact conditional test (Gu et al., 2008). For each pair of genes, P values were computed from the exact conditional test and adjusted to maintain the false discovery rate (FDR) at 0.05 across gene pairs using the Benjamini–Hochberg method (Benjamini and Hochberg, 1995). Only gene pairs whose expression differed from 0 were included in the analysis. For further information about the statistical analyses, see Methods S1.

Homeologous block analysis within tissue samples

Identification of homoeologous blocks was performed using i-ADHoRe version 2.1 (Simillion et al., 2008), and pairs of duplicate genes were structured into pre-defined blocks of duplicate genes (i.e., homeologs). For each tissue, we analyzed whether the gene pair homeologs present on a given block were statistically differentially expressed in the same direction as described in Methods S1. values were calculated for every defined duplicate block, and adjusted to maintain the FDR across duplicate blocks at 0.1.

Differential expression of genes and homeolog blocks across tissues

Differential expression between similar tissues, i.e. root versus nodule tissues and flower versus leaf tissues, was tested for every gene using Fisher's exact test, with a small P value corresponding to a statistically significant association between tissue type and expression of the gene. P values were subsequently adjusted to maintain the FDR at 0.05 across genes. For each pre-defined block of duplicate genes, we tested whether the genes were statistically differentially expressed in the same direction between roots and nodules as well as between flower and leaf tissues. P values were calculated following a similar protocol to that described in the previous section (details in Methods S1). P values were subsequently adjusted to maintain the FDR at 0.05 across genes.

Ka and Ks calculation and functional annotation

The number of non-synonymous (Ka) and synonymous (Ks) mutations and Ka/Ks ratio values were calculated using paml software (Yang, 2007), based on alignment of both nucleotide and protein sequences of the gene and its duplicate. Ks values were calculated for the whole set of duplicated genes (genes harboring two copies or more, i.e. 17 547 pairs), and used to estimate time using a value of 0.056 synonymous transversions per site (Schmutz et al., 2010). The Ka/Ks ratio was calculated for the 8910 pairs of strictly duplicated genes (genes harboring two copies only), and, for each copy, the function and the pathway in which the gene is involved were annotated based on similarity searches using the Blast2GO Gene Ontology (GO) tool (Conesa et al., 2005).

Relationship between the Ka/Ks value and differential expression of duplicate gene pairs

A natural logarithm transformation was applied to Ka/Ks values, with a boundary correction of 1E−10 for a null Ka/Ks value. The relationship between the ln (Ka/Ks) value and the odds of differential expression of duplicate gene pairs for the seven tissue samples was modeled using the logistic regression equation:

display math

where Pg is the probability of differential expression of duplicate gene pair ‘g’, β0 is the intercept of the logistic regression model, and β1 is the coefficient of the Ka/Ks value for duplicate gene pair ‘g’, i.e. the effect of ln (Ka/Ks) on the odds of differential expression of duplicate gene pair ‘g’.

Relationship between functional annotation and differential expression of strictly duplicated gene pairs

For each of the 1201 GO functional classes identified and each tissue sample, the relationship between differential expression of duplicate genes and their biological function (as defined in the Blast2GO analysis) was tested using a hyper-geometric test (Fisher's exact test) (Fisher, 1966). P values were subsequently adjusted to maintain the FDR at 0.05. Duplicate gene pairs were assigned to a GO functional class if at least one of the two genes was involved in that function. Duplicate gene pairs were permitted to be present in multiple functional classes if the duplicate gene pair was involved in multiple functions.


Coding sequence information (start and end points as well as chromosome locations) was retrieved from the phytozome database ( In order to visualize duplicated regions in the soybean genome, lines were drawn between matching genes using Circos (Krzywinski et al., 2009), first for the whole set of genes and then on the 8910 strictly duplicated gene pairs only. An additional circular layout was drawn for the strictly duplicated pairs of genes that were over/under-expressed in the same direction in the seven tissues. The results are displayed in Figure 1, with over-expressed copies in the seven tissues shown in red and under-expressed copies shown in green.


We would like to acknowledge funding from the US National Science Foundation (grant numbers MCB1229956 and DBI0836196 to S.A.J., and grant number DBI-0421620 to G.S.).