Correlated evolution of genome size and seed mass

Authors


Author for correspondence: Jeremy M. Beaulieu Tel: +1 (805) 756–5062 Fax: +1 (805) 756–1419 Email: jbeaulie@calpoly.edu

Summary

  • • Previous investigators have identified strong positive relationships between genome size and seed mass within species, and across species from the same genus and family.
  • • Here, we make the first broad-scale quantification of this relationship, using data for 1222 species, from 139 families and 48 orders. We analyzed the relationship between genome size and seed mass using a statistical framework that included four different tests.
  • • A quadratic relationship between genome size and seed mass appeared to be driven by the large genome/seed mass gymnosperms and the many small genome size/large seed mass angiosperms. Very small seeds were never associated with very large genomes, possibly indicating a developmental constraint. Independent contrast results showed that divergences in genome size were positively correlated with divergences in seed mass.
  • • Divergences in seed mass have been more closely correlated with divergences in genome size than with divergences in other morphological and ecological variables. Plant growth form is the only variable examined thus far that explains a greater proportion of variation in seed mass than does genome size.

Introduction

Nuclear DNA amount varies over four orders of magnitude in plants (Bennett & Leitch, 2005). Recent phylogenetic studies have revealed the dynamic nature of genome size evolution, where both increases and decreases have taken place within lineages (Leitch et al., 1998, 2005; Soltis et al., 2003; Johnston et al., 2005; Price et al., 2005). The amplification of transposable elements (Bennetzen, 2002; Kidwell, 2002; Bennetzen et al., 2005) and polyploidy are both thought to be pervasive mechanisms for increasing bulk nuclear DNA amount (2C DNA amount). The mechanistic basis for genome reduction is still poorly understood (Petrov et al., 2000; Bennetzen et al., 2005) but in plants is partly associated with the re-diploidization of the polyploid genome with accompanied downsizing of the monoploid genome (1Cx DNA amount; Leitch & Bennett, 2004). Genome size reduction may require strong selection pressures (Petrov et al., 2000), which implies that there may be some cost associated with large genomes, or benefits associated with small genome size. Therefore, there has been a growing interest in the phenotypic consequences of variation in genome size (Knight et al., 2005). There is a strong positive relationship between cell size and 2C DNA amount (Bennett, 1972, 1973; Edwards & Endrizzi, 1975; Sugiyama, 2005). In addition to cell size, 2C DNA amount is positively correlated with cell cycle duration (Rees et al., 1966; Baetcke et al., 1967; Bennett et al., 1983; Lawrence, 1985). Based on these cellular correlations, it is conceivable that many other morphological and physiological traits may scale with DNA content.

Previous studies have shown consistent, positive associations between genome size and seed mass in comparisons between populations of the same species and across groups of species within the same genus or family (Table 1). Seed mass varies over nearly 12 orders of magnitude, from the dust-like seeds of Orchidaceae to the 20 kg double coconut. Seed mass variation carries significant agronomic and ecological consequences, and therefore understanding the genetic basis of seed mass variation is of great interest. In our view, genome size may be related to seed mass through cell size effects within seed organs (i.e. endosperm and embryo). Step increases in genome size may lead to larger endosperm cells, resulting in increased seed volume and mass. Increased cell size in any other seed organ (cotyledons or hypocotyls, for example) could also lead to increased seed mass. Here we report results of a large analysis involving 1222 species (by far the largest study to date) where we test the hypothesis that variation in genome size is positively correlated with seed mass variation.

Table 1.  Previous studies on the relationship between genome size and seed mass
CorrelationaLevelbDescriptionAuthors
  • a

    Correlations are either positive (+) or not significant (NS).

  • b

    Studies were classified into different levels – those dealing with different populations of the same species (Pop.) or multiple species (Sp.).

+Pop.15 Dasyprum villosum Caceres et al. (1998)
+Pop.12 Soybean strains Chung et al. (1998)
+Sp.131 British angiosperms Thompson (1990)
+Sp.43 British plants Grime et al. (1997)
+Sp.22 Crepis spp. Jones and Brown (1976)
+Sp.12 Allium spp. and 6 Vicia spp. Bennett (1972)
+Sp.85 Pinus spp. worldwide Grotkopp et al. (2004)
+Sp.19 Mediterranean annuals Maranon & Grubb (1993)
+Sp.148 species in California flora Knight & Ackerly (2002)
+Sp.Several Poaceae and Fabaceae Mowforth (1985)
NSSp.16 grassland species, UK Leishman (1999)

To analyze the relationship between genome size and seed mass we used a statistical framework that included four different tests, each of which asked a different question of the data:

  • • Is there a predictable statistical association between genome size and seed mass? To answer this question we use simple regression statistics.
  • • Is the relationship between genome size and seed mass polygonal, and can we detect boundaries or limits to the bivariate relationship? We use quantile regression to provide a more complete view of the relationship than what is captured by the median regression statistics alone.
  • • Are divergences in genome size associated with divergences in seed mass? Has there been correlated evolution of these traits? From our dataset we construct a consensus ‘mega-tree’ phylogeny with the most current molecular diversification times (Wikström et al., 2001) and use independent contrast analyses (Felsenstein, 1985) to answer this question.
  • • Which evolutionary divergences contribute the most to present-day variation in genome size and seed mass, and do nodes that contribute significant variation in genome size also contribute significantly to extant variation in seed mass? To answer this we use contribution index scores, which estimate the amount of present-day variation explained by divergences at each node in the phylogeny (Moles et al., 2005a). We use a rank correlation statistic between contribution index scores for each trait to test this question.

By performing all four of these analyses we provide a comprehensive view of the relationship between genome size and seed mass for 1222 species.

Materials and Methods

Genome size and seed mass

The term genome size has been widely used in the literature to refer either to the total DNA amount in the nucleus or, in a more restricted sense, to the DNA content of the monoploid genome. To avoid confusion Greilhuber et al. (2005) proposed that ‘genome size’ should continue to be used in the broad sense as a covering term but proposed the terms ‘holoploid genome size’ or ‘2C value’ to refer to the DNA content of the unreplicated nucleus, and ‘monoploid genome size’ or ‘1Cx value’ to refer to the DNA content of the monoploid genome with chromosome number x. The 1Cx value (calculated by dividing the 2C value by the ploidy level) is predicted to be similar between a diploid and autopolyploid race of the same species, while the 2C DNA content should show step increases. However, it appears that polyploid formation may be accompanied by genome downsizing, which results in a smaller 1Cx amount compared with the diploid progenitor species (Leitch & Bennett, 2004). Here we test for associations between both 2C DNA and 1Cx DNA content with seed mass.

Estimates of 2C DNA content were taken from the Plant DNA C-values database (prime estimates; Bennett & Leitch, 2005) and were combined with seed mass estimates from the Seed Information Database (Flynn et al., 2004); both databases are maintained at the Royal Botanic Gardens at Kew. There were 1222 species with known 2C DNA content that also had a seed mass in the seed mass dataset. Where there were multiple estimates for seed mass, the geometric mean was used as the species value. We calculated 1Cx DNA content for species with known ploidy by dividing the 2C value by the ploidy level (i.e. 2x, 4x, etc.). Because many species have a range of ploidy, which can confound the calculation of the monoploid genome size, we only used species where one ploidy level was reported. Therefore 1Cx values were calculated for only 999 species.

For 2C DNA amount, our dataset comprised 1087 angiosperms and 135 gymnosperms. 2C DNA amounts ranged 525-fold from 0.284 to 148.95 Gbp. The angiosperm sample was well distributed phylogenetically; it included 139 families out of 443 families currently known (Stevens, 2005), including representatives of the basal angiosperms as well as species from each of the major angiosperm clades (i.e. monocot and eudicot). Our dataset provided an adequate representation of the full range of 2C values currently known for angiosperms (2C = 0.134–254.8 Gbp). The gymnosperm sample included nine families from all four gymnosperm orders (i.e. Cycadales, Ginkgoales, Gnetales, Pinales) and the 2C DNA values ranged nearly sixfold from 12.8 to 70.6 Gbp (the range for 207 gymnosperms currently in the Plant DNA C-values database is 15-fold from 4.6 to 70.6 Gbp).

Constructing a ‘mega-tree’

We constructed a ‘mega-tree’ hypothesis using Phylomatic (tree version: R20040402; Webb & Donoghue, 2005). This online software is a compilation of previously published phylogenies and its ordinal ‘backbone’ and family resolutions are based on the Angiosperm Phylogeny Website (APweb; Stevens, 2005), the best current estimate of relationships among higher plants. Currently, Phylomatic has complete familial representation. The program first matches a species by genus, and then by family. If one genus is missing within a family, the entire set of genera for that family is returned as a polytomy within the ‘mega-tree’. Because our dataset spans many genera within many families, most relationships were placed as polytomies. However, many of these polytomies could be resolved by consulting the current literature (supplementary material, Table S1). As a rule for resolving these polytomies, when there were conflicting branching patterns in the literature, a polytomy at the most ancestral node of the family was maintained.

Phylomatic assumes extant gymnosperms are monophyletic. While this view is controversial (morphological data support Gnetales being sister to angiosperms; Donoghue & Doyle, 2000; Friedman, 2006), molecular data generally support it (Chaw et al., 2000; Soltis et al., 2002; Burleigh & Mathews, 2004). However, placement of the four orders of gymnosperms (Cycadales, Ginkgoales, Gnetales, Pinales) is inconsistent. For this reason, we conservatively maintain the phylogeny output by Phylomatic that contains a basal polytomy across the four orders of gymnosperms. For comparison when analyzing gymnosperms alone, we also tested three alternative phylogenies of gymnosperms based on current molecular data (Chaw et al., 2000; Soltis et al., 2002; Burleigh & Mathews, 2004).

Branch length information for our ‘mega-tree’ phylogeny was taken from age estimates published by Wikström et al. (2001). These authors applied a nonparametric rate-smoothing algorithm (allows for different clades to evolve at different rates) to a three-gene dataset that spanned nearly 75% of all angiosperm families. Estimates were then calibrated at a single point within the fossil record (the Fagales–Curcubitales divergence, 84 million yr ago (Mya)), to obtain the first comprehensive hypothesis of angiosperm diversification times. Recently, Bell et al. (2005) incorporated multiple fossil calibrations using Bayesian relaxed clock (BRC) and penalized likelihood (PL) methods to derive divergence times of several major basal groups, which included the origin of angiosperms. The age estimates of Bell et al. (2005) were not significantly different from estimates by Wikström et al. (2001). Because the Wikström et al. (2001) ages were more comprehensive, we used those age estimates here.

Dated nodes from Wikström et al. (2001) matched 49 divergences in our phylogeny. We then used the branch length adjustment algorithm in Phylocom (BLADJ; Webb et al., 2006) to estimate the age for undated nodes. BLADJ sets a root node at a specified age and fixes all other known aged nodes. Branch lengths for undated nodes are then interpolated by evenly distributing them between nodes with known ages, which minimizes variance in branch length (Webb et al., 2006). The ages within our phylogeny should be treated as approximations.

Analytical approach

We analyzed the relationship between genome size (which includes both 2C and 1Cx DNA content) and seed mass using four methods: regression, quantile regression, independent contrasts, and contribution index scores (described later in this section). Since our dataset includes two distinct groups of seed plants, gymnosperms and angiosperms, we also determined the influence that each group had on the overall relationships independently. We also analyzed the relationship within well-represented (in terms of sample size) families and genera of both angiosperms and gymnosperms (Table S2). All variables violated the assumptions of normality; therefore we log-transformed the data before analysis.

Regression.  We used least-squares regression to test for an association between genome size and seed mass across all species. While we advocate using independent contrasts to infer adaptive or correlated evolutionary hypotheses (see later), regression analyses can provide predictive power, even in the absence of significant independent contrast results, and therefore should be performed in conjunction with independent contrast analyses. In addition, the results of the regression test can be directly compared to quantile regression analyses (see following section).

Quantile regression.  We used quantile regression to identify limits, boundaries, and shifting relationships within our bivariate distributions. Quantile regression extends classical least-squares regression by estimating slopes not only through the mean or median, but also through each quantile (or percentile) of the bivariate relationship. Significant changes in slope, or quadratic coefficients, through the quantiles of a bivariate distribution imply that the distribution is polygonal, or filled unimodal, rather than linear. Such bivariate distributional shapes are common in ecological and evolutionary analyses, yet only recently have statistical methods been available to quantify them. Quantile regression thus provides a more complete view of the relationship between x and y than what is captured by median least-squares regression alone.

We estimated the quantile regression coefficients for the 5th through the 95th quantiles. For example, the 65th quantile regression is calculated by minimizing residual errors around a line where 65% of the observations fall below the line and 35% fall above the line. Residuals for points that fall above the line are weighted by the quantile (0.65 for the 65th quantile), while the points falling below the line are weighted by one minus the quantile (corresponding to 0.35 for the same 65th quantile). The 50th quantile corresponds to the traditional median least-squares regression estimate, where an equal amount of points fall above and below the line. Koenker & Hallock (2001), Knight & Ackerly (2002), and Cade & Noon (2003) all provide a detailed discussion of quantile regression methods. We used the ‘quantreg’ package (Koenker et al. (2005) to perform our quantile regression analyses.

Independent contrasts.  We used the analysis of traits (AOT) module (developed by Ackerly, 2006) of Phylocom (Webb et al., 2006) to perform independent contrasts on our ‘mega-tree’ phylogeny. The AOT algorithm calculates standardized divergences of extant species and estimates internal node averages and divergences incorporating branch lengths (Felsenstein, 1985). A unique feature of AOT is that it can handle polytomies; our ‘mega-tree’ phylogeny contained many. AOT uses the method developed by Pagel (1992) to calculate independent contrasts with phylogenies that contain polytomies. AOT takes a particular polytomy and ranks species based on the value of the independent variable (in this case 2C DNA content or 1Cx DNA content), where the median value is then used to create two groups. Mean values are calculated for each trait between the two groups and the difference between these means is treated as a single independent contrast.

The consistency in the direction of subtraction when performing independent contrasts is important. AOT is useful in that it sets the sign of the contrast for X (here we set genome size as X) to always be positive, and all other traits (seed mass) are then compared in the same direction across the node (Ackerly, 2006). Since the direction of subtraction is clearly arbitrary, reversing the direction of subtraction will result in a contrast of the opposite sign. Thus, all contrasts inherently have a mean value of zero and regression analysis of independent contrasts must be forced through the origin to account for this property (Garland et al., 1992). We utilized the output of our standardized contrasts from AOT and used R (R Development Core Team, 2005) to obtain slope estimates and r2 from a regression analysis forced through the origin.

AOT also calculates an absolute measure of trait radiation (divergence width) at each node, which is analogous to a standard deviation. We used the divergence width instead of trait differences (independent contrasts) because the standard deviation can be used when polytomies are present in the phylogeny (Moles et al., 2005a; Ackerly, 2006). To examine the pattern trait evolution through time, we plotted the divergence width in genome size and seed mass with age estimates of Wikström et al. (2001) and age interpolations from BLADJ (see earlier section on Constructing a ‘mega-tree’). We then fit a loess curve to the data, with 5% of the points influencing the smoothness of the line, to uncover any particular geologic times that may have been more divergent than others.

Contribution index.  The contribution index is a measure of how much a divergence at a particular node in the ‘mega-tree’ explains present-day variation within a trait (Moles et al., 2005a). The contribution index is the product of two variance components: (i) the amount of variation within a focal clade resulting from a focal divergence; and (ii) the amount of the total variation within that focal clade compared to the whole tree (Moles et al., 2005a). Each component is calculated from different decompositions of the sum of square deviations from internal node averages estimated by Phylocom. The decomposition of the sum of squares for trait divergences at each node was obtained from AOT to calculate each component, and subsequently the contribution index. The contribution index was calculated for genome size (2C and 1Cx) and seed mass separately. A Spearman rank correlation was used to determine if nodes with high contribution to genome size variation were also nodes with a high contribution to seed mass variation.

Results

Regression and quantile regression

Across all species, the relationship between 2C DNA content and seed mass appears curvilinear and concave. Small 2C DNA content species have a wide range of seed masses, while species with large 2C DNA content tend to have larger seeds (Fig. 1a). Interestingly, the space occupied by mid-range 2C DNA content is depopulated for large seed mass species; the relationship looks quadratic (Fig. 1a). Therefore, we performed a partial F-test, a posteriori, to determine if quadratic regression is more appropriate than a ‘straight line’ linear regression. Across all species, the quadratic term reduced the squared errors (from 1582.2 to 1562.6; F = 15.66, P < 0.001). However, with the data separated into the different groups, gymnosperms appear to drive the quadratic pattern. The addition of a quadratic term to the gymnosperms alone also significantly reduced the squared errors from 72.9 to 66.2; F = 13.56, P < 0.001. For angiosperms alone the quadratic term did not add any more explanatory power (F = 0.392, P = 0.532) and the slope of a ‘straight line’ linear regression was not significantly different from zero (r2  0.001, slope = −0.015, P = 0.845; but see quantile regression results later in this section).

Figure 1.

(a) The relationship between 2C DNA content and seed mass. Data are split into gymnosperms (black circles) and angiosperms (gray circles). Angiosperms alone showed no trend, while gymnosperms are significantly positive. However, when correcting for influence of phylogeny, the overall relationship is significant, angiosperms are significantly positive, and gymnosperms show no trend (see Table 2). Each gray line in (a) corresponds to a different quantile of the quadratic coefficient, with the 50th quantile (least-squares estimate) highlighted (solid black line). (b) Quantile regression analysis of the quadratic coefficient (a) showing an increasing quadratic coefficient (solid line) with increasing quantiles. Gray is the standard error of each quadratic coefficient estimate. The horizontal lines in (b) represent the least-squares line (dashed line) with standard error (dotted lines).

Considering the results above, we fitted a second-order polynomial regression model for data involving all species, which was highly significant (r2 = 0.026, F2,1219 = 16.29, P < 0.001). The least-squares quadratic coefficient was positive (slope = 0.409, t2,1219 = 3.96 P < 0.001). Quantile regression also showed that, from the 5th through the 95th quantiles, the quadratic coefficient was also significantly different from zero and positive. In general, the magnitude of the quadratic coefficient steadily increased throughout the quantiles (Fig. 1b), indicating that the quadratic relationship became more significant for species with the largest seed mass. Between the 50th and 85th quantiles the quadratic coefficient was significantly different from the least-squares quadratic regression, highlighting the utility of quantile regression.

When analyzing angiosperms alone, the concavity of the relationship was no longer apparent (see F-test earlier, Fig. 2a). Therefore a first-order regression model was fitted. The model was not significant (r2 < 0.001, F1,1085 = 0.038, P = 0.845), however, quantile regression analyses indicated that the linear coefficient steadily declined as the quantiles increased (Fig. 2b). Only the coefficients corresponding to the 50th through the 61st quantiles were not significantly different from zero. Therefore, there was a shift from a significantly positive slope in the lower quantiles to a significantly negative slope in the upper quantiles.

Figure 2.

(a) Scatter plot of 2C DNA content and seed mass in angiosperms. Each gray line in (a) corresponds to a different quantile regression result, with the 50th quantile (least-squares estimate) highlighted (solid black line). (b) Quantile regression analysis of (a) showing a decreasing slope with increasing quantiles. Gray is the standard error of each quantile regression estimate. The horizontal lines in (b) represent the least-squares line (dashed line) with standard error (dotted lines) and slope = 0 (solid line).

The gymnosperm data alone did exhibit some concavity. A second-order linear regression model was significant (r2 = 0.214, F2,132 = 17.98, P < 0.001; Fig. 3a). The quadratic coefficient was positive (slope = 7.14, t2,132 = 3.68, P < 0.001) and from the 16th through the 95th quantiles the quadratic coefficient was significantly different from zero, but never significantly different from the least-squares quadratic regression. Nevertheless, these results indicated increasing concavity with increasing seed mass (Fig. 3b).

Figure 3.

(a) Scatter plot of 2C DNA content and seed mass in gymnosperms. Each gray line in (a) corresponds to a different quantile dependent result using the quadratic coefficient, with the 50th quantile (least-squares estimate) highlighted (solid black line). (b) Quantile regression analysis of (a), the quadratic coefficient showing an increase in slope with increasing quantiles. Gray is the standard error of each quadratic coefficient estimate. The horizontal lines in (b) represent the least-squares line (dashed line) with standard error (dotted lines) and slope = 0 (solid line).

Regression coefficients for the 1Cx analyses had a greater magnitude than the coefficients of the 2C DNA analyses (except across all species, see Table 2). Quantile regression results for 1Cx DNA content and seed mass paralleled results found when using 2C DNA content.

Table 2.  Results for the regression and independent contrast analyses across all species, and for angiosperms and gymnosperms analyzed separately, for 2C and 1Cx DNA content with seed mass
 RegressionIndependent contrasts
n r 2 Slope N cont r 2 Slope
  • Regressions for the independent contrasts were forced through the origin.

  • N cont, number of contrasts in the independent contrasts analyses.

  • , quadratic slope;

  • *

    , P < 0.05;

  • **

    , P < 0.01.

2C
 All species12220.026**0.409**6860.033**0.382**
 Angiosperms alone1087< 0.001−0.0155900.033**0.381**
 Gymnosperms alone 1350.214**7.14** 950.0230.620
1Cx
 All species 9990.041**0.304**5500.062**0.594**
 Angiosperms alone 8860.0040.1634670.062**0.592**
 Gymnosperms alone 1130.209**7.47** 820.047*1.01*

Independent contrasts

There was a significant positive relationship between 2C DNA content and seed mass across all species when using independent contrasts (r2 = 0.033, slope = 0.382, P < 0.001; Table 2; Fig. 4a). We found a significant positive relationship when analyzing angiosperms alone (r2 = 0.033, slope = 0.381, P < 0.001), but not for gymnosperms alone (r2 = 0.023, slope = 0.620, P = 0.137). When testing 1Cx DNA content and seed mass across all species, the relationship was highly significant and explained nearly twice the variation as 2C DNA (r2 = 0.062, slope = 0.594, P < 0.001; Fig. 4b). Angiosperms alone also had a highly significant positive independent contrast result when testing 1Cx (Table 2). Gymnosperms alone did show a marginally significant trend; however, we found no significant correlated evolution for gymnosperms, for both 2C and 1Cx DNA content, when using three differing resolutions of the gymnosperm phylogeny (Table 3).

Figure 4.

The relationship between divergences in genome size and divergences in seed mass. For both the 2C (a) and 1Cx DNA datasets (b), there was a significant and positive relationship with divergences in seed mass, which were primarily driven by divergences within angiosperms (divergences in gymnosperms were relatively small). 1Cx divergences explained more variation (6.2%) in seed mass evolution than did divergences in 2C DNA content (3.3%; see Table 2). Regression lines (solid lines) are forced through the origin and the horizontal slope (dashed line separating the two quadrants) is also shown.

Table 3.  Independent contrasts results for alternative resolutions of the gymnosperm phylogeny
Differing topologies within gymnosperms2C DNA1Cx DNA
N cont r 2 Slope N cont r 2 Slope
  1. N cont, number of contrasts.

  2. Regressions for the independent contrasts were forced through the origin. None of the analyses was significant.

Ginkgoales sister to all other gymnosperms; Gnetales within conifers970.0160.558840.0400.716
Ginkgoales sister to all other gymnosperms; Gnetales sister to conifers970.0150.718840.0410.541
Ginkogales sister to Cycadales; Gnetales sister to conifers970.0160.581840.0450.738

Across the history of seed plants, divergences in both 2C DNA and seed mass have remained relatively constant (Fig. 5). However, the largest divergences in both 2C DNA content and seed mass have occurred more recently in geologic time (Fig. 5). The pattern when examining the divergence width in 1Cx DNA content was identical to results when using 2C DNA content.

Figure 5.

The history of 2C DNA content and seed mass divergences through time. The divergence width is the absolute measure of trait radiation at each node. Estimated time is taken from age estimates of Wikström et al. (2001) and interpolations using a branch length adjustment algorithm (BLADJ; Webb et al., 2006). The regression line is a loess curve with 5% of points influencing the smoothness. For both 2C DNA (a) and seed mass (b), the average divergence width has remained constant; however, the largest divergences have been relatively recent. Mya, million yr ago.

Contribution index

Nodes that made large contributions to present-day genome size variation also made large contributions to present-day seed mass variation (Fig. 6). Spearman rank correlation showed that contribution scores for 2C DNA content were positively associated with contribution scores for seed mass (n = 686, Spearman's r = 0.422, P < 0.001). This was also true for 1Cx DNA content and seed mass (n = 550, Spearman's r = 0.400, P < 0.001).

Figure 6.

Our mega-tree phylogeny to the order level, displaying the 20 largest contributions to present-day variation for both DNA content (1Cx and 2C DNA content) and seed mass. Black ovals represent the 20 divergences with the highest contribution score for genome size. White diamonds represent the 20 divergences with the highest contribution score for seed mass. Black ovals within with white diamonds represent divergences that were in both of the above sets. Diamonds and ovals at the tips of this tree represent divergences within orders.

For our dataset, the single most important contribution to present-day 2C DNA content was the divergence between angiosperms (1087 spp., mean 2C value = 3.53 Gbp) and gymnosperms (estimated at 325 Mya; Judd et al., 2002; 135 spp., mean 2C value = 35.6 Gbp; Table 4). This divergence was also the most important for present-day seed mass variation. Angiosperms have smaller seeds (mean seed mass = 4.26 mg) compared with the large-seeded gymnosperms (mean seed mass = 21.6 mg). The second most important contribution to 2C DNA content variation was a three-way polytomy at the base of Poales (estimated at 72 Mya; Wikström et al., 2001), which included Typhaceae (two spp., mean 2C value = 0.810 Gbp), Cyperaceae/Juncaceae (24 spp., mean 2C value = 1.26 Gbp), and Xyridaceae/Poaceae (161 spp., mean 2C value = 7.63 Gbp). However, this node did not show a large contribution to seed mass variation; all nodes within this polytomy led to clades that produced relatively small seeds (Typhaceae, mean seed mass = 0.901 mg; Cyperaceae/Juncaceae, mean seed mass = 0.314 mg; Xyridaceae/Poaceae, mean seed mass = 2.16 mg).

Table 4.  The 20 divergences making the largest contribution to present-day 2C DNA content variation (ranked 1–20) with accompanied contribution to seed mass variation explained by these nodes (seed mass rank)
Rank2C DNA contributionDivergences making the largest contributionSeed mass RankSeed mass contribution
10.601Angiosperms vs gymnosperms (c. 325 Mya)10.321
20.072Polytomy at the base of Poales (c. 72 Mya)920.001
30.043Monocots vs the rest of the angiosperms (c. 154 Mya)200.010
40.014Asparagales vs commelinids (c. 116 Mya)30.033
50.011Austrobaileyales vs rest of angiosperms (c. 161 Mya)120.016
60.011Saxifragales vs Vitales and the rosids (c. 114 Mya)70.020
70.011Dioscoreales vs Liliales, Asparagales and commelinids (c. 125 Mya)790.002
80.011 Brachypodium spp. vs rest of the pooids; Poaceae (c. 8 Mya)210< 0.001
90.010Iridaceae vs the rest of Asparagales (c. 76.4 Mya)1100.001
100.008Polytomy across Fabales, Rosales, Cucurbitales, Fagales (c. 87 Mya)160.013
110.008 Trifolium spp. vs rest of Fabaceae (c. 36 Mya)310.007
120.007Acorales vs rest of the monocots (c. 144 Mya)408< 0.001
130.006Polytomy at the base of core eudicots (c. 117 Mya)80.019
140.006Xyridaceae vs rest of Poales (c. 42 Mya)20.019
150.006Arecaceae vs Poales, Commelinales, Zingiberales (c. 95 Mya)90.111
160.006Polytomy at base of Saxifragales (c. 111 Mya)320.007
170.005Ranunculales vs other eudicots (c. 137 Mya)60.021
180.005 Oryza spp. vs rest of Poaceae (c. 11 Mya)590.003
190.004Divergence at the base of the robinioids; Fabaceae (c. 48 Mya)690.002
200.004Basal divergence in Fabaceae (c. 56 Mya)190.011

The most significant contribution to present-day 1Cx DNA variation was also the divergence between angiosperms (886 spp., mean 1Cx = 1.47 Gbp; mean seed mass = 4.21 mg) and gymnosperms (113 spp., mean 1Cx = 16.6 Gbp; mean seed mass = 20.8 mg; Table 5). The second most important contribution to present-day 1Cx DNA content was a divergence (estimated at 161 Mya; Wikström et al., 2001) at the node that led to the monocots (215 spp., mean 1Cx = 2.50 Gbp) and the rest of the angiosperms (magnoliids and the eudicots; 670 spp., mean 1Cx = 1.24 Gbp). There was an opposite trend in the contribution to present-day seed mass at this node; the monocots produced slightly smaller seeds (mean seed mass = 3.59 mg) than the combined mean of the magnoliids and eudicots (mean seed mass = 4.43 mg).

Table 5.  The 20 divergences making the largest contribution to present-day 1Cx DNA content variation (ranked 1–20) with accompanying contribution to seed mass variation explained by these nodes (seed mass rank)
Rank1Cx DNA contributionDivergences making the largest contributionSeed mass RankSeed mass contribution
10.670Angiosperm vs gymnosperm (c. 325 Mya)10.280
20.060Monocots vs the rest of the angiosperms (c. 154 Mya)161< 0.001
30.039Polytomy at the base of Poales (c. 72 Mya)580.002
40.024Iridaceae vs the rest of Asparagales (c. 76 Mya)290.006
50.021Alstroemeriaceae vs rest of Liliales (c. 125 Mya)238< 0.001
60.017Polytomy at base of core eudicots (c. 117 Mya)30.052
70.013 Brachypodium spp. vs rest of the pooids; Poaceae (c. 8.3 Mya)212< 0.001
80.012 Trifolium spp. vs rest of Fabaceae (c. 36 Mya)170.009
90.008Xyridaceae vs rest of Poales (c. 42 Mya)80.024
100.008Divergence near the base of Pinus; Pinaceae (c. 141 Mya)400.004
110.008Divergence near the base of Poaceae (c. 11 Mya)175< 0.001
120.007Saxifragales vs Vitales and the rosids (c. 114 Mya)360.004
130.007Asparagales vs commelinids (c. 116 Mya)50.029
140.007Arecaceae vs Poales, Commelinales (c. 95 Mya)20.100
150.006Ranunculales vs rest of eudicots (c. 137 Mya)70.026
160.006Polytomy at the base of Saxifragales (c. 111 Mya)260.006
170.006Polytomy across Fabales, Rosales, Cucurbitales, Fagales (c. 87 Mya)130.012
180.004Divergence at the base of the robinioids; Fabaceae (c. 48 Mya)570.002
190.003Polytomy at the base of Ranunculaceae (c. 55 Mya)144< 0.001
200.003 Medicago spp. vs the rest of vicoids; Fabaceae (c. 39 Mya)870.001

Discussion

Because the relationship between genome size and seed mass is complex (not straight-line linear) and exists within a phylogenetic framework, we used a variety of statistical techniques to describe the relationship between these traits. We found that across extant species there was a significant positive relationship between genome size and seed mass. However, regression analysis only explained a small percentage of the error variation (Table 2). The r2 values of our regression slopes were much weaker than we expected based on the uniformly positive and significant regression and correlation results from 10 previous studies (Table 1). Furthermore, the slope for angiosperms alone was nearly zero. This may be because there are correlated divergences within the groups studied by other investigators (which were often congeners) but there are large leaps between groups that do not necessarily follow the same evolutionary trend. Across all species, independent contrast analyses showed that divergences in genome size are positively correlated with divergences in seed mass (Table 2; Fig. 4).

Genome size may set a minimum seed mass, if there is a developmental relationship between genome size and seed mass. Large seed masses have evolved in species with both small and large genomes, but large genome species rarely have small seed sizes (Figs 1–3). Interestingly, there is also an absence of species with mid-range genome sizes and large seed masses (Fig. 1a). The reason for this absence is intriguing.

While the ‘straight line’ linear slope within angiosperms was nearly zero, the slope changed across quantiles, shifting from a significantly positive slope in the lower quantiles to a significantly negative slope in the upper quantiles. This suggests that the bivariate distribution forms a filled triangle (Fig. 2). The edges of this triangle represent limits to the distribution. For example, large genome species are unlikely to have small seeds (note the lack of species in the lower right-hand quadrant in Fig. 1a). But, large genome species in angiosperms do not have the largest seeds. Small genome angiosperms like Cocos nucifera (Arecaceae; 2C = 6.96 Gbp), Castanospermum australe (Fabaceae; 2C = 1.11 Gbp), and Mangifera indica (Anacardiaceae; 2C = 0.88 Gbp) hold that distinction. However, small genome angiosperms also have the smallest seeds. Again, it appears that genome size may set a minimum seed mass, that increases with increasing genome size, but the maximum seed mass for any given genome size may be determined by other factors. Plant height and seed mass are coordinated life-history traits (Moles et al., 2005b,c), and this coordinated life history variation may work in opposition to the developmental constraint imposed by genome size on seed mass. Also, large-seeded angiosperms may do best in environments that are less favorable to large genome species (Knight & Ackerly, 2002).

Analysis using independent contrasts showed that divergences in genome size are positively correlated with divergences in seed mass across all species. This result was driven primarily by significant independent contrast results within angiosperms; independent contrast results were not significant within gymnosperms alone. The discrepancy in these results could be explained if seed mass scales with polyploidy, as several investigations have shown (Stebbins, 1971; Halloran & Pennell, 1982; Van Dijk & Van Delden, 1990; Bretagnolle et al., 1995). Polyploidy is common in angiosperms (Stebbins, 1950, 1971; Wendel, 2000), whereas in gymnosperms it is uncommon (Delevoryas, 1980; Otto & Whitton, 2000). Within monocots, for example, we found strong correlated evolution between genome size and seed mass (Table S2). It has been suggested that most, if not all, monocots are either current polyploids or re-diploidized ancient polyploids (Goldblatt, 1980).

Not only have there been correlated divergences, but divergences that contribute significantly to extant seed mass variation also contributed significantly to extant genome size variation – further strengthening the association between these traits. The split between angiosperms and gymnosperms held the most explanatory power for present-day variation in genome size and seed mass. Interestingly, across both the 2C and 1Cx datasets, there were more nodes within monocots that held high contribution index scores (i.e. were important for present-day genome size and seed mass variation; Fig. 6). Again, this may be explained by frequent polyploidy within the evolutionary history of monocots (Goldblatt, 1980). The divergence of the large-seeded palms from the mostly small-to-medium seeded commelinids (a divergence previously shown to contribute significantly to seed mass evolution; Moles et al., 2005a) ranked high in explaining present-day genome size variation (Tables 4, 5). In eudicots, the earliest divergence, the split between Ranunculales and the rest of the eudicots, also ranked high in explaining both present-day 2C DNA content and seed mass variation (Tables 4, 5).

Our regression results for angiosperms contradict previous studies (Table 1). However, when analyzing the data using independent contrasts, divergences between the two traits were significantly correlated (Table 2; Fig. 4a,b). In the history of genome size and seed mass evolution, the average divergence width has remained relatively constant, but the largest divergences in both traits have occurred more recently in geologic time (Fig. 5). Therefore, within angiosperms, deep nodes seldom show correlated evolution and the relationship between genome size and seed mass within presumably more recent diversifications at the order and family levels drove our independent contrast results (Tables 4, 5). This is consistent with knowledge of the fossil record, where the first half of angiosperm existence was marked by a relatively long period of stasis (c. 85 Mya) in seed mass, followed by a gradual diversification before the Cretaceous–Tertiary boundary (Tiffney, 1984; Eriksson et al., 2000). These results may also reflect the propensity of polyploidy in speciation, if polyploidy leads to larger seed mass (Bretagnolle et al., 1995; Wendel, 2000).

Gymnosperms with smaller genomes, relative to the rest of the group, have a range of seed mass; however, species with larger genomes have increasingly larger seeds (Fig. 3). Our independent contrast results indicated that gymnosperm divergences in 1Cx DNA were weakly correlated with divergences in seed mass; however, there was not a significant correlation between divergences in 2C DNA and seed mass. These results are based on a gymnosperm phylogeny with uncertainty at the base, and no significant result was obtained for either the 1Cx or 2C DNA datasets when using any of the three published alternate basal resolutions (Table 3). Overall, our gymnosperm results suggest there is a high degree of phylogenetic signal within this group. In contrast to angiosperms, divergences deep in the gymnosperm phylogeny drove the regression analysis and subsequent divergences between the two traits are comparatively small (Table 5; Fig. 6). However, when examining the relationship within Pinales, the most representative group among our gymnosperm sample, 1Cx DNA content and seed mass did show significant correlated evolution. Further examination within Pinales showed this result was driven by divergences within Pinaceae, which is largely determined by the significant relationship within Pinus (Table S2). The significant correlated evolution found within Pinus is consistent with results of Grotkopp et al. (2004), but our results are slightly weaker. Although we used the same phylogenetic tree, we had a smaller sample size (51 vs 83) and included age-estimated branch lengths (Grotkopp et al. (2004) set all branches equal to 1). Despite significance across all three taxonomic levels (order, family and genus), there is strong phylogenetic signal within Pinales. The clear discrepancy in the strength of the regression and independent contrast results (Table S2) can be traced to the inclusion of large influential divergences deep within Pinales, but also to a more recent divergence at the base of Pinus (Table 5, Fig. 6).

Our results show that 1Cx DNA content holds greater explanatory power than 2C DNA content. The monoploid genome (1Cx DNA content) explained 6.2% of the variation in seed mass across all seed plants. This was more than the variation explained by seed dispersal syndrome, which was reported at 2.7% by Moles et al. (2005b) and also ranked as the second most important factor for seed mass evolution (changes in growth form held the greatest explanatory power; Table 6). Therefore 1Cx DNA content may be a driver of genome size/seed mass correlated evolution. 1Cx DNA content has been shown to have greater explanatory power than 2C DNA content in a number of studies, including both meiosis (Bennett, 1971) and cell cycle duration (Shuter et al., 1983). The basis for this 1Cx effect is puzzling because it challenges our mechanistic hypothesis for bulk DNA content effects. Perhaps, it is not the quantity of DNA that matters but the basal monoploid genome size.

Table 6.  Rank of predictor variables for seed mass variation within seed plants (ranked by r2)
RankVariableVariation explained (r2)
1Growth form9.7%
21Cx DNA content6.2%
32C DNA content3.3%
4Dispersal syndrome2.7%
5Leaf area index1.6%
6Net primary production1.2%
7Precipitation0.8%
8Latitude0.7%
9Temperature0.3%

However, this result may have a statistical explanation. Central to independent contrast statistics is the estimation of ancestral states from descendent values. The use of 2C DNA content, which is irrespective of the level ploidy, to reconstruct ancestral trait values may lead to misinterpretations because 2C values include both diploid and polyploid species. If reconstruction of ancestral nodes were based on diploid species, ancestral nodes would never be greater (polyploid) than the average of the descendant species. Conversely, estimating 2C DNA content from polyploid descendants would result in the inflation of a diploid progenitor. The cumulative effects of many overestimations and underestimations could have had an influence on the decrease in the variation explained by 2C DNA content. Nevertheless, this does not eliminate any possible biological implications for the differing results for 1Cx and 2C DNA content. Therefore, we advocate using both 1Cx and 2C DNA content when testing for genome-size dependent relationships, and a continued search for a mechanistic model to explain why 1Cx is important.

We used a suite of comparative methods to uncover a significant evolutionary association between genome size and seed mass. Because seed mass has well described ecological effects, the relationship between genome size and seed mass perhaps represents a genotype/phenotype/selection relationship that does not involve genes per se (however, specific genes may influence seed mass in concert). Further analysis should focus on testing for a more direct role genome size plays in seed mass variation. This should involve examining cell sizes in specific seed organs (i.e. embryo, endosperm, cotyledons) that may be affected by genome size. In addition, perhaps genome size is related to other life-history traits (i.e. seed dispersal syndrome, growth form, plant height; Moles et al., 2005b) that are also shown to be important in the evolution of seed mass. Answering these questions requires a continued effort to join plant functional trait databases (Seed Information Database, Flynn et al., 2004; Glopnet, Wright et al., 2004) with the Plant DNA C-values database (Bennett & Leitch, 2005). This effort is important for uncovering other physiological, ecological, and evolutionary associations with the profound variation in plant genome size.

Acknowledgements

We thank J. Connolly, D. Ackerly, and two anonymous reviewers for comments that greatly improved the manuscript. JBD would like to thank R. Smith, J. Tweddle, and S. Flynn for their contributions to the Seed Information Database. The Millennium Seed Bank Project is funded by the UK Millennium Commission, The Wellcome Trust and Orange plc. Royal Botanic Gardens, Kew is partially funded by the UK Department for Environment, Food and Rural Affairs.

Ancillary