Genome size variation in wild and cultivated maize along altitudinal gradients


Author for correspondence:

Brandon Gaut

Tel: +1 949 824 2564



  • It is still an open question as to whether genome size (GS) variation is shaped by natural selection. One approach to address this question is a population-level survey that assesses both the variation in GS and the relationship of GS to ecological variants.
  • We assessed GS in Zea mays, a species that includes the cultivated crop, maize, and its closest wild relatives, the teosintes. We measured GS in five plants of each of 22 maize landraces and 21 teosinte populations from Mexico sampled from parallel altitudinal gradients.
  • GS was significantly smaller in landraces than in teosintes, but the largest component of GS variation was among landraces and among populations. In maize, GS correlated negatively with altitude; more generally, the best GS predictors were linked to geography. By contrast, GS variation in teosintes was best explained by temperature and precipitation.
  • Overall, our results further document the size flexibility of the Zea genome, but also point to a drastic shift in patterns of GS variation since domestication. We argue that such patterns may reflect the indirect action of selection on GS, through a multiplicity of phenotypes and life-history traits.


Nuclear DNA content, or genome size (GS), varies > 1000-fold among angiosperms (Muñoz Diez et al., 2012). Beyond its structural impacts on the genome, DNA content variation may have ecological and evolutionary significance (Biemont, 2008), because GS correlates with a range of phenotypes, including flowering time, flower size, seed mass, leaf size and photosynthetic rate (e.g. Meagher & Vassiliadis, 2005; Beaulieu et al., 2007a,b, 2008). Moreover, several studies have reported within-species correlations between GS and ecological variables, such as altitude, latitude and temperature (Knight et al., 2005). These phenotypic and ecological correlates suggest that GS affects the properties of species, such as regional abundances (Herben et al., 2012), colonization rates and invasiveness (Lavergne et al., 2010).

It thus seems likely that GS is shaped by natural selection, but nonselective alternatives have also been proposed to explain GS variation. For example, the skewed distributions of eukaryotic GSs can be explained by a purely mechanistic model in which GS evolves stochastically at a rate proportional to size (Oliver et al., 2007). In vertebrates, recombination rates, rather than natural selection, seem to drive changes in GS (Nam & Ellegren, 2012). It has also been argued that correlation between GS and population or biological parameters may be blurred by the phylogenetic signal. After correcting for this signal, Whitney et al. (2010) reported that GS in angiosperms does not correlate with effective population size (Ne). Because the efficacy of selection is expected to scale with Ne, the lack of relationship between GS and Ne may indicate that selection has had little impact on the broad-scale evolution of GS in higher plants.

Population-level analyses offer the best opportunity to infer selection on GS (Petrov, 2001), but most analyses of the evolutionary processes acting on GS have taken place on an interspecies scale. Within the plant kingdom, GS has perhaps been best studied within the genus Zea. The genus includes the species Zea mays, which is typified by the domesticated subspecies maize (Zea mays ssp. mays) and two prominent wild subspecies – Zea mays ssp. mexicana and Zea mays ssp. parviglumis. These two wild taxa, both of which are native to Mexico and collectively referred to as ‘teosinte’, are geographically and ecologically distinct. Subspecies mexicana is restricted primarily to the highlands in the states of Chihuahua, Durango and Puebla, as well as the Central Plateau that surrounds Mexico City (Iltis & Doebley, 1980; Fukunaga et al., 2005). At an average elevation of 2135 m (Hufford et al., 2012), mexicana exhibits adaptation to variable temperature conditions and high altitudes. By contrast, ssp. parviglumis is found at lower elevations in the Balsas River Basin and the state of Jalisco, tropical regions with relatively stable temperature regimes and lower average altitudes (1095 m; Hufford et al., 2012). The one location in which the two teosinte taxa overlap is the Balsas River Basin (Fukunaga et al., 2005), which is the presumed location of the domestication of maize from ssp. parviglumis c. 9000 yr ago (Matsuoka et al., 2002; van Heerwaarden et al., 2011).

Although there is still debate as to whether GS commonly varies within species (Knight et al., 2005; Biemont, 2008), there is little doubt that GS varies within Zea mays sensu lato. Among cultivated accessions, which include both open-pollinated landraces and inbred lines, GS varies by at least 30% (Muñoz Diez et al., 2012). This variation exhibits inconsistent patterns with geography. In maize, for example, GS and indirect measures of GS (i.e. the number of chromosomal knobs and C-bands) typically decrease with increasing latitude (Rayburn et al., 1985; but see Rayburn & Auger, 1990), but may be less consistent as a function of altitude (Bennett & Smith, 1976; Rayburn & Auger, 1990). Fewer studies have examined correlates between GS and ecological variables in ssp. mexicana and ssp. parviglumis, but at least one study has found a positive correlation between GS and altitude (Laurie & Bennett, 1985). To the extent that the pattern holds, a negative correlation between GS and altitude/latitude may reflect the need for rapid growth and early flowering in the shorter growing seasons typical of cooler regions (Rayburn et al., 1994; Poggio et al., 1998).

Although it is clear that GS varies within and among the subspecies of Zea mays, there has not yet been a systematic survey of GS variation among taxa and populations. Here, we investigate GS and ecological correlates in large samples of both cultivated landraces of maize and in annual Mexican teosintes. In doing so, we have overcome some of the common limitations of previous studies. The first of these is the sampling; most previous surveys of Zea have assessed GS from c. 20 total isolates. Here, we survey 22 maize landraces and 21 populations of annual teosinte, assessing GS variation from five individuals within each landrace and population. A second limitation is that the geographic scale of sampling has often been so broad that the effects of environmental factors have been blurred. Our sampling along parallel altitudinal gradients is designed specifically to reduce this limitation. Finally, environmental information has been limited mostly to altitude or latitude, without consideration of additional ecological variables that may impact more directly on GS, such as temperature and rainfall.

With GS and environmental data for multiple taxa and specific locations, we address three sets of questions. First, what is the distribution of GS among taxa, populations and plants? Is there convincing evidence that GS varies within taxa in a consistent fashion? Second, does GS correlate with bioclimatic variables? If so, are the correlates consistent across parallel gradients and also among taxa, suggesting a clear relationship between GS and environment? Or might relationships vary between the domesticate and its wild ancestors, as suggested by previous studies? Finally, what does this information reveal about the potential evolutionary and ecological forces that act on GS? Ultimately, we seek to determine whether variation in GS is shaped by, and limited by, ecological and evolutionary factors.

Materials and Methods

Plant materials and GS measurements

We sampled teosintes from two altitudinal gradients, hereafter called the ‘Balsas’ and ‘Jalisco’ gradients. ‘Balsas’ was located in the states of Guerrero, Morelos and Estado de México (Fig. 1). The Guerrero and Morelos states include part of the Balsas River Basin, whereas Estado de México represents part of the highlands of Central Mexico. The ‘Jalisco’ gradient was located in the Central Western region of Mexico, including portions of the states of Jalisco, Michoacán and Guanajuato that represent portions of the Bajío and Río Lerma Santiago Basins (Fig. 1). The two gradients ranged between 343 and 2581 m (Fig. 1, Table 1). We sampled 21 total teosinte populations from these gradients, with 10 and 11 populations from the Balsas and Jalisco gradients, respectively. At least 30 adult plants and their seeds (20–50 per adult) were collected from each population. Each gradient encompassed geographic regions typical of both parviglumis and mexicana; we made a provisional designation of subspecies status based on distribution maps (Hufford et al., 2012).

Table 1. Information about maize (Cultivated) and teosinte (Wild) samples included in this study, as well as the sampling locations
NameStatusaBank accessionCommon nameLatitudeLongitudeElevation (m)PlaceStateGradientGSbSDb
  1. a

    All cultivated materials were provided by Centro Internacional de Mejoramiento de Maiz y Trigo (CIMMYT); teosinte samples were sampled in situ by, and are available from, the authors.

  2. b

    GS, average genome size among five plants and three replicates per plant, with standard deviation (SD).

GUER152Cultivated19 406Breve prieto17.017−99.533300Tierra ColoradaGuerreroBalsas1.080.016
GUER17Cultivated230Conejo16.95−99.8399El TreintaGuerreroBalsas1.0540.032
GUER222Cultivated2037Maiz ancho18.4−99.311330CacahuamilpaGuerreroBalsas1.1140.028
GUER306Cultivated22 287Rojo itzihuini18.133−99.1626Atenango del RioGuerreroBalsas1.0840.025
JALI127Cultivated14 052Rojo sol19.933−103.7331800TapalpaJaliscoJalisco1.0780.029
JALI396Cultivated22 270Tempranillo19.94−105.250TomatlanJaliscoJalisco1.0920.029
JALI397Cultivated22 283Cristalino tempranillo19.94−105.250TomatlanJaliscoJalisco1.0650.026
JALI44Cultivated2208Ahumado19.48−104.161402Los TocotesJaliscoJalisco1.1720.02
MEXI05Cultivated2233Palomero toluqueño19.283−99.6333073MetepecMexicoBalsas0.990.01
MEXI211Cultivated1940Palomero legitimo19.283−99.652620TolucaMexicoBalsas0.9620.008
MEXI238Cultivated2074Maiz de Tehuacan18.8−99.9171800Llano GrandeMexicoBalsas1.0680.018
MORE112Cultivated2050Ancho18.617−99.35970San Gabriel de las PalmasMorelosBalsas1.1070.057
OAXA166Cultivated100Costeño16.283−98.26750El CirueloOaxacaOaxaca1.110.044
OAXA244Cultivated5847Tepecintle amarillo16.51−98.12250Costatitlan (El Limón)OaxacaOaxaca1.0840.01
OAXA258Cultivated5849Olotillo de Chaguas16.81−97.99520El TapancoOaxacaOaxaca1.1210.015
OAXA266Cultivated5850Tehuacanero azul17.033−97.95850Santa María PutlaOaxacaOaxaca1.1540.016
OAXA339Cultivated6075Taan gacia17.13−97.51600San Juan CopalaOaxacaOaxaca1.1250.037
OAXA496Cultivated17 443De cajete17.467−97.5672200YodomanoOaxacaOaxaca1.1150.018
OAXA522Cultivated22 024Olote colorado16.433−95.01750Col. Albaro ObregónOaxacaOaxaca1.2610.026
YUCA117Cultivated843Dzit bacal21.01−89.5410MéridaYucatan1.1550.017
CM01Wild 18.35−99.8411649TeloloapanGuerreroBalsas1.210.039
CM02Wild 18.411−99.9081439AlcholoaGuerreroBalsas1.1860.042
CM03Wild 17.172−99.541343Tierra ColoradaGuerreroBalsas1.0080.016
CM04Wild 16.981−99.285581TeconapaGuerreroBalsas1.0620.021
CM05Wild 17.392−99.4781201ChilpancingoGuerreroBalsas1.0980.102
CM06Wild 17.422−99.4511251MazatlánGuerreroBalsas1.1210.031
CM07Wild 17.46−99.3681107MochitlánGuerreroBalsas1.0750.051
CM08Wild 18.238−99.2181101Paso MorelosGuerreroBalsas1.1350.029
CM09Wild 18.974−99.071669HuilotepecMorelosBalsas1.0520.017
CM10Wild 19.407−99.6272581Villa SecaMexicoBalsas1.190.027
CM11Wild 20.139−102.0681846ChurintzioMichoacánJalisco1.060.03
CM12Wild 20.627−104.4081426GuachinangoJaliscoJalisco1.1780.034
CM13Wild 19.715−104.804504TalpititaJaliscoJalisco1.2290.032
SMH564Wild 19.91−104.1731407EjutlaJaliscoJalisco1.0940.019
SMH565Wild 19.896−104.1771317EjutlaJaliscoJalisco1.1080.01
SMH570Wild 19.934−104.008944La LaborJaliscoJalisco1.1230.015
SMH571Wild 19.992−100.8851861San Jose de las PilasGuanajuatoJalisco1.1080.033
SMH575Wild 20.053−101.0881849San RafaelMichoacánJalisco1.1390.031
SMH576Wild 20.161−101.3731856Tejocote de CaleraGuanajuatoJalisco1.1420.031
SMH577Wild 19.732−104.872572El LlanitoJaliscoJalisco1.2290.055
SMH578Wild 19.535−104.0581369El RodeoJaliscoJalisco1.1680.021
Figure 1.

Map of the collection locations of cultivated maize and wild teosinte samples in Mexico. Elevation gradients are denoted by separate colors. Wild and landrace samples are denoted by triangles and circles, respectively. The YUCA117 sample is not represented on the map and is not included in any analyses that compare gradients.

We also analyzed 22 landraces of maize (Zea mays ssp. mays L.). These were sampled along three altitudinal gradients. The first was the ‘Balsas’ gradient, consisting of nine landraces; the second was a gradient in the state of Oaxaca (with eight landraces); and the third, for which there was sampling of four landraces, was the ‘Jalisco’ gradient (Fig. 1, Table 1). Our sampling included two landraces (MEXI05 and YUCA117) that have been reported previously as having a small (Vielle-Calzada et al., 2009) and a large (J. P. Vielle-Calzada, pers. comm.) genome, respectively. In total, we assessed 22 primitive landraces with the altitudinal range of collection sites varying from 10 to 3073 m (Table 1). All landrace seeds were provided by Centro Internacional de Mejoramiento de Maiz y Trigo (CIMMYT).

Five plants per population or landrace were germinated and grown under controlled glasshouse conditions between February and August of 2011 at UC Irvine, CA, USA (33°38′49″N, 117°50′43″W; altitude, 25 m). Leaf samples were sent to Plant Cytometry Services (Schijndel, the Netherlands) where relative DNA measurements were performed by flow cytometry using 4′,6-diamidino-2-phenylindole (DAPI) staining and Ilex crenata ‘Fastigiata’ (2C = 2.2 pg) as an internal standard. This technique is particularly well suited for the detection of small DNA variation among samples. In order to limit technical error, we included the reference maize inbred line B73 in every batch as a second internal control. In addition, the GS measurements were made on three technical replicates per plant on a single leaf harvested at the same level of maturity for each plant. The GS measurement for each technical replicate was calculated as the ratio of the measure to that of maize reference B73; the estimate for each plant was the average among three technical replicates. Because we are interested in ranges of GS variation among Zea individuals, for analyses we expressed GS values as a ratio relative to B73, where B73 has a GS value of 1.0. However, we also provided estimates of GS, in picograms (pg) per 2C nucleus, for comparison with other species (Fig. 2, Supporting Information Table S1); these picogram estimates were calculated assuming that the B73 reference has a GS of 5.64 pg/2C, as based on propidium iodide (PI) staining against the I. crenata standard.

Figure 2.

Genome size (GS) estimates for each maize landrace and each teosinte population sample measured either as a ratio over the B73 maize reference genome or in picograms per 2C nucleus (pg/2C). Populations are arranged by mean GS values. Cultivated, yellow boxes; wild, blue boxes. The boxes indicate the first quartile (leftmost line), the median (central line) and the third quartile (rightmost line). The whiskers represent the standard deviation, with the outliers represented by dots.

Statistical analysis of GS within and among populations

The GS data satisfied the homogeneity of variances (O'Brien test, > 0.05), but did not fulfill the normality of the variance (Shapiro–Wilk test, < 0.0001) required to apply an ANOVA. For this reason, the distribution of GS variance as a result of differences among populations (Pop) and among plants within populations (Pop : Plant) was determined by a nonparametric multivariate analysis of variance (PERMANOVA; Anderson, 2001), using the function ‘adonis’ of the package ‘vegan 1.17-6’ (Oksanen, 2007) in R statistical software (Team, 2008). The ‘adonis’ function performs an ANOVA with matrices of Euclidean distances, testing significance via the analysis of 999 permutated matrices. We also compared variation in GS between gradients (Gradient) and among populations within gradients (Gradient : Pop) using PERMANOVA, based on landrace samples from the Balsas and Oaxaca gradients and teosinte samples from the Balsas and Jalisco gradients.

To assess the extent of spatial autocorrelation among the GS of populations and landraces, we estimated Moran's I statistic for all possible ‘neighborhoods’ defined by distances between locations that varied from 15 km up to 1000 km, with 5-km increases between the two extremes. The significance of the I statistic was determined by 10 000 random permutations using spdep v. 0.5-56 (Bivand et al., 2012) in R statistical software.

Comparison of GS with bioclimatic variables

For the collection site of each natural population and landrace, we recorded passport data, including geographical coordinates and altitude. These data were used to obtain climatic information for each natural population and landrace from the current bioclimatic grid layers from WorldClim V.1.4 (30 arc-seconds c. 1 km resolution;; Hijmans et al., 2005). The 19 bioclimatic variables represent summaries of means and variations in temperature and precipitation, and characterize the dimensions of climate considered to be particularly relevant in the determination of species distribution at a regional scale (Hijmans et al., 2005).

Given the history in the literature of comparing GS with elevation, we first focused on this relationship, employing a linear quantile regression (QR) model to explore the effect of elevation on percentiles of GS values (Knight & Ackerly, 2002; Knight et al., 2005). QR estimates the relationship between two variables for all portions (quantiles) of a probability distribution, rather than just the mean, as in standard regression analyses (Cade & Noon, 2003). As a result, QR models can provide insights into predictive relationships at the extremes of the distribution of a response variable. We estimated the QR functions from the 15th to the 95th quantile using the package ‘quantreg’ (Koenker, 2008) in R statistical software. Confidence intervals were calculated by 1000 bootstrap resamplings.

To assess the relationships among GS and bioclimatic variables, we applied Spearman Rank correlation tests and principal component analysis (PCA) using R. We also integrated the bioclimatic variables into a linear model based on partial least-squares regression (PLSR). The goal of PLSR was to assess the predictive value of the models relative to observed GS values. The PLSR method combines aspects of PCA and multiple regression, and is particularly useful when there are many predictor variables that may be nonindependent (Maestre, 2004). To determine the number of components to be included in the PLSR model, we performed a standard cross-validation procedure that also estimates the root-mean-square error of prediction (RMSEP; Maestre, 2004; Mevik & Wehrens, 2007). We retained only those components which contribute substantially to decrease RMSEP, which also happened to explain > 5% of the original variance (R2) in the response variable (Carrascal et al., 2009). The sum of the R2 of the retained components provided the total explanatory capacity of the PLSR model (Mevik & Wehrens, 2007). PLSR analysis was performed using the package ‘pls’ (Mevik & Wehrens, 2007) in the R statistical package.


GS variation among and within populations

We measured the GS of five plants per population belonging to 21 natural populations of teosinte and 22 maize landraces, most of which were collected along altitudinal gradients in Mexico. We sampled five plants per population or landrace, each with three technical replicates, resulting in a dataset of 645 GS measurements. For each plant, the GS value is reported as the average of three technical replicates based on comparisons with the maize inbred line B73, which has a normalized value of 1.0 in this study.

GS values varied widely among individuals, from 0.948 for an individual from the landrace ‘Palomero Legitimo’ (MEXI211) to 1.299 for an individual from the landrace ‘Olote colorado’ (OAXA522; Fig. 2). The average GS value per plant was 1.111, but these averages varied between wild and cultivated samples; the average GS of teosintes (1.129) was significantly larger (< 0.001, Kruskal–Wallis test) than the average GS of cultivated maize (1.095). The overall coefficient of variation was nearly identical between the wild and cultivated samples, however, at 6.07% and 6.06%, respectively.

The distribution of GS among and within populations was evaluated independently in the wild and cultivated groups by a PERMANOVA test. The major proportion of the variance was significantly (< 0.001) distributed among populations in both groups (Table 2). This proportion was similar, but higher, in the cultivated (R2 = 84.4%) than in the wild (R2 = 73.2%) group. Conversely, the proportion of the variance within populations in the wild group (R2 = 23.8%; < 0.001) was higher than that in the cultivated group (R2 = 13%; < 0.001). The residual accounted for the technical error among the three biological replicates for each plant, and was c. 3% (R2) in both groups.

Table 2. Nonparametric multivariate analysis of variance (PERMANOVA) for teosinte and maize landrace samples
CategorydfSSMS F R 2 P (> F)
Pop : Plant840.35930.004319.5360.23760.001
Residuals2100.04600.0002 0.0304 
Total3141.5120  1 
Gradient : Pop191.03680.054639.5880.68570.001
Residuals2940.40530.0014 0.2680 
Total3141.5120  1 
Pop : Plant880.19200.002212.3100.12970.001
Residuals2200.03900.0002 0.0263 
Total3291.4800  1 
Gradient : Pop150.84610.056471.8420.66740.001
Residuals2380.18690.0008 0.1474 
Total2541.2677  1 

One source of differences among wild populations could be a difference between ssp. mexicana and parviglumis. We therefore estimated the variance attributable to subspecies (Table S2). It was significant but quite low (R2 = 2.2%; < 0.001), with a much larger proportion of the variance attributable to populations (R2 = 70.9%; < 0.001). Given that mexicana and parviglumis do not differ substantially by this analysis, many of our analyses combine samples from the two teosinte taxa into a single dataset. However, we do accentuate taxonomic differences when appropriate.

We sampled separate elevation gradients for both the landrace and teosinte samples with the thought that these gradients provide a measure of pseudo-replication. We therefore contrasted the Balsas and Jalisco gradients for the teosinte samples and the two best sampled gradients (Balsas and Oaxaca) for the landraces. For both maize and teosinte, the variance caused by differences between the two gradients was significant (< 0.001), but far less for the wild samples (R2 = 4.6%) than for the cultivated samples (R2 = 18.5%). The highest component of variance again reflected differences among populations (R2 = 68.6%) or among landraces (R2 = 66.7%).

Finally, GS variation showed some spatial structure, because population GS values were spatially autocorrelated for both cultivated and wild samples (Fig. S1). Moran's I was significant for neighborhoods of 40 km for all GS data, 65 km for cultivated data alone and 75 km for teosintes data, with the I statistic remaining positive up to neighborhood sizes of 120 km (Fig. S1). To sum, our GS estimates represent significant differences among populations and among landraces, with lower GS variances attributable to either gradients or individuals within populations.

The relationship between GS and elevation

We have established that populations and landraces vary significantly in GS. But does this variation follow a discernible pattern with regard to environment? We first approached this question with an analysis of elevation, both because elevation has been used as a proxy for a suite of environmental variables and because its relationship with GS has long been investigated (Rayburn & Auger, 1990; Poggio et al., 1998; Knight & Ackerly, 2002; Knight et al., 2005).

We examined elevation by applying linear QR to both cultivated and wild samples (Fig. 3). In the cultivated group, GS and elevation showed a significant negative relationship (< 0.001) for all six quantiles considered (0.15, 0.30, 0.45, 0.60, 0.75, 0.90). This negative relationship was consistent across the Oaxaca and Balsas gradients and thus appears to be a general feature of our sample of maize landraces. It should also be noted that the slope of the 95% quantile was noticeably more negative than that of other quantiles (Fig. 3).

Figure 3.

Scatter plot and quantile regression showing the relationship between genome size (GS) and elevation for the cultivated maize landraces and wild teosinte groups of samples. The dashed gray lines correspond to the quantiles (0.15, 0.30, 0.45, 0.60, 0.75, 0.90), the median fit is in solid blue and the least-squares estimate of the conditional mean function is the dashed (red) line.

By contrast, the pattern for the wild samples, taken as a whole, was not clear. The lower quantiles (0.15 and 0.30) exhibited a significant positive relationship (< 0.05) between elevation and GS, but the upper quantiles (0.45, 0.60, 0.75, 0.90) showed nonsignificant (> 0.05) correlations. When we partitioned the samples by gradient, the picture became even murkier (Fig. 3). There was a consistent, significant and positive relationship at all quantiles for the Balsas gradient, but the exact opposite held for the Jalisco gradient. When individual subspecies were considered, the positive correlation held in both subspecies for the 50% quantile, but the slopes for all other quantiles differed in sign between subspecies (Fig. S2). Thus, the correlation with GS and elevation varied as a function of the teosinte sample.

Correlations between GS and bioclimatic variables

In theory, elevation is a proxy for several correlated variables, and hence we investigated correlations between GS and bioclimatic variables directly. Several geographic and climatic variables correlated significantly with GS based on Spearman Rank correlations (Table 3). In maize, for example, the approach recapitulated the negative correlation with elevation. Several additional correlations were significant for the entire landrace sample, including variables representing longitude, temperature (BIO5 and BIO6) and precipitation (e.g. BIO13). Significance based on the total maize sample seems to be primarily a function of the Balsas gradient, because correlations were generally not significant for the Oaxaca gradient, even though the two gradients had a similar number of landraces sampled (= 8 vs 9 populations for Balsas vs Oaxaca). The only correlations that were both shared and significant between gradients were those with elevation (Table 3; Fig. 2) and seasonal precipitation (BIO15).

Table 3. Environmental variables, their Spearman correlation with genome size (GS) and their coefficients in the linear model inferred by partial least-squares regression (PLSR; results are shown for gradients in maize landraces and teosintes separately)
Environmental variableaLandracesTeosintes
R (All)bR (Balsas)R (Oaxaca)PLSR (All)cR (All)R (Balsas)R (Jalisco)PLSR (All)
  1. a

    Bioclimatic variables defined as: BIO1, Annual Mean Temperature; BIO2, Mean Diurnal Range (mean of monthly (max temp − min temp)); BIO3, Isothermality (BIO2/BIO7) (× 100); BIO4, Temperature Seasonality (standard deviation × 100); BIO5, Max Temperature of Warmest Month; BIO6, Min Temperature of Coldest Month; BIO7, Temperature Annual Range (BIO5–BIO6); BIO8, Mean Temperature of Wettest Quarter; BIO9, Mean Temperature of Driest Quarter; BIO10, Mean Temperature of Warmest Quarter; BIO11, Mean Temperature of Coldest Quarter; BIO12, Annual Precipitation; BIO13, Precipitation of Wettest Month; BIO14, Precipitation of Driest Month; BIO15, Precipitation Seasonality (coefficient of variation); BIO16, Precipitation of Wettest Quarter; BIO17, Precipitation of Driest Quarter; BIO18, Precipitation of Warmest Quarter; BIO19, Precipitation of Coldest Quarter.

  2. b

    Spearman correlation coefficient R for (in parentheses) the entire sample (All) or a particular gradient (Balsas, Oaxaca, Jalisco). Significance values are denoted by asterisks: *, < 0.05; **, < 0.01; ***, < 0.001.

  3. c

    The PLSR coefficient based on the total sample. Significance levels are denoted by asterisks: *, < 0.01; **, < 0.001; ***, < 0.0001.

BIO1 = Annual Mean Temperature0.235*0.430**0.0350.051*−0.127−0.541***0.414**0.015
BIO2 = Mean Diurnal Range (mean of monthly (max temp − min temp))−0.0120.1030.0590.182−0.151−0.326*−0.466***−0.474***
BIO3 = Isothermality (P2/P7)(× 100)−0.0280.104−0.308−0.053−0.238*−0.764***0.568***−0.593***
BIO4 = Temperature Seasonality (standard deviation × 100)−0.1470.0730.024−0.0680.284**0.652***−0.355**0.186**
BIO5 = Max Temperature of Warmest Month0.250**0.558***−0.0700.113**−0.098−0.345*0.1620.112
BIO6 = Min Temperature of Coldest Month0.297**0.400**0.2620.032−0.100−0.471***0.497***0.137
BIO7 = Temperature Annual Range−0.0870.084−0.0890.1170.1640.428**−0.486***−0.108
BIO8 = Mean Temperature of Wettest Quarter0.228*0.406**0.0350.030−0.120−0.542***0.363**−0.002
BIO9 = Mean Temperature of Driest Quarter0.253**0.465**0.0350.076*−0.106−0.397**0.345**−0.209
BIO10 = Mean Temperature of Warmest Quarter0.228*0.430**0.1640.053−0.120−0.493***0.326*0.081
BIO11 = Mean Temperature of Coldest Quarter0.253**0.430**0.0350.061**−0.103−0.537***0.476***0.043
BIO12 = Annual Precipitation0.253**0.400**0.135−0.0550.071−0.338*0.476***0.211*
BIO13 = Precipitation of Wettest Month0.325***0.395**0.1990.0590.014−0.406**0.607***−0.033
BIO14 = Precipitation of Driest Month−0.052−0.366*0.2620.328**0.194*0.497***−0.738***−0.263
BIO15 = Precipitation Seasonality0.203*0.446**0.369*0.274***−0.034−0.2780.454***−0.121
BIO16 = Precipitation of Wettest Quarter0.248**0.395**0.199−0.0230.065−0.425**0.509***0.212***
BIO17 = Precipitation of Driest Quarter0.016−0.311*0.313*0.318**0.357***0.498***−0.0680.400*
BIO18 = Precipitation of Warmest Quarter0.247**0.063−0.156−0.328*0.105−0.509***0.589***0.262
BIO19 = Precipitation of Coldest Quarter0.014−0.174−0.0420.1000.326***0.1260.618***−0.023

The teosintes exhibited different trends from maize (Table 3). For the complete sample, only a few variables were strongly correlated with GS – i.e. longitude and precipitation in the warmest and coldest quarters (BIO17 and BIO19). Otherwise, the results depended heavily on sample partitions. For example, for the two gradients, the patterns of correlations with GS were significant, but in opposite directions, for several variables (e.g. BIO3, BIO6, BIO11, BIO14, BIO16 and BIO18; Table 2). When the samples were partitioned by subspecies (Table S3), the correlations become slightly more regular, with the most strong and consistent correlations between GS and longitude, temperature seasonality (BIO4) and levels of precipitation in the coldest and driest quarters (BIO17 and BIO19).

Pairwise correlations suffer from several weaknesses (see 'Discussion'), and so we turned to PLSR, which is a technique to build a linear model that decomposes the data into (in this case) orthogonal components. Once the model is estimated, it can then be used to predict the dependent variable, GS, from potentially nonindependent predictor variables. We applied PLSR separately to cultivated and wild GS measurements with the 22 predictor variables (Table 3). For cultivated data, we included GS data from the two main gradients sampled (Balsas and Oaxaca); for the teosinte data, both gradients were included. PLSR identified four and five components for cultivated and wild samples, respectively, which explained > 5% of the variance in GS, which we retained in the PLSR model. Once the PLSR model was determined, it was used to demonstrate the fit of the predicted values to observed values (Fig. 4). Together, the significant components explained 78.6% and 62.7% of the observed variance in GS for cultivated and wild samples, respectively. In other words, GS can be predicted partially on the basis of climatic and geographic variables.

Figure 4.

Comparison between the observed and predicted genome size (GS) for maize (cultivated) and teosintes (wild), based on the linear model inferred using partial least-squares regression (PLSR). The R2 of the Spearman correlation for both fits is also provided.

PLSR is usually implemented to build a regression model, but it can also provide insights into the weight and significance of predictor variables to the model. Table 3 presents the coefficients (or loadings) of each predictor variable, as summed across orthogonal components. The magnitude of each coefficient provides insight into its importance in the model; the significance of coefficients was calculated by a jackknife procedure. The weights of coefficients yielded three unique insights. First, in maize, the suite of most important variables is related to geography (elevation, longitude and, to a lesser extent, latitude). Because we were puzzled by the relationship between geographic measures and GS, we reapplied the PLSR model without latitude and longitude (Table S4), with a commensurate reduction in predictive value (Fig. S3). Second, the most highly weighted predictors in teosintes are related to temperature and precipitation. Finally, and perhaps most surprisingly, the variables that predict GS have shifted dramatically between cultivated landraces in Mexico and wild teosinte populations, even though the landrace and wild samples originate from similar regions and sampling strategies.

PCA reveals distinct groups of teosintes

We conducted a PCA to group populations on the basis of the complete dataset – that is, GS data plus the 22 predictor variables for both maize and teosintes (Figs 5, S4). The PCA approach is helpful because variables may be represented diagrammatically as vectors of particular magnitude and direction in the space defined by principal components. For example, it is possible to infer that variables related to temperature seasonality (BIO1, BIO5, BIO8, BIO9, BIO10, BIO11 and BIO15) have similar (correlated) effects on the grouping of populations because their associated vectors have similar direction (Fig. 5).

Figure 5.

Principal component analysis (PCA) including the 22 cultivated maize and 21 wild populations of teosinte included in this study. The gray arrows represent the different variables considered in the model. PC1 and PC2 explain 12.7% and 3.4%, respectively, of the variance.

In our analysis, the first principal component (PC1) explained 12.7% of the total variance, and the second component (PC2) explained 3.4% of the variance. Together, they grouped particular subsets of populations. For example, the Balsas teosinte populations formed a distinct group relative to the remaining teosintes. By contrast, the PCA produced no obvious grouping of landraces, but instead illustrated the remarkable ecological flexibility of cultivated maize, because landraces were distributed across the space defined by the first two principal components. We note, however, that most of the landraces from the Jalisco gradient grouped near the wild populations collected in the same area, as is expected if bioclimatic variables dictate landrace usage.


GS variation within species

Our study was designed to investigate variation in GS, the relationship of GS with climatic variables and, ultimately, the evolutionary processes that shape GS variation. With regard to the first, there is still debate as to whether GS varies within species and, if so, the extent of that variation (Knight et al., 2005). In order to overcome technical difficulties that may create false variation among individuals (Nardon et al., 2003; Biemont, 2008), we used plants grown and harvested under similar conditions, estimated GS by the same technique on all samples and employed extensive technical replication.

Our analyses clearly indicate that GS varies both among and within subspecies of Z. mays. Among subspecies, our sample of maize landraces has a moderately (c. 3%), but significantly, lower average GS than our sample of teosintes. The GS of elite inbred lines of maize is smaller still (Laurie & Bennett, 1985), suggesting that average GSs have decreased through the processes of domestication and subsequent crop improvement, perhaps because of strong selection (Rayburn et al., 1994). Average GS also varies significantly between the wild parviglumis and mexicana subspecies, but this variation pales in comparison with the amount of variation among populations. Indeed, among-population variation is the largest component of GS size variation, larger even than differences between subspecies or altitudinal gradients (Table 2).

Given GS variation, it is interesting to speculate on the genomic components that contribute to size variation – that is, features such as genes, transposable elements (TEs), centromeric repeats, telomeres, rDNAs and heterochromatic knobs. There is already substantial evidence that maize and teosinte genomes vary both in gene content (Swanson-Wagner et al., 2010) and TE complement (Wang & Dooner, 2006). Maize accessions are also known for varying in their number of heterochromatic knobs (McClintock, 1978). The measurement of knobs and TE repeats has been enhanced by the use of high-throughput sequencing (HTS; Tenaillon et al., 2011). For example, a study of 38 teosinte and maize inbred lines used HTS data to show that intraspecies differences in GSs are driven primarily by the amount of knob repeats (Chia et al., 2012). Given that knobs constitute only 8% of the maize genome (Ananiev et al., 1998), whereas TEs represent c. 85% (Schnable et al., 2009), this observation is somewhat surprising. Even more surprising is the observation that the overall TE content of these lines actually correlates negatively with GS (Chia et al., 2012).

These observations seem to imply some sort of genomic ‘trade-off’ between heterochromatic knobs and TEs, as if they balance actively to maintain limits on GS. Although it is difficult to imagine a precise molecular mechanism for this trade-off, there are some hints that optimal GSs may indeed exist. For example, it has long been suggested that large genomes are disadvantageous (Knight et al., 2005) and stabilizing selection has been shown to act on GS (Smarda et al., 2010) over short time scales (Smarda et al., 2007). There are hints of similar GS limitations in maize data (Poggio et al., 1998). Take, for example, our QR analysis between GS and elevation (Fig. 3). Although there is a clear and consistent positive association for both maize gradients, the higher quantiles have lower slopes, as might be expected if there are limits to GS. Moreover, the visual distribution of points suggests a quadratic relationship between GS and elevation; that is, the curvature of points suggests that GS reaches a maximum size at moderate elevations, with upper and lower limits (e.g. Knight & Ackerly, 2002). We tested a quadratic QR on these data, but the quadratic fits no better than the linear relationship (data not shown). Nonetheless, the pattern of observations is superficially consistent with a limit on maize GS as a function of elevation or variables correlated with elevation. We note that such limits on GS in maize have been argued previously based on the abundance of B chromosomes at different altitudes (Lia et al., 2007).

By contrast, there is no consistent evidence for constraints on GS in teosintes on the basis of QRs with elevation, whether the wild samples are treated as a whole (Fig. 3), in different gradients (Fig. 3) or in different subspecies (Fig. S2).

Climatic variables and GS

A related question is whether GS varies predictably as a function of climatic factors, suggesting ecological constraints on GS. Our study has uncovered a wealth of pairwise correlations between GS and bioclimatic variables (Table 3), but the pairwise correlations themselves are inconsistent among subspecies and gradients (Tables 3, S3), making it difficult to interpret their meaning and importance. Pairwise correlations may be inconsistent for a variety of reasons. One reason is sampling, both in terms of the depth of sampling and the climatic range over sampled locations. Although we have sampled extensively, our sampling could very well be too sparse to be able to identify consistent patterns between gradients or subspecies. Another sampling issue is climatic range; without sufficient variation in climate, it may be difficult to identify correlations between individual variables and GS. Our sampling of altitudinal gradients was nonetheless intended to cover climatic ranges within the limited distribution of the annual teosintes.

Another potential cause for inconsistencies in pairwise correlations may be that there are real but multifaceted relationships between environmental factors and GSs. Accordingly, we applied PLSR to our data. PLSR is designed to model relationships when there are many potentially nonindependent predictor variables and a limited number of observations (Mevik & Wehrens, 2007). The resulting models are predictive, explaining 78.6% and 62.7% of GS variation for landraces and teosintes, respectively (Fig. 5). Although PSLR is not specifically designed to identify the contribution of particular variables to the model, it does provide insight into the suites of variables that contribute to GS (Table 3). For example, the PLSR indicates that several variables are predictors of GS in the teosintes, including temperature-related variables (BIO2, BIO3 and BIO4) and precipitation in the wettest quarter (BIO16). Both BIO4 and BIO16 are also determinants of the climatic distribution of parviglumis and mexicana (Hufford et al., 2012).

In addition to elevation, it is notable that PLSR models also detect latitude and longitude as strong predictors of GS in maize (Table 3). We assume that these geographic coordinates somehow encapsulate a complex combination of climatic (and perhaps even sociological) factors that are otherwise unrepresented by the bioclimatic data. For example, higher latitudes may correlate with cooler growing seasons, as is often assumed for elevation. In this particular study, longitude is related to distance from the coast, and thus may capture aspects of climate (such as humidity) or even historical factors that vary from coastal areas to the highlands.

Does selection drive variation in GS?

We ultimately seek to determine whether GS variation is influenced by natural selection. To that end, as already noted, we find that within-population variation in GS is small relative to variation among populations, and that such variation is spatially autocorrelated. Although this pattern does not itself implicate the action of natural selection (population divergence can also be a function of genetic drift and isolated populations), it is a necessary precursor to infer that natural selection affects GS.

Although stabilizing and directional selection have been shown to affect GS in experimental settings (Rayburn et al., 1994; Cullis, 2005; Smarda et al., 2010), the appropriate null hypothesis for a survey like ours is that natural selection does not affect GS. This null hypothesis is difficult to reject, but we nonetheless suggest that the predictive performance of the PLSR models, which demonstrate dependences between GS and both geographical (elevation, longitude) and environmental (temperature and precipitation) factors, tends to discount it (Fig. 4, Table 3).

We also designed our sampling under a secondary hypothesis: if natural selection has shaped GS, we assume a priori that there should be consistent relationships between GS and environmental factors across taxa and gradients. This secondary hypothesis is not supported. For example, the environmental factors that correlate most noticeably with GS in teosintes (e.g. BIO2, BIO3, BIO4, BIO16; Table 3) do not correlate consistently with GS in maize, and vice versa (Table 3, Fig. 3).

We are left to conclude that, if GS is affected by selection, the relationship is defined by at least two salient features. The first is complexity, born of a shifting mix of environmental factors and phenotypes. The phenotypes that may affect GS include growth rate, flowering time, flower size, seed mass, leaf size and photosynthetic rate (e.g. Rayburn et al., 1994; Meagher & Vassiliadis, 2005; Beaulieu et al., 2007a,b, 2008). Each phenotype may be under different selection pressures in different biotic and abiotic environments; in turn, selection on multiple phenotypes may affect GS in synergistic or opposing fashions. It should also be noted that chromosomal knobs may be under selection through the effects of genetic factors, such as Abnormal chromosome 10 (Ab10), which may be less common in regions with short growing seasons. Ab10 is significant because it is associated with meiotic drive for increased knob heterochromatin (Buckler et al., 1999). The Ab10-carrying genotype may decrease with altitude because of fitness costs associated with genes in tight linkage (Buckler et al., 1999), generating an indirect effect on both knob frequency and GS.

The second is that shifts in the relationship can be very rapid. We have shown, for example, that GS in our sample of maize, which was domesticated c. 9000 yr ago, exhibits different relationships to both climate and geographic factors than does GS in teosintes (Fig. 3). It is possible that strong artificial selection on maize has fundamentally exaggerated, altered or skewed relationships between GS and phenotypes. This point may be best illustrated by our PCA (Fig. 4): given that maize was domesticated from parviglumis in the lowlands of the Balsas regions (Matsuoka et al., 2002; van Heerwaarden et al., 2011), we expect that the first domesticates of maize would have clustered with the Balsas teosinte group. Thereafter, maize was quickly distributed to the Mexican highlands and beyond (Fukunaga et al., 2005), across a range of climates (Hufford et al., 2012), so that maize samples of different GSs are now distributed throughout the climactic space defined by the PCA (Figs 4, S4). A final note is that, if our inferences are correct – that is, that the relationship between GS and selection is defined by complexity and rapid shifts – it may not be surprising that global analyses, like those of Whitney et al. (2010), fail to find convincing evidence of a strong effect of natural selection despite the fact that experimental analyses clearly do (Rayburn et al., 1994; Cullis, 2005; Smarda et al., 2010).


We would like to thank S. Taba at CIMMYT and J.-P. Vielle-Calzada for providing information and seeds on landraces. We would like to acknowledge the support of a UC-MEXUS grant #49298 to B.S.G. and L.E.E., as well as the Agence Nationale de la Recherche, ANR-12-ADAP-002, to M.I.T. C.M.D. was supported by a fellowship funded by the project P09-AGR-5010 of the Consejería Economía, Innovación Ciencia y Empleo de la Junta de Andalucía, Spain. E. M. was supported by the Balsells Fellowship at UC Irvine and from NIH Training Grant P50GM76516.