Over 3.5 million expressed sequence tags from the major cereal taxa were used to electronically mine over 176 000 putative single nucleotide polymorphisms (SNPs). The density, distribution and degree of linkage between these SNPs were compared among the different taxa. The frequency of sequence polymorphism was lowest in diploid taxa (rice, barley and sorghum), intermediate in tetraploid maize and highest in allohexaploid wheat and octoploid sugarcane. SNPs were further categorized as either intravarietal (differences between gene family members and homoeologues) or varietal (differences between two varieties), and as either co-segregating or non-co-segregating with neighbouring polymorphisms. Varietal co-segregating SNPs represent the best candidates for molecular markers as they show variation between varieties and have a high probability of being validated, as sequencing errors are unlikely to co-segregate with one another. This elite class of SNPs was most abundant in barley and least abundant in wheat and rice. Despite the large number of observed sequence polymorphisms in allohexaploid wheat, only a fraction of those available are likely to make good molecular markers. In addition, we found that rice SNPs up to 10 kb apart were in linkage disequilibrium (LD), but that high levels of LD attributable to population structure confounded the tracking of LD over greater distances.
Single nucleotide polymorphisms (SNPs) are the most common form of sequence variation between individuals within the same species (reviewed in Brookes, 1999). Recently, the exponential increase in the number of available expressed sequence tags (ESTs) has suggested that SNP-based marker systems are, in the near future, likely to become the marker system of choice for plant breeders and academics alike. As such, there has been a great deal of interest in identifying SNPs in the major cereal taxa, including wheat (Somers etal., 2003), barley (Russell etal., 2004), maize (Ching etal., 2002) and rice (Feltus etal., 2004). A particular challenge for SNP discovery and exploitation in crops is that most, if not all, will have been subjected to intense artificial selection over many generations, which will have inevitably reduced the number of SNPs relative to those of ancient wild populations. The methods used to identify SNPs in plants have been reviewed recently in Edwards etal. (2008), but briefly fall into the categories of targeted discovery effort and in silico mining of SNPs from existing molecular datasets. The latter approach is attractive for many crops because of the availability of significant public EST datasets. Where sequence trace files are available, PolyBayes software (Marth etal., 1999) can make use of the quality scores to aid in the automated identification of SNPs. Where trace files are not available (as is most often the case), SNPs can be mined directly from aligned EST sequences using a combination of redundancy and haplotype information (Barker etal., 2003; Tang etal., 2006).
The frequency of SNPs within a population reflects a number of factors, including forward/reverse mutation rates, the number of generations and the number of progenitors. In addition, both the number of SNPs and their distribution are influenced by any selection pressures which might have been applied to the population, including any bottlenecks through which the species has passed. Hence, although mutations probably occur at similar frequencies, the number present within a species or specific population can show considerable variation. In addition, several studies have shown that, as a result of highly specific selection pressure, for instance through selective sweeps, SNP frequencies can vary within different regions of the genome (Edwards etal., 2008).
Within plants, and specifically within crops, the task of identifying useful SNPs is made more problematical because of the occurrence of polyploidy, with the result that most SNPs identified through EST-based screens of polyploid species are intravarietal and not varietal, and thus represent sequence variants between homoeologous genomes and not between homologous genomes from the different varieties. This inability to detect varietal SNPs currently limits the use of SNPs for large association genetics-based studies in allohexaploid wheat. To examine the occurrence of useful varietal SNPs in the wheat genome and other cereal taxa, and to understand the factors that have affected their frequency and distribution, we employed a bioinformatics approach to electronically mine SNPs from several cereal EST databases. The density, distribution and degree of linkage of these SNPs within both diploid and polyploid cereals were examined with a view to comparing and contrasting the SNP frequencies within wheat and its related cereal species.
Results and discussion
Between 189 000 and 1.2 million cereal EST sequences were analysed for each taxon under study (Table 1). These sequences were assembled into contigs, and those with eight or more members were analysed further for putative SNPs. For the purposes of this analysis, SNPs were called as present in the aligned ESTs when each allele was represented in at least two independent ESTs and when the minor allele frequency was 10% or higher. The percentage of genes analysed was estimated on the basis of each taxon having a haploid genome content of 41 000 genes as for rice (Sterck etal., 2007) and assuming that homoeologues cluster together in a single contig where present. On the basis of these assumptions, the percentage coverage ranged from 10% of genes for sorghum to over 50% for maize. The average sequencing depth was between 20 (sorghum) and 30 (barley) sequence reads per contig, with the average number of named varieties per contig ranging from 3.2 to 6.7 for sorghum and wheat, respectively. There were too few named varieties in Saccharum to estimate this value accurately.
Table 1. Summary of cereal single nucleotide polymorphism (SNP) analysis (updated and redrawn from Edwards etal., 2008)
Wheat (Pooideae, Triticeae)
Barley (Pooideae, Triticeae)
Maize (Panicoideae, Andropogoneae)
Saccharum (Panicoideae, Andropogoneae)
Sorghum (Panicoideae, Andropogoneae)
5′ expressed sequence tags (ESTs) as a percentage of all ESTs (5′ and 3′) which could be orientated by blast match to the non-redundant protein sequence database at the National Center for Biotechnology Information (NCBI).
Assuming 41 000 genes in each taxon and that homoeologues, where present, are assembled into a single contig.
Wheat was found to have the highest SNP frequency at 16.5 SNPs per kilobase (kb), and rice had the lowest frequency at 4.2 SNPs/kb. These values are derived by dividing the number of SNP sites by the total length of sequence in SNP-containing contigs. Slightly lower estimates were obtained when those contigs with no SNPs at all were included. It is obvious but worth stating that, as these values are derived from ESTs, they relate exclusively to the transcribed regions of the taxa studied, including some untranslated regions. For all taxa, we found that the SNP frequency was elevated in putative 5′ and 3′ untranslated regions relative to coding regions, as defined by a blast search of the contig consensus to known proteins (Figure 1).
In total, we found 190 453 putative SNPs in the taxa studied. These are available to browse and search (by blast and text) at http://www.cerealsdb.uk.net. We have made many improvements to the display of SNPs compared with the previously published version of autoSNP (Barker etal., 2003), including the presentation of a protein-based view of SNP changes, colour coding and sorting of variety names, FASTA format versions of the aligned contig sequences, rice genome hits for the current contig and a simplified overview of SNP haplotypes for each report. This represents the largest collection of taxonomically diverse cereal expressed SNPs available as far as we are aware.
The data presented in Table 1 show a clear link between SNP frequency and ploidy level in the cereal taxa studied. The extant polyploids wheat (allohexaploid) and Saccharum (octoploid or higher) have the highest frequencies at 16.5 and 10.6 SNPs/kb, respectively, closely followed by the ancient tetraploid maize with 8.9 SNPs/kb. The diploid taxa barley, sorghum and rice have significantly lower SNP frequencies at 6.3, 5.5 and 4.2 SNPs/kb, respectively. The SNP-finding algorithm used will fail to detect SNPs in which the minor allele frequency is less than 10% or the minor allele is represented only once in a contig, and so, in this sense, it may be considered as an underestimation of SNP frequency caused by ‘false-negative’ errors. Conversely, a proportion of the putative SNPs discovered will actually be sequence errors (false-positive errors). The effect of these two types of error on our estimate of SNP frequency will, to some extent, cancel out. In an effort to place a lower boundary on the SNP frequency by minimizing false-positive errors, SNPs were categorized as either co-segregating or non-co-segregating. Co-segregating SNPs were defined as those in which there was a significant association between a given SNP position and other SNP positions within the same sequence alignment. The degree of association was tested by randomization tests, with the expectation that real SNPs will generally co-segregate, whereas sequencing errors will not. A P value was calculated for each pair-wise combination of SNPs within an alignment by permutation, and those SNPs for which the average P value was less than 0.05 were deemed to be co-segregating. SNPs may further be defined as intravarietal (occurring between members of a gene family or between homoeologous chromosomes within a polyploid crop variety) and varietal (occurring between different varieties of the taxa). For marker-assisted selection, breeders will generally require varietal SNPs to act as markers for the variety-specific traits under selection. Intravarietal SNPs, which occur between gene family members or homoeologous genes, will not generally be useful for breeding purposes because most, if not all, varieties will be scored as monomorphic for such markers; however, they may be useful in discriminating between gene family members or homoeologous genes, for instance during transcriptomics-based studies. SNPs were categorized as varietal if no variation was observed to occur at that position within any of the cultivars sampled. Figure 2 shows a summary of the SNP frequencies for the cereal taxa, indicating the frequencies of varietal and co-segregating SNPs. When the intravarietal SNPs are excluded, the SNP frequency drops for all taxa; however, this fall is most pronounced for allohexaploid wheat and ancient tetraploid maize, with just 26% and 42% of SNPs being varietal, respectively. The proportion of varietal SNPs could not be accurately determined for the polyploid Saccharum as only a small proportion of the available ESTs were annotated with cultivar information. The diploid taxa have a higher proportion of useful varietal SNPs, ranging from 45% in rice through 69% in sorghum to 92% in barley. Combining information for intravarietal variation and co-segregation, the last column for each taxon (Figure 2) shows the mean number of SNPs/kb, which satisfies both criteria, i.e. they have a high probability of being validated and show variation between varieties but not within them. Such SNPs would be the best candidates for use in molecular breeding programmes. The taxon with the highest number of such readily exploitable SNPs using these criteria appears to be barley, with 2.0 SNPs/kb, which is 32% of all putative barley SNPs. At the other end of the spectrum, just 6% of putative SNPs in wheat appear to meet these criteria, a result which suggests that, when using many conventional SNP discovery programs, a very high proportion of candidate SNPs will not turn out to be useful for breeders. The situation in maize is similar, with just 19% of SNPs co-segregating with no intravarietal variation. With just 0.6 and 0.9 co-segregating varietal SNPs/kb, respectively, rice and wheat offer the lowest number of useful SNPs/kb sequenced. This latter result is surprising, given that rice and wheat have the two largest sequence datasets used in this analysis. Given that, in this sequence data sample, there were almost twice the number of cultivars in the average wheat cluster (6.7) compared with the average rice cluster (3.5), wheat may, in reality, have the lowest level of marker-exploitable SNP diversity of all the cereal taxa studied, whereas its closest relative in this study, barley, has the highest. The average number of sequences and varieties per contig for each taxon, as well as the proportion of 5′ and 3′ EST sequences, are noted in Table 1, and show clearly that differences in these parameters do not explain the variation in SNP frequencies observed. We found that, averaged over all taxa, the SNP frequency was 1.5-fold higher in putative 5′ and 3′ untranslated regions than in neighbouring coding sequences, as defined by having a blast match to the non-redundant protein database. The observed differences in SNP frequency did not correlate with breeding systems: barley and wheat at the opposite ends of the spectrum in SNP frequency terms are both inbreeders, whereas the intermediate maize is an outbreeder. We found no evidence of a phylogenetic trend in SNP diversity as the close relatives sorghum and Saccharum (Kellogg, 1998) showed markedly different frequencies, as did wheat and barley. More likely, the results here represent the low level of residual germplasm diversity, resulting from the recent bottleneck of the polyploidization events that created wheat and Saccharum, followed by subsequent selective breeding. We speculate that selection has probably been particularly severe in wheat, where, in addition to the usual drivers of disease resistance and yield, there has also been consistent selection for bread-making quality.
Comparison with published SNP frequencies
In order to check the validity of our SNP mining approach, the SNP frequencies reported here were compared with those found in previous studies.
Published estimations of SNP frequency in wheat can be categorized into those which use all available sequences and those which have attempted to exclude SNPs between different homoeologues. Somers etal. (2003) reported an SNP frequency of 1 in 24 base pairs (bp) for SNPs between homoeologues, whereas our upper estimate is 1 in 61 bp, falling to 1 in 123 bp for SNPs which co-segregate. Previous estimates for varietal SNPs range from 1 in 613 bp, based on 16 gene fragments (Schnurbusch etal., 2007), through 1 in 335 bp (Ravel etal., 2006) to 1 in 223 bp (Ravel etal., 2007). Our estimate of one varietal SNP per 233 bp is in good agreement with these studies, whereas our value for co-segregating varietal SNPs at 1 in 1111 bp is rather low by comparison. This indicates that, although this latter class of SNPs has a high probability of validation, the value for varietal SNPs is possibly conservative.
Previous studies based on small focused (Bundock etal., 2003; Russell etal., 2004) and larger (Rostoks etal., 2005, 2006) datasets have estimated SNP frequencies in barley ranging from 1 per 78 to 1 per 200 bp, which is similar to our observation of 1 per 159 bp to 1 per 400 bp for total and co-segregating SNPs, respectively. Rostoks etal. (2005) have previously validated SNPs in eight barley varieties by extensive bidirectional re-sequencing. We downloaded this entire dataset from http://bioinf.scri.ac.uk/barley_snpdb and checked for overlap between those regions re-sequenced and those containing putative SNPs according to this study. Taking each of our putative SNPs, together with 10 nucleotides of the up- and downstream sequence, we found 105 orthologous matches against the Rostoks’ dataset, in 83 of which our SNPs were immediately confirmed. The remaining 22 non-confirmed SNPs were examined in more detail. We found that 15 of these were clearly real SNPs that were not reproduced in the Rostoks’ dataset because its sample of available germplasm was narrower than that encompassed in our EST-based study. In all of these cases, where the varieties Optic, Morex or Golden Promise appeared in both datasets, the alleles were concordant. Finally, we found that seven of our putative SNPs appeared to be possible sequencing errors: all in this group were SNPs represented by only a few example ESTs and in only one case was the suspect SNP called as co-segregating. From this comparison, we estimate that our false-positive call rate is approximately 7% for all barley SNPs and 1% for co-segregating barley SNPs.
The broad agreement between our estimates of SNP frequency and those published previously, plus the independent confirmation provided by comparison with the published barley SNPs, indicates that our highly automated and unsupervised SNP discovery pipeline and the inferences we drew from the resulting data are valid, and provide a dataset in which several important taxa can be compared.
The SNP frequency in maize has been estimated at between 1 in 104 bp for two randomly paired landraces (Tenaillon etal., 2001) and 1 in 124 bp in 36 inbred lines representing US germplasm (Ching etal., 2002) based on sequence data from 21 and 18 loci, respectively. Our frequency estimates of 1 in 112 bp (total) and 1 in 256 bp (co-segregating) are somewhat lower again, reflecting the slightly conservative approach taken here, but also probably caused by the lower genetic diversity among the publicly available nucleotide accessions compared with landraces or lines carefully selected to represent the whole US genetic base. A similar in silico mining approach to find SNPs in maize EST alignments has been used previously (Batley etal., 2003). That study used the 102 551 sequences available at the time and estimated an SNP frequency of between 1 in 600 for contigs of five or fewer members and 1 in 100 for contigs of 20 or more members. These data are consistent with the estimates presented in this study, where our frequencies are based solely on contigs with eight or more members. The independent SNP validation carried out in the 2003 study showed that, in all cases in which SNPs were shown to co-segregate, they were subsequently validated by re-sequencing. This provides further validation of the approach to SNP discovery used in this study. Finally, a recent pyrosequencing study by Barbazuk etal. (2007) in two maize lines generated 540 000 sequence reads which were used to identify SNPs. Their frequency estimate of between 1 in 300 bp and 1 in 214 bp is lower than that presented here, which is not surprising given that only two varieties were compared.
Alignment of the complete genomes for indica and japonica subspecies of rice has revealed SNP frequencies of between 1 in 943 bp and 1 in 268 bp using conservative (Feltus etal., 2004) and non-conservative (Shen etal., 2004) approaches. Our conservative (co-segregating) and non-conservative estimates of 1 in 1111 bp and 1 in 238 bp are in excellent agreement with those published previously.
We hypothesize and others have shown (Clark etal., 2004) that selection and selective sweeps can lead to a loss of SNP diversity in regions under intense selection, as selection for the favoured trait(s) drags surrounding linked alleles into fixation. To test for evidence of a loss of SNP diversity in specific regions of the wheat genome, we mapped both SNP-containing and non-SNP-containing contigs of wheat to chromosome bins using blast searches to previously mapped wheat ESTs (Rota and Sorrells, 2004). We found that, for wheat, there was no excess or deficit of SNP-containing contigs compared with total contigs in any of the chromosome arms across the seven chromosomes each of the A, B and D genomes. There was, of course, considerable variation in the number of contigs mapping to each chromosome arm bin as a result of the difference in size between these bins: chromosome 1AS had just 154 contig hits, whereas 5BL had 697. A more detailed analysis was performed whereby the mean number of SNPs per base was calculated for each chromosome bin based on those contigs for which the consensus sequence could be blast matched against an EST previously matched to the National Science Foundation (NSF) nulli/tetrasomic deletion lines (Rota and Sorrells, 2004). A total of 4588 contigs matched one or more mapped ESTs, with between 571 and 769 hits to each bin in total, making this a reasonably large dataset from which to draw conclusions about SNP density. In order to avoid multiple testing problems, we pre-selected SNP frequency as the variable to test for association against the factor of chromosome number. Our null hypothesis was that SNP frequency within SNP-containing contigs would be equal for all chromosomes. We were able to reject this null hypothesis using a one-way analysis of variance (anova, F6,35 = 10.91, P < 0.001). There was clearly a small but significant increase in SNP frequency in chromosomes 4 and 7 (12.5 and 12.4 SNPs/kb, respectively) compared with chromosomes 1, 2, 3, 5 and 6 (10.2, 11.0, 10.7, 11.2 and 10.3 SNPs/kb, respectively). The reasons for the observed differences in SNP density between chromosomes are currently unclear, but the statistics indicate that it cannot be explained by chance. It is possible that selection has either played a part in reducing SNP diversity in chromosomes relative to that seen in chromosomes 4 and 7, or that balancing selection for different traits is maintaining marginally higher levels of diversity in these two chromosomes.
We repeated this process for rice SNP data, in this case blast searching contigs not to mapped ESTs, but to the complete genome sequence. One-way anova revealed no difference in either the mean percentage of SNP-containing contigs or the mean number of SNPs per base between rice chromosomes. Feltus etal. (2004) have shown previously that SNP frequency varies across individual rice chromosomes. A lack of genome sequence data prevented us from attempting a similarly detailed analysis for the other taxa studied here.
Linkage disequilibrium (LD)
Several studies have examined the extent of LD in cereals (Kraakman etal., 2004; Morrell etal., 2005; Somers etal., 2007). This is an important issue as the distance over which LD decays determines the limitations of the use of molecular markers in a given taxon. When LD decays rapidly, a large number of markers will be needed to provide genome coverage, but, if these are available, fine-scale mapping is possible. Conversely, extended LD means that few markers are needed to provide genome coverage, but the resulting bins will be large and thus traits may not be mapped with such precision. In an attempt to assess the extent of LD, we mapped the SNP loci from all of our taxa studied on to the rice genome, with the expectation that genes sufficiently close to be in LD are likely to have retained their syntenic gene order. We assessed the effect of population structure on our LD measurements by re-sampling the original rice dataset 1000 times, whilst assigning loci to a randomly selected chromosome number. Population structure will result in apparent LD between loci which do not lie on the same chromosome, and so the randomization approach reveals the proportion of such artefactual LD. The greatest distance over which significant LD was recorded for rice was 10 kb. The randomization tests revealed a high level of artefactual LD in the rice data, such that, for genes 20 kb apart, LD had fallen to the background level. For all the other taxa studied, the observed LD was never significantly higher than that attributable to population structure. The high levels of LD detected in the randomized data for all taxa studied should serve as a caution to researchers hoping to use association studies to link SNPs to phenotypes in existing datasets. In accordance with our results, previous studies have shown significant LD extending throughout ten loci in rice (Zhu etal., 2007).
The results presented here show that valuable SNP data can be mined in quantity from the existing and ever-growing cereal EST collections. For crops which arose by polyploid hybridization, this process is clearly hampered by the presence of homoeologues/paralogues, which dominate the in silico SNP data collection. Fortunately, a technological solution is at hand, given the decreased cost and increased throughput of so-called ‘next-generation’ sequencing. Given a dataset of sufficient size, it will be possible to identify and exclude homoeologous SNPs from those found between varieties, and yet still produce sufficient SNPs to provide dense genome coverage. We anticipate, therefore, that 454 data will replace ESTs as the major resource for in silico SNP discovery in the foreseeable future. In order to exploit the SNP data generated for wheat, we have placed 25-mer probes, flanking representative SNPs from each consensus sequence, on an Agilent microarray. We hope to use this array to map large numbers of SNPs on to aneuploid wheat lines, and also to investigate the relative contributions of the A, B and D genomes to the wheat transcriptome.
All EST sequences were downloaded in GenBank format from the EST data collection held at the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/dbEST). Data for all taxa were downloaded in June 2006. Sorghum data were downloaded again in November 2008 after a significant increase in the number of sequences available. GenBank-formatted sequences were converted to FASTA format using a PERL script, with cultivar, tissue and developmental stage information parsed (where possible) from the GenBank header lines. The resulting FASTA multiple sequence files were screened for vector contamination using cross_match (Gordon etal., 1998) against the Univec database from NCBI. Short and low-complexity sequences were also removed from the datasets. Cluster/contig builds were produced for each species using the TIGR Gene Indices Clustering Tool (TGICL, currently available at http://compbio.dfci.harvard.edu/tgi/software/). SNP discovery was performed using a custom PERL script (available on request) employing an approach similar to that described by Barker etal. (2003). An SNP was called in the sequence alignment when each alternative SNP allele was present in at least two example sequences, or in more than 10% of sequences for contigs with more than 20 members. The 10% threshold is required to prevent an exponential increase in false discovery rate as sequence alignments increase in size (simulation data not shown). SNPs were tested for co-segregation using a permutation approach in which each pair-wise combination of SNPs within an alignment was tested for linkage. The maximum frequency of each observed allele combination was noted in the original data, and then the allele pairings were randomized 1000 times, with the maximum frequency being noted again in each case. If the observed frequency fell in the top 5% tail of the permuted values, the allele combination was deemed to be in significant LD. Individual SNP positions were deemed to be co-segregating if the average pair-wise permuted P value was less than 0.05. Simulations showed that this approach was applicable to alignments of eight or more sequences, as the false-negative rate for calling significant LD in datasets with perfect LD was then < 0.0001 (data not shown). For this reason, only contigs containing eight or more sequences were used for the analyses presented here. Contigs comprising more than 250 sequences were excluded from the analysis as the computed time required to perform the LD permutations was excessive (empirical observation) and the resulting HTML report files were difficult to visualize effectively.
Consensus sequences were mapped on to the rice genome using a PERL script to blast each sequence in turn and note the location of the top hit. For each rice chromosome, the distances between the sites of matching EST sequences were noted and used to calculate the proportion of SNP loci in LD. To control for population structure, this process was repeated 1000 times with each sequence randomized to one of the 12 rice chromosomes.
Mapped wheat ESTs were downloaded from http://wheat.pw.usda.gov/wEST as a gzipped FASTA file (wEST_mapped.fasta.gz, September 2007). The non-redundant protein sequence database and the rice chromosome sequence data were obtained from NCBI (posted date: 8 February 2008). The Prosite database and search tools were downloaded from ftp://ftp.expasy.org/databases/prosite/in in November 2008. All blast searches were performed locally using the blastall executable version 2.2.9 available from NCBI. All scripts were developed using PERL 5.8.6 under Apple OS X 10.5.5 and Redhat Linux 9. Sequence clustering was performed on the University of Bristol ‘BlueCrystal’ high-performance cluster (http://www.acrc.bris.ac.uk/acrc/hpc.htm) and a local dual quad core Mac-pro server. The clustering times varied for each taxon, but the rice dataset (the largest at more than 1 million ESTs) required approximately 1200 cpu hours to cluster and a further 400 cpu hours for all subsequent analysis and annotation. Statistical analysis was performed using Microsoft Excel, PERL scripts and MINITAB (http://www.minitab.co.uk).
We thank Jane Coghill for proofreading the manuscript and the Biotechnology and Biological Sciences Research Council (BBSRC) for providing the funding for this work via the following awards: BBE0001261, BBE0063371, BBF0075231 and BBF0103701. We also thank anonymous reviewer 2 for detailed comments and suggestions which were most helpful.