Between 189 000 and 1.2 million cereal EST sequences were analysed for each taxon under study (Table 1). These sequences were assembled into contigs, and those with eight or more members were analysed further for putative SNPs. For the purposes of this analysis, SNPs were called as present in the aligned ESTs when each allele was represented in at least two independent ESTs and when the minor allele frequency was 10% or higher. The percentage of genes analysed was estimated on the basis of each taxon having a haploid genome content of 41 000 genes as for rice (Sterck et al., 2007) and assuming that homoeologues cluster together in a single contig where present. On the basis of these assumptions, the percentage coverage ranged from 10% of genes for sorghum to over 50% for maize. The average sequencing depth was between 20 (sorghum) and 30 (barley) sequence reads per contig, with the average number of named varieties per contig ranging from 3.2 to 6.7 for sorghum and wheat, respectively. There were too few named varieties in Saccharum to estimate this value accurately.
Wheat was found to have the highest SNP frequency at 16.5 SNPs per kilobase (kb), and rice had the lowest frequency at 4.2 SNPs/kb. These values are derived by dividing the number of SNP sites by the total length of sequence in SNP-containing contigs. Slightly lower estimates were obtained when those contigs with no SNPs at all were included. It is obvious but worth stating that, as these values are derived from ESTs, they relate exclusively to the transcribed regions of the taxa studied, including some untranslated regions. For all taxa, we found that the SNP frequency was elevated in putative 5′ and 3′ untranslated regions relative to coding regions, as defined by a blast search of the contig consensus to known proteins (Figure 1).
In total, we found 190 453 putative SNPs in the taxa studied. These are available to browse and search (by blast and text) at http://www.cerealsdb.uk.net. We have made many improvements to the display of SNPs compared with the previously published version of autoSNP (Barker et al., 2003), including the presentation of a protein-based view of SNP changes, colour coding and sorting of variety names, FASTA format versions of the aligned contig sequences, rice genome hits for the current contig and a simplified overview of SNP haplotypes for each report. This represents the largest collection of taxonomically diverse cereal expressed SNPs available as far as we are aware.
The data presented in Table 1 show a clear link between SNP frequency and ploidy level in the cereal taxa studied. The extant polyploids wheat (allohexaploid) and Saccharum (octoploid or higher) have the highest frequencies at 16.5 and 10.6 SNPs/kb, respectively, closely followed by the ancient tetraploid maize with 8.9 SNPs/kb. The diploid taxa barley, sorghum and rice have significantly lower SNP frequencies at 6.3, 5.5 and 4.2 SNPs/kb, respectively. The SNP-finding algorithm used will fail to detect SNPs in which the minor allele frequency is less than 10% or the minor allele is represented only once in a contig, and so, in this sense, it may be considered as an underestimation of SNP frequency caused by ‘false-negative’ errors. Conversely, a proportion of the putative SNPs discovered will actually be sequence errors (false-positive errors). The effect of these two types of error on our estimate of SNP frequency will, to some extent, cancel out. In an effort to place a lower boundary on the SNP frequency by minimizing false-positive errors, SNPs were categorized as either co-segregating or non-co-segregating. Co-segregating SNPs were defined as those in which there was a significant association between a given SNP position and other SNP positions within the same sequence alignment. The degree of association was tested by randomization tests, with the expectation that real SNPs will generally co-segregate, whereas sequencing errors will not. A P value was calculated for each pair-wise combination of SNPs within an alignment by permutation, and those SNPs for which the average P value was less than 0.05 were deemed to be co-segregating. SNPs may further be defined as intravarietal (occurring between members of a gene family or between homoeologous chromosomes within a polyploid crop variety) and varietal (occurring between different varieties of the taxa). For marker-assisted selection, breeders will generally require varietal SNPs to act as markers for the variety-specific traits under selection. Intravarietal SNPs, which occur between gene family members or homoeologous genes, will not generally be useful for breeding purposes because most, if not all, varieties will be scored as monomorphic for such markers; however, they may be useful in discriminating between gene family members or homoeologous genes, for instance during transcriptomics-based studies. SNPs were categorized as varietal if no variation was observed to occur at that position within any of the cultivars sampled. Figure 2 shows a summary of the SNP frequencies for the cereal taxa, indicating the frequencies of varietal and co-segregating SNPs. When the intravarietal SNPs are excluded, the SNP frequency drops for all taxa; however, this fall is most pronounced for allohexaploid wheat and ancient tetraploid maize, with just 26% and 42% of SNPs being varietal, respectively. The proportion of varietal SNPs could not be accurately determined for the polyploid Saccharum as only a small proportion of the available ESTs were annotated with cultivar information. The diploid taxa have a higher proportion of useful varietal SNPs, ranging from 45% in rice through 69% in sorghum to 92% in barley. Combining information for intravarietal variation and co-segregation, the last column for each taxon (Figure 2) shows the mean number of SNPs/kb, which satisfies both criteria, i.e. they have a high probability of being validated and show variation between varieties but not within them. Such SNPs would be the best candidates for use in molecular breeding programmes. The taxon with the highest number of such readily exploitable SNPs using these criteria appears to be barley, with 2.0 SNPs/kb, which is 32% of all putative barley SNPs. At the other end of the spectrum, just 6% of putative SNPs in wheat appear to meet these criteria, a result which suggests that, when using many conventional SNP discovery programs, a very high proportion of candidate SNPs will not turn out to be useful for breeders. The situation in maize is similar, with just 19% of SNPs co-segregating with no intravarietal variation. With just 0.6 and 0.9 co-segregating varietal SNPs/kb, respectively, rice and wheat offer the lowest number of useful SNPs/kb sequenced. This latter result is surprising, given that rice and wheat have the two largest sequence datasets used in this analysis. Given that, in this sequence data sample, there were almost twice the number of cultivars in the average wheat cluster (6.7) compared with the average rice cluster (3.5), wheat may, in reality, have the lowest level of marker-exploitable SNP diversity of all the cereal taxa studied, whereas its closest relative in this study, barley, has the highest. The average number of sequences and varieties per contig for each taxon, as well as the proportion of 5′ and 3′ EST sequences, are noted in Table 1, and show clearly that differences in these parameters do not explain the variation in SNP frequencies observed. We found that, averaged over all taxa, the SNP frequency was 1.5-fold higher in putative 5′ and 3′ untranslated regions than in neighbouring coding sequences, as defined by having a blast match to the non-redundant protein database. The observed differences in SNP frequency did not correlate with breeding systems: barley and wheat at the opposite ends of the spectrum in SNP frequency terms are both inbreeders, whereas the intermediate maize is an outbreeder. We found no evidence of a phylogenetic trend in SNP diversity as the close relatives sorghum and Saccharum (Kellogg, 1998) showed markedly different frequencies, as did wheat and barley. More likely, the results here represent the low level of residual germplasm diversity, resulting from the recent bottleneck of the polyploidization events that created wheat and Saccharum, followed by subsequent selective breeding. We speculate that selection has probably been particularly severe in wheat, where, in addition to the usual drivers of disease resistance and yield, there has also been consistent selection for bread-making quality.