Australian Centre for Plant Functional Genomics, School of Land, Crop and Food Sciences, ARC Centre of Excellence for Integrative Legume Research, University of Queensland, Brisbane, Qld 4072, Australia
Molecular markers are used to provide the link between genotype and phenotype, for the production of molecular genetic maps and to assess genetic diversity within and between related species. Single nucleotide polymorphisms (SNPs) are the most abundant molecular genetic marker. SNPs can be identified in silico, but care must be taken to ensure that the identified SNPs reflect true genetic variation and are not a result of errors associated with DNA sequencing. The SNP detection method autoSNP has been developed to identify SNPs from sequence data for any species. Confidence in the predicted SNPs is based on sequence redundancy, and haplotype co-segregation scores are calculated for a further independent measure of confidence. We have extended the autoSNP method to produce autoSNPdb, which integrates SNP and gene annotation information with a graphical viewer. We have applied this software to public barley expressed sequences, and the resulting database is available over the Internet. SNPs can be viewed and searched by sequence, functional annotation or predicted synteny with a reference genome, in this case rice. The correlation between SNPs and barley cultivar, expressed tissue type and development stage has been collated for ease of exploration. An average of one SNP per 240 bp was identified, with SNPs more prevalent in the 5′ regions and simple sequence repeat (SSR) flanking sequences. Overall, autoSNPdb can provide a wealth of genetic polymorphism information for any species for which sequence data are available.
Molecular genetic markers are based on variation in the genome that can be assayed and monitored between individuals and across generations. They are used to define a genotype without the requirement to sequence the entire DNA content of the genome. The association of markers with heritable traits provides a link between the genotype of an organism and the expressed phenotype. Markers are used in agricultural breeding programmes to incorporate genetically characterized traits in place of field trials or glasshouse screens (Gupta etal., 2001). In human studies, they are frequently used for the identification of genes underlying inherited disorders (Thorisson etal., 2005). The generation of novel markers allows the production of high-density genetic maps and enables the genotype–phenotype link to be defined with greater precision.
Simple sequence repeats (SSRs), also known as microsatellites, and single nucleotide polymorphisms (SNPs) are the modern genetic markers currently being used in plant genetic analysis, together with anonymous marker systems, such as amplified fragment length polymorphisms (AFLPs). With advances in genome sequencing technologies, SNPs are becoming the marker of choice. SNPs are single base changes between individuals that can be used as DNA-based molecular genetic markers. There are three different forms of SNP: transitions (C/T or G/A), transversions (C/G, A/T, C/A or T/G) and small insertions–deletions (indels). SNPs are direct markers, as the sequence information provides the exact nature of the allelic variants. Furthermore, this sequence variation can have a major impact on how the organism develops and responds to the environment. SNPs represent the most frequent type of genetic polymorphism, and may therefore provide a high density of markers near a locus of interest (Picoult-Newberg etal., 1999). They have a fine resolution, are highly stable and reliable (Syvanen, 2001) and are capable of ultra-high-throughput automation and detection. The high density of SNPs makes them valuable for genome mapping and, in particular, they allow the generation of ultra-high-density genetic maps and haplotyping systems for genes or regions of interest, and map-based positional cloning of genes. SNPs are used routinely in crop breeding programmes, for genetic diversity analysis, cultivar identification, phylogenetic analysis, characterization of genetic resources and association with agronomic traits.
As with the majority of molecular markers, one of the limitations of SNPs is the initial cost associated with their discovery. Traditionally, the discovery of SNPs was a laboratory-based procedure involving the polymerase chain reaction (PCR)-based amplification of specific genome fragments in individuals of interest, followed by dedicated sequencing. This was both time consuming and costly. The reduced cost of DNA sequencing and the increasing quantities of DNA sequence in the public databases has led to the use of in silico methods for SNP discovery. In silico SNP discovery provides a much cheaper method of finding SNPs, both in terms of time and expense. The mining of available large sequence datasets remains the cheapest source of novel SNPs (Thorisson etal., 2005). Expressed sequence tags (ESTs) are particularly valuable sources of SNPs because of the high redundancy of sequences, diversity of genotypes represented and the fact that each identified SNP is associated with a functional gene (Picoult-Newberg etal., 1999).
Several methods have been developed for SNP discovery from sequence data. However, the challenge of in silico SNP discovery is not the identification of polymorphic bases, but the differentiation of true SNP polymorphisms from the often more abundant sequence errors. The most common source of sequence error is from the automated reading of raw data, because of the fine balance between obtaining the greatest sequence length and the confidence that bases are called correctly. Phred is the most widely adopted software used to call bases from Sanger chromatogram data (Ewing and Green, 1998; Ewing etal., 1998), providing a statistical estimate of the accuracy of calling each base, and therefore a primary level of confidence that a sequence difference represents true genetic variation. There are several software packages that take advantage of this feature to estimate the confidence of sequence polymorphisms within alignments. PolyBayes and PolyPhred (Marth etal., 1999) are the methods of choice to differentiate between true SNPs and sequence error when sequence quality information and trace files are available. However, this is rarely the case when collating sequences from public sequence repositories. Furthermore, sequence quality scores do not identify errors in sequences incorporated prior to the base calling process. The principal cause of these prior errors is the inherently high error rate of the reverse transcription process required for the generation of cDNA libraries for EST sequencing. Similar errors are also inherent, although to a lesser extent, in any PCR amplification process that may be part of a sequencing protocol. In these cases, redundancy-based SNP discovery methods can be highly efficient.
autoSNP (Barker etal., 2003; Batley etal., 2003a) uses a redundancy-based method for SNP confidence measurement, combined with SNP co-segregation which provides a second independent measure of confidence. Within a sequence assembly, polymorphisms must be represented by two or more sequences for an SNP to be called. Although this discards many of the possible true polymorphisms, it is essential to ensure that the identified SNPs have a high confidence of representing true genetic variation. The co-segregation score is a measure of whether a predicted SNP contributes to the definition of a haplotype, and this score is weighted to account for missing data in the assembly. We have implemented the SNP discovery software autoSNP within a relational database to enable the efficient mining of the identified SNP and indel polymorphisms and the detailed interrogation of the data. The implementation of autoSNPdb (Duran etal., 2009) allows researchers to identify SNPs between specific groups of individuals or within genes of predicted function. The system is flexible, and researchers may add additional levels of annotation and novel queries specific to their area of interest. Within autoSNPdb, sequence comparison of the consensus sequences with other sequence data allows annotation with predicted gene function and associated gene ontology (GO). The autoSNPdb system can be applied to any species for which sequence data are available, including next-generation pyrosequence data, and does not require quality scores or sequence trace files.
This article reports the results of the autoSNPdb method applied to the cereal crop barley. Barley (Hordeum vulgare) has been domesticated for ~10 000 years and is used for the production of whisky and beer. It is an important cereal crop in Australia, with an annual production of around 6.6 million tonnes (http://www.barleyaustralia.com.au). Barley has a relatively small diploid genome of approximately 5000 million base pairs. Barley autoSNPdb includes the ability to identify SNPs that discriminate between selected barley cultivars, and to search by sequence annotation or by sequence similarity. Analysis of SNP frequency is presented, including association with position within the sequence.
Results and discussion
A total of 466 800 barley sequences was downloaded from GenBank. Of these, 191 sequences were removed that were longer than 4000 bases and found to be genomic sequence, and 3411 sequences were removed as they had less than 100 bases. During the masking of repeats, 8209 sequences were removed as they had been masked for over 50% of their total length and were considered to be poor-quality sequence containing large amounts of sequence error, which would confound the SNP discovery process. The remaining sequences numbered 454 989 and, from these sequences, 25 674 assemblies and 68 565 singletons were generated using cap3. The EST sequences were processed using stringent parameters to limit the alignment of multiple genes from gene families and to identify polymorphisms between homologues from different barley lines. This is a similar number of contigs to that obtained in the barley HarvEST database (http://www.harvest-web.org/); however, autoSNPdb assembly identifies a greater number of singletons. This difference in assembly results can be attributed to the pre-clustering used in the HarvEST method, which greatly reduces the assembly time, but can also reduce the quality of the assemblies. The autoSNPdb pipeline does not pre-cluster sequences prior to cap3 assembly and requires large-scale computed infrastructure, but does not compromise assembly quality.
Of the 25 674 assemblies, 16 127 (62.8%) contained four or more sequences and could be used for the identification of candidate polymorphisms. In removing 37% of the total contigs containing less than four aligned sequence reads from the analysis, we restricted the number of potentially polymorphic loci which could be detected. However, this is necessary if we are to use redundancy to measure confidence in the validity of SNPs from the remaining loci.
A total of 29 447 candidate SNPs was identified within the 16 127 assemblies that contained four or more sequences. The percentage of assemblies that contained SNPs increased with the number of sequences they contained. Only 11% of assemblies containing four sequences possess identified SNPs. By 10 sequences, 37% had SNPs, and this increased to 62% for assemblies containing 30 sequences. This increase in the proportion of contigs which contain SNPs with an increase in contig size suggests that the number of SNP loci identified would increase with even larger datasets.
The mean number of SNPs in an assembly also increased with the number of sequences within the assembly (Figure 1), indicating that larger datasets with increased contig sizes would identify more SNPs per locus. There was also an observed increase in mean SNP score with contig size, and the majority of these increased number of polymorphisms in larger assemblies appeared to be true SNPs as they tended to co-segregate to define haplotypes, which indicates that larger datasets would also provide a greater confidence in the validity of the predicted polymorphisms. The number of SNPs per assembly was less than that previously identified in maize (Batley etal., 2003a), which may be a result of differences in genetic diversity in these species or may possibly reflect the ancient tetraploid nature of the maize genome.
Although the use of a redundancy-based approach to distinguish between sequence errors and true SNPs is highly efficient, the non-random nature of sequence error may lead to certain sequence errors within complex DNA structures being repeated between runs. In order to eliminate these errors, the minimum SNP redundancy score was weighted according to the number of sequences in the alignment, as it was found that errors at these loci have a relatively high SNP redundancy score and appear to be confident SNPs. Sequencing errors at complex loci are random between runs, whereas SNPs which represent divergence between homologous genes will co-segregate with haplotype; therefore, the SNP co-segregation score, based on the frequency of an SNP pattern occurring at multiple loci in an alignment, was developed to identify non-co-segregating SNPs. This score was weighted to account for missing sequence data at the SNP position within an alignment and for the number of SNP loci. The SNP score and co-segregation score together provide a means of estimating confidence in the validity of SNPs within aligned sequences.
The SNP identification methodology is also suitable for short-read, second-generation sequencing data. It is recognized that short-read sequencing is highly error prone, and a much greater level of redundancy would be required to call an SNP than with the more reliable Sanger or 454 sequence data. The types of error found in 454 data are very predictable and associated with polynucleotide tracts. These regions would not form the basis for confident SNP prediction in autoSNPdb.
In total, 29 447 SNPs were identified over a total region of 7 062 177 bp, which amounts to one SNP every 240 bp. Previous predictions of SNP frequency in barley vary: one SNP per 27 bp found by re-sequencing of a single gene from multiple varieties (Bundock and Henry, 2004); one SNP per 131 bp for a set of cytochrome P450 genes (Bundock etal., 2003); and one SNP per 200 bp in genes responsive to abiotic stress (Rostoks etal., 2005). The number of predicted SNPs found in our study is therefore lower than that found through targeted re-sequencing methods. This is probably a result of the differences in the length of amplicons, the number of cultivars used in the re-sequencing studies and the polymorphisms of the particular loci. The lower abundance of detected SNPs would also be expected as the autoSNPdb method discards some true polymorphisms because of a lack of redundancy, and may not find others where sequence for the discriminating varieties is not present in the public database.
Analysis of base changes
The SNP base changes were recorded. As the directionality of the change cannot be inferred from the data, polymorphisms were grouped alphabetically, i.e. A → G and G → A are grouped as A → G (Table 1). A greater number of transitions (A → G or C → T) (15 755) than transversions (A → C, A → T, C → G or G → T) (13 692) were identified. This is in accordance with previous computational and laboratory-based SNP discovery studies (Garg etal., 1999; Deutsch etal., 2001) and reflects the high frequency of C to T mutation following methylation (Coulondre etal., 1978). The relative abundance of the C/G transversions compared with A/T, A/C and G/T transversions was unexpected and remains to be explained.
Table 1. Base changes and the number of occurrences within single nucleotide polymorphisms (SNPs) in barley autoSNPdb
A → G
C → T
A → C
A → T
C → G
G → T
Of the 94 239 assemblies, 71 398 (76%) have UniRef annotation, 75 321 (80%) have GenBank annotation and 69 690 (74%) have a significant match with the rice genome, with 11 973 (12.7%) having no annotation (Figure 2). It can be seen that 61 763 assemblies have all annotations (65.5%), with individual assemblies more likely to have a GenBank annotation. Those sequences that have no annotation may be novel genes in barley or may be too short to identify their orthologues in rice. This level of annotation is similar to that identified for a small SNP study in barley, in which 77% of markers had a blastx annotation (Kota etal., 2008). The annotation can be utilized to search for SNPs in genes of predicted function of interest or in a reference genome location of interest.
The position of the polymorphisms in relation to the open reading frame (ORF) was analysed and normalized for the amount of sequence present. It was found that SNP density was greatest in the 5′ untranslated region (UTR), with an average of 4.2 SNPs/kb. This was followed by averages of 3.5 SNPs/kb and 1.7 SNPs/kb identified in the 3′ UTR and ORF regions, respectively; however, this result may be biased by the relatively few SNP-containing contigs for which 5′ regions could be clearly identified.
The positions of polymorphisms were also assessed in SSR flanking regions at 10-bp intervals, up to 300 bp from the SSR. It was found that SNPs were more prevalent in SSR flanking regions, decreasing away from the SSR (Figure 3). This reflects the results found in maize (Mogg etal., 2002; Batley etal., 2003b) and rice (Davierwala etal., 2000) that polymorphisms are prevalent in SSR flanking regions; however, this is the first time that it has been demonstrated that they are more abundant close to the SSR and decrease with distance from the SSR.
A total of 45 candidate SNPs, from 10 loci, was validated using direct sequencing of PCR products. The SNPs were randomly chosen from the first 50 contigs. Of the 45 candidate SNPs, 41 (91%) were shown to be true polymorphisms (Table 2). The four candidate SNPs that were shown to be false appeared to come from multigene families or had variance within the Haruna nijo line within the contig. Within this validation dataset, there appeared to be no correlation of valid SNPs and SNP or co-segregation scores, with some SNPs with scores of two being confirmed. In all cases, the validated SNPs with low SNP, co-segregation and weighted co-segregation scores were in contigs in which many different haplotypes were present. This validation rate of predicted SNPs is similar to the level observed in maize (Batley etal., 2003a).
Table 2. Details of the 10 loci genotyped and single nucleotide polymorphisms (SNPs) validated. Forty-five candidate SNPs from 10 loci, with a range of redundancy and co-segregation scores, were validated in the five lines Morex, Haruna nijo, Barke, Optic and Sloop
Thirteen were from a polymorphic simple sequence repeat (SSR)
Only first two SNPs assayed
Only first four SNPs assayed
Predict a multigene family or different Haruna nijo genotypes
Multiple haplotypes present
The false SNP had lower redundancy and co-segregation
Predict a multigene family or different Haruna nijo genotypes
We have developed an SNP discovery and annotation pipeline and database, autoSNPdb, and applied this to the public barley expressed sequence dataset. This identified 29 447 SNPs with high confidence, representing true genetic variation in barley, a species with relatively little genetic diversity. These predicted SNPs are maintained within a custom, web-accessible database providing a valuable source of annotated markers for applications such as genetic diversity analysis, high-resolution genetic map construction, cultivar identification, phylogenetic analysis, comparative genomics and the characterization of genetic resources for barley and for the comparison of genetic variation in other cereals. This system may be applied to any species for which sequence data are available. Future work will include the development of rice, wheat, Brassica and Brachypodium autoSNPdb, characterizing polymorphic marker conservation across related species and associating this with gene annotation and genomic location.
Sequence download and filtering
A total of 466 800 barley sequences was downloaded from GenBank (release 159). Sequences of less than 100 bp or greater than 4000 bp were removed (sequences of less than 100 bp cannot be repeat masked and sequences of longer than 4000 bp were considered to be genomic fragments).
RepeatMasker was used to soft mask repeat sequences which may otherwise interfere with the assembly process, and repeat sequences were set to lower case. The sequences were assembled with the cap3 program (Huang and Madan, 1999) using the parameters –p 90, –o 50, and the assembly process assigned a consensus sequence to each assembly. Where a sequence does not assemble with any other sequence, it is referred to as a singleton, and it is stored as an assembly of just one sequence. The assemblies and constituent sequences were parsed into a custom MySQL database.
Candidate SNPs were identified using the autoSNP method (Barker etal., 2003; Batley etal., 2003a). Assemblies were examined for small gaps, which were introduced during the assembly process, and sequence polymorphisms. Gaps were classified as insertion–deletions (indels). Polymorphisms were defined as a position in the assembly in which a number of sequences have a different nucleotide to the consensus. The number of sequences that contain the base change must be more than the minimum redundancy defined for the assembly, which is related to the number of sequences that are assembled at that nucleotide position. For an assembly that contains up to seven sequences, the minimum redundancy is two; for between eight and 11 sequences, the minimum redundancy is three; for between 12 and 19 sequences, the minimum redundancy is four; and for 20 or more sequences, the minimum redundancy is five. These conservative values were selected to limit the effect of accumulated sequence errors in large assemblies.
Each SNP was compared with every other SNP in the assembly to calculate the SNP co-segregation score. When SNPs combine to define haplotypes, they have a greater confidence score. When SNPs do not co-segregate, there is a reduced confidence in their ability to represent true genetic variation. The co-segregation score was weighted in relation to present/missing data to prevent high confident scores where the majority of data are absent.
Each consensus sequence was compared with GenBank (release 159) and UniRef90 using blast (Altschul etal., 1990), and significant matches were parsed into the database. GO annotation was derived from matching UniRef90 entries. Consensus sequences were also compared with the rice reference genome to identify potentially syntenic gene regions. Several features from the original sequence records were maintained. For the barley dataset, this included the barley cultivar, tissue type and development stage.
Translation region annotations for assembled contigs were derived from the alignment of the best UniRef90 match using blastx. Indels introduced by contig assembly are corrected for in the alignment. When a start or stop codon could not be clearly identified, the regions were flagged as having an unknown translation and omitted from subsequent analysis. SNPs were classified as being in 5′ UTR, ORF, 3′ UTR or unknown regions, and the SNP density was calculated for these regions.
SSR positions within contigs were determined using SSRPrimer (Robinson etal., 2004). The SNP frequency in each flanking sequence was then determined using a custom Perl script.
Ten contigs were selected randomly for the validation of 45 SNPs/indels. These predicted SNPs had a range of redundancy and co-segregation scores. Genomic DNA was isolated from the five cultivars Morex, Haruna nijo, Barke, Optic and Sloop, using the Qiagen DNeasy Kit, according to the manufacturer's instructions (Qiagen, Valencia, CA, USA). Amplification of the 10 loci was performed using primers designed to the conserved sequence surrounding the SNPs, employing the primer design program primer version 0.5 (Whitehead Institute, Cambridge, MA, USA). Amplifications were carried out in a 25-µL reaction volume containing 25 ng DNA, 2.5 µL 10 × PCR reaction buffer (Qiagen), 15 pmol forward and reverse primers, 200 µm of each deoxynucleotide triphosphate (dNTP) and 2 U HotStar Taq polymerase (Qiagen). After an initial hot start at 95 °C for 15 min, the following cycling parameters were employed: denaturation at 94 °C for 1 min, annealing at 55 °C for 1 min and extension at 72 °C for 1 min. After 35 rounds of amplification, a final extension step was performed at 72 °C for 10 min. Following amplification, PCR products were purified using the Purelink Kit (Invitrogen, Carlsbad, California, USA). The purified PCR products were sequenced using BigDye 3.1 (Applied Biosystems, Carlsbad, California, USA), employing forward PCR primers, and analysed using an ABI3730xl. Allele sequences from each locus and line were aligned and compared using Sequencher (GeneCode, Ann Arbor, MI, USA), and each of the SNPs was assessed.
A custom web interface was developed to allow user query and visualization of the SNP and annotation data (http://autosnpdb.qfab.org.au/). The maintenance of the SNP and annotation data within this relational database enables multiple query options. Annotation of the sequence is searchable by keyword, sequence ID, GO term or through similarity to defined regions of the reference genome. The identification of SNP-containing sequences through sequence similarity is performed using the blast interface. The flexibility of the database allows the identification of SNPs between defined groups of individuals, such as SNPs that differentiate between cultivars, tissue type or library, providing a valuable resource for genetic mapping and association studies.
To aid in the interpretation of the predicted SNP data, SNPs are viewed graphically as vertical bars, where the position of the bar along the x axis reflects the relative position of the SNP in the consensus sequence, the height of the bar represents the SNP redundancy score and the bar colour reflects the SNP weighted co-segregation score. Information on each SNP can be displayed by moving the cursor over the bar, and selection of a bar centres the sequence assembly at that position. The sequence assembly can be toggled between the full sequence assembly and an SNP summary. Labels to the left of the sequence may also be toggled between defined sequence information specific to each species. For barley autoSNPdb, this includes: cultivars; GenBank accession numbers; tissue type; and developmental stage. The interface is documented with help pages and database build information.
Support from The Australian Partnership for Advanced Computing (APAC) and Queensland Facility for Advanced Bioinformatics (QFAB) is gratefully acknowledged.