Shotguns and SNPs: how fast and cheap sequencing is revolutionizing plant biology


  • Steven D. Rounsley,

    1. School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA
    2. BIO5 Institute, University of Arizona, Tucson, AZ 85721, USA
    Search for more papers by this author
  • Robert L. Last

    Corresponding author
    1. Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
    2. Department of Plant Biology, Michigan State University, East Lansing, MI 48824, USA
    Search for more papers by this author

  • Note: The authors were employees of Cereon Genomics LLC, a wholly owned subsidiary of Monsanto Co., when they participated in the generation and analysis of the Landsberg erecta shotgun sequence.

For correspondence (fax 517 353 9334; e-mail


In 1998 Cereon Genomics LLC, a subsidiary of Monsanto Co., performed a shotgun sequencing of the Arabidopsis thaliana Landsberg erecta genome to a depth of twofold coverage using ‘classic’ Sanger sequencing. This sequence was assembled and aligned to the Columbia ecotype sequence produced by the Arabidopsis Genome Initiative. The analysis provided tens of thousands of high-confidence predictions of polymorphisms between these two varieties of A. thaliana, and the predicted polymorphisms and Landsberg erecta sequence were subsequently made available to the not-for-profit research community by Monsanto. These data have been used for a wide variety of published studies, including map-based gene identification from forward genetic screens, studies of recombination and organelle genetics, and gene expression studies. The combination of resequencing approaches with next-generation sequencing technology has led to an increasing number of similar studies of genome-wide genetic diversity in A. thaliana, including the 1001 genomes project ( Similar approaches are becoming possible in any number of crop species as DNA sequencing costs plummet and throughput rapidly increases, promising to lay the groundwork for revolutionizing our understanding of the relationship between genotype and phenotype in plants.


In 1998, the Arabidopsis Genome Initiative was well underway in its international, coordinated effort to generate the first plant genome sequence. The consortium focused on the Columbia (Col) ecotype of Arabidopsis thaliana using a bacterial artificial chromosome (BAC)-by-BAC sequencing approach, and their goal was to create a high-quality reference sequence of the euchromatic regions of the genome (Arabidopsis Genome Initiative 2000). Also, at that time, scientists (including these two authors) at Cereon Genomics LLC, a subsidiary of Monsanto Co., embarked on a project to sequence a second ecotype, Landsberg erecta (Ler), using a whole-genome shotgun approach. Although this effort was thought by many to be competing with the public project, the reality was that it was complementary and in addition to serving the company’s internal needs: it led to the first large-scale, genome-wide polymorphism database for any plant species. Thus, it provided a first glimpse at the nature of large-scale genomic variation that exists within a plant species. Here, we review its immediate impact, and how similar approaches using today’s technologies are advancing our understanding of plant biology and evolution.

Using the Ler sequence data at Cereon

The sequencing and analysis of the Ler genome was among the first projects at Cereon. Although we appreciated that the sequence would be of long-term broad utility to the community, there were two short-term goals for the project: accessing the majority of the genes for a flowering plant and developing markers for map-based cloning. Both goals were part of a broad functional genomics strategy to use Arabidopsis to find genes for Monsanto’s transgenic and molecular breeding programs. The strategy also included forward genetics screening for a diverse set of mutants altered in phenotypes directly related to Monsanto’s commercial targets. Thus, in addition to providing gene sequences directly, a primary goal of the Ler project was to provide tools to enable map-based cloning on mutants with a wide variety of phenotypes, including some that required analytical chemistry techniques, and which were therefore difficult to score (e.g. seed metabolite traits).

The sequencing phase of the project generated over 700 000 Sanger sequencing reads. Although this seems modest by today’s standards, it was very ambitious in 1998, requiring over 7000 96-lane sequencing runs, which after filtering for mitochondrial and chloroplast contamination represented approximately 2x coverage of the Ler genome. Along with the large quantity of sequence data came an assembly challenge. Not only was the coverage relatively low, but software tools for attempting whole-genome assembly of large genomes were still in their infancy. Ultimately, with various pre- and post-processing strategies and the phrap assembler (Green, 1996), these data were assembled into a total of 92.1 Mbp that contained at least a portion of over 95% of genes in the genome (Jander et al., 2002). This collection of Ler sequences was regularly aligned against the Col sequences from the public sequencing project, as they were produced to identify putative polymorphisms that could be used as markers (Figure 1). Two distinct types of polymorphisms were predicted: single nucleotide polymorphisms (SNPs) and insertion–deletion polymorphisms (indels). Because of the low coverage of the Ler sequence data, stringent thresholds were used to maximize the quality of the predicted polymorphisms, and thereby minimize the resources spent on markers that were unlikely to be useful. Indeed, a random sampling of the SNPs showed a surprisingly high validation rate (Jander et al., 2002).

Figure 1.

 A schematic representation of the process by which putative Col–Ler polymorphisms were identified. (a) The publicly funded Arabidopsis Genome Initiative sequenced the Col ecotype in a stepwise BAC-by-BAC manner (Arabidopsis Genome Initiative, 2000). Each clone was sequenced and assembled independently, and combined to form the high-quality reference genome sequence. (b) Cereon Genomics used a whole-genome shotgun approach to sequence the Ler ecotype. Shotgun sequence reads from the entire genome were assembled to form short contigs of overlapping sequences. The two collections of sequences were then aligned to each other at high stringency to identify either (c) putative single nucleotide polymorphisms or (d) putative insertion/deletions. The entire collection of predicted polymorphisms was provided to the not-for-profit research community through the TAIR database (Jander et al., 2002; Rounsley, 2003).

To adapt map-based cloning to an industrialized ‘one size fits all’ environment, a strategy was implemented that minimized the number of plants that were phenotyped. Instead, relatively large numbers of plants were genotyped, and recombinant progeny were tested for their phenotype (Jander et al., 2002). Genes affected in dozens of mutants were identified, including unannotated genes for enzymes involved in seed amino acid (Jander et al., 2004; Lee et al., 2008), glucosinolate (Kim et al., 2004; Kliebenstein et al., 2007) and tocopherol (Van Eenennaam et al., 2003; Valentin et al., 2006) metabolism, as well as a gene encoding a key enzyme of ascorbate biosynthesis (Jander et al., 2002), and ESK genes, the loss-of-function mutants of which are constitutively freezing tolerant (Xin et al., 2007).

Making the data available to the public

As the internal successes grew, there was increasing realization that placing the polymorphism data set in the hands of the academic community would have a mutually beneficial outcome. Individual scientists would benefit by being able to clone their genes of interest more rapidly, and Monsanto would benefit by access to that knowledge through the scientific literature. In total, the knowledge generated by the community was likely to greatly complement what Monsanto could generate internally, and at a much reduced cost. Finding the ideal mechanism and legal framework for providing access to the data was not trivial, but ultimately a partnership with The Arabidopsis Information Resource (TAIR, provided access with a ‘click agreement’ to a license that protected only the data set as a whole, and allowed polymorphisms to be used and published freely (Rounsley, 2003). The final data set contained 56 670 polymorphisms, including 37 344 SNPs and 18 579 indels, at an average density of one SNP per 3.3 kb of genome, and one indel per 6.6 kb of genome. This was the largest set of genetic markers available for any plant species at the time, and remained so until similar resources for rice (Oryza sativa) were made available in 2008 via an array-based resequencing platform (McNally et al., 2009). Following the release of the polymorphism database, Monsanto also made the full set of Ler sequence contigs available – also through the TAIR website.

Use of the data by the broader Arabidopsis community – examples

The tens of thousands of predicted polymorphisms have been used in many published studies, ranging from mapping of interesting mutations to studying genome structure and function. Most notably, more than one hundred papers have appeared describing genetic mapping and map-based cloning approaches using the Cereon/Monsanto collection of markers. Not surprisingly, the fields impacted span a wide range of plant biology, with recent examples including mutants altered in cytokinesis (Thiele et al., 2009), root tropic responses (Miyazawa et al., 2009), female gametophyte development (Moll et al., 2008) and disease resistance (Wawrzynska et al., 2008). Identification of alleles (including quantitative trait loci, QTLs) controlling phenotypes observed in crosses between Col and Ler has also benefited from the availability of dense marker maps (Edwards et al., 2005; Hoekenga et al., 2006; Staal et al., 2008). The high-density marker map facilitates studies of genetic mechanisms such as genome-wide patterns of recombination. For example, Drouaud and co-workers used these markers to study the dynamic pattern of meiotic recombination across chromosome 4 of Arabidopsis, and characterized sequences associated with ‘hot’ and ‘cold’ spots in euchromatin (Drouaud et al., 2006). Sites that are polymorphic between Col and Ler are also a good starting place for efficiently finding genetic differences between Col/Ler and other ecotypes; for a recently published example, see the paper by Huang et al. (2008).

Because the shotgun Ler sequence contained deep coverage of organelle DNA, it has been useful for studies of structure and expression of organelle genomes. In a recently published example, Forner and co-workers used the Ler-derived markers and reciprocal crosses to analyze the genetic basis for differences in mitochondrial mRNA terminus processing (Forner et al., 2008). Both maternal (likely cis-acting affects of mitochondrial DNA polymorphism) and trans-effects arising from differences in nuclear genes were found. On a utilitarian but important note, polymorphisms that ‘uniquely’ identify an ecotype or mutant allele are very useful in confirming the identity of seed stocks, individual plants or cell cultures. This is especially useful for a plant like Arabidopsis, where the tiny seeds and availability of tens of thousands of mutants and hundreds of wild accessions can lead to a nearly limitless level of contamination and confusion of lab stocks.

The expanding universe of Arabidopsis genetic polymorphisms

Although the Ler–Col comparison was the first published example of a genome-wide set of indel and SNP markers, the resources for studying and using Arabidopsis sequence variation is expanding at a rapidly increasing rate. An early effort to mine expressed sequence tags (ESTs) and sequence tagged sites (STSs) led to the identification of nearly 9000 polymorphisms across 12 A. thaliana accessions (Schmid et al., 2003). This general approach was expanded by Nordborg et al., who sequenced more than 870 fragments in 96 different accessions of A. thaliana, providing an early view of the overall pattern of genetic variation across many loci and a substantial number of isolates of the species (Nordborg et al., 2005). These data provided insight into the overall population structure of A. thaliana from around the world. Their results also indicated that linkage disequilibrium decays over 25–50 kb, suggesting that genetic association mapping could be used in this or similar populations of A. thaliana.

The use of information about genome-wide and transcriptome-wide variation in a diverse set of individuals to discover genes of interest continues to gain popularity in plants (Nordborg and Weigel, 2008), and is being applied in a wide range of studies in A. thaliana. For example, high-resolution mapping of the Col × Ler recombinant inbred lines (RILs) by array hybridization provided detailed information about recombination behavior, and created a durable tool for fast and high-resolution mapping of QTLs in Arabidopsis (Singer et al., 2006). This idea was taken a step further by West and co-workers, who combined genome-wide DNA polymorphisms and gene expression markers to comprehensively characterize RILs from a cross between the Bay-0 and Sha ecotypes (West et al., 2006).

Such tools that allow efficient pan-genome surveys are becoming increasingly important in harnessing ‘natural variation’ to associate genes with traits. Screening of diverse germplasms (linkage disequilibrium mapping) promises to become increasingly important for associating genes with phenotypes. A recent report described a combination of ecotype screening, genetic segregation analysis, and resequencing of a candidate gene in 92 ecotypes to demonstrate genetic association of the MOT1 gene and shoot molybdenum content (Baxter et al., 2008). These types of results also facilitate studies in evolutionary and population genetics. For example, Toomajian et al. (2006) were able to rigorously ask whether the FRIGIDA locus, controlling the requirement for vernalization for flowering, was under selection in A. thaliana. By comparing polymorphism patterns at FRI to more than a thousand other loci in 96 accessions, they concluded that this locus was under strong selection: such a conclusion is made compelling because of the sampling of a very large number of other loci. The increasing availability of DNA sequences from related taxa allows comparisons of evolutionary and population biological processes. For instance, Foxe and co-workers examined rates of purifying (stabilizing) and positive selection in the outcrossing species Arabidopsis lyrata and the self-pollinating A. thaliana (Foxe et al., 2008).

Recent breakthroughs in resequencing throughput and affordability have led to efforts to attempt to sequence 1001 isolates of A. thaliana ( A step in this direction was reported using high-density array resequencing of several dozen accessions (Borevitz et al., 2007; Clark et al., 2007). These studies provide an unprecedented view of genome evolution and relationships among individuals and populations: for example, huge variations in patterns of genetic polymorphism including large regions of low polymorphism consistent with recent ‘selective sweeps’ (Clark et al., 2007). Recently, deep coverage short-read sequencing approaches have been used for several technical breakthroughs in Arabidopsis research: the resequencing of three ecotypes (Ossowski et al., 2008); the sequencing of the floral epigenome (Lister et al., 2008); and the demonstration of its application to simultaneous mapping and mutation identification (Schneeberger et al., 2009). Plummeting costs and longer read length next generation sequencing technologies should provide a staggering quantity of sequence data for Arabidopsis over the next few years.

Applications to crops

Until recently, studies of genetic diversity in crops and other larger genome plant species focused on sequencing of cDNAs and analysis of specific genomic regions (Ganal et al., 2009). This was because of the large size and complexity of their genomes, as well as because of the lack of reference sequences for plants other than Arabidopsis and rice. A common approach was to develop primers for specific (typically evolutionarily conserved) genomic or cDNA sequences, and to perform PCR amplification and sequencing of these amplicons on diverse cultivars, natural isolates or related species of interest. Random sequencing of EST collections from diverse germplasms and bioinformatic detection of polymorphisms in orthologous genes is also commonly employed. High levels of redundancy in plant genomes, such as large gene families or polyploidization, present special challenges for distinguishing between polymorphic alleles and paralogous gene family members. Conversely, for diploid outcrossing species, analysis of a single individual can identify genome-wide collections of polymorphisms because of the wide-spread heterozygosity in the genome.

As with Arabidopsis, second generation sequencing technologies are revolutionizing crop genomics, including our understanding of genome diversity, development of mapping resources, and studies of ecological and evolutionary biology. By lowering costs and increasing the rate of sequence acquisition, these technologies are causing researchers to rethink how crop genomes and transcriptomes are analyzed. Some of these approaches represent natural extensions of past approaches. For example, the sequencing of ESTs with the 454 sequencing platform in species as phylogenetically diverse as maize (Zea mays) (Barbazuk et al., 2007) and the tree Eucalyptus (E. grandis) (Novaes et al., 2008) led to the discovery of SNPs in thousands of genes at much lower cost than with dideoxy sequencing. These markers can then be deployed in molecular breeding for variety improvement or gene discovery by genetic mapping.

Both RILs and nearly isogenic lines (NILs) are widely used in genetic mapping of naturally occuring variation, and in plant breeding. In the past, genetic analysis of RILs and NILs in crops and model plants required tedious and expensive methods for marker discovery and genotyping of the lines with these individual markers, and the resulting maps were often of relatively low resolution (Eshed and Zamir, 1995; Loudet et al., 2002; Simon et al., 2008). The use of multiplexing strategies combined with fast and cheap DNA sequence analysis enables very high resolution genetics. In a recently published example (Huang et al., 2009), low-pass Illumina sequencing was performed on 150 rice RILs derived from a cross between the cultivars O. sativa ssp. indica and japonica. This approach permitted the construction of a high-resolution map of the recombination events in these RILs, and allowed the efficient mapping of traits associated with individual RILs.

Whole-genome scans of genetic polymorphism are increasingly being used to search for sequences under natural and artificial selection. Although pioneering work is being carried out using maps with millions of polymorphisms to analyze genetic variation in humans (Sabeti et al., 2007), this approach is also being successfully applied to studying complex traits in crop plants. Scanning populations derived from the Illinois high and low kernel oil lines with nearly 500 genetic markers revealed the influence of more than 50 QTLs influencing the trait, with each locus responsible for small levels of genetic variance (Laurie et al., 2004). This study indicates the value of genome-wide genetic analysis in revealing the genetic basis for complex traits in maize, although the existence of large numbers of small-effect alleles precludes identification of the genes underlying the effects. Whole-genome scan association mapping with approximately 8500 SNP markers was used to analyze the genetic basis of kernel oleic acid (18:1) content (Belóet al., 2008). In this case the fatty acid desaturase gene fad2 was found very close to the SNP marker genetically associated with the trait, confirming the value of this approach in gene discovery in maize. Studies using genome scans of maize and its progenitor teosinte are revealing candidates for genes under artificial selection. In an early study, analysis of DNA sequences of 774 gene fragments in 14 maize and 16 teosinte inbreds led Wright and co-workers to estimate that more than 1000 genes have been influenced by artificial selection in the evolution of teosinte to the varieties that they analyzed (Wright et al., 2005). A similar study of genetic variation in cultivated and wild sunflower (Helianthus annuus) accessions revealed evidence for several dozen regions being under selection during the period since sunflower domestication (Chapman et al., 2008). These studies suggest that genome-wide genetic scans will be a useful approach to identifying genes influencing important traits in a variety of crop plants.

Concluding comments

In the last 10 years we have seen the dramatic impact that an available reference genome sequence can have on an entire scientific community. In addition to all the intrinsic information that is present in that reference, it also provides a framework to which other data resources can be added. In particular, the sequencing of additional related genomes can provide practical utility and immensely rich data sets for studying genetic variation. With the unprecedented changes in sequencing technologies over the last few years, production of the Cereon Col–Ler data set now seems almost trivial. However, its value has been long lasting, and has seeded a burgeoning field the current proposals of which seemed outrageous just a few years ago. It is staggering to consider where sequencing technologies may be in 5 years time, and the potential volume of sequence data that will be collected from complex crop genomes and from the biota of complex ecosystems. With these new data sets will come tremendous challenges associated with their analysis and presentation.


We thank Ivan Baxter for helpful comments on the manuscript. Research in RLL’s group is supported by NSF grants DBI-0604336 and MCB–0519740, and research in SDRs group is supported by NSF grants DBI-0822284 and DEB-0918758.