• Open Access

Characterizing homologues of crop domestication genes in poorly described wild relatives by high-throughput sequencing of whole genomes


(Tel +61 7 3346 0551; fax +61 7 3346 0555; email robert.henry@uq.edu.au)


Wild crop relatives represent a source of novel alleles for crop genetic improvement. Screening biodiversity for useful or diverse gene homologues has often been based upon the amplification of targeted genes using available sequence information to design primers that amplify the target gene region across species. The crucial requirement of this approach is the presence of sequences with sufficient conservation across species to allow for the design of universal primers. This approach is often not successful with diverse organisms or highly variable genes. Massively parallel sequencing (MPS) can quickly produce large amounts of sequence data and provides a viable option for characterizing homologues of known genes in poorly described genomes. MPS of genomic DNA was used to obtain species-specific sequence information for 18 rice genes related to domestication characteristics in a wild relative of rice, Microlaena stipoides. Species-specific primers were available for 16 genes compared with 12 genes using the universal primer method. The use of species-specific primers had the potential to cover 92% of the sequence of these genes, while traditional universal primers could only be designed to cover 80%. A total of 24 species-specific primer pairs were used to amplify gene homologues, and 11 primer pairs were successful in capturing six gene homologues. The 23 million, 36-base pair (bp) paired end reads, equated to an average of 2X genome coverage, facilitated the successful amplification and sequencing of six target gene homologues, illustrating an important approach to the discovery of useful genes in wild crop relatives.


Plant domestication resulted in changes in the appearance, biochemical properties and genetics of crops to assist in the ease of cultivation. In their review, Tang et al. (2010) summarized the genomic changes that are associated with the domestication of crops, such as amino acid substitution, deletion and truncation, transposon insertion, regulatory change, splice site mutation and gene duplication. Farmers and plant breeders have been selecting beneficial characteristics to create better crops to feed the ever-increasing world population. Selections involved in crop domestication, whether natural or artificial, result in genetic bottlenecks (Doebley et al., 2006; Tang et al., 2010). Wild crop relatives can provide alternative gene pools to counteract the genetic bottlenecks (Dillon et al., 2007; Singh et al., 2008); however, genomic characterization of wild species is generally poor (Zeid et al., 2010). Variants of known genes with potential novel functionality have been found by exploring related species by utilizing synteny and homology (Feltus et al., 2006; Zeid et al., 2010). The common method of screening biodiversity for useful gene homologues has often been based upon the amplification of targeted genes using available sequence from cultivated or model species to design universal primers that amplify the target gene region universally across species (Ramirez et al., 2004; McIntosh et al., 2005; Shapter et al., 2009).

Amplification of gene homologues using universal primers is made possible by designing and positioning primers in the regions with conserved sequence between two or more species, usually within the same family. This universal primer method (UPM) is useful for amplifying a known gene in another related species, where the sequence is unknown (Ramirez et al., 2004; McIntosh et al., 2005; Shapter et al., 2009; Zeid et al., 2010). The crucial requirement of this approach is the presence of sequences with sufficient conservation across species to allow for the design of universal primers. This UPM approach is often not successful with diverse organisms or highly variable genes. For example, homologues of the desired genes may have been characterized in only one species, making the identification of conserved regions to allow for the design of universal primers impossible; secondly, sequence conservation between gene homologues in related species may not be sufficient for the design of effective universal primers, leading to failure to amplify the gene homologues in the target species.

Traditionally, gene homologues amplified by polymerase chain reaction (PCR) have their identities confirmed by Sanger sequencing (Sanger et al., 1977). Automation fluorescence technology, capillary electrophoresis and computer-driven laser detection have enabled this technology to achieve high-throughput and high-quality sequencing, making this method the gold standard sequencing platform to date (Men et al., 2008). Each high-quality Sanger read can extend to 1 kilobase pair (kbp) in length. There are several drawbacks associated with this method, such as the difficulty of obtaining accurate sequence data for polyploids without cloning, the lack of ability to handle and analyse allele frequencies and the cost of sequencing, which, at $1 per kbp, makes it impossible for researchers with limited budget to sequence whole genomes (Men et al., 2008).

Improvements in massively parallel sequencing (MPS) make it an increasingly viable method for whole genome sequencing (Amaral et al., 2009; Hyten et al., 2010; Rubin et al., 2010). There are a number of platforms available for this sequencing method. The most commonly used platforms are Roche 454 (454 Life Sciences, Branford, CT), Illumina Genome Analyzer (Illumina Inc., San Diego, CA) and ABI SOLiD (Applied Biosystems, Foster City, CA). They generate a large number of short reads, typically 450 base pairs (bp) for Roche 454 and 50–100 bp for both Illumina Genome Analyzer and ABI SOLiD (DiGuistini et al., 2009). Reviews comparing the different platforms indicate that high-quality genomewide sequence with a broad range of applications can be rapidly achieved using MPS (Mardis, 2008; Harismendy et al., 2009; Metzker, 2010). Although genome assembly remains difficult with this technology because of a number of factors such as poor ability to deal with repetitive sequences (DiGuistini et al., 2009; Metzker, 2010), it provides a viable option for characterizing homologues of known genes in poorly described genomes. Sequence data generated by MPS can be used to design species-specific primers that allow for the amplification on homologues of known genes in related species.

There are over 10 000 species in the grass family Poaceae (Hsiao et al., 1999). Major crop plants belonging to Poaceae, such as rice, maize and sorghum, have been well characterized at the genomic level (Goff et al., 2002; Yu et al., 2002; Paterson et al., 2009; Schnable et al., 2009), and research is underway to characterize barley, a complex diploid (Schulte et al., 2009) and complex polyploid grasses such as sugar cane and wheat (Paux et al., 2008; Bundock et al., 2009). Members of the Poaceae are highly conserved at the genomic level. A number of researchers have used the DNA sequence conservation between species within the Poaceae to identify homologues of known genes from one species in others (Campbell et al., 2007; Shapter et al., 2009; Zeid et al., 2010). Recently, Brachypodium distachyon became the first wild grass to have its whole genome sequenced, allowing it to be used as another model system for grasses (The International Brachypodium Initiative, 2010). Completed genome sequences can be used as templates for the design of molecular markers in poorly described species (Feltus et al., 2006).

There is a need to genetically characterize poorly known species for a number of purposes such as biodiversity, biodiscovery and breeding for crop improvement. The availability of molecular markers for these species is very limited because marker development is laborious, time-consuming and expensive (Zeid et al., 2010). In general, resources and research are directed towards major crops. Very few Poaceae taxa in the world have been well characterized. In those species that have been well characterized, the sample sizes have been small, not reflecting the total diversity of each individual species. Understanding the genetic diversity of known taxa will also aid the discovery of unknown species (David et al., 2007). Characterizing poorly known species will contribute to the knowledge of the gene pool available for use for crop improvement through advancing plant breeding techniques and genetically modified organism technology (Dillon et al., 2007; Singh et al., 2008; Tang et al., 2010).

Microlaena stipoides, a wild relative of rice, Oryza spp. (Figure 1), was used as the target species in this study. It responds well to irrigation and nitrogen application (Chivers and Aldous, 2005) and has been targeted for commercialization and domestication as a perennial grain crop (Davies et al., 2005a). Microlaena stipoides seeds have similar grain ultrastructure to rice (Shapter et al., 2008), and in some ecotypes, the seed size is approaching rice (Davies et al., 2005b). Microlaena stipoides is a tetraploid (Murray et al., 2005) with a reported genome size of approximately 869 mega base pairs (Nock et al., 2011).

Figure 1.

 A cladogram showing the relationship between the target species, Microlaena stipoides and five major grass species, adapted from Kellogg, 2009, 2001; Shapter et al., 2009.

In this study, we propose a novel method for characterizing homologues of known genes in poorly described genomes. We attempted to amplify homologues of well-characterized genes from a model species in a related but poorly described species by utilizing species-specific primers derived from MPS genomic data of the poorly described species. MPS of genomic DNA was used to obtain M. stipoides sequence information for 18 genes related to domestication characteristics in Oryza (Table 1). The characteristics controlled by these genes are photoperiod sensitivity, heading date, grain yield (grain weight, size, number and filling rate), tillering, seed shattering, prostrate growth, pericarp colour, waxy grains, dwarfism and grain fragrance (Table 1). Species-specific primers were designed based on the alignments of M. stipoides sequence and the 18 Oryza reference gene sequence. A comparison between characterizing homologues of known genes in poorly described genomes using UPM and species-specific primer method (SSPM) is presented here.

Table 1.   Details of 18 annotated rice genes related to domestication traits analysed in this study
Gene/protein nameAbbreviationDomestication traitPublication
  1. *Indicates that first exon and intron are located on the UTR region of the gene.

QTL for heading date 1Hd1Photoperiod sensitivity and heading dateYano et al. (2000)
QTL of seed shattering in chromosome 1qSH-1Seed shatteringKonishi et al. (2006)
QTL for grain number 1AGn1aGrain numberAshikari et al. (2005)
Red pericarpRcRed pericarpSweeney et al. (2006)
QTL for seed width on chromosome 5qSW5Grain widthShomura et al. (2008)
Waxy*, also known as granule bound starch synthase I (GBSSI)WxWaxy grainsHirano et al. (1998)
Semi dwarf 1Sd-1DwarfismAshikari et al. (2002)
Monoculm 1MOC1TilleringLi et al. (2003)
QTL for heading date 3aHd3aHeading dateKojima et al. (2002)
Early heading date 1*Ehd1Heading dateDoi et al. (2004)
QTL for heading date 6Hd6Heading dateYamamoto et al. (2000)
QTL for grain size 3GS3Grain length and weightFan et al. (2006)
QTL fro shattering 4, also known as shattering 1 (SHA1)sh4Seed shatteringLi et al. (2006)
Betaine aldehyde dehydrogenase 2BADH2FragranceBradbury et al. (2005)
QTL for grain width and weight 2GW2Grain weight and widthSong et al. (2007)
Grain incomplete filling 1GIF1Grain filling rateWang et al. (2008)
QTL for grain number, plant height and heading date 7Ghd7Grain number, plant height and heading dateXue et al. (2008)
Prostrate growth 1PROG1Prostrate growthJin et al. (2008)


Maximum possible primers placement for UPM

Seventeen had homologous sequences identified within Poaceae, the exception being Ghd7 (Table 2). Ehd1 and qSW5 did not have sufficient areas of conservation to allow for the alignment of the available species with Oryza reference gene for any universal primer placement (Table 2). Rc, GS3 and PROG1 had only limited areas of conservation which restricted the proportion of the gene potentially able to be captured to such an extent that universal primer amplification was deemed to be uninformative (Table S1); therefore, the total number of genes where universal primers could not be placed was six. There was sufficient conservation between species for universal primer design within the remaining 12 genes (Hd1, qSH-1, Gn1a, Wx, Sd-1, MOC1, Hd3a, Hd6, sh4, BADH2, GW2 and GIF1). Figure S1 shows the possible universal primer pair placements indicating the maximum gene capture.

Table 2.   Comparison of mining homologues of rice genes in Microlaena stipoides using UPM and SSPM outlining the type of primers available for PCR amplification
GeneGenbank entryUPMSSPM
Relevant putative homologues in other species within Poaceae*Best minimum match percentage at 20-bp overlap (Sequencher®)Primer site(s)Number of PCR amplicons§Percentage of reference sequence covered by MPSMinimum match percentage at 20-bp overlap (Sequencher®)Primer site(s)Number of PCR amplicons§
  1. MPS, massively parallel sequencing; SSPM, species-specific primer method; UPM, universal primer method.

  2. *BLASTN search was carried out using the Genbank accession number. The parameters were set as: choose search set: database others, programme selection: optimize for highly similar sequences (megablast). The algorithm parameters were set as default with the exception of max target sequences changed to 20 000.

  3. Sequencher® allows for 0–100 bp minimum match percentage to perform alignment. The default is set at 20 bp. Alignment was not possible for match below 60%.

  4. Types of primer possible were NS, not sufficient, ND, sufficient for nondegenerate primers, C, sufficient for combination of nondegenerate and degenerate primers and N/A, not applicable.

  5. §Number of PCR amplicons for amplifying the gene, maximum amplicon size is limited to 4 kbp. N/A, not applicable.

  6. Calculation is based on the length of consensus sequence divided by the length of reference sequence.

  7. **Indicates the BLASTN result that gave the best alignment and used for universal primer placement for UPM.

Hd1AB041837Phyllostachys edulis**, Brachypodium distachyon, Zea mays, Sorghum bicolor, S. arundinaceum, Triticum aestivum, Hordeum vulgare, Lolium perenne, Schedonorus pratensis and Festuca arundinacea67ND134.782ND1
qSH-1AB071333T. aestivum**, H. vulgare and Z. mays63C143.985ND1
Gn1aAB205193S. bicolor**, P. edulis, Z. mays, T. aestivum and Bambusa oldhamii72NSN/A51.185ND2
RcAB250059S. bicolor** and T. aestivum<60NSN/A35.481ND2
qSW5AB433345S. bicolor** and P. edulis63NSN/A37.582ND1
WxAF031162Found in other major Poaceae including Sorghum spp., Z. mays, Triticum spp. and Hordeum spp**. Also found in various other Poaceae (too many to list here), including target species M. stipoides75C145.383ND2
Sd-1AF465255Luziola leiocarpa, Chikusichloa aquatica, Rhynchoryza subulata, Ehrharta erecta, Leersia tisserantii, Z. mays** and S. bicolor78ND135.682ND1
MOC1AY242058S. bicolor**78ND162.889ND1
Hd3aBD169069H. vulgare**, Triticum turgidum, T. aestivum, Z. mays, L. perenne, T. monococcum, S. bicolor, Phyllostachys meyeri, Streptogyna crinita, Streptogyna americana, Aulonemia subpectinata,Sasa senanensis, Dendrocalamus asper, Sasa kurilensis, Guaduellafoliosa, G. marantifolia, Sasa tsuboiana, Sasa jotanii, Diandrolyra bicolor and Lithachne pauciflora80ND134.782ND1
Ehd1BD407931S. bicolor**<60NSN/A29.881NSN/A
Hd6DQ157464Z. mays**, S. bicolor, H. vulgare and L. perenne81C232.576ND2
GS3DQ355996Z. mays and S. bicolor**<60NSN/A34.383NSN/A
sh4DQ383373Echinochloa spp.**Z. mays and S. bicolor75ND142.383ND1
BADH2DQ910546S. bicolor**, Z. mays, H. vulgare, Leymus chinensis, Zoysia tenuifolia, T. aestivum, Pascopyrum smithii, Schedonorus arundinaceus, Elymus trachycaulus, Phalaris arundinacea, Psathyrostachys juncea and Agropyron cristatum79C258.188ND3
GW2EF447275H. vulgare**, Z. mays and S. bicolor81ND373.290ND2
GIF1EU095553Dendrocalamopsis oldhamii**, Z. mays, S. bicolor, T. aestivum, P. edulis, Saccharum hybrid, L. perenne and B. distachyon77C259.287ND2
Ghd7EU286801Not foundN/AN/AN/A30.482ND1
PROG1FJ155665H. vulgare**, P. edulis, Z. mays, T. aestivum and S. bicolor62NSN/A53.187ND1

A complete M. stipoides homologue (EF600044) was already published for the Waxy gene (Shapter et al., 2009). For the purpose of this study, the M. stipoides Waxy homologue from Genbank was disregarded and the search for homologue(s) of Wx gene in other Poaceae species was performed as per the other genes.

Maximum possible primers placement for SSPM

Massively parallel sequencing performed by Nock et al. (2011) produced over 23 million, 36-bp paired end reads with a total of 1.6 giga base pairs (gbp). Based on the size of M. stipoides genome (Murray et al., 2005), the maximum possible coverage of the whole genome was 2X. Whole genome sequencing resulted in patchy coverage of the M. stipoides genome when aligned back to rice as a reference. Coverage of putative homologues of targeted genes varied as shown in Table 2. On average, 44% of the M. stipoides sequence could be aligned to the targeted reference gene sequences. Despite the low coverage, it was sufficient for species-specific nondegenerate primer placement within the putative exon boundaries for all targeted genes except one, Ehd1, where the capture of exons was too low (three exons of the total of five exons). The remaining 17 genes had possible placements of nondegenerate species-specific primer pairs for maximum gene capture (Figure S1).

On average, 94.6% of the rice reference sequences could potentially be captured with the SSPM. Maximum possible gene capture by SSPM ranged from a minimum of 32.3% for Ehd1 to a maximum of 100% for GW2 (Table S1). For comparison, across the 15 genes for which areas of sequence conservation were identified between an Oryza spp. and other Poaceae sufficient for universal primer design, an average 85.3% of the rice reference gene sequence could be captured (Table S1). This ranged from 100% capture of the coding sequence for Hd1 and GW2 to a minimum capture of 19.6% and 37.6% in PROG1 and Rc, respectively.

UPM PCR primers

Primer design was not attempted for Rc, qSW5, Ehd1, GS3, Ghd7 and PROG1 for the reasons that were mentioned earlier. PCR primer design was attempted within the identified maximum placement of UPM primers for the remaining 12 genes (Hd1, qSH-1, Gn1a, Wx, Sd-1, MOC1, Hd3a, Hd6, sh4, BADH2, GW2 and GIF1). Using the criteria as described in the materials and methods section, nondegenerate primers were successfully designed for six genes (Hd1, Sd-1, MOC1, Hd3a, sh4 and GW2), whereas a combination of nondegenerate primers and degenerate primers was needed for five genes (qSH-1, Wx, Hd6, BADH2 and GIF1). No forward primer meeting the set criteria was found for Gn1a; however, a primer pair encompassing exons 2–4 was identified. In summary, the UPM method allowed for PCR primer design encompassing the gene coding region for 11 genes of the 18 genes (Figure S2).

SSPM PCR primers

Primer design was not attempted for Ehd1 for the reason mentioned earlier. PCR primer was designed within the identified maximum placement of SSPM primers for the remaining 17 genes (Hd1, qSH-1, Gn1a, Rc, qSW5, Wx, Sd-1, MOC1, Hd3a, Hd6, GS3, sh4, BADH2, GW2, GIF1, Ghd7 and PROG1). Using the criteria as described in the materials and methods section, nondegenerate species-specific primers were successfully designed for all of the genes with the exception of GS3. Compatible primers were identified only on exon one and exon five for GS3, and the size of the PCR amplicon exceeded 4 kbp. There was insufficient coverage of exons 2, 3 and 4 to allow for primer design. GS3 was omitted from the subsequent steps. In summary, the SSPM method allowed for PCR primer design encompassing the gene coding region for 16 genes of the 18 genes (Figure S2).

Figure S1 shows the position of primers using both UPM and SSPM within the identified maximum possible primer placements. SSPM primers covered more portion of the genes compared with UPM primers. SSPM primers covered all exons for Gn1a, Rc, qSW5, Ehd1, GS3, Ghd7 and PROG1, whereas UPM primers for these genes were either not possible, did not cover all exons or provided coverage that was too low to give meaningful sequence information. Table 3 shows the percentage of genes covered by UPM and SSPM primers designed. Using the UPM, the highest proportion of gene covered by the primers was 99.2% (Wx), the lowest was 17.1% (PROG1) and on average 79.7% of the genes were covered (Table 3). Using the SSPM, the whole of GW2 was able to be covered by the primers, followed by Rc (99.9%), Wx (99.8%) and Hd6 (99.5%), the lowest was 32.3% (Ehd1) and on average 92% of the genes were covered (Table 3).

Table 3.   Comparison of percentage of gene covered by primers designed using universal primer method (UPM) and species-specific primer method (SSPM)
GeneGenbank entryUPMSSPM
Percentage gene captured by primersExons coveredPercentage gene captured by primersExons covered
  1. *For GS3, good alignments were only available for exons 1 and 5 and primers were available for these exons. Using primers from exons 1 and 5 would result in a PCR amplicon that exceeds 4 kbp in size and may not be achievable using our method. Exons 2, 3 and 4 did not have enough Microlaena stipoides sequence information to allow for SSPM primer design. GS3, along with Ehd1, which did not have enough coverage to allow for all exons to be amplified was excluded from further analysis.

Hd1AB04183792.9All (1–2)97.7All (1–2)
qSH-1AB07133382.9All (1–4)89.2All (1–4)
Gn1aAB20519383.1Exons 2–4 of 496.3All (1–4)
RcAB25005932.7Exons 5–8 of 899.9All (1–8)
qSW5AB433345 None77.3All (1)
WxAF03116299.2All (2–14)99.8All (2–14)
Sd-1AF46525592.1All (1–3)98.4All (1–3)
MOC1AY24205861.9All (1)95.9All (1)
Hd3aBD16906992.3All (1–4)97.2All (1–4)
Ehd1BD407931 None32.3Exons 4–6 of 6
Hd6DQ15746498.6All (1–10)99.5All (1–10)
GS3DQ35599660.0Exons 1–3 of 594.2All (1–5)*
sh4DQ38337390.2All (1–2)89.5All (1–2)
BADH2DQ91054698.8All (1–15)99.6All (1–15)
GW2EF44727597.7All (1–8)100.0All (1–8)
GIF1EU09555396.1All (1–7)98.2All (1–7)
Ghd7EU286801 None94.7All (1–2)
PROG1FJ15566517.1Poor alignment of exon 1 of 196.0All (1)

UPM PCR amplification and Sanger sequencing results

Of the 11 genes with universal primers designed, PCRs were conducted for three genes, qSH-1, sh4 and GW2 (Table S2). Post-PCR agarose gel separation and visualization of the qSH-1 amplicon (exons 1–4) produced a distinct band at the expected size. Partial sequencing and alignment to exons 1 and 3 of the reference gene showed 85% and 86% homology, respectively, to rice qSH-1 (Table S3). The primer pair for sh4 produced a distinct major band at the expected size with three smaller faint bands that were separated by gel extraction. The major band was excised and sequenced, and exons 1 and 2 showed 81% homology to the sh4 shattering gene in rice.

Three pairs of primers were designed for GW2: the first pair encompassed exons 1–5; the second pair, exons 5–6; and the third pair, exons 7–8. PCR amplification of exons 1–5 produced two distinct fragments at the expected size. These two fragments were sequenced and showed two distinct but homologous sequence reads in M. stipoides. Sequencing of both fragments showed homology of 92% with rice GW2 exons 1–5 (Table S3). PCR amplifications of exons 5–6 and 7–8 both produced single fragments. The exon 5/6 amplicon had 94% and the exon 7/8 amplicon had 86% homology to their respective rice GW2 exons (Table S3). In summary, of the five universal primer pairs tested, five were successfully Sanger-sequenced and confirmed as putative homologues in M. stipoides.

SSPM PCR amplification and Sanger sequencing results

Species-specific primers were designed for 16 of the 18 genes using the MPS M. stipiodes data. A total of 24 PCR primer pairs were designed, some genes requiring more than one primer pair for full coverage (Figure S2 and Table S2). PCR amplification using 13 of the 24 primer pairs yielded multiple nonspecific fragments that could not be sequenced. The remaining 11 primer pairs produced amplicons that showed >75% sequence homology in putative coding regions when aligned back to Oryza. Two of these 11 primer pairs produced additional fragments that showed low homology (49% and 55%) to Oryza (Table S3).


The utility of UPM is well documented (Ramirez et al., 2004; McIntosh et al., 2005; Shapter et al., 2009; Zeid et al., 2010); however, there are three major limitations to this design method: first, identifying sufficient areas of conservation between species for potential primer design which capture enough of the target gene homologue; secondly, whether these areas of conservation meet the minimum criteria for PCR primer design; and lastly, whether the universal primers are specific enough to amplify the target homologue successfully. The use of degenerate primers in areas of mediocre conservation is a common strategy for working around the first limitation of the UPM; however, this increases the risk of amplification of unrelated sequences and/or other members of a gene family, because of the reduced specificity of the primers (Linhart and Shamir, 2005).

In this study, primer sites were identified for 11 of the 18 putative gene homologues using the UPM. Owing to the established utility of the UPM, we performed amplifications on just three of the genes, qSH-1, sh4 and GW2. Of the five primer pairs utilized, all produced PCR fragments that were identified as putative homologues in M. stipoides. Universal primer design was possible for six gene homologues using nondegenerate primers and five gene homologues using a combination of nondegenerate and degenerate primers. Although the UPM may not have resulted in the successful amplification of all of the gene homologues, where degenerate primers were designed, the successful amplification using all of the primer pairs suggests that the UPM of homologue capture is highly successful where primer location is not a major limitation.

Using the SSPM, 16 of the 18 genes targeted had nondegenerate primer pairs identified, allowing for the primer design for an additional five genes compared with the UPM and theoretically improved specificity for all primers designed. However, of the 24 PCR primer pairs that were tested in M. stipoides, only 11 were successful, capturing six putative homologues of the desired genes in M. stipoides. Sanger sequencing confirmed more than 75% sequence homology with the reference gene exons for all 11 primer pairs (Table S2), although two additional fragments with low sequence homology were identified.

Massively parallel sequencing is a rapidly evolving technology, particularly in terms of the capacity of the short read platforms such as Illumina GAII. In the past year, sequence output potential of the platform has more than doubled in terms of length of individual reads and total read number per lane. The coverage bias that leads to parts of the genome/genes being over-represented and other parts being consequently under represented (Harismendy et al., 2009) currently associated with the short read platforms is also a limiting factor for this methodology. While complete genome or gene coverage of target species is not necessary to mine gene homologues using the SSPM, the efficacy of the SSPM will be greatly enhanced as the read length, read number and evenness of genomic coverage improve rapidly over the next generation of chemistry and hardware. We were able to capture six genes from a theoretical 2X genome coverage using 36-bp paired end reads on the Illumina Genome Analyzer platform. The latest improvements in the platform allow for 150-bp paired end read lengths and up to 640 million paired end reads per flow cell (http://www.illumina.com/systems/genomeanalyzer_iix.ilmn, last accessed 7th December 2010), equating to 80 million paired end reads per lane, which would result in a theoretical coverage of the M. stipoides genome of 27X. Obviously, the greater proportion of a gene sequence available for primer design, the more rigorous the design criteria can be and this should improve the amplification success from this methodology.

Rice as a model cereal plant has a relatively small genome size, approximately 420–466 mega base pairs (Goff et al., 2002; Yu et al., 2002). Other crop and model plants have much bigger genomes such as maize at 2.3 gbp (Schnable et al., 2009), barley at larger than 5 gbp (Schulte et al., 2009) and hexaploid wheat at 17 gbp (Paux et al., 2008). The improvements in the MPS technology may not be sufficient to produce gap-free whole genome sequence of these species at low cost, especially with the presence of coverage bias as previously mentioned. There are a number of methods to increase the coverage of gene regions in whole genome sequencing. Some of the commonly used methods are methylation filtering, high Cot selection and microarrays. Whitelaw et al. (2003) successfully combined methylation filtering and high Cot selection in the enrichment of gene-coding sequences in maize. Chou et al. (2010) enriched DNA samples with high-density oligonucleotide microarray to enrich a targeted gene. The combination of gene enrichment methods and MPS technology has the potential of improving targeted gene sequences that will allow for more species-specific primer design, resulting in capturing desired gene homologues in the target species.

The UPM has been used for the rapid amplification of gene homologues in another species when the target species has little or no known DNA sequence data. When the target species has DNA sequence data generated from MPS, the SSPM can be used as a method for the rapid amplification of important gene homologues. Even with the limitations of the platform that were current at the time of data collection for this study, the SSPM allowed for more primer combinations with greater gene coverage of most of the target genes. The SSPM method also eliminated the need for obtaining target gene sequence in more than one species and the need to design degenerate primers with their associated complications (Linhart and Shamir, 2005). Another advantage of being able to design outside highly conserved areas of the gene is that they are less susceptible to mispriming with members of a gene family or closely related gene. SSPM primers were tested under ‘first pass’ conditions and with a single primer pair per amplicon; therefore, the success rate of the PCR amplifications reported here for SSPM represents a minimum baseline of success which should be easily improved upon with the design of alternative primer pairs available from the MPS data, and optimization of PCR reagents and conditions.

The SSPM is expected to produce species-specific DNA sequence; however, our data show that this method still produced nonspecific DNA sequence. This could be explained by a number of possible causes. Low coverage of our target genome may have led to poor initial alignment with the reference sequences leading to incorrect primer site identification. It is expected that improved MPS technology will result in higher coverage and partially resolve this problem. Primer design stringency may have been too low; however, higher MPS coverage would allow for the design of longer and more specific primers.

Massively parallel sequencing data can be used to mine desired genes from poorly characterized species using a single well-described species’ sequence data as a reference. Universal primers designed from sequence from multiple species have a number of limitations and involve more steps than utilizing MPS. Improvements in the MPS technology will enable more and better quality data to be generated rapidly. This method will be applicable for gene homologue mining in any poorly described species where an appropriate single reference gene is available. In this study, MPS facilitated the amplification of homologues of six previously undescribed genes in a species with limited genetic data available, illustrating that the SSPM will be a valuable complementary approach to the discovery of useful genes in wild crop relatives.

Experimental procedures

Plant material and DNA extraction

Native Seeds Pty Ltd (http://nativeseeds.com.au/, last accessed 26th May 2010) provided M. stipoides seed from their AR1 breeding line. Fresh leaf tissue was harvested from a plant for DNA extraction. The method of DNA extraction was based on Carroll et al. (1995) with the following exceptions: 3 g of leaf tissue in liquid nitrogen was ground to a fine powder using mortar and pestle, and the quantity of reagents was adjusted proportionally.

Identifying conserved regions between rice and other Poaceae for maximum possible primers placement for UPM

Sequences for 18 rice domestication genes (Table 1) were obtained from NCBI Genbank website (http://www.ncbi.nlm.nih.gov/genbank/, last accessed 10th June 2010). Searches using nucleotide blast (blastn) optimized for highly similar sequences (megablast) (http://blast.ncbi.nlm.nih.gov/Blast.cgi, last accessed 9th August 2010) were performed on each gene to find homologue(s) in other species within Poaceae. Alignments between the exons from the rice reference gene and the homologue(s) found through blastn were made using Sequencher® V4.6 (Gene Codes Corporation, Ann Arbor, MI). Where there was more than one homologue found, the alignment with the most sequence matches was accepted. Regions with conserved sequence for at least 17 bases and a maximum of two bases with degeneracy at the start and the end of the coding region of each gene were identified as possible sites for primer placements to allow for the amplification of putative gene homologues of the rice reference genes in M. stipoides.

Identifying regions in Microlaena stipoides aligned to rice reference genes for maximum possible primers for SSPM

Sequences for 18 rice domestication genes (Table 1) were obtained from NCBI Genbank website. M. stipoides sequence data were obtained from a study by Nock et al. (2011). Single and paired end short reads produced by the MPS method using CLC Genomics Workbench 3.6.5 (CLC bio, Aarhus N, Denmark) were assembled against reference Oryza gene sequence data. Consensus sequences identified were exported to Sequencher® V4.6 and aligned back to the reference genes from rice. Regions of M. stipoides sequence with at least 17 consecutive bases at the start and at the end of the coding region of each gene were identified as possible sites for primer placements to allow for the amplification of putative gene homologues of the rice reference genes in M. stipoides.

PCR primer design

Clone Manager Professional 9 (Scientific & Educational Software, Cary, NC) was used for both UPM and SSPM primer design. PCR amplicons were limited to 4 kbp. Clone Manager Professional 9 primer criteria for both UPM and SSPM primers were set as follows: a minimum of 17 bases at 45%–65% GC content, Tm between 55 and 65 °C and the Tm difference between forward and reverse primers ± 4 °C, stability of 5′ versus 3′ at 1.2 kcals, less than three matches for 3′ dimers, less than seven for any dimer, less than five base runs, less than three repeats of dinucleotide pairs, less than eight protein degeneracy within 6 bp of 3′ end, less than or equal to one G’s or C’s at 3′ end. For UPM primers, a maximum of two bases of degeneracy was allowed. Primers that met these criteria were chosen whenever available, and a maximum of two unmet criteria was accepted.

PCR amplification and sequencing preparation

Universal primers and species-specific primers used for PCR amplification are listed in Table S2. PCRs were conducted using Platinum®Taq DNA Polymerase (Invitrogen Australia Pty. Ltd., Mulgrave, Vic., Australia) at 1 unit/reaction with 1× PCR Buffer, primers at 0.2 μm each, with 3% DMSO, 3% glycerol, 2 mm magnesium chloride and 10 ng of genomic DNA. The PCR programme was set for one cycle of denaturation at 96 °C for 2 min, followed by 35 cycles of denaturation at 96 °C for 30 s, annealing at the temperature required by respective primer pairs for 30 s, extension at 68 °C for 1 min per kbp of PCR product. A single cycle at 72 °C for 5 min allowed for further extension. Optimization of PCR was carried out using temperature gradient of annealing step down to 5 °C below and 5 °C above the average annealing temperature of the primer pair. The PCR products were run on 1% agarose gel electrophoresis stained with ethidium bromide to confirm amplification and then sequenced to confirm homology. Where a single PCR product was amplified in a reaction, Exosap-IT (GE Healthcare Bio-Sciences Pty. Ltd., Rydalmere, NSW, Australia) was used to clean up the PCR product prior to direct Sanger sequencing. Multiple PCR products in a reaction were separated via gel electrophoresis and individually excised and then cleaned using QIAquick Gel Extraction Kit (Qiagen Pty Ltd, Doncaster, Vic., Australia) prior to direct Sanger sequencing. Figure S2 summarizes the workflow involved in comparing the two methods for characterizing homologues of known genes in poorly described genomes using UPM and SSPM.


The authors thank Catherine Nock for providing MPS assembly data and for providing assistance in data analysis, Timothy Sexton for valuable discussions in data analysis, Daniel Waters, Gopala Subbayiyan and Peter Bundock for discussions in methodologies, Southern Cross Plant Genomics for performing the sequencing and Jared Baxter for producing Figure S1. Funding was provided by the Australian Research Council-Linkage Project ID LP0776409.