Editor: Michael Galperin
De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads
Article first published online: 10 DEC 2008
© 2008 Federation of European Microbiological Societies. Published by Blackwell Publishing Ltd. All rights reserved
FEMS Microbiology Letters
Volume 291, Issue 1, pages 103–111, February 2009
How to Cite
Farrer, R. A., Kemen, E., Jones, J. D.G. and Studholme, D. J. (2009), De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads. FEMS Microbiology Letters, 291: 103–111. doi: 10.1111/j.1574-6968.2008.01441.x
- Issue published online: 5 JAN 2009
- Article first published online: 10 DEC 2008
- Received 5 September 2008; accepted 4 November 2008.First published online 10 December 2008.
- genome sequencing;
- de novo sequence assembly;
- Pseudomonas syringae;
Illumina's Genome Analyzer generates ultra-short sequence reads, typically 36 nucleotides in length, and is primarily intended for resequencing. We tested the potential of this technology for de novo sequence assembly on the 6 Mbp genome of Pseudomonas syringae pv. syringae B728a with several freely available assembly software packages. Using an unpaired data set, velvet assembled >96% of the genome into contigs with an N50 length of 8289 nucleotides and an error rate of 0.33%. edena generated smaller contigs (N50 was 4192 nucleotides) and comparable error rates. ssake and vcakeyielded shorter contigs with very high error rates. Assembly of paired-end sequence data carrying 400 bp inserts produced longer contigs (N50 up to 15 628 nucleotides), but with increased error rates (0.5%). Contig length and error rate were very sensitive to the choice of parameter values. Noncoding RNA genes were poorly resolved in de novo assemblies, while >90% of the protein-coding genes were assembled with 100% accuracy over their full length. This study demonstrates that, in practice, de novo assembly of 36-nucleotide reads can generate reasonably accurate assemblies from about 40 × deep sequence data sets. These draft assemblies are useful for exploring an organism's proteomic potential, at a very economic low cost.