Charles Cannon studies the ecology, evolution, and conservation of tropical Asian forests using a variety of approaches. Currently, he is committed to extending genomic techniques to the study of tropical biodiversity. Chai-Shian Kua has a background in human cancer genetics and is now focused on the application of cutting-edge technologies in non-model systems. Zhang Di helped develop the software for this analysis and has a growing interest in bioinformatics and genomics. John Harting is working on general multivariate evolutionary theory, using stochastic models based upon Price’s theorem. He continues to be involved in conservation research in Indonesia.
Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack
Article first published online: 10 FEB 2010
© 2010 Blackwell Publishing Ltd
Special Issue: Next Generation Molecular Ecology
Volume 19, Issue Supplement s1, pages 147–161, March 2010
How to Cite
CANNON, C. H., KUA, C.-S., ZHANG, D. and HARTING, J.R. (2010), Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack. Molecular Ecology, 19: 147–161. doi: 10.1111/j.1365-294X.2009.04484.x
- Issue published online: 10 FEB 2010
- Article first published online: 10 FEB 2010
- Received 15 June 2009; revision received 28 September 2009; accepted 13 October 2009
- simulated SRS data;
- stone oaks;
Most comparative genomic analyses of short-read sequence (SRS) data rely upon the prior assembly of a reference sequence. Here, we present an assembly free analysis of SRS data that discovers sequence variants among focal genomes by tabulating the presence and frequency of ‘complex’ fragments in the data. Using data from nine tree species, we compare genomic diversity from populations to families. As a control, we simulated SRS data for three known plant genomes. The results provide insight into the quality and distributional bias of the sequencing reaction. Three main types of informative complexmers were identified, each possessing unique statistical properties. Type I complexmers are unique to a genome but suffer from a high false positive rate, being highly dependent on read coverage and distribution. Type II complexmers are shared between two genomes and can highlight potential copy-number differences. Type III complexmers are exclusive to a subset of genomes and can be useful for associating genetic differences with phenotypic or geographic variation. At the population level in an endangered timber species, numerous markers were identified that could potentially determine geographic origin of individuals and regulate international trade. We observed that the genomic data for the four fig species were more divergent than for stone oak species, possibly due to their complex pollination syndrome and high rates of gene flow. Our approach greatly enhances the application of SRS technology to the study of non-model organisms and directly identifies the most informative genetic elements for more detailed study and assembly.