Heuristic optimization for global species clustering of DNA sequence data from multiple loci
Article first published online: 14 AUG 2013
© 2013 The Authors. Methods in Ecology and Evolution © 2013 British Ecological Society
Methods in Ecology and Evolution
Volume 4, Issue 10, pages 961–970, October 2013
How to Cite
Chesters, D., Yu, F., Cao, H.-X., Dai, Q.-Y., Wu, Q.-T., Shi, W., Zheng, W., Zhu, C.-D. (2013), Heuristic optimization for global species clustering of DNA sequence data from multiple loci. Methods in Ecology and Evolution, 4: 961–970. doi: 10.1111/2041-210X.12104
- Issue published online: 7 OCT 2013
- Article first published online: 14 AUG 2013
- Accepted manuscript online: 24 JUL 2013 10:02AM EST
- Manuscript Accepted: 15 JUL 2013
- Manuscript Received: 4 FEB 2013
- Chinese Academy of Sciences. Grant Number: KSXC2-EW-B-02
- National Science Foundation of China. Grant Numbers: 30870268, 31172048, J1210002
- Ministry of Agriculture of the People's Republic of China. Grant Number: 201103024
- Program of Ministry of Science and Technology of the People's Republic of China. Grant Number: 2012FY111100
- heuristic clustering;
- MOTU ;
- multilocus clustering;
- simulated annealing;
- species delineation
- Hierarchical clustering of molecular data is commonly used for estimation of species diversity in all forms of life. Parameters appropriate for species-level clustering are usually derived from reference data and applied for the delineation of sequences of unknown species membership, although it is not clear how this should be carried out in a multilocus scenario.
- We introduce a novel means of concurrent clustering parameter optimization and delineation for multilocus data. A simulated annealing heuristic search is performed, whereby clustering thresholds are independently varied for each locus, but optimized according to the recovery of expected taxonomic species globally over loci. For each iteration of the search, one or more loci are randomly selected and a different threshold is separately proposed to cluster each, then the loci are linked to form global species units. Where the set of thresholds group the reference (species labelled) data with high taxonomic congruence, they are adopted for clustering of the subject (nonlabelled) sequences into global molecular operational taxonomic units (global MOTU). Four mined test data sets composed of both reference and subject sequences are combined with a newly sequenced three gene Apoidea data set, and subject to the proposed method.
- Even optimizing four loci and thousands of sequences, the approach rapidly convergences on a set of parameters with maximal optimality score, although the method masks a high degree of incongruence, and does not always converge on a single set of thresholds. For example, of the 476 Apoidea sequences, 70 global MOTU were inferred over the heuristic search, although the number of single gene MOTU were much lower for the 28S RNA locus, and a range of equally optimal clustering thresholds were observed for the CytB gene.
- We demonstrate the approach as a scalable species delineation solution for heterogeneous data sets composed of incompletely and inconsistently labelled data from public DNA data bases, for newly sequenced multilocus data, or both. The delineation over a heuristic search of clustering parameters facilitates the estimation of species diversity in multilocus data, giving species estimates that take into account uncertainty regarding choice of clustering thresholds.