Heuristic optimization for global species clustering of DNA sequence data from multiple loci

Authors

  • Douglas Chesters,

    1. Key Laboratory of Zoological Systematics and Evolution (CAS), Institute of Zoology, Chinese Academy of Sciences, Beijing, China
    2. Guangzhou Institute of Advanced Technology, Chinese Academy of Sciences, Guangzhou, China
    Search for more papers by this author
  • Fang Yu,

    1. Key Laboratory of Zoological Systematics and Evolution (CAS), Institute of Zoology, Chinese Academy of Sciences, Beijing, China
    Search for more papers by this author
  • Huan-Xi Cao,

    1. Key Laboratory of Zoological Systematics and Evolution (CAS), Institute of Zoology, Chinese Academy of Sciences, Beijing, China
    Search for more papers by this author
  • Qing-Yan Dai,

    1. Key Laboratory of Zoological Systematics and Evolution (CAS), Institute of Zoology, Chinese Academy of Sciences, Beijing, China
    Search for more papers by this author
  • Qing-Tao Wu,

    1. Key Laboratory of Zoological Systematics and Evolution (CAS), Institute of Zoology, Chinese Academy of Sciences, Beijing, China
    Search for more papers by this author
  • Weifeng Shi,

    1. School of Basic Medical Sciences, Taishan Medical College, Taian, Shandong, China
    Search for more papers by this author
  • Weimin Zheng,

    1. Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
    Search for more papers by this author
  • Chao-Dong Zhu

    Corresponding author
    1. University of Chinese Academy of Sciences, Beijing, China
    • Key Laboratory of Zoological Systematics and Evolution (CAS), Institute of Zoology, Chinese Academy of Sciences, Beijing, China
    Search for more papers by this author

Correspondence author. E-mail: zhucd@ioz.ac.cn

Summary

  1. Hierarchical clustering of molecular data is commonly used for estimation of species diversity in all forms of life. Parameters appropriate for species-level clustering are usually derived from reference data and applied for the delineation of sequences of unknown species membership, although it is not clear how this should be carried out in a multilocus scenario.
  2. We introduce a novel means of concurrent clustering parameter optimization and delineation for multilocus data. A simulated annealing heuristic search is performed, whereby clustering thresholds are independently varied for each locus, but optimized according to the recovery of expected taxonomic species globally over loci. For each iteration of the search, one or more loci are randomly selected and a different threshold is separately proposed to cluster each, then the loci are linked to form global species units. Where the set of thresholds group the reference (species labelled) data with high taxonomic congruence, they are adopted for clustering of the subject (nonlabelled) sequences into global molecular operational taxonomic units (global MOTU). Four mined test data sets composed of both reference and subject sequences are combined with a newly sequenced three gene Apoidea data set, and subject to the proposed method.
  3. Even optimizing four loci and thousands of sequences, the approach rapidly convergences on a set of parameters with maximal optimality score, although the method masks a high degree of incongruence, and does not always converge on a single set of thresholds. For example, of the 476 Apoidea sequences, 70 global MOTU were inferred over the heuristic search, although the number of single gene MOTU were much lower for the 28S RNA locus, and a range of equally optimal clustering thresholds were observed for the CytB gene.
  4. We demonstrate the approach as a scalable species delineation solution for heterogeneous data sets composed of incompletely and inconsistently labelled data from public DNA data bases, for newly sequenced multilocus data, or both. The delineation over a heuristic search of clustering parameters facilitates the estimation of species diversity in multilocus data, giving species estimates that take into account uncertainty regarding choice of clustering thresholds.

Ancillary