Heuristic optimization for global species clustering of DNA sequence data from multiple loci
Correspondence author. E-mail: firstname.lastname@example.org
- Hierarchical clustering of molecular data is commonly used for estimation of species diversity in all forms of life. Parameters appropriate for species-level clustering are usually derived from reference data and applied for the delineation of sequences of unknown species membership, although it is not clear how this should be carried out in a multilocus scenario.
- We introduce a novel means of concurrent clustering parameter optimization and delineation for multilocus data. A simulated annealing heuristic search is performed, whereby clustering thresholds are independently varied for each locus, but optimized according to the recovery of expected taxonomic species globally over loci. For each iteration of the search, one or more loci are randomly selected and a different threshold is separately proposed to cluster each, then the loci are linked to form global species units. Where the set of thresholds group the reference (species labelled) data with high taxonomic congruence, they are adopted for clustering of the subject (nonlabelled) sequences into global molecular operational taxonomic units (global MOTU). Four mined test data sets composed of both reference and subject sequences are combined with a newly sequenced three gene Apoidea data set, and subject to the proposed method.
- Even optimizing four loci and thousands of sequences, the approach rapidly convergences on a set of parameters with maximal optimality score, although the method masks a high degree of incongruence, and does not always converge on a single set of thresholds. For example, of the 476 Apoidea sequences, 70 global MOTU were inferred over the heuristic search, although the number of single gene MOTU were much lower for the 28S RNA locus, and a range of equally optimal clustering thresholds were observed for the CytB gene.
- We demonstrate the approach as a scalable species delineation solution for heterogeneous data sets composed of incompletely and inconsistently labelled data from public DNA data bases, for newly sequenced multilocus data, or both. The delineation over a heuristic search of clustering parameters facilitates the estimation of species diversity in multilocus data, giving species estimates that take into account uncertainty regarding choice of clustering thresholds.