Make every species count: fastachar software for rapid determination of molecular diagnostic characters to describe species

Only a fraction of species found so far has been described, particularly cryptic species uncovered by molecular data. The latter might require the use of molecular data for its diagnosis, but it is important to make use of the diagnostic content of the molecular data itself. The molecular character‐based model provides discrete molecular diagnostic characters within DNA sequences that can be used in species descriptions fulfilling the requirement of most codes of nomenclature for a character‐based description of species. Here, we introduce fastachar, a software developed to extract molecular diagnostic characters from one or several taxonomically informative DNA markers of a selected taxon compared with those of other taxa in a single step. The input data consist of a single file with aligned sequences in the fasta format, which can be created using alignment software such as mega or geneious. fastachar is an easy‐to‐use software with a graphical interface. Thus, the software does not require the user to have any knowledge of the underlying programming environment (Python). We hope this software, based on the method proposed by Jörger and Schrödl (Frontiers in Zoology, 10, 59, 2013) to describe cryptic species, will encourage researchers to take the final step in taxonomy: the formal description of species. We propose the use of this method and fastachar also for the inclusion of molecular data in the description of any species. fastachar is released as open‐source software under GNU General Public License V3 and is freely available for all major operating systems from https://github.com/smerckel/FastaChar.


| Introduc tion
Taxonomy encompasses identifying, discovering and formally describing species and other taxa (Valdecasas et al., 2008), with the major objective of inventory all life on earth. Although taxonomic research has been carried out for over 250 years, only a fraction of species found has been described (Coleman, 2015). The introduction of the DNA barcode concept by Hebert et al. (2003) has given a new breath to taxonomy, the so-called molecular taxonomy. In a strict sense, DNA barcoding aims at identification by comparing DNA barcodes of unknown specimens with DNA barcodes of a priori defined taxonomic entities in databases such as GenBank (https://www. ncbi.nlm.nih.gov/genba nk/) or BOLD systems (http://www.bolds ystems.org/). Several methods and software programs, referred to as Sequence Identification Engines (SIDEs), were developed to accomplish identification in a robust way (e.g., blast; Altschul et al., 1990;dna-bar;DasGupta, Konwar, Mǎndoiu, Shvartsman, 2005; bronx; Little, 2011; blog 2.0; Weitschek et al. 2013). As an identification tool, DNA barcoding was considered extremely exciting and potent, but it was considered a poor approach to species discovery and description (Wheeler, 2004;Rach et al., 2008). However, there is no denying of the utility of DNA barcodes in species discovery by first flagging putative new species such as cryptic species, morphologically identical but genetically distinct, in many nominal taxa. Examples can be found in disparate groups of organisms such as Rotifera (Fontaneto et al., 2011), Mollusca (Borges et al., 2012;Jörger et al., 2012;Neusser et al., 2011), Arthropoda (Hebert et al., 2004), fish (Azpelicueta et al., 2019) and turtles (Reid et al., 2011).
The discovery of putative new species is usually followed by molecular species delimitation. The goal is to build a taxonomic scheme for DNA sequence data, which can be single locus or multilocus, to infer a de novo delimitation of the putative new species, sometimes referred to as molecular operational taxonomic units (MOTUs; Blaxter et al., 2005). The distance-based method advanced by Hebert et al. (2003) has become the workhorse used in DNA barcoding (Reid et al., 2011), aiming for the detection of a 'barcoding gap' between intra-and interspecific variation (Hebert et al., 2004;Puillandre et al., 2012). However, this approach has been criticized because of the high variation and occurrence of overlap between intra-and interspecific variation in many groups of organisms (Meyer & Paulay, 2005). More recently, several analytical methods were developed for molecular species delimitations. Some support species delimitation with single-locus data, partitioning sequences into MOTUs without adopting a rigid sequence threshold (Kekkonen & Hebert, 2014). One popular approach is the general mixed Yule-coalescent (GMYC; Fujisawa & Barraclough, 2013;Pons et al., 2006), which is used to discriminate between population and speciation processes to delineate species. Other popular methods are the Automatic Barcode Gap Discovery (Puillandre et al., 2012), the Barcode Index Numbers (BIN; Ratnasingham & Hebert, 2013) and the Poisson tree processes (PTP; Kapli et al., 2017;Zhang, Kapli, Pavlidis, Stamatakis, 2013). The ABGD and BIN methods apply clustering algorithms to distinguish partitions in the genetic distance to delineate species that are clustered into MOTUs. The PTP method is based on a gene tree inferred from molecular sequences.
In the case of new species in cryptic complexes, the multilocus approach has been favoured to provide more information and stability in the species descriptions (Neusser et al., 2011;Jörger et al., 2012;Borges & Merckelbach, 2018;Trevisan et al., 2020). The multilocus approach will also address the well-known fact that gene trees are not necessarily identical to species trees, for instance because of incomplete lineage sorting or horizontal gene transfer (Jörger & Schrödl, 2013). Several methods and software programs have been developed to analyse multilocus data, for example the Bayes factor delimitation (Grummer, Bryson, Reeder, 2014;Leaché, Fujita, Minin, & Bouckaert, RR, 2014), and phylogeographic inference using approximate likelihoods (Jackson, Carstens, Morales, & O'Meara, 2017). Many newly discovered species remain, however, undescribed even in cases where the species hypothesis is highly supported by numerous lines of evidence (Pante et al., 2015), hindering estimates of biodiversity, conservation efforts and attempts to control diseases and invasive species (Schlick-Steiner et al., 2007). Taxonomy remains incomplete if species are flagged as merely putative rather than formally described and thus full established. In many cases, the transition from species delimitation to species descriptions is the major task to be achieved (Jörger & Schrödl, 2013). Recently, new methods have been proposed to speed up species description using a combination of molecular data, concise morphological descriptions and other sources of information such as geographic distribution (Butcher et al., 2012;Summers et al., 2014). However, molecular data have been included in species descriptions in various forms, such as lists of GenBank accession numbers, inclusion of entire DNA barcode sequences or presentation of raw distance measures, but seldom making use of the diagnostic content of the molecular data itself (Goldstein & DeSalle, 2011;Jörger & Schrödl, 2013).
In this context, the molecular character-based model proposed by DeSalle et al. (2005) provides discrete molecular diagnostic characters (MDCs) within DNA sequences that are especially useful for the diagnosis of species, in particular of cryptic species that cannot make use of morphological characters for diagnosis (Jörger & Schrödl, 2013). The molecular diagnosis involves the listing of the DNA (usually from a combination of mitochondrial and nuclear markers) or protein characters of a species which are different (apomorphic) from that of its closest relatives (e.g., congeneric species). This allows for a direct comparison between species and standardizes the use of molecular data (Goldstein & DeSalle, 2011), which can be used as diagnostic features in a similar fashion as traditional morphological diagnostic characters are used for species diagnosis (Bauer et al., 2011;Bergmann et al., 2009). In addition, using MDCs allows the detection of differences that reflect lineage independence, decreasing taxonomic ambiguity and instability (Bauer et al., 2011).
In this work, we present fastachar, a software package designed to extract MDCs from one or several taxonomically informative DNA markers of a selected taxon compared with those of other taxa in a single step. The software was developed to determine MDCs for the description of Lyrodus mersinensis (Borges & Merckelbach, 2018) because none of the existing software packages met our needs.

| A B rief Review of E xis ting Sof t ware
The Character Attribute Organization System (CAOS; Sarkar et al., 2008) was developed following previous research (Sarkar et al., 2002;Rach et al., 2008) for species identification. However, it was pointed out by the authors that the first function (P-Gnome) could also be useful to establish diagnostics from DNA sequences for species descriptions . However, being not designed for this purpose, the workflow to determine MDCs requires a timeconsuming iterative process (Jörger & Schrödl, 2014). Furthermore, Kühn and Haase (2019) note that the software considers masked entries in the DNA as valid characters, yielding potentially incorrect results. spider (Brown et al., 2012) is an open-source software package implemented in r and provides a number of functions for use in analyses of DNA barcoding studies, including determining MDCs (Azpelicueta et al., 2019;Escobar et al., 2019). spider, however, does not adhere to the definition of MDCs as proposed by Jörger and Schrödl (2013) and does not treat masked entries in the DNA correctly, as pointed out by Kühn and Haase (2019). In addition, the use of spider requires a workable knowledge of the programming environment R, and lacking this, the time and effort required to obtain literacy in R seem disproportional.
A third software package, quiddich (Kühn & Haase, 2019), is presented as a tool specifically for determining MDCs, but this too, being an r package, presents a conundrum for non-R users.
Compared to caos and fastachar, for example, quiddich defines two additional types of molecular diagnostic characters. These types (3 and 4) are plesiomorphic (not unique to the species being analysed) and therefore cannot be used as diagnostic or defining characters of that species. Thus, they are not meaningful for the purpose of a formal description and do not adhere to the definition proposed by Jörger and Schrödl (2013) either.

| Implementation
fastachar is implemented as a package for the scripting language Python. We chose Python mainly because (a) Python code is easy to read by design, so that nonexperts in programming still can work out how the algorithms work, (b) the Python environment can be installed easily, and iii) the Python environment includes by default a robust cross-platform toolkit to create graphical user interfaces.
The fastachar package consists of a number of Python modules 1 , which implement the necessary data types and functionality to read and compare DNA sequences, and report the results. Using these modules, simple Python scripts can be written that process prealigned DNA sequences, generating output in text or spreadsheet files for further analysis.

| Algorithms
Within the context of fastachar, MDCs, as defined by Jörger and Schrödl (2013), are determined by comparing two lists consisting of sequences of a single alignment. List A contains sequences belonging to one taxon A, whereas list B can contain those of one or more taxa.
Per position k in the alignment, (mathematical) sets 2 A k and B k are created from the characters of the sequences in the lists A and B, respectively. The characters specified in the IUPAC ambiguity code for nucleotides (http://www.bioin forma tics.org/sms/iupac.html) are expanded into all the base nucleotides they represent. For example, a set containing M expands into a set containing both A and C, and so does N expand to A, C, G and T. This is equivalent to the 'single pure characters' used in caos  and 'type 1 diagnostic characters' used in quiddich (Kühn & Haase, 2019).

| Evaluation criterion for potential molecular diagnostic characters
If a taxon A has one or more ambiguous character codes in a given position k, then the position disqualifies to yield a molecular diagnostic character. In some of these cases, however, the user might still consider taxon A to have an MDC at this position. To single out these cases for further inspection, a second algorithm yields the potential MDCs by replacing Condition 1 by the condition

A k contains at least two characters
This is equivalent to the 'private characters' used in caos  and 'type 2 diagnostic characters' used in quiddich (Kühn & Haase, 2019).

| Noninformation characters
Not all sequences obtained for a certain marker are of the same length because of low-quality readings, usually at the beginning and/or at the end of the sequence. After aligning, the shorter sequences are padded, yielding continuous blocks of noninformative dash characters at the beginning and/or the end of the sequence.
Masking programs are also sometimes used to mask parts of sequences that are highly variable or of bad quality, using (mostly) N as masking character. To avoid noninformative characters to be treated as informative, which could lead to erroneous results, a dash (or N) is excluded from the sets A k and B k , if it appears in a continuous block of dashes (or N's) at the start and/or at the end of a sequence.

| Preliminary considerations on how to use molecular data in species descriptions
The use of molecular data as the partial or even as the major source of a species description needs consideration on the best practice to follow. It is important to keep in mind that each species description is a hypothesis (not a fact) about the discontinuous distribution of unique combination of characters (Nixon & Wheeler, 1990;Valdecasas et al., 2008) may they be morphological and/or molecular. The final decision of the taxonomist may always be falsified and is testable forever (Haszprunar, 2011). Given that many species have wide and complex ranges, to sample thousands of specimens from many populations of several species can be done but is prohibitively time-consuming and cost-inefficient (Wheeler, 2004). Thus, when using molecular sequence data (or morphological data) for species descriptions the hypotheses of the apomorphic characters (derived traits referred to in the fastachar context as MDCs) are theorized (Haszprunar, 2011) and may contain errors in the form of incorrectly assumed apomorphic character states (Jörger & Schrödl, 2013). This may be the case when working with sparsely sampled species or when there is a low number of sequences per species. When new sequences, for instance from new cryptic species or other closely related congeneric species, are added to the analysis at a later stage, the putative molecular apomorphies of described species may have to be reconsidered as plesiomorphies (ancestral traits not unique to the species being analysed). It may also happen that when sequence data of new specimens of the same species are included in the analysis, some putative molecular apomorphies may vanish in intraspecific variation. Thus, it is important to include in the analysis two or more independently evolving markers (at least two independent and informative markers per species) and several sequences of each species to account for intraspecific variation. This is to ensure that at least some of the potentially apomorphic nucleotides found in different markers are truly unique mutations accumulated due to the absence of gene exchange (Jörger & Schrödl, 2013).

| How to select and prepare data for the analysis
The most important decision in the analyses using fastachar is to select the most adequate evolutionary group which will serve for comparison. One possibility is to compare the species of interest with some or all its congeners. If that is not possible, it can be compared with other confamilial species. In the latter case, however, care must be taken not to include distant confamilial species in the comparison. When distant species are compared, there is a higher probability of including homoplastic character states, which increases with evolutionary distance (Rach et al., 2008) given that there are only four potential possibilities in a homologous position of a nucleotide alignment. This is particularly important when dealing with fast-evolving markers. On the other hand, if a species is only compared to its known direct sister species this increases the risk of including plesiomorphies as MDCs. It also depends entirely on the validity of this phylogenetic sister group relationship (Jörger & Schrödl, 2014).
After selecting the group to be analysed, the sequences must be aligned with a dedicated alignment program, such as muscle (Edgar, 2004) or geneious (Kearse et al., 2012). Since the determination of the diagnostic characters relies on the positional homology assumption, it is important to compare the results of different alignment programs in fastachar and choose the one that takes the most conservative approach. This results in fewer diagnostic characters, but their reliability will be higher (Jörger & Schrödl, 2013). The aligned sequences are then exported into a standard fasta file format, where each sequence must be accompanied by a species name, and any associated laboratory code or GenBank information.

| How to process the data using fastachar's GUI
When fastachar is started, a GUI is brought up. The user first opens the fasta file prepared in the previous step, and all species contained are listed in text field 'Unselected Species' (see Figure 1). From this field, the required species are dragged into the text fields of 'Selected species list A and B', respectively. Then, the desired operation is selected choosing between determining MDCs or potential MDCs. After clicking 'Process' button, the results are shown in the bottom text field, which allows for a direct inspection of the results.
If desired, the results can be saved as an ASCII file or as a spreadsheet file for further analysis.

| Demonstration of fastachar using six data sets with several taxonomically informative markers
The software can be used to determine MDCs from one or several taxonomically informative molecular markers for any taxon. Thus, below we provide examples of determination of MDCs from several molecular markers in disparate taxa including algae, fungi, invertebrates and vertebrates. Each data set is associated with a published peer-reviewed article ( Table 1). The results are shown in Annex 1.
For demonstrational purposes only, we determine MDCs for only one species in each data set (Annex 1). The 18S rRNA and 28S rRNA sets, from the work by Jörger and Schrödl (2013), were used to compare the performance of fastachar with that of caos . Similarly, we used the COI data set from Azpelicueta et al. (2019) to compare the performance of fastachar with that of spider (Brown et al., 2012). The MDCs determined by fastachar were identical to those determined by caos and spider.

| Presentation of the results
The presentation of the results obtained from the analysis in fastachar is of great importance. This is particularly the case when MDCs are the partial or even as the major data source used in a species description. Thus, the researcher has to ensure that the data are traceable, reproducible and testable. In their method, Jörger & Schrödl (2013) advised that to make a diagnostic position in a sequence traceable and the results reproducible by other researchers the following information should be included: (a) primers and alignment program used; (b) how position 1 was determined (for instance the first base after the primer sequence); (c) deposit the sequences in public databases (e.g., GenBank or the BOLD system or as additional material in the article); and (d) in new species descriptions the provided reference sequences should be generated from type material. For further details, see Jörger & Schrödl (2013). We further advise that MDCs and respective positions should be presented not only for the species of interest but also for all species to which it was compared (see Tables S1-S6 in Annex). This will ensure that future researchers can reproduce and test the results.

| Discussion and Conclusions
Molecular taxonomy is a burgeoning field, and several methods and software programs have been developed to facilitate the three major processes: identification, molecular species delimitation (discovery) and formal descriptions of species. Most programs are used only for one of these functions, for example blog 2.0 for identification (Weitschek et al., 2013) BPP for delimitation (Yang & Rannala (2010). But others such as spider (Brown et al.,2012) and caos  may be used in more than one process. For example, spider can be used for identification, delimitation and description, while caos was designed for species identification but it can be also used for species description. In this context, fastachar could be viewed as a software not only for species description but also for species delimitation. This is because the presence of MDCs may reflect differences that indicate separately evolving lineages (species) according to the concept of species proposed by De Queiroz (2007). However, to use fastachar for species delimitation would be inferior to the use of, for example, a coalescent-based delimitation method such as BPP (Yang & Rannala, 2010), as there is no objective criterion on the number of MDCs to delimit a species (Jörger & Schrödl, 2013). In addition, there is no simple formula that can predict the length of the sequence that must be analysed to ensure species diagnosis because rates of molecular evolution vary between different segments of the genome and across taxa (Hebert et al., 2003). Therefore, fastachar should be used only to extract MDCs to formally describe species, when hypotheses for those species were congruently supported across delimitation methods. This is consistent with the principles of integrative taxonomy, in that taxonomic inference should be based on congruence across analyses that utilize multiple sources of data (Carstens et al., 2013).
Software packages to extract MDCs exist, but either have a cumbersome and time-consuming workflow, or require a working knowledge of the programming environment R, which not everyone may have. This may be one of the reasons why in many species' descriptions molecular data are used but not in the form of MDCs.
fastachar is easy and quick to install, and runs on all major operating systems. The intuitive graphical user interface provides a convenient way to determine the MDCs from any number of sequences of a selected taxon compared with those of other taxa (as many as required by the user) in a single step, whereas it removes the requirement to have any knowledge of any programming environment. Therefore, we hope it helps to standardize the use of molecular data and stimulate researchers to proceed to the final step of molecular taxonomy, that is describe the new species, particularly cryptic TA B L E 1 Data sets used to determine the diagnostic molecular characters using fastachar  (2019) species, after the exploratory step of delimitation, making them available for biodiversity research.

ACK N OWLED G EM ENTS
We thank the three anonymous reviewers for their constructive feedback that helped improving the manuscript.