A molecular‐based identification resource for the arthropods of Finland

To associate specimens identified by molecular characters to other biological knowledge, we need reference sequences annotated by Linnaean taxonomy. In this study, we (1) report the creation of a comprehensive reference library of DNA barcodes for the arthropods of an entire country (Finland), (2) publish this library, and (3) deliver a new identification tool for insects and spiders, as based on this resource. The reference library contains mtDNA COI barcodes for 11,275 (43%) of 26,437 arthropod species known from Finland, including 10,811 (45%) of 23,956 insect species. To quantify the improvement in identification accuracy enabled by the current reference library, we ran 1000 Finnish insect and spider species through the Barcode of Life Data system (BOLD) identification engine. Of these, 91% were correctly assigned to a unique species when compared to the new reference library alone, 85% were correctly identified when compared to BOLD with the new material included, and 75% with the new material excluded. To capitalize on this resource, we used the new reference material to train a probabilistic taxonomic assignment tool, FinPROTAX, scoring high success. For the full‐length barcode region, the accuracy of taxonomic assignments at the level of classes, orders, families, subfamilies, tribes, genera, and species reached 99.9%, 99.9%, 99.8%, 99.7%, 99.4%, 96.8%, and 88.5%, respectively. The FinBOL arthropod reference library and FinPROTAX are available through the Finnish Biodiversity Information Facility (www.laji.fi) at https://laji.fi/en/theme/protax. Overall, the FinBOL investment represents a massive capacity‐transfer from the taxonomic community of Finland to all sectors of society.


| INTRODUC TI ON
Over the past decade, DNA-based identification has been adopted as a key tool for characterizing biological specimens Hebert, Ratnasingham, et al., 2016). To compare species composition among sites, to describe community organization, or to access previous knowledge related to the taxa encountered, specimens must first be identified. A quick and efficient approach is to cluster specimens into molecular operational taxonomic units or MOTUs (Blaxter et al., 2005). Indeed, the clustering of sequences combined with an interim taxonomy enables efficient characterization of biodiversity (Smith et al., 2013) and of species interactions (Clare et al., 2019).
Yet, full realization of the value of such data relies on connecting as many MOTUs as possible to Linnaean taxonomy, because this makes it possible to connect species detected in DNA-based surveys to prior biological knowledge. Thus, the most efficient avenue for combining molecular data with taxon-specific knowledge involves populating reference databases with DNA barcodes annotated with Linnean taxonomy (Hebert et al., 2003). By definition, such progress can only be achieved through the active involvement of taxonomists.
Once populated, DNA barcode reference libraries can be used to establish the likely identity of a query sequence-and to partition the millions of reads from a high-throughput sequencing run to their likely source species. In such use cases, the reference sequence with the highest similarity (Altschul et al., 1990) is often assumed to represent the likeliest taxon, and its taxonomic tag becomes the relevant identification (BOLD Team, 2019). Importantly, any taxonomic placement made through this approach comes with uncertainty, since both the query and reference sequences may contain read errors, and because there is variation among sequences within a species. A further important source of uncertainty arises from the fact that reference sequence databases are incomplete, and they contain some incorrectly identified records (Pentinsaari et al., 2020). The best way to reduce uncertainty and improve performance involves extending species coverage and improving the quality of the reference databases (Meiklejohn et al., 2019;Pentinsaari et al., 2020;Wilson et al., 2011).
To arrive at comprehensive, reliable reference libraries, several nations and campaigns have constructed DNA barcode databases. Approaches range from barcoding all macroscopic species in an arctic region (Wirta et al., 2016) or a coral atoll (Andersen et al., 2019) to campaigns taking a taxonomic or geographic focus (e.g., Dincă et al., 2021;Miller et al., 2016;Zhou et al., 2016; for a summary, see http://www.ibol.org/phase 1/about -us/campa igns/).
Among the most ambitious initiatives to date are Fauna Bavarica, striving for coverage of all species in this German state (https:// barco ding-zsm.de/bfb), and the intensive work in the Area de Conservacion Guanacaste, northwestern Costa Rica-with efforts to barcode all species in this nation Miller et al., 2016). Other campaigns strive to generate comprehensive DNA barcode libraries for the biota of a country, including for example, Austria (ABOL, https://www.abol.ac.at/), Belgium (BeBOL, http://bebol.myspe cies.info/), Germany (GBOL, https:// bolge rmany.de/home/en/germa n-barco de-of-life-2; see Morinière et al., 2019), Norway (NorBOL, http://www.norbol.org/), and new identification tool for insects and spiders, as based on this resource. The reference library contains mtDNA COI barcodes for 11,275 (43%) of 26,437 arthropod species known from Finland, including 10,811 (45%) of 23,956 insect species. To quantify the improvement in identification accuracy enabled by the current reference library, we ran 1000 Finnish insect and spider species through the Barcode of Life Data system (BOLD) identification engine. Of these, 91% were correctly assigned to a unique species when compared to the new reference library alone, 85% were correctly identified when compared to BOLD with the new material included, and 75% with the new material excluded. To capitalize on this resource, we used the new reference material to train a probabilistic taxonomic assignment tool, FinPROTAX, scoring high success. For the full-length barcode region, the accuracy of taxonomic assignments at the level of classes, orders, families, subfamilies, tribes, genera, and species reached 99. 9%, 99.9%, 99.8%, 99.7%, 99.4%, 96.8%, and 88.5%, respectively.
The FinBOL arthropod reference library and FinPROTAX are available through the Finnish Biodiversity Information Facility (www.laji.fi) at https://laji.fi/en/theme/ protax. Overall, the FinBOL investment represents a massive capacity-transfer from the taxonomic community of Finland to all sectors of society.

K E Y W O R D S
COI, DNA barcodes, probabilistic taxonomic assignment, PROTAX, reference library, species identification Switzerland (SwissBOL, http://www.swiss bol.ch); see Figure 1. In each case, the combination of high species coverage with reliable taxonomic annotations is a key objective.
In this study, we consolidate knowledge from several sources to create a new tool that enables the taxonomic identification of more than 10,000 species, linking molecular samples with taxonomic collections and expertise. Specifically, we report the creation of a DNA barcode library for the arthropods of Finland, one built through a nationwide network of taxonomic experts. Ultimately, the Finnish Barcode of Life initiative (FinBOL, https://www.finbol.org/) will establish a DNA barcode reference library for all ~48,000 species of multicellular organisms that occur in Finland. The present study represents important progress toward this goal as it releases a reference data set for 11,275 arthropod species. We describe the approach employed to build this reference library, the current success rate, and the improvement in species identification resulting from its use. To capitalize on its development, we trained PROTAX, a probabilistic taxonomic assignment tool (Somervuo et al., 2016), to identify arthropod sequences from Finland-while accounting for gaps in knowledge and available reference material. Implemented as a web-based service, this new resource (FinPROTAX; https://laji.fi/ en/theme/ protax) allows both the accurate taxonomic placement of insects and the evaluation of the uncertainty associated with placements at each level in the taxonomic hierarchy.

| Approach
FinBOL was built using a crowdsourcing approach, as its progress reflects contributions by a network of about 150 Finnish taxonomists who contributed identified arthropod specimens for sequence analysis. This network involves both professional researchers and amateur naturalists. For details regarding the organization and activities of the network, see Appendix S1.
To maximize efficiency, FinBOL adopted a flexible strategy to obtain tissue samples of good quality for DNA barcoding. This flexibility has involved the utilization of both museum and private collections, the latter of which are many and of high quality in Finland, with different collectors focusing on different taxa. In return, FinBOL provided all participants with open access to the resultant data. Following this principle, FinBOL has not required that voucher specimens be deposited in public collections, but rather that they are maintained in known collections and only eventually-when the owner is deceased or discontinues collection-donated to museums.
Specimens obtained for DNA barcoding were mostly tissue sampled and photographed at the Zoological Museum of the University of Oulu and subsequently returned to their source public or private collection. To a lesser degree, some or all of these stages were carried out by the Finnish Museum of Natural History Luomus, by individual contributors, or, for photography, at the Centre for Biodiversity Genomics (CBG) at the University of Guelph, Ontario, Canada. Tissue samples (usually one or two legs, or a part of a leg, depending on the size of the specimen) were placed in 96-well microplates prefilled with ethanol and shipped to the CBG for DNA extraction and sequencing. For some minute species of Diptera, Coleoptera, or Acari, for which the regular tissue sampling approach was not feasible, plates containing whole specimens were assembled, and vouchers were recovered from the plates after DNA extraction using nondestructive methods.
DNA extraction followed a standard high-throughput protocol (Ivanova et al., 2006). A cocktail of the Folmer (Folmer et al., 1994) primers and LepF1 and LepR1 (Hebert et al., 2004) was then used to PCR amplify the target region of cytochrome c oxidase I in most specimens. The resultant amplicons were Sanger sequenced on an ABI 3730XL. Additionally, sequences of some taxa were produced in FinBOL participants' individual research labs, largely employing the protocols outlined above. The barcode sequences, as well as photographs and metadata for specimens, were uploaded to the Barcode of Life Data system (BOLD) (Ratnasingham & Hebert, 2007.
Within the global database, all FinBOL projects are listed under the FinBOL campaign in the BOLD project list.
The current reference resource includes all FinBOL arthropod data uploaded and validated by October 31, 2019. All sequences with a minimum length of 500 bp and a validated taxonomic identification were compiled into a data set on BOLD (dx.doi.org/10.5883/ DS-FINPRO). The full set of sequence data was downloaded as a time-stamped version and used to train the probabilistic taxonomic classifier PROTAX (see Section 2.3). As data will continue to accumulate well into the future, albeit at a slower rate, new sequences will be continuously uploaded to BOLD where they will remain connected to FinBOL through project identity (see above). All sequences that are not flagged as misidentified or contaminated are automatically included in the identification engine on BOLD (BOLD Team, 2019).

| Validating the taxonomic resolution achieved
To examine how much the national reference library improved the identification success of Finnish arthropods, we adopted a user's perspective. In brief, we examined the impact of the FinBOL arthropod material on species identifications generated by the BOLD Identification Engine-a web-based tool querying all sequences uploaded to BOLD from public and private projects to locate the closest match. Based on 1000 query sequences, we evaluated identification success when the query material was compared to BOLD under three scenarios: (1) BOLD without the new records, (2) BOLD with the new records added, and (3) BOLD restricted to the new records alone.
The BOLD ID Engine accepts sequences from the 5′ region of the mitochondrial COI gene and returns a list of closest matches to the query sequences. The user can choose between querying the full COI database or limiting the query to reference records with species-level identifications. For identification, BOLD uses the BLAST algorithm (Altschul et al., 1990) to identify single base indels before aligning the protein translation through profile to a hidden Markov model of the COI protein (BOLD Team, 2019). As our reference library, we used the Species Level Barcode Records Database, that is, every COI sequence of >500 bp with species level identification uploaded on BOLD.
To assess whether and how much species-level taxonomic assignment improved with access to the FinBOL arthropod reference library, we selected a test set of 1000 Finnish arthropod species stratified by order. Species were chosen in rough proportion to national species-level diversity per order ( Figure 2), with an important exception: to reduce the dominance of the four most diverse orders (Coleoptera, Lepidoptera, Diptera, and Hymenoptera) and increase representation of other taxa, we included only 150 species for each of these orders. The test set was assembled from the full FinBOL arthropod data set by first reducing the data set to only those species for which at least two sequences were obtained, and then randomly drawing the predetermined number of representatives from each order (with one sequence per species). This set of 1000 sequences was compared against the COI Species database on BOLD, that is, only including reference sequences with species level identifications, using the Batch ID Engine tool (accessed on March 8, 2021) with the default parameters: a minimum of 80% sequence similarity, and a minimum overlap of 300 bp between query and reference sequence.
The Batch ID Engine outputs a list of the top 100 closest matches to the query sequence in the database which fulfill these criteria, excluding self-match. If less than 100 matches for a given query sequence meet the criteria, the resulting list of matching sequence records is shorter. Each query sequence was assigned to the species with the best-matching reference sequence. Identification success (true or false) was evaluated assuming that the original identification of the query sequence was correct-a reasonable assumption since in each case, the identification had been made by the best available national expert (see section Approach, above).
Since the Batch ID Engine excludes comparisons to the query sequence itself, our restriction of the query material to species F I G U R E 1 Complementarity in BIN composition of arthropod faunas between FinBOL and other regional DNA barcoding campaigns. Shown is the number of arthropod BINs unique to the Nordic countries (a) or FinBOL (a-c) with the arrows joining the two regions for which the comparison is made. Shades from pink to brown refer to the total number of BINs contributed to BOLD by each regional campaign. Regions in grey are not considered in this comparison (although some have contributed DNA barcodes to BOLD). Since all campaigns are in progress, the numbers for Finland refer to the current data release whereas the numbers for other areas refer to records on BOLD on 21 January 2021. In total, the FinBOL data release contained 13,777 BINs of which 1713 BINs had not been previously contributed to BOLD. For exact numbers, see Appendix S2 with at least two reference sequences in the FinBOL material ensured the existence of at least one reference sequence in the FinBOL data set. Our three scenarios thus correspond to asking: how accurate an identification would we have achieved if querying the global database without the FinBOL arthropod records (scenario 1: comparison to BOLD with the new records excluded); how accurate an identification will we achieve now, when the FinBOL records are added (scenario 2: comparison BOLD with the new records included); and how accurate an identification will we achieve if we take the regional origin of the sequence into account, restricting our reference library to the national library alone (scenario 3: BOLD restricted to the new records only). For this purpose, we queried the Batch ID Engine for the taxonomic annotation associated with the best-matching sequence while excluding or including FinBOL sequences from the result list (corresponding to scenario 1 and 2, respectively), or excluding all non-FinBOL sequences from the list (for scenario 3). From the closest matches, we then determined the proportion of query sequences assigned to a single correct species ("Correct"), to several alternative species with the same sequence similarity of which at least one represented the correct one ("Several alternatives") or to the wrong species ("False").

| Training PROTAX to identify Finnish arthropods
To capitalize on the new resource, we trained PROTAX (Somervuo et al., 2016), a probabilistic taxonomic assignment tool, to identify arthropod sequences from Finland. PROTAX is a taxonomic classifier which establishes the probability that a query sequence can be assigned to a given taxon. In comparison to the identification engine used by BOLD, PROTAX has the advantage of recording how much the suggested identification can be trusted at each taxonomic level. PROTAX uses the sequences in the reference library to parameterize a statistical model of the probability with which a query sequence belongs to any particular taxonomic level (class, order, family, subfamily, tribe, genus, and species), or to a previously unknown taxon at the same taxonomic level. The latter probability should explicitly be interpreted as "a taxon not represented in the reference library", and thus explicitly accounts for current gaps in coverage.

| Taxonomy
Since PROTAX is based on hierarchical assignment to nested taxonomic levels, it builds on a fully resolved taxonomical hierarchy.
However, it also accounts for the fact that there may be unknown taxa at each taxonomic level (see above) so the taxonomy represents both known and unknown insects of Finland. To train PROTAX, we used the names and taxonomic hierarchy of known taxa in the 2019 edition of the national checklist of Finnish species (FinBIF, 2020).
Under each known genus, tribe, subfamily, family, order, class, and phylum in the taxonomy tree, there is also a branch corresponding to unknown taxa.
The taxonomic tree used to train PROTAX was constructed based on the full hierarchy and taxonomic names of 26,437 species. The root node of the tree represents the phylum Arthropoda, with a total of 48,801 nodes in the full tree covering the seven levels. Since the usage of taxonomic ranks varies greatly among taxa and taxonomists, the reference taxonomy used (FinBIF, 2020) contained some missing values for particular combinations of taxonomic levels and taxa. In those cases where a name was missing from the full taxonomic classification at a certain taxonomic rank, a dummy name was created. For example, the bird louse genus Actornithophilus belongs to the family Menoponidae in the order Phthiraptera, but subfamily and tribe ranks are not used in the reference checklist. Therefore, two dummy names were created to link this genus to its family. To create a single fully connected taxonomy tree, a total of 17,783 dummy names were introduced. For unknown taxa, that is, branches not included in the known taxonomy, an additional node was created under each internal tree node (see previous paragraph). These nodes allow for branches possibly missing from the set of known taxonomic names. For example, at the species level there were 33,335 nodes, of which 26,437 represented known species and the remaining 6898 nodes represented unknown species under 6898 known genera. Similarly, nodes representing unknown branches were added to the tree at all other levels of the taxonomy (but these nodes did not have further child nodes in our taxonomy tree).
To train the classifier in taxonomic assignment, we used 37,422 sequences from the FinBOL arthropod data set accompanying this study (Table 1). Of these sequences, 2798 sequences were only assigned to a genus or to an interim species, whereas 34,624 sequences were assigned to valid species. Out of 26,437 known species in the taxonomy, the data set used to train PROTAX included at least one reference sequence for 10,985 species. Out of 6898 known genera, the data set included at least one reference sequence for 3910 genera.

| Modelling approach
In PROTAX, classification starts at the root node of the taxonomic hierarchy, where a query sequence belongs with probability of one, and proceeds to leaf nodes passing through all ranks. Probability assignment from a parent node to its child nodes is achieved by means of a multinomial regression model. The parameters of the model are estimated using reference sequences to mimic query sequences coming from different parts of the taxonomy. A detailed description of PROTAX can be found in Somervuo et al. (2016), and a detailed description of the current implementation in Appendix S3. For the present purpose, the software has been rewritten in C to maximize its performance and is available in the github repository https:// github.com/psome rvuo/protaxA.
To match typical data types, we constructed two versions of PROTAX, one for the full-length (658 bp) Folmer region (Folmer et al., 1994), adopted as the standard DNA barcode for animals (Hebert et al., 2003), and another for the Leray region (313 bp), as amplified by primers mlCOIintF (Leray et al., 2013) and jgHCO2198 (Geller et al., 2013). Parameter estimation was done using MCMC as explained in Somervuo et al. (2016). Model parameterization was done separately for each level of taxonomy as in Somervuo et al. (2017). For each of the seven levels of the taxonomy, 10,000 training sequences were generated from reference sequences and 2,000 iterations of MCMC were performed. The first half of the iterations was used for adapting the proposal distribution, whereas MAP estimates of parameters were selected from the second half of the iterations where the proposal distribution was fixed. The probabilistic taxonomic assignment tool parameterized for Finnish arthropods is henceforth referred to as FinPROTAX.

F I G U R E 2
Taxonomic composition of the known Finnish arthropod fauna, its representation in reference libraries, and the identification success achieved by molecular tools. Shown on the left is the number of arthropod species per order in Finland, with the total length of each horizontal bar indicating total species richness and the sections within each bar showing the fraction of Finnish species for which DNA barcodes only occur in the FinBOL material (maroon); in FinBOL and in other BOLD material (dark blue); in BOLD but lacking from FinBOL (cyan); or completely missing from BOLD (green). The right-hand part of the figure identifies the improvement in identification success resulting from the FinBOL records. Shown from right to left is identification success under three scenarios: (1) BOLD without the FinBOL records, (2) BOLD with the FinBOL records added, and (3) identification by comparison to the FinBOL records alone. Sequences were assigned to species based on sequence similarity with reference sequences in BOLD using the BOLD ID Engine. Identification success was scored assuming that the original identification was correct. Sections within each horizontal bar show the proportion of query sequences assigned to a single correct species (column "Correct"), to several alternative species with the same sequence similarity, of which at least one represented the correct one (column "several alternatives") or to the wrong species ("False"). The composition of the overall fauna is taken from the Finnish national checklist of species (FinBIF, 2020). This checklist was also used to query the representation of Finnish species on BOLD. Due to possible differences between reference checklists used by different BOLD users when submitting data (e.g., in delimitation of genera), a single species may appear on BOLD under more than one name. As a result, our coverage counts for many orders are likely to be slight underestimates TA B L E 1 Extent and coverage of the FinBOL arthropod reference library compared to overall arthropod species richness in Finland

| Validating FinPROTAX performance
To validate the performance of the parameterized PROTAX model, we used two approaches: First, we used 10,000 sequences from the FinBOL arthropod data set as query sequences. Since the correct assignment was known to the species level for each of these sequences, we could validate the accuracy with which a query sequence was attributed to the correct taxon at each level in the taxonomic hierarchy (class, order, family, subfamily, tribe, genus and species). In this validation study, we calculated the probabilities from each query sequence against all taxa and took the taxon corresponding to the highest probability. During this process, the query sequence was removed from the reference sequences so the query sequence was not allowed to match itself.
The assignment was deemed correct if the taxon with the highest probability matched the given taxonomic label of the sequence. In this way, we were able to compare how the best probability given by PROTAX corresponds to the correct classification and verify that the probabilities provided by PROTAX are unbiased (Somervuo et al., 2016). To speed the search, we excluded all taxa with negligible identification probabilities, using 0.01 as the threshold. If a parent node had a probability below the threshold, all child nodes of the parent were excluded from the further search. This test was run separately for the full barcoding region and the Leray region.
Second, we assessed the taxonomic confidence a user of FinPROTAX will achieve in identifying an environmental sample Malaise trap (Geiger et al., 2016;Malaise, 1937;Townes, 1972 Sanger sequenced using the methods described in Appendix S4. Of the resulting 6486 sequences, we ascertained the fraction that could be assigned to a given taxonomic level. Realizing that different researchers will be satisfied with different levels of confidence, we carried out this analysis at two probabilities: 0.9 and 0.5. For interpreting these cutoffs as "reliable" (0.9) and "plausible" (0.5), we refer the reader to section Training PROTAX to identify Finnish arthropods and to (Somervuo et al., 2016(Somervuo et al., , 2017, noting explicitly that these probabilities naturally build on sequence similarity, but otherwise have nothing in common with a simplistic cutoff of, say, 98% or 99% sequence similarity (e.g., Clare et al., 2019).  Figure 1a). Viewed at a regional scale, it is complementary to the modest DNA barcoding efforts in Sweden and the extensive campaigns in Norway (Figure 1b). Considering national barcoding campaigns in Europe, the Finnish effort is substantial and complementary in terms of BIN coverage (Figure 1b; for exact numbers see Appendix S2). Needless to say, the fact that one in eight BINs contributed to BOLD is new also indicates that the rest were earlier sequenced somewhere else. Thus, BIN overlap was also substantial ( Figure 1).

| Taxonomic resolution achieved
The new records contributed by FinBOL substantially improve upon the accuracy achieved by the global identification resources.
From our query material of 1,000 insect and spider species, 73% of insects and 91% of spiders were correctly assigned a single, unequivocal best match by BOLD when the Finnish material was removed ( Figure 2) versus 84% for insects and 96% for spiders once it was added. This was mainly due to fewer false assignments with a smaller reduction in the fraction of taxa yielding multiple equally-well matching identifications (Figure 2). The highest accuracy was achieved when comparison was restricted to the national reference library alone ( Figure 2). Identification success varied substantially among orders, but reached 100% in several wellrepresented orders (Figure 2).
The taxonomic coverage of the FinBOL arthropod reference data ( Figure 2, left-hand part) allowed accurate training of FinPROTAX.
Because certain classes were absent from the training set ( Figure 2, right-hand parts; Table 1), the training of FinPROTAX was restricted to spiders (order Araneae) and insects (class Insecta) only. For these taxa, the taxonomic classifier achieved high accuracy in assigning query sequences to the correct taxon at all taxonomic levels. For the 10,000 query sequences of known taxonomic affinity, the probabilities proved unbiased sensu Somervuo et al. (2016). In other words, if PROTAX assigns a probability 0.9 for a query sequence, then for any large number of sequences, one in ten will be incorrectly classified while nine of ten will be correct.

| With what accuracy can we identify an environmental sample?
The accuracy achieved by FinPROTAX is further demonstrated by its classification of an independent data set from a Malaise trap. Among the 6486 sequences, some were assigned to known taxa with a high probability whereas others were not (Figure 3, for an interactive graph see Appendix S4). As expected, the probability of correct assignment was greatest for higher levels in the taxonomic hierarchy, for which more training data was available (Table 2; Appendix S4).
Interestingly, differences in assignment success between the two cutoff levels (0.9, 0.5) were surprisingly small: if a sequence was attributed to an unknown branch at a lower level, it was likewise consistently attributed to an unknown branch at a higher level (Table 2).
Among Insects, a substantial proportion of sequences were assigned to the category of "unknown order" (Figure 3). This category included sequences for which the highest probability included taxa not represented in the reference sequences. They included both taxa not represented in the current taxonomy and known taxa lacking reference sequences. Closer investigation revealed that most of these sequences were derived from two species, Lepidocyrtus lanugi- Among orders, the level of confidence in identifications varied substantially among taxa. For Lepidoptera, sequences were relatively evenly and reliably assigned to a species, reflecting its high coverage in the reference database ( Figure 1; Table 2). For Diptera, out of 1,918 sequences that were classified to the order level with a probability greater than 0.90, 915 sequences (48%) were assigned to a known species with a probability exceeding >0.9. For Hymenoptera, the corresponding numbers were 1,088 and 130 (12%) much lower than those for Lepidoptera (123/150 = 82%), for Oribatida (106/112 = 95%), and for Araneae (60/71 = 85%).
Among mites in Oribatida, most sequences were classified into a single species, Diapterobates humeralis. By contrast, a large proportion of sequences for Hymenoptera were identified with substantial uncertainty (see large grey circle among Hymenoptera in Figure 3).

| DISCUSS ION
Building on a decade of work, the Finnish Barcode of Life initiative has delivered a unique reference library for specimen identification and a parameterized tool for probabilistic taxonomic assignment.
The main outcome is a massive transfer of identification capacity from the taxonomic community to society at large. In the process, the project provided five valuable lessons about how national resources for biodiversity research may be built. The following text discusses each of these aspects.

| A national reference library for arthropods: Extent and coverage
Of the 26,437 arthropod species known from Finland, 23,956 of which are insects (FinBIF, 2020), the current FinBOL data release provides reference sequences for 11,275 and 10,811 species, respectively-i.e. 43% and 45% of the known national fauna. In wellstudied groups, barcode coverage is high, with 92% of 2616 species of Lepidoptera and 94% vs 218 known species of Trichoptera.
However, this relatively high coverage also extends to many "difficult", globally poorly known groups such as selected families of Diptera and Hymenoptera (Figure 2).
Despite the high coverage, the proportion of arthropod taxa which were contributed uniquely through FinBOL to the global database BOLD (Ratnasingham & Hebert, 2007 averaged just 7%, but ranged from 0 to 54% of the species per order (Table 1).
The current figures are slightly inflated because a few species collected outside Finland were included in the data set. The relatively low uniqueness of the Finnish fauna reflects the recent deglaciation of the target area along with general rules of biogeography. The latest glaciation in Finland, the Weichselian, peaked 22,000 BP and although deglaciation began by 13,000 BP, the northern parts of Finland only became ice-free about 10,500 BP (Donner, 1995). As a result, almost all current species have colonized since then, and very few species are endemic. Species range size also tends to increase with latitude-a pattern known as Rapoport's rule (Ruggiero & Werenkraut, 2007;Stevens, 1996). Combined with a general decline in species richness with latitude (e.g., Hillebrand, 2004;Schemske & Mittelbach, 2017), this rule leads to fewer unique species per unit area with latitude. For a high-latitude country such as Finland Russia (data scarce and not shown), FinBOL has generated a valuable regional resource.
Against this general setting, the building of the national reference library has helped to clarify which species are truly shared with other regions. For example, a large-scale comparison of Austrian and Finnish Lepidoptera demonstrated that several arctic-alpine species thought to be shared between high latitudes and high elevations are actually different species (Huemer et al., 2014;Mutanen, Hausmann, et al., 2012). A similar comparison at an intercontinental level revealed many overlooked species shared between Palearctic and Nearctic regions (Landry et al., 2013;Pentinsaari et al., 2020).
Comparisons of national barcode libraries regularly reveal cases of deep sequence divergence requiring detailed taxonomic study, and such work often leads to the description of new species. One recent example is the split of the charismatic moth Pyralis regalis (Denis & Schiffermüller, 1775, into two distinct but morphologically similar species (Wikström et al., 2020). Such results make clear the need to barcode representatives of presumptively widespread species across their range. While a reference sequence for a given species from any point in its range is valuable, a DNA barcode reference library based on restricted geographic sampling will only contain a fraction of the variation within most species (Bergsten et al., 2012;Huemer et al., 2014;Mutanen, Hausmann, et al., 2012), weakening its capacity to generate reliable taxonomic assignments on a global scale. For this reason, it is important to construct regionally comprehensive reference libraries of widespread species-both from a national and global perspective.

| Taxonomic resolution achieved
When added to the identification engine in BOLD, the barcode records generated by FinBOL substantially improved identification success. This improvement occurred against a background of already high success (see Figure 2)-even with all Finnish records excluded, a full 74% of the species in our query set were assigned to a single, correct taxon when queried through the BOLD identification engine.
This attests to the power of the global barcoding effort, which has populated the global reference library with extensive, taxonomically annotated reference sequences-including massive European material ( Figure 1).
For every major order (barring Hymenoptera; Figure 2), the sequences contributed by FinBOL produced a significant increase in identification success (Figure 2). One aspect of this success involved a general reduction in the number of false assignments, that is, cases where the best-matching sequence was erroneously annotated.
Although some of these mismatches may reflect disagreement among taxonomists regarding the correct species name, such cases probably represent a minority of the "false" species assignments. In

F I G U R E 3
Resolution achieved in the taxonomic assignment of environmental data. Barcode sequences >500 bp were recovered from 88% (6,486/7,414) of the arthropods in an annual Malaise trap sample. For these organisms, we show the proportions of taxa within the phylum Arthropoda assigned to a given taxon. Each sequence was assigned to the taxon with the highest probability (excluding sequences where the highest probability was <0.1). Grey coloration indicates all sequences while red indicates the number of sequences assigned with a probability exceeding 0.9 (the sizes of the different diagrams are not internally comparable). The central diagram shows the classification to an order level. All samples were classified into two classes, Insecta and Arachnida. Five diagrams surrounding the central one show the classifications from the order to the species level. Araneae and Oribatida were the two largest orders for Arachnida while Diptera and Hymenoptera were the two largest orders for Insecta. To visualize the full contents of the sample, we provide an interactive Krona wheel (Ondov et al., 2011)   World, cases that await taxonomic revision and the declaration of synonyms (e.g., van Nieukerken et al., 2016). More importantly, since BOLD is not only a reference database, but also a workbench for analysing and curating DNA barcode data, some misidentified, contaminated, chimeric or otherwise erroneous sequences make their way into the database and may not be immediately excluded (see Pentinsaari et al., 2020). In addition, the species-level resolution of the COI barcode region is not always perfect as closely related species may share identical haplotypes or form mixed sequence clusters (Hausmann et al., 2013;Huemer et al., 2014;Prous et al., 2016). In such cases, multilocus approaches sometimes improve identification success (Meiklejohn et al., 2019), and assignment based on whole genomes may improve success even further (e.g., Ji et al., 2020).
Yet, for establishing a national reference library of the current size, coverage, and curation since 2010, under then prevailing analytical costs, no realistic alternatives to the current single-locus approach were available. Here, the improvement in correct taxonomic assignment shows the value of a highly curated reference database, as achieved by the large network of skilled taxonomists contributing expert-identified material to FinBOL.
Overall, the improvement in identification success enabled by the FinBOL records seemed roughly proportional to species diversity in the target group (by affecting the number of choices) and by the representation of species in FinBOL versus the rest of BOLD (by affecting the added precision brought by the FinBOL effort). Clearly, the highest identification success is attained when the query sequences are matched to the national database alone ( Figure 2). It should be noted, however, that the exact comparison performed here is restricted to taxa represented by at least two sequences in the FinBOL data-otherwise, we could not compare the match of query sequences to FinBOL versus non-FinBOL material.
For some 15,000 arthropod species, of which 13,000 are insects, no reference material exists in FinBOL, and the lack of coverage is as high as 100% in some arthropod orders (Table 1; Figure 2).
Thus, for a random sequence generated de novo from an unknown query sample, identification success varies substantially depending on the order (compare Figure 2). While the national checklist of arthropods is itself incomplete, it provides a vivid illustration of the challenges caused by the lack of reference sequences from the national barcoding library. For about one-half of all Finnish arthropod and insect species, no reference exists in FinBOL, and for about one-third none exist elsewhere in BOLD (Figure 2). Thus, if we assume that the reference sequence providing the best match to the query sequence will equal the correct identification, we will be off in a substantial proportion of cases. To control for this strong bias, our PROTAX implementation, FinPROTAX, quantifies the likelihood that the query sequence represents a species currently missing from the reference library.

| A new tool for taxonomic assignment
Some species will always be missing even from comprehensive DNA barcode reference libraries-simply because they are excessively rare or hard to obtain. Such gaps are caused either by incomplete sampling and/or by the fact that communities at all spatial scales show changes over time (e.g., Antão et al., 2020). Gaps in the reference library can be forgotten when taxonomic assignment is based on highest sequence similarity alone, even though these gaps may have a critical impact on the identifications achieved (see Somervuo et al., 2017). For FinBOL, the fact that nearly half of the national arthropod fauna has now been covered also implies that more than half awaits sequencing. The PROTAX implementation based on the Finnish reference library, FinPROTAX, allows researchers to account for such influences, and provides intuitive measures of uncertainty to help them evaluate the reliability of taxonomic assignments.
Implemented as a web-based service (https://laji.fi/en/theme/ protax), this new resource allows the accurate taxonomic placement of insect samples and the evaluation of the uncertainty associated with their placement at each level in the taxonomic hierarchy. Notes: Shown is the proportion of taxa assigned to a given taxonomic level with a probability exceeding a particular cutoff value, either 0.9 (left-hand columns) or 0.5 (right-hand columns). "known%" refers to branches with a reference sequence in the training set while "unknown%" are branches where no reference sequences available. For details on the specific data set and for an interactive visualization of the full contents of the sample, see Appendix S4.
TA B L E 2 Taxonomic resolution achieved for a full-season sample of arthropods from a Malaise trap When validated by query sequences from the national reference library, the accuracy of FinPROTAX was high with 88.5% of test sequences being assigned to the correct species as the most likely match. The same unbiased result was achieved both for the full-length Folmer region and the Leray region-despite the latter being only half the length of the former (313 bp vs. 658 bp), and thereby including less nucleotide variation. Yet, beyond taxonomic assignment-that is, likely names with which to label the species-FinPROTAX also provides a measure of uncertainty-that is, of the probability with which this label is correct. The importance of this added consideration was highlighted by our application of FinPROTAX to a diverse sample of arthropods collected by a Malaise trap.
Notably, no algorithm can reliably attribute a sequence to a named taxon for which it has seen no training data. Thus, when PROTAX reports "unknown" taxa with high probability, it does not necessarily mean that the sequence originates from a previously unknown taxon not included in the taxonomy. Instead, it means that no good match was present among the existing reference sequences.
To understand the implications of this outcome, it is important to consider what PROTAX does; it converts sequence distances into taxon probabilities. Hence, the quality of the sequences (both query and reference sequences) is key to accurate taxon assignments, so uncertainty can be due either to sequencing errors and/or to taxa not being absent from the reference sequences. An extreme case occurs when the taxon is included in the taxonomy but lacks any reference sequences. When PROTAX reports unknown taxa, it includes in that category also known taxa from which there are no reference sequences available (Somervuo et al., 2016).
This consideration is further illustrated by the large uncertainty associated with specific taxa in the Malaise trap material (Figure 3).
Here, sequences from classes, orders, or species missing from the FinBOL material were consistently assigned to the category of "unknown" taxa-just as they logically should, since they are by definition unknown to the taxonomic classifier. In this context, we note that we only trained the classifier on material containing insects (class Insecta) and spiders (class Arachnida, order Araneae). This is because insects and spiders have been more popular among contributing taxonomists than any other arthropods (Table 1)-for which reason they are also the ones for which classification needs will most often arise.
By definition, this implies that the current FinPROTAX classifier has been explicitly optimized for these two taxonomic groups, whereas it will face challenges in classifying other sequences (i.e. those from taxonomic groups never shown to FinPROTAX). In practice, arthropods outside of Insecta will either be attributed to the "unknown" category under the root node, or to "unknown Insecta".
Within classes, there was much variation in the probability of assignment for individual taxa. Between the levels of class and order, the differences in uncertainty were relatively small ( Chalcidoidea (see Figure 3, Appendix S4). Again, this is a logical outcome: the less knowledge we have of what taxa to choose from and of the molecular variation within versus between taxa, the more difficult it is to attribute a DNA sequence to a specific taxon.
Importantly, the current taxonomic resolution achieved by FinPROTAX is based on the reference data alone, and neglects additional considerations. An obvious source of uncertainty not modelled by PROTAX involves errors resulting from PCR and sequencing.
Therefore, when using PROTAX with sequences coming from errorprone high-throughput sequencing platforms, third-party software such as DADA2 (Callahan et al., 2016) should be used to reduce the incidence of errors in reads. The presence of sequencing errors does not prevent the use of FinPROTAX, but the errors will result in a higher uncertainty, i.e. less specific taxon assignments.
Another complication to taxonomic assignment is barcode sharing, i.e., cases where two valid species share the same sequence for the gene region examined. Such cases will naturally increase uncertainty as there is no way to tell such species apart. In ambiguous cases (see examples in Appendix S4), any additional data will be worth evaluating-including cases where extant ecological knowledge may extend the insights from molecular data. Where a resolved specieslevel taxonomy is missing for key groups, applications can be strengthened by mapping the geographic distributions of sequence-based species proxies (see Pentinsaari et al., 2020). As a future improvement to PROTAX, we propose to add priors informed by the spatial, temporal, and ecological context of the sample. As a simple solution, the user may a priori weigh their belief that a given species may occur in the sample based on, for example, digital maps of the distribution of the target taxa, their host plants, and habitats. If implemented as a simple dichotomy (e.g., probably vs. almost impossible species), this can be done for even a large number of species with reasonable effort.

| Lessons learnt
Obtaining funding for a national project like FinBOL was difficult.
Assessing our progress, we believe the FinBOL initiative has catalyzed national biodiversity research on a broad scale. A uniting factor for this scientific community is a need for species-level understanding to examine the emergent features of overall diversity.
In this context, the project has overcome key aspects of the taxonomic impediment. It has truly been an investment in the construction of a national infrastructure for biodiversity science, adding an accurate and cost effective tool to Finnish science-just as one might regard investment in a telescope or particle accelerator (Hebert, Hollingsworth, et al., 2016;Hebert, Ratnasingham, et al., 2016).

| Future directions
The FINBOL reference library can already deliver reliable taxonomic assignments for many taxa, but much remains to be done. In evaluating the finer details of the taxonomic assignments possible for our environmental sample, we note that the current evaluation applies strictly to a specific sample of a specific type. While highly diverse in species composition, any single sample naturally comes with its particular features. For example, the sampling site chosen for analysis will have an effect since some regions in Finland have been comprehensively sampled while other areas lacking museums or universities have not. The sampling technique employed will also affect the outcome.
For example, the insects dominating a Malaise trap catch are better represented in the current reference library than are soil arthropods.
To gain both regional and ecological coverage, efforts will be directed toward expanding the FinBOL arthropod library and the FinPROTAX tool based on new records, thereby improving its performance across all types of samples. Current plans call for 50% coverage for the esti- Two methodological advances are aiding FinBOL's progress.
One is the move to high-throughput sequencing platforms, such as SEQUEL , which reduce analytical costs and increase throughput. A second advance involves the improved ability to recover DNA barcodes from old specimens , enabling the use of museum collections to fill gaps in barcode coverage for rare species. FinBOL is currently sequencing large numbers of museum specimens representing rare taxa, including type material, to fill gaps. In the same way as type specimens were introduced to anchor the species name unequivocally to morphology, this approach will allow us to add a sequence to the species description Miraldo et al., 2014;Prosser et al., 2016).
While technological progress is accelerating the acquisition of sequence information, the ranks of taxonomic specialists are thinning. Many of the older members of Finland's taxonomic community feel that their contribution to the current reference library forms a lasting contribution to science. The need to capture such knowledge is essential because there are, for example, no young Finnish taxonomists who can critically identify species in many key groups of arthropods (e.g., aphids, chewing lice, chalcid wasps, gall midges, most mite lineages). Hence, the annotated barcode records assembled by FinBOL participants represent a tremendous intergenerational transfer of taxonomic knowledge. The ultimate aim of FinBOL is to facilitate the rapid, accurate identification of specimens regardless of life stage, sample size, and quality. Until now, this has demanded access to experts, but the accurate identification of voucher organisms is a demanding, time intensive process. However, the time contributed by current taxonomists in identifying and contributing voucher specimens represents a great gift to future generations who will benefit from their expertise when they are no longer able to process new material. Thus, the current contribution offers a major capacity donation from the taxonomist community to science. The many taxonomists among us rejoice at this opportunity to contribute while the many molecular biologists, ecologists, and arthropod enthusiasts feel gratitude for this contribution.

ACK N OWLED G EM ENTS
FinBOL could not have achieved success without the contributions from many people with species expertise in diverse arthropod groups. While most significant contributors of FinBOL are co-authors on this publication, many others with smaller yet important contributions are not. They helped by providing specimens for tissue sampling often from their private collections and by identifying samples, but also in multiple other ways. We are most grateful to all these persons.
Also, FinBOL's work has been supported by a number of technicians, research assistants and students who have contributed in tissue sampling, photography and databasing of barcoded specimens. We are especially grateful to Piia Partanen and Riikka Jarkko for their major input with this regards. Staff at the Centre for Biodiversity Genomics