Image-based taxonomic classification of bulk insect biodiversity samples using deep learning and domain adaptation

Complex bulk samples of insects from biodiversity surveys present a challenge for taxonomic identification, which could be overcome by high-throughput imaging combined with machine learning for rapid classification of specimens. These procedures require that taxonomic labels from an existing source data set are used for model training and prediction of an unknown target sample. However, such transfer learning may be problematic for the study of new samples not previously encountered in an image set, for example, from unexplored ecosystems, and require methods of domain adaptation that reduce the differences in the feature distribution of the source andtarget domains(training andtest sets). We assessedthe efficiency of domain adaptation for family-level classification of bulk samples of Coleoptera, as a critical first step in the characterization of biodiversity samples. Neural network models trained with images from a global database of Coleoptera were applied to a biodiversity sample from understudied forests in Cyprus as the target. Within-dataset classification accuracy reached 98% and depended on the number and quality of training images, and on dataset complexity. The accuracy of between-datasets predictions (across disparate source – target pairs that do not share any species or genera) was at most 82% and depended greatly on the standardization of the imaging procedure. An algorithm for domain adaptation, domain adversarial training of neural networks (DANN), significantly improved the prediction performance of models trained by non-standardized, low-quality images. Our findings demonstrate that existing databases can be used to train models and successfully classify images from unexplored biota, but the imaging conditions and classification algorithms need careful consideration.


INTRODUCTION
Biological identifications increasingly rely on machine learning algorithms that use photographic images to place unidentified specimens into a taxonomic classification.As these methods are proving to Tomochika Fujisawa and Víctor Noguerales contributed equally to this study.be very powerful especially for identification of the species-rich and morphologically diverse insects, it is now possible to place a specimen with high confidence against curated image libraries, for example, those obtained from pinned museum collections (Buschbacher et al., 2020;Hansen et al., 2020).With the rapid increase of such images, machine learning can greatly increase the capacity for species identification without putting demand on scarce taxonomy experts (Høye et al., 2021;Valan et al., 2019).The methodology, therefore, is likely to play a major role in the taxonomic endeavour in future, and deep learning potentially can have similar impacts on the practice of taxonomy as the revolution of DNA barcoding and metabarcoding some 20 years ago, or it could work in concert with these molecular approaches (Høye et al., 2021;Wührl et al., 2022;Yang et al., 2022).
However, the true potential and possible limitations of algorithmic methods for exploiting the information contained in specimen images remain to be established, as the various applications and choice of machine learning algorithms continue to be refined (Romero et al., 2020;Valan et al., 2019).
The greatest challenge for modern taxonomy probably is the study of highly diverse and poorly studied biotas and geographic regions around the world, harbouring many undescribed species (Costello et al., 2013).In particular, in studies of insect diversity, such as those from tropical forest canopy or the soil, huge numbers of specimens are collected and subsequently need to be classified and counted as part of ecological and environmental studies (Novotny et al., 2007;Caruso et al., 2019).In these circumstances, specimens are often assigned to high taxonomic ranks at order and family levels (Karlsson et al., 2020), for example, for broad ecological comparisons (Stork & Grimbacher, 2006) and ecological status assessment using bulk-sample specimens in freshwater ecosystems (Escribano et al., 2018).Thus, despite the lack of taxonomic resolution, familylevel assignments are a critical first-step to characterize biodiversity samples and demand rarely available broad knowledge of insects across taxonomic groups and geographic regions.Imaging of these specimens is comparatively fast with the help of recently described automated imagers (Ärje et al., 2020;Wührl et al., 2022) or by taking high resolution images of large sets of specimens in a single photo, which can then be cropped to represent single individuals for subsequent classification (Hudson et al., 2015;Bian et al. 2022).Automated classification based on these images would remove the need for manual identification by taxonomic experts who individually can handle only a small portion of the diversity spectrum usually encountered in such studies (Basset et al., 2012), and thus may help to provide rapid assessment of threatened insect assemblages, where speed is a priority.
In machine learning, images are classified against a set of defined objects, for example, the images of a particular taxonomic group.A model trained to separate the types of images in this source is used to classify unlabelled objects in the target, such as an unknown set of specimens.Most recent studies used convolutional neural networks (CNN, LeCun et al., 2015) for the task of image classification.Because of the lack of images for training the full parameters of a CNN model, approaches like fine-tuning of the existing CNN (Ärje et al., 2020) or feature transfer from the pre-trained CNN (Valan et al., 2019) are commonly used in biodiversity studies, following the successful applications of pre-trained CNN outputs as generic image features (Donahue et al., 2013;Razavian et al., 2014).These methods of transfer learning (sensu Valan et al., 2019; see Table S1 for a detailed terminology) have already shown great power in taxon annotation of insect specimens, and in some cases, surpass the capabilities of trained taxonomists (Valan et al., 2021).
Yet, applications of image classification algorithms for insect biodiversity research have mainly been limited to narrow tasks and specific target sets, such as pinned museum specimens (Hansen et al., 2020;Valan et al., 2019), aligned body parts (Buschbacher et al., 2020;Klasen et al., 2022), or small target groups of a few species (Ärje et al., 2020;Popkov et al., 2022).In most of these studies, the unlabeled (target) set is from the same dataset, that is, the target taxa at species or higher hierarchical levels are included in the training set.However, as hitherto unsampled specimen sets are included, the feature space of source and target domains no longer has similar distributions.Thus, aligning the disparity between domains requires a trained model that can be generalized across the entire feature space of the domains, using procedures of 'domain adaptation' (e.g., Pan & Yang, 2010;Farahani et al., 2020; see Table S1).Although methods for domain adaptation have been successfully applied to fields such as medical image classification (Guan & Liu, 2021), they may also be useful for analysis of biodiversity samples and the classification of insect specimens from unexplored areas whose components are unlikely to be present in the training set.
Building an image-based classification system may be further complicated by several factors affecting the feature distribution of source and target datasets.Capture bias is a well-established problem in machine learning, as objects appear in different contexts (location, lighting, background, etc.) or are taken on different imaging devices.
Images of insects may be from collection specimens taken in fairly standardized positions and lighting conditions (Hansen et al., 2020;Valan et al., 2019), or may be obtained directly from bulk samples and photographed either singly (Raitoharju et al., 2018;Valan et al., 2019;Wührl et al., 2022) or cropped from large-field composite images (Buschbacher et al., 2020;Hansen et al., 2020).Images thus display different aspects of the specimens and differ in illumination and magnification, which affects the recognition of key features (Ärje et al., 2020;Raitoharju et al., 2018).The performance of a model trained in one dataset can be compromised if this deviation of a prediction target from the training source is not controlled correctly (Torralba & Efros, 2011;Tommasi et al., 2017), and such performance reduction has already been reported in applications for biodiversity research (Knyshov et al., 2021;Popkov et al., 2022).
Other issues are unrelated to differences in image acquisition, but result from the biases of defining the semantic categories or classes recognized in the source and target domains (Tommasi et al., 2017).
Such 'category bias' may arise from inconsistent labelling, either due to the application of different taxon concepts used for classifying species and higher taxa, or due to specimen misidentification.The resulting noisy or incorrect data labels then reduce the effectiveness of the model.In addition, in particular, in higher taxonomic categories, the same name is assigned to visually different images due to the distributional shift of subclasses (e.g., different genera representing a family in the source and target).Furthermore, in general cross-dataset applications, the model can encounter a category which is missing in the source training data, for example, a new family may be present.The treatment of such anomalous (or 'out-of-distribution'; Tabak et al., 2019; see Table S1) samples affects the reliability of the biodiversity assessment.As more variation is encountered, to fully learn the structure of the data, the model should scale with the size and complexity of the training data.
In practice, due to these problems of intra-class variability and the inconsistencies of the photographs, the success of deep learning in taxonomy to date has been in situations where a bespoke image library is available that holds a narrow representation of the query taxa and images under the same aspect and imaging conditions (brightness, angle, magnification, etc.; Buschbacher et al., 2020;Valan et al., 2021).
In addition, the performance evaluation in these studies often has been limited to the training-testing procedure within a single dataset, and the generalization capability of the models across datasets was not explicitly examined.The utility of these methods remains largely untested in the application to samples from poorly characterized species, as those from previously unseen bulk insect samples in unexplored areas.Ideally, such samples would be identifiable against images drawn from other sources, for example, an image database of well characterized regional communities and taxa obtained elsewhere, although minimizing the adverse effects of biases in these datasets.
Here, we use deep learning approaches for classification of insects based on bulk-sample images from high-throughput biodiversity surveys, testing the possibility of domain transfer between unrelated image sets.We characterize bulk samples of poorly known communities of Coleoptera (beetles) collected in different sites around the world, whose high species diversity and complex morphological variation provide a challenging, but realistic situation for machinebased classification.The aim was a classification at the taxonomic level of family, which is a frequent goal of initial identification in world-wide biodiversity surveys where results may be used for subsequent specimen counts, assignment to functional groups, or further in-depth taxonomic identification by specialists.Using a specimen database from sites of various geographic origin globally identified at family level, we attempt the classification of specimens in a geographically and ecologically disparate community (Cyprus).We use this setup to address the question about the transferability of identifications across communities from different habitats and continents, that is, when the input subclass (species and genera within a family) is not present in the training data.Various parameters are tested that may affect the prediction accuracy, including: (i) the size of the training set; (ii) the complexity of the training set, which may be affected by the level of intra-class variability, noise from misidentifications, or the presence of out-of-distribution samples; and (iii) the quality of images, for example, the resolution of the image using standard macrophotography versus high-resolution stacking technology.The error from these factors may be reduced by the use of advanced methods for domain adaptation.We here apply one such method, the domain adversarial neural network (DANN) algorithm, which includes unlabeled images from the target dataset in its training process to improve the target prediction.As we show, the use of deep learning with the specific domain adaptation algorithm is a powerful approach for classifying unknown samples but the prediction success depends on the composition of the training set and may vary between classes (i.e., some beetle families are more easily predicted).

Sample collection and taxon selection
As the target for classification, we used a collection of leaf-litter bulk samples from a total of 46 sites distributed across five forest habitats of the Troodos mountain range of Cyprus (Figure 1).These samples were processed as described by Noguerales et al. (2021) to extract bulk Coleoptera specimens from the substrate using a Berlese apparatus.
During bulk-sample processing, a subset of individual specimens, representing all different morphospecies encountered in the samples, were separated and processed alongside the remainder of the bulk samples.
The two sets of samples (bulk samples vs. single specimens) were preserved in 100% ethanol and subsequently photographed following two different imaging protocols (L H and L L , respectively, see below).For more details on soil sampling and habitat descriptions, see Arribas et al. (2016) and Noguerales et al. (2021), respectively.During sample processing and imaging, the most common families/subfamilies, with 5 or more photographs per taxonomic rank and dataset, were identified and used for downstream analysis.The chosen families were: Brentidae, Carabidae, Chrysomelidae, Cryptophagidae, Curculionidae, Latridiidae, Leiodidae, Melyridae, Ptiliidae, Staphylinidae:Scaphidiinae, Staphylinidae (excluding Scaphidiinae) and Tenebrionidae (Table S2).

Local high quality (L H ) dataset
Bulk samples were air-dried and specimens placed at regular distances onto filter paper in a Petri dish.In cases of large disparity in body size, we split the bulk samples into different size categories which were separately photographed in order to improve the focus and resolution across all specimens regardless of their body size.As much as possible, specimens were positioned for photography in dorsal view.
Bulk-sample photographs were taken using a Zeiss AXIO Zoom.V16 Stereo Zoom Microscope equipped with a Zeiss AxioCam HRc (High Resolution 13 Megapixels Colour Microscope) camera at the Imaging and Analysis Centre at the Natural History Museum (NHM) in London, United Kingdom.This instrument has a motorized focus drive and motorized stage for generating large high-resolution images by dividing the field into regular tile-images that are subsequently xyz stitched.
Depending on the sample size, photographs were taken by dividing them into 16-64 tiles, each with 25-30 slices (z-stacks) using the Zeiss NEO 2 Blue Edition software.We rendered z-stack images with the Helicon Focus v.5.3.14 software (https://www.heliconsoft.com)using the pyramid-based algorithm ('Method C') and default parameters.Focus stacking was also performed using the depth-map algorithm ('Method B') in Helicon Focus with a radius value of 8 and a smoothing parameter Finally, we manually cropped individual specimen photos from the bulk-sample images using INSELECT v.0.1.35software (Hudson et al., 2015).After some minor corrections of bounding edges, cropped single-specimen images were exported and taxonomically identified at the family/subfamily level by the authors.Only wholebodied specimens were considered for further analyses.The cropped images were resized to 255 Â 255 pixels for subsequent classification tasks.When an image was not an exact square, the edges were padded using the average pixel value of the outermost portions of the image to enforce a square shape.
The individual frames cropped from the bulk samples were denoted the Local High Quality (L H ) data set, referring to the fact that they were obtained from a local area and thus represent a small taxonomically confined set, and taken at high image resolution.The L H dataset represented the best-case scenario, where high-resolution training images of local samples are obtained under controlled conditions with high-performance imaging equipment.This set was the primary target in measuring the success of transfer learning.

Local low quality (L L ) dataset
A subset of single specimens (taken from the bulk samples) were individually photographed using a conventional stereoscope NIKON SMZ1270i equipped with a NIKON DS-Fi3 Microscope Camera (5.9 megapixels) controlled by the NIKON DS-L4 v.1.5.0.3 control unit.
These images were denoted Local Low Quality (L L ) dataset.These photographs were intended to represent a more realistic scenario of local specimens being photographed during field sampling and sample sorting in local laboratory facilities using conventional instruments, and were used to address the question about 'capture bias', that is, the effect of imaging conditions on classification accuracy.

Global high quality (G H ) dataset
We also obtained a wider sample of images from a global catalogue of Coleoptera specimens available at https://www.flickr.com/photos/site-100/.These images had been obtained from local sampling campaigns at 11 sites throughout Central America, Africa and Southeastern Asia (see Table S3) and photographed in bulk using the Zeiss AXIO Zoom, as described above, although others were individually taken at highresolution on a single lens reflex (SLR) camera (Canon EOS 500D) and macro lens (Canon MP-E 65 mm f/2.8 1-5x Macro).Helicon Focus software was used to render z-stack images, as described above.This dataset was denoted the Global High Quality (G H ) dataset.For each of the selected families, all specimen photographs available for the respective sites were used.Relative numbers of available specimens per family were usually correlated across sites, with greatest numbers in Staphylinidae.The numbers of images in the three data sets are shown in Table S2.The G H dataset mainly consists of samples from tropical forest interception traps and leaf litter, and does not share lower taxonomic groups with the target dataset sampled from Mediterranean forest soils.These collections were the source for the test of domain adaptation protocols applied to the unknown Cypriot target.

Feature transfer and neural network classifier
We employed the strategy of feature transfer from the pre-trained convolutional neural network (CNN) proposed by Valan et al. (2019).
We chose the outputs of the fifth convolutional block of the VGG19 model after 2-dimensional average pooling as a set of features for an image, based on the results of Valan et al. (2019) and our pilot analyses.These 512-dimensional image features were used for the classification with a neural network classifier.
The neural network classifier consisted of two fully connected (FC) layers with ReLU activation and a softmax output layer (Figures 1   and S1).The dropout was applied after the FC layers with a dropout rate of 0.6.The neural network was trained with the stochastic gradient descent algorithm with the softmax cross-entropy loss for 300 epochs.
We used a batch size of 10 and a fixed learning rate of 0.01, and the convergence of loss was visually assessed.The numbers of units in the two FC layers (512 and 256 for the first and second FC layers, respectively) and the dropout rate were determined by five-fold crossvalidation with a random 200 images of the G H dataset, and these hyperparameters were used throughout all classification tasks in this study.

Metrics for prediction accuracy
We evaluated the performance of the models with the following metrics throughout the subsequent classification experiments.The accuracy of the prediction was measured as the proportion of successful predictions in the test set, Acc ¼ , where b y i is the predicted class of the i-th image, y i , the true class and b The classification performance for each class was measured by the multiclass recall rate, multiclass precision and the F1-score (see Table S1).Recall rate of class c is defined as a proportion of correct predictions of c out of the actual number of images of c, . Multiclass precision is defined as a proportion of correct predictions of c out of the number of images predicted as c, The F1-score is the harmonic mean of the multiclass recall rate and precision.Thus, the recall rate is interpreted as the fraction of images of a class present in the sample that are correctly selected, while precision quantifies the fraction of the images predicted as members of a class that are actually correct.The F1-score represents the overall performance of a classifier with respect to these two measures.
We evaluated the transferability of learning by measuring the reduction of accuracy when a model trained with a source data set predicts target images.The target accuracy, Acc T , was measured as the proportion of successful predictions of the target images (see Table S1).The baseline accuracy within the source dataset, Acc S , measured by the within-dataset classification was compared with Acc T .
The accuracy reduction, ΔAcc S, T ð Þ¼Acc S À Acc T , was recorded as a measure of transferability between the datasets.High ΔAcc indicates large reduction of accuracy, hence difficulty in transfer.
Divergence between the source and target datasets was measured with a dataset classification error.A linear support vector machine (SVM) was trained to classify images to the source or target dataset with the features of 200 randomly selected images from both datasets.Conversely to the above analyses, here the model was trained to classify datasets instead of taxa.Then, a classification error of the SVM, ε sourceÀtarget , was measured as a proportion of incorrect predictions of 200 test images sampled from the two datasets.An intuitive interpretation of this measure is that the dataset classification task is harder when the feature distributions between two datasets are more similar.Therefore, a large classification error indicates high similarity between source and target datasets.This approach is commonly used to measure the dataset bias (Tommasi et al., 2017).

Within-dataset classification
To evaluate the baseline performance, Acc S , of the CNN model, we first conducted bulk image classification within datasets (assessing the effect of intra-class variability).This was performed by testing the number of training images on prediction accuracy, whereby the CNN model was trained with N images randomly selected from the dataset and predicted the class (family label) of n test images randomly selected from the rest.
N ranged between 100 and 700 for L H (with intervals of 100 images), between 50 and 250 for L L (with intervals of 50 images) and between 100 and 900 for G H (with intervals of 100 images).The number of test images n was set to 200 for L H and G H , and 50 for L L due to the small size of the dataset.To evaluate the consistency of prediction accuracy, 10 replicates were generated for each scenario of N images.The effects of the number of images and difference of prediction accuracy between datasets were assessed by a linear regression model.

Between-datasets classification
For the between-dataset prediction, the CNN model was trained with a source dataset to predict images from a different target dataset.The NN was trained with N images randomly selected from the source dataset, which was then used to predict all images of the target data-

Effects of datasets and the number of images
The accuracy of within-dataset classification and the effect of the number of training images varied among datasets.The accuracy for the L H samples of specimens collected from Cyprus generally improved with an increasing number of training images and reached an average of 96% with 700 images (Figure 2a).The maximum classification accuracy for the L H was 98%.
The within-dataset classification accuracy of the L L images, taken by a conventional stereoscope and camera, was generally lower compared to the L H dataset (9.1% lower for L L , linear regression p-value, p < 0.001).The accuracy increased monotonically with the increasing number of images and reached an average of 89% with 250 images (Figure 2b).As expected, the within-dataset classification accuracy of the G H images obtained from diverse sites around the world was significantly lower (16% lower for G H , p < 0.001) compared to the L H obtained from the single area of Troodos.The improvement of accuracy was slower than for the other datasets, and the average accuracy was 84% with the maximum number of 900 images (Figure 2c), consistent with the greater heterogeneity of the global set.Loss and accuracy development during training of models are reported in Figures S2, S3.

Performance of between-dataset classification
The accuracy of cross-dataset predictions was first assessed in regard to the effect of image quality.When the L L images were used to train the NN and then to predict the L H images, the accuracy remained largely constant at 71% for 250 images (Figure 2b).The accuracy reduction ΔAcc, that is, the reduction in success of predictions compared to the predictions expected from within-dataset classification, rapidly increased with the number of images (Spearman rho = 0.72, p < 0.001), indicating that the training with L L images did not improve the prediction of the L H images (Figure 3).
Next, we considered the critical question about the power of the global dataset to predict the local data, using the G H and the L H as a source-target pair.The prediction accuracy for this comparison was close to the within-G H predictions, with the average accuracy being 79% and the maximum 82% with 900 images (Figure 2c), indicating that the local set from the Cyprus collection (L H ) behaved in a similar way as the other local sets contributing to the G H dataset.The accuracy reduction from G H to L H was on average 0.04 and remained almost constant after 300 images (rho = 0.13, p = 0.28, Figure 3).The power of the G H dataset required the high image quality exhibited by the target (L H ); when the G H -trained model was used to predict the L L images, the accuracy was significantly lower (Figure 2c).This was also evident from the increased accuracy reduction with increased number of images; whereas, the G H !G H predictions improved with more images, the G H !L L predictions did not (Figure 3).The dataset classification errors (ε souce-target ) were 0.20 (G H !L H ), 0.06 (G H !L L ) and 0.01 (L H !L L ), indicating high similarity between the G H and L H images and the distinctiveness of the L L .

The performance of the domain adversarial training
The DANN significantly improved the target accuracy of the L L !L H prediction, which involves images from the different photographic setups (Figure 4a,b).A linear regression model showed that the target accuracy increased by 6.2% (p < 0.001, Figure 4b) and the accuracy reduction decreased by 0.060 when the DANN model was used with labelled L L and unlabeled L H images (Figure 3).The average target accuracy was 79% with 200 labelled L L images and 400 unlabelled L H images (Figure 4b), approaching the same level of accuracy as G H !L H predictions.
On the contrary, the DANN did not improve the target accuracy when the G H was used as a source dataset (Figure S4 and S5).The G H !L H target accuracy was on average 0.75 with 940 labelled G H images and 460 unlabelled L H images (in total 1400 images), and overall target accuracy was significantly lower than the between-dataset predictions by the plain NN model (2.2% reduction for DANN, p = 0.0002, Figure S4).In the G H !L L prediction, a similar trend was observed (Figure S5) and the target accuracy was not significantly different from the NN (0.015% reduction for DANN, p = 0.98).Loss and accuracy development during training of models are reported in Figures S2, S3.

Classification error
Classification error was visualized as a scaled confusion matrix.Starting with a trial for a within-dataset analysis with 400 training images in the L H random sampling showed that the large taxonomic groups were correctly classified in most cases (Table S4).For example, four families (Carabidae, Curculionidae, Ptiliidae and Staphylinidae) were classified with more than 95% recall rate, although the remaining taxa had widely different recall rates ranging from 0% to 82% (Figure 5a).In the extreme case of the family Melyridae, with the lowest number of available images (n = 5), no images were predicted correctly (Figure 5a).
When a taxon had >50 images, its recall rate and precision approached 1.0 (Figure 5a,c).The F1-scores showed a similar pattern, that is, for those images that were called to be members of a taxon, these predictions were generally correct (Figure 5e).Class-wise recall rates and F1-scores showed a strong positive correlation with the number of images (recall rates: rho = 0.81, p = 0.0014; F1-scores: rho = 0.85, p = 0.0005; Figure 5a,e).The effect of the number of images on the class-wise precision was also positive, but slightly weaker (rho = 0.41, p = 0.187, Figure 5c).Failed predictions included ventral views of insect bodies, specimens with missing body parts or multiple specimens in a single image (see Figure S6).Apart from these irregular images, most failed predictions were for taxa represented by <20 images (Figure 5a,c,e).Prediction probabilities for the successful predictions (average 0.98) were overall higher than for the failed predictions (average 0.79, Figure 6a), when using the L H dataset with 400 training images.
For the between-dataset analysis, a confusion matrix of the G H !L H prediction trained on 800 images equally showed a high accuracy of predictions (Table S5).Misclassification mostly affected morphologically similar taxa, for example, the reciprocal confusion of Brentidae and Curculionidae (Table S5).Chrysomelidae, Curculionidae and Staphylinidae had recall rates >0.90 (Figure 5b), but more taxa were incorrectly classified than in the case of the L H !L H prediction.
No image of Leiodidae and Scaphidiinae, with the available training images <50, was predicted correctly (Figure 5b).
The success of the class-wise recall rates was strongly correlated with the number of images in the source dataset (rho = 0.77, p = 0.0036, Figure 5b).Three taxa with >300 images had recall rates >0.95, although the taxa with <40 images had recall rates <0.4 (Figure 5b).The effect of the number of images on the class-wise precision and F1-score was also positive, but the effects were not significant (class-wise precision: rho = 0.16, p = 0.618; F1-score: rho = 0.42, p = 0.171; Figure 5d,f).Surprisingly, the F1-scores were greatly reduced relative to the recall score for the Chrysomelidae, indicating the precision of the prediction was low even if the recall was high (Figure 5d,f), that is, the true Chrysomelidae were correctly classified, but many other taxa were incorrectly classified as Chrysomelidae.

Prediction probabilities and out-of-distribution samples
In order to test the effect of the presence of unknown inputs (out-of- successful predictions (average 0.83, Figure 6a).However, four samples were predicted with high probabilities of >0.95, including three images of Coccinellidae, Hydrophilidae and Phalacridae that were classified as Ptiliidae (Figure 6a).
A similar test was also conducted for the between-dataset classifications by using a G H -trained model to predict L H images. Similar to the within-dataset classification, average prediction probabilities of successful predictions (0.98) were consistently higher than the failed predictions (0.84) and out-of-distribution samples (0.77).However, failed predictions more frequently had probabilities >0.95 than the out-of-distribution samples (Figure 6b).
For both tests involving the within-and between-dataset predictions, to detect the failed predictions, we set conservative threshold values for the prediction probabilities and marked samples below the threshold as potential misclassification.When the threshold value was set to 0.95, 92% of successful predictions were retained while 76% of failures and 75% of out-of-distribution samples were correctly detected as misclassifications (Figure 6a,b).set.We envision that mixed trap samples in future will be routinely photographed with high-resolution cameras, producing huge numbers of valuable images, but unlike most existing studies that use pinned or cardboard-glued specimens, these images present specimens in diverse angles, habitus, magnification and lighting (Schneider et al., 2022;Wührl et al., 2022).We show that these images provide sufficient information for specimens to be identified as members of particular families of Coleoptera.This finding is of special relevance in the context of large-scale biodiversity surveys where higher-rank taxonomic classification arises as a mandatory first-step prior to more refined classification by expert taxonomists (Karlsson et al., 2020).

Within a local dataset, classification accuracy regularly reached 95%
or more, which is similar to findings from other studies using more standardized photographs from museum collections (e.g., $92% and 96% for Diptera and Coleoptera, respectively; Valan et al., 2019).We also confirm that classification performance depends on the number of images used for training (Figures 2-5), as widely seen in image recognition applications generally (Donahue et al., 2013) and in insect classification in particular (e.g., >90% recall rates were obtained for taxa with >50 images; Valan et al., 2019Valan et al., , 2021)).We find that the prediction accuracy generally does not increase further after about 200 images in each of the three datasets used here.However, the degree of accuracy is greatly affected by the image quality and the complexity of the dataset: both the L L (low image quality) and in particular the G H (high complexity) datasets show comparatively lower accuracy of predictions if trained on themselves.

Utility of global databases for classifying local faunas
The critical question in this study is about the success of transfer learning in a situation where the source and target data are from different faunas.We here used the challenging case of the soil fauna of a Mediterranean island as the domain target for images trained on a set of mixed trap samples from altogether 11 tropical forest sites across the globe (the G H set), which presumably do not share any species or genera.However, most local bulk samples, even from such disparate ecosystems, share a similar set of taxa at the family level, especially for a few species-rich families which are found in similar relative proportions in most samples.The complexity of the data may allow the CNN model trained on this broad set to capture general family traits of the global fauna and thus make it suitable for a greater range of classification tasks at local level.However, this broad scope comes at a certain price, as the source accuracy is fairly low (comparing the G H !G H with the L H !L H , Figure 2), but if we accept the slightly lower accuracy, our study confirms the possibility of classifying local samples against this global set.We conclude that it is not strictly necessary to create local reference databases for training, when targeting higher taxonomic levels.This finding opens the way for local biodiversity assessment studies around the globe using a universal training set.Global databases have the additional advantage of offering high numbers of images per taxon, which is more difficult to obtain locally, although it is critical for increasing the performance of the CNN-based classification (Figure 2; Donahue et al., 2013;Valan et al., 2019Valan et al., , 2021)).Despite the high prediction accuracy of the dataset as a whole, some taxa may show consistently lower classification performance.The primary factor affecting recall and precision is the number of images per family.The required quantity was available only for the largest families (which were also available for the greatest number of countries globally, as a measure of complexity of the training set).The fact that most taxa accumulate fewer single-specimen photographs as a result of rarity and low abundances, may be artificially addressed by data augmentation techniques, which have been successfully applied to specimen identification tasks (Klasen et al., 2022).However, a few taxa, including the widely sampled Chrysomelidae, showed low F1-scores even with a large number of images (Figure 5b,d,f) Interestingly, Chrysomelidae also showed low classification performance in the study of Valan et al. (2019).The family is composed of morphologically rather distinct subfamilies, and an increased number of images may help to unveil the subclasses generating low performance models.

Lessons from combining DANN with differing databases
We show that photographs taken from similar imaging setups (G H and L H ) are readily used for between-region image classifications although images taken by a conventional stereoscope (L L ) exhibited a large accuracy reduction for the prediction of the local high-quality dataset.
Considering the nearly identical taxonomic composition of the L H and L L datasets, the large accuracy reduction indicates a negative impact of the original image quality and the lack of standardization between the target-source pairs.The overall dissimilarity of L H from G H and L L measured by dataset classification errors also suggest a negative effect of non-standardized imaging on prediction performance.These results are in accordance with the reduction in classification accuracy observed by other studies comparing different imaging procedures, for example, training with high-resolution museum specimens to predict field images (Knyshov et al., 2021).The application of alternative algorithms may overcome limitations resulting from the usage of highly different images taken by unstandardized imaging conditions.
In the current study, we could successfully ameliorate the accuracy reductions between L H and L L using DANN, a method designed for improving domain adaptation (Ganin et al., 2016).However, in other combinations of datasets such as G H and L H , the DANN did not improve the target prediction performance.This may be due to poor hyperparameter tuning or insufficient training of the model with a complex loss function (Kouw & Loog, 2021).Nevertheless, our study would offer some evidence that DANN (or domain adaptation techniques in general) can be considered a method of choice when a standardized image acquisition is not available.

Improvements from using alternative metrics for model performance
Although CNN-based image classification for biodiversity assessment is becoming increasingly popular, its performance is not always assessed with a broad set of performance metrics.As observed in Chrysomelidae, the reduction of performance was only detectable in the multiclass precisions and F1-scores, but not in the recalls, which revealed a specific difficulty in the classification of this group.Given the inferential power of these performance metrics, we encourage their integration in biodiversity-related applications.
Another overlooked metric is the confidence of predictions.We could detect failed predictions and potential out-of-distribution samples by setting a threshold value on the probabilities.In accordance with Hendrycks and Gimpel (2017), such misclassified or out-ofdistribution samples were predicted with consistently lower prediction probabilities.Because out-of-distribution samples are common in biodiversity surveys, detection of unknown target samples based on low prediction confidence is particularly useful.A potential difficulty of this approach is that calibration of the threshold requires extra data.Conventional deep neural networks can be uncalibrated, that is, prediction probabilities do not precisely reflect prediction accuracy (Guo et al., 2017).Such uncalibrated models can make an incorrect prediction with excessively high confidence.This overconfident failure is noticeable in our analysis (Figure 6b).Therefore, additional labelled samples are required to set a robust threshold for the identification of failure and out-of-distribution samples.Methods for explicit calibration of prediction probabilities or detection of out-of-distribution samples without additional data (e.g., Hsu et al., 2020;Mukhoti et al., 2020) are being actively developed in the machine learning field, and applying those methods is a potential future direction.As DANN could remove the dataset biases caused by the imaging instruments, the purpose-specific models will expand the possibility of machine learning applications to biodiversity surveys (see Høye et al., 2021).

Building the global database for CNN-based classification
As new images become available for ever more species, the reference library for taxonomic identification is rapidly growing.Given the geographic and taxonomic distance of our reference set from tropical forests, the family category is the only meaningful level exhibiting overlap of source and target, but conceivably the methodology could be applied at lower levels, for example, genera, if more similar samples ognizing the overall gestalt of a family.These family labels were straightforward for most groups, but identification of some beetle families may be compromised due to images that obscured appendages or other key traits, especially in small-bodied Leiodidae, Latridiidae or Cryptophagidae, which may have contributed to the prediction errors seen in these families (Table S4).Thus, corrections to the family labels in the database may be required, possibly by DNA barcoding and phylogenetic placement methods that confirm the family membership.Likewise, combining image acquisition for biodiversity assessment with metabarcoding could be instrumental for validating or improving genetic-based inferences (Yang et al., 2022) or estimating biomass and abundance (e.g., Høye et al., 2021;Schneider et al., 2022).Metabarcoding studies often lose morphological information of specimens, but imaging could be accommodated as a routine step before the DNA extraction of bulk insect samples.

CONCLUSIONS
To our knowledge, this is the first attempt of domain adaptation for taxonomic classification of an entirely unknown dataset, as a key element of using image-based identification in biodiversity studies at the global scale.We show that the approach is highly feasible, but needs careful consideration of the imaging procedure, the algorithmic approach and the choice of training sets.We envisage that an increasingly complete set of images, covering the diversity of major taxonomic groups, will become available as a global database in future, against which samples from any ecosystem and biogeographic region can be classified at a certain hierarchical level (e.g., families of beetles).In our approach, we lack the close alignment of the feature space in source and target that would guarantee high transferability, albeit at the expense of lower generalization capability when encountering unknown samples.

F
photos from 'Method C' were used for downstream analyses.
set and Acc T and ΔAcc were measured.We ran the above procedures for three source-target pairs (training dataset!predicted dataset), G H !L H , G H !L L and L L !L H .These settings simulate two alternative scenarios: (i) a global image database is used to predict local samples (G H !L H and G H !L L ) and (ii) conventional images, as those representing single-specimen photographs by local taxonomists, are used to predict local high-resolution images (L L !L H ). Between-datasets classification with domain adversarial training In addition to the standard CNN setups described above, we employed the domain adversarial training of neural networks (DANN, Ganin et al., 2016) which incorporates a certain portion of the unknown targets in the model.The DANN model jointly predicts the class (family label) of the source images and the dataset (domain) of all input images (as in the previous section) by adding layers for the dataset classification to the classifier (Figure S1).The training procedure then optimizes the model parameters in the shared part of the network to not only minimize the loss of the label classifier (taxon prediction) but at the same time to maximize the loss of the domain classifier (dataset prediction).This adversarial training procedure optimizes shared intermediate features to be invariant between the two domains, and hence the model can generalize across them, which potentially improves the accuracy in target predictions.In this study, a softmax layer with binary cross entropy loss was added as a dataset classifier to the NN after the second FC layer.The regularization parameter, λ, which controls the relative importance of the two classifiers, was set to λ = 0.1, 0.5 and 1.0, and the best performing results (λ = 0.1) were reported.The performance of the DANN method was measured with procedures similar to those in the previous section.A mixed set of images of size N was randomly selected from target and source datasets, and training was done using taxon labels from the source images and dataset labels for all images.Next, 400 mixed test images were predicted, and their Acc S , Acc T and ΔAcc were recorded.We applied the DANN to the three pairs from the previous section.The total number of images N ranged between 300 and 800 for L L !L H , 400 and 1400 for G H !L H , and 300 and 1000 for G H !L L .The proportions of source images were 0.3, 0.67 and 0.83 for L L !L H , G H !L H and G H !L L , respectively, which yielded training images from the source similar in number to the other training setups.The effect of DANN on target accuracy was tested using linear regression with the model type and the number of images as explanatory variables.Models of neural networks were implemented in Python with Keras 2.5.0 (https://keras.io)and TensorFlow 2.5.0 (https://www.tensorflow.org)libraries, and all statistical analyses were conducted with R 4.1.0(R Core Team, 2021).
Effect of increasing the number of training images on prediction accuracy.Training the convolutional neural network (CNN) on a subset of images and prediction of the class (family label) of images.(a) Local high quality (L H ) images for training and predicting the class of L H images, (b) local low quality (L L ) images for training and predicting the class of either L L or L H images and (c) global high quality (G H ) images for training and predicting the class of either L L , L H or G H images.
distribution samples) on the classification, we first used an L H -trained model to predict the class of 16 L H images belonging to eight families/ subfamilies, Coccinellidae, Elateridae, Endomychidae, Hydrophilidae, Laemophloeidae, Phalacridae, Scarabaeidae and Scydmaeninae, which were not present in the training data, but were present in the target sample (Cyprus) in small numbers.For these images, the (incorrect) prediction probabilities were also lower on average than for theF I G U R E 3The effect of increasing numbers of training images on the accuracy reduction in across-dataset predictions.Subsets of randomly selected images of one dataset are used for training and predicting the class (family label) of another set, as indicated by different colours.Lines in light blue refer to the comparison involving tests of locality, that is, when using global high quality (G H ) images for training and predicting the class of local high quality (L H ) images.Lines in green refer to comparisons involving tests of image quality, that is, when using local low quality (L L ) images for training and predicting the class of local high quality (L H ) images.Lines in dark blue refer to comparisons involving differences in both locality and image quality, that is, when using global high quality (G H ) images for training and predicting the class of local high quality (L L ) images.The x-axis representing the number of training images is on a logarithmic scale.The vertical dotted bars indicate 95% confidence interval of the average accuracy reduction.Higher accuracy reduction indicates a worse performance on prediction compared to the within-dataset prediction accuracy.The solid and dashed lines represent results of the convolutional neural network (CNN) and domain adversarial neural network (DANN), respectively.Note that only the L L to L H prediction accuracy improved with the use of DANN.
Effect of the number of images on prediction accuracy of the convolutional neural network (CNN, panels a and c) and the domain adversarial neural network (DANN, panels b and d) training for the local low quality (L L ) and local high quality (L H ) images.Top panels (a and b) represent between-dataset predictions (L L !L H ) and bottom panels (c and d) indicate within-dataset predictions (L L !L L ).Solid lines represent regression lines between the number of images and accuracy.For both between-and within-dataset predictions, models using DANN were trained with a mixed set of randomly selected images from the L L and L H datasets.For other dataset comparisons, see FiguresS3 and S4.F I G U R E 5 Effect of the increasing number of images on recall rates (panels a and b), multiclass precision (panels c and d) and F1-scores (panels e and f).We used 400 randomly selected local high quality (L H ) images for training and predicting the class (family label) of L H images (within-dataset classification) (left panels), and 800 randomly selected global high quality (G H ) images for training and predicting the class of L H images (right panels).Note that x-axes representing the number of images are on a logarithmic scale.Circle sizes represent the number of countries where samples of a given family were collected from (as a proxy of intra-family morphological variation).
Downloaded from https://resjournals.onlinelibrary.wiley.com/doi/10.1111/syen.12583 by Readcube (Labtiva Inc.), Wiley Online Library on [06/06/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License DISCUSSION This work adds to the growing number of studies demonstrating the power of CNNs in image-based taxonomic classification.Specifically, we tested the possibility of classifying specimens from bulk samples of beetles, whereby unknown local samples were classified using a model trained on similarly photographed bulk samples from a global

F
I G U R E 6 Prediction probabilities for the successful, failed and out-of-distribution predictions at a 0.95 threshold (horizontal line).(a) Intradataset predictions of L H images using 400 randomly selected images for training.(b) Predictions of L H images using 800 G H images for training.
had been used.The current set of images is limited with regard to the number of families (classes) and number of images per family (intraclass variability), resulting in out-of-distribution errors and prediction errors, respectively.Both issues can be addressed with a wider selection of images, for example, those available from the SITE-100 project(Bian et al. 2022) taken with similar equipment.Based on our results, any future image collection should consider the need for standardization, including that imaging should use the same aspect, for example, dorsal view for Coleoptera (also seeHansen et al., 2020), uniform background across images (preferably a light colour without texture), clear separation of specimens in the photographs, and similar optical equipment and magnification.The exact parameters remain to be explored within and across studies, but standardization of imaging is critical to transferability when rolling out large-scale efforts for imagebased classification in biodiversity studies.As part of this effort, image segmentation should be improved and automated(Schneider et al., 2022;Schwartz & Alfaro, 2021), to increase our capability for rapidly generating 'clean' and individual-based image databases extracted from bulk samples.A potential bottleneck is the need to expand the training set gradually, which generally requires recomputation of the model when new classes are added, although recent updated methods may simplify this process(Hadsell et al., 2020).A second issue affecting the accuracy of predictions is the 'category bias' from inconsistent categorisation and labelling of the training set itself.In the current study, images in the training set were classified from the images by rec-

Further
studies are required to assess the trade-offs of broadening the source domain and to establish best practice for the specific research question at hand.Once a stable expanded image database has been created, it can be used for wider applications in biodiversity research and monitoring, potentially building a global model applicable to any sampling site and possibly used while still in the field.ACKNOWLEDGMENTSThis work was supported by the iBioGen project, which has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 810729.We are grateful to Richard Turney and Thomas J. Creedy (Natural History Museum, London) for advice and support during bulk-sample imaging, and Takashi Imai (Shiga University) for helpful advice on deep learning methods.We also thank three anonymous referees for their constructive and valuable comments on an earlier version of the manuscript.We would like to extend our gratitude to Andreas Dimitriou for help during sample imaging, and Konstantinos Ntatsopoulos for support in the taxonomy of Cyprus beetles.Víctor Noguerales was supported by a postdoctoral contract under the iBioGen project and a "Juan de la Cierva-Formaci on" postdoctoral fellowship (grant: FJC2018-035611-I) funded by MCIN/AEI/10.13039/501100011033.Tomochika Fujisawa was supported by JSPS KAKENHI (grant number: 20K06824).