Classifying the unknown: Insect identification with deep hierarchical Bayesian learning

Classifying insect species involves a tedious process of identifying distinctive morphological insect characters by taxonomic experts. Machine learning can harness the power of computers to potentially create an accurate and efficient method for performing this task at scale, given that its analytical processing can be more sensitive to subtle physical differences in insects, which experts may not perceive. However, existing machine learning methods are designed to only classify insect samples into described species, thus failing to identify samples from undescribed species. We propose a novel deep hierarchical Bayesian model for insect classification, given the taxonomic hierarchy inherent in insects. This model can classify samples of both described and undescribed species; described samples are assigned a species while undescribed samples are assigned a genus, which is a pivotal advancement over just identifying them as outliers. We demonstrated this proof of concept on a new database containing paired insect image and DNA barcode data from four insect orders, including 1040 species, which far exceeds the number of species used in existing work. A quarter of the species were excluded from the training set to simulate undescribed species. With the proposed classification framework using combined image and DNA data in the model, species classification accuracy for described species was 96.66% and genus classification accuracy for undescribed species was 81.39%. Including both data sources in the model resulted in significant improvement over including image data only (39.11% accuracy for described species and 35.88% genus accuracy for undescribed species), and modest improvement over including DNA data only (73.39% genus accuracy for undescribed species). Unlike current machine learning methods, the proposed deep hierarchical Bayesian learning approach can simultaneously classify samples of both described and undescribed species, a functionality that could become instrumental in biodiversity monitoring across the globe. This framework can be customized for any taxonomic classification problem for which image and DNA data can be obtained, thus making it relevant for use across all biological kingdoms.

1. Classifying insect species involves a tedious process of identifying distinctive morphological insect characters by taxonomic experts. Machine learning can harness the power of computers to potentially create an accurate and efficient method for performing this task at scale, given that its analytical processing can be more sensitive to subtle physical differences in insects, which experts may not perceive. However, existing machine learning methods are designed to only classify insect samples into described species, thus failing to identify samples from undescribed species.
2. We propose a novel deep hierarchical Bayesian model for insect classification, given the taxonomic hierarchy inherent in insects. This model can classify samples of both described and undescribed species; described samples are assigned a species while undescribed samples are assigned a genus, which is a pivotal advancement over just identifying them as outliers. We demonstrated this proof of concept on a new database containing paired insect image and DNA barcode data from four insect orders, including 1040 species, which far exceeds the number of species used in existing work. A quarter of the species were excluded from the training set to simulate undescribed species.
3. With the proposed classification framework using combined image and DNA data in the model, species classification accuracy for described species was 96.66% and genus classification accuracy for undescribed species was 81.39%. Including both data sources in the model resulted in significant improvement over including image data only (39.11% accuracy for described species and 35.88% genus accuracy for undescribed species), and modest improvement over including DNA data only (73.39% genus accuracy for undescribed species).

Unlike current machine learning methods, the proposed deep hierarchical
Bayesian learning approach can simultaneously classify samples of both described and undescribed species, a functionality that could become instrumental in biodiversity monitoring across the globe. This framework can be customized for any taxonomic classification problem for which image and

| INTRODUC TI ON
Understanding biodiversity for insects requires both discovery and identification. Insects are one of the largest and most diverse animal groups on the planet with an estimated 5.5 million species, yet only 20% are described (Stork, 2018) and many are disappearing faster than they can be identified (Costello et al., 2013), making it difficult to assess biodiversity. Once an insect is collected, a taxonomist will identify the insect to its lowest taxonomic level possible based on existing morphological character keys (Buck et al., 2009). Traditionally, taxonomists use identification keys describing physical characters to identify a given specimen. This presents a real-world challenge because undescribed species cannot be uniquely identified by existing characters, and only through the comprehensive analysis of characters could one distinguish undescribed from described species.
DNA-based technologies, such as barcoding (sequencing certain conservative yet sufficiently variable regions of the genome) (Hebert et al., 2003), have helped confirm new species in cases where the DNA sequence variation exceeds the established intraspecific variation, or in cases where species are not distinguishable by their phenotypic characters (cryptic species) (Burns et al., 2008). While such powerful DNA-based methods are able to provide an estimate of biodiversity, they do not alone contribute to the knowledge base.
The DNA Barcode Database (BOLD) (Ratnasingham & Hebert, 2007, with a search of the order Diptera yields 2.4 million records (DNA sequences) and 126,000 BINs (barcode indexed numbers, a representative measure of sequence diversity based on thresholds). However, only 25,000 species have been identified out of the 126,000 BINs represented in BOLD Diptera record set. This indicates that while it is true DNA is facilitating the discovery of new species, identification is occurring at a much slower rate. Species identification is made difficult by a lack of taxonomists given the vast diversity of insects, and the fact that the art of traditional taxonomy is on the decline (Hopkins & Freckleton, 2002;Lee, 2000;Orr et al., 2020). Therefore, a novel way to efficiently scale both the discovery and identification of existing as well as new species is crucial for making the assessment of biodiversity feasible.
Machine learning methods can be leveraged to find intricate patterns and relationships in data, which have corresponding labels that indicate group membership, for example genus and species, for classification and outlier detection tasks. When combined with images, thus forming a computer vision task, machine learning can extricate subtle insect morphological characters, which are then used to classify described species and identify undescribed species.
Classification models that use only images are enticing because images are significantly easier to obtain than DNA samples. While image-based classification models cannot yet compete with DNAbased methods, recent studies demonstrated that machine learning approaches for image-based taxonomic identification could eventually achieve human-expert level accuracy (Milošević et al., 2020;Raitoharju & Meissner, 2019;Valan et al., 2019). Advances in machine learning have led to a surge of interest in entomology, a domain for which there are many challenges machine learning methods could help overcome.
Specifically, deep learning approaches (a subset of machine learning) have been utilized in pest-detection (Ding & Taylor, 2016;Sun et al., 2018), digitization of museum collections (Hedrick et al., 2020;Meineke et al., 2020), measuring invertebrate biodiversity (Mayo & Watson, 2007;Wang et al., 2012), investigating the plant-insect interactions (Tran et al., 2018) and many more applications (Høye et al., 2021). Deep learning methods have also been employed for the more challenging task of automatically detecting species in video and time-lapse images (Pegoraro et al., 2020). The main drawback of these methods is that they are focused on specific insect groups and considered only a very small number of subgroups.
Traditional machine learning models are inherently limited by incomplete insect data repositories available for training the model; it is often impossible to create a training repository with a complete set of insect species represented for a given taxon. For example, some insect species are rare or not yet described, and thus wellcharacterized training images of insects from these species cannot be obtained. Moreover, insects pose an additional challenge due to their morphologically distinct life stages where in some cases, the insects' immature stages look significantly different than their adult morphologies.
At the heart of the issue is that insect identification requires a method that can both classify samples from described species and identify samples from undescribed species or species not included in the training data. Many existing methods assume that all possible species are represented in the training data set (Geng et al., 2020); such methods, therefore, could not identify samples from rare or unidentified species. Additionally, current classification methods were used with relatively small data sets and would not scale well to problems with a large number of classes (Geng et al., 2020). Furthermore, such approaches were restricted to detecting if an insect sample is an outlier and could not differentiate between different types of outliers (Bendale & Boult, 2016;Perera & Patel, 2019;Scheirer & Boult, 2016). This limits their usefulness in entomology as the DNA data can be obtained, thus making it relevant for use across all biological kingdoms.

K E Y W O R D S
biodiversity, classification, computer vision, deep learning, machine learning, undescribed species Insecta class contain a large number of similar species and the hierarchy in its taxonomy necessitates outlier differentiation.
In order to accomplish both tasks, we adapted the generalized zero-shot learning (ZSL) setting (Xian et al., 2018), with genus and species taxonomic levels as auxiliary information, to test whether this method could be used to facilitate the identification of new insect species. ZSL indicates that the model will be tested not only on samples from species seen during training (referred to as described species throughout) but also on samples from species not seen during training (referred to as undescribed species throughout); we analysed both the model's accuracy in assigning samples from described species to their correct species, and its accuracy in assigning samples from undescribed species to their most likely genus of origin. 1 In brief, we sought to answer whether recent advances in deep learning and computer vision can extract subtle yet potentially discernible morphological characters that when combined with DNA sequence data, facilitate more accurate identification of insects of described species and aid the discovery of insects of undescribed origin, further grouped by genera (see Figure 1).

| MATERIAL S AND ME THODS
In this section, we first describe the DNA and image data used in this study. Next, the use of deep learning models for extracting information-rich feature vectors from insect images and DNA barcodes is described followed by a description of splitting the feature data training, validation and testing, and for simulating undescribed species. We then lay out the details for our novel hierarchical Bayesian classifier, including a description of transductive and inductive machine learning approaches, which join the image and DNA feature vectors to boost model accuracy over using either image-only or DNA-only feature vectors as input to the classifier. Finally, we detail a bioinformatics baseline model that was included for comparison.

| Data collection
Our study used paired insect image and DNA sequence data obtained from the Barcode of Life Data System (BOLD) (Ratnasingham & Hebert, 2007, from four major Insecta orders: Diptera, Coleoptera, Lepidoptera and Hymenoptera. Tables 1 and 5 and The raw images were in full colour (3 colour channels-red, green and blue) and generally had a width of 640 pixels by a height of 300-1000 pixels. Only images that had matching DNA barcodes were included, and each image was manually inspected so that low quality images, duplicate images, images containing incomplete insect bodies or immature pigmentation, and missing images (e.g. just a label was present) were removed. Only species with a minimum of 10 images within a single barcode index number (BIN) were included.
BOLD differs from other genetic databases (e.g. Agarwala et al., 2018) in that it accepts data for unidentified or unknown organisms. BOLD's DNA-based grouping algorithms will first assign a BIN to the unidentified sample: the BINs are closely aligned (but not perfectly) with species groupings. The BOLD database then translates the sample's DNA sequence to its protein sequence and searches its database for a species or genus match. BOLD will assign the sample to a species if its sequence contains less than 1% divergence from a F I G U R E 1 Deep hierarchical Bayesian classification with described and undescribed species. (a) Image feature vectors (2048-dimensional) were obtained from a pretrained ResNet-101 (He et al., 2016) model. (b) DNA feature vectors (500-dimensional) were obtained from a custom CNN model. (c) The optimal way to merge the image and DNA features was to first map image features to DNA feature space as learned by transductive ridge regression. (d) Hierarchical Bayesian model was trained on the merged training set VX train ∪ X train and used for classification. A test sample then was either assigned to one of the described species or identified as a new species belonging to one of the described genera (indicated with genus name followed by sp.).  While the BOLD database is essential for the discovery of new species, it has a consequential limitation: it does not facilitate the identification of such new species beyond the measures described above.

| Feature extraction
Prior to implementing the hierarchical Bayesian method, we employed deep learning models to extract meaningful features from the raw insect image and DNA barcode data. Humans can look at an insect and identify many distinguishing morphological features such as "lime-green scales" or "setose antenna"; computers are given an insect image file, which is a width × height × 3 data matrix representing pixels filled with RGB values between 0 and 255, and are asked to do the same. Humans come up with a list of text describing the insect features, whereas computers produce feature vectors, a set of numeric values that have been learned to best distinguish one class from another when used in classification models. In addition to being discriminative, representations learned from deep learning are often significantly smaller in size than the original raw data representation (especially for images), which leads to better scalability when used in downstream machine learning models.
In particular, we used a pretrained ResNet-101 (He et al., 2016) model See Supporting Information for further detail.

| Handcrafted versus deep features
Handcrafted features are predominantly data-agnostic and manually designed by experts to overcome specific challenges, like occlusion and variations in scale and illumination (Nanni et al., 2017) or to characterize a priori known characteristics (shape, colour etc.), whereas deep features are more generic and data-driven, given that they are learned directly from input images (Bora et al., 2016;LeCun et al., 2015). Human experts evaluate features qualitatively whereas computers require quantitative features. In the case of a large-scale insect classification task, handcrafting features to capture subtle characteristics of insect species represented by dozens of dichotomous keys may not be very practical.
On the other hand, a deep network is by default trained to learn quantitative features that will maximize classification accuracy of the network for a specific task. Deep features can be extracted at multiple levels of abstraction: the initial layers of the neural network (NN) resemble Gabor filters and tend to learn low level image features such as edges and blobs (Figure 3a), that is transferable to many different categories of objects and tasks (Yosinski et al., 2014), while deeper layers learn more complex relationships that can represent TA B L E 1 A breakdown of the data set by order.

Order # Genera # Species # Samples
Diptera ( high level semantics ( Figure 3b). Capturing semantics shared across instances of the same objects in the feature space mitigates intraclass variability in the image space (Chan et al., 2015), which is vitally important for fine-grained image classification tasks involving thousands of classes and limited number of samples per class.
From a taxonomist perspective, the ultimate goal would be to have an algorithm that both identifies samples from undescribed species and lists their distinguishing morphological character(s) to aid in the process of new species discovery and an identification key created. However, efforts to match semantic descriptors (e.g. dichotomous keys) with quantitative features (e.g. deep features) often introduce significant feature redundancy and noise, which may negatively impact classifier performance. Although the lack of interpretability of these learned features still remains as a big hurdle confronting deep learning models, recent advances in self-supervised representation learning (Caron et al., 2021), attention and saliency maps (Simonyan et al., 2013) are expected to gradually close this interpretability gap between deep-learned and hand-crafted features.
Such methods can learn to identify distinguishing areas in images, some without the help of image labels, and it is plausible that in the future, deep learning algorithms will be able to identify such finegrained discriminative areas in images, possibly with the integration F I G U R E 2 Phylogenetic tree of the four orders from the dataset. Two species were randomly chosen from each order, with their complete taxonomic hierarchy illustrated.

| Merging image and DNA data
Combining different data modalities within a single deep network (Nanni et al., 2017;Yang et al., 2019)  species, which gives transductive methods the theoretical and as detailed later, empirical, advantage over inductive methods in the task of identifying undescribed species.
Prior to using either the merged or individual data sets in the model, we reduced their dimension to 500 using principal component analysis (PCA) to ensure the input data dimensions were low enough to be feasible for use in the hierarchical Bayesian model. The transductive approach is outlined in Figure 1c and further detail regarding these processes is in Supporting Information.

| Training, validation and test data
Machine learning classification models are generally built by an iterative process of tuning the model with training data and measuring the trained model accuracy on validation data, until the most optimal final model has been found. This final model is then tested by measuring its accuracy on a test data set, which has not been seen previously by the model, to gauge the model's generalizability to future data sets. To prove our model was a viable method for identifying undescribed species, our model was validated and tested with data sets that contained samples for species the model had not seen before. Since the BOLD data we collected, by design, contained no true undescribed species data, test undescribed species data had to be simulated as described in Figure 4. For validation, the training undescribed species were split in a similar manner into described and undescribed species. Some insect species had multiple images, each capturing a different view of the insect (e.g. ventral and dorsal views). All insect species with multiple images were restricted to the training set, leaving 27 of the described species with no representatives during testing. In the test data set, there were a total of 4965 samples from 770 described species and 8463 samples from 243 undescribed species (see Table 2).

| Hierarchical Bayesian model
Insect species have a predefined taxonomic hierarchy (order < family < subfamily < genus < species). Despite the shared morphological characters at each level of the hierarchy, which carry valuable information for classification tasks, the taxonomy is often overlooked in machine learning methods. A hierarchical Bayesian model was recently introduced in computer vision for zero-shot classification of object classes Badirli et al., 2020) using visual attributes (Badirli et al., 2020) Figure S2) is that species sharing similar haplotypes would group in the same phenotypic space characterized by feature vectors. The generative model design is given below (see Figure S3 for a graphical depiction) where j, i, k represent indices for genus priors, described species and data instances, respectively.
We assume that the data instance x jik comes from a Gaussian distribution defined by mean ji and covariance matrix Σ j ; note species from the same genus share the same covariance matrix Σ j to

F I G U R E 4
For genera with ≥3 species, one-third were randomly assigned undescribed while the rest were assigned described; only the test set contained undescribed species to test the model's ability to identify them. preserve conjugacy. The data instances are generated independently and conditioned on the hyperparameters of both global and genus priors. The hyperparameter 1 is a scaling constant that adjusts the dispersion of the described species means ( ij ) around the center of their corresponding genus prior. A larger 1 leads to smaller variations in species means from the mean of their corresponding genus prior, suggesting a fine-grained (harder to distinguish) relationship among species sharing the same genus. Conversely, a smaller 1 dictates coarse-grained (easier to distinguish) relationships among species sharing the same genus.
Each genus prior is Gaussian and characterized by the parameters j and Σ j . The mean vectors of the genus priors are in turn distributed according to a Gaussian prior and 0 is a scaling constant that adjusts the dispersion of these mean vectors around the mean vector 0 of the global prior. A smaller value for 0 suggests that genus centres are expected to be farther apart from each other whereas a larger value suggests they are expected to be closer to each other. On the other hand, Σ 0 and m dictate the expected shape of the described species distributions, as under the inverse Wishart distribution as- D is the dimension of the data. The minimum feasible value of m is equal to D + 2; the larger m is, the less individual covariance matrices will deviate from the expected shape.

| Hyperparameters and statistics
The described species and genus prior posterior predictive distributions (PPDs), derived in Supporting Information, are a function of both hyperparameters and sufficient statistics. The hyperparameter 0 (mean of the global prior) is the mean of the described species means, while the hyperparameter Σ 0 is the mean of the described species covariance matrices scaled by s (also referred to as the pooled covariance); each were calculated with training data. The species-specific sufficient statistics, also calculated with training data, are: x jc (mean vector), n jc (number of samples), and Σ jc (covariance matrix), where c represents the current described species. The algorithm with pseudo-code for deriving these values is in Supporting Information.
It is worthwhile to note here that the genus prior PPD formulations relied upon the quantity of described species data available to the model during training, which did not encompass all possible species, both as a result of simulating undescribed species by removing some species from the training set and as a result of having an incomplete data set. Therefore, the genus prior, while described as fully as possible given the training data, was not a complete representation of all its member species.

| Classification
Bayesian models classify a sample by assigning it the label of the class whose distribution maximizes its likelihood. The set of class labels included both the described species and genera, so classification involves the simultaneous comparison of likelihoods across all TA B L E 2 Train-test split.   Figure S3a and depicted in Figure 5). If the sample was assigned to a genus by the model, we predicted that the sample originated from an undescribed species; the genus labels were used to assess how accurate the model was in assigning undescribed species to their correct genus. If the sample was assigned a species by the model, we predicted that the sample originated from that described species.

| Optimization
The goal of our HBM was to classify test samples from described species to their respective species and test samples from undescribed species to their respective genus. The classification performance was assessed by the average described species accuracy (referred to throughout as just described species accuracy) and average undescribed species genus accuracy (referred to throughout as just undescribed species accuracy), as well as their harmonic mean, as shown in the following equation: where for class j, y j is the the number of correctly classified samples and n j is the number of total samples.
The hyperparameters 0 , 1 , m and s were tuned through crossvalidation on the validation set to produce the maximum harmonic mean of the described and undescribed species validation accuracies which ensures the model would be capable of doing both tasks well when given test data (see Supporting Information for evaluated hyperparameter values). The harmonic mean, a standard form (Xian et al., 2018) for evaluating model performance, is an overall accuracy measure between the described and undescribed species accuracies, and is more representative than the usual average measure would be due to the fact that these two accuracy measures do not share a common denominator; since there are more described species (770) than genera (134), the described species accuracies would have been more dominant in the overall accuracy measure if the usual average would have been used, skewing model performance.

| Baseline bioinformatics approach
To show our HBM's performance against a more traditional distance-based method, we included results in  (Jukes & Cantor, 1969) between the test sample's sequence and the consensus sequences from each described species was found. Test samples were assigned to the described species with the minimum distance, only if the minimum distance was smaller than a designated threshold. If the minimum distance was larger than this threshold, the test sample was predicted to be from an undescribed species and assigned to the genus of the species with the minimum distance. The distance threshold was chosen by cross-validation.

| Experimental design
Several models were developed and tested for their classification ac-

| RE SULTS
Classification accuracies for all model variations for both undescribed and described accuracies, and their harmonic means, are reported in Table 3.
It is clear from the results that classifiers, which used image data alone achieved minimal accuracy (39.11% described accuracy). As expected, DNA data proved to be informative for species classification. The bioinformatics baseline method had the highest described classification accuracy at nearly 99% while achieving an undescribed accuracy of 72%, a 10% reduction compared to the transductive model (HBM-DIT). In comparison to the Bioinformatics method, the HBM-DNA yielded a better undescribed accuracy, but slightly lower described accuracy.
Combining image and DNA data in all five HBM scenarios increased accuracy over image-only and DNA-only models, particularly for undescribed species test samples. Transductive (HBM-DIT) and heuristic inductive likelihood (HBM-DIL) methods performed best, with >88% harmonic means and 81% undescribed accuracies.

While both inductive methods (HBM-DIC and HBM-DIL) performed
reasonably well, they present a real-world challenge in that they require test samples to have both image and DNA data present, which may be difficult to obtain.
Under the transductive setting, the quantity of undescribed species available in the test data for learning the mapping between image and DNA feature spaces impacted accuracy. The last two rows of Table 3 display model performance utilizing decreasing portions of undescribed species data in the test data set.
Note that the model was not tuned for these configurations and used the same optimal model parameters as the other HBM-DIT models. These findings show how the transductive method can outperform non-combined-data models even when only 25% of the available undescribed species data was included in the test set, and model performance increased as more undescribed testing data were included.
The transductive model, which included all of the available undescribed testing data, yielded 96.66% overall described classification accuracy with 4827/4965 correct classifications (Table 5).
Unsurprisingly, the accuracy declined for undescribed samples, with three of the four orders having >81% undescribed sample accuracy which was still remarkably good. HMB-DIT misclassified the genera of many undescribed Diptera samples (Table 5). When examining the different family groups and their classification accuracies (Table 5), the Culicidae (the mosquitoes), Syrphidae (the hover flies) and Tipulidae (the crane flies) had the most misclassifications.
With Culicidae, 45/58 of the misclassifications were Aedes vexans records classified to the Culex genus. As a semi-independent test, the DNA sequence of a random Aedes vexans record from our dataset was passed through BLASTn (Altschul et al., 2021)

| DISCUSS ION
While the most successful method in this study employed both image and DNA barcode data, the use of image-only methods or DNA-only methods each had varying levels of success. DNA lends strong support for new species identification if the sequence variation falls outside of the normal bounds of intraspecific variation.
BOLD uses a cut-off of <1% sequence divergence to identify species to a reference specimen (which is itself identified as having <2% sequence divergence for three or more records) and <3% to assign to a genus (Ratnasingham & Hebert, 2007). In some cases, DNA barcodes have been integral in differentiating between morphologically indistinguishable species, confirmed through additional nuclear DNA sequencing (Janzen et al., 2017). DNA has proven to be a powerful method, yet it does not allow for the development of morphologically based identifications for any future work, and depends on DNA-based approaches, which can be expensive.
Image-only analysis has shown promise in real-time insect species monitoring, but commonly has suffered when image background extraction is necessary. Furthermore, in the application of these image-only methods, only described species were monitored; two examples are pest management (Van Horn et al., 2018;Wu et al., 2019) and biodiversity surveys (Schneider et al., 2022). When deep learning methods were used with images to identify described classes of insects, accuracy gains reached 90% or greater (Milošević et al., 2020;Raitoharju & Meissner, 2019;Valan et al., 2019;Visalli et al., 2021), and in some cases, approached or surpassed taxonomist accuracies (He et al., 2015). However, these methods were tested either on coarse-grained datasets (easier to distinguish between classes) or with a limited number of species (generally <15).
Furthermore, the lingering issue of identifying rare or undescribed species and the inherent data imbalance continue to plague the ability of more efficient means of identifying new species. This is especially true within the Insecta class, where the majority of the species continue to be unidentified and would present a significant advancement to the field of entomology if identified. More broadly, identifying undescribed species helps us to better understand ecosystems and their processes, of which insects likely play a significant role (Yang & Gratton, 2014).

| ResNet model complexity
We experimented with extracting image features from different ResNet architectures with increasing complexity 50,101 and 152), to evaluate if a more complex model would be optimal (see Table 4). We found that ResNet-34 produced a lower harmonic  is shown in Figure 6, while HBM-DIT was able to predict seven out of these same 11 genera with an average undescribed accuracy of 87%, a marked improvement.

| Striking morphological similarity between species belonging to the same genus
Physical variation in some insects is nearly invisible to the human eye, especially if one is not a specialized expert and species are closely related. Nevertheless, machine learning models can extract these subtle differences from images and when combined with DNA data, can classify these difficult cases correctly. To illustrate, we present a simple challenge in Figure 7

| Effect of background noise on model performance
High-quality images are an integral part of any successful machine learning approach and heavily impact the model performance F I G U R E 6 HBM-DIT improvement over HBM-DNA undescribed species accuracy. HBM-DNA did not accurately predict the genera for any undescribed species samples from the 11 genera listed in the table, while HBM-DIT, which combines both DNA and image features, was able to accurately predict seven out of the 11 with an average accuracy of 87%.

F I G U R E 7
Striking morphological similarity between four species from genus Agabus. The figure shows that deep learning is able to extract very subtle discriminative features from images, and when combined with DNA features, can improve performance over DNA-only models for the task of identifying samples from undescribed species. To underline the difficulty in classifying images from such similar species, imagine you are given just these four images (not labelled). You are told three images are from described species, which you are given the names of, and one is from is from an undescribed/unknown species. The figure above shows which species was randomly assigned as undescribed in our model. You look up known images from each of the described species to help you. How successful are you in assigning the described species images to the correct species and identifying which sample is from an undescribed species?  (Divilov et al., 2017;Junior et al., 2013) and mosquitoes (Zacarés et al., 2018).

| CON CLUS IONS
This study developed a novel framework to facilitate discovery and identification of insect species, with much unknown biodiversity, at scale. The proposed model is the first in the literature to tackle this problem by leveraging image and DNA data together, is the first to be tested on more than a thousand species and is the first to have the ability to also classify undescribed species to genus. Our best performing hierarchical Bayesian classification model, trained with image and DNA feature data obtained from their respective deep learning models which were merged using the transductive linear mapping approach, classified described species with greater than 96% accuracy, and was 81% accurate in identifying the correct genus of undescribed species. Considering the transductive approach was built on a regularized linear mapping, there is great potential for achieving better performance by utilizing nonlinear mappings and/or more sophisticated approaches like generative adversarial networks (GAN) (Goodfellow et al., 2014) or variational autoencoders (VAE) (Kingma & Welling, 2014). Integrating GAN/ VAE would create an end-to-end representation learning method that could potentially mitigate the shortcomings of supervised pretrained models such as ResNet-101. The HBM could also be extended to consider genera/species as subclasses and higher taxonomic levels, such as family, as superclasses. Such a classifier would readily deal with missing or unobserved genera. Given the large inter-species variation in DNA barcodes, a deep learning CNN model with a hierarchical loss function that considers information not just from species but genus, family and order could produce more robust DNA features, given a significantly larger dataset covering more genera and families. We are currently investigating what impact image resolution has on insect image classification accuracy, with the hypothesis that higher resolution images could yield stronger model performance. We are also investigating using im-

ACK N O WLE D G E M ENTS
We acknowledge the valuable reviews from our anonymous reviewers for significantly improving the final version of this manuscript.
Murat Dundar and Sarkhan Badirli were sponsored by the National Science Foundation (NSF) grant IIS-1252648 (CAREER). George Mohler was sponsored by NSF grant ATD-2124313. Frannie Richert was supported in part from funding from the IUPUI Institute of Artificial Intelligence. The content is solely the responsibility of the authors and does not necessarily represent the official views of NSF.

CO N FLI C T O F I NTE R E S T S TATE M E NT
None to report.