DiversityScanner: Robotic handling of small invertebrates with machine learning methods

Invertebrate biodiversity remains poorly understood although it comprises much of the terrestrial animal biomass, most species and supplies many ecosystem services. The main obstacle is specimen‐rich samples obtained with quantitative sampling techniques (e.g., Malaise trapping). Traditional sorting requires manual handling, while molecular techniques based on metabarcoding lose the association between individual specimens and sequences and thus struggle with obtaining precise abundance information. Here we present a sorting robot that prepares specimens from bulk samples for barcoding. It detects, images and measures individual specimens from a sample and then moves them into the wells of a 96‐well microplate. We show that the images can be used to train convolutional neural networks (CNNs) that are capable of assigning the specimens to 14 insect taxa (usually families) that are particularly common in Malaise trap samples. The average assignment precision for all taxa is 91.4% (75%–100%). This ability of the robot to identify common taxa then allows for taxon‐specific subsampling, because the robot can be instructed to only pick a prespecified number of specimens for abundant taxa. To obtain biomass information, the images are also used to measure specimen length and estimate body volume. We outline how the DiversityScanner can be a key component for tackling and monitoring invertebrate diversity by combining molecular and morphological tools: the images generated by the robot become training images for machine learning once they are labelled with taxonomic information from DNA barcodes. We suggest that a combination of automation, machine learning and DNA barcoding has the potential to tackle invertebrate diversity at an unprecedented scale.


| INTRODUC TI ON
Biodiversity science is currently at an inflection point. For decades, biodiversity loss had been mostly an academic concern, although many biologists had already predicted that the decline would eventually threaten whole ecosystems (May, 2011). Unfortunately, we are now at this stage, which explains why the World Economic Forum considers biodiversity loss as one of the top three global risks based on likelihood and impact for the next 10 years (World Economic Forum's Global Risk Initiative 2020). This new urgency is also leading to a reassessment of research priorities. Most biologists had traditionally focused on charismatic taxa (birds, mammals, butterflies, etc.) with a preference for endangered or even extinct species (Ceballos et al., 2017). However, with regard to quantitative arguments, this focus has always been poorly motivated. If one were to adopt a quantitative point of view of terrestrial animal diversity, invertebrates would receive most attention. They contribute more than 45 times the biomass of wild vertebrates (table S23 in Bar-On et al., 2018), contain >90% of the estimated species diversity (e.g., Groombridge, 1992), and comprise much of the functional and evolutionary diversity. In 2011 Robert May (2011) stated that "[w]e are astonishingly ignorant about how many species are alive on earth today, and even more ignorant about how many we can lose (and) yet still maintain ecosystem services that humanity ultimately depends upon." This situation can only be changed if new and efficient tools for assessing and monitoring invertebrate biodiversity are developed. Such tools need to be particularly suitable for clades that have come to be referred to as "dark taxa" and which Hartop et al. (2021) defined as those "for which the undescribed fauna is estimated to exceed the described fauna by at least one order of magnitude and the total diversity exceeds 1000 species." Biomonitoring of these taxa with morphological tools or DNA barcoding is very time-consuming because it requires processing thousands of typically small specimens. This goes some way to explain why metabarcoding of bulk invertebrate samples has become increasingly applied. It allows for fast processing of bulk samples and yields information on species composition. However, using this method comes at the cost of not being able to obtain precise abundance data (Creedy et al., 2019), although monitoring population declines is important (Ceballos et al., 2017). Furthermore, the missing association between DNA sequences and individual specimens constrains follow-up research. For example, species new to science will remain undescribed because specimens belonging to the undescribed species cannot be readily located. Similarly, specimen-or species-specific studies addressing the role of species in the ecosystem cannot be carried out, although much could be deduced by, for example, sequencing the microbiome (e.g., Six, 2013) or gut content (e.g., Reeves et al., 2018). Overall, it is therefore desirable to develop not only bulk sequencing strategies for mass samples, but also to modernize specimen-based processing.
We here argue that three technical developments can help with achieving this goal. The first is cost-effective methods for obtaining and sequencing specimen-specific barcode amplicons with second-and third-generation sequencing technologies (Hebert et al., 2018;Srivathsan et al., 2019aSrivathsan et al., , 2021Wang et al., 2018). Indeed, today's consumable cost for barcoding a sample with 1000 specimens is <100 USD  and portable sequencers produced by Oxford Nanopore Technologies are democratizing access to DNA sequences (Buchner et al., 2021;Pomerantz et al., 2018;Srivathsan et al., 2021;Watsa et al., 2020). Unfortunately, automation and data processing with neural networks, which present the other two developments, remain underutilized. Currently, automation mostly exists in the form of pipetting robots in molecular laboratories. However, the main challenge posed by bulk samples is the imaging and movement of individual specimens into microplates.
With regard to the use of neural networks, they are currently widely used for identifying plant and charismatic vertebrate taxa (Fairbrass et al., 2019;Milošević et al., 2020;Stowell et al., 2019;Tabak et al., 2019), but invertebrates in bulk samples have benefited very little (but see Ärje et al., 2020b). Yet, thousands of samples are collected every day. They include plankton samples in marine biology, macroinvertebrate samples used for assessing freshwater quality, and mass insect samples (Borkent & Brown, 2015;Brown, 2005;Brown et al., 2018;Karlsson et al., 2020b). Here, well-trained convolutional neural networks (CNNs) would be important because they could use images to (a) identify specimens to species, (b) provide specimens for follow-up research (e.g., microbiome), (c) yield precise abundance information and (d) measure biomass. All this would enable semiautomated biomonitoring of invertebrates when samples obtained from the same place at different times are processed.
Computer-based identification systems for invertebrates are starting to yield promising results (Feng et al., 2016;Knyshov et al., 2021;Perre et al., 2016). For example, a recently developed system can size and identify stoneflies (Plecoptera) that are routinely used for freshwater quality assessment (Sarpola et al., 2008). Another system processes samples consisting of soil mesofauna (Chamblin et al., 2011). However, this system is comparatively expensive because it uses a robotic arm. Other robots have been designed for specific, commercial insect sorting purposes. This includes one that can separate intact mealworm larvae (Tenebrio molitor) from skins, faeces and dead worms (Kim, 2014) and one that sorts mosquitoes (Lepek et al., 2020) and is capable of distinguishing males from females. However, all these machines lack the ability to recognize a wide variety of insect specimens in bulk invertebrate samples. The machine closest to this capability is the BIODISCOVER by Ärje et al. (2020a), which can identify ethanol-preserved specimens, which have to be fed into the machine manually one by one. In addition, all specimens are returned into the same container after identification.
Most of these systems use deep CNNs with transfer learning (Ärje et al., 2020b) and thus require large sets of training images.
Arguably, the lack of such sets is the main obstacle for applying CNN to invertebrates. Here, robotics could have a major impact if automatic imaging were to be combined with DNA barcoding. Robots could provide the images, which would then be assigned/labelled with taxonomic information obtained with DNA barcodes. Such barcodes can be used to sort specimens to putative species ("MOTUs" [molecular operational taxonomic units]) with good overall congruence to morphospecies (e.g., Wang et al., 2018). Comparing the barcodes with public databases will then reveal for which specimens a preliminary MOTU ID can be replaced with a scientific name.
Imaging combined with labelling at species-level resolution will thus be able to yield training sets for CNNs.
We here describe a new robot (DiversityScanner) that recognizes insect specimens based on an overview image of a sample and processes small specimens (<3 mm), which contribute >60% of all specimens in Malaise traps. This estimate is based on analyses of taxonomic compositions of samples from Sweden (Karlsson et al., 2020a) and Neotropical countries (Brown, 2005). The analyses revealed that 64%-84% of the specimens belonged to Diptera families that contain predominantly small species (e.g., Chironomidae, Sciaridae, Phoridae, Cecidomyiidae, Mycetophilidae). Furthermore, another >5% of all specimens in the Swedish samples were small parasitoid wasps (<3 mm: Diaprioidea, Chalcidoidea, Platygastroidea).
After identifying specimens of appropriate size and distance to other specimens, the DiversityScanner images each suitable specimen and moves it into a microplate. The robot then uses these images to assign the specimens into 14 common "classes" of insects (usually family-level) using a CNN. Lastly, the images are used to estimate biomass based on insect length and an estimated volume.

| CON CEP T AND ME THODS
We here present a compact insect sorting robot ( Figure 1) that assigns objects (mostly insects) to different classes. Note that we here use the term in the context of machine learning. Indeed, most of the classes in our study are families in the Linnean system (N = 10), two contain two families and two are of higher rank (Calyptratae and the paraphyletic acalyptrate Diptera). To ensure accessibility, our robot relies mostly on standard, commercially available components that are connected via parts that can be printed on a commercial (Fused Deposition Modeling-FDM) 3D printer. The basic design uses a cube-shaped frame (50 × 50 × 50 cm) as well as three linear drives with accurately positioning stepper motors and is based on a zebrafish embryo handling robot that was developed earlier (Pfriem et al., 2012). The robot is equipped with two high-resolution cameras ("overview" and "specimen" camera) with customized lenses, LED lighting and image recognition software. Furthermore, a specimen transport system using a suction pump is integrated to transfer insects into the wells of a standard 96-well microplate. Thus, the robotic system can be divided into: (a) the transport system, (b) the image acquisition system and (c) the image processing system. All are operated by a graphical user interface (GUI) on a touchscreen.

| Transport system
The x-and y-axes of the robot are realized by LEZ1 linear drives (Isel AG) and connected to the outer frame of the robot at half height.
Both linear drives are driven by high-precision stepper motors to ensure good positioning accuracy. The y-axis is moved by the orthogonally connected shaft slide of the x-axis. The shaft slide of the y-axis transports the specimen camera and the z-axis with the suction hose. In order to move the suction hose in the z-direction (=up and down), the z-axis is driven by an AR42H50 spindle drive with stepper motor (Nanotec Electronic). All three axes are controlled by a single TMCM-3110 motor controller (Trinamic) that allows for precise, fast and smooth movements. The motor controller and the other electronics are protected from water and ethanol droplets by a box at the bottom of the robot. The transport system is controlled by a Raspberry Pi 4 (Model B, 4 GB) single-board computer that was programmed in Python. In order to pick up insects from a Petri dish and transfer them into a well of a 96-well microplate, a suction hose with a pipette tip is positioned above the target insect by the transportation system. The hose is connected to an LA100 syringe pump (Landgraf Laborsysteme) that is also controlled by the Raspberry Pi.
The sorting process is illustrated in Figure 2.

| Image acquisition system
The sorting system includes two cameras with different lenses: the overview camera (C1) and the specimen camera (C2). The overview camera is a Ximea MQ042CG-CM camera using a CK12M1628S11 lens (Lensation) with a focal length of 16 mm and an aperture of 2.8. It is positioned directly above the Petri dish to take a detailed overview image of the sample. This image is used for detecting insects and their position within the Petri dish ( Figure 3a).

F I G U R E 1
DiversityScanner with 1: x-axis; 2: y-axis; 3: zaxis; 4: Petri dish; 5: microwell plate; 6: overview camera (C1), 7: specimen camera (C2). The touch screen provides updates about the sorting process (e.g., insect detection, position of pipette tip and specimen carmera) and can be used to start and stop the robot. The Raspberry Pi, motor control unit and syringe pump are hidden behind the display panel The specimen camera (C2) is a Ximea MQ013CG-E2 using a telecentric Lensation TCST-10-40 lens with a magnification of 1×. This camera is moved by the x-and y-axes of the robot to a position above the insect to take a detailed image for the purpose of classification and measuring size (Figures 4 and 7).

| Image processing system
Three different software algorithms are used. The first algorithm determines the position of each object within the Petri dish, the second estimates the biomass of each insect and the third is an artificial neural network to classify insects into different classes.

| Determining position of insects
Most objects in a Malaise trap sample are insects, but bulk samples also include insect parts and debris. After the overview image is taken, several image processing operations are used to detect only insects that are suitable for processing: (i) a median filter removes noise from the image, (ii) the RGB image is converted to greyscale, (iii) an adaptive threshold filter segregates the objects and (iv) a contour finder identifies the boundaries of all objects. Three conditions must be met for an object to be considered for imaging and transfer: (i) the size must be within a specified interval, (ii) the object has to be >10 mm away from the Petri dish edge (blue line in Figure 3b) and (iii) its distance to other objects must exceed a minimum threshold value set by the user. For efficient operation, only small specimens (body length <3 mm) should be placed into the Petri dish. Size presorting of whole samples can be manual or employ the efficient sieving methods described by Buffington and Gates (2013). Furthermore, it is desirable to distribute insects more or less evenly in the Petri dish because clumping reduces the number of insects that are available for sorting.
After detection, the coordinates of the objects are stored in a list and used to determine where the pipetting tip and the specimen camera should be for processing a specimen. After each insect is new insects can also be added to the Petri dish during this work step.
Given that only the Petri dish and the suction tube touch specimens, only these parts need cleaning between samples. The dish and suction tube can be cleaned with bleach or replaced.

| Biomass estimation
Several image processing operations are needed to measure the length and volume of each insect. First, the contour is determined using F I G U R E 3 Overview of Petri dish with evenly distributed insects before (a) and after processing (b). (b) A region of interest has been defined (blue line 10 mm from the edge). Circles represent detected objects (green = meet size and distance conditions for imaging and movement; red = size too large and/or distance too small)

(a) (b)
F I G U R E 4 Specimen images obtained with the specimen camera before (a) and after processing (b). (i-iv) Image processing steps used to distinguish head, thorax/mesosoma and abdomen/metasoma. (i, ii) Contour determination; (iii) connecting surfaces; (iv) placing random points; (v) regression; (vi) defining dividing lines additional perpendicular straight lines are drawn which must be within the body contour. The distance and length of the straight lines is then used to determine the volume, one slice at a time. The lengths and volumes of the individual and its body parts are displayed on the screen of the sorting robot and the measurements are stored. Figure 4 shows an example of a detailed picture (a) before and (b) after the volume estimation, as well as the necessary steps (i-vi). All operations use the free opencv program library (version 4.5.1) and Python scripts (version 3.8.6). Currently, volume estimates perform best for body parts that are rotationally symmetrical; that is, the method works better for insects with rotationally symmetrical morphology (e.g., many Hymenoptera).

| Classification with artificial neural network
We apply machine-learning algorithms based on CNNs to assign in-

| RE SULTS
To test how fast the DiversityScanner sorts, we used 192 specimens (=two microplates). The average time per specimen was 37 s for the first and 38 s for the second plate, with some specimens taking much longer (e.g., #1, #8, #35: Figure 6a). The reported time consists of the time needed for activating the GUI, the write operations on the SD card, the movement time of the axes, the runtimes of the algorithms for object detection and classification, and the times for using the syringe pump. Faster sorting is feasible, but reduces quality because the specimens need to settle before high-quality images can be taken. In addition, the specimens have to "sink" within the pipette tip before they can be safely expulsed into a well of the microplate. In contrast, object recognition and classification are fast (Figure 6b,c). The average time for object detection is <1.3 s and classification ~4 s. Currently, the robot classifies the detected insects into 14 different classes. All other insects and noninsect objects are combined in a class labelled "other" ( Table 2). The best classification results are obtained for "Hymenoptera Diapriidae" and "Hemiptera Cicadellidae," where all insects were correctly classified (100%), whereas insects of the class "Hymenoptera Ichneumonidae" had the lowest correct classification rate (75%). The performance details for the different classes are summarized in a "confusion matrix" ( Table 3) that compares results of the "predicted" (CNN) identification with the "true" labels (taxonomists). Note, that the good performance of the CNN allowed for the implementation of taxon-specific processing. The robot then only sorts insects belonging to a predefined class.
Biomass estimation is the slowest process and performing it during sorting adds significantly to processing times because the Raspberry Pi requires almost 2 min per specimen (Figure 6d: 108.06 s). We therefore recommend that the images be exported to another computer before the algorithm is applied. On a notebook with Intel Core i7-4510 U with 2.0 GHz, the average processing time per specimen is ~12 s. Note that total volume is estimated as the sum of the volumes for head, thorax and abdomen/metasoma and that our tests used only Hymenoptera Diapriidae because they have clearly separated body parts.
Currently, the sorting robot handles only insects up to 3 mm in length (Figure 7a-o), because larger insects do not fit through the pipetting tip. However, solutions for larger insects are in development.
Lower-bound size limits for specimens do not exist, but very small specimens may not be detectable on the overview image.

| DISCUSS ION
The use of CNNs for the identification of charismatic species is becoming routine (Fairbrass et al., 2019;Milošević et al., 2020;Stowell et al., 2019;Tabak et al., 2019). However, these methods have been largely unavailable for small invertebrates, even though they comprise much of the multicellular animal species diversity (Groombridge, 1992;Stork et al., 2015) and contribute many ecosystem services (Wagner, 2020  trained CNNs, which cannot be obtained without first producing sets of training images for thousands of species. We believe that the best strategy for obtaining these sets is combining automated specimen imaging with DNA barcoding. Each DiversityScanner can image 1000 specimens per day so that a laboratory equipped with a few DiversityScanners will be able to process several full Malaise trap samples per day. Each contains thousands of specimens that can be imaged with minimal manual labour. After imaging, the specimens are automatically transferred to microplates for DNA barcoding. Once barcoded, the images can be relabelled using the taxonomic information obtained from DNA barcodes. This can produce image training sets that have approximately species-level resolution given that specimen sorting with DNA barcodes yields MOTUs that are mostly congruent with morphospecies even when rigorously as- treatment. This ability to only find and move some taxa helps with implementing clade-specific molecular recipes (e.g., different DNA extraction or PCR recipes for taxa that are difficult to barcode: e.g., Hymenoptera) or restricting barcoding to either males or females given that often only one sex has species-specific morphological differences (Eberhard, 2010). Overall, we would thus predict that the DiversityScanner will prove useful for many studies using the toolkit of molecular ecology. The robot can rapidly generate barcodes for an unknown fauna, which helps with improving the quality of barcode databases and the interpretation of metabarcoding data. The robot can also prepare large numbers of specimens for molecular work on microbiomes or species interactions that have been sorted semi-automatically to the species level. By facilitating the barcoding of all specimens, the DiversityScanner furthermore highlights which common species belonging to dark taxa should be prioritized for taxonomic treatment.
One of the unresolved issues is whether CNNs will be sufficiently powerful to yield species-level identifications for closely related species (but see Ärje et al., 2020b;Knyshov et al., 2021). It is likely that the main limitation will be the number, quality and orientation of training images. Figure 7 illustrates the latter problem. Insects are imaged from many different angles and each will require enough training images before the CNNs will have a realistic chance for achieving accuracy at high taxonomic resolution. One solution for this problem is imaging specimens in many orientations. Fortunately, this is now feasible because modern, high-quality cameras can acquire large numbers of images at different magnifications and orientations. This is particularly straightforward once specimens have been presorted to putative species based on DNA barcodes. As illustrated by the BIODISCOVER robot, inserting these specimens into a cuvette allows imaging from many sides. This is why we predict that once large numbers of species have been extensively imaged and included in CNNs, robots such as the DiversityScanner should be able to identify many specimens to species based on images only.
Note also that not all would be lost if CNNs were eventually found to be incapable of distinguishing closely related species. Specimens identified to genus-or species-group level would still be suitable for many biomonitoring purposes.
Eventually, DNA barcoding might become restricted to those specimens that are not identifiable based on visual information; that is, the DiversityScanner would learn how to sort specimen to species, but also learn how to identify those specimens that still require barcoding. This will make the robot a powerful tool for discovering rare or new species in large samples. This ability would be particularly important in the 21st century, given that new species continue to arrive at well-characterized sampling sites (Parmesan, 2006). These new arrivals are due to both distribution shifts in response to climate change (Fartmann et al., 2021;Wilson et al., 2007) and anthropogenic introductions (Bertelsmeier, 2021). It would be desirable for both to have an early-alert system based on automated workflows.
With regard to the classification accuracy rates of our current CNN, we observe only a very weak correlation between the number of training images, morphological heterogeneity and classification accuracy ( Figure S1). There are classes with large numbers of training images that perform better than classes with lower numbers (e.g., "Diptera Calyptratae," 57 training images: 83% vs. "Diptera Phoridae," 64 training images: 97%), but the better performance of "Diptera Phoridae" could also be due to higher morphological uniformity. However, this is not in line with the observation of a comparatively high classification accuracy that was obtained for the class "other" that has the highest morphological heterogeneity. Indeed, this class performed better (81%) than "Hymenoptera Ichneumonidae" (75%, In addition, object detection can also be applied to detailed specimen images in order to detect cases where the image contains more than one specimen. The latter would also avoid instances where an insect was detected but not picked up by the pipette so that a well of the microplate remains empty. Particularly high on the list of development needs is also the handling of specimens larger than 3 mm. help fill these gaps (e.g., Reeves et al., 2018;Six, 2013;Srivathsan et al., 2019b;Yeo et al., 2018).

ACK N OWLED G EM ENTS
We would like to thank in particular Daniel Moser and Stefan Vollmannshauser for their support with manufacturing the mechanical parts and helping us with connecting the electronic circuits. Mr Leshon Lee prepared the video documenting the working principles of the DiversityScanner. Funding was provided by the Center for Integrative Biodiversity Discovery at the Museum für Naturkunde Berlin. Open access funding enabled and organized by ProjektDEAL.

DATA AVA I L A B I L I T Y S TAT E M E N T
All image data used for training and testing are accessible at the media repository of the Museum für Naturkunde Berlin: https://doi. org/10.7479/4tbx-qm72. All files for printing the robot parts and the software code are available from the repository of the Open Science Framework: https://osf.io/en594/. Benefits from this research accrue from the sharing of our data, software and robot assembly information on the public databases as described above.