The potential of Jellytoring 2.0 smart tool as a global jellyfish monitoring platform

Abstract Despite the recent recognition of jellyfish as an important component of marine ecosystems and existing concerns on their potential population increase, they are rarely monitored at the appropriate spatial and temporal scales. Traditional jellyfish monitoring techniques are costly and generally restrict the spatial–temporal resolution limiting the quantity and quality of monitoring data. We introduce Jellytoring 2.0, an automatic recognition tool for jellyfish species based on convolutional neural networks (CNN). We trained Jellytoring 2.0 to identify 15 jellyfish species with a global distribution. Our aim is to offer Jellytoring 2.0 as an open‐access tool to serve as the backbone for a system that promotes the creation of large‐scale and long‐term jellyfish monitoring data. Results reveal that Jellytoring 2.0 performed well in the identification of the 15 species with average precision values ranging between 90% and 99% for most of the species. Four of the species presented slightly lower values (75%–80%). Our system was trained on a relatively small dataset, implying that additional integration of image data will further improve the performance of the CNN. We show how the application of CNNs to image data can deliver a tool that will enable the cost‐effective collection of jellyfish data on larger spatial and temporal scales. For Jellytoring 2.0 to become a truly global automatic identification system, we ask scientists and nonscientists to actively contribute with jellyfish image data to extend the number of species it can identify.

. The logistical difficulties in sampling and monitoring these organisms in time and space are mostly responsible for this deficient level of information, hindering our understanding of jellyfish population dynamics (Lamb et al., 2019).
Generally, jellyfish are monitored by visual observations from the shore and/or boats, with data logging approaches ranging from quantitative, to presence/absence and relative abundance indices (Condon et al., 2013). Data generated in this way generally tend to have a limited spatial and temporal resolution due to the high sampling effort and labor intensity involved and their associated cost.
Furthermore, human observer data collection often requires special training and is susceptible to observer bias.
Here, we explore the application of convolutional neural networks (CNNs), a type of deep learning methodology, for the identification of different species of jellyfish. The successful application of this methodology could serve as a basis for the cost-effective acquisition of data in a continuous standardized way, with the possibility of moving toward a global scale methodology. Identification bias associated with the observer would be reduced, since networks are trained by a multitude of observers and a high volume of images which would level out any potential biases. Recent studies have successfully applied CNNs for the identification of different jellyfish species (Gauci et al., 2020;Han et al., 2021;Martin-Abadal et al., 2020).
The majority focused on the local scale (Gauci et al., 2020;Martin-Abadal et al., 2020), while one study used jellyfish images from different geographical areas without a spatially explicit specification (Han et al., 2021). Yet, if the objective is to significantly advance and improve the quality and quantity of jellyfish data and spatial and temporal coverage, the first step is to develop a cost-effective method that allows for efficient data collection at large scales.
Here, we demonstrate and evaluate the use of the Jellytoring deep learning neural network as the basis for a global jellyfish identification system.

| MATERIAL S AND ME THODS
The present work extends on the technological advances previously achieved by the authors at setting the basis for an automatic jellyfish identification system. Jellytoring, a convolutional neural network, was developed and trained for the identification and quantification of three common jellyfish species in the Western Mediterranean area with successful results (Martin-Abadal et al., 2020). The tool, however, has the potential to be further expanded to identify a wide range of jellyfish species at a global scale.

| Convolutional neural networks
Convolutional neural networks (CNN) are a type of deep learning approach that allows the automatic extraction of information within an image. Neural networks are inspired on how the neurons in a mammalian brain respond to stimuli (Goodfellow et al., 2016).
Training examples are given to the network so that it learns how to complete a task (Goodfellow et al., 2016). Here, we used underwater images containing different species of jellyfish to train a CNN to detect and classify them. Once the CNN has "learned," it can detect and classify new jellyfish images, assigning confidence values to each detection. Finally, the species with the highest confidence value is associated with the detection.

| Image acquisition to train the CNN
Jellytoring was originally trained to identify three common Mediterranean species, namely Pelagia noctiluca, Cotylorhiza tuberculata, and Rhizostoma pulmo. To expand the capabilities of Jellytoring, hereafter Jellytoring 2.0, and make its application global, underwater video recordings from a range of jellyfish species were used to obtain additional images. The search platform YouTube was used to scan for publicly available jellyfish underwater videos. Searches were performed using the keyword "jellyfish" subsequently following a snowballing technique. A total of 226 videos corresponding to 15 jellyfish species were identified. A list of the jellyfish species and their respective number of videos are detailed in Table 1. Relevant videos were downloaded, and still images were extracted from each video. On average, a frame was extracted every 5 s. To log the presence of the different jellyfish species, an annotation file was generated using LabelImg (Tzutalin, 2018). For each of the frames, a bounding box around every jellyfish occurrence was drawn and was classified according to the jellyfish species. An .xml file was generated containing the position and classification of the occurring instance within the image. Overall, a total of 3808 occurrences were recorded corresponding to 15 jellyfish species (Table 1).
In an initial step, Jellytoring 2.0 was trained grouping all jellyfish species into one model. Performance however was not entirely satisfactory (Tables S1 and S2) as for similar-looking species (e.g., Rhizostoma pulmo vs. Rhizostoma luteum), the network incorrectly identified them in up to 30% of the instances. To reduce the number of misidentifications and to improve the network's performance, a different approach was adopted and the CNN was trained to generate different models for different geographical areas.
WoRMS database (www.marin espec ies.org) was used to assess the spatial distribution of the jellyfish species, which were assigned to the different areas of the world where they had been recorded  Figure 1). Therefore, the CNN was trained to generate four different models for the four subareas. Each model was trained using data from that specific area. This approach ensured that the network could not get confused between two similarly looking species, which had been uniquely recorded in different geographical areas.

| Framework network selection
A variety of deep learning frameworks based on CNN can be used to extract instance information from images (Dai et al., 2016;Girshick, 2015;He et al., 2016;Lin et al., 2017). Here, our objective was to select a framework with the capability of detecting and classifying a range of jellyfish species, without the need to obtain pixelwise segmentation of the detected instances, nor any extra feature that could slow down the process. Therefore, considering the requirements of the present application, the Faster R-CNN framework was selected. Considering the slow movement of jellyfish, an architecture with high detection performance was deemed suitable. We selected the Faster R-CNN-based implementation of the Inception ResNet v2 architecture (Szegedy et al., 2016).

| Jellytoring workflow
As a first step, a set of images containing jellyfish is used as input for a frozen version of a deep object detection neural network-trained model. During inference, the network starts the process of jellyfish TA B L E 1 Summary of the jellyfish species included in the study, number of videos analyzed, total number of still images extracted from the videos, and geographical distribution of the different species according to WoRMS database.

Species # videos # files Atl/Med Pacific Arctic/Baltic Indian/China
Aurelia aurita 14 210 x Abbreviation: x, recorded in the region.

F I G U R E 1 World oceans' map
considered in the creation of regional neural networks to reduce error in the identification of species with similar morphology.
detection. Subsequently, detection is optimized by a nonmaxima suppression (nms) algorithm (Neubeck & Van Gool, 2006). The final predictions are obtained by deleting instances with an associated confidence lower than a selected threshold value (Cthr). Figure 2 shows a representation of identification bounding boxes for several of the jellyfish species.

| Evaluation metrics
To evaluate the performance of a trained model, the bounding-box predictions generated as a result of the test dataset are classified as either a true positive, TP; or a false positive, FP. To do so, we use the Intersection over Union (IoU) metric. IoU is defined as the area of the intersection between a predicted and a ground-truth bounding box, divided by the union area of these bounding boxes (Equation 1).
A prediction is classified as TP if the IoU value with any groundtruth bounding box is greater than a threshold IoU and the predicted species class (i.e., species of interest) matches the class specified in the ground-truth box. When these conditions are not met, the prediction is classified as a FP (Equation 2). Ground-truth instances that do not have an IoU greater than the threshold IoU with any prediction are considered as undetected instances (false negatives, FN). Following the criteria applied in the PASCAL VOC challenge (Everingham et al., 2010), the IoU threshold was set at 0.5 Once the number of TP, FP and FN is obtained, the average precision (AP) is calculated (Zhu, 2004); AP is one the most frequently used metrics in object detection applications. This metric is defined as the area under the max (precision)-recall curve. Once the AP is obtained for each class, a mean average precision (mAP) for all classes is computed. In addition, the confidence score generated by the network for each prediction can be used to fine-tune the network. The confidence score is used to find a confidence value that, when deleting predictions with low confidence values, more FP than TP are deleted thus, improving the detection performance of the network.
To do this, a threshold sweep on the confidence prediction from 0% to 100% was performed in 1% steps (C_thr). For each step, the predictions with an associated confidence level lower than the C_thr (1) For each of the four models corresponding to the different geographical areas, a 5-k-fold validation was performed (Geisser, 1975).
Through this method, the dataset is split into five equally sized subsets and the network is trained five times, each time using a different subset as the test data (20% of the dataset) and the remaining four subsets as training data (80% of the dataset). The final results (i.e., AP, mAP, C_thr, Recall, Precision, and F1-score) are thus an average of the five trainings. This method reduces the variability of the results, making them less dependent on the selected test and training sets and therefore obtaining a more accurate performance estimation.

| RE SULTS
For each one of the four geographical regions, the AP, mAP, Cthr, and F1-score metrics are presented, along with a multiclass confusion matrix. The confusion matrix allows us to identify those pairs of classes where the network has difficulties in telling them apart.

| Atlantic-Mediterranean region
The spatial distribution assessment indicated that 12 of the 15 jel-  (Figure 3).
On the other hand, the training data for R. pulmo were thrice that of R. luteum, biasing the network toward R. pulmo (Table 3).

| Pacific region
Seven of the 15 jellyfish species had been detected in the Pacific comprised between 78% and 83%. The mAP had a value of 90.1% and a F1-score of 83.5% after applying an optimal confidence threshold of 44.8% ( Table 4). The multiclass confusion matrix indicates that the network performed very well in distinguishing between species, as all values but one were above 98% with the exception of N. nomurai which was correctly identified in 96.7% of the cases. Only occasionally (3.3% of the cases) was misidentified as C. capillata (Table 5).

| Arctic-Baltic region
Four of the 15 species were recorded in the Arctic/Baltic region. All AP values for this region were above 80%. The values for C. capillata and C. lamarckii were both over 97%, while the AP values for A. aurita and C. fuscescens were in the lower eighties (81.5% and 83.2%).
The average AP between all the species was slightly over 90%. After applying the optimal confidence threshold of 55.4%, the F1-score achieved a value of 82.2% (Table 6). The performance of the network model was very high as it was able to distinguish between the different species. Even very similarly looking species belonging to the same genus (i.e., C. capillata and C. lamarckii) were correctly identified in 98% of the cases ( Table 7).

| Indian Ocean-South China Sea Region
The model for the Indian Ocean and the South China Sea was developed with the four jellyfish species recorded in the region. The AP values for two of the species, N. nomurai and C. quinquecirrha, were very high, ranging between 96.7% and 97.5%. AP values for A. aurita and P. noctiluca were slightly lower, 82.4% and 79.2%, respectively.
The mAP had a value of 89%, while the F1-score was 80.8% after applying a 64.8% confidence threshold ( Table 8). As with the other models, the network had a high performance in distinguishing between species, as all values were above 99.5% (Table 9).
Overall, the performance metrics are good and consistent over the four regions. Furthermore, the network does not tend to mistake a jellyfish species for a wrong one, demonstrating the effectiveness of the implemented strategy of dividing the species into four regions.

| DISCUSS ION
Thus far, jellyfish have proven difficult to monitor, due to the high associated cost of logistics and the need of humans to analyze visual data (direct observations or video recordings), limiting both the spatial and temporal scales of monitoring programs (but see Condon et al., 2015). The results of this paper demonstrate that the application of CNNs to image data has the potential to deliver a tool that enables the cost-effective collection of jellyfish data on larger spatial and temporal scales. In our initial publication, we described the suc-

99.6%
The results were highly promising, but the question remained of whether the CNN would be able to distinguish a much larger number of species and thus serve as a backbone for the development of a global tool. Here, we demonstrate that the algorithms developed are robust and that most species can be distinguished by the system with high precision and an acceptable level of confusion. The tool  The development of the CNN as a single model would be the ideal objective but the aim is to have a system that is able to facilitate the identification of jellyfish species, which is still achieved through the CNN working as separate submodels.
Within the Atlantic-Mediterranean subarea model, the most no- For instance, P. noctiluca is a translucent organism whose body can adopt significantly different configurations, due to the movement of their tentacles; or in the case of A. aurita, it tends to appear in big blooms, causing overlapping in the images and making it hard to identify single organisms. While the classification weaknesses for some species may be improved with the integration of additional training data, for others there may be morphological or behavioral aspects that will make it difficult to increase the overall classification precision. Nevertheless, in general, the incorporation of additional training data is expected to significantly improve the detection and correct classification of objects (Goodfellow et al., 2016).  project administration (supporting); resources (equal); software (equal); supervision (equal); validation (equal); visualization (equal); writing -original draft (equal); writing -review and editing (equal).

ACK N OWLED G M ENTS
During the conception, implementation and writing of the  web application platform and Aenne Douwes for her contributions during the early stages of the project. We also thank the contributions of two anonymous reviewers that contributed to achieving a clearer manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
The dataset and code used is deposited in a GitHub repository https://github.com/srv/jf_object_detec tion/tree/regions.