Improving the accessibility and transferability of machine learning algorithms for identification of animals in camera trap images: MLWIC2

Abstract Motion‐activated wildlife cameras (or “camera traps”) are frequently used to remotely and noninvasively observe animals. The vast number of images collected from camera trap projects has prompted some biologists to employ machine learning algorithms to automatically recognize species in these images, or at least filter‐out images that do not contain animals. These approaches are often limited by model transferability, as a model trained to recognize species from one location might not work as well for the same species in different locations. Furthermore, these methods often require advanced computational skills, making them inaccessible to many biologists. We used 3 million camera trap images from 18 studies in 10 states across the United States of America to train two deep neural networks, one that recognizes 58 species, the “species model,” and one that determines if an image is empty or if it contains an animal, the “empty‐animal model.” Our species model and empty‐animal model had accuracies of 96.8% and 97.3%, respectively. Furthermore, the models performed well on some out‐of‐sample datasets, as the species model had 91% accuracy on species from Canada (accuracy range 36%–91% across all out‐of‐sample datasets) and the empty‐animal model achieved an accuracy of 91%–94% on out‐of‐sample datasets from different continents. Our software addresses some of the limitations of using machine learning to classify images from camera traps. By including many species from several locations, our species model is potentially applicable to many camera trap studies in North America. We also found that our empty‐animal model can facilitate removal of images without animals globally. We provide the trained models in an R package (MLWIC2: Machine Learning for Wildlife Image Classification in R), which contains Shiny Applications that allow scientists with minimal programming experience to use trained models and train new models in six neural network architectures with varying depths.


| INTRODUC TI ON
Motion-activated wildlife cameras (or "camera traps") are frequently used to remotely observe wild animals, but images from camera traps must be classified to extract their biological data (O'Connell, Nichols, & Karanth, 2011). Manually classifying camera trap images is an encumbrance that has prompted scientists to use machine learning to automatically classify images Willi et al., 2019), but this approach has limitations. We address two major limitations of using machine learning to automatically classify animals in camera trap images. First, machine learning models trained to recognize species from one location and in one camera trap setup might perform poorly when applied to images from camera traps in different conditions (i.e., these models can have low "out-of-sample" accuracy; Schneider, Greenberg, Taylor, & Kremer, 2020). This transferability, or generalizability, problem is thought to arise because different locations have different backgrounds (the part of the picture that is not the animal) and most models evaluate the entire image, including the background (Beery, Morris, & Yang, 2019;Miao et al., 2019;Norouzzadeh et al., 2019;Terry, Roy, & August, 2020;Wei, Luo, Ran, & Li, 2020). By including images from 18 different studies in North America, our objective was to train models with more variation in the backgrounds associated with each species. Furthermore, by training an additional model that distinguishes between images with and without animals, we provide an option that could be broadly applicable to camera trap studies worldwide.
Second, the use of machine learning in camera trap analysis is often limited to computer scientists, yet the need for image processing exceeds the availability of computer scientists in wildlife research. For example, several researchers have provided excellent Python repositories for using computer vision to analyze camera trap images Beery, Wu, Rathod, Votel, & Huang, 2020;Norouzzadeh et al., 2018;Schneider et al., 2020).
These software packages enable programmers to use and train models to detect, classify, and evaluate the behavior of animals in camera trap images. However, these packages require extensive programming experience in Python, a skill which is often lacking from wildlife research teams. To facilitate the use of this type of model by biologists with minimal programming experience, Machine Learning for Wildlife Image Classification (MLWIC2) includes an option to train and use models in user-friendly Shiny Applications (Chang, Cheng, Alaire, Xie, & McPherson, 2019), allowing users to point-and-click instead of using a command line. This facilitates easier site-specific model training when our models do not perform to expectations.

| Camera trap images
Images were collected from 18 studies using camera traps in 10 states in the United States of America (California, Colorado, Florida, Idaho, Minnesota, Montana, South Carolina, Texas, Washington, and Wisconsin; Appendix S1). Images were either classified by a single wildlife expert or classified independently by two biologists, with discrepancies settled by a third. An image was classified as containing an animal if it contained any part of an animal. Our initial dataset included 6.3 million images but was unbalanced with most images from a few species (e.g., 51% of all images were Bos taurus). We rebalanced the number of images by species and site to ensure that no one species or site dominated the training process. Previous work North America. We also found that our empty-animal model can facilitate removal of images without animals globally. We provide the trained models in an R package (MLWIC2: Machine Learning for Wildlife Image Classification in R), which contains Shiny Applications that allow scientists with minimal programming experience to use trained models and train new models in six neural network architectures with varying depths.

K E Y W O R D S
computer vision, deep convolutional neural networks, image classification, machine learning, motion-activated camera, R package, remote sensing, species identification suggested that training a model with 100,000 images per species produces good performance (Tabak et al., 2019); therefore, we limited the number of images for a single species from one location to 100,000. When >100,000 images for a single species existed at one location, we randomly selected 100,000 of these images to include in the training/testing dataset. After rebalancing the data, we had a total of 2.98 million images; 90% were randomly selected for training, while 10% were used for testing. Images used in this study were either already a part of or were added to the North American Camera Trap Images dataset (lila.science/datasets/nacti; Tabak et al., 2019).
Images from Canada were not used for training but were used to evaluate model transferability as an out-of-sample dataset.

| Training models
We trained deep convolutional neural networks using the ResNet-18 architecture (He, Zhang, Ren, & Sun, 2016) in the TensorFlow framework (Adabi et al., 2016) on a high-performance computing cluster, "Teton" (Advanced Research Computing Center, 2018). Models were trained for 55 epochs, with a ReLU activation function at every hidden layer and a softmax function in the output layer, mini-batch stochastic gradient descent with a momentum hyperparameter of 0.9 (Goodfellow, Bengio, & Courville, 2016), a batch size of 256 images, and learning rates and weight decays that varied by epoch number (described in Appendix S2). We trained a species model, which contained classes for 58 species or groups of species and one class for empty images (Table 1). We also trained an empty-animal model that contained only two classes, one for images containing an animal, and the other for images without animals.

| Model validation and transferability
We first evaluated our trained models by applying them to predicting species in the 10% of images that were withheld from training.
Models were evaluated for each species using the recall, top-5 recall, and precision, which are values summarizing the number of true positives (TPs), false positives (FPs), and false negatives (FNs): As recall is the proportion of images of each species that were correctly classified, top-5 recall is the proportion of images for each species in which one of the model's top five guesses is the correct species. We also calculated confidence intervals for recall and precision rates (Appendix S3). To evaluate transferability of the model, we conducted out-of-sample validation by applying our trained models to images from locations where the model was not trained.
To evaluate the effect of using multiple training datasets on model generalizability, we iteratively trained models using varying numbers of datasets (i.e., 1 dataset, 3 datasets, 6 datasets, … all 18 datasets) and tested the model on the out-of-sample datasets. can classify images at a rate of 2,000 images per minute on a laptop with 16 gigabytes of random-access memory and without a graphics processing unit. MLWIC2 will optionally write the top guess from each model and confidence associated with these guesses to the metadata of the original image file. The function "write_metadata" and the associated R Shiny Application uses Exiftool (Harvey, 2016) to accomplish this. In addition, if scientists have labeled images, MLWIC2 has a Shiny app that allows users to train a new model to recognize species using one of six different convolutional neural network architectures (AlexNet, DenseNet, GoogLeNet, NiN, ResNet, and VGG) with different numbers of layers. We also trained models in these other architectures for comparison. Note that the time required to train a model depends on the number of images used for training and computing resources; operating MLWIC2 on a highperformance computing cluster requires programming experience.
When we iteratively trained the model on varying numbers of datasets, we found that accuracy on out-of-sample images increased with the number of datasets used to train the model (Figure 3).

| D ISCUSS I ON
In MLWIC2, we provide two trained machine learning models, one classifying species and another distinguishing between images with animals and those that are empty, with 97% accuracy, which can TA B L E 2 Mean recall and precision rates (along with 95% confidence intervals) for predicting species using the species model on the validation dataset (the 10% of images that were withheld from training)

TA B L E 2 (Continued)
images in datasets globally. For many research projects, the task of simply removing empty images can save thousands of hours of labor.
We propose a workflow for how users can apply these models to filter-out empty images and train new models as necessary (Figure 4).
By providing Shiny Applications to train models and classify images, we make this technology accessible to more scientists with minimal programming experience. Our finding that high recall (>95%) can be achieved with fewer than 2,000 images for some species (Table 2; Figure 1) suggests that smaller labeled image datasets can potentially be used to train models with this software.
Other researchers have developed models for recognizing animals in camera traps, with some success in out-of-sample identification. For example, Zilong software accurately removed 85% of empty images (Wei et al., 2020), MegaDetector had a precision of 89%-99% at detecting animals , and MLWIC achieved an accuracy of 82% at out-of-sample species classification F I G U R E 1 Within sample validation of the species model revealed high recall and precision for most species. Median values across datasets are presented along with 95% confidence intervals. The number of datasets for each species is included in the circle next to the species name (circle sizes are proportional to the number of datasets containing each species) (Tabak et al., 2018(Tabak et al., , 2019. We hypothesize that our models performed well on some out-of-sample datasets (Snapshot Serengeti, Snapshot Karoo, Wellington, and Saskatchewan; Table 3) because they were trained using camera trap images from multiple locations with different camera placement protocols, allowing the model to develop a search image for each species in multiple backgrounds ( Figure 3).
Transferability of machine learning models remains a complication for implementing these models more broadly to camera trap data and, in many cases, it is most productive for scientists to build  models that are trained directly on their study sites (see Figure 4 for more details). While such models will have less broad applicability (they are unlikely to be accurate globally), they can have high study-specific accuracies, thus reducing the burden of manual image classification. Our finding that models become more generalizable when more datasets are used to train the model (Figure 3) indicates that by including more diverse datasets when we train future models, we may be able to train a model that can be accurate in more locations.

| Future directions
As this new technology becomes more widely available, ecologists will need to decide how it will be applied in ecological analyses. For example, when using machine learning model output to design occupancy and abundance models, we can incorporate accuracy estimates that were generated when conducting model testing. for some species to avoid detection by cameras when they are present (Tobler, Zúñiga Hartley, Carrillo-Percastegui, & Powell, 2015).
Another area in need of consideration is how to group taxa when few images are available for the species. We generally grouped species when few images were available for model training using an arbitrary cut off of approximately 1,000 images per group (Table 2).
Nevertheless, we had relatively few images of grizzly bears (Ursus arctos horribilis; n = 843), but we included this species because it is of conservation concern, and found high rates of recall and precision (99% for each). We grouped members of Mustelidae (Mustela erminea, Mustela frenata, unknown Mustela spp., Neovison spp., and Taxidea taxus) together, and this group had relatively low recall and precision (89% and 91%, respectively). When researchers develop new models and decide which species to include and which to group, they will need to consider the available data, the species or groups in their study, and the ecological question that the model will help address.

CO N FLI C T O F I NTE R E S T
The authors have no conflicts of interest to declare.

DATA AVA I L A B I L I T Y S TAT E M E N T
The trained models described in this work are available in the MLWIC2 package (https://github.com/mikey Ecolo gy/MLWIC2).

Images used to train models are available in the North American
Camera Trap Images dataset (lila.science/datasets/nacti). Data from validation tests are available from the dryad digital repository