Animal Scanner: Software for classifying humans, animals, and empty frames in camera trap images

Abstract Camera traps are a popular tool to sample animal populations because they are noninvasive, detect a variety of species, and can record many thousands of animal detections per deployment. Cameras are typically set to take bursts of multiple photographs for each detection and are deployed in arrays of dozens or hundreds of sites, often resulting in millions of photographs per study. The task of converting photographs to animal detection records from such large image collections is daunting, and made worse by situations that generate copious empty pictures from false triggers (e.g., camera malfunction or moving vegetation) or pictures of humans. We developed computer vision algorithms to detect and classify moving objects to aid the first step of camera trap image filtering—separating the animal detections from the empty frames and pictures of humans. Our new work couples foreground object segmentation through background subtraction with deep learning classification to provide a fast and accurate scheme for human–animal detection. We provide these programs as both Matlab GUI and command prompt developed with C++. The software reads folders of camera trap images and outputs images annotated with bounding boxes around moving objects and a text file summary of results. This software maintains high accuracy while reducing the execution time by 14 times. It takes about 6 seconds to process a sequence of ten frames (on a 2.6 GHZ CPU computer). For those cameras with excessive empty frames due to camera malfunction or blowing vegetation automatically removes 54% of the false‐triggers sequences without influencing the human/animal sequences. We achieve 99.58% on image‐level empty versus object classification of Serengeti dataset. We offer the first computer vision tool for processing camera trap images providing substantial time savings for processing large image datasets, thus improving our ability to monitor wildlife across large scales with camera traps.

resulting in millions of photographs per study. The task of converting photographs to animal detection records from such large image collections is daunting, and made worse by situations that generate copious empty pictures from false triggers (e.g., camera malfunction or moving vegetation) or pictures of humans. We developed computer vision algorithms to detect and classify moving objects to aid the first step of camera trap image filtering-separating the animal detections from the empty frames and pictures of humans. Our new work couples foreground object segmentation through background subtraction with deep learning classification to provide a fast and accurate scheme for human-animal detection. We provide these programs as both Matlab GUI and command prompt developed with C++. The software reads folders of camera trap images and outputs images annotated with bounding boxes around moving objects and a text file summary of results. This software maintains high accuracy while reducing the execution time by 14 times. It takes about 6 seconds to process a sequence of ten frames (on a 2.6 GHZ CPU computer). For those cameras with excessive empty frames due to camera malfunction or blowing vegetation automatically removes 54% of the false-triggers sequences without influencing the human/animal sequences. We achieve 99.58% on image-level empty versus object classification of Serengeti dataset. We offer the first computer vision tool for processing camera trap images providing substantial time savings for processing large image datasets, thus improving our ability to monitor wildlife across large scales with camera traps.

K E Y W O R D S
background subtraction, camera trap images, deep convolutional neural networks, humananimal detection, wildlife monitoring

| INTRODUC TI ON
Motion-sensitive wildlife cameras, commonly referred to as camera traps, are increasingly popular survey tool for animal populations because they are noninvasive and increasingly easy to use (Kays, 2016).
Comparisons with other wildlife monitoring methods have shown camera traps to be the most effective and cost efficient approach for many species (Bowler, Tobler, Endress, Gilmore, & Anderson, 2016). Ambitious projects are increasing the scale at which cameras are used on the landscape, now rotating hundreds of sensors across thousands of sites (Steenweg et al., 2016), sometimes with the assistance of citizen scientists (McShea, Forrester, Costello, He, & Kays, 2016 DCNNs learn features hierarchy all the way from pixels to classifier where the training is supervised with stochastic gradient descent (Lee, Xie, Gallagher, Zhang, & Tu, 2014). In this paper, the output from the classification layer is three scores which refer to human, animal, and background classes.
Computer vision has the potential to offer an automated tool for processing camera trap images if it can, first to identify the moving animal within the image and subtract the background, and second, to identify the moving object . Although these problems are solved for many indoor environments (Huang, Hsieh, & Yeh, 2015), the challenge is much greater with camera traps images because of their dynamic background scenes with waving trees, moving shadows, and sun spots.
Previous efforts to distinguish animals from background in camera traps have been proposed for foreground detection. In general, foreground areas are selected through one of two means, pixelby-pixel, in which an independent decision is made for each pixel, and region-based, in which a decision is made on an entire group of spatially close pixels (Dong, Wang, Xia, Liang, & Feng, 2016).
Unfortunately, the success of these efforts has been limited by producing large number of false positives and difficulty in distinguish between animal and human objects.

| ME THODS
Our system (Figure 1) starts by detecting where the moving objects (human, animal, or moving vegetation) are within the images using a background subtraction method. Unlike many other image processing and vision analysis tasks, detecting and segmenting human-animals from the camera trap images is very challenging since natural scenes in the wild are often highly cluttered due to heavy vegetation and highly dynamic due to waving trees, moving shadows, and sun spots. Next, these moving objects are identified with classifiers to distinguish them as human, animal, or moving background. After describing these algorithms, we will explain how we reduce the false positives (background patches mistakenly identified as animals or people) using cross-frame verification and present a study of the complexity-accuracy trade-off of DCNNs to propose a fast and accurate scheme for human-animal classification. Finally, we describe our GUI and command line data input and output.

| Object region segmentation
The first step in processing camera trap images is to distinguish the moving objects in the foreground (aka foreground object proposals) from the fixed background. We rescale each frame from a given camera trap sequence into a specific width and height and then divide the rescaled image into 736 (or 32 × 23) regular blocks. We then extract features from each block. To determine which block(s) contain moving animals, we use the minimum feature distance (MFD) to all other co-located blocks. These features include intensity, Local Binary Pattern (LBP) (Ojala, Pietikäinen, & Mäenpää, 2001), Gray Level Co-occurrence Matrix (Baraldi & Parmiggiani, 1995), and Histogram of Oriented Gradient (HOG) (Dalal & Triggs, 2005). Given a sequence with 10 frames, for each block of the 736 blocks is compared with the other nine co-located blocks to find the background block which has MFD. Any block that has feature histogram difference with the co-located blocks larger than the MFD is classified as a moving object (i.e., foreground). Our experiments found that the HOG (Dalal & Triggs, 2005) is the best feature vector that can efficiently represent the block information.
We compare consecutive images in a sequence to find the moving object by subtracting feature histograms from same region position on subsequent images. The regions with the highest differences are then connected contiguously to form the moving objects. The difference value should be robust enough to reduce the number of false alarms and able to detect any animal or human as precisely as possible in the presence of challenges of camera trap images. Because some camera brands only record 3 images per trigger, we initially use information from three consecutive frames to find the moving object.
In a second method, we use the entire sequence frames to extract a background frame in a composition manner, and then subtract each F I G U R E 1 Flow chart of the proposed system. In the training stage, we generate the training patches to train the classification model. In the detection stage, we use joint background modeling with the pre-trained classification model from the training stage to detect the human and animal frame features histogram from this composite background. After we subtract a given frame from the background frame, we set a threshold value that defines whether this block belongs to background or foreground. The foreground blocks are then connected to represent the foreground region(s). These foreground regions are the region proposals which need to be verified as human, animal, or background in order to label them with tagged bounding boxes.

| Region proposals verification
Before we proceed to the final step of identifying the moving object as an animal or person, we first use a verification of region proposals to determine if they are from the foreground or background. We observe that some of the false positive foreground generated by background subtraction are caused by the intensity changes within the same sequence. where L is the total number of intensity levels in the patch. Figure 2 shows how the false alarms are detected through the SHL value.
Region verification through SHL requires less than 50 ms for each patch.

| Foreground proposals classification
After finding objects within a given frame (aka proposals), the next step is to classify them into human, animal, or background. We created a training dataset of images from these three classes (human/ animal/background) by cropping rectangular regions (aka patches) from 459,427 camera trap images and manually labeling them.
These images were all from Reconyx or Bushnell brand cameras, and included color and black/white pictures with mainly two image resolution 1,024 × 1,536 pixels and 1,920 × 2,048 pixels. The original images come from three countries (Panama, Netherlands, and USA), and thus represent a great variety of types of animals and people. We use this dataset to train and test three different classifiers: Bag of visual word (BOW) (Fei-Fei & Perona, 2005), AlexNet (Krizhevsky, Sutskever, & Hinton, 2012), and our DCNN model (AlexNet-96). The input image size for AlexNet and BOW is 256 × 256 pixels and 96 × 96 pixels for AlexNet-96. Our software can accept any size of camera trap images (i.e., 1,024 × 1,536 and 1,920 × 2,048). It should be noted that cropped region proposals (image patches), which have different sizes and aspect ratios, need to be rescaled to match with the classifier required input size. The training set is completely separate from the testing data and both include randomly chosen sequences with different camera trap circumstances including color and black/white, trail road, grass, and top-tree images. Each of the training and testing dataset contains 30,000 image patches consisting of 10,000 patches for each class.
We evaluate the performance of our human-animal detection method on 200 camera trap sequences each consisting of 10 images. We manually labeled all animals and persons with bounding boxes in these 2,000 images. To evaluate the detection performance, we compared the segmented output patch and the manually labeled patches with the intersection over the union (IoU).
We consider our classifier as accurate (true positive = TP) if the patch has an IoU ≥ 0.5 and is classified correctly (person, animal).
Any background classified as human or animal, or any IoU ≤ 0.

| Software
We implemented our algorithms in two forms, a graphical user interface and command line, to facilitate use by camera trappers and also to make the individual components available to other computer programmers who want to modify it or incorporate it into other software.

| Graphic user interface
We have packaged our algorithms with a user friendly graphical user interface to allow ecologists to easily use our algorithms without de-

| Command line interface
We have developed a C/C++ command line interface program for fast human-animal detection. The input argument required to run this program is only the program name and the input file that contains a list of all sequence images. Running the program on a batch of sequences requires a single text file contains the names of sequences files. For example, if I have 10 sequences, there will be 11 files, 10 files listing the name and the path of the images in each sequence, one file has the name and the path of each of the 10 files.

| Fast DCNN analysis
We studied how the relationship between complexity and classification accuracy by gradually reducing the input size of each image, and changing the number of filters. There was little effect of reducing input size from 256 × 256 pixels to 96 × 96 pixels (classification   Figure 5a,b), although lower resolution pictures were less accurate. However, this reduces the complexity (and thus processing time) by 10 times, with a relatively small loss of classification accuracy (2.2%). Figure 5c shows the complexity analysis associated with reducing the number of filters for each convolutional layer of input size 96 × 96 AlexNet (AlexNet-96) (Yousif, Yuan, Kays, & He, 2017b

| Object detection evaluation
Our proposed background modeling outperforms other published alternatives in both recall and precision (Table 3), and works even with difficult images typical of camera trapping ( Figure 6). In Table 4, we compare our detection results with the other state-of-the-art TA B L E 3 Performance comparison on background subtraction in the Camera Trap dataset with other methods

| Sequence-level evaluation
Although our algorithm evaluates individual images, this information can be pooled across sequential frames to classify the contents of a sequence, and then remove the empty sequences and people. We classify a camera trap sequence as (a) background when there is no human/animal is detected, (b) human when all the detected objects are humans, and (c) animal if there is an animal is detected. We evaluated the performance of our detection method based on sequence labeling using six deployments that reflect different camera circumstances that often result in many non-animal pictures (Table 5).
We evaluate the performance using three different metrics:

| Image-level classification evaluation
For this task, we use the camera trap images from Snapshot Serengeti project and have included this with the GUI. The Snapshot Serengeti project is a study with 225 camera traps running continuously in Serengeti National Park, Tanzania, since 2010 (Norouzzadeh et al., 2018;Swanson et al., 2015).

| CON CLUS I ON AND FUTURE WORK
With the growing reliance of camera traps for wildlife research, there is an increasing interest in developing computer vision tools to overcome the challenges associated with big data projects. Our tool offers an important advance in this effort by helping biologists remove useless images. This is a time-consuming task, made especially bad in grassy or canopy habitats where 98% or more of the pictures consist only of moving vegetation. The removal of humans is also useful for busy hiking trails where they make up the majority of pictures. Our process to automatically identifying people in pictures could also aid in situations where the privacy of people being photographed is of concern, or in educational programs where school kids are running camera traps and looking through the pictures.
Our model works on both color and infrared photos and has been trained and tested with difficult and challenging images Note. High values of recall and TNR indicates better object detection and specificity, respectively. While the low values for FNR indicates less misclassifying object as background than misclassifying background as object.
F I G U R E 7 Detailed confusion matrices for six deployments, see Table 5  Row a shows the correct identification of an animal (purple bounding box) toward the back of a scene. Rows b and c correctly identify the moving objects as background grass and leaves, respectively, showing no bounding boxes. Row d shows four frames correctly identified as human (yellow bounding box), with the fifth mistakenly classifying the object as an animal. Row e shows moving grass that was classified as an animal and human. Row f shows an animal that was not detected because it did not move during the sequence 2016), and RRC (Ren et al., 2017)) use only the single frame information to find the target object which makes them inefficient for the highly cluttered scene of the camera trap images (Norouzzadeh et al., 2018). Our sequence-level background subtraction shows an effective approach to localize the moving objects. Most of the recent papers on camera trap images aims to classify the whole image using the available DCNN models. Image-level classification is unable to (a) localize the object, (b) differentiate between an image that contains a small animal and a background image, and (c) classify a multiple object image.
This work introduces a near real-time software with outstanding performance compared with the other state-of-the-art. In our unpublished work, we have improved the performance much higher but still can not be run on CPU computers. There is still ongoing work to implement this work beside animal species classification from eastern North America camera trap images as a cloud service.

ACK N OWLED G M ENTS
This work has been supported by National Science Foundation under grant CyberSEES-1539389.

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
Hayder Yousif constructed the main ideas of the research, carried out most experiments, and drafted the original manuscript. Jianhe Yuan did the training and testing with DCNN part. Roland Kays and Zhihai He offered useful suggestions for improving the accuracy and revising the manuscript.

DATA ACCE SS I B I LIT Y
The softwares used in this paper have been archived on figshare (https://figshare.com/s/cfc1070ca5a9bdda4cd8).