Automatic counting of birds in a bird deterrence field trial

Abstract Decreasing costs in high‐quality digital cameras, image processing, and digital storage allow researchers to generate and store massive amounts of digital imagery. The time needed to manually analyze these images will always be a limiting factor for experimental design and analysis. Implementation of computer vision algorithms for automating the detection and counting of animals reduces the manpower needed to analyze field images. For this paper, we assess the ability of computer vision to detect and count birds in images from a field test that was not designed for computer vision. Using video stills from the field test and Matlab's Computer Vision Toolbox, we designed and evaluated a cascade object detection method employing Haar and Local Binary Pattern feature types. Without editing the images, we found that the Haar feature can have a recall over 0.5 with an Intersection over Union threshold of 0.5. However, using this feature, 86% of the frames without birds had false‐positive bird detections. Reducing the false positives could lead to these detection methods being implemented into a fully automated system for detecting and counting birds. Accurately detecting and counting birds using computer vision will reduce manpower for field experiments, both in experimental design and data analysis. Improvements in automated detection and counting will allow researchers to design extended trials without the added step of optimizing the experimental setup and/or captured images for computer vision.


| 11879
SIMONS aNd HINdERS of future field testing by reducing manpower needed to analyze the data.
Computer vision is an increasingly useful tool for environmental and biological data. Advancements in computer processing, imaging quality, and the availability of large data sets containing thousands of labeled images, such as iNaturalist (Van Horn et al., 2017), NABirds (Van Horn et al., 2015), and Caltech-USCD Birds 200 (Wah, Branson, Welinder, Perona, & Belongie, 2011), have made it possible to create and test computer vision schemes for a variety of applications (Weinstein, 2018). Some applications include the use of satellite and digital imagery for analysis of land coverage and sediment profiles, monitoring plant phenology throughout the year, and automatic identifying and tracking of animals (Brown et al., 2016;Burton et al., 2015;Chabot & Francis, 2016;Gauci, Abela, Austad, Cassar, & Adami, 2018;O'Connell & Merryl, 2016;Romero-Ramirez, Grémare, Desmalades, & Duchêne, 2012;Weinstein, 2018).
Computer vision techniques typically include two stages: feature extraction and classification (Wäldchen & Mäder, 2018). The feature extraction stage uses a set of training images to train an algorithm with a specific feature or series of features. There are a variety of feature types available for computer vision, often categorized into spectral, spatial, and temporal features (Bouwmans et al., 2018).
After the feature extraction stage, the classification stage determines if each image in a testing set contains the object of interest.
Detecting animals in their natural environment, in most cases, is trivial for humans, but the finite attention span of human investigators will always limit the amount of imagery that can be analyzed.
Computer vision can reduce the amount of human analysis needed, but detecting patterns and determining the foreground of a complex image is a nontrivial task for computers. The background of fieldwork images are generally cluttered, and highly variable due to wind and lighting changes. Furthermore, the animal of interest often has little contrast to the background. To overcome these difficulties, testing conditions are often constrained. Examples include limiting the types of images used for training, focusing on species with high contrast from the background, or imposing a background that reduces clutter in the image. A common way of constraining the data is by using images where the animal comprises the majority of the pixels. This is common for studies that use training data sets to develop species identification techniques. The high-quality labeled images F I G U R E 1 These images show some of the commercial installations of SonicNets. These installations include a directional system installed on the roof of a strip mall (a), a custom system in the superstructure of a coal power plant (b), an omnidirectional system installed on the roof of a meat processing plant (c), an ominidirectional system installed at a catfish farm (d), and an omnidirectional system installed in a plant nursery (e). Each of these systems were successful at deterring the birds from the targeted region, but there is only qualitative and anecdotal data. All images were provided by Midstream Technology that allow for detection of fine-grain details between and within species are often zoomed in such that there is little background in the image (Berg et al., 2014). Other studies have selected species of interest because they have high contrast to the background, for example, Snowy Egrets and White Pelicans against dark backgrounds (Bohn, Möhringer, Kőrösi, & Hovestadt, 2015;Huang, Boom, & Fisher, 2015;Nadimpalli, Price, Hall, & Bomma, 2006). Another way to make computer vision more effective is by decreasing the possible orientations of the animals, often accomplished by imaging the animals from above (Abd-Elrahman, Pearlstine, & Percival, 2005;Chabot & Francis, 2016;Jalil, Smith, & Green, 2018;Mammeri, Zhou, & Boukerche, 2016;Marti-Puig et al., 2018;Pérez-Escudero, Vicente-Page, Hinz, Arganda, & Polavieja, 2014;Stern, Zhu, He, & Yang, 2015). Constraining the images in these ways can be extremely helpful for computer vision, but limits the range of possible experimental designs for understanding animal behavior.
Optimizing computer automated detection for real-world conditions will reduce manpower needed for data analysis in a wide variety of useful experiments. To demonstrate this possibility, we trained a computer automated detection method to detect and count birds in video frames from a field test intended to be analyzed by humans. We used a cascade object detector based on the Viola--Jones algorithm, developed to detect human faces (Viola & Jones, 2001). We gathered over three million images and analyzed approximately twenty thousand. Of the images analyzed, birds were present in only 2,555 frames. While there are many computer vision techniques available, we wanted to determine the effectiveness of this algorithm for this application due to its speed and robustness.
Implementing an accurate bird detection algorithm will allow realtime automated bird counting to be incorporated in a wide range of future projects, saving time, and manpower.

| Field trial design
For this study, we used video stills from a field experiment conducted to replicate an aviary experiment in an open environment (Mahjoub et al., 2015). We used an acoustic bird deterrent that leverages the understanding of avian communication to design noise fields which make auditory stream segregation difficult for birds (Dent, Martin, Flaherty, & Neilans, 2016). This encourages the birds to relocate without hazing or harming the animal. The field test ran for 33 days. Two days were removed from the automatic detection testing set because there were multiple hours of video files missing from the raw data. Figure 2  This field test resulted in over 3,800,000 similar images, intended to be analyzed manually to determine the presence of birds.
To create the testing set of images, we captured the video frame once per minute, then clipped the videos to include 357 min from 9:53 a.m. to 3:51 p.m. These were the times when no humans were present at the field test site, and all the equipment was functioning correctly. This resulted in 714 images from each day of the testing for a total of 22,134 video frames analyzed.
The automated detection algorithms were trained using video stills from the cameras focused on the tops of the food tables. The setup of this system is similar to that of a baited camera trap. There is little variation in the background from image to image, and the birds can move freely in and out of frame. In contrast to a camera trap, we recorded video continuously instead of when an animal triggered the camera. This resulted in thousands of images without birds. For our analysis, we look at table1B and table2B independently. This allows us to account for any difference in the field of view of the cameras.   the complexity for computer vision in this application. Since the camera placement was not optimized for computer vision, the birds only take up a small area of the frame, have a variety of orientations, and sometimes blend into the background.
Introducing the food tables into the images reduced some of the background clutter, but also introduced some complexity into the images in the form of reflections and shadows. The reflective nature of the aluminum food tables frequently created clear reflections of similar shape, size, and color saturation to the real bird. The center of Figure 3 shows such a reflection created by a Mourning Dove.
In addition to reflections, the birds and birdseed containers often create shadows on the food tables. Most of the birds that visited the food tables were Brown-headed Cowbirds, like the one seen inside the birdseed container in Figure 3. These birds have high contrast to the food table and background of the image, but have similar color saturation and tone to the shadows.
The birds we were interested in for this application are diurnal so imaging was only necessary during daylight hours. Although we imaged during daylight hours the weather can change dramatically throughout the day in the summer in Virginia, creating significant lighting changes. Figure  Many of these complexities could have been avoided if the experiment had been designed to use computer vision. Even so, the images have some visual advantages compared to other studies including daylight imaging, a consistent background, imaging mostly parallel to the ground, and reduction in background clutter from the food tables. However, these are not sufficient to overcome the disadvantages, and do not make computer automated detection of birds in the images a trivial task.

| Automated counting design
To detect and count the birds in the images, we used the Matlab Computer Vision Toolbox cascade object detector. This detector uses the Viola--Jones algorithm, which has been shown to be faster and require fewer training images than other cascade object detectors (Reese, Zheng, & Elmaghraby, 2012). We chose this algorithm over a deep learning technique such as Faster R-CNN or YOLO due to the small number of positive images in our data set. Training a neural network requires significantly more positive training images than a cascade object detector (Bowley, Andes, Ellis-Felege, & Desell, 2017). We determined that to have enough positive images in our training set to train a neural network, the testing set would be too small. Another computer vision detection method that could be implemented for this application is background subtraction. However, F I G U R E 5 This figure demonstrates some of the complexity created by the positioning of the camera for this field experiment. Each image was created by cropping the full frame to just include the area around the bird, and the birds were not resized. These images show the inconsistent orientation of the birds. They also show that the number of pixels needed to show the entire bird varied frame to frame and bird to bird. The images also show that the birdseed containers often obscure parts of the birds we decided that the numerous background changes that occurred throughout the trial including camera positioning, lighting variation, birdseed container placement, movement due to wind and rain, and reflections on the food table would have reduced the effectiveness of this method. We also believe that the speed of the Viola--Jones algorithm will allow for real-time bird detection in future projects.
We first determined the testing set of images from the video files.
Most of the images analyzed did not contain birds. Of the 22,134 video frames analyzed in our testing set only 2,555, 11%, had at least one bird present.  (Dalal & Triggs, 2005). The Haar feature compares color intensity between rectangular regions in the images. The sum of the pixels in one region is subtracted from the sum of the pixels in the next region. This type of feature gets more complex in each stage by using smaller regions and comparing different orientations and numbers of rectangles (Viola & Jones, 2001). The LBP feature converts the images into a histogram of gray scale values. This is done by comparing a central pixel in the region to surrounding pixels. Each region is then either assigned a 0 or 1 depending on the difference between the center pixel and the surrounding pixels. This method is often used in identifying textures in images (Ojala, Pietikainen, & Maenpaa, 2002).

| RE SULTS
For this paper, we wanted to determine the feasibility of using computer vision to detect birds in an image from an experiment that was designed to be analyzed by humans. Using a cascade object detector trained separately with a Haar feature and LBP feature, we first wanted to determine its ability to detect birds without editing the images. Figure 6 shows the same frame of table2B with one bird present. The top image is the detection using the LBP feature and the bottom using the Haar feature. The yellow rectangles labeled "Bird" are the bounding boxes where birds were detected. In both of these images, the object detector appears to correctly identify the bird, while also flagging areas without birds.
To determine the success of our model, we calculated the metrics of precision, recall, F-Measure, false-negative rate (FNR), and false alarm rate (FAR) for each detector and food table with an Intersection over Union (IoU) threshold of 0.5, seen in Table 2    In each case, the precision value for the full images was very low.
The highest precision on the full images is 0.016 using the LBP feature detector on table1B. The precision for the LBP feature detector was 0.012 for table2B. The Haar feature detector had slightly lower precision of 0.01 for table1B and 0.003 for table2B. The recall for these detectors is much higher than the precision, between 0.29 and 0.54. Recall is a measure of how well the ground truths are identified, calculated using The recall values seen in Table 2 show that the object detectors correctly detect each bird in the frame about half the time. The F-Measure is calculated using Equation 4 and is an estimate of the accuracy of the system. Table 2 shows that the highest F-Measure using the full image was 0.083 for the LBP feature detector on table1B. We also calculated the FNR using Equation 5. This is a metric of how likely a bird will be missed considering the total identifications.
The FAR is the number of false positives relative to the total identifications by the model as The FAR is a metric of how likely it is that a detection by the model is a false positive. Table 2 shows that the FAR for each de- in the clouds seen in Figure 6. To reduce these false positives, we cropped the images to contain just the area surrounding the food table, then ran the feature detector over that image. Figure 9 shows the cropped images of the video frame used in Figure 6. F I G U R E 6 Images of the same video frame captured by the camera pointed directly at table2B. The yellow rectangles labeled "Bird" in each of the images show the bounding boxes where birds were detected using the object detection algorithm. The top image used the LBP feature detector, and the bottom image used the Haar feature detector. In both of these images it appears the automated bird detection method correctly identified the bird, but using an IoU threshold of 0.5 only the Haar feature detector produced a true-positive detection. These images also show how weather and lighting affect the bird detection. The cloud coverage resulted in many false-positive detections using both features be seen that cropping the image reduced the number of false positives. Cropping the images also reduced the total number of frames without birds that had false positives, as seen in Table 3. Reducing the number of false-positive detections resulted in an increase in precision using the cropped images, as seen in Table 2. Table 3 also shows the number and percent of frames with a true-positive detection when a bird was on the food table using an IoU threshold of 0.5. From Table 3, it can be seen that the LBP feature detector had fewer frames with true positives than the Haar feature detector. In Figure 6, it appears that both detectors correctly identify the bird in the frame. Using an IoU threshold of 0.5, only the Haar feature detector has a true-positive detection. Figure 7 shows the zoomed in area around the bird seen in Figure 6  While the precision increased by cropping the images, the recall slightly decreased. This can be seen in Table 2 where the recall for each detector on the two food tables decreased on average by 0.012. The recall decreased because the number of true-positive detections slightly decreased by cropping the images. Table 3 shows the number of frames with true-positive detections using an IoU threshold of 0.5. Cropping the images shows a decrease in the number of frames with true positives which demonstrates the decrease  The frequent false-positive detections due to the object detector omitting parts of the birds led us to explore using smaller IoU thresholds. We found that by reducing the IoU threshold to 0.35 in all cases but table2B using the LBP detector the recall value increased to over 0.6 and the precision increased between 0.001 and 0.02. Figure 11 shows the precision-recall plots using each detector type. The plot on the left is the Haar feature detector and the plot on the right is the LBP feature detector. Each line is labeled with the food table and size of the image. The dots represent the IoU thresholds. The dot closest to the origin is the IoU threshold of 0.5, and the IoU threshold decreases by increments of 0.05 from left to right. This figure shows that in all cases both the precision and recall increase by decreasing the IoU threshold.
Note that the recall values are sometimes higher than 1 because the detector correctly identified the bird in more than one bounding box. Figure 11 shows that the Haar feature detector has higher recall than the LBP feature detector for each food table and image size, but the LBP detector has higher precision. This is because the Haar feature detector more accurately detected the birds while the LBP feature detector had fewer false-positive detections per frame. This figure also shows that reducing the false positives created by background clutter via cropping the images increased the precision for every detector on both food tables. There is a slight reduction in recall from cropping the images. This is because reducing the amount of space between the edge of the photograph and the birds reduced the number of true-positive identifications by each detector.

| D ISCUSS I ON
Finite human attention span will always be a factor in visual analysis of video files, limiting the amount of data that can be collected and analyzed. Creating an automated detection system can reduce the amount of manpower needed, extend the feasible length of future experiments, allow for real-time detection, and allow a single investigator to run multiple experiments simultaneously. For this study, we focused on the effectiveness of using a cascade object detector for automated bird detection to determine the presence of birds on food tables. We have shown that this technology can be used to detect bird presence, although improvements will need to occur before the system can be fully automated.
With F I G U R E 8 Images of the same video frame from the camera pointed directly at table1B with no birds in the frame. The yellow rectangles labeled "Bird" in each image show the bounding boxes where birds were detected using the object detection algorithm. The top image used the LBP feature detector, and the bottom image used the Haar feature detector. It can be seen that both feature detection methods had false positives. In contrast to Figure  6, these false-positive detections occurred on the food table and were mostly due to reflections and shadows. This shows that lighting changes effect the precision of the object detector While this technology is not yet suited to replace human counting, combining automated detection with human observers could reduce time and the number of images that need to be examined manually. Possible schemes for this include setting a baseline of false-positive detections reducing the total images a researcher needs to view, or using the automated detector to alert a researcher to take a second look at a specific area within an image.
Reducing the amount of human analysis opens up many possibilities for further understanding of birds in natural environments.
This testing also provided us with some insights for designing future SonicNets experiments. On the positive side, the Audio Spotlight speaker, designed for indoor use, was functional after running for 33 days at 100% duty cycle for 8 hr a day in the humid Virginia summer. However, we also confirmed that more birds must be present to quantify the effectiveness of the acoustic deterrent. Designing and verifying the accuracy of the computer automated detection techniques gave us accurate counts of the birds visiting the food tables each day, but found no statistically significant impact due to the bird deterrent, because very few birds visited the food tables, and never in groups larger than 5. This field test had been designed to replicate earlier aviary experiments, but no flocks consistently visited either food table (Mahjoub et al., 2015). Nevertheless, we were able to show that an automated computer detection system can be used to detect birds in a field experiment not designed for computer vision.

| CON CLUS ION
The decreasing cost of cameras, image processing, and digital storage allows researchers to generate and store massive amounts of digital imagery. The time needed to manually analyze these images will always be a limiting factor for experimental design and Note: It also shows the percent and number of frames with true-positive detections using an IoU threshold of 0.5 when birds were in the frame.
F I G U R E 9 These are the images created by cropping the images in Figure 6 to include just the area surrounding table2B. The yellow rectangles labeled "Bird" in each of the images show the bounding boxes where birds were detected using the object detector algorithm. The top image used the LBP feature detector, and the bottom image used the Haar feature detector. Cropping these images reduced the total number of false positives in the frame by removing the text and clouds from the top of the image that were detected as birds in the full image. Both of these images appear to show positive detections of the bird, but the IoU in each case is <0.5 for this include setting a baseline of false-positive detections, to reduce the total number of images a researcher needs to view, or using the automated detector to alert a researcher to take a second look at specific area within an image. Reducing the amount of human analysis opens up many possibilities for further understanding of birds in natural environments.
One way this technology could further our understanding of birds is by using imagery of convenience created by bird feeder cameras to track migratory behavior. The introduction of products like the Nest Hello (https ://store.google.com/us/produ ct/nest\_hello \_doorb ell?hl=en-US) doorbell cam allows people to create enormous amounts of digital imagery that they are often willing to share F I G U R E 1 0 These images are the same video frame of table1B. For both the full and cropped images, the left side shows the detections using the LBP feature detector and right is the Haar feature detector. The yellow rectangles labeled "Bird" are the bounding boxes where birds were detected using the object detection algorithm. In each image, only the portion of the bird outside of the birdseed container is inside the bounding box. This resulted in an IoU less than the 0.5 threshold in each case F I G U R E 11 These plots show the precision versus recall for both detectors on the full and cropped images. The left plot used the Haar detector and the right plot is of the LBP detector. On each line, the filled circle closest to the origin is the IoU threshold of 0.5 and decrease by increments of 0.05 moving left to right. Each line is labeled with the food table and size of the image used. From these plots, it can be seen that the LBP feature detector had consistently higher precision values than the Haar feature detector. This is because the the LBP feature detector had fewer false positives. It can also be seen that the Haar feature detector had higher recall than the LBP feature detector, meaning that there were more true-positive detections using the Haar feature detector. In all cases, the precision increased by cropping the images to include just the area around the food table. This is due to the reduction in false positives created by clouds and background clutter in the image. The recall decreased slightly by cropping the images. This is because the object detector was trained using the full images and therefore expected some amount of space between the bird and the edge of the image. Cropping the images decreased this space and resulted in fewer true-positive detections of the birds near the edges of the images freely. These cameras are inexpensive, and it is a matter of time until like-minded people start using them to record their bird feeders.
Machine learning is also becoming more accessible to the general public, and a cascade object detector like the one used in this study would be straight-forward to implement on this kind of imagery to detect the presence of birds at the feeders. Using this bird detection on connected cameras worldwide could give researchers real-time information on bird distribution. Adding species labels, either by humans or improved automatic machine learning techniques, would increase our understanding of species migration. This technology could be used in coordination with programs like the Cornell Project FeederWatch (https ://feede rwatch.org/about/ proje ct-overv iew/) to create a robust amount of data not reliant solely on volunteer data collectors.
There are some simple steps that can be taken to make the images better suited for computer vision detecting birds. The first is to have a training set with more images of birds in the frame. The data used in this study were highly skewed toward images without birds, resulting in many false-positive detections. Recording only when motion is detected would significantly increase the number of frames with birds, allowing the positive training set to be much larger. Another way to improve the images for computer vision would be to image the birds in a way that will reduce the orientations of the birds, background clutter, reflections, shadows, and obstructions.
Since the images were intended to be analyzed by human investigators, the positioning of the camera and low camera angle to the food tables did not take any of these things into account, and created a more difficult imaging scheme. If the field of view of these cameras include areas where the birds are not likely to visit, the simple step of cropping the images can improve the accuracy of the automated detection. By cropping the images in this data, we increased the precision of the detector. All of these steps together could dramatically increase the ability of the algorithm the detect birds and could easily be done with the images of convenience shared using Wi-Fi connected cameras.
The cost to acquire and store large amounts of video imagery continues to plummet. Crowdsourcing data could allow for data points throughout the world to better understand the distribution and migration of birds throughout the year. As technologies improve for automated analysis of images from field data, we can look forward to being able to easily re-analyze imagery previously acquired for some other purpose and perhaps filed away in a drawer (Rosenthal, 1979).

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
E. Simons collected and performed the data analysis, advised by M. Hinders. All authors contributed to the preparation of the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data set will be archived via the William & Mary Applied Science Department web pages at https ://www.as.wm.edu/Nonde struc tive. html.