Measuring agreement among experts in classifying camera images of similar species

Abstract Camera trapping and solicitation of wildlife images through citizen science have become common tools in ecological research. Such studies collect many wildlife images for which correct species classification is crucial; even low misclassification rates can result in erroneous estimation of the geographic range or habitat use of a species, potentially hindering conservation or management efforts. However, some species are difficult to tell apart, making species classification challenging—but the literature on classification agreement rates among experts remains sparse. Here, we measure agreement among experts in distinguishing between images of two similar congeneric species, bobcats (Lynx rufus) and Canada lynx (Lynx canadensis). We asked experts to classify the species in selected images to test whether the season, background habitat, time of day, and the visible features of each animal (e.g., face, legs, tail) affected agreement among experts about the species in each image. Overall, experts had moderate agreement (Fleiss’ kappa = 0.64), but experts had varying levels of agreement depending on these image characteristics. Most images (71%) had ≥1 expert classification of “unknown,” and many images (39%) had some experts classify the image as “bobcat” while others classified it as “lynx.” Further, experts were inconsistent even with themselves, changing their classifications of numerous images when they were asked to reclassify the same images months later. These results suggest that classification of images by a single expert is unreliable for similar‐looking species. Most of the images did obtain a clear majority classification from the experts, although we emphasize that even majority classifications may be incorrect. We recommend that researchers using wildlife images consult multiple species experts to increase confidence in their image classifications of similar sympatric species. Still, when the presence of a species with similar sympatrics must be conclusive, physical or genetic evidence should be required.


| INTRODUC TI ON
Ecological research is experiencing an explosion in the use of wildlife imagery. Camera trapping has become a common noninvasive survey technique (Burton et al., 2015;O'Connell, Nichols, & Karanth, 2011;Rowcliffe & Carbone, 2008), especially for rare and elusive forest-dwelling species (Furnas, Landers, Callas, & Matthews, 2017;Stewart et al., 2016), and has been used to obtain crucial ecological information (Caravaggi et al., 2017). Landscape-scale camera grids or transects are increasing across the globe (McShea, Forrester, Costello, He, & Kays, 2016), and such sampling may be used to monitor global biodiversity in the future Steenweg et al., 2017). For example, the project Snapshot Wisconsin currently has over 1,000 registered volunteers maintaining over 1,200 remote cameras and has collected over 22 million images since it was estab- Similarly, numerous websites and mobile phone applications encourage people to submit wildlife images for the purpose of assessing species' distributions. For example, the United Kingdom Mammal Tracker application allows the general public to submit geo-located images of 39 wildlife species (Mammal Watch South East, 2018).
Such camera networks and image-solicitation projects can collect substantial data across broad scales, but the data may be of limited utility because of the need to classify the animals that the images contain Newey et al., 2015;Wearn & Glover-Kapfer, 2017). Researchers are typically interested in classifying each animal to the species level and in many cases even to individuals (Rich et al., 2014;Weingarth et al., 2012). However, classifying images is difficult when they are blurry, taken in poor lighting, show only part of the animal, or when only one image is available for a given animal (Meek, Vernes, & Falzon, 2013).
Further, even high-quality images may be difficult to classify if the species has similar sympatrics Swanson, Kosmala, Lintott, & Packer, 2016;Yu et al., 2013), especially if classifiers have a bias toward one sympatric species over another, perhaps based on the location or background habitat of an image. For example, rare species can have higher false-positive and false-negative errors than common species (McKelvey, Aubry, & Schwartz, 2008;Swanson et al., 2016). Similar concerns have also been raised for classification of acoustic records for groups such as bats, cetaceans, amphibians, and birds (Chambert, Waddle, Miller, Walls, & Nichols, 2017). Correct species classification is crucial; even low misclassification rates can lead to significant over-or underestimation of the occupancy, habitat preferences, or distribution of a species (Costa, Foody, Jiménez, & Silva, 2015;Miller et al., 2011;Molinari-Jobin et al., 2012;Royle & Link, 2006), which could hinder conservation efforts (McKelvey et al., 2008).
Camera-trapping and image-solicitation studies have used various methods for image classification; manual classification by the lead researchers, hired technicians, or volunteer students is most common, but crowdsourcing from the general public (Swanson et al., 2016;Wisconsin Department of Natural Resources, 2018) and automated classification by computer software (Hiby et al., 2009;Jiang et al., 2015) have also been used. In our experience and observations of studies where images are manually classified, most images are classified by only a single person, but the number of classifiers and their expertise are rarely reported. Despite the fact that even highly trained experts are not always correct (Alexander & Gese, 2018;Austen, Bindemann, Griffiths, & Roberts, 2016;Gibbon, Bindermann, & Roberts, 2015;Meek et al., 2013;Swanson et al., 2016), the accuracy of image classifications is rarely questioned.
Classification of images by a single person may be adequate when classifying high-quality images of species that are distinctive, such as mountain goats (Oreamnos americanus), porcupines (Erethizon dorsatum), and snow leopards (Panthera uncia), but may be unreliable for sympatric species that are similar in size, shape, or coloration (Meek et al., 2013).
Many species across the globe fall into this category such as bears, deer, lemurs, some mustelids, felids and antelopes, as well as many bats, Here, we use bobcats (Lynx rufus) and Canada lynx (Lynx canadensis; hereafter lynx) as a case study to measure agreement among experts in their classifications of images of similar sympatrics. Bobcats and lynx are congeneric felids similar in size and appearance that are sympatric across southern Canada and the northern United States (Gooliaff, Weir, & Hodges, 2018;Hansen, 2007;McKelvey, Aubry, & Ortega, 2000). Although bobcats and lynx look similar, they have slight anatomical differences (Hansen, 2007;Lewis, 2016). Lynx have larger paws, longer legs and have more of an arched back compared to the straighter profile of bobcats. Lynx have more pronounced facial ruffs and longer ear-tufts, as well as shorter, solid black-tipped tails, as opposed to the longer, black and white-tipped tails of bobcats. Bobcats also have black heel marks that are absent on lynx, and usually have more brownish and spotted pelage compared to the gray-silver pelage of lynx.
Bobcats are common and are legally harvested in both countries, but lynx are federally listed as threatened in the contiguous US (US Fish & Wildlife Service, 2000). Classification of felid images in the contiguous US thus has direct conservation implications for lynx; bobcats falsely classified as lynx could result in false occupancy or distribution maps, or protection of areas that are not in fact used by lynx, whereas lynx misclassified as bobcats could result in under-protection.

| MATERIAL S AND ME THODS
We measured agreement among experts in their classifications of bobcat and lynx images that we collected through citizen science.
In a separate study, we solicited 4,399 images of bobcats and lynx from the public across British Columbia, Canada to examine the provincial distribution of each species (Figure 1; Gooliaff et al., 2018).
We subsampled those images to create six trials of images each designed as separate experiments to investigate different factors that we thought might affect agreement among the experts in their classifications of images; we tested the (a) season, (b) background habitat, (c) visible features of the animal, and (d) time of day in images, and (e) whether we provided the location of images to the experts (Table 1). The sixth trial was a retest of the first set of images, to assess whether experts were consistent in their classifications of the same images months later. We divided images into trials rather than providing them all at once both to make it easier for the experts and so that each factor that we tested was isolated in one set of images. Within each trial there were multiple categories of images (e.g., "summer" and "winter" categories in the "season" trial); we compared agreement among the experts in their classifications of the images between these different categories (Table 1).
To select images for the different categories, we first chose images from the entire set that were of good photographic quality (i.e., the animal was in focus and not distant), were of single, alive, adult individuals that showed no bait or prey, and that were not submitted by any participating experts. We did not crop, edit or modify the images. We then randomly selected images to populate each category (Table 1). Within each category, all image characteristics (i.e., season, background habitat, visible features, and time of day) were consistent.
Each image was used only once, except for images in the "season" trial which were repeated as the "consistency" trial. We also mistakenly included one image twice in the "legs and tail" category. We disregarded the second classifications from the experts for this image in all analyses, which resulted in the "legs and tail" category containing 19 images rather than 20. Multiple images that were taken by the same remote camera, and thus that had the same background, were not included in the same trial. If the ratio of what we thought were bobcat and lynx images was below 4:1 for either species in any category, we randomly replaced images until at least that ratio was achieved, except for the "northern" images in the "location" trial because bobcats are likely absent in northern BC ( Figure 1; Gooliaff et al., 2018). In total, we selected 299 images: 116 images (39%) from remote cameras and 183 images (61%) from conventional cameras.
We created weblinks for the six trials (Table 1) using FluidSurveys (www.fluidsurveys.com). We released trials online sequentially, two weeks apart, between January and April 2017. In each trial, experts were prompted to classify the species in each image by selecting "bobcat," "lynx," or "unknown." The experts were not able to zoom in on images to ensure that experts based their classifications on the same view and detail of the images. The order of images in each trial was random, but was the same for all experts. Experts could not proceed to the next image without selecting an answer, and once selected, experts could not view previous images. However, experts were allowed to save unfinished trials and complete them at a later time. Trials were password protected, and we instructed experts to not consult with others; the experts did not know who else was participating in the experiment. Our study obtained ethics approval from the University of British Columbia (certificate # H16-03169).
F I G U R E 1 Images of bobcats (white circles; n = 805) and lynx (black circles; n = 807) taken during 2008-2017. These images were solicited from the public across British Columbia and here we map points based on our own classifications of the images. We also show our boundary between northern and southern BC (dotted line) Experts were aware that we were measuring agreement among them in their classifications of images, but they were unaware of the conditions that we were testing in each trial. Experts were unaware of image locations to ensure that experts based their classifications on the images themselves and not on any contextual information.
We provided the location for only half of the images in the "location" trial (Table 1), to test whether knowing such information affected agreement among the experts in their classifications of the images.
These images were accompanied with a map of BC showing the location of the image with a red star. The map also included cities and highways to help orient the experts. Although location information would almost always be available for images collected in actual camera-trapping or image-solicitation studies, and thus, our experiment does not reflect a realistic scenario in this regard, we wanted to determine whether knowing such information might bias expert classification in these actual studies.
We selected 27 experts from across western North America to classify the images; we chose experts from (a) northern BC and the Yukon (n = 9), where lynx are common but bobcats are likely absent  (Hansen, 2007;McKelvey et al., 2000). We considered people as bobcat or lynx experts if they were biologists who had field or image-classification experience on either species. Even if somebody had experience working with only one species, we felt that they should be able to distinguish the species more familiar to them from the less-familiar species. For example, if somebody had experience working with lynx but not bobcats, they should be able to tell that an image of a bobcat is "not a lynx." All of the people who participated in our experiment agreed that they had relevant experience to be considered an expert. Our panel of experts represented people likely to participate in studies on one or both species, or who would likely be asked to classify bobcat or lynx images. The experts consisted of mesocarnivore and furbearer biologists from provincial, state, and federal government agencies, as well as private consultants and academics.

| Statistical analysis
Our response variable was the number of experts that classified each image as "bobcat," "lynx," or "unknown." Because we used images that were contributed by the public, we were unable to independently verify the species in each image and thus could not conclude whether expert classifications were accurate. Instead, we measured agreement among experts in their classifications of the images (hereafter agreement) using Fleiss' kappa (K), which measures reliability among a group of classifiers. We calculated K using the R package irr (Gamer, Lemon, Fellows, & Singh, 2014) and calculated 95% confidence intervals based on 1,000 bootstrap iterations using the R package boot (Canty & Ripley, 2017). K is bound between −1 and 1; a value of 1 indicates perfect agreement, 0 indicates agreement that would occur by chance, and −1 indicates perfect disagreement (Fleiss, 1971).
K is commonly used in medical fields to measure agreement among clinicians in their diagnosis of certain conditions from images (Barnett, Glickman, Umorin, & Jalali, 2018;Farr, Guitton, & Ring, 2018;Vandenberk et al., 2018), but has also been used to measure agreement among biologists in identifying individual cougars (Puma concolor) from remote-camera images (Alexander & Gese, 2018).
There is no standardized method for interpreting or comparing K beyond relative differences between groups (Gwet, 2010). Many medical studies consider values >0.60 to represent "substantial" agreement (Landis & Koch, 1977); however, such studies often ask experts to rate the severity or progression of a disease or condition, whereas we asked experts to classify an animal species. Thus, in our study, we interpreted K more critically because experts were selecting from fewer and more distinct categories, conditions that typically increase K values (Sim & Wright, 2005).
We determined whether agreement varied between images with different characteristics (i.e., season, background habitat, visible features, and time of day) by comparing K between categories of images within each trial (Table 1). We also determined the combination of image characteristics that resulted in the highest and lowest agreement by pooling images with the same combination of characteristics from all categories. We determined whether knowing the location of an image affected agreement by comparing K when experts knew the location of an image to when they did not for images taken in northern and southern BC (Figure 1). We also determined whether agreement varied depending on the rarity of a species where experts lived by comparing K between experts from the three regions, and determined whether knowing the location of an image affected agreement within expert groups differently for either northern or southern images. We also determined whether experts were consistent in their classifications by having them unknowingly reclassify images from the first trial ("season" trial) 10 weeks later and calculating K between their first and second classifications of the same images.
In addition to calculating K across different kinds of images, we also calculated the proportion of agreement for individual images using the following equation, where bobcat, lynx, and unknown are the number of experts that classified an image as "bobcat," "lynx," and "unknown," respectively, and n is the total number of experts: With three classification options, the proportion of agreement had an upper bound of 1.00, indicating perfect agreement and had a lower bound of 0.31, indicating perfect disagreement (i.e., of 27 experts, nine each classified an image as "bobcat," "lynx," and "unknown").
Finally, we calculated the number of experts required to classify an image to reach a final classification (i.e., the number of experts at which the majority classification was unlikely to change by asking more experts). We calculated the mean probability that the majority classification (i.e., the classification of the greatest number of experts) of a randomly selected subset of one to 27 experts matched the majority classification of all 27 experts.

| RE SULTS
All 27 experts completed each of the six trials (Table 1). The following results refer to all images in the first five trials (n = 259 images); this set excludes the 40 images in the "location" trial for which we provided locations. The total number of individual expert classifications was 6,993 (27 experts × 259 images); the experts classified the images as "unknown" in 11% (n = 753) of classifications and as "bobcat" or "lynx" in 89% (n = 6,240) of classifications.
Of these 259 images, 71% (n = 185) had ≥1 experts classify that image as "unknown." Experts reached a majority classification of "unknown" for 3% (n = 9) of images, but experts did not unanimously classify any images as "unknown." Experts unanimously classified 24% (n = 61) of images as being either "bobcat" or "lynx," while 39% (n = 101) of images had ≥1 experts classify that image as "bobcat" and ≥1 as "lynx." Overall, the 27 experts had moderate agreement in their classifications of the 259 images (K = 0.64, 95% CI = 0.60-0.68). The majority of images did not have a unanimous classification by the experts (76%; Figure 2a); the mean proportion of agreement score for individual images was 0.79 (SD = 0.19), but was highly variable (Figure 2b; Table 2). However, experts appeared to have similar agreement for each species; the mean proportion of agreement score was 0.84 (SD = 0.18, n = 92) and 0.77 (SD = 0.19, n = 167) for images that we had classified as "bobcat" and "lynx," respectively.  when the location of an image was provided, and experts had greater agreement for southern images than northern images when they knew the location of an image (Table 3). where one or two groups had a majority classification of "bobcat" or "lynx" while the other group(s) had a majority classification of "unknown," and two where different groups had a majority classification of "bobcat" and "lynx." The three expert groups had similar levels of agreement for images for which we provided locations, and also had similar consistency for retested images.
Experts did reach a clear majority classification for most images  2,481 individual expert classifications of either "bobcat" or "lynx" did not match our classification).
Finally, we note that the two of us as authors had sequences for many and contextual information for all images and agreed with the majority classification of all 288 images in our experiment that had a majority classification of "bobcat" or "lynx" except for two; one majority classification by the experts was "lynx," one "bobcat," while we held the opposite views. One of those images is the top right image in Figure 4a.

| D ISCUSS I ON
We demonstrate far from perfect agreement among experts in distinguishing between images of two similar sympatric and con-  Further, we were surprised at the frequent use of "unknown" as a classification by the experts in our study: experts classified images as "unknown" in 11% of classifications, and a striking 71% of images had ≥1 experts classify that image as "unknown." Thus, in many cases, experts were not confident enough to classify the species in the image.
These results are particularly troubling given that the images were all of high photographic quality. We do not know whether experts or novices would be more likely to classify images as "unknown"; experts may be aware of pitfalls in classification that novices do not know to look for, which could mean that experts use "unknown" more often than novices when images do not include critical defining features.
Alternatively, novices may doubt their ability to classify a species, thus using "unknown" more frequently. Regardless, we provided the option of classifying each image as "unknown" rather than forcing experts to choose between "bobcat" and "lynx" to allow for such cases of genuine uncertainty. If we had forced experts to assign a species to each image, our calculated minimum misclassification rate of 4% would likely have been much higher. We recommend that researchers honor and trust cases of uncertainty where they cannot confidently classify the species in an image.
Expert agreement varied among different kinds of images. The largest difference was that experts had much lower agreement for We show the number of experts that classified each image as "bobcat," "lynx," and "unknown." (b) Both images are of the same animal taken near Prince George, British Columbia and have the same image characteristics. The image on the left was not included in our experiment but had a 4:4 split vote between bobcat and lynx among local biologists who were asked to classify the image. We included the image on the right in our experiment without providing its location; 26 experts classified the image as "bobcat", and one expert classified the image as "unknown". Images provided by (from top to bottom row): BC Parks, Emre Giffin, James Gagnon F I G U R E 5 Mean probability that the majority classification of a randomly selected subset of experts matched the majority classification of all 27 experts, calculated across all images excluding the 40 images for which we provided locations (n = 259 images). Bars represent 95% confidence intervals. Probabilities are lower for even numbers of experts because of the likelihood of drawing a split vote, which is not possible for odd numbers of experts be easiest to distinguish between the two species when both the face and legs are visible. Surprisingly, experts had slightly higher agreement for images taken at night than images taken during the day. Perhaps experts found it easier to distinguish between the two species at night because they were forced to focus on the physical features of each animal, rather than taking the color of an animal into account.
Expert agreement also depended on the background of images.
Experts may have cued in on certain background features to aid in their classifications, for example, associating tree species or habitat with one species over the other. Some of the experts spontaneously commented to us after the study was complete that for some images they had based their classifications on the vegetation. Experts had lower agreement for images with a forest background than images with a background of grassland or human infrastructure, likely because grassland and developed habitats are more characteristic of bobcats, but both species use forests.
Further, we showed that the location of an image can also affect expert classification; experts had greater agreement when they were provided with the location of an image. Again, spontaneous post-study comments from experts revealed that some experts used the location of an image to "confirm" their selections. However, while we expected experts to have greater agreement for images that we provided locations for, we were surprised to find that experts had greater agreement for southern images than northern images when the location was provided. We expected the opposite because bobcats are likely absent in northern BC; thus, there was essentially only one choice for northern images, whereas knowing the location of southern images should have provided little help since both species are common there (Figure 1; Gooliaff et al., 2018). Instead, some experts classified images from northern parts of the province as "bobcat," counter to our expectation. This result suggests that those experts were not familiar with the distribution of bobcats in BC.
Still, we strongly suspect that the location of an image can bias its classification if the person classifying the image has a preconceived idea of the species' distribution, which can lead to misclassification of similar species if one species is thought to be extremely rare or absent in a particular area, when in fact it is present. As some species suffer range contractions and population declines, while others expand ranges with climate change, we think this possible location bias is worth further study.
For example, the left image in Figure 4b was taken near Prince George in 2016, and sent to us as part of our citizen science search for images (Gooliaff et al., 2018). At the time, there had never been a confirmed bobcat record that far north. We classified the image as "bobcat," but the image was widely circulated on social media and the local news station, which sparked an intense debate among hunters, trappers, and naturalists as to whether the animal was a bobcat or lynx. Biologists in Prince George were asked by the local news station to classify the image, and initially four biologists thought "bobcat" and four biologists thought "lynx." After additional images showing the animal's paws were shared, those biologists shifted toward classifying the image as "bobcat" or "possible hybrid" (K. Otter, University of Northern British Columbia, personal communication).
The right image in Figure 4b is of the same animal and shares the same characteristics (i.e., season, background habitat, visible features, and time of day) as the left image. We asked experts to classify the right image in our experiment without providing its location; 26 experts classified the image as "bobcat" and one as "unknown." Despite the fact that experts unanimously classified only 24% (n = 61) of images, experts did reach a clear majority classification for most images. Thus, while classifications of an image by a single expert were unreliable, we believe that the final majority classifications were correct for most images. Our findings suggest that the location of an expert did not matter, as long as many experts were asked. We found only slight differences in agreement between experts from northern BC and the Yukon, southern BC, and the northwestern US, suggesting that experts were not biased by the rarity of a species in the area where they live.

| Implications for studies using wildlife images
As photographic data become increasingly used in ecological studies for many groups of species (Rowcliffe & Carbone, 2008, Burton et al., 2015, Steenweg et al., 2017 note that in our public solicitation for images (Gooliaff et al., 2018), approximately half of detections from remote cameras (44%) and conventional cameras (52%) still had only single images.
Although here we measured agreement among people in classifying single images, our experiment was based on the best-case scenario of experts classifying images of high photographic quality.
While we randomly selected images for our experiment to ensure that we did not consciously or subconsciously choose images that were easy or difficult to classify, our initial screen of using only high-quality images meant that our collection of images was likely far easier to classify than images that would typically be collected in camera-trapping or image-solicitation studies, as blurry images or those with animals more distant from the camera would be more difficult to classify (Meek et al., 2013).
Misclassification rates would also likely be higher when images are classified by non-experts, such as volunteers and crowdsourcing Swanson et al., 2016;Wisconsin Department of Natural Resources, 2018). Image classifications by non-experts may be suitable for species that are distinctive, but we strongly suggest caution when classifying images for species with similar sympatrics; we recommend that such images be flagged for classification by multiple experts. Specifically, we recommend that studies using wildlife images consult at least five species experts when classifying images showing species with similar sympatrics. Still, we stress that the majority classification of even five experts is not necessarily correct, only that the majority classification is unlikely to change by asking more experts.
Further, we recommend that researchers be explicit about their methods for classifying images. If researchers employ a design where most images are classified by one or two individuals who then consult with colleagues on difficult images, we urge such information to be provided, for example, specifying whether the main classifiers disagreed on the classification of such images, or whether all images that met a certain profile were flagged for further scrutiny. When sharing metadata from camera-trapping and image-solicitation studies, we recommend that researchers include information on the number of people who classified each image, whether those people were experts or non-experts, and the individual classifications of each researcher. We also suggest that researchers make available the raw images to provide the option of reclassifying certain images in future studies, at least images that are highly influential to the final conclusions that are drawn (e.g., images from range edges, or images that are the only record of a species in a given locality).
Images have been described as being conclusive evidence for the presence of a species, even when that species is thought to be absent or extinct (McKelvey et al., 2008), but that is only true if the species in the image can be conclusively classified. We show that experts find it difficult to distinguish between images of similar species, which implies that images collected from camera trapping or public solicitation should not be taken as definitive evidence of species presence for any species that may be readily misclassified as a similar sympatric, such as bobcats and lynx, but rather as an initial subjective screen and then followed with definitive, objective survey methods such as noninvasive DNA sampling or live-trapping.

ACK N OWLED G M ENTS
We thank everyone who submitted images and all experts who participated in our study. We also thank R. Weir, J. Pither, K.
McKelvey, and anonymous reviewers for comments that improved

CO N FLI C T O F I NTE R E S T
None declared.