Automated facial recognition for wildlife that lack unique markings: A deep learning approach for brown bears

Abstract Emerging technologies support a new era of applied wildlife research, generating data on scales from individuals to populations. Computer vision methods can process large datasets generated through image‐based techniques by automating the detection and identification of species and individuals. With the exception of primates, however, there are no objective visual methods of individual identification for species that lack unique and consistent body markings. We apply deep learning approaches of facial recognition using object detection, landmark detection, a similarity comparison network, and an support vector machine‐based classifier to identify individuals in a representative species, the brown bear Ursus arctos. Our open‐source application, BearID, detects a bear’s face in an image, rotates and extracts the face, creates an “embedding” for the face, and uses the embedding to classify the individual. We trained and tested the application using labeled images of 132 known individuals collected from British Columbia, Canada, and Alaska, USA. Based on 4,674 images, with an 80/20% split for training and testing, respectively, we achieved a facial detection (ability to find a face) average precision of 0.98 and an individual classification (ability to identify the individual) accuracy of 83.9%. BearID and its annotated source code provide a replicable methodology for applying deep learning methods of facial recognition applicable to many other species that lack distinguishing markings. Further analyses of performance should focus on the influence of certain parameters on recognition accuracy, such as age and body size. Combining BearID with camera trapping could facilitate fine‐scale behavioral research such as individual spatiotemporal activity patterns, and a cost‐effective method of population monitoring through mark–recapture studies, with implications for species and landscape conservation and management. Applications to practical conservation include identifying problem individuals in human–wildlife conflicts, and evaluating the intrapopulation variation in efficacy of conservation strategies, such as wildlife crossings.

however, there are no objective visual methods of individual identification for species that lack unique and consistent body markings. We apply deep learning approaches of facial recognition using object detection, landmark detection, a similarity comparison network, and an support vector machine-based classifier to identify individuals in a representative species, the brown bear Ursus arctos. Our open-source application, BearID, detects a bear's face in an image, rotates and extracts the face, creates an "embedding" for the face, and uses the embedding to classify the individual. We trained and tested the application using labeled images of 132 known individuals collected from British Columbia, Canada, and Alaska, USA. Based on 4,674 images, with an 80/20% split for training and testing, respectively, we achieved a facial detection (ability to find a face) average precision of 0.98 and an individual classification (ability to identify the individual) accuracy of 83.9%. BearID and its annotated source code provide a replicable methodology for applying deep learning methods of facial recognition applicable to many other species that lack distinguishing markings. Further analyses of performance should focus on the influence of certain parameters on recognition accuracy, such as age and body size. Combining BearID with camera trapping could facilitate fine-scale behavioral research such as individual spatiotemporal activity patterns, and a cost-effective method of population monitoring through mark-recapture studies, with implications for species and landscape conservation and management. Applications to practical conservation include identifying problem individuals in human-wildlife conflicts, and evaluating the intrapopulation variation in efficacy of conservation strategies, such as wildlife crossings.

K E Y W O R D S
deep learning, face recognition, grizzly bear, individual ID, machine learning, wildlife monitoring

| INTRODUC TI ON
Conservation Technology is an emerging field that aims to address large-scale conservation challenges with innovative tools.
With biodiversity conservation a global concern, computational approaches that enable wildlife monitoring at larger spatial scales, but with finer resolution, are recognized as a priority (Arts et al., 2015). Scaling-up biodiversity monitoring (Steenweg et al., 2017), however, requires automation of data processing and analysis to increase reproducibility, while reducing time, cost, and labor (Weinstein, 2018).
Computer vision increasingly supports analyses of big data collected from image-based ecological studies (Weinstein, 2015).
One challenge is the inability to distinguish among individuals within species that lack unique markings (Rowcliffe et al., 2008).
Addressing this gap and taking a similar approach to human individual identification, face recognition (in various forms) has been developed for nonhuman primates (great apes Hominidae spp. (Ernst & Küblbeck, 2011;Loos & Pfitzer, 2012;Freytag et al., 2016;Schofield et al., 2019), lemurs Lemuroidea spp. (Crouse et al., 2017), and macaques Macaca mulatta (Witham, 2018)). For species other than primates, one of the only references to facial recognition of unmarked species in the peer-reviewed literature focuses on domestic dogs Canis familiaris (Moreira et al., 2017).
Facial recognition approaches could prove useful to the suite of nonprimate wildlife species that lack distinctive body markings.
Knowledge of unique individuals can facilitate the use of established techniques such as mark-recapture and thereby inform management.
Deep learning techniques automatically detect and extract learned features from data, and provide a powerful alternative to traditional methods of feature extraction (see Christin et al., 2019 andSchneider et al., 2019 for ecological applications). Face recognition using deep learning has recently achieved an accuracy of up to 92.5% for chimpanzees Pan troglodytes (Schofield et al., 2019) and 96.3% for giant pandas Ailuropoda melanoleuca (Chen et al., 2020); the latter possessing distinctive eye patch markings that could aid identification. A primary challenge, however, is that deep learning requires large labeled datasets for training and testing, which are difficult to acquire for wild populations, especially at the individual level (Schneider et al., 2019). Training on images of captive individuals offers a useful approach when such conditions exist for species of interest, but it is unclear how well these networks generalize to images taken of wild individuals in situ; controlled environments may provide inadequate training data for real-world application (Wearn et al., 2019). Long-term individual-based ecological studies of wild populations (sensu Clutton-Brock & Sheldon, 2010) provide an alternative and more common context that can support image databases collected over years. Moreover, variability contained within these images, such as fluctuations in body weight, may be more representative of the external morphology of wild animals.
Here, we describe our application BearID, which uses deep learning and facial images to detect and identify individual brown bears Ursus arctos, a species that lacks consistent, unique pelage markings.
Brown bears provide an ideal candidate for expanding facial recognition beyond primates as they present opportunities and challenges likely spanning a wide variety of taxa: (a) They vary in morphology across their range (Hilderbrand et al., 1999), and (b) they experience extreme weight fluctuations between seasons and as they age and grow (Kingsley et al., 1983).
Using a programming pipeline of face detection and reorientation, face encoding, and face classification (Schroff et al., 2015), we trained and tested an object detection network, landmark detection network, similarity comparison network, and support vector machine (SVM)-based classifier. We provide the methodological details for building the application, as well as our initial results and annotated source code. Although trained on a single species by design, BearID is transferable to other mammals and certain parts of the pipeline may be particularly transferable to other caniforms due to facial similarities. Given the number of species and the broad terrestrial and marine distribution of this important suborder of Carnivora, the frequency with which representative populations are studied, the expense of current identification methods (e.g., genetic tagging), and their relative ease of photographing, they comprise a well-suited study system for this approach. BearID thereby provides an important step in applying deep learning methods of facial recognition to a variety of wild animals beyond primates that lack distinctive markings.
We followed Dlib's deep learning example programs to provide an outline for building bearface, bearembed, and bearsvm (see Data Accessibility).
To develop BearID, we custom-built a computer system with a graphics processing unit capable of deep learning by providing parallel computing (see Appendix S1 for hardware details). We primarily used Python and C++ (below; Data Accessibility).

| Training and test data
From 4,675 images, we created a fully labeled "golden dataset" that included a bounding box for each face, the locations of landmarks, and identification of each bear (see Appendix S2 for details). We randomly split the golden dataset into 3,740 (80%) images for training and 935 (20%) for testing using a Python script, generate_partition.py.
One image had to be removed from the test split due to an incorrect label, for a total of 934 images in the test set. We used the data splits to train and test the various networks in our application (below).
For face detection, we scaled the resolution of all training and test images down to 2,000 × 1,500 pixels to avoid overloading our hardware. For the test set, if scaling caused a face to become too small (<200 × 200 pixels), we scaled until the face was 200 × 200 pixels and then cropped the overall image to 2,000 × 1,500 pixels (examples: Photographs S1 and S2).

| Face detection
Bearface finds faces and landmarks (eyes, tip of the nose, ears, and top of the forehead) in images. It consists of two networks: an object detector (OD) and a shape predictor (SP). The OD uses a sliding window (Dalal & Triggs, 2005) and a convolutional neural network (CNN) trained with Dlib's max-margin object detection loss function (King, 2015). We selected this approach as Dlib's example model trained on domestic dogs performed sufficiently to expedite labeling for the golden dataset (see Appendix S2). The CNN was trained using the bounding box labels in the golden dataset (see Appendix S3 for training procedure). The SP uses Dlib's implementation of face alignment with an ensemble of regression trees (Kazemi & Sullivan, 2014) and was trained using the landmark labels in the golden dataset. The bearface application takes as input: an image file or list of images as an XML file and a network weights file. JPEG and PNG are both accepted as input file types, but raw or other format images would first need to be manually converted. It outputs an XML file with a list of images and corresponding face and landmark information.

| Face reorientation and cropping
This stage uses the facial landmarks in the XML created by bearface to reorientate and extract the bear faces (or "chips" : Schroff et al., 2015). The application, bearchip, centers and rotates the face to optimal orientation. The current implementation uses only the eyes to align and center images (Table S1). It then scales and crops (150 × 150 pixels) the faces and writes each face chip as a JPEG file ( Figure 3).

| Face encoding
Face encoding forms the core process that facilitates facial recognition in the pipeline. It uses a similarity metric (Schneider et al., 2020) to learn a function that maps an input image (bear face chip) into a target space (Chopra et al., 2005). The metric loss function (Dlib toolkit: King, 2009) drives the similarity metric to be small for face chips of the same bear and large for face chips from different bears.
The output is an embedding, which is a numeric vector representation of a facial image that can be compared to other embeddings to identify individuals using a face classifier (below).
For the implementation, bearembed, we trained a similarity comparison network using a deep CNN with a ResNet-34 architecture (He et al., 2016), following the Dlib example "deep face recognition" implementation (King, 2009), to produce a 128-dimensional Euclidean embedding per face chip image (Schroff et al., 2015). The bearembed application has three modes: training, testing, and embedding. To generate the face chip training data, we used bearface on the training portion of the golden dataset. For metric learning, we used pairwise hinge loss, rather than triplet loss (as in Schroff et al., 2015), implemented by the Dlib metric loss layer. We used hard negative mining to ensure a balanced ratio within mini-batches of positive (same individual) and negative (different individuals) pairs from the training data (see Appendix S4 for training procedure).
We augmented the training dataset by applying color perturbation and jittering each time a face chip was included in a mini-batch (Appendix S4). Bearembed can be used with the trained network for testing or to generate embeddings for a set of chips.

| Face classification
Face classification is the process of assigning an individual ID label to an embedding created by bearembed, from an existing dataset. We

| Full application
The full BearID application was implemented as a python script,

| Testing methodology
We initially tested subapplications (bearface, bearembed, bearsvm) independently to gain an accurate representation of performance using the golden dataset. We then tested the full application from input image to ID classification to assess cumulative error on classification accuracy. We focus on subapplication results as our focal measures of performance.

| Face detection
The OD and SP were analyzed separately for the bearface application. We evaluated the OD using precision, recall, and interpolated average precision with an intersection over union at 0.5 of all predicted faces compared to those in the test split of the golden dataset (n = 934; Table 2). Precision can be considered as: if a face was detected, how often was it a face; recall can be considered as: how many faces were detected, out of all the faces present; and average precision as the area under the curve of the precision-recall curve, the latter comprising a key performance metric. We evaluated the SP by finding the distance between the predicted landmarks and those in the test split (Table 2). We normalized the distance for each landmark by scaling by the interocular distance. We report the mean normalized distances of all landmarks across all faces (Table 2).

| Face encoding
Bearembed was evaluated based on the "Labeled Faces in the Wild method" (Huang et al., 2007). We used k-fold cross-validation (

TA B L E 2
Testing results for the object detector and shape predictor that comprise bearface Note: Average precision calculated as area under the curve of the precision-recall curve with interpolated precision and an IoU (intersection over union) at 0.5. OD is object detector, and SP is shape predictor ± standard deviation.

| Face encoding and classification
Bearembed had predictive utility when classifying between matched (same individual) and unmatched (different individuals) pairs ( Figure 4). The error for the training data was nominal ( Figure S1), which could indicate overfitting (the network learns to distinguish the specific training images rather than something more general).
Higher accuracy occurred when splitting data by face chip, rather than ID label (Table 3). For visualization purposes, we created a subset of bears (n = 16: Knight Inlet) with > 3 images per bear. The resulting embeddings created by bearembed showed variation among and within individuals; images of some individuals were consistently clustered, whereas others were clustered with images of multiple individuals ( Figure 5).
Using bearsvm, an ID prediction was made for each embedding in the golden dataset test split (934). Accuracy was determined by dividing the number of correct predictions (according to the groundtruthed ID labels) by the total number of predictions. Two bears had single embeddings in the test set but none in the training set, so they could not be classified. Of the remaining 932 embeddings, bearsvm produced 782 correct predictions, yielding an accuracy of 83.9%. A confusion matrix was generated to further investigate classification performance by indicating which bears were confused when ID predictions were made among 16 individuals using bearsvm (Table 4).
Classification performance varied in a similar way to embedding performance ( Figure 5); some individuals were consistently identified accurately, whereas others were more likely to be confused with other bears (Table 4).

| Initial testing of the full application
We evaluated the full application, BearID, to assess the cumulative effects of subapplication error on overall identification accuracy (i.e., the effect of errors/imprecision in facial detection on individual classification). We ran 934 test images through the bearid script to receive an ID classification for each image in which a face was detected (n = 929; Table 5). Two images of bears not represented in the training set were disregarded for the bearsvm and final bearid results.
The overall accuracy for the full BearID pipeline was only slightly reduced (82.4%; Table 5) compared to when the classifier (bearsvm) was tested independently on the golden dataset (83.9%).

| Transferability to other populations and species
The bearembed (face encoder) requires the "face chip" to create vector embeddings to test matching/nonmatching pairs of images. It can be used on brown bear individuals not currently in the dataset, but may have lower accuracy (see "by ID label" embedding results). It has not been tested on other species; bearsvm (classifier) compares embeddings to those already in a database to return a matching ID label and therefore can only be used for brown bears in the current dataset, requiring training on specific known individuals.
Although not designed primarily for this function, BearID can be used to conduct face verification (Deb et al., 2018) for "unknown" brown bears by running images through bearface and bearchip, using the embedding mode of bearembed to create embeddings for the images, and then the test mode of bearembed to test between the images. Results will indicate if the bears in the images are matching (same individual) or nonmatching (different individuals). Accuracy will be 71.3% ["ID label" accuracy] or possibly lower due to potential regional differences in morphology, as the network has only been trained on bears from two populations.

| D ISCUSS I ON
Our Images of the same individuals across years should result in a more robust identification network (Schofield et al., 2019), but could reduce accuracy (e.g., in humans: Rashmi et al., 2017). Further investigation of the influence of aging and weight gain on facial biometrics of species is needed for increased inference (insights into deep learning: Miao et al., 2019). These changes could also explain why some individuals were more consistently recognized than others. In addition, images of wild animals include variation in image quality due to distance from the focal animal, background, lighting,  Table 2). c Accuracy of classifications from source images varies compared to when bearsvm was tested on the golden dataset (above).

TA B L E 5
Evaluating the impact of detection (bearface) and classification (bearsvm) error on full pipeline accuracy and pose. Using CNNs, Freytag et al. (2016) found lower accuracy for wild (77%) compared to captive chimpanzees (92%). Assessing the parameters that contribute to an "optimal facial image" and the impact of changes in facial appearance (e.g., facial trauma) could increase both the detection and recognition accuracy in future studies.
Whereas methods of pattern recognition applied to individual ID are well established (see Kühl & Burghardt, 2013), most mammals do not possess stable, unique markings. This inability to identify individuals objectively can restrict scientific inquiry and limit methods. BearID thus provides an important step in harnessing facial recognition techniques to address a broad spectrum of ecological questions that require individual ID, from fine-scale behavior (e.g., individual activity patterns: Hertel et al., 2017) to landscape-level population assessments (e.g., spatial mark-recapture using camera traps). We also see potential for this technology within conservation practice, such as to identify problem individuals in human-wildlife conflicts (see Swan et al., 2017) and to evaluate intrapopulation variation in efficacy of conservation strategies such as the use of wildlife crossing structures (e.g., Dexter et al., 2018), with implications for connectivity.
BearID may require additional species-specific training for other taxa, but our pipeline provides an open-source and replicable method as a foundation for discovery more broadly. Ramos. We thank and are grateful to the Da'naxda'xw Awaetlala

ACK N OWLED G M ENTS
First Nation for permitting this research in their traditional territory.
Special thanks to Stephanie O'Donnell and WILDLABS for facilitating our collaboration.

CO N FLI C T O F I NTE R E S T
The author(s) declare no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
BearID is an open-source application available on GitHub at https:// github.com/hypra ptive/ bearid (version 20.05