Deep learning‐based pose estimation for African ungulates in zoos

Abstract The description and analysis of animal behavior over long periods of time is one of the most important challenges in ecology. However, most of these studies are limited due to the time and cost required by human observers. The collection of data via video recordings allows observation periods to be extended. However, their evaluation by human observers is very time‐consuming. Progress in automated evaluation, using suitable deep learning methods, seems to be a forward‐looking approach to analyze even large amounts of video data in an adequate time frame. In this study, we present a multistep convolutional neural network system for detecting three typical stances of African ungulates in zoo enclosures which works with high accuracy. An important aspect of our approach is the introduction of model averaging and postprocessing rules to make the system robust to outliers. Our trained system achieves an in‐domain classification accuracy of >0.92, which is improved to >0.96 by a postprocessing step. In addition, the whole system performs even well in an out‐of‐domain classification task with two unknown types, achieving an average accuracy of 0.93. We provide our system at https://github.com/Klimroth/Video‐Action‐Classifier‐for‐African‐Ungulates‐in‐Zoos/tree/main/mrcnn_based so that interested users can train their own models to classify images and conduct behavioral studies of wildlife. The use of a multistep convolutional neural network for fast and accurate classification of wildlife behavior facilitates the evaluation of large amounts of image data in ecological studies and reduces the effort of manual analysis of images to a high degree. Our system also shows that postprocessing rules are a suitable way to make species‐specific adjustments and substantially increase the accuracy of the description of single behavioral phases (number, duration). The results in the out‐of‐domain classification strongly suggest that our system is robust and achieves a high degree of accuracy even for new species, so that other settings (e.g., field studies) can be considered.


| General
Describing and analyzing animal behavior is a central element in ecology, ethology, and neurosciences. In order to characterize animal behavior more closely and identify general behavioral patterns, it makes sense to include longer periods of time, different habitats, and many individuals (Burger et al., 2020). While this is often a highly demanding task in natural habitats, studies in zoos allow to develop, improve, and evaluate methods helping to understand behavior patterns of various species (Kögler et al., 2020;Ryder & Feistner, 1995).
Advances in digital infrastructure make it possible to collect and process observational data on a larger scale. However, the timely evaluation and extraction of meaningful information from the mass of recorded behavioral data represent a major challenge that can hardly be met by humans . Consequently, to provide means of automatic evaluation of animal behavior, computer vision and deep learning techniques emerged during the last years in behavioral biology and ecology (Chakravarty et al., 2020;Dell et al., 2014;Eikelboomet al., 2019;Valletta et al., 2017).
Over the last decade, deep learning techniques in computer vision applications have become a crucial factor Ng et al., 2015;Zha et al., 2015). Many state-of-the-art models that hold current benchmarks in computer vision tasks like object detection or semantic segmentation use convolutional neural networks (CNNs) (Russakovsky et al., 2015). Two deep learning approaches are common for inference tasks on video data. For video action classification, neural networks can be trained on sequences of consecutive frames to leverage temporal features like motion that can be strong cues to predict actions. These approaches work best with a medium to high frame rate and high resolution. Unfortunately, gathering such data over a longer period of time can be costly and may not be suitable for every research application. Another common practice is to use a neural network for inference on single frames and inject temporal higher logic to combine these predictions. This is the approach we are taking in our research presented here.

| Our contribution
We present a deep learning approach to video action classification of four different behavioral states of various African ungulates: standing, lying-head up, lying-head down, being absent (cf. Section 1.5).
The goal of our approach is to use a few manually annotated videos of individuals in a certain setting in order to subsequently automatically evaluate a large video dataset of this individual. This will be tackled by a three-stage deep learning-based framework.
The first phase is an object recognition phase carried out by a Mask R-CNN neural network (He et al., 2017). It serves three purposes. Firstly, it reduces background information by localizing the regions of interest that mostly consist of pixels filled by animals. It is thereby increasing the similarity of sample images taken from different enclosures, which dramatically increases the power of transfer learning across enclosures (Yosinski et al., 2014). Secondly, object detection can be used to distinguish between individuals within the same enclosure as long as the individuals do not occlude each other too extreme. Lastly, it provides a clean way of detecting whether an animal is present or absent.
The second phase carries out a canonical classification task on the clean-cut images from phase 1. Our approach is governed by an ensemble of two EfficientNetB3 (Tan & Le, 2019) image classifiers. One network predicts actions based on single-frame inputs, and we accumulate the predictions to one prediction per time interval (7 s). The second classifier includes temporal dimension of the video by predicting the shown behavior of this time interval directly. Therefore, the consecutive frames of this interval are concatinated to a single input (so-called multiframe encoding, (Ji et al., 2013;Karpathy et al., 2014)). Subsequently, the final prediction per interval is based on an average over the predictions by the ensemble of classifiers. Finally, to further smooth predictions, we apply carefully chosen rolling averages during this process.
In the third phase, an application-driven postprocessing step takes place. After calculating predictions for each time interval, we apply postprocessing rules that, for instance, filter out very short activity phases of behaviors which are very unlikely to appear within the evaluated behavioral states or use information about the position of the animal in its enclosure.

| Video action classification using CNNs
Among the first appearances of CNNs for video action classification, Ji et al. (2013) and Karpathy et al. (2014) discovered that encoding multiple frames performs marginally better than the frame-by-frame classification. The first milestone was reached by incorporating the temporal dimension of a video into the classification approach by different means of so-called optical flow calculations Ng et al., 2015;Simonyan & Zisserman, 2014;Zha et al., 2015). The current state of the art for video action classification is a two-stream approach (Feichtenhofer et al., 2016;Zhao et al., 2020) where each frame is fed into a CNN and gets predicted by a frame-by-frame classifier which gathers the spatial features of an image. In parallel, a sequence of consecutive frames is classified by a second CNN that K E Y W O R D S animal behavior states, automated monitoring, convolutional neural networks, deep learning tools, ecology of savannah animals, image classification captures the temporal dependencies of the video. The final prediction per frame is a fusion of the features given by these two streams.

| Deep learning approaches for action classification in behavioral studies
In recent years, the use of computer vision and deep learning techniques has emerged in behavioral biology tasks (Christin et al., 2019;Dell et al., 2014;Valletta et al., 2017). Papers of this kind should be clustered by the nature of the data used. One class of experiments was performed under laboratory conditions: high frame-rate videos with a high contrast. A prominent example is the JAABA (Kabra et al., 2012) toolbox for video classification of behaviors of mice and Drosophila flies. Another example is DeepBehaviour (Graving et al., 2019) which is used to detect and track the trajectories of mice in a laboratory. Moreover, Stern et al. (2015) present a system for object detection and behavior classification: They predict with great accuracy whether a Drosophila-fly is on some substrate or not.
Other projects need to process data recorded in the wild where the recorded image or video material poses a much greater challenge as variations in background, brightness, weather, camera specifics, recording angle, etc., lead to highly complex datasets. For instance, Porto et al. (2013) present a computer vision-based classifier using the Viola-Jones detection algorithm to distinguish lying behavior of dairy cows in free-stall stables. Norouzzadeh et al. (2018) use camera traps in the Serengeti to answer research questions on numbers, types, and behavior of recorded (larger) African mammals. Their behavior classification task is to distinguish between the five activities standing, resting, moving, eating, and interacting, for each detected individual. They apply a deep learning system harnessing 1.4 million images from the Snapshot Serengeti Dataset (Swanson et al., 2015) available to them. One main challenge of the high variation in background is the failure of standard transfer learning techniques as deep learning classifiers are sensitive to typical backgrounds (Beery et al., 2018;Quionero-Candela et al., 2009). One approach-which, as already mentioned, we take as well-to tackle this variety is to increase the similarity between images by image segmentation. An active learning system for identifying species and counting individuals using image material produced by camera traps uses such segmentation techniques and is extensively studied by Norouzzadeh et al. (2021).

| Our objectives
Understanding the behavior of animals is a key element of ecology.
For example, behavioral studies can improve our understanding of the habitat requirements or migration patterns of species, which in turn have important implications for nature conservation issues (Melzheimer et al., 2020;Teitelbaum et al., 2015). However, animal behavior is complex, contextual, and species-specific, so approach and analysis must differ depending on the thematic focus, the environmental variables, or even the species themselves. In this context, videography is an inexpensive, noninvasive method for documenting animal behavior. Although the manual methods of video evaluation allow for differentiated behavioral analysis, they are also very timeconsuming, so that longer quantitative analyses are limited. Under controlled laboratory conditions, valid solutions based on computer vision algorithms are available today, which allow to perform behavioral analyses routinely (cf. Section 1.3.2). On the other hand, for data recorded in setups where the environment variables are much more complex or the available image material is of lower quality, automatizing the evaluation process posed a major challenge for researchers so far.
A key objective of this work is to combine recent successes of deep learning with domain knowledge and expertise from behavioral biology. Our overall objective is to establish a pipeline that produces high-quality action classification with only little human labeling effort involved. In this study, the main objective is to build an accurate automatic pipeline to classify behaviors of animals recorded in zoo enclosures. We aim to achieve this using open-source software, lowbudget technical equipment, and make our code openly available on github, so that it may be easily reproducible by other research groups. We showcase a procedure that allows to significantly reduce manual labeling endeavors while maintaining high-quality labels in a controlled manner. The procedure goes as follows: • Let a researcher manually label a small set of nights of an unknown individual. 1 • Split these into train, test, and validation set, that is, reserve at least one night as holdout test set.
• Fine-tune the object detection and the classification networks on the train data by, for instance, using backpropagation and evaluate the performance on the test set.
• If the performance is not satisfactory, the accuracy can further be improved by adding more labeled nights and tuning a postprocessor.
Given a pool of existing labeled data from 10 different species from the order of Cetartiodactyla, we aim to further predict unlabeled nights from the same or other individuals of that species.
We therefore split nights into single-activity time intervals (seven seconds long) and predict one out of four stances: standing, lyinghead up, lying-head down, or being absent, which are explained in Section 1.5. On the one hand, we are interested in the performance of neural networks on the task of inferring these states per interval.
On the other hand, it is crucial that the entire system is also capable of predicting the behavioral phases of entire recording nights in such a way that typical biological parameters such as the number and duration of the phases are sufficiently accurate in order to use these predictions for behavioral research studies. Finally, we also investigate a slightly easier task: distinguishing standing from lying (independent from the head's position), which is of great interest for the identification of rhythmic activity patterns in nocturnal behavior.
We will refer to this as the task of binary classification.

| Background
In order to keep studies comparable, behavioral research works with standardized ethograms that allow comparisons within a species or a related systematic group (Stanton et al., 2015). Therefore, the definition of annotated behavioral states is explained below. Our study focuses on the three basic behavioral categories: standing, lyinghead up, and lying-head down, which are defined in the following ethogram.
• Standing: The animal stands in an upright position on all four hooves. It does not matter what the animal is doing in this position, so, for example, it could be feeding, resting, walking, or ruminating.
• Lying-head up (LHU): The animal's body lies on the ground, and the head is lifted. We do not distinguish between being awake or F I G U R E 1 For three different species (top to bottom): Common Eland (Taurotragus oryx), Common Wildebeest (Connochaetes taurinus), and Waterbuk (Kobus ellipsiprymnus), the three behavioral states (left to right) standing, LHU, and LHD are shown being in the non-REM sleep; furthermore, the animal could also be feeding, ruminating, or resting.
• Lying-head down (LHD): The animal is lying with its head rested.
The resting head lies down on the ground and is placed beside the body or sometimes in front of it.
A visualization of each state can be found in Figure 1. Additionally, if the animal cannot be seen in a frame, the desired label is being absent.
At this point, we shortly want to stress that LHD is a valid indicator for recognizing REM sleep. Indeed, identifying REM sleep by a characteristic posture is a common practice in behavioral studies based on image and video material (Ternman et al., 2014). This is due to postural atonia being a characteristic of REM sleep (Lima et al., 2005;Zepelin et al., 2005); therefore, due to the lack of muscle tone, any body part (including the animal's head) needs to be laid down. Furthermore, at least for cows, it is well known that this kind of behavioral estimation for REM sleep is highly sensitive (Ternman et al., 2014).

| The deep learning approach
Deep learning has three key drivers: algorithms, data, and computational resources. For the first two stages of our prediction pipeline, we apply deep learning algorithms from the last few years, whereby an ensemble of three neural networks has been established.
However, we strongly believe that the specifics of their design are of less relevance and they could easily be exchanged with other neural networks deemed state-of-the-art for the respective tasks (Bochkovskiy et al., 2020;Tan et al., 2020;Touvron et al., 2020). In contrast, the data used for training the neural networks and evaluating their performance play a crucial role for the experiments; hence, we dedicated Section 2.2 to discuss it in great detail. Lastly, we were able to perform all experiments with just a single, mediocre GPU (RTX 2070). For all three models, the total training time amounted to 840 hr, and the entire pipeline now predicts behaviors for 1 hr of video material in 15 min.

| Data
The data for this project span 209 nights (2,926 hr) of recordings of 65 individuals out of 10 different species, see Table 1. The videos were taken over the last three years with either a Lupus LE139HD or Lupus LE338HD camera stemming from zoo enclosures of one Dutch and ten German zoos. They have a frame rate of 1 fps and a resolution of either 1,080 p or 720 p. The recording time mostly ranges from 5 p.m. to 7 a.m., that is, the time where the animal keepers are mostly absent, with night vision using the build-in infrared emitters of the cameras.
Compared to previous studies in behavioral biology (Graving et al., 2019;Kabra et al., 2012;Stern et al., 2015) recorded under laboratory conditions, our data are much more complex and noisy.
Installing the cameras properly faces major issues as the enclosure structure and the husbandry is given by the zoos, that is, the existing, limited installation options must be used if available and the animals should not be disturbed by the cameras. This leads to huge differences from enclosure to enclosure regarding the position and the angle in which the cameras can be installed. Furthermore, the angle of the camera might change due to external influences, visibility might worsen because of dirt sticking on the lens and the animals should not be able to reach the camera leading to a high degree of occlusions (sometimes the installation needs to be outside of the enclosure box) or truncation effects (blind spots in the enclosures).
Some edge cases are illustrated and further elaborated on in the Appendix A.
For the task of object detection, we have manually annotated bounding boxes for nearly 26k randomly sampled images-a detailed per species listing is provided in Table 2. A subset of 10% of these images is used as a test set, and the remaining 90% build the training set of the object detector. For the main task of classifying behavior, we have complete labels for all 209 nights. For one common wildebeest, one bongo, and three common elands, we keep a holdout set of some nights for testing (these are the same nights containing the test images for object detection). Out of all other nights, we randomly select a training set of about 95k images such that the three classes standing, LHU, and LHD are almost balanced in number. For further evaluation of the single-frame and single-interval performance of the neural network predictors, we proceed similarly with the test nights to obtain 6k images for the Common Elands and 4k images for each of the other two species, respectively (cf. Table 1).
We refer to this subset of the test set as the validation set.

| Phase 1: Object detection
The objective of phase 1 is to localize individuals by drawing a minimal rectangular bounding box around them, which can be cut-out and further classified into the action classes in phase 2. If no individual is detected, we can already predict the class as being absent.
For object detection on single image frames we fine-tune a Mask R-CNN with ResNet-101 backbone that was pretrained on the MS COCO database (Lin et al., 2014), which has animal object classes like zebras, elephants, and dogs and is hence a good base for transfer learning to our dataset. More precisely, we use the Matterport implementation (Waleed, 2017) of Mask R-CNN and fine-tune on the training data described above for 50 epochs to detect animals out of the listed 9 species. Due to some tough truncation occurring in our data, we further run one round of offline hard example mining (Felzenszwalb et al., 2010): For each animal, we run the trained model on 400 images from the nights used for training and inspected the obtained predictions. Then, the model failures, that is, the poorly predicted bounding boxes, were re-annotated by hand and finally the network was re-trained for 15 epochs including these additional annotations.
After the per image prediction, we apply the following postprocessing steps that helped to make the overall predictions more robust to edge cases in the data and erroneous localization predictions. We only keep bounding box predictions of which the net's confidence is at least 97%. We also allow a maximum of one box per image. At a first glance, this approach looks tailored to enclosures with one individual, but can, in fact, be easily extended to detect and distinguish multiple individuals within the same enclosure.

| Phase 2: Action classification
In phase 2, we predict the action displayed in short sequences of cutout frames. We follow a successful approach to video action classification (cf. Section 1.3.1) that is based on a two-stream system-the image frame is input in the first stream and motion cues from the temporal context are fed into the second stream. For this second input, optical flow is a common choice-which we tried as well, but found the performance to be inferior to the model we will describe below. A small ablation study and discussion on this can be found in Appendix B and multiframe encoding was chosen as an alternative way to input pixel motion information as a result thereof.
The inputs to stream 1 are the cut-out boxes from phase 1, resized to a resolution of 300 × 300 pixels. A single input for the second stream consists of a four-frame encoding of a 7-s time interval. 2 The corresponding four cut-out boxes are resized to 150 × 150 pixels each and then combined to the same input size as stream 1.  10 −3 and exponential decay of 0.9. We further applied the following input augmentation steps during training: random center cropping by 0-16 px, random horizontal flipping, random Gaussian blurring, brightness and contrast augmentation, and finally, random rotation by −25 to +25 degree.

| Phase 3: Postprocessing
Finally, we apply a series of postprocessing operations to make our prediction pipeline more robust and fitting to the task of predicting accurate time intervals of animal behavior and leverage our knowledge on the temporal consistency of the data. To begin with, we average model predictions between the two streams of phase 2 and between consecutive intervals by applying a rolling average. An overview of the prediction pipeline up to this stage is illustrated in Figure 2 and the details of the implementation can be found in Appendix C.
Next, we incorporate application-driven rules to smooth predictions over time and include our domain knowledge of the animal's behaviors. As the steps before introduce only a weak temporal context, we still observe flickering of the predictions due to small misclassifications or data edge cases. For example, in case an individual is heavily truncated or occluded, the predictions of consecutive intervals might jump between absent and other actions. Furthermore, we reject certain types of transitions that would lead to unrealistic short intervals of activity, such as a short sequence of standing between LHD events, and just keep the previous behavior in such cases. To sum up, we obtain the final predictions by following the transition rules listed in Table 3.

| Evaluation
Our objectives stated in Section 1.4 require to extend the usual testing ground for classification tasks: We are highly interested in the overall performance of the system on complete videos of known For the remaining two levels, we tackle a prediction task more challenging than usually performed in statistical learning. In the classical setup, the entire dataset is split randomly, that is, train and test

| Evaluating the deep learning components
Before analyzing and discussing the core target evaluation measures introduced in Section 2.6, let us first state the results for the single Note: If the system detects a sequence of (previous behavior, current behavior, next behavior) where the current behavior is shorter than described, it will replace it by the previous behavior.
deep learning components. The results of the object detection component can be found in Figure 3. It achieves an AP@75 of more than 0.95 on the whole testing set and on the class of elands in the testing set.
For the action classification task, we first report performance on the balanced validation set, so that this leads to a testing environment compatible with common practices in deep learning. We achieve a testing accuracy of 0.881 for stream 1 and 0.954 for stream 2. Due to the specifics of the classification task and the data, these numbers can hardly be set into comparison with typical benchmark classification tasks like the ImageNet Large Scale Visual Recognition Challenge. Consequently, to better assess the performance of our models, we conducted a human study where experts (E, n = 11) and novices (N, n = 11) received 100 randomly chosen single frames from the validation set. Both groups were given the same images, but once cut-out and later as the original entire frame. The result of the participants versus stream 1 and stream 2 are listed in Table 4.
The standalone performance of stream 1 can be easily compared with the cut-out performance of the human predictors. We see that it clearly outperforms the novices and slightly outperforms the expert, except for the LHD class, where some experts perform better. Comparing with the human predictors on enclosure level, stream 1 still outperforms the novices, while it performs on par with the experts. As the humans lose around 5%-10% in performance through the cut-out process, stream 1 has to compensate for this imprecision of the object detection phase and still achieves human expert performance. Moreover, we add stream 2 in this table as well, knowing that its input spans 7-s time intervals which gives it a clear advantage over both humans and stream 1. Nevertheless, it is remarkably that this is enough to clearly outperform stream 1 and even the experts' predictions on the entire frame in all but the standing class. This underlines the benefits of including temporal information into the model. Still, stream 1 yields a useful addition as it has different strong points than stream 2, such as classifying standing, and hence, we see below that model averaging improves the overall prediction quality of the pipeline significantly. To conclude, the validation accuracy of both streams can be considered quite high and verify that the model generalizes quite well, even more so, considering the data quality and possible label ambiguities cf. Appendix A.

| Performance of the overall pipeline
In the following, we present test results for time-interval predictions of stream 1, stream 2, the fusion step, and after postprocessing for three levels of generalization performance as outlined in Section 2.6. The results are presented in Table 5 subdivided into the performance for individual animals. We furthermore report average recall, precision, and f-score for the overall predictions and the accuracy for the binary classification task.  Note: Those values are reported for the group of experts (E, n = 11) as well as the group of novices (N, n = 11) and for the two streams of the deep learning system. slightly worse as we see in Figure 4, Column A. When comparing the f-scores for the three action classes, we see for all three elands that performance is weakest for LHD. In contrast, the f-scores for the binary task are much higher (Figure 4

F I G U R E 4 Overview on recall,
precision, and f-score for each behavioral class and each individual (EL-Eland, B-Bongo, W-Wildebeest) produced by the classification system on the testing dataset average number of phases per video. Corresponding results will be stated in the next section.

| On behavioral biological key figures
Finally, we turn to our last objective, namely predicting the number of activity phases per night and their total duration. Table 6  can, for completeness, be seen in Table D1 of Appendix D.

| D ISCUSS I ON
The first part of our model pipeline succeeds strikingly in detecting individuals in their enclosures. As object detection is one showcase task for deep learning, this was to be expected, but still our results are notably high for such a task. State-of-the-art performance on the COCO dataset (Lin et al., 2014) by very recent models like YOLOv4 (Bochkovskiy et al., 2020) or EfficientDet (Tan et al., 2020) achieve an AP@75 of less than 60; however, this across many object classes and in very diverse scenes. Moreover, phase 2 of our deep learning pipeline may still predict actions correctly even if phase 1 performs slightly erroneous localization, that is, failures with respect to the AP metric may still produce cut-out images with which actions can be predicted reliable, for example, if the bounding box is slightly to big or part of the animal is truncated, which also occurs naturally due to truncation at the image borders. We conclude that our model performs the detection phase with great accuracy and robustness.
To put our action classification results into context, it is crucial to compare data variety and complexity. The data for our Finally, for the biological key figures our model recognizes most sequences correctly. More precisely, the few errors occurring during prediction seem to average out very well over multiple videos, also in the weak in-domain classification task. On this basis, our model can be used to automatically label raw data recordings from Elands 1-3 without further human supervision. With the application presented here, the bottleneck of many behavioral biological studies could be overcome-manually evaluating a huge stock of recorded raw data.
We are confident that our methods transfer well to different studies, our high out-of-domain accuracy is a good indication for this. Hence, our approach may be used in black box fashion by only adapting the postprocessing rules to the specialties of the animal's behavior. Even more though, when having already established a well-performing system, as usual in transfer learning, the amount of labeled data needed for fine-tuning is likely to decrease significantly.

TA B L E 6
Overview on the accuracy of the deep learning system predicting the amount of phases and the average duration per night for the three behavioral states Note: The value in w/o pp is the output after fusing Streams 1 and 2 while row pp lists the prediction after postprocessing was applied. We report those quantities and their SEM for the pure in-domain, the weak in-domain, and out-of-domain classification.
Machine learning applications have the potential to greatly expand the scope of ecological behavioral studies in this area (Christin et al., 2019), as large amounts of data can be analyzed in a reasonable time frame and the effort for manual analysis is drastically reduced (Tab ak et al., 2018).  (Beery et al., 2018;Schneider et al., 2020); thus, we conjecture that a similar system might perform well in free-range observations but requires additional training data.

CO N FLI C T O F I NTE R E S T
The authors declare that there are no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The python code is available at https://github.com/Klimr oth/ Video -Actio n-Class ifier -for-Afric an-Ungul ates-in-Zoos/tree/main/ mrcnn_based and is also stored at figshare https://doi.org/10.6084/ m9.figsh are.13526171.

E N D N OTE S
1 We provide small Python scripts in our github repository which might be useful to govern this task.

DATA Q UA L I T Y
This section contains examples of image frames that are challenging to the deep learning pipeline. That is, Figure A1 shows a blind spot in the enclosure of Eland 1 which is compensated by the postprocessor-if the Eland is staying below the red line for at least 70 s, the behavioral state is assumed to be lying.
Different hard examples are given in Figure A2. Even if the object detector predicted the bounding box accurately, the high amount of truncation resulting from a poor installment of the camera (due to a lack of better installment options) makes the images challenging to classify.

F I G U R E A 1
Example of an event of high truncation. The image is likely to be misclassified as standing without postprocessing. The l.h.s. shows the recorded image and on the r.h.s. the result of the object detection phase is shown F I G U R E A 2 Example of hard to classify images due to the camera's position

A PPE N D I X B O N TH E US E O F O P TI C A L FLOW A S S ECO N D S TR E A M I N PUT
The current state-of-the-art approach toward video action classification would use optical flow calculations in Stream 2 of the system to explicitly input motion cues to the classifier. In the following, we report our results when applying this approach to the setting at hand. For calculating the optical flow, we used OpenCV's implementation of the Farneback algorithm for dense optical flow (Bradski, 2000;Farnebäck, 2003) with different types of parameter settings, that is, using various window sizes (blurring vs. robustness) and Gaussian filters. The classification task was governed by a ResNet-101 CNN trained on the same training set as the system at hand. The validation accuracy only reached 0.57 and even the training accuracy did not pass 0.84.
At first glance, it is surprising that the optical flow stream was outperformed significantly by the multiframe-encoded setting as the different postures we try to classify clearly deviate in the amount of motion the individual is showing. However, we found that in the large temporal difference of one second between two consecutive frames, background motion, such as floating dust, hay or straw, crossing insects, and brightness changes due to infrared emitters, leads to plenty of spurious motion cues. See, for example, Figure B1 which shows the optical flow of five consecutive frames. As a result, the training signal stemming from the optical flow tended to be very brittle, which we think is the reason for the bad performance, especially the bad validation performance. One might be able to improve on this by outlier rejection and other preprocessing steps, but such a tuning likely leads to a strong bias toward specifics of environmental variables like the enclosure and the camera, hence might generalize poorly to nights of new individuals. Therefore, we choose to continue with the multiframe encoding as second stream instead which proved to be more flexible and robust.

A PPE N D I X C M O D EL AV ER AG I N G
This section gives the implementation details of the model averaging described in Figure 2.

Single-frame classification
After the object detection phase, we are left with up to four images per 7-s time interval. Each of those images p is predicted by the first EfficientNet B3 yielding a distribution x p = x p,0 , x p,1 , x p,2 , where x p,0 is the probability that the animal is standing in the image, x p,1 that it shows LHU and x p,2 that it shows LHD. Let x� p = x p,0 , x p,2 , x p,3 , 0 be the adjusted distribution such that the probability of being absent is set to zero. If in frame i , the animal is detected in phase 1, we generate x′ i as described; otherwise, we set x� i = (0, 0, 0, 1).
Next, a rolling average of order 16 is applied to x ′ which covers local temporal dependencies. Formally, the rolling average of order k generates a sequence of distributions x such that x i = %x i,0 , %x i,1 , %x i,2 , %x i,3 is given by where ∝ stands for being proportional up to normalizing x i back to a probability distribution. Finally, the prediction for time interval j is given as the average over the predictions on its contained images.
Thus, if interval j consists of frames i, i + 2, i + 4, i + 6, we set Now, stream 1 outputs a sequence y 1 , …, y m such that y j describes predicted probabilities for each behavior in time interval j.

Four-frame classification
For the second stream a second EfficientNet B3 produces a distribution per time interval j by predicting behaviors on four-frame-encoded input images, and � j = (0, 0, 0, 1) if and only if the animal is not detected during phase 1 on any of the four images. As above, we then apply a rolling average, but now of order 4 such that in total it accounts for a similar time period as the rolling average in stream 1, processing � = � 1 , … � m to the stream's outcome y � = y� 1 , …, y� m .

Postprocessing details
Besides the postprocessing rules listed in Table 3, enclosure-specific settings were incorporated into the postprocessor of Eland 1 (cf. Section 3.1). As the installed camera left a blind spot, the animal can be highly truncated (see Figure A1). As one would expect, the corresponding images were prune to misclassification. As the object detection phase gives access to the coordinates of the drawn bounding box, it is possible to mark any frame in which the bounding box starts below a certain line (sketched as a red line in Figure A1), as truncated. Now, if a sequence of truncation is shorter than 10 time intervals, the sequence was labeled as the previously shown behavior. Otherwise, the assigned label was set to be LHU, as it is very unlikely due to the enclosure's design that the animal was standing in the blind spot for a longer period of time.
� j = j,0 , j,1 , j,2 , 0 F I G U R E B 1 Optical flow of a lying eland of five consecutive frames. While the spatial dimension does not change notably, the optical flow is very sensitive to the high amount of background noise A PPE N D I X D

FU RTH E R E VA LUATI O N R E S U LT S
Finally, we report the statistical key quantities of interest in the binary classification task in Table D1.

TA B L E D 1
Overview on the accuracy of the deep learning system predicting the amount of phases as well as the average duration of each behavioral state in the binary classification task Standing Lying

Real Prediction Real Prediction
Pure in-domain classification Eland 1