A comparison of techniques for classifying behavior from accelerometers for two species of seabird

Abstract The behavior of many wild animals remains a mystery, as it is difficult to quantify behavior of species that cannot be easily followed throughout their daily or seasonal movements. Accelerometers can solve some of these mysteries, as they collect activity data at a high temporal resolution (<1 s), can be relatively small (<1 g) so they minimally disrupt behavior, and are increasingly capable of recording data for long periods. Nonetheless, there is a need for increased validation of methods to classify animal behavior from accelerometers to promote widespread adoption of this technology in ecology. We assessed the accuracy of six different behavioral assignment methods for two species of seabird, thick‐billed murres (Uria lomvia) and black‐legged kittiwakes (Rissa tridactyla). We identified three behaviors using tri‐axial accelerometers: standing, swimming, and flying, after classifying diving using a pressure sensor for murres. We evaluated six classification methods relative to independent classifications from concurrent GPS tracking data. We used four variables for classification: depth, wing beat frequency, pitch, and dynamic acceleration. Average accuracy for all methods was >98% for murres, and 89% and 93% for kittiwakes during incubation and chick rearing, respectively. Variable selection showed that classification accuracy did not improve with more than two (kittiwakes) or three (murres) variables. We conclude that simple methods of behavioral classification can be as accurate for classifying basic behaviors as more complex approaches, and that identifying suitable accelerometer metrics is more important than using a particular classification method when the objective is to develop a daily activity or energy budget. Highly accurate daily activity budgets can be generated from accelerometer data using multiple methods and a small number of accelerometer metrics; therefore, identifying a suitable behavioral classification method should not be a barrier to using accelerometers in studies of seabird behavior and ecology.

energetics, and the environment. Many GPS tracking studies infer animal behavior from path geometry, collecting locations at very high intervals to obtain detailed tracks to support inferences about animal behavior based on path trajectories Mendez et al., 2017;Ryan, Petersen, Peters, & Grémillet, 2004;Wakefield, Phillips, & Matthiopoulos, 2009;Weimerskirch, Le Corre, & Bost, 2008). Pairing GPS and accelerometer sensors could reduce the frequency of required GPS fixes, extending the battery life for longer deployments without sacrificing detailed behavioral data.
Satellite and light-based tracking methods record locations with low temporal resolution (geolocators) and at irregular intervals (satellite transmitters), which precludes inference about detailed behavior. If these methods were coupled with accelerometers, then it would be possible to track species over large spatial scales for extended timeperiods with high temporal resolution. This type of detailed, longterm tracking of animal movements and behaviors will allow robust inference about animal ecology and how species interact with their environments (Cagnacci et al., 2010;Wakefield et al., 2009).
The ease with which biologists can deploy tracking devices to study the movements of wild animals has exceeded the ability of biologists to categorize, analyze, and interpret the volume of data these efforts have generated. Widespread adoption of accelerometers to measure animal behavior is inhibited by limited validation, which has contributed to a lack of consensus on analysis methods. A host of methods have been proposed for classifying animal behavior from accelerometer data (Appendix S1), including movement thresholds (Brown et al., 2013;Moreau, Siebert, Buerkert, & Schlecht, 2009;Shamoun-Baranes et al., 2012), histogram analysis (Collins et al., 2015), k-means (KM) cluster analysis (Angel, Berlincourt, & Arnould, 2016;Sakamoto et al., 2009), k-nearest neighbor analysis (Bidder et al., 2014), classification and regression trees (Shamoun-Baranes et al., 2012), neural networks (NN; Nathan et al., 2012;Resheff, Rotics, Harel, Spiegel, & Nathan, 2014), random forests (Bom, Bouten, Piersma, Oosterbeek, & van Gils, 2014;Nathan et al., 2012;Pagano et al., 2017), hidden Markov models (HMM; Leos-Barajas et al., 2016), expectation maximization (EM; Chimienti et al., 2016), and super machine learning (Ladds et al., 2017). At least three custom software applications are available for classifying animal behavior from trained accelerometer data: AcceleRater (Resheff et al., 2014), G-sphere (Wilson et al., 2016), and Ethographer (Sakamoto et al., 2009). Many of these methods use machine-learning techniques that are difficult to interpret because underlying processes are opaque. Numerous accelerometer-derived metrics have been employed as predictors in classification models, often without any critical evaluation of their value in improving classification accuracy. We reviewed 15 similar studies that classified animal behavior from accelerometers, to identify common accelerometer metrics used in classifications (Appendix S1). These studies used between 1 and 147 different variables in their classification models; the median number of parameters included was seven. Using large numbers of predictor variables may make classifications unnecessarily complex, potentially discouraging biologists from adopting this tool, and make methods developed on one data set less generalizable to other studies. Simpler approaches may appear inadequate in comparison to sophisticated analyses, while many complex methods can be difficult for most ecologists to implement.
Identifying an appropriate classification technique is further complicated because most methods are based on small sample sizes, with limited or no validation of classification accuracy. In a sample of 15 studies, only 10 attempted to validate their classifications, only six had sample sizes of more than 10 individuals from the same species, and five studies used data from <5 individuals from some species for analysis (Appendix S1). Many classification methods rely on training data acquired through direct observation of free-living (Nathan et al., 2012), domesticated (Moreau et al., 2009), or captive (Pagano et al., 2017) animals. Training data can be challenging or impossible to collect for wide-ranging species like seabirds, with some species travelling hundreds of kilometers in a single foraging trip. Observations of captive animals are unlikely to represent the full range of animal behavior for species that move over large spatial scales (Pagano et al., 2017). There is a need for robust unsupervised classification methods and for alternative approaches to developing training and validation data sets for species, such as most seabirds, that cannot be observed directly in the wild.
We compared six different methods for classifying behavior using accelerometer data from two seabird species: thick-billed murres (Uria lomvia) and black-legged kittiwakes (Rissa tridactyla). In this study, we focus on comparing methods for classifying the main behaviors (flying, swimming, on colony, and diving) that comprise a daily activity budget for two seabird species. Daily activity budgets have been widely used in studies of seabird behavior ), energetics (Birt-Friesen, Montevecchi, Cairns, & Macko, 1989, and ecology (Furness & Camphuysen, 1997); identifying robust methods for calculating daily activity budgets from accelerometer data should contribute to wider application of this technology. Accelerometer deployments were paired with GPS data loggers and GPS tracks were used to validate the accuracy of accelerometer-based classifications. High-resolution GPS data are already widely used for behavioral classification in free-living birds, thus, these data provide a good option for validating classifications on a large number of individuals engaging in a full range of natural activities. Our analysis focused on identifying coarse-scale behaviors: resting on colony, flying, swimming, and diving (for murres).
Quantifying these behaviors is useful for many seabird studies and these behaviors can be inferred from high-resolution GPS tracks. We compared overall accuracy and behavior-specific accuracy for each species. We also considered the effect of breeding stage (incubation vs. chick rearing) on classification accuracy; although behavior in general should not change between breeding stages, the frequency of different behaviors can change, and factors such as level of activity and posture while at the nest could change, affecting our ability to accurately identify these behaviors. To determine if classification method affects estimates of energy expenditure we also used daily activity budgets from each classification to calculate daily energy expenditure (DEE). Finally, we used variable selection to assess whether or not models using more predictor variables perform better than models with fewer variables and to identify the variables that make the greatest contribution to improvements in classification accuracy for each species.

| Tagging methods
We deployed GPS-accelerometers (Axy-trek; Technosmart, Rome, Italy; 18 g) on 21 incubating and 19 chick-rearing murres breeding at Coats Island, in 2018. Murres were captured using a noose pole and biologgers were attached to the back feathers using TESA tape (TESA 4651,Hamburg,Germany). Murres were released at the capture site and re-captured between 2 and 4 days later to retrieve data loggers. The biologgers were programed to collect GPS locations at 1 min intervals, depth at 0.1 m resolution and 1 Hz intervals, acceleration in three axes at 25 Hz, and temperature at 1 Hz. Note that deployment of similar tags altered dive duration, flight costs, and chick feeding rates (Elliott, Davoren, & Gaston, 2007;Elliott, Vaillant, et al., 2014). As all individuals should be similarly impacted, these tag effects should not affect the results of this study.
We deployed tri-axial accelerometers (Axy-3; Technosmart; 3.2 g), paired with GPS biologgers (CatTraQ; Catnip Technologies, USA; 14 g), on black-legged kittiwakes at Middleton Island, Alaska, USA, in 2013. Data were collected from 17 incubating and 19 chick-rearing kittiwakes. Both biologgers were attached to the back feathers of kittiwakes using Tesa tape (TESA 4651). Kittiwakes were released at the capture site and re-captured between 1 and 3 days later to retrieve data loggers. The biologgers were programed to collect GPS locations at 30 s intervals and tri-axial acceleration at 25 Hz. Deployment of these tags had no impact on reproductive success and survival, but altered flight duration (Chivers, Hatch, & Elliott, 2016). As all individuals should be similarly impacted, these tag effects should not affect the results of this study.

| Accelerometer-derived metrics
We focused on three types of accelerometer-derived metrics for behavior classifications: wing beat frequency (WBF), pitch, and dynamic acceleration. We chose variables that we thought would be related to the target behaviors based on our prior knowledge of the study species. We calculated WBF by extracting the dominant frequency in the Z-axis using a Fast Fourier Transform (FFT) over a 5-s moving window. The FFT was performed using the "fft" function TA B L E 1 Accelerometer-derived metrics calculated prior to behavioral classifications. Only pitch, SD Z , SD ODBA , WBF, and depth were used in classifications, other statistics shown were calculated to obtain final classification parameters

Statistic Label Equation Description
Static acceleration Average acceleration in each axis, calculated over a 2-s moving window Vertical orientation of the body angle Variation in the dynamic acceleration in the Z-axis Standard deviation of overall dynamic body acceleration Pitch measures vertical body angle based on the static acceleration (acceleration averaged over time) of all three axis (Table 1).
We expected pitch to change between different behaviors, because the body angle of a bird will change between time on land, swimming, and flight. All pitch values were corrected for differences in device orientation by standardizing acceleration measurements to a pitch of 0° for periods of presumed flight (WBF between 6-9 Hz for murres and 3-6 Hz for kittiwakes) (Elliott, Chivers, et al., 2014), when all birds should have a similar and consistent body orientation (Chimienti et al., 2016;Watanuki et al., 2003).
Dynamic body acceleration integrates the amount of dynamic acceleration (i.e., after removing the static component due to gravity and associated with posture) over a fixed time period, and can be used as an index of movement . Dynamic body acceleration can be measured along each axis individually, or as a composite of all three axes using overall dynamic body acceleration (ODBA, Table 1). For murres, we used standard deviation of the overall dynamic acceleration, (SD ODBA ) as a measure of overall activity level. For kittiwakes, initial data exploration indicated that there was greater relative variability in the Z-axis than in the ODBA, therefore, we used standard deviation in the Z-axis (SD Z ) to measure activity level. Table 1 describes the accelerometer metrics calculated from accelerometers; all of these metrics have been used in prior studies classifying animal behavior from accelerometers (Chimienti et al., 2016;Pagano et al., 2017;Shamoun-Baranes et al., 2012). Murre classifications also used depth to identify periods of diving. We calculated pitch and dynamic acceleration using a 2-s moving window (Shepard, Wilson, Halsey, et al., 2008) and WBF using a 5-s window, for both species. Once accelerometer statistics were calculated, we subsampled all data to 1 s intervals to reduce processing time during classification, and because our behaviors of interest occurred at intervals >1 s. All summary statistics are reported as mean ± SD.

| Accelerometer track segmentation
We used a behavior-based track segmentation approach for classification (Bom et al., 2014;Collins et al., 2015). Cliff-nesting murres and kittiwakes must fly to travel between their nest site and foraging areas at sea, therefore, periods of flying should separate colony behavior from swimming behavior. For murres, dives are separated from flights by periods of swimming. We used this prior knowledge of seabird behavior to segment tracks into periods of consistent behavior. We first classified diving (murres) and flying (murres and kittiwakes) from the 1-s sampled data using each method. Any behavior that occurred for <3 s was re-assigned to the previous behavior class and each period of presumed behavior was assigned a unique segment ID. For practical reasons, we imposed a maximum length of 120 s on each segment. This ensured that if a transition between behaviors was missed, the error wold not propagate beyond 120 s. This upper limit also ensured that each type of behavior was represented proportionally in the data. Incubation bouts typically last for many hours, while bouts of flying or diving could last seconds or minutes, so although most of the birds spend a majority of their time at the nest, there would be relatively few bouts of colony behavior relative to other types of behavior. Within each segment, we recalculated movement metrics using mean pitch and mean dynamic acceleration.

| Histogram segregation
We adapted a HS approach from Collins et al. (2015). We used density plots to visualize the distribution of each variable sequentially.
Characteristic peaks and valleys in the distribution were used to identify break points for different behaviors. Each behavior was classified using a stepwise approach, once the locations had been assigned to a behavior these locations were not considered for the next variable. We first classified "diving" (murres only) and "flying" using depth and WBF.
Accelerometer data were then broken down into segments of continuous behavior and we calculated average pitch and average dynamic acceleration within each segment. Remaining "unknown" segments were classified to "swimming" and "colony" based on peaks in histograms for these two variables. Each track was classified individually.

| Neural network
We used the classifications from the HS method to train the NN models. We did not use the GPS data for training the model because we wanted to test classification approaches that could be applied when GPS data are not available for model training. We randomly chose ten tracks for each species, then, randomly selected 1,000 data points within each behavior class from each of these tracks to make a training dataset. This trained model was used to predict classifications for all tracks within each data set. NN models were run with five hidden nodes using the R Package "nnet," version 7.3-12 (Venables & Ripley, 2002).

| Random forest (RF)
The random forest (RF) method used the same training data set described above for the NN model. We ran the RF models using the R package "randomForest," version 4.5-14 (Liaw & Wiener, 2002).

| Accelerometer classification-unsupervised
We also used three unsupervised classification methods: KM cluster analysis, EM, and HMM. For each method, we ran analysis with between three and six classes and visually examined the classifications to decide on the number of classes that best identified the behaviors of interest. When we identified more than three (kittiwakes) or four (murres) behavior classes, classes were grouped into the behaviors of interest based on expected patterns in behavior.

| k-Means
The KM classification was performed in two steps. For murres, dives were identified manually by classifying all data with depth below −1 m as "diving." A KM classification was performed on WBF to identify two classes, and the class with higher WBF was labelled as "flying." We then segmented all data into bouts of "diving" (murres only), "flying" and "unknown" behavior. Within segments of continuous behavior, we calculated the average pitch and dynamic acceleration. A second KM classification was performed on the remaining "unknown" segments with average pitch and dynamic acceleration as input variables. We used the natural logarithm of dynamic acceleration, and both variables were scaled to their range prior to analysis. The KM classification was performed on all tracks at once. Analysis was run using the "kmeans" function in base R.

| Expectation maximization
The EM classification was performed in two steps. For murres, dives were identified manually by classifying all data with depths below −1 m as "diving." An EM classification was performed on WBF to identify two classes; the class with higher WBF was labelled as "flying." We then segmented all data into bouts of "diving" (murres only), "flying" and "unknown" behavior. Within segments of continuous behavior, we calculated the average pitch and dynamic acceleration.
A second EM classification was performed on the remaining "unknown" segments, with average pitch and dynamic acceleration as input variables. We used the natural logarithm of dynamic acceleration, and both variables were scaled to their range prior to analysis.
EM classification was performed on all tracks for each species at once EM analysis was conducted using the R package "Rmixmod" package, version 2.1.1 (Langrognet, Lebret, Poli, & Iovleff, 2016). We considered Gaussian models with free proportions; BIC was used to identify the best model.

| Hidden Markov models
Hidden Markov models require data that are sampled at equal intervals, for this reason, we did not use the track segmentation approach described above. Instead, average accelerometer values for WBF, pitch, dynamic acceleration and depth were taken for 5-s intervals (murres) and 10-s intervals (kittiwakes Because the GPS data were collected at a lower temporal resolution (60 s for murres and 30 s for kittiwakes) than the accelerometer analysis (1 s), the GPS classification would be slower to respond to a change in behavior. For example, a murre that transitions from flying to swimming halfway between two GPS fixes would be classified as still flying during the next location, however the accelerometer could pick up this change in behavior at the time it occurred.
To deal with this difference in sampling rate, we identified periods when the GPS indicated a transition from one behavior to another.
All data points within 60 s of a GPS transition between colony, flying, or swimming were labelled as transitions and excluded from further analysis. Transitions between diving and swimming were not excluded, because the pressure sensor collected depth data at 1 s intervals. In total, 11.0% of GPS locations were excluded for murres because they were identified as periods of transition between behaviors.

| Black-legged Kittiwake
GPS data were used to validate kittiwake behavior classifications. Locations requiring a ground speed >20 m/s or more than 10-min between fixes were excluded from analysis (0.4% of locations), because these were potential GPS errors. Locations

| Classification accuracy
We subsampled the accelerometer data to 1 min (murres) and 30 s (kittiwakes) to match the resolution of the GPS data and used a confusion matrix to calculate the overall accuracy and the balanced accuracy for each behavior. Confusion matrices and measures of accuracy were calculated using the R package carat (Kuhn, 2016). We used mixed-effects models, with bird identity as a random effect, to test for differences in the classification accuracy among methods and between breeding stages. Accuracy data were logit transformed prior to analysis. We used the R package nlme (Pinheiro, Bates, DebRoy, Sarkar, & R Core Team, 2018) to run the models and the lsmeans package (Lenth, 2016) to calculate parameter estimates, 95% confidence intervals (CI) and for pairwise comparisons.

| Daily energy budget
We used an estimate of DEE to look at the overall variation among classification methods. DEE (in kJ/d) for murres was calculated fol- from CO 2 production rates (ml CO 2 g −1− hr −1 ) to kJ using an energetic equivalent of 27.33 kJ L CO 2 assuming average kittiwake mass of 416 g (Jodice et al., 2003;Speakman, 1997). We used mixed effects models, with bird ID as random effects, to test for differences in DEE estimates among methods.

| Variable selection
We chose 42 accelerometer statistics used in previous studies (Appendix S1) to consider in our variable selection analysis; these included raw acceleration values, static acceleration, dynamic acceleration, minimum, maximum, range, skew, and kurtosis for each axis. We also calculated the trend, as the slope coefficient from a linear regression, and autocorrelation, as the value of the first order autocorrelation function. Each of these statistics was calculated over a 2-s moving window. Finally, we included the dominant frequency for each axis calculated over a 5-s moving window.
We used random forests models to identify which variables contributed the most to classification accuracy and how much adding additional variables improved accuracy. To simulate a realistic training data set, acquired through paired GPS-accelerometer deployments, we trained and tested data from the classified GPS tracks using a random subset of 10 individual tracks for each species. From these tracks, we sub-sampled 1,000 locations from each behavior class to ensure each behavior was adequately represented in the training data. We used forward selection to identify which accelerometer variables provided the greatest improvement in classification accuracy for models with between 1 and 10 vari-
Flying segments had high WBF (8.1 ± 0.25 Hz). Diving segments were characterized by depths below −1 m (−20.5 ± 9.0 m). Figure 2 shows the hierarchical process and average breakpoints used for assigning behaviors with the HS method. We used five total classes in the KM classification for murres: two colony, one diving, one flying, and one swimming class. For the EM and HMM classes only four classes were necessary to obtain a clear separation of all four behaviors, based on visual examination of the classifications.

| Kittiwakes
Colony segments were characterized by high pitch (29.9 ± 11.7°; Figure 3) and low SD Z (0.04 ± 0.02 g). Swimming was characterized by low pitch (5.7 ± 2.9°) and high SD Z (0.18 ± 0.04 g). Flying segments had high WBF (4.16 ± 0.16 Hz). The HS method began by classifying flight with WBF, then colony with SD Z , and finally swimming with pitch ( Figure 4). We used four total classes in the KM, EM, and HMM classifications for kittiwakes: two colony classes, one flying class, and one swimming class.

| Murres
Mean classification accuracy for each method was >98.3% and accuracy for each individual track was above 92.7% for all methods ( Figure 5). There was no statistical support for a difference in accuracy among classification methods (F 5,190 = 1.28, p = 0.28).
Averaging across breeding status, accuracy was highest using the There was a significant interaction between method and behavior (F 15,894 = 23.6, p < 0.001; Figure 6)

| Kittiwakes
There was strong evidence for a difference in classification accuracy among methods (F 5,170 = 6.21; p < 0.001) and between breeding stages (F 1,34 = 9.41; p = 0.004), there was no support for an interaction between method and breeding stage (F 5,170 = 0.41; p = 0.84; Figure 5). Averaging across all methods, accuracy during

| Thick-billed murres
There was a significant difference in estimates of DEE among methods (F 5,190

| Kittiwakes
Breeding status had a significant effect on DEE (F 1,37 = 23.5,  and high values during swimming segments (0.7 ± 0.12). As with the initial classification methods, WBF was high during periods of flight and low during periods of swimming or periods on the colony.

| Thick-billed murres
Our original model using WBF, pitch and SD Z had comparable accuracy, 92.5% (CI = 90.4%-93.6%), to the top two variable model identified through variable selection. ACF Z appeared to measure differences in activity in kittiwake behavior that were not apparent in pitch or SD Z . For both pitch and SD Z , average values of pitch and SD Z for colony were more similar to swimming than flying, while average values of ACF Z for colony and swimming were more distinct than from average values for flying. Since our original model had lower accuracy for swimming and colony behavior, at least during incubation, ACF Z may provide better classification for these behaviors.

| D ISCUSS I ON
We found high classification accuracy using a small number of accelerometer-derived metrics to identify coarse-scale animal behavior.
Accuracy was robust to choice of classification method. Although there were statistically significant differences in classification accuracy for the methods tested, average accuracy of all methods was high (98% murres, 91% kittiwakes). There were no differences in mean accuracy among methods for murres and relatively small differences in mean accuracy among methods for kittiwakes. Choice of classification method appears to have little impact on classification results. Any of the methods described here should provide a robust classification of the principal behavior types for murres and kittiwakes. We expect these results to be largely transferable to other species in the same families, and potentially more broadly applicable to other waterbirds that use flapping flight.
We were able to achieve highly accurate and consistent results across all methods using a small set of predictor variables. For both species, including more than two or three predictor variables gave no significant improvement in classification accuracy. Many other studies, particularly those using machine learning methods, include large numbers of predictor variables (Ladds et al., 2017;Nathan et al., 2012). We found that limiting the number of variables greatly reduced analysis time, because files are smaller and models are simpler. Resulting classifications are easier to interpret, especially for unsupervised classifications, because they are based on fewer predictors with an a priori relationship to behavior.
More importantly, we have shown that similar variables-pitch, dynamic acceleration, and WBF-can be used to classify the behavior of two different seabird species. The predictor variables we selected are likely to be useful in classifying coarse-scale behaviors for a wide range of species, because changes in pitch, dynamic acceleration, and periodicity are fundamental components of all activity . Even in non-flying species, locomotion (walking, running, and swimming) should have a distinct signature in the frequency domain which would help identify this type of behavior .
Measures of pitch, dynamic acceleration, and frequency should be F I G U R E 7 Change in thick-billed murre (left) and black-legged kittiwake (right) behavior classification accuracy with additional variables included in random forest models using a forward selection procedure. Black points are medians and error bars are 95% confidence intervals a good starting point in any behavioral classification. However, our variable selection identified another variable, ACF Z for kittiwakes, which performed slightly better in classifying behavior for this species, the difference in average accuracy in using this variable was minimal. In the absence of training data to conduct similar variable selection, the types of accelerometer statistics we selected a priori for our models are likely to be effective in classifying basic behavior for a range of species.
That classification accuracy was consistently high is perhaps not a surprising result. Many studies have found higher accuracy when only a small number of general behaviors is considered (Hammond, Springthorpe, Walsh, & Berg-Kirkpatrick, 2016;Ladds et al., 2017;Shamoun-Baranes et al., 2012 (Elliott et al., 2013;Pennycuick, 1987). As a result, murres only use flapping flight, which is easily defined from accelerometer profiles.
Kittiwakes have much lower wing loading and lower wing beat frequencies (Jodice et al., 2006;Pennycuick, 1987). Murres make rapid, directed flights with few landings on the water, which helps to distinguish flight from swimming in GPS tracks. The more agile kittiwakes change direction and make short, frequent landings while visually searching for prey, which would create more overlap in ground speeds measured by GPS. Simultaneous deployments of GPS-accelerometers with salinity loggers or a magnetometer could help improve validation of kittiwake behavior classifications and identify accelerometer measures characteristic of gliding flight.
In principle, there should be no difference in the behaviors we classified between incubation and chick rearing, because all of these behaviors occur in all stages of the annual cycle. However, we did find it was more difficult to classify swimming and colony behavior accurately for incubating kittiwakes than for chick-rearing kittiwakes.
For both species, swimming was primarily differentiated from colony using differences in dynamic acceleration and pitch. Kittiwakes build a nest structure to hold their eggs and can be quite active in shifting positions and turning eggs within their nest cup. This activity at the nest and changes in pitch during incubation may have made it more difficult to differentiate incubation from swimming consistently.
Additionally, during incubation kittiwakes may spend more time resting on the water, which would have relatively low dynamic acceleration compared to active foraging on the water, making it more difficult to discern from time spent at the nest. Variable selection analysis found that ACFz was a stronger predictor of behavior for kittiwakes than either pitch or SD Z . ACF Z showed strong differentiation between swimming and colony, making it potentially a more useful variable in classifications for kittiwakes.
For any behavioral classification, the position of the data logger on the animal could influence the utility of certain acceleration measures. For example, a logger mounted on the tail or legs would have a different pitch signature than a logger mounted on the back or stomach, and may show different patterns of dynamic acceleration from the main body. Additionally, variation in how loggers are attached to individual animals can influence the ability to identify different behaviors between tracks. Indeed, in our data the differences in classification accuracy among individuals was significantly larger than the differences in classification accuracy among methods. Therefore, there should be careful consideration of logger position, and consistency in logger attachment, during study design, implementation, and data analysis.
By using a training data set for the RF and NN methods that only included a sub-sample of individuals, we demonstrated that data from a small number of individuals was transferable to a larger sample of individuals. Acquiring training data for wide-ranging species like seabirds is an impediment to using supervised classification methods for labelling behaviors. We have demonstrated that a simple supervised classification method can be used to build a training data set for basic behaviors in seabirds. The NN and RF approaches have the advantage that classifications can be fully automated without any user input once a training data set has been developed. The use of machine learning techniques for classification of wide ranging species can be limited by the challenges of developing a training data set. With large data sets, a training data set could be developed based on a subsample of data using any of the other four methods described here, and a model based on this training data could be used to classify remaining data.
Wing beat frequency was an important variable in our classifications. Estimating WBF from accelerometer data requires a sampling frequency that is at least two times higher than the expected WBF (or equivalent movement pattern) of the focal species ("the Nyquist frequency"). WBF also has many ecological applications, such as estimating changes in mass after a foraging bout (Sato, Daunt, Watanuki, Takahashi, & Wanless, 2008) and measuring changes in flight costs associated with environmental conditions (Elliott et al., 2013). Flapping flight is one of the most energetically expensive behaviors for seabirds, so accurately quantifying this behavior is important for energetic estimates. We recommend accelerometer studies on seabirds use a sampling frequency that will allow estimation of WBF, which is consistent with other authors recommendations for sampling frequencies to adequately sample dynamic body acceleration (Gómez Laich, Wilson, Gleiss, Shepard, & Quintana, 2011). For behavioral classifications, we cannot perceive any strong rationale for sampling at frequencies higher than 2-3 times the expected WBF of a focal species.
Coarse-scale behavior identification, like the approaches demonstrated here, could be a first step in a hierarchical process of identifying fine-scale behaviors (Leos-Barajas et al., 2017. Several studies have been successful in distinguishing general behaviors, like the behaviors identified in this paper, but have been less successful in effectively classifying finer scale behaviors associated with prey capture, prey handling and self-maintenance (Hammond et al., 2016;Ladds et al., 2017;Shamoun-Baranes et al., 2012). An initial partitioning into general behavior classes may simplify the process of defining detailed behavior profiles, especially where these behaviors occur as a subset within more coarse-scale behavior. While our results show that accurate classification of basic seabird behaviors can be developed using simple methods and a small group of accelerometer statistics, identifying fine scale behavior may require independently collected training data, and a larger suite of predictor variables, to capture the unique characteristics of less common behaviors.

| CON CLUS ION
Obtaining reliable activity budgets from free-ranging animals is important for addressing a wide range of questions in wildlife ecology and animal behavior. Combined with methods for tracking animal location, behavioral classification from accelerometers could be used to examine the relationship between behavior and environmental conditions over large spatial and temporal scales. We believe that uncertainty about how to classify behavior from accelerometers has been a barrier to wider use of this technique. Our results demonstrate that general behaviors of seabirds can be classified from acceleration profiles using a range of techniques and a small number of predictor variables. Choice of classification method had a negligible effect on accuracy, therefore, researchers should not be impeded by a need to develop and apply the most advanced classification method, as multiple methods can provide similar results when classifying a small number of common behaviors. However, this finding may not hold in cases where the objective is to identify more detailed types of behavior than the broad classes considered here. Where the goal of classification is to develop a daily activity budget or estimate DEE, then simple classification methods are likely adequate, at least for waterbirds that primarily use flapping flight. Where the goal is to examine how different factors effect behavior, the HMM approach may be preferable because this approach can be used to directly test the effect of predictor variables on behavior.

ACK N OWLED G EM ENTS
R Foundation. We also thank two anonymous reviewers for their comments, which greatly improved this manuscript.

CO N FLI C T O F I NTE R E S T
None declared.

DATA ACCE SS I B I LIT Y
Data used in this analysis and R scripts for behavioral classifications have been archived at https://datadryad.org/(https://doi. org/10.5061/dryad.2hf101c).