Sperm whale response to tag boat presence : biologically informed hidden state models quantify lost feeding opportunities

Animal-attached sensors provide invaluable data to describe behavior of cryptic species, such as cetaceans, and are increasingly used to assess anthropogenic disturbance effects. Tag deployment and handling may itself alter the behavior of study animals and there is a need to assess if and when behavior recovers to an undisturbed level. Not all behavioral changes have fitness consequences, and our goal is to derive metrics that can be linked to fitness implications, such as time and energy allocation to different functional behaviors. Here we detail an approach that incorporates biological knowledge and multiple streams of tag-recorded data in a hidden state-switching model to estimate time series of functional behavioral st ates for 12 sperm whales off Norway. Foraging, recovery and resting states were specified in the hidden state model by state-dependent likelihood structures. Comparison of hidden state models revealed a parsimonious set of input time series, and supported the inclusion of a less informed ‘silent active’ state. There was a high agreement between state estimates and expert classifications. We then used the estimated states in time series models to test three hypotheses for behavioral change during suction-cup tag deployment procedures: change in behavioral states, change in prey capture attempts and locomotion cost, given behavioral state. Sperm whales spent 34% less time at the sea surface and 60% more time in non-foraging silent active state in the presence of the tag boat (‘‘tagging period’’ 0.1–2.8 h) than during post-tagging baseline period (1.8–20.8 h). No comparable pre-tagging baseline data were available. Nevertheless, time-decaying models of tagging effects were not retained in model selection, indicating a short-term effect that ceased immediately after the tagging period. We did not find changes in energetic proxies, given behavioral state, however changes in functional state budget indicate costs in terms of lost feeding opportunities and recovery time at surface. These results are useful to quantitatively identify data periods that should not be considered baseline behavior within tag recordings. This functional state approach proves effective to quantify disturbance in terms of time and energy allocation that is based upon general principles that can be applied to other species and biologging applications.


INTRODUCTION
Animal-attached sensors have become an important means to monitor individual behavior for a wide range of species and habitats in the wild.With technological advances in miniaturization, resolution and longevity of bio-logging sensors and transmitters, there is scope for a more integrated understanding of how individual behavior and physiology interact with their environment and anthropogenic stressors (Cooke et al. 2004, Johnson et al. 2009).As such, biologging science can provide first clues of individual-level mechanisms that could drive anthropogenic impacts on populations (Cooke et al. 2004, Tyack 2009, Berger-Tal et al. 2011, Miller et al. 2012).For population-level inferences to be reliably made, it is important to consider how representative the tagged individuals' baseline behavior (such as time spent foraging) or response to stimuli (such as probability of avoidance) are of the wild, non-tagged population of conservation interest.Evaluation of possible effects of bio-logging experimental procedures is therefore important when considering how representative tag data might be to the entire population (''measurement affects performance''; Wilson et al. 1986, Miller et al. 2009).
Research effects of biologging studies comprise both the effects elicited by the tag deployment procedures, such as approach, physical contact or capture (hereafter collectively termed as 'handling') and the presence of the device itself upon the animal.Documented tagging and marking effects range from injury, physiological stress and behavioral changes to reproductive success and survival rates (Murray and Fuller 2000, Godfrey and Bryant 2003, Barron et al. 2010, Walker et al. 2012).The relative importance of handling and device effects depends upon their relative invasiveness, duration and repetition that may allow for habituation or sensitization.The effects of tag presence are of particular concern for flying and swimming species that may be more sensitive to alterations to their streamlining, such as taginduced drag (Bannasch et al. 1994, Barron et al. 2010, Hazekamp et al. 2010), and subsequent increases in transport costs (Wilson et al. 1986, Ropert-Coudert et al. 2000, Wilson and McMahon 2006, Fossette et al. 2007).These effects are reduced by use of relatively smaller and more aero-and hydrodynamic tag shapes (e.g., Bannasch et al. 1994).Locomotion costs can also be expected to increase if the tag significantly increases the mechanical loading (weight), buoyancy or center of gravity of an individual (Wilson et al. 1986).Tag attachment method (e.g., harness vs. glue) may also impair movement (Barron et al. 2010), but also have more subtle physiological effects, such as changes in the distribution of animal surface temperature (McCafferty et al. 2007).
In marine mammals, most studies have reported short-term behavioral effects of tagging with little evidence of impacts on survival (McMahon et al. 2008, Walker et al. 2012).While extensive research on tagging effects have helped to guide deployment practices and tag development (e.g., Fossette et al. 2007), generalizing the devicespecific and mostly qualitative results to different species and constantly evolving telemetry set ups is challenging (Murray and Fuller 2000).Not only are tagging effects likely to depend upon specific handling procedures and tag design but also individual (age, sex, condition) and behavioral and environmental context (e.g., nursing, prey availability) (Murray andFuller 2000, Walker et al. 2012).Reliable estimation of tagging effects therefore requires case-by-case assessment.However, with limited availability and cost of alternative study platforms, tagging studies are rarely able to empirically cross-validate tag data with data from a 'pre-tagging' period or data from non-tagged individuals (Murray and Fuller 2000, Godfrey and Bryant 2003, Walker et al. 2012).Most studies therefore assume that tagging has negligible or no influence on parameters of interest after some cut-off recovery time since handling ('baseline' period; Murray andFuller 2000, Godfrey andBryant 2003, but see definition of baseline period based upon affected dive parameters in Miller et al. 2009).
An alternative and quantitative approach is to compare tagged individual behavior between different available 'doses' of tagging procedures, such as varying tag size (Wilson et al. 1986) or handling intensity (Engelhard et al. 2002).Such an approach could be used to back-calculate true population parameters (Wilson et al. 1986, Wilson andMcMahon 2006).For example, Ropert-Coudert et al. (2007) compared diving and movement behavior of Adelie penguins between two different tag sizes to extrapolate effects on penguins with tags of negligible size.Based upon their results on tagged individuals, the authors were able to predict that non-tagged penguins would maintain similar energy expenditure than tagged animals but be able to swim faster, dive deeper, and range farther in pursuit of prey.Similarly, data can be compared within each tag record under the assumption that handling effects are strongest at the time of attachment and decrease afterwards.For example, Miller et al. (2009) found that the first dive after tagging was shorter than subsequent dives of sperm whales.Such a 'during-after' comparison can reduce confounding individual variability, but does assume that tag records are long enough to allow at least partial recovery.
In this paper, we develop and apply a novel approach to quantitatively assess the effects of suction-cup tag deployment procedures ('handling') on sperm whales for which no pretagging control was available.Our goal was to compare whale behavior in the presence vs. absence of the tag boat, and to evaluate different models of recovery from effects due to tag attachment and tag boat presence.We evaluate three classes of possible behavioral effects: (1) change in behavioral time-budget, (2) reduction in prey capture attempts (proxy for foraging success), given behavioral state and (3) increase in movement cost, given behavioral state.
To obtain behavioral states for hypothesis testing, we used multiple streams of tag data in a hidden state-switching model to estimate biologically informed states and their uncertainty.As well as classification of sperm whale behavior, our goal was to formulate 'functional' states that could be generalized to other species and used to assess a range of disturbance stimuli.Conceptually, our analytical approach follows the movement ecology paradigm (Nathan et al. 2008) and functional state framework (Isojunno and Miller 2014).Functional state decomposes behavioral time series into behavioral states as units of 'effort' that are associated with a goal or set of goals, combining both the ultimate and proximate drivers of behavior (Nathan et al. 2008, Isojunno andMiller 2014).For example, these goals could be mating, information, breathing or shelter.The achievement of these goals can be measured using currencies (e.g., prey capture for feeding goal) or proxy indicators of the currency (e.g., terminal echolocation as an indicator of foraging success) and expressed as success or cost rate within each functional state (Isojunno and Miller 2014).The states capture mean differences across different behavioral states of the currencies of interest.
We used adult male sperm whales in a subarctic foraging ground in Northern Norway as a relatively simple model system where individuals spend most of their time solitary and feeding (Teloni et al. 2008, Oliveira andWahlberg 2013).Sperm whales perform deep (200-1000 m) and long (30-60 min) echolocation-based foraging dives (Watwood et al. 2006), facing trade-offs between time spent foraging at depth and recovering oxygen stores at the sea surface (Boyd 1997).These trade-offs formed the conceptual basis for our functional state model for sperm whales.We considered two bio-energetic currencies, foraging success and movement cost, that vary across the foraging dive cycle (surfacing, descending transit, layer-restricted search, ascending transit).Terminal echolocation buzzes (Miller et al. 2004) and dynamic body acceleration (ODBA; Halsey et al. 2009) were quantified as proxies for prey capture attempts (;foraging success) and locomotion cost, respectively.Besides foraging dive cycles, we also expected sperm whales to spend time in shallower dives for other purposes, such as resting or 'silent active' swimming.Sperm whale resting dives occur in consecutive bouts of variable duration, are typically shallower than foraging dives, and are stereotypically characterized by a vertical 'head-up' or 'head-down' posture (Miller et al. 2008).Non-foraging but active behaviors are also described for sperm whales (Miller et al. 2008), and likely reflect social or anti-predatory functions (Cure ´et al. 2013).We were able to test how many non-foraging functional states are utilized by sperm whales by comparing models with five (foraging states þ resting) versus six states (þ active non-foraging) (Fig. 1).

METHODS
We first estimate time series of functional states and then use the resulting state classification to test for behavioral disturbances likely linked to individual fitness.Behavioral states were esti-mated in a hidden state model in order to formalize our prior expectations of functional behavior (surfacing, transiting, layer-restricted search, resting, and other 'silent active') and utilize multiple input data time-series.The state estimates and uncertainty were next used as data in a second analysis step that tested for time or energetic costs of tag deployment procedures with different models of recovery from disturbed to post-tagging baseline behavior.

Data
Data were collected for 12 individual sperm whales tagged with an audio and movementrecording bio-logging device (Dtag; Johnson et al. 2009).Four whales were tagged in 2005 (Teloni et al. 2008) and eight whales were tagged in 2008-2010 (Miller et al. 2012) near Lofoten Islands in Northern Norway.Sperm whales were localized at sea visually and acoustically by monitoring their echolocation clicks with a towed hydrophone array.The protocol included initial observations at 200-1000 m from a main observation vessel (MS Stronstad, 29 m).A smaller tag boat (rigid-hulled inflatable boat or similar) was launched to approach each whale and deploy tags with a pole that varied in length each year of research (Table 1).
Tag data were processed to calculate depth as well as whale-frame acceleration and magnetometer data which was converted to pitch, roll and heading time-series (Miller et al. 2012).Timeseries data from the tag was down-sampled to one sample per minute to reduce computational time and concentrate analysis efforts on dive phase scale rather than fine-scale behavior, such as thrusting strokes.Depth was sampled at the start of each 1-min interval, while mean pitch and 'overall dynamic body acceleration' (ODBA) were calculated over the entire 1-min interval.
ODBA was calculated as the sum over each minute of the two-norm of high-pass filtered acceleration (finite impulse response filter, cut-on frequency 0.05 Hz).To account for effects of tag position on ODBA, ODBA values for each whale were divided (normalized) by its median value and then multiplied by the median ODBA across whales.Surface periods were detected using a depth threshold of 2 m for accepting a dive, and a threshold of 1 m for reaching the surface.Time (min) since the last surface period was calculated for the start of each 1-min interval (minFrom-Surf ).Audio data (stereo at 96 kHz) were monitored aurally and visually using spectrograms for echolocation click trains (regular and buzz clicks) and marked for their start time and duration in each record.The presence or absence of these Fig. 1.We specified five or six functional states for sperm whales in their foraging ground: (1) surfacing, oxygen replenishment and physiological recovery at the surface; (2) descending transit, transiting to a deeper prey layer; (3) layer restricted search (LRS), searching at a prey layer; (4) ascending transit, transiting to a shallower depth or the surface; (5) resting and sleep underwater and ( 6) active non-foraging, which could encompass multiple functions.States 1-4 are considered to be functional states for foraging.Solid arrows show transitions that were expected to be likely and dashed arrows highlight the uncertainty related to the transition probability to and from state 6.These expectations and uncertainties were incorporated in the model as respective informative and uniform priors for the transition probabilities (Appendix A: Fig. A1).
v www.esajournals.orgaurally monitored clicks in each 1-min interval was used in the hidden state models in conjunction with the depth and accelerometer data.Other types of clicks (slow clicks, codas) were not included in the analysis.
Six whales were subject in a controlled exposure experiment that included up to five 20-30 min exposure sessions (Miller et al. 2012, Cure ´et al. 2013).Two of these six whales were exposed to just two sessions, followed by a secondary suction-cup tag deployment 1.2 h after all experiments ended.All data from all 12 individual sperm whales were used to parameterize the hidden state model, but non-tagging baseline periods excluded all exposure sessions and post-exposure periods.For tagging effects analysis, tag handling periods were defined as the time period between tag deployment and recovery of tag boat to the main research vessel or movement of the tag boat (.1 km) from the tagged whale.
A calibration data set of behavioral states was used to compare with the hidden model state estimates.''Bottom phases'' were defined by the period between the first positive and the last negative pitch in a dive for nine whales (Miller et al. 2004; Table 1).Dive types were classified by consensus of three experts, including the authors and Dr. Stacy DeRuiter.The resulting consensus comprised 11 dive types (Appendix C: Table C4).

Hidden state model
Our state-switching model for sperm whale behavior consisted of four functional foraging states and either one or two additional states for non-foraging related behavior (Fig. 1).Alternative model structures were considered to assess how many states (five or six; Fig. 1) and which combinations of input data (depth, clicking, minFromSurf, ODBA and/or pitch) should be included to classify the behavioral time series most effectively.Each model consisted of a fiveby-five or six-by-six state transition probability matrix and state-dependent likelihoods for the input data.
Depth was modeled as a random walk Gaussian variable with a state-specific mean and variance (Langrock et al. 2013, Photopoulou 2013): where d t denotes depth at time step t and s denotes the hidden state at time step t À 1. Descent and ascent states were modeled as a directional random walk ('bias' parameter p s estimated 6 ¼ 0), and all other states a nondirectional random walk (p s ¼ 0).A separate variance for depth changes (r 2 s ) was estimated for each state.A step function was used to constrain predicted depths to be .0m.
To relax the Markov assumption that state transitions depend only upon the previous time Notes: Total sample duration (h) refers to data that was used to fit hidden state models, while tagging analysis show durations of data retained for tagging and post-tagging datasets (see text).Tag boat shows the total number of hours that the boat remained near the whale after tag deployment.Expert classified dives are given as dives with clicking and usual dive profile (1-4), dives with clicking and unusual dive profile (5-7), dives without clicking and drifting behavior (8-9) and dives without clicking and silent active swimming (10-11) (Appendix C: Table C4).
v www.esajournals.orgstep, all models allowed the probability of surfacing at time t to increase with decreasing depth at time t À 1 in a multinomial logistic regression (see Langrock et al. 2013 for a similar formulation of feed-back in transition probabilities).minFromSurf (x 1 ) was an additional covariate in the regression for the probability of transition to LRS (state 3).The linear predictor for the probability of state s at time t was therefore: where intercept b 0 was specific to a statetransition, coefficient b 1 was associated with transitions to surface (state 1), and b 2;S t associated with staying in LRS (state 3).The coefficients were fixed at zero for other transitions, i.e., when s t 6 ¼ 1, then b 1 was set to zero, and when s t 6 ¼ 3 and s tÀ1 6 ¼ 3, then b 2 was set to zero.The presence/ absence of clicking (c) was estimated a state-specific probability (c t ; Bernoulli(c s[t] )).ODBA (o) was similarly modeled as a Gamma distributed variable with state-dependent shape and rate parameters (o t ; Gamma(u s[t] , x s[t] )).
The absolute value of the pitch angle p was modeled in a logistic Beta regression (Ferrari and Cribari-Neto 2004) so that within mobile states (i.e., not surfacing or resting), pitch was related to vertical step length in a linear predictor: Here, the coefficient for vertical step a 1,s[t] was specific to each state so that all mobile states were estimated a single coefficient which was fixed at zero for surface and resting.Pitch during surfacing and resting were estimated statedependent means (a 0,1 , a 0,5 ), while mobile states were assigned a common intercept.
The joint likelihood for the full model (all five data streams) was the product of their conditionally independent likelihoods (for a similar where h denotes the included set of statedependent parameters.The full model (all four model components) had 54 estimable parameters in addition to the hidden states that were estimated for each data point.
After initial inspection of model performance, one additional parameter was introduced for the seven best DIC model structures.In the forementioned models, we had assumed a timeconstant average step length within each state by estimating a state-specific r 2 s .Inspection of the data revealed that step lengths increased as a function of the depth during foraging dives (dives consisting of only descent, LRS, and ascent).The observed relationship appeared to be linear when depth was square root transformed (see Fig. 4, middle panel).We therefore specified a time-varying r 2 s for LRS state and time-varying drift for descent and ascent states by setting: Fig. 2. Illustration of the log-linear model probability of state transition log(P(s)) ¼ a þ b 3 x with five different hypotheses for tagging dose.Blue and red tick marks on x-axis show Tagging period, with start of tagging data in blue and end of Tagging in red.The first hypothesis for dose was a presence/absence effect of tag boat, Tagging, shown as shaded gray.Four timedecaying explanatory variables were tested for hypotheses of recovery from either tag deployment (blue; minFromTd and minFromTd 2 ) or end of Tagging period (red; minFromTagging and minFromTagging 2 ).The variables were calculated as linear or squared time since tag deployment or Tagging, representing either exponential (dashed lines f(x); minFromTd and min-FromTagging) or exponential with delayed (dashed lines f(x 2 ); minFromTd 2 and minFromTagging 2 ) speed of recovery.In this illustration example, the intercept a was set at À0.5 and coefficient b at À0.005.v www.esajournals.org Here r 2 s and p s are the time-constant intercepts for variance and drift for the random walk, r 2 s,t and p s,t are the respective time-varying parameters, and l the increase in step length for every square root unit increase in depth.We therefore specified that the relationship between step length and depth was constant across the three foraging states (descent, LRS, ascent).For an exhaustive list of model parameters, see Appendix A.
In order to incorporate prior information on whale behavior, a Bayesian approach was taken to parameterize the models.A Gibbs sampling algorithm was used to sample from the joint posterior distribution of the model.We used freely available jags software (2003) within r (coda package, Plummer 2003 and R2jags package Su and Yajima 2012).Descent and ascent rate were specified with informative priors using Gamma distribution with a mean and variance parameter from literature (Watwood et al. 2006).A lower mean and variance for ODBA was used to construct a Gamma prior for resting.Probability of clicking was also informed, with a higher mean for foraging states (descent, LRS, ascent).Pitch regression coefficients had uninformative priors with no parameter difference between states except that the coefficient for vertical step was fixed at zero for surface and resting as explained above.Uniform (uninformative) priors were specified for most transition probabilities (state-specific intercepts).Coefficients for the probability of transition to surface and LRS were assigned uninformative normal priors.The probability of transition to surface was constrained to be negative by truncating its prior distribution.See Appendix A for a comprehensive list (illustration in Appendix D), and example model scripts and data in the Supplement.
All models were sampled in three independent chains, each with an initial 16,000 iterations.Model convergence was assessed at this point, 'Base' structure here refers to depth þ clicking that were included in all of the converged set of models; 'TS' refers to time-varying step length models.Left panel: overall state uncertainty for each model (total bar width) with contributing states color-coded.Overall state uncertainty was calculated for each model as the total proportion of posterior samples that were not the most prevalent state.Gray circles show DIC (from Table 2).Middle panel: percentage of time estimated in each state during pre-classified bottom phases.Contributing states are colorcoded so that green shows sensitivity of layer-restricted search to pre-classified bottom phases.Right panel: percentage of time estimated in each state during expert classified silent active dives.Models in all panels are shown in ascending order for overall state uncertainty.
v www.esajournals.organd a subset of models that were deemed to reach convergence in terms of state classification were updated a further 20,000 times.Initial values were set manually for all state parameters (Appendix A: Table A1).Brooks-Gelman-Rubin diagnostic (BGR; Brooks andGelman 1998, Gelman et al. 2003) was used to assess model convergence, which was rejected based on its poorest converging parameter (BGR estimate 1.05).Detailed methods and results for model convergence can be found in Appendix B.
Four criteria were used to rank models that were deemed to have converged: (1) goodness of fit relative to model complexity (deviance information criterion DIC), (2) uncertainty in state classification, and (3) sensitivity and specificity to pre-classified bottom phases and (4) comparison to pre-classified resting and silent-active dives.Detailed methods for model selection can be found in Appendix C.

Use of state classifications for assessment of tagging effects
We used the top-ranked time series estimate of hidden states as data, and their uncertainty as weights, in a second analysis step that tested the effects of tagging on three response variables: (1) estimated activity state (; multinomial, proxy for functional state), (2) presence/absence of buzzing (; Bernoulli, proxy for foraging success), and (3) overall dynamic body acceleration (ODBA ; Gamma, proxy for locomotion cost).1).The total numbers of states with and without clicking are given on the bottom gray x-axis, and the posterior estimate for the probability of clicking on the top black x-axis.Middle panel: vertical steps (m/min) predicted as a function of depth (m).Posterior mean steps as a function of depth were predicted based on the posterior mean (solid lines) and 95% quantile (dashed lines) for the random walk parameters r 2 s , l, p 2 and p 4 (Appendix D: Table D1).Predictions for each state are color-coded; vertical step predictions for descents (red) and ascents (blue) include drift (bias, p) and are slightly asymmetric around zero because descent and ascent drift were estimated separately as p 2 and p 4 in the model.Vertical step predictions for states 1 (surfacing, in black), 3 (LRS, in green), 5 (resting, in indigo) and 6 (active-silent, in pink) did not include drift (i.e., not signed) but for illustration, are overlaid here symmetrically with observations both above and below zero.Right panel: Absolute value of pitch (deg) predicted as a function of vertical step length (m/min) (right), each overlaid with observed data.Pitch values were predicted based on the posterior mean (solid lines) and 95% quantile (dashed lines) values the pitch regression intercept a 0,s and coefficient for depth a 1,s (Appendix D: Table D1).
v www.esajournals.orgProbability of state, given previous state, was modeled by including previous state (prevState) as factor baseline covariate.State was used as a factor baseline covariate in models for ODBA and buzz.We also allowed for mean differences in all three response variables across individuals by including tag id (whale) of the record as a factor covariate.The binomial model for buzz was fitted to a subset of data that only included foraging states (descent, LRS and ascent).No buzzing was observed in the non-foraging states (surfacing, drifting or silent active), so estimating standard errors for their coefficients would have not been possible.
Candidate exposure covariates were assessed for inclusion using model selection, and were designed to test between different hypotheses of the time-course of possible behavioral responses to tag deployment procedures (Fig. 2).Presence/ absence of tag boat was included either as a main effect (Tagging), or interaction with year (Tagging : year) or pole length (Tagging : poleL) to assess any differences in level of response across years or as a function of pole length, respectively.
We chose a maximum likelihood framework for fitting these models for ease of model selection using AIC.Multinomial log-linear regression models were fit using function multinom in r library nnet, while binomial (logit link) and Gaussian (identity link) regression models were fit using function glm in r library stats.Multinomial models were weighted by the posterior probability of the state estimate, thus accounting for the uncertainty in state estimation.AIC unit difference of DAIC , À2 was considered support for candidate tagging covariates compared to the baseline models for each response variable (state ; prevState þ whale, ODBA ; state þ whale, and buzz ; state þ whale).All tagging effects models included the baseline covariates and up to two tagging-related explanatory variables.To avoid spurious relationships, only one of the four time-decay covariates (minFromTd, minFromTd 2 , minFrom-Tagging, or minFromTagging 2 ) were included in any one model, and were not included in the same model with Tagging:Year.
The lowest AIC models were diagnosed for influential individuals and data, goodness of fit, distributional assumptions, and serial correlation in residuals (Appendix D).Models that were diagnosed with serial correlation of residuals were re-fit within a generalized estimating equation (GEE) in SAS 9.3 (procedure 'genmod').In multinomial models, the state that appeared to change most in response to tagging was used as a binomial response variable in the GEE.Any tagging effects were re-assessed using the empirical standard error estimates that do not assume any particular working correlation within the GEE, but account for the smaller effective sample size of correlated data within clusters.Small empirical standard errors (estimates . 2 3 SE) and significant type 3-tests (p , 0.05) were considered as support for candidate covariates.GEE models included whale as a cluster variable rather than an explanatory variable, and therefore explicitly estimated the parameters of the model for the group of whales rather than separately for each individual.

Data
A total of 175.37 hours of DTAG data were analyzed, an average of 14.6 hours of data recorded per whale (Table 1).All data from the 12 deployments were used to parameterize the hidden state model.Data for tagging effects analysis included nine DTAG deployments (87.62 h) from the time of first tag-on to the first experiment or end of the full tag record.For two of these whales, we also included the period between the start of secondary tag deployment until the end of the full DTAG record.Three whales (sw05_199a-c) were excluded completely due to incidental exposures to unidentified sonar at the beginning of the tag records (0-3 hours from tag deployment) (Table 1).

Hidden state model selection
Based on their state classification convergence at 6k-16k iterations, eight fixed-step length (FS) models were rejected and 10 accepted for further updates.All six-state FS models that did not include pitch failed to converge in terms of state classification, suggesting that pitch was important in discriminating between resting (state 5) and active-silent state (state 6).In the 10 FS models selected to be updated, all parameters converged adequately (BGR estimate , 1.05) after 16,000 iterations.v www.esajournals.orgAs described in the methods, the seven lowest DIC model structures were also fitted with timevarying step length (TS) during foraging states descent, LRS and ascent (Table 2).Of the seven TS models, two models with five states (models 2 and 6) failed to converge in terms of state classification; the remaining five TS models and their parameters appeared to converge sufficiently.The converged set of models improved within-chain correlation of all posterior transition probabilities from state 3 to states 2-6 compared to the same models without TS (Fig. B3).
Six-state and TS models outperformed respective five-state and FS models, both in terms of lower DIC, lower state uncertainty and higher sensitivity to pre-classified bottom phases (Fig. 3).Six-state models estimated most of the time in expert classified 'silent active swimming' dives as silent active state, and models with pitch were further able to discriminate between expert classified drifting and silent active swimming dives (Fig. 3).When vertical step was allowed to vary with depth (TS models), inclusion of ODBA appeared to somewhat improve overall state certainty and sensitivity to pre-classified bottom phases (Fig. 3).
Minimum DIC was obtained for model structures 6 (base þ pitch) and 9 (base þ pitch þ minFromSurf ) both within five-and six-state models.However, both uncertainty in state classification and sensitivity to pre-classified bottom phases ranked three models slightly above the lowest DIC model (six states, TS and pitch): full six-state TS model, and six-state TS models pitch þ minFromSurf and pitch þ ODBA (Fig. 3).Including ODBA in the best DIC model with pitch changed only 2.8% of its state estimates, a magnitude similar to their overall state uncertainty (;3%), and had only small contribution on the state classifications of the full TS model (Appendix C: Fig. C6).In the interests of model parsimony therefore, we selected against ODBA in the hidden state model.Including minFromSurf in the best DIC model changed the state classification even less, by 0.6%.Without minFromSurf, TS model posterior samples, transition probabilities from state 3 in particular, had a greater (.400) effective sample size.Therefore, it was the lowest DIC model 6 (base þ pitch) with six states and time-varying step length that was selected for interpretation and further analyses of tagging effects.

Description of selected hidden state model
The posterior distributions of the selected hidden state model were consistent with our prior expectation of behavior.A high probability of clicking was estimated for the foraging states (posterior means for descent: 0.90, layer-restricted search: 0.99, and ascent: 0.56) while a low probability of clicking was estimated for surface, resting and silent active states (,0.02).Descent and ascent rates overall were very similar when accounting for their variability and effects of  v www.esajournals.orgdepth (Fig. 4).During foraging states (descent, LRS and ascent), step length was estimated to increase by 1.47 (SD 0.02) m/min for every unit increase in square root transformed depth.The posterior mean absolute value of pitch was 1.3 (SD 45.6) degrees during surfacing and 80.5 (SD 45.8) degrees during resting.See Appendix D for complete description of the selected model.

Effects of tagging
When the tag boat remained near the whales in tagging operations (n ¼ 8.1 h), the whales spent no time resting, and across individuals, an average of 1.63 more time in the silent active state (10.1%,SD ¼ 13.2) and less time surfacing (12.4%, SD ¼ 10.2) compared to baseline periods when the boat was recovered (n ¼ 79.4 h; 60% increase from 6.3%, SD ¼ 15.1 and 34% decrease from 18.8%, SD ¼ 5.5, respectively) (Figs. 5 and  6).
The most prolonged tagging period was for whale sw08_152a that was approached by the tag boat for 2.8 hours after tag attachment attempting to photograph the whale (Fig. 5).During those 2.8 hours, the whale spent only 1.8% of the time in surfacing state 1, and 31.4% of the time in silent-active state.With most of the silent-active state comprised silent diving, the whale spent only 12.3% of its time near surface (,10 m).This compared to 12.7-27.8% of time spent in the surface state across the deployments during posttagging (Appendix E: Table E3).Immediately after the tag boat left the whale, it spent eight minutes in the surfacing state, which was the longest period the whale spent in the surfacing state during the entire DTAG record (posttagging individual average surface duration was 6.8 min SD 2.0).
The lowest AIC model for state transitions included prevState þ whale þ Tagging, which improved the baseline model prevState þ whale by 9.5 AIC units (Fig. 7).Tagging covariate was also supported by a likelihood ratio test between the two models (df ¼ 5, p ¼ 0.002, function anova.nnet()).The model estimated 86.5% of the post-tagging baseline states and 79.2% of the Tagging period states correctly.The model fit best to LRS and drifting states (92.7% and 88.5% correct predictions, respectively), and worst to silent active state (64.8%) (Appendix E: Fig. E1).
The binomial GEE model for silent active state with prevState and Tagging as covariates and whale as a cluster variable improved the QIC of the baseline GEE model by 28.9 units (Table 3).The GEE model with Tagging estimated the odds of silent active state to increase by a factor of 3.70 [95% CI 1.3, 10.1] during the tagging period (Appendix E: Table E5), slightly greater but Fig. 5. Time series of state budget and dive profile for whale sw08_152a during the Tagging and post tagging baseline period.X-axis shows time since tag-on time (tot).Bottom graph shows posterior probabilities for each state (color-coded as in Fig. 3).Top graph shows 1-minute depth data (gray) overlaid with presence/absence of clicking (black) and presence/absence of buzzing (green).
v www.esajournals.orgwithin the confidence intervals (2 3 SE) of the respective coefficient estimate from the multinomial model (Appendix E: Table E4).The GEE model results confirmed that the detected change in state transitions of the multinomial model was not merely a by-product of serial correlation.We detected both positive and negative residual correlation in the best multinomial model (Appendix E: Fig. E2).
In the AIC model selection, there was little support for an overall tagging effect on probability of buzzing, given the foraging states (descent, LRS and ascent states) and whale as a factor covariate.Tagging improved the baseline model state þ whale by only 0.74 AIC units (Fig.  v www.esajournals.org7).The coefficient estimate for Tagging was small (À0.32,SE ¼ 0.20) with no evidence that it was different from zero (z ¼ À1.61, p ¼ 0.107).
The best AIC model for ODBA included state þ whale þ minFromTd.minFromTd improved the baseline model by 30.0 AIC units (Fig. 7), however, the estimated effect was very small (À0.18 decrease in mean ODBA for every hour).When fitted within a GEE which accounts for serial correlation in the data, neither minFromTd, Tagging nor Tagging:state were supported with respective QIC increases of 543.7, 18.5 and 1353.22 units compared to the base model (Table 3).We therefore concluded that there was no evidence for a change in ODBA as a function of time since tag-on time or tag boat presence.

DISCUSSION
In this study, we were able to estimate a timeseries of functional behavioral states that most fully captured the variability in diverse data streams recorded by an animal-attached movement and sound recording tag.Our results demonstrate that a 'silent active' state can be identified despite lack of a prior functional description for this state, and that including this state along with a defined resting state improved the functional state models for behavior of Norwegian sperm whales.We then used the output of the model to evaluate three possible behavioral responses to tagging procedures: (1) change in behavioral time-budget, (2) reduction in prey capture attempts, given behavioral state and (3) increase in movement cost, given behavioral state.We demonstrate that whales spent more time in the silent active state when the tag boat was present, and that a simple present versus absent response explained the data better than time-decaying models of behavioral response.This enables quantitative determination of post-handling periods that should be excluded to retain periods more likely to reflect baseline behavior.

Hidden state models
The hidden state models were able to estimate both very stereotyped states (surfacing, resting) and states with highly variable data signatures (layer-restricted search LRS, other non-foraging) (Fig. 4).Although there was uncertainty in formal model selection in this Bayesian framework, different hidden state models arrived at similar state classifications, which all agreed well with expert classifications (Fig. 3; Appendix C: Tables C1 and C2).The hidden state models succeeded to identify and characterize states that could be interpreted in terms of functional behaviors previously documented for sperm whale foraging.The accepted hidden state model included six states, time-varying vertical step length for foraging states (descent, LRS and ascent), clicking, and log-linear relationships between vertical step and the absolute value of pitch.This model had the lowest DIC score, and was selected as the most parsimonious model amongst the models with the lowest uncertainty and agreement with expert opinion (Fig. 3).
Allowing step length to increase with depth (TS models) improved the DIC, state uncertainty, and sensitivity to pre-classified bottom phases compared to models with a simple random walk with fixed step length (FS models) (Fig. 3).TS models also appeared to capture an active foraging mode better than FS models that estimated the highest average ODBA during descent rather than LRS.We do not postulate that sperm whales are intrinsically more mobile at depth, but rather that the time-varying formulation for step length was more flexible by accepting a wider distribution of step lengths for LRS, and was subsequently able to more fully capture an active foraging mode.Such high variability in step length across foraging phases could be expected when prey layers vary in vertical thickness, quality and/or prey species that in turn influence whales' hunting and searching strategy.

Functional time budget of foraging male sperm whales
Layer-restricted search (LRS) was estimated as the most prevalent state in the post-tagging data (47.5% of all data, and 51.2% of all foraging states 1-4), consistent with the high proportion of time spent in foraging and high diving efficiencies (foraging phase duration: dive cycle duration) reported for sperm whales both at high-and low latitudes (Jaquet et al. 2000, Watwood et al. 2006).Unlike studies using bottom phase (defined by the first descent and final ascent of a dive) or foraging phase (defined by the first and last buzz of a dive) (Watwood et al. 2006) alone as a measure of foraging time, we were able to estimate multiple foraging phases within a dive (e.g., Appendix D: Fig. D3b).Thirty of 119 (25.2%) ''usual'' foraging dives (expert dive types 1-4 in Appendix C: Table C4) contained more than one foraging (LRS) phase.
There was strong support for a sixth 'silent active' state, with six-state models outperforming five-state models in terms of higher overall posterior probability of states (Fig. 3) and a better fit of state-dependent likelihoods to the data (Appendix C: Fig. C3).Furthermore, there was high concordance between the state 6 estimates and expert classified ''silent active'' dives (Fig. 3).Nevertheless, state transitions appeared to be relatively weak predictors of state 6 compared to other states (Appendix C: Fig. C5), with wide posterior credible intervals for the transition probability of staying in state 6 (Appendix D: Fig. D1).However, variability related to state transitions is expected as state 6 likely encompassed several non-foraging behaviors that may have been associated with different functional behaviors and contexts, such as socializing, avoiding the tag boat near surface, or horizontal transit.Future work with larger datasets could consider the potential to divide state 6 into more specific functional states.
State 5 (resting/drifting) was estimated for 3.8% of post-tagging baseline data, most of which coincided with expert classified drifting dives based upon the description of this behavior by Miller et al. (2008).For two whales (sw05_196a and sw10_150a) state 5 also identified drifting to the sea surface that occurred at the end of foraging dives (max depth 306 m; Appendix D: Fig. D3a).Drifting had a very distinct data signature featuring little vertical movement yet nearly vertical pitch (posterior mean and 95% CRI was estimated for step length as 8.5 [7.9, 9.0] m, and for pitch as 80.5 [79.8, 81.0] degrees), consistent with stereotyped vertical posture driftdives documented for sperm whales world-wide (Miller et al. 2008).

Effects of tagging
Using the estimated states and uncertainty to assess tagging effects, we found that sperm whales increased time in non-foraging silent active state (Table 3, Figs.6-7).Within each behavioral state, we found no evidence of changes in a proxy for locomotion cost (ODBA) or a proxy for foraging success (probability of buzzing) (Table 3, Figs.6-7).Sw08_152a was reapproached by the tag boat for the longest duration (2.8 h), and during this time, spent 31.4% of time in silent active state compared to no time in this state in the post-tagging period that consisted of three foraging dives and longer surface periods (Fig. 5).These results indicate an evasion or vigilance reaction to the tag boat that disrupted behavior, rather than a direct reduction in prey capture attempts or change in locomotion cost within behavioral states.No longer-term effects could be detected on the time scale of each tag record (;15-20 h in duration), suggesting that whales recovered to a posttagging level of behavior relatively soon after the tag boat left.
We were not able to collect comparable data during the pre-tagging period, and therefore could not establish with certainty that behavior was resumed to a completely undisturbed (nontagged) level.However, two whales (sw10_149a and sw10_150a) were re-approached for a secondary tag deployment and had a response profile consistent to whales that were tagged only once.Both whales spent time in the silent active state near the sea surface during first and second tagging periods, while full foraging dive cycles were resumed soon after the tag boat left the whales.Within both tag records, the first and second post-tagging deployment periods consisted of near identical time and depth profile of foraging (descent, LRS and ascent states) with little apparent effect of the presence of a secondary tag (Appendix D: Fig. D3k-l).We conclude that the presence of tags alone on the animal was likely to have little effect on whale behavior compared to tag deployment procedures ('handling').Indeed the DTAG only weighs 300 g, which is less than 0.01% of an adult sperm whale mass (14-50 t).Little is known about the effects of suction-cup tag attachment, however tags typically detach if a sperm whale performs a breach, indicating that an uncomfortable tag can be removed by the whale (Johnson et al. 2009).Nevertheless, multiple tagging of the same individual appears a promising approach to decompose the effects of handling vs. tag presence on the animal, as well any sensitization or habituation to tag deployment procedures.Future studies could address these effects with a larger sample size of secondary tag deployments and include tags of varying sizes.
Although we did not find evidence for changes in energetic proxies within states, the increased probability of non-foraging silent active behavior and reduced time at surface suggests an energetic cost of tag boat presence.Miller et al. (2009) found similar short-term changes in sperm whale foraging behavior during the first dive of the tag record but not subsequent dives.These changes included reduced buzz and pitching rates during the bottom phase, and shorter dive duration compared to the subsequent dive.However, the presence/absence of tag boat was a more important predictor of effects than time since tag-on time, suggesting a lack of a specific cut-off period after tag attachment.This result is expected when the tagging procedure, including re-approach for photo-identification, varies across tagging occasions.In such cases it is important to collect detailed data on tagging effort to describe the 'dose' of handling, such as tag boat distance to the whale, with focus on recording the intensity and duration of approach both before and after a successful tag attachment.

Methods considerations
Although we did find evidence of short-term tag boat effects (tagging period duration ranged between 0.1 and 2.8 h), the sensitivity of our test for subtle longer-term effects was likely to be limited due to the relatively small number of tags (n ¼ 9) and high variability in state budgets and buzz rates across the tag records.Variability in the tagging procedure across years and different tagging crews was also likely to affect the probability and level of the individual responses.We did not find evidence for a different level of response to shorter pole length (Fig. 7), however due to the small sample size we could not account for other factors that could have been equally important, such as tag boat handling and targeting tag placement near the head vs. the back of the animal.Tagging protocol in 2005 and 2008 aimed to place tags at the anterior end of the head, while in 2009-2010 whales were typically approached from behind and tags were placed on the back of the whale anterior to the dorsal fin.
Tagging periods after tag attachment ranged from just 6 minutes up to 2.8 hours (1.1-61.4% of tag records).Our time-series approach explicitly modeled this unbalanced sample, and we also contrasted results from GLMs that estimated individual level differences with GEEs that estimated individual average and between-individual variability in the response data.Nevertheless, it was possible that a few individuals that responded strongly to the presence of tag boat were influential in the estimation of a population effect.We tested the influence of individuals by re-fitting the baseline and the tagging effects GLM:s for state without each individual, and found that excluding either sw08_152a or sw10_150a lowered the AIC difference below our DAIC threshold of À2 (Appendix E: Fig. E4).sw08_152a was exposed to the longest tagging v www.esajournals.orgperiod (2.8 h), whereas sw10_150a was approached by the tag boat twice for shorter periods, including secondary tag deployment.Both sw08_152a and sw10_150a spent the longest durations in state 6 (an average of 6.5 and 4.0 minutes, respectively) compared to any other whale during the tagging period, and were not estimated to return to state 6 in the baseline period.Therefore, had we not sampled these two individuals, we would have not been able to estimate a tag boat effect overall.
As one of the first attempts at multivariate hidden state modeling of individual behavior, we simplified the hidden state model structure by assuming mostly Markov transitions, no individual effects or spatial memory for prey layers.Despite the relatively simple process model, a sufficiently strong signal in the input data allowed for robust state classification and estimation of time budgets that were highly variable across individuals.A more realistic (complex) process model would be required if disturbance was incorporated and tested within the hidden state model.For example, a hidden state model with individual as a random effect could estimate population-average effects by incorporating tag boat presence as an explanatory variable for buzzing within each state.

Implications and future steps
The functional state approach appears to be able to effectively estimate behavioral disturbances that can be linked to individual fitness.We showed that after tag deployment, whales can remain vigilant to the presence of tag boat and thus trade off foraging time for perceived risk at surface.During-after comparisons of functional states and currency proxies were influenced by individuals that were exposed to tag boat repeatedly or for extended durations, highlighting the importance of consistent deployment procedures and minimizing handling time.Nevertheless, we succeeded to estimate a cut-off point (tag boat recovery) after which whales were likely to have returned to a post-tagging level of behavior, and recommend before-during comparison of behavior where pre-tagging baseline data is not available.Our results also lend support for the exclusion of handling periods to better capture post-tagging baseline behavior.However, comparable pre-tagging data would be needed to quantify tagging effect as a deviation from the non-tagged population of interest.An optimal design would monitor behavior during all phases of tag deployment (before approach, during tag deployment procedures, during attachment, and after the device is detached), and from a platform that minimized research effects.For cetaceans such as sperm whales that use biosonar to locate prey, remote visual and passive acoustic tracking could be used to monitor foraging and movement before-duringafter tag deployment, as well as complement fine-scale on-board acoustic and orientation data during the tag record.
Our concept model and hidden state approach was based upon first principles of searching behavior (transiting vs. encamped search) and central-place foraging (surface vs. diving) that are transferrable across species, and we show that not all such functional states need strict definition a priori to be estimated.The hidden state model also incorporated speciesspecific expectations of behavior (echolocation, drifting posture), combined multiple sources of data to estimate biologically interpretable states and parameters (such as descent rate), and allowed modeling of currency proxies within the relevant behavioral contexts.Unlike expert classifications, hidden state modeling is automated and quantifies uncertainty.As well as for offline-analysis, these are also desirable properties for an on-board data compression algorithm, and state-estimation of fine-resolution archival tags could guide the development and use of such algorithms in data-relaying tags that aim to collect data for months or even years.With recent advances in deviance-based selection for Bayesian mixture models (e.g., Plummer 2003), there is also more scope to incorporate and test a range of disturbance effects as explanatory variables within hidden state models, rather than as a second AIC-based analysis step.Behavioral context is increasingly highlighted as the key to understanding the fitness trade-offs of behavioral decision making in response to anthropogenic stimuli (Gill et al. 2001, Beale 2007) and within such flexible hierarchical estimation frameworks, could be explicitly modeled by conditioning disturbance effects by behavioral state.

Hidden state model convergence
Methods.-All models were sampled in 3 independent chains, with an initial 16 000 iterations each.Initial values were set manually for all state parameters (Table A1).The first 6k iterations were discarded for adaptation and burn-in.The remaining 10k iterations were down-sampled (thinned) to reduce autocorrelation in the analyzed samples, and monitored for convergence of state classification to accept a subset of models for a further 20k iterations (Fig. B1).A lower down-sampling rate was used initially (every 18th and 6th sample for iterations 6-12k and 12-16k) to explore serial correlation in chains, while a factor of 50 was used to downsample the additional (20k) iterations.
The model assessment after 16k iterations was done to reduce computation time on models that were deemed unlikely to converge at all.Two criteria were set for rejecting a model for further updates: (1) for at least two states, state proportions were so diverged that the samples did not overlap, (2) the state proportions appeared stationary across the iterations (6-16k) (Fig. B1).Such models were considered to be poor representations of the data and were not included in further model selection.The remaining models were assessed visually for parameter convergence, stationarity and serial correlation of the chains across all iterations (6-36k).Brooks-Gelman-Rubin (BGR; Brooks andGelman 1998, Gelman et al. 2003) diagnostics were calculated separately for iterations 16-36k and 26-36k using a 95% credible interval in function gelman.diag(r package 'coda').Model convergence was rejected based on its poorest converging parameter.All convergence assessment accounted for downsampling rate.The convergence assessment procedure was repeated for models with variable step length, but with more iterations (48k) and a down-sampling rate at 18.An accurate estimate of the posterior distribution requires a sufficient number of independent samples.Conventionally, an effective sample size Fig.A1.Beta prior densities (y-axes) illustrated for transition probability intercepts (x-axes).Rows show state (1-6) at time t, and columns state (1-6) at time t þ 1. Green shows transitions with uninformative prior density Beta(shape1 ¼ 1, shape2 ¼ 1), and other colors show different types of informative Beta priors: from state 2 to 1, and state 1 to 4 (;Beta(1, 1eþ06)); from states 1-4 to state 5, and states1-5 to 6 (;Beta(1, 10)); and staying in states 1, 2, 5 and 6 (;Beta(2, 1)).
v www.esajournals.org of 400 is used (Lunn et al. 2013).Posterior summary statistics were therefore based on the number of iterations that contained at least 400 independent samples for each parameter and state proportion in the converged set (function effectiveSize in package coda).
Results.-Based on their state classification convergence at 6k-16k iterations, eight FS models were rejected and 10 accepted for further updates (Fig. B2).No five-state model state classification converged unless they included clicking (Model 2), however otherwise it was less clear how fivestate model structures contributed to state convergence at 6-16k iterations.While 'clicking with minFromSurf' or 'ODBA' models (3 and 7) failed to converge, clicking did converge with 'minFromSurf þ ODBA' (Model 4).In addition, convergence of state classification was satisfactory for all five-state models that included pitch (models 5, 6, 8 and 9).Near-surface behaviors appeared to be the most challenging to converge into states 1, 5 and/or 6 across model structures.The base model (Model 1, depth) and Model 3 (clicking þ minFromSurf ) indicated a failure to classify these behaviors both with and without the sixth state: in all four models, the coefficient for the probability of surfacing (T.beta11) and precision for depth during states 1 and 5 were divergent.Similarly, it appeared that ODBA alone could not discriminate between active and non-active behaviors near surface, such as shallow resting, unless the model was limited to Fig. B2.Brooks-Gelman-Rubin (BGR) estimate (solid circles) and 95% CI (lines) for each fixed step length model (y-axis, number of states in brackets) at iterations 16-36k.BGR estimate of 1.05 was used as an acceptance threshold for parameter convergence.Left-hand panel includes three models that were rejected based on their poor state classification convergence (1(5), 1(6) and 2( 6)) at 6-16k iterations.These models were updated to check that the rejection based on state classification was conservative.As expected, model parameters remained poorly converged.Right-hand panel only includes models that were accepted for state classification convergence.five states and the duration of submergence was accounted for (minFromSurf ).For both five-and six-state Model 7, state 1 and 5 parameters for depth precision (tau) and state 5 and 6 parameters for ODBA (a.mean, a.rate) diverged.
In the 10 updated FS models, the Brooks-Gelman-Ruben (BGR) estimates were less than or equal to 1.05 for all parameters and state proportions at 16-36k.However, in seven of the 10 FS models, the upper 95% confidence intervals of BGR estimates exceeded the threshold (max 1.14) (Fig. B2).Most of the values (15/20) that were over the threshold were related to transition probabilities from state 3 to states 2-6 that were also more serially correlated (i.e., mixed slower) than other parameters, indicating that withinchain correlation increased uncertainty in parameter convergence.
While a sufficient number of effective posterior samples (..400) could be achieved for all other parameters at 16-36k iterations, transition probabilities from state 3 had effective sample sizes less than 400 in all FS models (min 150).We did not find any improvement in the chains when minFromSurf was included as a covariate for the probability of staying in state 3, indicating the slow mixing of state 3 transition probabilities was not due to inaccuracies of the 1st order Markov assumption.Nevertheless, even the most serially correlated posteriors appeared stationary (no trend) across the iterations.We therefore used a wide sample window for all posterior summary statistics (16-36k) to improve accuracy.However, the converged set of time-varying step length models appeared to improve within-chain correlation compared to FS models, at least for the transition probabilities from state 3 (Fig. B3).

APPENDIX C
Hidden state model selection Methods.-For measuring goodness of fit relative to model complexity, we used deviance information criterion (DIC).The DIC is an extension of Akaike's Information Criterion (AIC) and is particularly useful for models that have been fitted outside of a maximum likelihood (ML) framework.Similarly to the AIC, the DIC is based upon both model fit and model complexity (Spiegelhalter et al. 2002, Lunn et al. 2013): where D is the posterior mean deviance of the model, and p D is the effective number of model parameters.Jags calculates the deviance as the sum of the deviances of all observed random variables defined in the model (i.e., ''stochastic nodes'' in BUGS terminology), and p D as the difference between the expected deviance D and the deviance evaluated at the posterior means (DðhÞ; Spiegelhalter et al. 2002).However, p D cannot be evaluated for discrete parameters such as hidden states (Lunn et al. 2013).We used an alternative measure of effective number of parameters p v instead, which is invariant to reparameterization but assumes that the information in the likelihood dominates that of the prior (Gelman et al. 2003, Lunn et al. 2013): We also assessed how well the posterior statedependent likelihoods fitted to data.The joint probability density of data (''emission probability'') and probability of state transitions were calculated for each model given the posterior parameter samples.The emission and transition probabilities were calculated for each data point as per model specification, but ignoring prior distributions.Transition probability matrix was updated at each time step to incorporate the linear predictor with data on depth and time since surfacing (minFromSurf ).The ''emission only'' prediction was calculated by selecting the state that maximized the sum of the loglikelihoods for data (i.e., in the full model, the likelihood for depth, clicking, pitch and ODBA).The ''Viterbi sequence'' accounted for both emission and transition probabilities by calculating the likelihood for the entire sequence using the Viterbi algorithm (Viterbi 1967, see Supplement script).The predicted states were then compared to the posterior state estimates to assess the contribution of state-dependent likelihoods vs. transition probabilities in the state classification.The two estimates are expected to differ because the posterior state-dependent likelihoods (data) should not always support v www.esajournals.orgthe expected states based on the sequence of states, e.g., a data point resembling drifting (state 5) in the middle of an estimated layer-restricted search phase.
To measure the contribution of data in a given state classification, we re-calculated the most likely states based on a sub-set of emission probabilities from the full model.The full model was chosen in order to compare the contributions of all data streams.The predicted states based on emission probabilities were compared to the model's state estimates, and the percentage of correct predictions for each state was contrasted across the sub-sets.
Layer-restricted search (LRS) state estimates were compared to pre-classified bottom phases and drifting state and silent active state estimates to expert classification of dives (Table C4) to assess their concordance to existing methods of behavioral classification.Unlike LRS state, bottom phases were limited to a single phase within each dive that started and ended with changes in descend and ascend pitch.Therefore a higher sensitivity of LRS state to bottom phases could have also indicated a classification that was less sensitive to multi-layered dives.18.9 14 41.9 12 6.5 6.7 TS 8(6) 18.9 13.8 43.1 10.8 6.5 6.9 TS 9(6) 18.9 13.8 42.1 11.9 6.5 6.7 Note: Blank cells are unavailable state 6 estimates in five-state models.
Table C2.Pair-wise similarity in state classification between the converged set of models Model no.
Fixed step length models Variable step length models 2(5) 4(5) 5(5) 6(5) 8( 5) 9(5) 5( 6) 6( 6) 8( 6) 9(6) 9( 5) 5( 6) 6( 6) 8( 6 Notes: The number of model structures is followed by the number of states in parentheses.Percentages of the time series of state estimates that agreed between the two models are given above the diagonal.The diagonal (values in boldface) shows the total proportion of states estimated as LRS state for each model.Percentages of LRS state estimates (one or both models estimated LRS state) that agreed between the two models are given below the diagonal.v www.esajournals.orgMeasures of accuracy and diagnostic odds ratio (DOR, Glas et al. 2003) were used to compare LRS state to pre-classified bottom phases.The sensitivity of LRS state estimates to pre-classified bottom phases was calculated as the total proportion of LRS state estimates within bottom phases.The specificity of LRS state estimates was calculated as the proportion of non-LRS state estimates within the whole timeseries that was not classified as outside bottom phases.
DOR combines sensitivity and specificity into one discriminatory test performance diagnostic (Glas et al. 2003): Thus, DOR was the ratio of the odds of LRS state in a bottom phases to the odds of LRS state outside the bottom phase.The higher the value, the better the estimated state could discriminate between the human classified states.Standard errors were calculated for the logarithm of DOR that follows approximately a normal distribution (Glas et al. 2003).Sensitivity and specificity of the LRS state classification to bottom phases were calculated based on the proportion of bottom phases ('true conditions') and intervals between bottom phases ('false conditions') that were estimated as LRS state.The number of bottom phases were therefore accounted for in SE(log(-DOR)).
Results.-Six-state models had both lower posterior mean deviance and DIC than their respective five-state model structures.Conversely, posterior mean deviances were higher for models with ODBA despite increased model complexity (Table 3).The effective number of parameters was estimated small for models with smaller deviance (model structures 6 and 9) and higher for models with higher deviance (model structures 4, 5 and 8) both by p D and p v (Table 3, Notes: the given probabilities were calculated as the average posterior proportion of states that were not the most prevalent state within each state estimate (i.e., accounting for the prevalence of state estimates in data, unlike Fig. C2).Blank cells are unavailable state 6 estimates in five-state models.Fig. C1).Therefore, deviance and DIC arrived at a similar ranking of models.
Time series of state estimates were calculated for each model as the most prevalent state in the posterior sample at iterations 16-36k for fixed step length (FS) models and samples 24-48k for time-varying step length (TS) models.The state estimates from all the 15 models agreed on 77.9% of data, yielding similar time budgets (Tables C1  and C2).Six-state model classifications were more consistent within FS models (93.9%) and within TS models (93.9%) than across (86.7%).TS models estimated the highest proportions of data as LRS state than any FS model (.40%; Table C2).
'Overall state uncertainty' was designed to measure the average residual or overall lack of support for the estimated sequence of states.Overall state uncertainty ranged between 3.3-4.5% of samples across all models.Allowing for step length to increase with depth improved the overall state certainty in all converged model structures (Fig. C2).TS models 5 (full model) and 8 (pitch þ ODBA) had the lowest overall state uncertainty (3.25% and 3.33%).
Based on emission probability of data alone (depth, clicking, pitch, ODBA), the models' discriminatory power broadly mirrored that of their overall state uncertainty (Figs.C2 and C3), indicating that any lack of support for the most prevalent states was driven by the state-dependent likelihoods.Emission probabilities predicted a posterior average of 89.1-93.45% of state estimates across models.When accounting for transition probabilities (Viterbi algorithm), the models' ability to discriminate states was improved and less variable between models (97.8-  v www.esajournals.orgFig. C2.Overall state uncertainty for each model (total bar width) with contributing states (states 1-6 colorcoded from left to right).Overall state uncertainty was calculated for each model as the total proportion of posterior samples that were not the most prevalent state.Models are shown in ascending order of overall uncertainty with the lowest (best) values at the bottom.Model structure numbers are given before number of states in brackets.'Base' structure here refers to depth þ clicking that were included in all of the converged set of models; 'TS' refers to time-varying step length models.Fig. C3.Percentage of state estimates predicted correctly when the predicted state was based on emission probability alone (black) and on both emission and transition probabilities (Viterbi algorithm, blue).Percentages for each posterior sample (small dots), mean percentage (large dots) and 95% quantiles (intervals) are shown.Models are shown in ascending order of average percentage of correct predictions based on emission alone, with the highest (best) values at the bottom.Emission probabilities were calculated as the sum of the state-dependent posterior log densities for each data stream.The predicted state in a time step, given an emission, was found by maximising the emission probability across the states.Transition probabilities were calculated based on the posterior transition matrix and linear predictor at each time step for each model ('TS': time-varying time step length models).The most likely path was found by Viterbi algorithm that minimizes both the emission and transition probabilities across sequences of states.See Supplement for r script.98.9%).Viterbi algorithm improved the state predictions only 7.0% on average, highlighting how variable and relatively little the (mostly) Markov state-transitions contributed to the state classification (Fig. C3).
Surface and drifting states had the lowest average state uncertainties (0.44% and 0.70%) and descent and ascent states the highest across all models (6.08% and 5.48%) (Table C3, Fig. C4).Silent active state had similarly high average uncertainty both within FS and TS models (5.74% and 6.07%, Table C3).Although the emission probabilities predicted silent active state better than the foraging states 2-4 (Fig. C4), silent active state was predicted worse than any other state when accounting for transition probabilities (Fig. C5).However, excluding the sixth state from the model appeared to decrease the contribution of state-dependent likelihoods in descent state estimation (Fig. C4).The statedependent likelihoods for TS models were further better able to discriminate descent and ascent states than FS models (Fig. C4).The overall lower state uncertainty of TS models therefore appeared to be driven by the foraging states (descent, LRS and ascent).Contribution of data was measured for the state classification of the full TS model structure 5 with six states.Compared to the full set of likelihoods, the percentage of correct predictions decreased most for LRS state and silent active state when clicking was excluded from the predictions.In contrast, removing ODBA changed the percentage of correct predictions least (Fig. C6).
There were only small differences in the estimated time budgets for expert classified (Table C4) drifting dives between the models (Fig. C7).All models estimated drifting dives to consist more than an average of 76% (76.3-82.9%) of time in drifting state, and at least an average of 13% (12.9-18.1%) of time in state 2 (descending).
Five-state models with pitch estimated drifting dives to also contain ascending (3.9-4.4.6%) while six-state models estimated silent active (silent active state, 7.0-7.9%)(Fig. C7).Expert classified silent active swimming dives had more variable time budgets across models than drifting dives.Five-state models without pitch estimated these dives to consist mostly of drifting state (Model 2 average: 75.2%, Model 4 average: 68.7%) while five-state models with pitch estimated these dives to consist mostly of ascend (state 4 averages 61.2-62.6%)(Fig. C7).Six-state classifications were more consistent, with ;75% in silent active state and ;3% in drifting state.
LRS state estimates of model structure 6 and 9 with six states were most sensitive to the preclassified bottom phases (0.79 and 0.78, respectively; Fig. C8).With little differences in specificity between the models, also the diagnostic odds ratio (DOR) selected for these two models as the best match for pre-classified bottom phases.In terms of the 95% confidence intervals for DOR, all FS models appeared to be significantly poorer classifiers of bottom phases than TS models (Fig. C8).Notes: Posterior summary statistics were calculated using 'summary.mcmc'function in package 'coda' in r.Time-series standard error (SE) was based on an estimate of the spectral density at zero.Parameter names are given as in BUGS code (Supplement), with numbers in square brackets referring to the state(s) (1-6) with which the parameter is associated.v www.esajournals.orgFig. D2.Residual plots for depth and pitch in the best hidden state model.Raw residuals were calculated as posterior mean (''fitted'') minus observed value for each time step.In order to obtain sign for the posterior predicted depth, we used the posterior mean of samples that were monitored for the random walk in the model (parameter 'mu' in Supplement Jags script).For pitch, posterior mean was calculated based on the posterior means of the beta regression coefficients p.beta0 and p.beta1, as well as the observed vertical step length.

Fig. 3 .
Fig. 3. Hidden state model selection.Model structure numbers are given before number of states in brackets.'Base'structure here refers to depth þ clicking that were included in all of the converged set of models; 'TS' refers to time-varying step length models.Left panel: overall state uncertainty for each model (total bar width) with contributing states color-coded.Overall state uncertainty was calculated for each model as the total proportion of posterior samples that were not the most prevalent state.Gray circles show DIC (from Table2).Middle panel: percentage of time estimated in each state during pre-classified bottom phases.Contributing states are colorcoded so that green shows sensitivity of layer-restricted search to pre-classified bottom phases.Right panel: percentage of time estimated in each state during expert classified silent active dives.Models in all panels are shown in ascending order for overall state uncertainty.

Fig. 4 .
Fig. 4. Characteristics of the selected functional state model.Left panel: sample size and posterior 95% quantile for probability of clicking by functional state (Fig.1).The total numbers of states with and without clicking are given on the bottom gray x-axis, and the posterior estimate for the probability of clicking on the top black x-axis.Middle panel: vertical steps (m/min) predicted as a function of depth (m).Posterior mean steps as a function of depth were predicted based on the posterior mean (solid lines) and 95% quantile (dashed lines) for the random walk parameters r 2 s , l, p 2 and p 4 (Appendix D: TableD1).Predictions for each state are color-coded; vertical step predictions for descents (red) and ascents (blue) include drift (bias, p) and are slightly asymmetric around zero because descent and ascent drift were estimated separately as p 2 and p 4 in the model.Vertical step predictions for states 1 (surfacing, in black), 3 (LRS, in green), 5 (resting, in indigo) and 6 (active-silent, in pink) did not include drift (i.e., not signed) but for illustration, are overlaid here symmetrically with observations both above and below zero.Right panel: Absolute value of pitch (deg) predicted as a function of vertical step length (m/min) (right), each overlaid with observed data.Pitch values were predicted based on the posterior mean (solid lines) and 95% quantile (dashed lines) values the pitch regression intercept a 0,s and coefficient for depth a 1,s (Appendix D: TableD1).

Fig. 7 .
Fig. 7. AIC model selection for tagging effects on state transitions (left) probability of buzzing (middle) and ODBA (right).Baseline models are shown on top of each Fig. and candidate Tagging covariate combinations on the left.Black solid circles show AIC for each model and vertical line AIC value for the baseline model.Models were considered to have performed better than the baseline model if their AIC was at least two units lower (horizontal grid length).

Fig. 6 .
Fig. 6.Behavioral time budgets (left) and proxies of foraging success (probability of buzzing, center) and locomotion cost (mean ODBA, right) averaged across individuals during the Tagging condition (top, 8.1 h) and post Tagging baseline (bottom, 79.4 h).Each state is color-coded as in Fig. 3.

Fig. B1 .
Fig. B1.Example output used to visually assess convergence of state classifications.Convergence of state classification was approximated by total proportion of states in the posterior sample.Top left panel shows proportions in each chain (color-coded) as a function of iteration history (x-axis).Top right panel shows posterior density distribution pooled across chains.Bottom left panel shows shrink factor (Brooks-Gelman-Rubin Diagnostic, BGR) as a function of the upper limit of samples that were used in the calculation of the diagnostic (function gelman.plot in package coda in r).The BGR value in the title was calculated for the whole posterior sample, with 95% confidence interval in the brackets.Bottom right plot shows autocorrelation for each chain as a function of lag; the total effective sample size for the samples is shown in the title (function effectiveSize in package coda in r).The examples show model 1 with six states (a) that was rejected based on its poor convergence of state proportions, and model 2 with five states (b) that appeared to converge adequately.

Fig. B3 .
Fig. B3.Effective posterior sample size (x-axis) for transition probabilities from state 3 to states 2-6 (yaxis) in fixed-step length (FS) models (black lines) versus time-varying step length (TS) models (blue).Effective posterior sample sizes are given per 20 k iterations; for effective size analysis, posterior samples were thinned by a factor of 50 for FS models and by a factor of 54 for TS models.

Fig. C1 .
Fig. C1.Comparison of the two estimates for the effective number of parameters for each converged model (n ¼ 15): pD (returned by jags, difference between the posterior mean and the deviance evaluated at posterior means) and pV (variance of posterior deviance divided by two).The lowest deviance model is circled in blue.

Fig. C4 .
Fig.C4.The average (solid circle) and 95 quantile for the percentage of correct predictions for each state estimate when predicted by emission probability alone.Emission probabilities were calculated as the sum of the state-dependent posterior log densities for each data stream.The predicted state in a time step, given an emission, was found by maximising the emission probability across the states.'TS' refers to time-varying step length models.

Fig. C6 .
Fig. C6.The average (solid circle) and 95 quantile for the percentage of correct predictions for each state estimate when predicted by a sub-set of state dependent likelihoods for each data (y-axis) in the full time-varying step length (TS) model 5 with six states.

Fig. C7 .
Fig. C7.Comparing state estimates of the converged set of models to expert classified drifting dives and silent active dives.Bar plots show time budget of state estimates (color-coded) within each type of dive.Models are shown in ascending order of state proportion for silent active dives, then by drifting dives, with the highest values at the bottom.'TS' refers to time-varying step length models.

Fig
Fig. C8.Comparing LRS state estimates of the converged set of models (y-axis) to pre-classified bottom phases.Bar plots show proportion of state estimates (color-coded) in pre-classified bottom phases (left) and intervals between pre-classified bottom phases (middle).Green bar widths therefore represent sensitivity and specificity of LRS state estimates to the bottom phases.Logarithm of the diagnostic odds ratios (DOR) are shown in the far right figure, with 95% Gaussian confidence intervals.Sample size for standard errors were based on number of bottom phases (n ¼ 171).Models are shown in ascending order of DOR with the highest (best) values at the bottom.Model structure numbers are given before state number in brackets.'Base' structure here refers to depth þ clicking that were included in all of the converged set of models; 'TS' refers to time-varying step length models.
Expert classified dive types (rows, 1-11) are as shown in TableC4.Only data from post-tagging baseline period are included.Duration and maximum depth are means.Percentage of dive cycle time values are mean with SD below.

Fig. D1 .
Fig. D1.Prior and posterior densities for the best model (Model 5 with 6 states and time-varying step length).Posterior densities are shown both at range [0, 1] (shaded gray) and zoomed in to the posterior range (blue).Prior densities are shown as dashed line at [0, 1] range.Thinned posterior samples (1334 out of 20k iterations) are given as rug plot at the bottom of each graph at the scale of the prior.

Fig. D3 .
Fig. D3.Time series of posterior state probabilities for each individual (color-coded), overlaid with dive profile (white) and presence/absence of clicking (black dots) at bottom half of the graph.The top half of the graph shows absolute value for pitch scaled to 0-1 (gray), and for sonar exposures, sound exposure levels (black; dB re 1 lPa 2 s).Sonar pings from an unidentified source are given as yellow vertical lines.Black vertical lines labelled at top x-axis show received sound exposure level (SEL).

Fig. E3 .
Fig. E3.Residuals and fitted values for the GEE model (state6 ; prevState þ Tagging).(a) Raw binomial residuals for the post-tagging baseline data as a function of time since the end of the tagging period (vertical line at 0 hours), and for the Tagging condition as a function of time since tag deployment (vertical line at À3 hours), colored by previous state (pink ¼ state 6).A positive residual indicates state 6 in the data, and a smaller positive value indicates that the GEE fitted a higher probability.(b) Autocorrelation of the raw residuals (y-axis) as a function of lag (x-axis).(c) Fitted probability of state 6 as a function of posterior probability of state 6.Circles and triangles show data from post-tagging baseline and tagging periods, respectively, colored by previous state.(d) Individual average fitted probability vs. average observed presence of state 6 within the post-tagging baseline (black solid circles) and Tagging (red triangles) periods.Segments join data from the same individual.

Fig
Fig. E4.The best tagging effects model for state (prevState þ whale þ Tagging) re-fit without each whale (xaxis) and checked for AIC against the baseline model (prevState þ whale).Right y-axis (red) shows duration of the tagging period in minutes.Horizontal line shows our cut-off DAIC ¼ À2.

Table 1 .
Summary of tag records.
formulation, see McClintock et al. 2013): Þlðb t jh; s t Þlðc t jh; s t Þlðu t jh; s t Þ lðx t jh; s t Þlða t jh; s t Þlðs t jh; s t Þlðs t jh; s tÀ1 Þ

Table 2 .
Deviance, effective number of parameters ( p v ) and deviance information criterion (DIC, based on p v ) for the 15 converged models in the last 10,000 iterations.
Notes: Models labelled 'FS' (fixed step length) estimated a constant step length within each state; models labelled 'TS' (timevarying step length) allowed vertical step to increase as a function of depth during foraging states.See Appendix C for calculation of p v .

Table 3 .
The lowest AIC models (generalized linear models; no random effects) and the corresponding GEE models with QIC (with whale as a random effect).

Table A1 .
Estimable parameters in full model with prior and initial value specification.

Table C1 .
Estimated time budget for each fixed step (FS) and time-varying step (TS) length model.

Table C3 .
State uncertainty (probability that the estimated state was not the true state) in each model.

Table C4 .
Expert dive classification.Dive types were divided into two main categories based on the presence of echolocation clicks.Sub-categories were determined by the shape of the dive profile and active swimming versus drifting.

Table D2 .
Descriptive dive cycle (diving þ post surfacing) statistics for each state estimate within expert classified dive types.