Designing observational biologging studies to assess the causal effect of instrumentation

Authors


Correspondence author. E-mail: authierm@gmail.com

Summary

  1. Biologging has improved ecological knowledge on an increasing number of species for more than 2 decades. Most studies looking at the incidence of tags on behavioural, physiological or demographic parameters rely on ‘control’ individuals chosen randomly within the population, assuming that they will be comparable with equipped individuals. This assumption is usually untestable and untenable since biologging studies are more observational than experimental, and often involve small sample sizes. Notably, background characteristics of wild animals are, most of the time, unknown. Consequently, investigating any causal effect of instrumentation is a difficult task, subjected to hidden biases.
  2. We describe the counterfactual model to causal inference which was implicit in early biologging studies. We adopted methods developed in social and political sciences to construct a posteriori an appropriate control group. Using biologging data collected on Scopoli's shearwaters (Calonectris diomedea) from a small Mediterranean island, we used this method to achieve objective causal inference on the effect of instrumentation on breeding performance and divorce.
  3. Our method revealed that the sample of instrumented birds was nonrandom. After identification of a relevant control group, we found no carry-over effects of instrumentation on breeding performance (taking into account imperfect detection probability) or divorce rate in Scopoli's shearwaters.
  4. Randomly chosen control groups can be both counterproductive and ethically dubious via unnecessary additional disturbance of populations. The counterfactual approach, which can correct for selection bias, has wide applicability to biologging within long-term studies.

Introduction

There is no controversy over the beneficial impact of the biologging revolution for wildlife ecology (Ropert-Coudert et al. 2012, but see Hebblewhite & Haydon 2010). Biologging is the ‘use of miniaturized animal attached tags for logging and/or relaying data about an animal movements, behaviour, physiology and/or environment' (Rutz & Hays 2009). Knowledge on the ecology of elusive animals, in particular marine species, greatly increased over the last two decades, as epitomized by seabird research (Wilson et al. 2002). Seabirds have been the topic of a sustained wealth of biologging studies (Vandenabeele, Wilson & Grogan 2011). The percentage of publications addressing the potential for detrimental effects of tags on seabirds, however, did not increase over the same period (Vandenabeele, Wilson & Grogan 2011).

Although guidelines and suggestions on instrumentation and animal welfare have been issued over the years (Wilson, Grant & Duffy 1986; Phillips, Xavier & Croxall 2003; Hawkins 2004; Wilson & McMahon 2006; Casper 2009), a shortcoming of impact studies is often the control group. Biologging is plagued by a catch-22 effect (Barron, Brawn & Weatherhead 2010): behaviours upon which we expect an adverse effect from tags may not be observable without the latter. Wilson, Grant & Duffy (1986) identified this issue and proposed to use linear regression to infer the behaviour of untagged animals. If the underlying statistical model is correct, this approach may predict values for the same animal, as if it had not been equipped. The conditional tense betrays a counterfactual in the terminology of causal inference (Rubin 2006). Indeed, causal inference concerns what would happen following an intervention, hypothetical or real (Gelman & Hill 2007).

Sample size is another issue. Random sampling guarantees that background characteristics of animals will be balanced on average between a control group and an instrumented group. Yet, any specific study, especially small sample-sized ones, will have some bias due to imbalance (Gelman & Hill 2007, p. 172). A small sample cannot reproduce all the essential features of the target population, although belief in the contrary is widespread (Tversky & Kahneman 1971). Because only a small number of expensive tags can typically be deployed (Hebblewhite & Haydon 2010), the assumption of no selection bias is strong. The healthy-looking bird with shiny feathers is more likely to be instrumented with an expensive tag than an emaciated, shabby-plumaged one. Ethics and animal welfare considerations actually forbid the second bird to be instrumented. Assessing the impact of instrumentation demands a meaningful control sample, which is a group of birds that could have been equipped but were not. Causal questions on instrumentation can only be unambiguously addressed if such a control group exists. In general, a potent threat to causal inference is selection bias, that is, bias due to inadequate choice of a control sample.

Some studies of the impact of instrumentation reported no short or long-term effects, for example on large animals (McMahon et al. 2008). A recent meta-analysis on birds reported an overall negative impact but also that breeding success and survival were larger for birds equipped with larger tags (Barron, Brawn & Weatherhead 2010). Similarly, Grémillet et al. (2005) found that the resighting rate of Arctic Great Cormorants (Phalacrocorax carbo) 2 years after instrumentation was higher for birds which had been equipped with internal heart-rate data loggers. It is difficult to believe the causal interpretation of theses results, which may rather be statistical artefacts, such as Simpson's or Lord's paradox (see Appendix S1).

Simpson's paradox occurs whenever the relationship between two categorical variables differs depending upon whether subgroups are accounted for in an analysis or not. Its resolution often lies in causal reasoning: only variables that are unaffected by the treatment should be accounted for (see Appendix S1). This truism stems from the definition of a cause for an effect: (i) the putative cause happened before the effect (spatio-temporal contiguity); (ii) putative cause and effect covary; and (iii) other potential putative causes that may affect the phenomenon are ruled implausible (confounders are neutralized). Lord's paradox is more subtle (see Appendix S1, Lord 1967; Holland & Rubin 1983), but illustrates the importance of defining relevant comparisons and clearly stating all assumptions underlying estimates implied to be causal (Rubin, Stuart & Zanutto 2004). Lord's paradox occurs whenever a control group is missing: conclusions on the cause of effects may result more from seemingly innocuous statistical assumptions rather than data (King & Zeng 2007; Arah 2008). Causal inference aims at predicting what would have happened, had a treated unit been left as a control. That is, inference proceeds by estimating some unobserved outcomes, either implicitly (classical approach) or explicitly (Rubin, Stuart & Zanutto 2004; Rubin 2006). In both cases, assumptions and definitions are required to avoid paradoxical results.

Methods

Our focus is two-fold: (i) how to find a control group in an observational study; and (ii) how to ensure that the control group is adequate for causal inference. These steps will help assessing whether unambiguous causal inference is possible. Observational studies and randomized experiments have often been opposed in ecology (Sagarin & Pauchard 2010). Rather than a dichotomy, they represent two ends on a continuum of suitability to infer the cause of effects (Rubin 2007). In a randomized control trial (RCT), treatment is randomly allocated to units such that both known and unknown confounders are evenly distributed between control and treated units: each unit has a nonzero probability of receiving each treatment, independently of other units. This is usually not the case in an observational study, but the latter may be conceptualized as a ‘broken RCT’ (Rubin 2006). Below is detailed how to mend a posteriori an observational study as if it were a RCT, drawing from methods developed in the political and social sciences (Rubin 2006, 2007, 2008; Sekhon 2009; Austin 2011; Sekhon 2011). Methods are summarized on Fig. 1.

Figure 1.

Design flowchart of designing an observational study to mimic a randomized control trial. Dotted arrows symbolize feedback loops. ATT stands for ‘Average Treatment effect on the Treated’ and is the causal effect of interest.

Treatment

The first step in any causal analysis is to define the treatment of interest. The aim of causal inference is to investigate what would happen to an outcome variable following a potential intervention or manipulation. A clearly defined treatment enables the identification of appropriate comparisons, irrespective of the technical methods used to estimate the effects of a cause on an outcome (Rubin, Stuart & Zanutto 2004).

Potential outcomes

Causal inference has a fundamental problem (Rubin 1978): a unit (individual) is either treated (inline image) or not (inline image) but cannot be both. Its observed response is as follows:

display math(eqn 1)

inline image and inline image are potential outcomes, of which only one will effectively materialize. Either inline image is observed and inline image becomes the counterfactual or inline image is observed and inline image becomes the counterfactual. The counterfactual model (Eqn (eqn 1), Fig. 2a) has two core characteristics (Rubin 1978): first, it defines a causal effect as a comparison of potential outcomes on a common set of units. The causal effect for a unit can be the difference inline image, and the average causal effect is inline image. Second, the counterfactual model stresses the importance of study design by insisting on the assignment mechanism. The assignment mechanism is the hypothetical or real rule that guided the decision whether to treat a unit (Fig. 2a). It describes which potential outcome is observed: inline image or inline image. Causal inference in an observational study is a doubly missing data problem with both the assignment mechanism and one of the potential outcomes missing (Fig. 2b).

Figure 2.

The Rubin Causal Model or counterfactual model. (a) A randomized control trial detailing the two steps of (i) assigning units to either the control or treatment conditions before (ii) recording outcomes of causal interest. At the design stage, both potential outcomes are still possible. (b) An observational study: the missing red arrow emphasizes that the assignment mechanism is missing and must be inferred to construct a valid control group for causal inference.

Stable unit treatment-value assumption

Data inline image are assumed fixed and randomness stems from the assignment mechanism: this is the Stable Unit Treatment-Value Assumption (SUTVA, Gelman et al. 2003, p. 201). SUTVA entails (i) only one version of the treatment and (ii) no interference between units: the potential outcome observed for a given unit is independent of the treatment assignment for other units (Sekhon 2009). If SUTVA is violated, there are more than two potential outcomes, which complicates the identification of causes (Fig. S1). Instances where SUTVA does not hold are beyond the scope of this study (see Chapter 6 of Gelman et al. 2003).

Potential outcomes stress the importance of time for causal inference: before exposure to a treatment, two outcomes are possible. The familiar notation inline image eclipses the assignment mechanism. Familiar regression modelling implicitly relies on counterfactuals, but does not necessarily correct for selection bias (Gelman & Hill 2007; King & Zeng 2007). Counterfactual outcomes may be predicted with this approach, but may also be very sensitive to modelling assumptions (see Appendix S1, Gelman & Hill 2007; King & Zeng 2007).

Propensity score

Designing an observational study means reconstructing the assignment mechanism with a probability model for treatment inline image given covariates (inline image). Covariates are any variables unaffected by instrumentation and include pretreatment variables such as age or sex. If intermediate outcomes are included, Simpson's paradox may arise. Assuming that no confounder is omitted:

display math(eqn 2)

where inline image is the propensity score or the probability of a unit receiving treatment as a function of observed covariates (Rubin 2008).

Matching on propensity score

The propensity score is a balancing score, such that the (statistical) distribution of covariates for a given value of inline image is the same whether a unit received treatment or not (Rubin 2006). The propensity score is the coarsest many-to-one balancing score, meaning that covariate balance between control and treatment can be achieved by matching solely on inline image (Rubin 2007). Given a single value of inline image, a suitable control for a treated individual is simply an untreated one with a similar value of inline image.

While inline image is known in a RCT, inline image is missing in an observational study and must be estimated (inline image). In biologging, animals that had no chance to be instrumented (inline image) or were bound to be equipped (inline image) cannot be used for causal inference. Realistic counterfactuals entail inline image. For example, in burrow-nesting seabirds, nest accessibility (burrow depth) affects trap-ability. An additional complication stems from imperfect detection: some animals may be more trappable than others, and such unobserved heterogeneity may give rise to Simpson's paradox.

Placebo tests

Propensity score matching mimics randomization after data collection, but still assumes no hidden bias. Matching methods have the potential to correct for selection bias, but may nevertheless fail. A painful, but potentially valid, conclusion may be that the data at hand are not informative on some relevant causal effects. In order to check that matching has indeed corrected bias, one may test that no important variable is omitted in Eqn (eqn 2) by comparing control and treated samples for a difference that should be null by design. Thus, the strong ignorability assumption behind propensity score estimation may be checked with placebo tests (Sekhon 2009). Positive placebos may suggest that further assumptions are required, to which results may be sensitive (see Appendix S1).

Observed outcomes

One crucial aspect of an RCT is that outcomes are unavailable when the study is implemented. To mimic this important feature, outcomes of interest should be withheld until a suitable control group is identified to avoid data snooping (Rubin 2008). Matching does not require knowledge of the observed outcomes (see examples in Sekhon 2009).

Causal effect

Causal effects are average treatment effects on the treated (ATT):

display math(eqn 3)

With a suitable control group, a consistent estimate of the counterfactual inline image is inline image. Alternatively, a model can be used to predict counterfactuals inline image, which are then compared with observed values inline image (Rubin 1978; Gelman & Hill 2007).

We will now illustrate propensity score matching to correct for selection bias with an investigation on the impact of tags on Scopoli's shearwaters (Calonectris diomedea). Managers and scientists may be concerned whether instrumenting seabirds causes divorce or interfere with breeding performance the following year. Divorce, a potentially costly event (Choudhury 1995), is defined as a bird pairing with a new partner (at time t + 1) in spite of its former mate (at time t) being simultaneously present and alive (Choudhury 1995).

Material

Field work

Deployments were carried out between mid-July and mid-September 2011 on Riou Island (43inline image N, 5inline image E), off Marseille, France. Thirty-four GPS were deployed on one partner from 34 different active nests of Scopoli's shearwaters. Population size is estimated at 280–300 breeding pairs (Anselme & Durand 2012). Rats (but not cats) are present on the island, but subjected to a regulation program. Breeding activity was determined as part of the long-term demographic monitoring program run since 1976 by the Conservatoire d'Espaces Naturels de Provence-Alpes-Côtes d'Azur (CEN-PACA).

Birds were caught inside their underground burrows, at night, after chick feeding. GPS were attached to back feathers using Tesainline image tape (Tesa s.a.s., Savigny le Temple, France). Total weight of a GPS was 20 g (4·0 cm × 2·2 cm × 0·8 cm), corresponding to 3·1% and 3·6% of average body mass for males and females, respectively. Equipped birds were weighed with a spring scale at deployment. In addition to GPS, time-depth recorders (TDR, ⊘ 8 mm × 11 mm weighing 2·7 g) were attached with Tesainline image tape on tail feathers on all but six birds. The average body mass of equipped males was 630 g (range 580–760 g) and 550 g (range 490–600 g) for equipped females. Out of the 34 deployed GPS, 31 were subsequently recovered, usually within 4 days after at least one foraging trip at sea (maximum: 4 trips). Upon recapture, GPS and TDR were retrieved, and the tip of two primary feathers was clipped for isotopic analyses. Subsequently, 21 (out of 31) birds were equipped with a geolocator (GLS, ⊘ 8 mm × 35 mm weighing 3·6 g) mounted on a plastic ring. Handling time was kept ≤10 minutes. Among the initial 34 equipped birds, three individuals were not metal-banded and were excluded from the analysis.

Instrumentation had no obvious short-term impacts on birds; they all performed foraging trips and returned to their nesting burrows at night. We did not assess short-term impacts because of the following: (i) such studies already exist (Igual et al. 2005; Villard, Bonenfant & Bretagnolle 2011); and (ii) they are potentially biased by inadequate choice of control birds. Also, (iii) using a control group is both logistically and ethically challenging when working on small, vulnerable populations because it doubles the disturbance of animals. Finally, we already examined the data at the end of the 2011 field season: no instrumented bird was a failed breeder. Because outcomes for the 2012 breeding season (divorce, breeding decision and success) were unknown, we could objectively design an observational study.

From the CEN-PACA database, we extracted the life histories of all shearwaters breeding on Riou Island in 2011. Sex was behaviourally determined from calls. Most birds were ringed as adults, and only one instrumented bird was ringed as a chick. Body masses correspond to the average adult mass of birds across all resighting events before 2011. Birds with missing sex or adult body mass information were excluded, yielding a total of 183 birds with no missing information.

Propensity score estimation

To control for trap-ability, we modelled the propensity score (the probability to equip a bird with tags) as:

display math(eqn 4)

Because instrumentation is a rare event, we used the cdf of Student t distribution of location 0, scale 1·5484 and 7 degrees of freedom as a robust link function (Fig. S2; Liu 2004). Using data augmentation (Albert & Chib 1993), we used shrinkage regression with a horseshoe prior (Carvalho, Polson & Scott 2010) to achieve automatic variable selection in propensity score estimation (Greenland 2008). Balance was graphically assessed (see Table S1 and Fig. S3 for results without shrinkage). Model fitting was performed with winbugs (Lunn et al. 2000) called from r (R Development Core Team 2012). Prior specifications are available as Supporting Information.

Matching

We matched individuals (without replacement) according to their estimated linear propensity scores [robit(inline image)] with the r package matching (Sekhon 2011): we used Mahalanobis metric matching within propensity score caliper (Rubin 2006). Caliper width was set to 1/4 of the standard deviation in estimated linear propensity scores (inline image) (Sekhon 2011). Only birds whose linear propensity score satisfied inline image were considered suitable matches for a bird equipped with tags (denoted with the subscript inline image). We excluded as potential match any partner of an equipped bird.

Initially, 29 birds out the 31 equipped could be matched. No matches were found for the two individuals corresponding to the two rightmost red strips on Fig. 3. In order to check that the matching procedure worked, we performed a placebo test (Sekhon 2011). A placebo tests for a causal effect that is null by definition (Fig. 1). We assessed whether the probability of detecting a bird in 2010, that is the year prior to tag deployment, was different between equipped and control birds. This placebo revealed that equipped birds were more likely to be detected in 2010 than control birds (difference in proportions: 0·20 ± 0·13%, Likelihood Ratio Test = 2·4, P = 0·12). Although not statistically significant at the 5% level, this result suggested a biased sample due to (unobserved) trap-ability.

Figure 3.

Results from the robit regression for propensity score estimation of the 183 breeding Scopoli's shearwaters detected in 2011 on Riou Island. Red strips correspond to birds that were instrumented in 2011 and light-coloured strips to birds that were not tagged. Individuals represented as light-coloured bands on the leftmost part correspond to birds breeding in 2011 that had an extremely small probability to be equipped and for which there is no similar equipped bird. Likewise, the 4 rightmost red bands correspond to equipped birds that had the largest probability to be equipped and for which no matches were available (lower panel).

To remedy this, we imposed to match equipped birds that were not detected in 2012 with control birds that were also not detected in 2012 and likewise for detected individuals. Detection in 2012 was not our outcome of interest. Moreover, this covariate did not enter the estimation of propensity scores since, when birds were instrumented in 2011, it was impossible to tell whether they would be detected in 2012. We think that conditioning on detection in 2012 is adequate because (i) it is not one of our outcomes of interest; and (ii) the monitoring of the Riou population is performed by a dedicated field team independent of our research team. Any conscious or unconscious bias that we may have had on looking for previously equipped birds did not affect data collection in 2012.

With this constraint, 27 equipped birds were matched (lower panel of Fig. 3). The placebo test revealed no obvious bias (difference in proportions: 0·12 ± 0·14%, Likelihood Ratio Test = 0·8, P = 0·37). Covariate balance between equipped and control samples was satisfactory (Fig. 4).

Figure 4.

One-to-one matching with Mahalanobis metric within propensity score caliper (Rubin 2006). Covariates balance is illustrated by means of Tukey plots for the 183 breeding birds in 2011, 27 equipped birds and the corresponding 27 identified controls. The point represents the median, and the thick line the interquartile range. Thin-lined fences were computed as in Dümbgen & Riedwyl (2007) to illustrate asymmetry. Covariates were standardized.

Observed outcomes

With this suitable control group, we addressed two causal questions:

  1. Does instrumentation affect the breeding performance of a bird the following year?
  2. Does instrumentation cause a bird to change partner the following year?

Imperfect detectability of individuals remains an issue. We used a multistate capture–recapture model to predict counterfactuals. Our sample consists of 54 birds that were alive in 2011. Their life histories spanned 2004–2012. We assumed that all birds survived in 2012. Because survival is perfect until 2011 by design, death can only occur in 2012, but is confounded by imperfect detection. For any year, a bird could be either (i) nonbreeding, (ii) a failed breeder, (iii) a successful breeder or (iv) not seen. There are thus three different states and nine possible transitions (Fig. S4). A bird was considered a successful breeder if its chick fledged, a failed breeder if it failed to do so after laying an egg. Birds caught on the colony, but for which no egg or chick was found in the nest, were assumed to be nonbreeding.

We deleted all observations in 2012 for equipped birds to predict them from the model. These predictions correspond to what would have been observed if these birds had not been equipped. We compared predicted (inline image) and observed (inline image) values with both a inline image and likelihood ratio tests:

display math(eqn 5)

A Bayesian inline image (Gelman, Meng & Stern 1996) close to 0·5 reflects no causal effect of instrumentation on breeding performance the following year: observed data are similar to predicted counterfactuals. A inline image close to either 0 or 1 betrays model misfit, suggesting a causal effect.

We checked the Goodness-of-fit (GOF) of the multistate model with u-care (Choquet et al. 2009), which was adequate (global test, GOF = 9·4, P = 0·97, df = 21). Model fitting was performed with winbugs (Lunn et al. 2000) called from r (R Development Core Team 2012). Prior specifications are available as Supporting Information. Finally, we compared the proportion of control birds which changed partner in 2012 to that of equipped birds.

Results

There was no causal effect of instrumenting a bird with tags on its breeding performance the following year. Results from the multistate capture–recapture model are summarized in Tables S2, S3 and Fig. S6. Bayesian inline image were 0·7 and 0·5 for the inline image and likelihood ratio test respectively: observed outcomes were not different from their predicted counterfactuals.

Among the 13 control birds with a known partner in 2012, two changed partners compared with 2011. Among the 12 equipped birds with a known partner in 2012, none changed partners compared with 2011. In the latter case, the zero numerator is problematic for classical inference (Winkler, Smith & Fryback 2002) but informative priors offer a solution (Seamen, Seamen & Stamey 2012). Using data from Swatschek, Ristow & Wink (1994) from a colony of Scopoli's shearwaters in Crete, we elicited an informative prior. Divorce rate in this Cretan population was between 3·6% (perfect detection scenario) and 18·8% (conservative scenario). In contrast, divorce rate on Lavezzi Island, Corsica, where black rats were present, was 23·1% (Thibault 1994). Black rats also occur on Riou, yet we could not determined whether they were also present in the study population of Swatschek, Ristow & Wink (1994). We elicited an informative Beta prior by matching the first quartile with the value 3·6%, and its third quartile with the value 18·8%. The resulting prior is an informative Beta (0·86,5·77) distribution (Fig. 5) with an effective sample size of = 7 (0·86 + 5·77).

Figure 5.

Posterior distributions of the divorce rate for control and equipped birds. The informative prior that was used in the analysis in depicted in grey. Points symbolize the median, thick lines a 50% credible interval and thin lines a 95% credible interval.

Posterior mean divorce rates among equipped birds and control birds were, respectively, inline image and inline image (means are bracketed by a 95% credible interval following Louis & Zeger 2009). The difference was inline image, indicating no causal effect of instrumentation on divorce rate the following year (Fig. 5).

Discussion

After explicitly correcting for selection bias, we found no effect on mate fidelity of instrumenting Scopoli's shearwaters from Riou Island with tags for 3–10 days during the chick-rearing period. Taking into account imperfect detection probability, we found no effect on breeding performance one year after instrumentation either.

Assumptions and limits

Four instrumented birds could not be matched (Fig. 3): our causal estimate does not cover the whole possible range of observations, even if none of these four birds divorced. We also assumed that GPS instrumentation, not geolocators or clipping two feather tips, is the sole differential treatment, which conforms with the Stable Unit Treatment-Value Assumption (SUTVA). Without SUTVA, there are more than two potential outcomes, which complicates the identification of causes. We deployed several tags on single individuals, a potential SUTVA violation. Because all tags were externally attached and the GPS, which was systematically fitted, was the largest device, we assumed that SUTVA holds. Our study is not atypical with respect to other published ones. Our estimated ATT was imprecise because of small sample size, another characteristic of biologging (Hebblewhite & Haydon 2010). To increase precision, many-to-one propensity score matching may be used, although it may also cause further attrition in the sample if k matches are unavailable for each treated unit. Matching with replacement is another possibility (Sekhon 2011), but beyond the scope of the present study.

Both Igual et al. (2005) and Villard, Bonenfant & Bretagnolle (2011) investigated the impact of instrumentation on Scopoli's shearwaters. Their causal effect was the impact of instrumenting at least one mate of a breeding pair with tags. SUTVA does not hold since the probability of equipping a bird may depend on whether its mate was instrumented or not (Fig. S1A). As John Tukey famously declared ‘an approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem’. We should nevertheless strive to provide precise answers to ethics committees and managers for them to make the best possible decisions (Wilson & McMahon 2006).

Biologging studies can have several goals (studying foraging and the impact of tags), thereby raising the possibility that none of them can be attained satisfactorily. Learning about the effects of tags and the foraging ecology of animals simultaneously may not be possible with the same data. A neat distinction between the numerical representation (the ATT) and the empirical representation of a phenomenon (a bird seen with a different mate following its instrumentation) is essential for a fruitful discussion between scientists and managers. The importance of defining the aim of the study and the causal effect of interest before examining data is paramount. To further guarantee objectivity, outcomes of interest must be kept hidden from the analyst until a suitable control group has been found. Designing observational studies as if they were RCT is important for the credibility of researchers relative to ethics committees and wildlife managers.

Design versus analysis

Scientists deploy expensive telemetric tags to collect data on the ecology and physiology of wild animals in their natural environment. The sample of equipped animals may be unconsciously biased towards good-quality or easily recapturable individuals. Valid inferences may be drawn from this sample, but extrapolation to the larger population involves additional assumptions (Gelman & Hill 2007). Instrumentation is a rare event, concerning a potentially nonrepresentative fraction of the population. In our study of Scopoli's shearwaters, the dual goals of estimating representative demographic rates and estimating a causal effect are not attainable because of selection bias. The low precision of the demographic estimates (Tables S2 and S3) makes them of little use. Suppose for example that, in a capture–recapture study, <5% of animals were instrumented. A multistate model is fitted to the observed data, with an indicator variable (or a stratum) for instrumented individuals. Suppose further the model is deemed acceptable if it accommodates 95% of the life histories. If the 5% of misfits are precisely instrumented animals, estimated vital rates are still reasonable, but it is risky to give a causal interpretation to the regression coefficient for instrumentation because the model does not provide an adequate fit to these animals.

Our aim with capture–recapture modelling was to account for imperfect detection probability among instrumentable birds. Because the estimated causal effect concerns ‘instrumentable’ animals, we cannot generalize results to the Riou population and determine the causal effect of instrumentation on a typical individual without defining ‘typical’.

Ethical implications

The ethical issue raised by our work is whether it is worth assessing the causal effect of instrumentation by sampling ‘control’ individuals, when ‘control’ is a strong and untestable assumption. Our study highlights the necessity to find a suitable control group before collecting punctual data (data that are not part of a systematic monitoring effort) on random individuals if our aim is to test for instrumentation effects. One must explicitly spell an assignment mechanism before carrying out the study (for example, tossing a fair coin). In the case of punctual instrumentation within a long-term monitoring study, background characteristics can also be used prior to deployment to define a set of similar individuals which will be equipped or will serve as control. Causal inference is straightforward because the assignment mechanism is specified a priori. A power analysis should also be carried out to assess whether meaningful effects can be detected given the planned number of tag deployments (Igual et al. 2005).

The scope for correcting selection bias after the experiment is limited without detailed knowledge of animals' background characteristics. Propensity score methods are intrinsically a posteriori: they may only be useful within long-term studies. In the case of a punctual biologging study on a population of unknown characteristics, propensity scores cannot be used to reconstruct the assignment mechanism. Collecting data on wild animals must be scientifically and ethically justified. Collecting data on a random control group may be unjustified when causal inference is not guaranteed: an ill-defined control group may imply unnecessary disturbance of particularly vulnerable individuals. Randomly sampling a control group may nonetheless be useful to control for large detrimental effects which may occur during fieldwork (significant increase in trip duration or mass loss, for instance). Fortunately, with miniaturization of data loggers, large effects are less likely.

Conclusion

We detailed how to assess the causal impact of biologging on instrumented animals by trying to recover a posteriori a suitable control group. The grim picture of limited research on the impact of biologging (Vandenabeele, Wilson & Grogan 2011) may partly result from the lack of guidelines to identify meaningful control groups. Vandenabeele, Wilson & Grogan (2011) or Wilson & McMahon (2006) briefly mentioned this issue but offered no guidelines. Our incremental contribution is to suggest existing methods to fill that gap.

Fig. 1 details how to design an observational study to explicitly assess the impact of biologging on animals. Propensity score matching is, however, not a panacea as it assumes no hidden bias or cannot be easily used within catch-22 cases such as the study of foraging efficiency or heart-rate frequency, where different modelling approaches (hydrodynamics, flight mechanics) may be more appropriate (Hazekamp, Mayer & Osinga 2010). A pluralistic approach is clearly needed, within which the counterfactual model should be seriously considered.

Acknowledgements

The long-term monitoring study of Scopoli's shearwaters on Riou Island is approved by the Centre de Recherches par le Baguage des Population d'Oiseaux (CRBPO, Paris). Access to protected areas and tag deployments were approved by the ethics board of the Conservatoire d'Espace Naturels de Provence-Alpes-Côte d'Azur. Bird instrumentation was carried out under personal animal experimentation permits #34–369 (D. Grémillet) and #34–505 (C. Péron) delivered by the Direction Départementale de la Protection des Populations. We thank CEN-PACA staff in charge of long-term demographic monitoring of Scopoli's shearwaters on Riou Island: Jean Patrick Durand, Célia Pastorelli, Nicolas Bazin, Timothée Cuchet and Lorraine Anselme. We thank Pierrick Giraudet and Léo Martin for tag deployment, Emmanuelle Cam for multistate models' Goodness-of-fit tests and Olivier Gimenez for multistate capture–recapture model bugs code. Emmanuelle Cam, Olivier Gimenez and Christophe Barbraud offered suggestions on an early version of the manuscript. We thank Jarrod Hadfield and two anonymous reviewers for helpful and constructive comments. The authors declare no conflict of interest.

Ancillary