#### Study population and data collection

The population of Weddell seals breeding in Erebus bay (Southwestern Ross Sea, Antarctica [77.62°–77.87°S, 166.3°–167.0°E]) has been the subject of a mark–recapture program since 1971 (Siniff et al. 1977), and since 1982, every pup born in the study area has been systematically marked shortly after birth. Each year, during the pupping season (October–November), seal colonies were visited every 2–3 days to tag newborn pups and untagged mothers, and five to eight surveys were conducted throughout the entire study area. During surveys, every encountered animal was recorded along with its sex and reproductive status. Animal handling involved in the collection of these data followed a protocol that was approved by Montana State University's Animal Care and Use Committee (Protocol #2011–38).

In this study, we used data from 1982 through 2011 to build encounter histories of individually marked females that were (i) of known age (i.e., tagged as pups inside the study area), (ii) part of the breeding population (i.e., females that had bred at least once), and (iii) resighted at least once after their year of first reproduction (recruitment) to provide information on reproductive rates (here defined as the “probability of producing a pup subsequent to recruitment”). The detection rate of mother–pup pairs is virtually 1.0 on the ice (Hadley et al. 2006), such that every year all females giving birth in Erebus Bay are detected. Moreover, female Weddell seals display strong philopatry, making it extremely rare for a female recruited inside the study area to later reproduce outside the study area (Cameron et al. 2007; Hadley et al. 2007a). Here, we restricted our analyses to data collected on females born inside Erebus Bay and known to have given birth there. Thus, we could reasonably assume that all reproductive events of this subpopulation of seals were recorded in our data, and any female not seen in a given year could be assumed to have skipped reproduction that year.

#### Statistical modeling

Encounter histories started at the first reproductive event (state F) of an individual and thereafter consisted of two possible states: experienced breeder (E; as opposed to “first-time breeder F”) and skip-breeder (S). Probabilities of reproduction were defined as rates of transition from any state *k* (F, S, or E), in year *t*, into state E in year *t* + 1 (*ψ*^{kE}). The complementary transition rates into state S corresponded to probabilities of skipping reproduction (1 − *ψ*^{kE}). Based on previous analyses and knowledge of this population of seals (Hadley et al. 2007a; Rotella et al. 2009; Chambert et al. 2012), we modeled *ψ*^{kE} as a function of reproductive state in year *t* − 1 and year *t* (*η*_{t}). We also included the standardized age of individual *i* in year *t* (*A*_{i, t}) as a covariate in our models, but given that this study did not focus on age effects, our primary goal in doing so was to account for the potential confounding effect of age when considering our competing hypotheses about individual heterogeneity. We thus decided to model the age effect as a quadratic relationship because of its generality and biological relevance, and did not test alternative simpler age effects (e.g., linear trend, no trend). In order to investigate our three a priori hypotheses (H1, H2, and H3), we considered the influence of two types of individual random effects: (i) a “baseline” individual effect (*α*_{i}) expressed in “normal” environmental conditions; and (ii) an individual effect (*β*_{i}) expressed in “iceberg” years. Accordingly, we built a set of three competing models:

Here, *μ* represents a theoretical mean reproductive rate (on the logit scale), *γ*_{1}, *γ*_{2} are the two parameters of the quadratic age effect, and *X*_{t} is a binary covariate indicating whether year *t* was an iceberg year (*X*_{t} = 1) or not (*X*_{t} = 0). We also note that in model H2, *α*_{i} corresponds to a unique random intercept for each individual *i* that is expressed in all years in keeping with the fixed heterogeneity hypothesis. On the other hand, in model H3, *α*_{i} represents the individual effect in noniceberg years (i.e., “normal” years), and *β*_{i} corresponds to the individual effect expressed during iceberg years (i.e., “disturbed” years).

In the analyses presented here, modeling was focused on the sequence of reproductive states during the time an individual was known to have been alive, that is, between its first reproductive event (state F) and its last encounter (in state E or S), a period we refer to here as the “minimal lifetime window” (MinLifeWin) of the animal. Models were thus conditional on the first and the last detection of each individual and did not include a survival parameter. This approach was sensible for our objective of evaluating possible differences among individuals in their frequency of reproduction, and not in their survival rates. Given the very high detection rates inside colonies (Hadley et al. 2007a; Rotella et al. 2009; Chambert et al. 2012) and the high philopatry of locally born animals in colonies (Cameron and Siniff 2004; Cameron et al. 2007), estimates of reproductive rates appeared to be very robust to this right censoring of encounter histories. Indeed, when we performed analyses (not presented in this study) using non-right–censored data in which survival and detection were explicitly modeled, we found that estimates of model parameters were the same. For ease of interpretation, we thus decided to present the results of the simpler approach focusing solely on reproductive rates. Furthermore, as noted before, any nondetection event inside an individual's MinLifeWin necessarily corresponded to the skip-breeding state. Reproductive states were thus known for all years within an individual's MinLifeWin, such that we did not need to include a detection parameter in our models.

A Bayesian approach was used for inference and implemented in the software program OpenBUGS (Lunn et al. 2009). Markov chain Monte Carlo (MCMC) methods were used to sample, and thus approximate, the posterior distributions of the parameters of interest. The year and individual varying parameters *η*_{t}, *α*_{i}, and *β*_{i} were modeled hierarchically following independent normal distributions with mean 0 and model- and parameter-specific standard deviations: *η*_{t} ~ *N*(0, *σ*_{η}), *α*_{i} ~ *N*(0, *σ*_{α}), and *β*_{i} ~ *N*(0, *σ*_{β}). Standard deviations *σ*_{α} and *σ*_{β} are measures of the magnitude of interindividual variability in reproductive rates, for their respective environmental condition, and are thus of primary interest to our question. We chose to model the two individual effects (*α*_{i}, and *β*_{i}) as independent between the two time periods, rather than assume a common correlation applied to all individuals by explicitly including it as a parameter of a multivariate normal distribution. If explicitly modeled, the magnitude of the estimated correlation is driven by the individuals behaving similarly in iceberg and noniceberg years, and does not represent the contrasted individual reproductive patterns seen for some individuals. This specification allowed model H3 to be as distinct as possible from the fixed heterogeneity hypothesis represented in model H2. We investigated and quantified correlation under this independence assumption by calculating a correlation coefficient (*ρ*_{α, β}) directly from the joint posterior distribution of the *α*_{i}'s and *β*_{i}'s.

A prior for μ was specified through a uniform U(0,1) distribution on the mean of *ψ* on the original scale (i.e., ). Parameters , *γ*_{1}, and *γ*_{2} were assigned diffuse normal prior distributions N(0,1000) on the logit scale. Uniform prior distributions U(0,10) were used for hyperparameters *σ*_{ε}, *σ*_{α}, and *σ*_{β}. To assess the sensitivity of inferences of the standard deviation of individual effects (*σ*_{α}) to the choice of prior, we compared these results to those obtained using an inversegamma (4,0.05) prior on variance (see Supporting Information). This latter prior distribution has a very high density in values close to zero and thus strongly penalizes high values of variance. This prior therefore penalizes, a priori, the heterogeneity hypothesis (H2), which is a way to assess the support for this hypothesis in a conservative way.

For each of the three competing models, we ran two chains in parallel with different sets of initial values. The first 5000 MCMC samples were discarded (burn-in period), after having checked that convergence was satisfactory. Convergence was visually assessed using sample path plots in conjunction with the Brooks–Gelman–Rubin diagnostic “R” (Brooks and Gelman 1998), with values close to 1.00 indicating adequate convergence. A total of 150,000 MCMC samples after burn-in were used for inference. All parameters described above were defined on the logit scale in the models, but summaries of the posterior distributions provided later in the text, figures, and tables were transformed back to the scale of a probability of reproduction (i.e., interval [0,1]) to ease interpretation. Back-transformed values are hereafter denoted with a star (e.g., ).

#### Model comparison and hypothesis selection

In addition to making inference directly from parameter posterior distributions, we adopted a model comparison approach. In ecology, the evaluation and comparison of competing models, generally analyzed under a likelihood approach, has traditionally been based on generic criteria of model accuracy (Burnham and Anderson 2002), dominated by the use of the Akaike Information Criterion (AIC). In Bayesian statistics, there currently is no consensus on the best way to select and compare competing models (Link and Barker 2010). Generic criteria, such as AIC, Bayesian Information Criterion, and Deviance Information Criterion are used (Spiegelhalter et al. 2002; Barnett et al. 2010; Cubaynes et al. 2012), but are also widely criticized, especially in the context of hierarchical (random effects) models (Link and Barker 2010). Here, we chose to implement posterior predictive checking to compare the performance of our three competing models (Gelman et al. 2004; Schofield and Barker 2011). The principle of posterior predictive checking (see Gelman et al. 2004, p. 159) is straightforward: if a given model represents a good approximation of the true process that generated the data, then replicated data generated under this model should have very similar features to the observed data. This approach allows assessment of the goodness of fit of each model and provides an explicit tool for model comparison.

For each particular model, the implementation of posterior predictive checking took place as follows. First, 10,000 replicate data sets (*y*^{rep}) were simulated under different draws from the joint posterior distribution of all parameters to account for uncertainty. As in the statistical models used to analyze the data (see above), the simulation of each individual's reproductive history (i) was conditional on its first and last encounters (i.e., its MinLifeWin was fixed) and (ii) started with state F. Subsequent states were simulated using year- and individual-specific reproductive rates calculated from the set of parameters relevant to each model, thus including the effects of state, age, and year for all models. Second, a relevant function of the data (*T*(·)) was derived for each replicate (*T*(*y*^{rep})), and the distribution of *T*(*y*^{rep}), called the posterior predictive distribution, was compared with the observed value *T*(*y*^{obs}) derived from the observed data set (*y*^{obs}). One-sided posterior predictive *P*-values (i.e., *Pr*[*T*(*y*^{rep}) ≥ *T*(*y*^{obs})] or *Pr*[*T*(*y*^{rep} ≤ *T*(*y*^{obs})]) were then calculated as a summary statistic of the lack of fit between replicated and observed data.

We chose to derive and compare data features directly related to our question of interest. Notably, it has been emphasized in the recent literature that the existence of underlying individual heterogeneity should not be claimed if the observed level of variation in individual performance (e.g., lifetime reproductive success or temporal persistence in a given reproductive state) is not larger than that simply predicted by random chance alone (Tuljapurkar et al. 2009). We therefore decided to explicitly compare the competing models in predicting the observed level of realized interindividual variation in (i) reproductive success and (ii) measures of temporal persistence in the experienced breeder state (E). As a measure of individual reproductive success, we used the observed reproductive output (RepOutput), that is, the number of pups produced by an individual within its MinLifeWin. The measures of individual persistence in the breeder state were defined as follows: (i) the longest time of persistence in the reproductive state (PersistRep), defined as the maximum number of consecutive years an individual remained in state E without skipping a year of reproduction (i.e., length of the longest series of E's for each individual); and (ii) the number of consecutive reproductive events (ConsecRep), defined as the total number of 2-year sequences “EE” in an individual's history. These measures of reproductive performance were calculated for each individual and the level of variability across individuals was investigated through two statistics of interest: (i) the standard deviation (SD) and (ii) the maximum value (Max). To summarize, we thus investigated the SD and Max of the distribution of these three variables (RepOutput, PersistRep, and ConsecRep) over all individuals for each posterior replicate of simulated data (i.e., 10,000 SD's and Max's).

Within each individual's reproductive history, the number of stochastic state transitions is a direct function of the MinLifeWin of this particular individual. Therefore, the level of discrepancy between simulated and observed data is better informed by individuals with long MinLifeWin. To better expose potential lack of fit of each model, the comparison between *T*(*y*^{rep}) and *T*(*y*^{obs}) was thus based on the 645 individuals (of a total of 954) having a MinLifeWin of at least 5 years, which we considered a reasonable compromise between sample size and amount of informative data from each individual. Nevertheless, we note that comparisons from samples defined by a different MinLifeWin threshold (e.g., including all individuals) gave similar results and led to the same conclusions.

#### Simulations of expected reproductive output