marked: an R package for maximum likelihood and Markov Chain Monte Carlo analysis of capture–recapture data


Correspondence author. E-mail:


  1. We describe an open-source r package, marked, for analysis of mark–recapture data to estimate survival and animal abundance.
  2. Currently, marked is capable of fitting Cormack–Jolly–Seber (CJS) and Jolly–Seber models with maximum likelihood estimation (MLE) and CJS models with Bayesian Markov Chain Monte Carlo methods. The CJS models can be fitted with MLE using optimization code in R or with Automatic Differentiation Model Builder. The latter allows incorporation of random effects.
  3. Some package features include: (i) individual-specific time intervals between sampling occasions, (ii) generation of optimization starting values from generalized linear model approximations and (iii) prediction of demographic parameters associated with unique combinations of individual and time-specific covariates.
  4. We demonstrate marked with a commonly analysed European dipper (Cinclus cinclus) data set.
  5. The package will be most useful to ecologists with large mark–recapture data sets and many individual covariates.


A diverse array of software programs are currently available to analyse mark–recapture data. The most comprehensive software package is arguably program MARK (White & Burnham 1999), which uses FORTRAN code to estimate demographic parameters where sources of variation in those parameters (e.g. survival, detection probability) are manually specified within a graphical user interface (GUI). RMark (Laake 2013) is a package for R (R Core Development Team 2012) that constructs models for MARK with user-specified formulas to replace GUI-based model creation.

A number of stand-alone R packages have been developed to analyse capture–recapture data, usually with a narrower focus than MARK (e.g. Rcapture, Baillargeon & Rivest 2007; mra, McDonald et al. 2005; secr, Borchers & Efford 2008; btspas, Schwarz et al. 2009; SPACECAP, Gopalaswamy et al. 2012; BaSTA, Colchero, Jones & Rebke 2012). Each package is designed for a unique niche or model structure. We believe these alternative packages are useful because they expand the analyst's toolbox, and the code is open source which enables the user to understand fully what the software is doing.

Here, we describe marked, a free open-source mark–recapture package that runs in the R environment (R Core Development Team 2012). The original impetus for the package was to improve on the execution times of RMark/MARK for fitting of Cormack–Jolly–Seber (CJS) models to data sets with thousands of animals and many time-varying individual (animal-specific) covariates by making the model and data structures more efficient. Subsequently, we also implemented the CJS model using Automatic Differentiation Model Builder (ADMB; Fournier et al. 2012) and incorporated individual heterogeneity via random effects with admbre (Skaug & Fournier 2006). ADMB provides a flexible framework for fitting models of capture–recapture data and has been recently used by Ford, Bravington & Robbins (2012) to incorporate random effects in multistate models. Instructions for interfacing marked with ADMB are provided in the marked package help for function crm. We also added a Bayesian Markov Chain Monte Carlo (MCMC) implementation of the CJS model based on the approach used by Albert & Chib (1993) for analysing binary data with a probit regression model. In addition, we implemented the Jolly–Seber (JS) model with the Schwarz & Arnason (1996) Population Analysis (POPAN) structure by extending the hierarchical approach to likelihood construction of Pledger, Pollock & Norris (2003) to the entry of animals into the population.

Below, we provide a brief background on the models currently implemented in marked and depict the work flow with regard to data formatting, processing and model fitting. We illustrate use of marked with the European dipper (Cinclus cinclus) mark–recapture data analysed in Lebreton et al. (1992). Further information, including help files, example data and analysis, and a vignette with more technical explanation of the statistical methods, code structure and package usage can be obtained by downloading the marked package from CRAN (



Capture–recapture data are typically represented as an encounter history. For CJS and JS models, the encounter history is a sequence of zeros and ones, where a 1 codes for an encounter and a 0 means the animal was not observed. Each position in the capture history represents a sampling occasion that occurs at a sequence of times math formula for m occasions. For example, a sequence of 101 means the animal was initially encountered (or released) on the first occasion at math formula, not encountered on the second occasion at math formula and encountered on the third and last occasion at math formula.

The probability of observing a particular capture history depends on the parameters in the model. For CJS models, the probability of each history is conditioned on initial release and apparent survival is the parameter of interest. Specifically, the parameters are as follows: (i) math formula, the probability that animal i survives the interval between occasions j and j + 1 math formula and does not emigrate and (ii) math formula, the probability that animal i was encountered on occasion j. For example, the probability of the history 110 can be given as math formula. Notice that the last portion is a sum representing two possible times of death: (i) death during the second interval and (ii) death at some time after the third occasion when the sampling ended. In marked, we adopt a hierarchical perspective (cf. Pledger, Pollock, & Norris 2003) and view the time of death as a latent variable. For maximum likelihood implementation, we simply integrate (sum) over possible times of death to construct the probability of each history (we use MCMC to integrate over possible death times in Bayesian implementations). This formulation is often more efficient than other approaches to likelihood construction for mark–recapture data. Additional details are described in the vignette with the marked package.

Survival probabilities are associated with a specific time period (usually the midpoint of one encounter occasion to the midpoint of the next encounter occasion). However, in some situations, the interval of time between encounter occasions varies. To compare survival estimates on the same scale or to test the parsimony of reduced models such as those with time invariant survival, one needs to adjust for unequal time intervals (e.g. using math formula such that the math formula represents a standardized survival probability). Time interval can vary not only by encounter occasion (which is permitted in MARK), but also by individual (which is not allowed in MARK). For instance, in modelling data of California sea lions (Zalophus californianus; Melin et al. 2011), we were confronted with a situation in which young-of-the-year were marked at ages from 3·5 to 7 months depending on the cohort, leading to survival intervals that varied from 5 to 8·5 months. In marked, we allow for time intervals to vary for each occasion and animal to handle these types of situations.

Survival and encounter probabilities can vary by time (interval/occasion) and can differ for individuals based on their attributes (e.g. cohort, sex, age and weight). In MARK, time and group-specific effects are handled differently from covariates measured on individual animals. This artificial difference can confuse the user as noted by McDonald et al. (2005). In addition, the way in which individual covariates are handled can lead to slow execution times, particularly with large numbers of time-varying individual covariates. In marked, we use an alternative approach that is similar to McDonald et al. (2005), where a separate design matrix is stored for each parameter, unique encounter history and encounter occasion. These design matrices are easily created using regression-like formulae in R.

Data Format

The data format is identical to the structure used in RMark (Laake 2013). The data contain a record for each of the n capture histories. The capture history is contained in a character string named ch with a character for each of the m occasions. An optional field named freq can be included if the capture history represents more than a single animal. The data can contain any number of numeric or categorical (factor) covariates to be used in the models. For example, the dipper data contain a factor variable named sex with values M and F.

A special naming structure is used for time-varying covariates. The covariate name is the base portion of the name which is followed by a suffix with the time value and not the occasion value. For example, with a CJS model, if we wanted to use age as a survival covariate and time was specifed with values 1990–1992 for occasions 1–3, then we would need two covariates named AGE1990 and AGE1991, for the two intervals. The values of the covariates would be the ages of the animals at the beginning of the survival time interval (e.g. age at time 0 for survival from time 0 to time 1). An age covariate for P, would be named AGE1991 and AGE1992 which are the ages of the animals at the recapture times 1 and 2. Example data are shown in Table 1 with annual occasions and the first occasion specified as time 1990. If one wanted to model trap dependence (td) in capture probability (Pradel 1993), the covariates would be named td1991 and td1992. The values of td1991 and td1992 used as covariates for recapture probability for the second and third occasions (times 1991 and 1992) would be the capture history values at the first and second occasions (times 1990 and 1991).

Table 1. Small example of capture–recapture data with three capture histories and three occasions with the beginning time set to 1990 for first occasion
ch Sex AGE1990 AGE1991 AGE1992
  1. The capture history (ch) is a string of 1 (caught) or 0 (not caught) values for each occasion. AGE is a time-varying individual covariate. Age is more easily handled by specifying the initial.age field but is used here as an example for a user-specified time-varying variable. All capital letters were used for AGE so it would not conflict with the default Age variable that is created in the design data.


Data Processing

In marked, there are two steps required to process and create the necessary data structures prior to fitting a model. The first step is performed with the function, which has arguments for the capture–recapture data frame, the type of model for the data (currently CJS, JS or probitCJS) and several other optional arguments that set data attributes like beginning time and time intervals. The result of is a list containing the data, model and the various attributes of the data. By default, identical data records (i.e. those with the same ch and combination of covariates) are accumulated and represented by a single record with the added field freq which is the count of accumulated records. The accumulation is not performed in some circumstances (e.g. probitCJS).

The next step, with the function, uses the result of to create a list containing design data for each parameter in the model. These design data are used to create design matrices from formulas for each parameter. This step automates the fairly burdensome process used by the mra package (McDonald et al. 2005) in which every covariate must be time-varying and requires the construction of an n(m − 1) matrix for each numeric covariate and a matrix for each level of a factor variable (e.g. time). Instead, we allow covariates to be either static or time-varying, and we use the model.matrix function to create the design matrix.

The design data for each parameter contain a record for each animal, for each modelled occasion. The number of records depends on the model and the parameter. For example, with capture probability, there are n(m − 1) records for CJS models and nm records for JS models. Some of the fields are created automatically like time, cohort (first capture occasion) and age. If initial.age (age at first capture) is specified in the data, then age is the animal's age at each occasion; otherwise, age is ’time since initial marking’ because the default value of initial.age is 0. Ageing is based on the time intervals between occasions. Values of variables related to time (time and age) will differ for interval parameters (e.g. survival) and occasion parameters (e.g. capture). Interval parameters use the values of time and age based on the time at the beginning of the interval, and occasion parameters use the value of time and age for the occasion. Both factor and numeric variables are created for time, age and cohort. A capital first letter (e.g. Time) is used for the numeric variable.

Additional variables from the capture–recapture data to be included in the design data are specified as either static or time.varying separately in a list for each parameter in the model. Static variables have a single value per capture history and that value is copied for each occasion. Variables that are listed as time.varying must conform to the naming convention described above. Table 2 shows the design data for a CJS model based on the example data (Table 1).

Table 2. Design data for ϕ and p for a Cormack–Jolly–Seber model using example data from Table 1
  1. Numeric variables Time, Cohort, Age (not shown here) are numeric with a 0 origin (e.g. time 1990 is Time 0). The field ID is the record number for the capture history. The variable Age is the number of years since first caught; whereas, AGE is the actual age of the animal.

ϕ 1199019900F0
p 1199119901F1

Model Fitting

Currently, there are three types of models implemented in the marked package: CJS, JS, and probitCJS. The first two are based on maximum likelihood estimation (MLE), and the third is an MCMC implementation. A model is fitted to the data with the function capture–recapture model (crm). Its first two arguments are the processed data list and design data list constructed in the data processing steps. Many of the crm arguments are the same as the arguments for the data processing steps because the data processing steps can be skipped and performed in crm; however, it is more efficient to do them separately so as to avoid repeating those steps for each fitted model. The primary argument for crm is model.parameters, a list of formulas for the parameters and an optional matrix of fixed real parameter values. The formula for a parameter is applied to the design data in model.matrix to produce a design matrix with the default of treatment constrasts for factor variables. Tables 3 and 4 show the resulting design matrices from the example data with the formulas specified as

display math
Table 3. Design matrix math formula for math formula with the example data and formula math formulasex + AGE
i j SexAGEInterceptSex:MAGE
  1. The values of i (animal) and j (occasion) and covariate values are shown on the left (columns 1-4) and the design matrix on the right (columns 5-7). For the Cormack–Jolly–Seber model, math formula is constructed but is not used in the likelihood calculation because the animal is not released until occasion 2.


The model would have three parameters (three design matrix columns) for both math formula and math formula to construct each of the probabilities. Following the notation of MARK (White & Burnham 1999), X is the design matrix, β is the vector of parameters, and the real parameters (e.g. ϕ and p) are computed using the inverse link function. For probabilities, the inverse logit link is used for MLE, expressed for ϕ as:

display math
Table 4. Design matrix math formula for pij with the example data and formula math formulatime + AGE
i j TimeAGEInterceptTime:1992AGE
  1. The values of i (animal) and j (occasion) and covariate values are shown on the left (columns 1-4) and the design matrix on the right (columns 5-7). For the Cormack–Jolly–Seber model, math formula is constructed but is not used in the likelihood calculation because the animal is not released until occasion 2.


A probit link is used for probabilities with MCMC. For JS, a log link is used to constrain math formula, the number never captured, to be non-negative. The estimate of super-population size is the total number of individual animals caught plus math formula. Also, for JS, a multinomial logit link is used for the probability of entry into the population math formula.

Maximum likelihood estimations are obtained numerically by finding the values of β parameters that minimize the negative log-likelihood (maximize log-likelihood) using optimization methods provided through the R package OPTIMX (Nash & Varadhan 2011) using one or more optimization methods (Nash & Varadhan 2011). If the argument use.admb is set to TRUE, admb (Fournier et al. 2012) will be used instead of optimx. admb will often be a faster and more reliable method because it uses automatic differentiation rather than numerical derivatives to find the minimum. Use of ADMB also enables inclusion of individual heterogeneity in survival and capture probability (Gimenez & Choquet 2010) by setting the argument re to TRUE. Incorporation of random time variation could be accomplished by modifying the ADMB TPL files provided with the package.

Initial values for MLE and MCMC methods can either be provided as a constant (e.g. 0) or as a vector from the results of a previously fitted similar model. If initial values are not specified, then they are computed using generalized linear models (GLM) that provide approximations. Using the method described in Manly & Parr (1968), we compute initial estimates for capture probability p for occasions 2 to k − 1 with a binomial GLM and the formula for p that is fitted to a sequence of Bernoulli random variables that are a subset of the capture history values math formula i = 1, …, n and math formula where math formula and math formula are the first and last occasions the ith animal was seen. A similar but more ad hoc idea is used for survival, ϕ. We know that an animal is alive between the first (math formula) and last occasions (math formula) it was seen. We assume that each animal dies in the interval following the last time it was seen (math formula). We use a binomial GLM with the formula for ϕ fitted to a sequence of Bernoulli random variables math formula i = 1, …, n and math formula where math formula = 1 for math formula and math formula = 0 for math formula. The initial value is set to 0 for any β that is not specified or estimated with the GLM approximations above.

Dipper data example

We use mark–recapture data for the European dipper (Lebreton et al. 1992) provided with the marked package to show some examples of its use. The package vignette contains more detailed explanation, and the code used to create the examples. To demonstrate the use of static and time-varying covariates, we added an imaginary static covariate weight (set to a random value between 1 and 10) and a time-varying covariate Flood to model survival ϕ. Flood is the same for all dippers but varies by time with a value of 0 for times 1981 and 1984–1986 and a value of 1 for times 1982 and 1983. The Flood covariate could have also been added into the design data (after it was created) rather than in the data frame because it is constant for each animal. For P, we added a time-varying individual covariate td, which is 1 if the dipper was caught on the previous occasion and 0, if not.

The following code processes the data for the CJS model and makes the design data. The covariate weight is specified as static and Flood is time-varying for Phi (ϕ). For p, sex is a static covariate and td is a time-varying covariate.

  • > # Process data

  • >,model=‘cjs’,begin.time=1981)

  • > # Create design data with static and time-varying covariates

  • > design.Phi=list(static=c(‘weight’),time.varying=c(‘Flood’))

  • > design.p=list(static=c(‘sex’),time.varying=c(‘td’))

  • > design.parameters=list(Phi=design.Phi,p=design.p)

  • >,parameters =design.parameters)

Next, we define the models for ϕ and p that we want to fit and call crm using MLE with ADMB (model_admb) and MCMC (model_mcmc).

  • > Phi.sfw=list(formula=math formulaFlood+weight)

  • > p.ast=list(formula=math formulasex+td)

  • > model_admb=crm(dipper.proc,ddl,hessian=TRUE,

  • + model.parameters=list(Phi=Phi.sfw,p=p.ast),use.admb=TRUE)

  • > model_mcmc=crm(dipper,model=‘probitCJS’,begin.time=1981, design.parameters=design.parameters,

  • + model.parameters=list(Phi=Phi.sfw,p=p.ast),

  • + burnin=1000,iter=10000)

[Correction added on 5 September 2013 after first online publication: quote marks added around ‘probitCJS’.]

The β parameter estimates (Table 5) will not match because the MLE model uses a logit link and the MCMC model uses a probit link. However, we can compare estimates of ϕ and p, which are generated automatically for each unique combination of covariates used in the model for the parameter (Fig. 1, Table 6). In addition, a predict function can be used to obtain predictions for any set of covariates.

Figure 1.

Survival probability estimates for each value of weight for flood (F) and non-flood (N) years for example model with dipper data. Points are Markov Chain Monte Carlo mode estimates and lines are the equivalent maximum likelihood estimation estimates.

Table 5. Parameter estimates for maximum likelihood estimation (MLE) and Markov Chain Monte Carlo (MCMC) fitted models for dipper data with survival model Flood + weight and capture probability model sex + td
  1. For MCMC model, estimate is the mode, SE is the standard deviation of the posterior and lower (LCL) and upper (UCL) confidence intervals are 95% highest posterior density interval.


A sequence of models can be fitted with the function crm.wrapper which is similar to mark.wrapper in RMark. It fits each combination of the parameter models and provides a model selection table (Table 7). In this example, we also show how a polynomial spline can be fitted to a numeric time covariate with the splines package in R (Fig. 2).

Figure 2.

Survival probability estimates and confidence intervals for the dipper data using a polynomial spline for variation in survival over time.

Table 6. Capture probability estimates for maximum likelihood estimation (MLE) and Markov Chain Monte Carlo (MCMC) fitted models for dipper data with survival model Flood + weight and capture probability model sex + td
ModelSex td EstimateSELCLUCL
  1. For MCMC model, estimate is the mode, SE is the standard deviation of the posterior and lower (LCL) and upper (UCL) confidence intervals are 95% highest posterior density interval.

  • > library(splines)

  • > do.example=function()

  • + {

  • + Phi.1=list(formula=math formula1)

  • + Phi.2=list(formula=math formulabs(Time))

  • + p.1=list(formula=math formula1)

  • + p.2=list(formula=math formulatime)

  • + crmodel.list=create.model.list(c(‘Phi’,‘p’))

  • + return(crm.wrapper(crmodel.list,data=dipper,begin.time=1981, hessian=TRUE,use.admb=TRUE,external=FALSE))

  • + }

  • > example=do.example()

Table 7. Model selection summary table for four example models fitted to the dipper data
Model#par−2LnLAICDelta AICModel weight
  1. The formula math formula1 is a constant model, math formulabs(Time) is a polynomial spline over time, and math formulatime has separate parameters for each sampling occasion.

Phi(math formulabs(Time))p(math formula1)5660·69670·690·000·51
Phi(math formula1)p(math formula1)2666·84670·840·150·47
Phi(math formulabs(Time))p(math formulatime)10658·19678·197·500·01
Phi(math formula1)p(math formulatime)7664·48678·487·790·01


At present, the marked package will likely be most useful to analysts with data for thousands of animals with long capture histories and many individual covariates. Beyond faster run times for large analyses (see marked package vignette for efficiency comparison), the marked package has a few other advantages over MARK including:

  1. Individual-specific time intervals between occasions (except for MCMC methods);
  2. Ability to control when hessian is calculated;
  3. Automatic parameter initialization that speeds up convergence; and
  4. Easier prediction with individual covariates using the animal-occasion data structure;

It is not our goal to replace MARK but to develop an open-source platform to provide new models, particularly those with random effects or hierarchical extensions. We are currently developing a Bayesian MCMC and MLE version of the multi-state CJS model (Brownie et al. 1993; Dupuis 1995) and have plans to develop a model that allows estimates of survival while accounting for tag loss.


We thank Alexey Altuhkov and Eli Gurarie for suggesting use of the splines package. The findings and conclusions in the paper are those of the authors and do not necessarily represent the views of the National Marine Fisheries Service, NOAA. Reference to trade names does not imply endoresement by the National Marine Fisheries Service, NOAA.