Compound Poisson-gamma vs. delta-gamma to handle zero-inflated continuous data under a variable sampling volume

Authors


Summary

  1. Ecological data such as biomasses often present a high proportion of zeros with possible skewed positive values. The Delta-Gamma (DG) approach, which models separately the presence–absence and the positive biomass, is commonly used in ecology. A less commonly known alternative is the compound Poisson-gamma (CPG) approach, which essentially mimics the process of capturing clusters of biomass during a sampling event.
  2. Regardless of the approach, the effort involved in obtaining a sample (henceforth called the sampling volume, but could also include swept areas, sampling durations, etc.), which can potentially be quite variable between samples, needs to be taken into account when modelling the resulting sample biomass. This is achieved empirically for the DG approach (using a generalized linear model with sampling volume as a covariate), and theoretically for the CPG approach (by scaling a parameter of the model). In this study, the consequences of this disparity between approaches were explored first using theoretical arguments, then using simulations and finally by applying the approaches to catch data from a commercial groundfish trawl fishery.
  3. The simulation study results point out that the DG approach can lead to poor estimates when far from standard idealized sampling assumptions. On the contrary, the CPG approach is much more robust to variable sampling conditions, confirming theoretical predictions. These results were confirmed by the case study for which model performances were weaker for the DG.
  4. Given the results, care must be taken when choosing an approach for dealing with zero-inflated continuous data. The DG approach, which is easily implemented using standard statistical softwares, works well when the sampling volume variability is small. However, better results were obtained with the CPG model when dealing with variable sampling volumes.

Introduction

Ecological data for species population densities are often characterized by a large proportion of zero values accompanied by a skewed distribution of remaining values, including occasional extremes (Pennington 1996; Martin et al. 2005). Ignoring these features could lead to incorrect estimates of quantities of interest (e.g. mean biomass, probability of presence) and their associated uncertainty, and possibly to incorrect conclusions (Martin et al. 2005). Zero values in species population densities can originate from two general sources, with consequences for the appropriate analytical approach used to make inferences [see review in Martin et al. (2005)]. True zeros can occur as a direct result of the effect under study (e.g. suitability of a given habitat) or as a stochastic result of sampling from areas of low density. On the other hand, false zeros can occur as a result of detection limits or observer effects. Our interest here lies in true zeros.

Standard continuous probability distributions such as the normal, gamma or log-normal are often inappropriate for the analysis of zero-inflated biomass data, even with ad hoc assumptions such as the addition of constants to create a mass at zero. A better approach is to use so-called two parts, hurdle or Delta models, which assume that zero and nonzero data arise, respectively, from separate processes (Stefansson 1996; Punt et al. 2000; Ortiz & Arocha 2004; Maunder & Punt 2004). This method does not require the addition of a constant, which can introduce a bias in the data. This model is also very flexible as covariates can be added in the zero and nonzero parts of the model using conventional generalized linear modelling techniques. However, the break between zero and nonzero values presents a particularly unnatural discontinuity in density data, where many zeros are actually stochastic clues of a strong gradient of decreasing biomass quantities. A second approach is the use of a positive distribution that simultaneously incorporates zeros and positive quantities. Jorgensen (1987) proposed the exponential dispersion model, with a power variance function. This model, also known as the Tweedie distribution, handles zero-inflated data without treating the zero and nonzero values separately. The Tweedie model and its variants have been applied to fisheries data (Candy 2004; Shono 2008; Foster & Bravington 2012; Lecomte et al. 2013). In this article, we rely on a gamma marked compound Poisson, named compound Poisson-gamma model (CPG), a member of the Tweedie family. Foster & Bravington (2012) extended it to be more flexible when covariates can affect parameters. They showed that the CPG mean–variance relationship is not necessarily constant, conversely to the Tweedie distribution (Foster & Bravington 2012). Parsimonious variant of this distribution, using exponential rather than gamma variables, has also been used [e.g. Ancelet et al. (2010)].

In many studies, the effort involved in obtaining a sample (henceforth called the sampling volume, but could also include swept areas, sampling durations, etc) can vary among sampling events. These differences in the sampling volume have to be accounted for in the analysis. Variable sampling volume is accounted for directly in the modelling for the CPG approach by scaling a parameter, whereas recourse to a generalized linear model to take into account the sampling volume as a covariate or an offset is required for the delta-gamma (DG) approach (Maunder & Punt 2004). Such different approaches to dealing with variable sampling volumes are likely to affect estimation reliability for quantities of interest (e.g. mean quantity, probability of presence).

This study evaluates the relative robustness of the DG and CPG approaches for estimating biomasses and presence probabilities under variable sampling volumes conditions in three ways. Firstly, the form and analytical properties of the two models are presented and contrasted from a theoretical perspective. Secondly, simulations were used to evaluate the robustness of the proposed models and compare their fitting abilities with variable volumes with different variances. Thirdly, the two approaches are applied to catch data from a commercial groundfish trawl fishery. Theory and analyses of simulated and observed data have all indicated that the CPG approach outperforms the DG approach under variable sampling volumes.

Materials and methods

The delta-gamma model

The delta modelling approach is based on the specification of two submodels to represent the biomass (Stefansson 1996). Let X be a binary variable that equals to 1 if the species of interest is present and 0 otherwise.

display math(eqn 1)

π being the probability of the presence of the species. Conditionally, let Y be a positive sampled quantity of interest (e.g. species density or biomass) after a sampling event:

display math(eqn 2)

with shape and rate parameters, (α,β) and math formula the Dirac distribution at zero. This yields the DG model, DG(π,α,β), with other distributional assumptions for strictly positive quantities yielding other models in the delta family, such as the delta log-normal. The expected value for the biomass under the DG model is math formula, and the other main derived quantities (e.g. variance of the biomass, probability of presence) are summarized in Table 1.

Table 1. Quantities of interest (probability of presence, expected positive biomass, expected biomass and variance of biomass) for the DG and the CPG model under a standard sampling volume
 DGCPG
Probability of presence π math formula
Expected positive biomass math formula math formula
Expected biomass math formula math formula
Variance of the biomass math formula math formula

A useful model property in statistical ecology is additivity with regards to the sampling volumes, in which the sum of two independent sampling events follows the same distribution type as each sampling event. For example, sampling during two hours follows the same distribution as two samplings process of one hour. It allows gathering data with different sampling volumes in the same model, as their sum is obtained according to a distribution in the same family. Unfortunately, the DG model is not additively coherent as pointed out by Stefansson (1996). As a consequence, it is not clear how the DG parameters vary in time or space when sampling volumes vary among sampling events. In practice, a simple way to deal with a non-constant sampling volume is to perform a pre-standardization of the data. The biomass collected is divided by the sampling volume. The problem with this method is that it only standardizes the positive data and leaves the presence–absence part unscaled, ignoring the fact that, as sampling volume increases, the probability of observing zero biomass should decrease when the species is present. A more relevant solution uses generalized linear modelling (e.g. Zuur et al. 2009) with the sampling volume as a covariate in each part of the DG model. The probability of presence is usually modelled with a logistic regression:

display math(eqn 3)

where V is the sampling volume. A log-linear function is used to add the sampling volume in the expected positive biomass given the species is present :

display math(eqn 4)

A criticism of the delta approach is the separate modelling of the presence–absence and the strictly positive quantities. Consequently, gradients of biomass including low or null densities of the species are modelled disjointly in practice, which is a rather unnatural representation of the phenomenon being modelled.

The compound Poisson-gamma model

Conceptually, the CPG mimics the process involved when sampling most living organisms in nature for which the observed variable of interest is a continuous variable, such as the total biomass captured during a sampling event (Foster & Bravington 2012; Lecomte et al. 2013). Simply put, the model assumes that a Poisson distributed number N of aggregations (i.e. patches or lumps) of organisms are collected, each patch containing a mass math formula modelled using a gamma distribution. It should be noted that an aggregation could contain only one organism (Foster & Bravington 2012). The sum of the individual masses of captured aggregations yields the total observed biomass Y:

display math(eqn 5)

The CPG is characterized by three parameters: λ the Poisson intensity, a and b the shape and rate gamma parameters:

display math(eqn 6)

The main derived quantities for the CPG model are summarized in Table 1.

Due to additivity properties (Jorgensen 1987), the sampling volume V may be straightforwardly incorporated in a CPG model by scaling the Poisson intensity parameter:

display math(eqn 7)

The CPG approach jointly models the probability of presence and the nonzero sampled quantity. This capacity allows one to model a gradient of decreasing biomass in the distribution of the targeted species due to low density of organisms or low detectability.

There is no disjoint treatment of null and positive values as in the DG model. Foster & Bravington (2012) note that when no covariates are included in either the Poisson or gamma latent components, the CPG model belongs to the Tweedie family, and, in addition, a reviewer has noted that this is still the case when the set of covariates is identical in each of the Poisson and gamma components.

A simulation study to compare the impact of variable sampling volume

The abilities of the DG and CPG models to reliably estimate quantities of interest when sampling volume is variable were compared using simulations. The trawls are divided into small fractions, or microvolumes, that could conceptually be although of as the sweeping of one unit of area by the trawl. Each microvolume contains a small amount of biomass produced according to a DG process. The observed sampled biomass is the sum of the biomass collected over those small microvolumes. Because the DG does not possess additivity, a biomass amount summed over all microvolumes constituting a complete trawl haul does not conform with either the DG or CPG model. The simulation proceeded as follows:

  1. Biomass values were generated with a DG model of parameters math formula, math formula and math formula corresponding to a sampled fraction math formula. These biomasses are denoted as ‘microbiomasses’.
  2. The total collected biomass of a sample is the sum of math formula microbiomasses captured across all sampled microvolumes for that sample to result in a total volume V:
    display math
  3. The total volumes V are simulated according to a log-normal distribution:
    display math

with one of several variances math formula which varied between simulations: 0·1, 0·2, 0·3, 0·5, 0·7, 0·9, 1.00, corresponding, respectively, to a coefficient of variation of the sampling volumes math formula: 0·10, 0·20, 0·31, 0·53, 0·80, 1·12, 1·31 and a constant median of 1. For each math formula value, n = 100 data sets composed of 150 full samples were generated of the population of microvolumes.

Three quantities of interest can be expressed analytically as a function of the microvolume parameters. The expected biomass collected over math formula microvolumes, each producing a microbiomass from a DG distribution with parameters (math formula, math formulamath formula), equals:

display math

The probability π of presence is obtained by noticing that:

display math(eqn 8)

Finally, the strictly positive expected biomass is given by:

display math

To simulate zero-inflated biomass data with a variation in the sampling volume, a very small microvolume math formula·001 and large numbers math formula were considered. According to equation 8, math formula has to be chosen very small to simulate a realistic probability of presence. Three contrasting sets of the parameters (math formula, math formula, math formula) were considered as follows: (200,2,0·001), (200,2,0·0005), (20,2,0·001). In those cases, if math formula, the resulting sampled volume is V = 1. Thus, when math formula, the mean biomass of a data set generated with parameters (200,2,0·001) was Q = 100, and the probability of presence was π = 0·63. This data set presented a reasonable proportion of zeros associated with large positive biomasses that is often encountered in ecological surveys. Data sets generated with parameters (200,2,0·0005) were intended to investigate higher proportion of zeros (Q = 50 and π = 0·39), whereas data sets simulated with parameters (20,2,0·001) were representative of situation with lower quantities of biomass (Q = 10 and π = 0·63). Summing over a large number of microvolumes math formula allowed to simulate realistic continuous zero-inflated data with a variation in the sample volume. However, one may object that the previous large sum of microvolumes could unduly favour the additively consistent CPG model. That is why, a fourth set of parameters with a larger math formula was chosen to test the robustness of the CPG model in situation far from the addition of a large number of microvolumes. In this case, a larger microvolume was chosen math formula·3, and a small number of microvolumes math formula were summed to ensure a realistic overall probability of presence. When math formula, data sets generated with this set of parameters (200,2,0·15) presented a mean biomass Q = 45 and a probability of presence π = 0·39.

Bayesian inference

We choose to use Bayesian inference and computation, using Markov chain Monte Carlo methods. For both models, the Bayesian model specification requires prior distributions. Commonly, vague normal distributions, with mean zero and standard deviation 100, were chosen for all regression parameters. For the positive parameters, weakly informative gamma prior distributions, Gamma(1,0.001), were chosen. The inference was carried out using OpenBUGS, the open version of WinBUGS (Ntzoufras 2011). For each model, three chains were run for 60,000 iterations, with a burn-in period of 30 000 iterations. A thinning of 100 iterations was performed to avoid autocorrelations in each chain. Convergence was assessed using the Gelman–Rubin convergence test. Maximum likelihood estimation of both models can be found in Foster & Bravington (2012).

Model evaluation

The effect of variable sampling volume was examined for three quantities of interest θ, namely mean biomass, mean strictly positive biomass and probability of absence. The results obtained for the two models were explored using four performances indices for each quantity. The first was the root mean squared error, which accounts for the common trade-off between variance and bias of the posterior mean of the quantity for the ith data set, math formula. It is defined as:

display math(eqn 9)

where math formula is the 'true' value of θ used in the simulations. The second was the estimated average coefficient of variation computed for each unknown quantity of interest, which highlights the relative estimated dispersion, and is defined as:

display math(eqn 10)

where math formulais the posterior standard deviation of math formula related to data set i. The third is the recovery ratio, math formula, (sometimes called the confidence coefficient), which is obtained by counting over the 100 data sets, how many times the true value falls within the 90% credible interval. It highlights the fitting capacity of the model. Finally, the average posterior median, math formula, over the n=100 replicated data sets was computed as an estimator of the three quantities of interest.

Case study: commercial fishery groundfish data

The consequences for how the CPG and DG approaches deal with variable sampling volumes were explored by applying the methods to commercial fishery catches, which are known to present variable sampling volumes between sampled sites and a high proportion of zeros. This case study is particularly pertinent because the synthesis of commercial fishery catches is routinely used to assess relative stock abundance in fisheries worldwide (e.g. Maunder & Punt 2004).

The data consisted of bottom trawl catches for two years, 2006 and 2009, from a commercial fishery that covered the continental shelf off the west coast of Canada. The two years of data were chosen because they presented a contrast in annual dispersions of the sampling duration. The mean duration of a sampling event for both years was 120 minutes, and all sampling volumes were scaled accordingly so that one unit of sampling effort corresponds to two hours of towing. Histograms of the sampling duration after rescaling by the mean are provided in Fig.1. The variation observed in these fisheries is commensurate with variation observed in other fisheries elsewhere (e.g. Fig.2). Such scaling by the mean led to the following contrasted variance between the selected years:

Figure 1.

Histograms of the duration of sampling events of the groundfish commercial catches after rescaling by its mean for the years (a) 2006 and (b) 2009.

Figure 2.

Histogram of the fishing effort (hours) in the bottom-trawl fisheries of the southern Gulf of St Lawrence (Canada) for Atlantic cod and American plaice, in 1992, the year prior to a moratorium on cod fishing.

  • 2006, with empirical variance math formula·31 and empirical coefficient of variation math formula·56,
  • year 2009: with empirical variance math formula·14 and empirical coefficient of variation math formula·37.

The data for two species exhibited differences in mean sampled density between the dover sole (Microstomus pacificus, math formula in kg per tow) and the Pacific Ocean perch (Sebastes alutus, math formula in kg per tow). Both models were applied to data from each species and year separately. Depth (in metres) was added to both models as a covariate to account for its well-known effect on catch rates. Depth, which ranged from 50 to 500 m, was split into three classes to account for a possible nonlinear response with bin cut points at 125 m and 200 m. The most prevalent class (50, 125) was defined as the baseline effect. The resulting model for the delta approach was as follows:

display math(eqn 11)

where math formula and math formula account for the depth effect. The depth was incorporated via the Poisson intensity parameter in the CPG (consistent with Lecomte et al. 2013) although the effect of covariates can be added to either or both of the CPG parameters (Foster & Bravington 2012). The resulting model was as follows:

display math(eqn 12)

Where μ is the intercept and math formula denotes the depth effect. The same priors and estimation procedure , as the ones used in the simulation study, were considered for the Bayesian inference of the case study. The fitting ability of the two approaches was compared using the deviance information criterion (DIC) (Spiegelhalter et al. 2002). The posterior coefficients of variation math formula, 90% credible intervals CI , and the posterior medians math formula were computed for both approaches.

Results

Simulation study

The simulation results of the data set generated with parameters set (200,2,0,001) are presented in this section. The three other data sets generated with the sets of parameters (200,2,0,0005), (20,2,0,001) and (200,2;0,15) are provided in Tables S1–S3 as their results are very similar to those of the first data set.

When sampling volume variability was small (math formula·8), the estimates for the three quantities of interest were good and quite similar for both models, with well calibrated math formula, small RMSE and math formula (Table 2). It is worth noting that for a log-normally distributed sampling volume with a unit median and variance math formula, the mean is an increasing function of the variance math formula. Consequently, for the data sets with a small variance, the results did not differ much from those obtained for a constant sampling volume equal to 1. As the math formula increases, DG estimates of the probability of absence and positive biomass are overestimated, the recovery ratios decrease, and relative uncertainties surrounding estimated parameters based on the math formula increase. In contrast, the CPG approach was able to estimate correctly the simulated values, with recovery ratios that remained generally correct and constant. Overall, RMSE values were lower for the CPG compared to the DG model even when math formula was small. These general patterns remained for different choices of simulated parameters (see Tables S1–S2 in Supporting information).

Table 2. Estimation of mean biomass Q, mean positive biomass QP and probability of absence 1−π with a variable sampling volume, for the simulated parameter set (math formula
Volume math formula math formula R RMSE math formula
CPGDGCPGDGCPGDGCPGDG
  1. math formula is the coefficient of variation of the simulated sampling volume. math formula is the value used to produce the simulations, math formula is the posterior median, math formula is the recovery ratio and should be 90%, RMSE is the root mean squared error and math formula is the average estimated coefficient of variation and have to be the lowest. Values in bold denote the best fit.

math formula
Q100 95·83 95·6 78 738·058·150·080·08
QP158·15 155·42 154·277 81 5·41 7·62 0·030·05
1−π0·370·380·3877 88 0·03 0·040·080·11
math formula
Q100 95·82 95·56 85 84 7·06 7·280·080·08
QP158·15 155·41 155·06 86 78 4·66 7·690·030·05
1−π0·370·380·3885 87 0·030·030·080·11
math formula
Q100 96·33 95·678786 6·71 7·190·080·09
QP158·15155·75 156·04 8891 4·45 6·490·030·05
1−π0·370·380·398785 0·03 0·040·080·11
math formula
Q100 96·3 95·868884 6·45 7·580·080·09
QP158·15155·59 158·8 8786 4·32 6·430·030·06
1−π0·370·380·48785 0·02 0·040·080·11
math formula
Q10096·27 97·28 84 86 6·37 8·640·070·1
QP158·15 155·62 164·618383 4·23 10·670·030·06
1−π0·37 0·38 0·41 84 71 0·02 0·050·070·11
math formula
Q10094·04 99·77 7690 7·13 8·340·070·11
QP158·15 154·12 173·66 76 53 4·74 17·450·030·06
1−π0·37 0·39 0·42 75 69 0·03 0·060·060·12
math formula
Q100 95·99 106·95 88 82 5·88 12·270·070·11
QP158·15 155·36 186·948823 3·97 31·810·030·06
1−π0·37 0·38 0·43 86 64 0·02 0·060·060·12

Case study: commercial fishery groundfish data

The probability of absence of dover sole estimated by the two models was high and similar for both years (Table 3). Estimated depth parameters were in accordance between models with depth classes (125, 200) and, (200, 500) having a positive effect on the probability of presence relative to shallow depths, as it was observed with the CPG parameters (Tables 4 and 5). No depth effects were detected with the DG approach for the modelling of the positive biomass (Table 5). In contrast to absence probability, estimates of the overall mean and of the mean positive biomass differed between models for the 2006 data although not the 2009 data (Table 3). Recall that sampling volumes were more variable in 2006 than in 2009.

Table 3. Estimation of mean biomass Q, mean positive biomass QP and probability of absence 1−π for the dover sole sampled in 2006 and 2009
Year math formula CI math formula
PGDGPGDGPGDG
  1. math formula is the posterior median, CI is the credible interval at 95% and math formula is the coefficient of variation.

2006    
Q4·697·862·95–7·034·6-1-2·090·270·28
QP100·22154·0988·36–113·21129·38–183·360·070·1
1−π0·950·950·93–0·970·92–0·970·010·01
2009    
Q40·938·0631·04–53·5427·21–52·540·170·2
QP154·81157·76131·47–184·35122·24–199·080·110·14
1−π0·740·760·68–0·790·7–0·80·050·04
Table 4. Parameter estimates for the CPG model fitted to the dover sole biomass data sampled in 2006 and 2009
ParameterTerm20062009
MeanSDMeanSD
Intercept −3·0470·255−1·1870·157
Depth(125,200)3·130·273−3·4010·811
Depth(200,500)2·9850·2690·2060·226
a  0·5280·0470·930·138
b  0·0050·0010·0070·001
Table 5. Parameter estimates for the DG model fitted to the dover sole biomass data sampled in 2006 and 2009
PartParameterTerm20062009
MeanSDMeanSD
BernoulliIntercept −3·4310·346−5·7110·882
Volume 0·4850·1761·1280·345
Depth(125, 200)3·4590·3043·4390·799
Depth(200, 500)3·3390·2973·7530·807
GammaIntercept −0·720·131−0·2720·245
Volume 0·2210·0910·210·189
Depth(125, 200)−0·0710·127−0·0440·195
Depth(200, 500)−0·020·12−0·0870·225

Results for the ocean perch were similar to those for dover sole. Depth parameters estimates were in accordance between models (Tables 6 and 7). Depth classes (125, 200) and (200, 500) had a positive effect on the presence of the Pacific Ocean perch regarding shallow depths for both years.

Table 6. Parameter estimates for the CPG model fitted to the pacific ocean perch biomass data sampled in 2006 and 2009
ParameterTerm20062009
MeanSDMeanSD
Intercept −3·4340·31−8·4522·932
Depth(125, 200)4·0090·3177·1942·939
Depth(200, 500)4·7590·3197·8492·933
a 0·4050·0391·5680·23
b 0·00100·0020
Table 7. Parameter estimates for the DG model fitted to the pacific ocean perch biomass data sampled in 2006 and 2009
PartParameterTerm20062009
MeanSDMeanSD
BernoulliIntercept −3·0350·495−10·8173·634
Volume −0·0770·3171·2930·344
Depth(125, 200)4·3930·3598·3563·591
Depth(200, 500)7·4090·6479·0533·612
GammaIntercept −0·4980·1010·0420·212
Volume 0·170·0750·5310·147
Depth(125, 200)0·0140·102−0·3060·173
Depth(200, 500)0·0320·0980·180·173

Both models similarly estimated a high probability of absence (Table 8), but estimates of the overall mean and of the mean positive biomasses differed dramatically between models for the 2006 data, and to a much less extent for the 2009 data. DIC scores were lower for the CPG than the DG model for both years (Table 9), indicating that the fitting capacity of the CPG model was better than that of the DG model. The CPG model remains a model of choice even in situations where the observed biomass is the sum of a small number of DG-distributed microvolumes biomasses as shown in Table S1, Supporting information.

Table 8. Estimation of mean biomass Q, mean positive biomass QP and probability of absence 1−π for the pacific ocean perch sampled in 2006 and 2009. math formula is the posterior median, CI is the credible interval at 95% and math formula is the coefficient of variation
Year math formula CI math formula
PGDGPGDGPGDG
2006    
Q17·3569·3210·11–27·241·06–107·740·290·3
QP538·731604·82473·58–610·621417·96–1835·280·080·08
1−π0·970·960·95–0·980·94–0·980·010·01
2009    
Q281·14222·35207·97–358·88160·41–296·360·160·18
QP1136·87943·78980·41–1309·96755·52–1149·250·090·13
1−π0·750·760·7–0·810·71–0·810·040·04
Table 9. Deviance information criterion (DIC) scores related to the DG and CPG models fitted to the data sets of the two species collected in 2006 and 2009 by commercial fisheries
 PerchSole
2006200920062009
DG7946206643191539
CPG7092159734901100

Discussion

The simulations used in this study allowed for a comparison of two statistical approaches for continuous zero-inflated data by relying on simulated data that mimics the catches of organisms in a uniform habitat with zero-inflation and continuous values of abundance. Based on the simulations, variable sampling volumes were found to produce inference challenges for the DG but not for the CPG distribution. This is consistent with the theoretical arguments we presented concerning the additivity property.

The case study and simulations confirmed that under a variable sampling duration, as it is often encountered in fisheries and other ecological data, the CPG model outperforms the DG overall, providing better fits to data and correct inferences on estimated quantities. The DG model in such situations tends to overestimate mean biomass values, potentially leading to incorrect conclusions, which in the case of fisheries may mean incorrect stock management recommendations. These differences in fitting capacity could be explained by the structure of the CPG model, which can handle variable sampling volumes easily because of the additivity property, whereas the DG approach takes variable sampling volume empirically with the help of a generalized linear model. However, when the sampling volume variability is small, the models performed comparably in the simulation. Fortunately, small sampling volume variability is more the rule than the exception in standardized surveys, and the DG approach therefore remains a valid standard practice in those cases. It is in cases where data do not come from planned surveys, or when data are from two or more surveys with different sampling durations and for which a joint analysis is desired that model choice becomes very important. This choice can have important ecological and economic consequences. For example, commercial fishery catch-rate data, such as those analysed here for groundfish or the ones exemplified in Fig. 2 for cod, provide the data required to estimate relative abundance indices. These indices form the basis for a large number of stock assessments world-wide, including tuna and cod fisheries that are both highly lucrative and that pose important conservation concerns (e.g. Ahrens (2010); Carruthers et al. (2011)). Incorrect inferences drawn from the data are liable to lead to incorrect stock assessment advice and a potential that conservation or economic objectives for a fishery will not be achieved.

In the case study, sampling volume and depth were modelled to affect only the number of patches for the CPG models as, for example, increasing the duration of a sampling event results in an increased number of captured patches. Patch size should vary randomly with respect to changes in sampling volume if a sample is taken in a generally homogeneous habitat. Of course, if increasing sampling volume causes a sample to span more than one area of homogeneous habitat, then both patch number and size can vary in complex ways, and the underlying assumptions of both the CPG and DG could be violated.

Ancelet et al. (2010) pointed out a high correlation between the two quantities (number of patches, biomass in one patch) in a special case of the CPG approach. This result suggests that when the CPG distribution is used to model the effect of covariates on the property of interest, such as in generalized linear models (Stefansson 1996; Shono 2008; Zuur et al. 2009; Foster & Bravington 2012) or additive models (Zuur et al. 2009), it is appropriate to link only one of these two hidden quantities to the explanatory covariates. We suggest that it is most appropriate to model the effect of covariates on the number of patches only, because it tunes both the presence–absence and the quantity of biomass sampled. The parameters are heuristically defined as the number and biomass of patches although these ecological properties are not actually being estimated. Foster & Bravington (2012) used a data set which was composed of biomass and abundance data to explore the relationship between patch size and the size of one typical fish coming from this patch. They showed that the size and the number of patches could have a different relationship to those for the size and number of individual fish. However, such an hypothesis about the size and numbers of patches collected during a sampling event need to be checked. Even if the conjunction of the parameters yields a distribution of biomass values possessing the properties of interest, that is, zero-inflation as well as continuous values with occasional extremes and additivity with respect to variable sampling volume, one must not over interpret an ecological meaning for the individual parameters.

We conclude with practical recommendations arising from this work. When facing zero-inflated data with a constant sampling volume or a sampling volume with a low variability, the DG approach is likely to be understandably preferred by many because of its ease of implementation. However, when working with variable sampling volumes, the analyst should be wary of the DG model. We suggest the CPG structure as a better alternative, even at the cost of some increased complexity of implementation. If not, the simulation study developed in this study shows that, conversely to the CPG, the DG estimates may provide fallacious conclusions, unduly overestimating the biomass quantities.

Acknowledgements

We are indebted to the insightful comments of two anonymous reviewers, which greatly improve the manuscript. We also want to thank an anonymous reviewer for proposing the use of the sampling volumes as a covariate in both parts of the DG approach, which allows for a fair comparison on a more balanced basis.

Ancillary