Modelling excess zeros in count data: A new perspective on modelling approaches

We consider the analysis of count data in which the observed frequency of zero counts is unusually large, typically with respect to the Poisson distribution. We focus on two alternative modelling approaches: Over-Dispersion (OD) models, and Zero-Inflation (ZI) models, both of which can be seen as generalisations of the Poisson distribution; we refer to these as Implicit and Explicit ZI models, respectively. Although sometimes seen as competing approaches, they can be complementary; OD is a consequence of ZI modelling, and ZI is a by-product of OD modelling. The central objective in such analyses is often concerned with inference on the effect of covariates on the mean, in light of the apparent excess of zeros in the counts. Typically the modelling of the excess zeros per se is a secondary objective and there are choices to be made between, and within, the OD and ZI approaches. The contribution of this paper is primarily conceptual. We contrast, descriptively, the impact on zeros of the two approaches. We further offer a novel descriptive characterisation of alternative ZI models, including the classic hurdle and mixture models, by providing a unifying theoretical framework for their comparison. This in turn leads to a novel and technically simpler ZI model. We develop the underlying theory for univariate counts and touch on its implication for multivariate count data.

they suggest reasons why it will often in practice be very difficult to distinguish, from data, which of the alternative models will be 'best' in any useful sense. These insights lead to a new form of ZI which we term 'Logistic ZI'. Some argue that the essential difference between these approaches is that there are situations where at least some of the zeros (and only the zeros) arise from a process that differs essentially from that which generates the counts, including counts of zero. This prescribes an explicit ZI approach, in which some of the observed zeros are deemed to reflect an unreported structural variable; they may even be deemed 'false'. But, although not strictly necessary, this concept of true and false zeros may impose a burden on some users, for an over-dispersed distribution can often achieve the same descriptive effect; see Blasco-Moreno et al. (2019) and Martin et al. (2005). By contrast, OD distributions can lead to excess low counts such as zero. Zeros are no longer special: low counts may be inflated with respect to the Poisson. The key distinction is the upper tail of OD distributions is also inflated with respect to the Poisson.
Our fundamental concern in this review is with concepts, rather than with, say, algorithms, power and tests. We defer consideration of vector counts, which introduce new challenges, but not new concepts. We motivate the paper in Section 2, by referring to a concrete example.
We then introduce our general approach including our new type of ZI, in Section 3. Section 4 examines, from the same perspective, the implicit ZI induced by the Negative Binomial. Our concluding thoughts are in Section 5.

Motivating example
In this section we use a simple dataset to illustrate a brief and conventional overview of some models that deal explicitly or implicitly with excess zeros in the context of regression. Specifically, we adopt the classic nomenclature and motivation of the models, but defer a discussion of theoretical aspects until Sections 3 and 4. The datatset is highly structured, permitting the fitting of saturated models for the mean structure. It also has, perhaps somewhat atypically, extensive replication allowing the calculation of natural and non-parametric estimates of, not just, the means but also of the probability of zeros. This allows us to see clearly that a basic Poisson regression model deals inadequately with the incidence of zero counts. It also allows us to evaluate the different explicit approaches to the handling of the excess zeros in terms of both the fitting of the probabilities of zeros and the fitting of the means. Both of these are challenging in more general regression settings. But we argue in this paper that the opaqueness, most especially of explicit ZI modelling, is a contributory factor. In the same way we can study implicit ZI models, and here we focus on two versions of the negative binomial regression model as examples of over-dispersed count models.
We consider the Trajan apple data of Ridout et al. (1998). The response variable is the number of roots produced by 270 micropropagated shoots of the apple cultivar Trajan under two different photoperiods (8 and 16 hours) and four different concentrations of the growth hormone cytokinin BAP in a completely randomised design with multiple replication at each of the settings. Here we treat both explanatory variables as factors and restrict attention to the full interaction model for the eight different photoperiod by hormone settings, so in essence fitting the frequency distribution of counts for the replicates at each setting. We refer to these eight distinct settings as C (k) , k = 1, . . . , 8 and index any associated model parameters and data summaries in the same way; writingȳ (k) and p We start with the basic Poisson model with π y (λ) = λ y e −λ /y!, y ∈ N. Fitting a Poisson log-linear full interaction model corresponds to estimating a separate parameter λ (k) for each of the settings C (k) with the cell meansȳ (k) being the maximum likelihood estimators. Effectively the data are treated as iid, within each cell. So here the observed cell means are reproduced by the fitted model (see Figure 1

Explicit ZI extensions
Here we revisit the two classical approaches: the hurdle model of Mullahy, in two variants, and the mixture model of Lambert. In Section 3 these will be referred to as types A, B and C, and a fourth Type D will be introduced. Results from all four are presented in Figure 1 We denote these explicit ZI extensions byπ y (λ, γ), where here we focus on one parameter extensions of the underlying Poisson base model π y (λ). Section 3 will provide a unified treatment of the way that γ can be seen as parameterising different types of explicit ZI. In the brief overview below of the classical treatment of hurdle and mixture models, γ implicitly parameterises terms specific to these models. The key step is thatπ y arises from a specific alteration of the Poisson π 0 toπ 0 ; and that, for y > 0,π y = ρπ y where ρ re-normalises to ensure that yπ y = 1. So for y > 0 the modified distribution retains the same structure as the Poisson for successive terms, i.e.π y+1 /π y = π y+1 /π y .
The first class of ZI extensions that we consider are the Hurdle models of Mullahy (1986).
The basic idea of the hurdle model is that the zeros and non-zeros arise from separate processes, allowing the zeros to be modelled as binary outcomes and the non-zero counts as outcomes from a zero-truncated count distribution. The hurdle name comes from the idea that the binary process for the zero/non-zero values essentially models the probability of crossing over the hurdle from zero to non-zero counts. Then conditional on having crossed over this hurdle the model is a standard count data model for the non-zero counts, typically taken to be a zerotruncated count model. The simplest hurdle model is one in which the zeros are modelled as coming from a Bernoulli distribution with a constant probabilityπ 0 =π 0 (γ) and the non-zero counts coming from a zero-truncated Poisson distribution; we refer to this in Section 3 as Type (1−e −λ ) . In our notation this hurdle model has the form with the re-normalising ρ = (1−π 0 ) (1−e −λ ) and E[Y ] = (1−π 0 ) (1−e −λ ) λ = ρλ. Fitting this simple model to the Trajan data with a single additional ZI parameter, the commonπ 0 , with separate parameters λ (k) for each cell C (k) , the overall proportion of observed zeros (0.237) is necessarily reproduced by the fitted model. For under this model, the indicator variable E[I{Y = 0}] =π 0 irrespective of cell and is estimated by the overall proportion of zeros p 0 . But as the eight individual cells have different numbers of zero observations, π 0 = p (k) 0 in general; we elaborate in Section 3. As the zero-truncated Poisson distribution is in the Exponential Family, the fitted model reproduces the means of the zero-truncated data; thus E[Y (k) |Y (k) > 0] is estimated by the truncated meanȳ More general hurdle models, in particular whereπ 0 is itself modelled via covariates in a regression context, involve specifying a particular form for the binary model. In addition to any specific linear predictor, this specification includes the choice of a link function for the zero probabilityπ 0 , or equivalently for the probability of non-zero countsπ + = 1 −π 0 , whereπ + can be thought of as the probability of crossing the hurdle. The link function can be taken as any standard binary/binomial model link but can also be motivated by considering the zeros to have come from a particular (truncated) count distribution; for example, Mullahy (1986) considers the use of the Poisson distribution leading to a complementary log-log link for the hurdle crossing probabilityπ + .
In the general form of Mullahy's Poisson-hurdle the zero and non-zero counts are both assumed to come from Poisson distributions but with different (modelled) parameters. A simplified version of this, with a single additional ZI parameter, takes the Poisson parameters as proportional withπ 0 (λ, α) = π 0 (αλ) with α < 1 for zero-inflation and where α = 1 reduces to a common Poisson model for zero and non-zero-counts (to fit in with our general notation forπ 0 (λ, γ) in Section 3 we take α = e γ ) ; we refer to this as Type B. In this shared parameter version of the hurdle model we no longer have the separation of the zero and non-zero processes and consequent simplification of the estimation. Here both zero and non-zero counts contribute to the estimation of the regression parameters and not surprisingly this model does better than Type A at reproducing the overall means and also manages to pick up some of the differences in zero proportions.
The other broad class of explicit ZI approaches are the zero-inflated mixture models, exemplified by the zero-inflated Poisson (ZIP) model of Lambert (1992). Here the zero probability is extended to a mixture of the base zero probability and degenerate point mass at zero, where the mixing probability provides the additional parameter and corresponds to the specific inflation of the zero probability. In the usual mixture-like notation we haveπ 0 (λ, q) = q + (1 − q)π 0 for 0 ≤ q ≤ 1, with ρ = 1 − q. This can be written in our generic extended formπ 0 (λ, γ) by parameterizing q as a function of γ; specifically, for reasons that will become clear in Section 3, we take q = 1 − e γ giving γ = log(1 − q). We refer to this as Type C. For the Trajan data we use a constant mixing probability q (equivalently constant γ) and separate λ (k) values for each of the eight classes C (k) . From Figure 1(b) we see that here the results are very similar to those for Type A, this is because the cell means are generally large and so the base Poisson model zero probabilities, exp(−λ (k) ), are small and contribute little to the fitted zero probabilities that are dominated by the estimated constant zero inflationq.
As a further novel explicit ZI model in Section 3 we introduce Type D, where the alteration of the zero probability is made through its odds-ratio. As discussed in Section 3, the resulting extended distribution is in the Exponential Family and for the Trajan data interaction model with the eight distinct λ (k) and a single ZI parameter the sufficient statistics are the individual cell means,ȳ (k) , and the overall proportion of zero counts. Consequently, the maximum likelihood fitted model reproduces the eight cell means and the overall proportion of zeros and, here, also seems to recover much of the individual cell zero probability structure, all with a single additional parameter.
While we would not claim that the pattern of behaviour across Types A -D seen with the Trajan data is going to hold in general, we do feel that there are some important messages.
First, the choice of ZI model may have unforeseen consequences on the overall fit, although the specific ways in which this may happen are rather subtle. Second, mis-specification of the ZI parameter (as here in treating it as constant when then is a large difference across the two photoperiod settings) can severely impact the estimation of the mean component of the model and covariate parameters of interest as seen here for types A and C.
Note that for the Trajan data if a full interaction model is also used for the additional ZI parameter, then all four types (A, B, C, and D) are equivalent, as the likelihood reduces to one based on eight independent samples with their own parameters. These comments suggest that in practice if interest is in the mean model then it would be sensible to use a rich model for the ZI parameter, which will provide more robust inferences. Of course, if there is also explicit interest in the covariates affecting the additional zero part then some care may be needed in choosing an appropriate model and the development of some targeted diagnostics to help in this would be valuable.

Implicit ZI extensions
We now consider the alternative approach of using over-dispersed models with apparently zeroinflated data. In general, many over-dispersed models are one-parameter extensions of the basic Poisson model and we write their probability function as π y (λ, φ) with an additional dispersion parameter φ, where often φ = 0 corresponds to the Poisson. Here we restrict our attention to fitting two different forms of negative binomial model. The canonical negative binomial model arises from a Poisson-gamma mixture taking Y ∼ Poisson(λV ) where V is a Gamma(1/φ, 1/φ) distributed random variable with mean 1 and variance φ. The resulting negative binomial distribution, which we refer to as NB-quad, has mean µ = λ and a quadratic variance function µ + φµ 2 and for a fixed value of φ it is in the Exponential Family. The probability of a zero observation is π 0 (λ, φ) = (1 + φλ) −φ −1 ≥ e −λ with equality for φ = 0, showing that, for φ > 0, this model does indeed inflate the probability of a zero observation compared to the Poisson distribution. For the Trajan data fitting the NB-quad with the full eight parameter interaction model and a single dispersion parameter φ, we see from Figure 1(a) that this model also reproduces the cell means and hence has some robustness for inference on regression parameters.
However, while the single additional over-dispersion parameter accounts for some of the zeroinflation, it fails to model the zero proportions well. Of course, as with the explicit ZI models, by including a model for the over-dispersion parameter we can substantially improve the fit of the zeros, at the price of added complexity.
For comparison we also consider a different parameterization of the negative binomial model (arising from a different version of the Poisson-gamma mixture) that has a linear variance function, µ(1 + φ) and which we refer to as NB-lin. As with all Poisson mixture models, as well as over-dispersion, the probability of zero is inflated compared to the Poisson with now over-dispersion models that we could consider, but none are going to be a panacea for fitting all zero-inflated data.
The intention here is not to present a definitive analysis, but rather to consider a simple common model where differences are apparent between the different implicit and explicit ZI models. In particular, we restrict our attention to a single ZI or over-dispersion parameter. In practice, as suggested above, we may want this parameter to depend on covariates and then many of the ZI models can be very similar, or indeed identical, and over-dispersion models can also give similar fits, as noted by Warton (2005).
Here for explicit ZI models we have taken the base model to be Poisson, but this is not necessary and we could use any suitable count distribution that could itself be a one (or more) parameter extension of the Poisson. Such combinations of explicit and implicit approaches lead to two parameter extensionsπ y (λ, φ, γ) of the Poisson model and offer yet more flexibility.
Additionally, in principle, it is also possible to have regression models for each of the three parameters. However, such flexibility and complexity comes at a price and it would require a very rich dataset to distinguish between different models.

Explicit Models for Excess Zeros
Here we consider formally the explicit ZI modelling of excess zeros in count data regression.
We revisit and generalise the analysis of the Trajan data. Our objective in this is to put extant models into a theoretical framework which seems to be novel. This emphasises the functional form of what was described above as the specific alteration of the Poisson π 0 toπ 0 . This apparently new perspective may facilitate the (under-developed) constructive criticism of model fit, generally, using the Trajan analysis as a primitive example. Further, a new ZI type emerges naturally from the theory, which can inherit the attractive properties of Exponential Family probability distributions.
Formally, we discuss a wide family of one parameter extensions of the count distribution π y (λ) underlying a generalized linear model (GLM), leading via coefficients to new estimates of the mean and thus of the probability of zero counts. The extensions are based on a new model π y differing from the base π y via a parameter, here termed γ. This itself could be modelled as depending on covariates via further coefficients. This is indeed routine, but is not our immediate concern. Here, by studying a very wide range of functional options forπ y (λ, γ), we hope to assist in the search for parsimonious ZI alternatives to the base pmf.
We first formalise the modelling in the previous section. The immediate focus is with models which explicitly address the apparent excess of zeros, but only zeros. The base could in fact be provided by one of many two (or more) parameter over-dispersed pmfs, denoted here by π y (λ, φ); often these are themselves generalisations of the Poisson. But for simplicity of notation, we typically suppress the second parameter. Here we refer to a zero-altered distributionπ y as defined by a functionπ 0 (π 0 , γ) with consequent implications forπ y , y = 0. As we develop below the functional form defines a type of ZI, parameterised in the notation below by γ. Three such types (A,B,C) were discussed in Section 2 and a fourth (type D) mentioned.
The a priori need for such an extension was established there by a comparison of proportions where π y denoted a Poisson GLM, leading to the maximum likelihood estimate for each λ (k) . These plots re-appear in Figure 2. In each of the panels, the Poisson GLM is represented by the diagonal; the points are pairs π . Note now that the horizontal axis in Figure 1 denotesπ 0 for the given model whereas in Figure 2 it denotes the (Poisson) base π 0 . Note also the full unit square is used for the plots, in contrast to Figure 1, to allow for examination of the functions underpinning explicit ZI modelling. As in the previous section, a full explanation of these functional forms is provided below and should be read in conjunction with Figure 2. Furthermore the OD models, which are also presented in Figure 2, should be read in conjunction with Section 4.
The details of these generalisations are the subject of this Section, as we now develop, firstly focussing on the theory of the ZI typesπ y , and subsequently by considering the implications for likelihood inference.
The specific functionπ 0 (π 0 , γ) of π 0 characterises the type of ZI and is parameterised by γ; it is such functions that are illustrated in Figure 2. Typically, but not necessarily, γ = 0 denotes the null caseπ 0 = π 0 . Observe that parameters (λ, φ) enterπ 0 and ρ only via π 0 . Note also that the relative magnitudes ofπ y for non-zero y are the same as those for the base π y . Equivalently, the conditional pmfs, given y > 0 (that is, the zero-truncated distributions), are the same, whether based onπ y or the base π y . In particular It is such conditional pmfs that distinguish explicit and implicit approaches to the modelling of excess zeros.
We now consider special cases. Three are relatively common. We will see below that many of the interesting cases can be written as They are all summarised in Table 1.

ZI Type A
The simplest model has constantπ 0 ∈ [0, 1]. With this restriction we could writeπ 0 = γ. But here, as for other types, γ may of course depend on covariates; we thus write g(π 0 ) = γ for some link function g(.). In this form γ is typically unconstrained.
This form is the hurdle model discussed in Section 2; it could be described as the constant type of ZI. But its use in this very basic form (constantπ 0 ) is almost degenerate. Its inadequacy for the Trajan data is apparent (Figure 2 top right panel). Note that here, and equivalently in the other panels, the value of γ used for the function is the MLEγ for the Trajan data under Type A, as discussed in Section 2. In this case it is that value of γ that rendersπ 0 = p 0 , the overall proportion. Further, for this and other panels, the 'data points' are pairs (π 0 (λ (k) ), p 0 ) under each ZI type. They thus correspond to the plotted functionsπ 0 (π 0 (λ),γ), where the value ofγ is specific to the ZI type. For Type A, the MLEλ (k) is derived from the truncated meanȳ (k) + , as previously discussed. We formalise such inference in the next subsection.
Observe that this ZI type is zero-inflationary for small π 0 , but zero-deflationary for large π 0 . This is a necessary consequence of its motivation, where zero counts are generated by a process quite independent from positive counts. Note also that the neutral case ofπ y = π y is not a special case of this model.

ZI Type B
Type B is that referred to in Section 2 as the Poisson-hurdle. It can be expressed asπ 0 = π 0 (αλ) = (π 0 ) α for positive α; for the Poisson base model we haveπ 0 = e −αλ i.e. coming from a Poisson model with mean αλ. An alternative form is through the complementary loglog link function with unconstrained additional parameter γ = log(α) and log(− log(π 0 )) = γ + log(− log(π 0 )), Observe that γ = 0 defines the neutral case, with positive and negative γ corresponding to under-and over-inflation. The ZI Type B panel in Figure 2 (bottom left panel) makes it quite clear that this more clearly reflects the frequency of zeros observed in the Trajan data.
ZI Type C In the notation of Section 2 this can be written as (1 −π 0 ) = (1 − q)(1 − π 0 ) for 0 ≤ q ≤ 1; zero inflation is thus non-zero deflation by a constant factor. It could be characterised as the 'linear model'. It is very widely used; indeed for very many authors, this model is coterminous with ZI. For the Trajan data, like Type A, it is clearly a poor model of the observed zeros ( Figure 2 middle bottom panel).
Its simplest link-function expression is via the complementary log function for γ ≤ 0.
Like types B and D, and in contrast to A,π 0 = 1 when π 0 = 1. But like Type A, when π 0 = 0,π 0 > 0; this distinguishes itself qualitatively from Type B, and also from D below. This is the feature which marks both types A and C as inappropriate single parameter extensions for the Trajan data. Note also that the model is also defined for γ > 0, subject toπ 0 > 0. Thus Type C includes under-inflation; but now γ is constrained by a function of λ. More formally, when γ > 0, we must writeπ 0 = max(0, 1 − e γ (1 − π 0 )).

Type D
Type D, which is new, is most simply expressed as logit(π 0 ) = γ + logit(π 0 ) for unconstrained γ. Equivalently we haveπ 0 1−π 0 = e γ π 0 1−π 0 . The alteration may now be seen as multiplicatively altering the odds ratio of a zero count. In closed form, it may be written as π 0 = e γ π 0 1+(e γ −1)π 0 . Although its performance is similar to Type B, arguably it is the best of the four ZI types in describing the observed frequency of zeros in the Trajan data (see Figure 2 bottom right panel). But it also differs from the others in a more theoretical sense, as we elaborate below.
For, in many cases, pmfs involving Type D extensions are often within the Exponential Family.
It is for this reason that D is, for the Trajan analysis, the only one of the four ZI types for whichμ (k) =ȳ (k) , as seen in Figure 1, sharing this property with the Poisson and NB-quad distributions.
We conclude this subsection by making two remarks. Firstly, the use of the unit square for the plots ofπ 0 (π 0 , γ) emphasises a peculiar feature of the design of the Trajan data. It is lacking any samples corresponding to small means λ and thus large π 0 . It is this that renders it difficult to distinguish which is the 'best' model, despite its other peculiar feature -replication -that gives access to conditionally iid data and thus to the proportions p 0 of observed zeros. An even more extreme example would be fully iid data. There we would see that all the four ZI types would in fact be re-parameterisations of each other and would thus be impossible to distinguish.
All four functions would intersect atπ 0 = p 0 . Figure 3 uses this to provide a different contrast of the functions. There it is seen that types B and C can be remarkably similar for mid to large values of π 0 , and thus might be difficult to distinguish in a data design that did not include small values of π 0 . Further, Type D distinguishes itself from Type B in two ways. Firstly it has a symmetry (around the diagonalπ 0 = 1 − π 0 ) that the others lack; but this may in fact be disadvantageous in some cases. For example, it can be steeper than B for very small π 0 ; but then, necessarily, it will be much flatter for small 1 − π 0 .
Secondly we observe that there are many more options in the design of even the one parameter functionsπ 0 =π 0 (π 0 , γ) than the literature might suggest. These clearly extend beyond the four types we have used for illustration. They typically involve classic link functions g(p), such as complementary log-log and logit. Interestingly however, the particular characteristic of the overwhelmingly popular Type C -namely thatπ 0 > 0 when π 0 = 0 -seems not to be a property associated with any popular link function other than complementary log-log or close but uncommon relatives such as g(p) = 1 1−p . Furthermore they may all be used with over-dispersed base pmfs π y (λ, φ).
We now address issues in inference.

Inference
We focus on likelihood inference, and illustrate with the analysis of the Trajan data in Section 2, which provides some new insight. We do not dwell on algorithms or second-order issues; but it may be that our perspective will stimulate ideas on the critical evaluation of explicit ZI modelling. The theory thus focusses on the score equations, being differentials of log L = i log(π y i ) = i y i (µ(λ, γ), γ), wrt the parameters γ, λ i or functions of these such asπ 0 , µ i , and/or the coefficients β underlying the λ terms. This extends of course to cases of π y where the φ parameter is non-null; but we do not dwell on this. Recall that, for the Trajan model, there are nine parameters: one common γ term and eight values of λ (k) , or equivalently of µ (k) . Particular interest lies in useful decompositions of the likelihood, as we now discuss.
The full distributionπ y (λ, γ) can be written in various forms some of which lead to different decompositions of log L. Although all forms apply to all ZI types, these are most simply interpreted for types A and D. Equation (4) below decomposes into components of the likelihood focussing on zero and non-zero counts. It sheds light on the fitted functions in Figure 2 for all types, and on the (λ, γ) parameters in Type A. Equation (5) sheds light on Type D, for this is seen often to lie within the Exponential Family, with consequent other decompositions.
We begin by noting alternative versions of (2) which are particularly insightful where π + y = πy 1−π 0 is the zero-truncated form of π y and e γ denotes the odds ratio for (π 0 , π 0 ). From (4), with an obvious notation logπ y (λ, γ) = 0 y (π 0 (π 0 , γ)) + + y (λ), the log-likelihood decomposes as: This result applies to any ZI type, including A -D above. Importantly the second term is independent of γ. The first term depends on I{y i = 0} and supplies the only information on γ throughπ 0 ; but recall that, apart from Type A,π 0 is also a function of λ through π 0 , and thus the first term generally supplies some information on λ. The usual calculus leads to score functions whose solution is the MLE for (λ i , γ) (and for any other parameters, φ). In general regression, the λ i are dictated by coefficients, these and γ being the true target of the inference.
In the case of Type A, of course,π 0i is independent ofπ 0i , this being e −λ i for the Poisson base. In the case of iid observations, as within a single cell of the Trajan design,γ is such that π 0 (γ) = p 0 the observed proportion of zeros in the data. Recall that this remark applies to all two parameter (λ, γ) types of ZI in the iid case. However, in the Trajan case with common γ across all cells, results will differ with ZI type, as we now illustrate.
It is useful first to consider in more detail the estimation of λ under Type A, first for the iid case and then for the Trajan design. For Type A, from (4), the estimation of λ focusses solely on + y (λ), and thus on the zero-truncated distribution π + y (λ). The expected value of the zero-truncated distribution is (1 − π 0 ) −1 λ for which the MLE isȳ + , this being the case for all pmfs in the Exponential Family, such as the Poisson and NB-quad, and truncated versions Again, since all two parameter ZI types are equivalent in the iid case, this result applies also to them, as is well known. We revisit this for Type D below, where this demonstration is very simple.
But in the Trajan model, there are eight cells with λ = λ (k) in each but sharing a common ZI parameter γ, defining for all ZI types, a commonπ 0 estimated by the overall p 0 . The term + y decomposes into sums over each of these cells, each leading, separately, to estimates of each λ (k) . In the case of Type A, these are zero-truncated means and thus lead to estimateŝ It is this that leads, in Figure 1, to the fact that ZI Type A leads to valuesμ (k) other thanȳ (k) , unlike the simple Poisson. Similar but technically more difficult arguments apply to other ZI types. The exception is Type D, which does lead toμ (k) =ȳ (k) , as we discuss below.
The interpretation of Figure 2 is assisted by the following observation. Optimisation overall at (λ i ,γ) can be seen as optimisation for γ (atγ) given values ofλ i , and for the λ i (atλ i ) for givenγ. (We make no statement, here, as to the efficiency of an algorithm based on this; and any such algorithm should be distinguished from the EM algorithm widely used in ZI Type C, which uses binary regression of latent binary variables.) In particular, for givenλ i , the maximum likelihood estimation of γ focuses exclusively on 0 y (π 0 (π 0 , γ)). That is to say, at the MLE,γ is that parameter which best fits the functionπ 0i (π 0i , γ) to observations I{y i = 0} modelled as E[I{Y i = 0}], which we denote asπ 0i . Thus,γ may be considered (at the MLE) as the fitted parameter in the binary regression of the observables I{y i = 0} on g(π 0i ) using a very particular link function for E[I{Y i = 0}] with one unknown coefficient, that is γ + g(π 0i ).
A trivial example of this is that the MLEγ = p 0 is, for the Trajan analysis, a weighted average of the eight separate values p (k) 0 . There being but one parameter in this function, the ultimate best fit is likely to be dominated by its large scale properties, being here the behaviour of the function at both large and small values of π 0 ; for Type A the functions in Figure 2 are thus relatively inflexible.
For these data, Figure 2 make it clear that ZI types A and C are evidently similar, and similarly inappropriate. In contrast, the ZI types B and D are clearly more appropriate. More subtly, the data design -resulting in large means and low probabilities of zero -are challenging for the functions offered by types A and C, which offer little flexibility in this part of the unit square, unlike types B and D. Other data sets may of course be challenging for types B and D.
The second form forπ 0 (Equation (5)) is of particular importance for Type D when the base distribution π y (λ) is, like the Poisson, within the Exponential Family. For the Trajan analysis, it highlights an important distinction between the models corresponding to (here similar) types B and D. In particular, it explains why, in Figure 1, pmfs for the Poisson, NB-quad and ZI Type D with Poisson base both lead to the same estimators of the cell expected values. ZI Type D may thus be a useful addition to the ZI types.
The key point in Equation (5) is that, for Type D, the odds ratio forπ 0 and π 0 is taken to be constant, denoted e γ . As a consequence, the parameter γ can have a particular significance, when π y (λ) is in the (one parameter) Exponential Family and can thus be written as log(π y ) = yη(λ) − A(λ) + c(y); for the Poisson the natural parameter is η = log(λ). Firstly, defining c(y) such that c(0) = 0, it is clear that the cumulant function A(λ) = − log(π 0 ), being λ for the Poisson; and further, from the properties of the Exponential Family, we have being λ in the Poisson case. From (5) we thus have log(π y ) = yη(λ) + I{y = 0}γ −Ã(λ, γ) + c(y) whereÃ(λ, γ) = γ − log(π 0 ), which plays the same role as A(λ) for π 0 . The implication is that π y is in the two-parameter Exponential Family, inheriting this property from the π y . The natural parameters are (γ, η).
Further, the log-likelihood in Equation (6) includes a decomposition into the sum of two terms, being 0 y (γ) and + y (λ). And indeed, if π(y) is in the m-parameter family, thenπ y is in the (m + 1)-parameter family. It follows immediately that sufficient statistics for an iid sample y i ; i = 1, . . . n fromπ y are ( y i , I{y i = 0}), being equivalent to (ȳ, p 0 ) and having expected values already known here to be (µ,π 0 ), recalling that µ = ρλ. This result is also available in general from differentiatingÃ(λ, γ). And, following the observation above on the equivalence, in the iid case, of all two parameter ZI types,ȳ and p 0 are unbiased MLE estimators of µ and π 0 , as already shown above, via the more cumbersome Type A arguments. This is the reason why ZI Type D (with Poisson base) shares this property with the usual Poisson and NB-quad models. And further, we can immediately assert that ZI Type D with NB-quad as base will also share this property.
But for the Trajan model, the decomposition in Equation (6) extends to nine terms. Formally, with independent Poisson pmfs modelling each cell in the Trajan data, we can write this as an eight-parameter Exponential Family distribution. The log-likelihood for ZI Type D thus decomposes into the sum of nine terms, I{y i =0} (γ) and eight terms y i (λ (k) ). And thus the MLEs areγ = p 0 (as above) andμ (k) =ȳ (k) . This differs (in respect ofμ (k) ) from Type A above, and in fact coincides with the neutral (no ZI) Poisson mean for which E[Y ] = λ = µ for each cell, as shown in Figure 1. The same reasoning applies to (neutral) NB-quad estimates and indeed to ZI Type D with NB-quad base (not shown here for brevity).

Zero-Inflation via Over-Dispersion
As we have already noted, over-dispersion models, such as the negative binomial, have often been used for the analysis of count data with excess zeros. Indeed, some authors have suggested that in practice such OD models may be sufficient with no need to explicitly model the excess zeros. In Section 2 we considered the use of particular negative binomial models and here we provide a little more detail. Of course the negative binomial family is just one form of OD extension of the Poisson distribution, albeit the most widely studied and used.
General mixing of a Poisson with any distribution (also often referred to as compounding) leads to OD count distributions. Other generalisations include weighted Poisson models (see Del Castillo & Pérez-Casany, 1998) extending the original idea of size-weighting of counts in Rao (1965). This general family of models includes the currently popular COM-Poisson distribution (Shmueli et al., 2005;Sellers & Shmueli, 2010;Ribeiro Jr et al., 2020) and unlike the mixture-based over-dispersion models this (and other members of the family) can also accommodate underdispersion. Other classes of extended Poisson models that can handle both over and underdispersion, include the Generalised Poisson (Consul & Jain, 1973), the Gamma-count (Zeviani et al., 2014) based on using gamma-distributed inter-arrival times rather than the exponential times as in a Poisson process, and the discrete Weibull distribution (Luyts et al., 2019) as an example of the general approach of discretisation of a continuous random variable. While OD versions of these can be used to model excess zeros, the combination of these models with explicit ZI also provides the possibility of modelling excess zeros where the non-zero counts are under-dispersed.
In many OD models not only do we have Var(Y ) > E[Y ], but also, often as a consequence, π 0 (µ, φ) > π P 0 . Simply stated, in general, over-dispersion puts more weight in the tails of a distribution, but as a count random variable Y is constrained below by zero, this extra weight in the lower tail accumulates on low values of Y and, in particular, on Y = 0. As such these models can provide an alternative approach to the apparent problem of excess zeros in data. As this is an indirect consequence of the OD model we call this implicit ZI. Puig & Valero (2006) provide a theoretical treatment of the circumstances under which OD induces ZI, characterising the conditions under which this is necessary. They use it to also contrast several count distributions; one of these is the Negative Binomial (NB), which here we use as an exemplar of OD models.
Interestingly, as we develop below, the details depend not only on the OD distribution, but also on its parameterisation. In Figure 2, we show the implied ZI behaviour of these models for comparison with that of the explicit ZI models.
These two versions of the NB are included in Figure 2, where, as discussed in Section 3, the extended model zero probabilitiesπ 0 are plotted against the Poisson base π 0 , augmented by the observed and model zero probabilities for the Trajan data. Here we have expressedπ 0 (µ, φ) as a function of π P 0 via µ = − log π P 0 . Figure 3 also illustrates this for the two different parameterisations of the Negative Binomial distribution. In constructing this theoretical plot we have fixed the dispersion parameter φ so that the curves pass through the same common (π 0 ,π 0 ) point (0.2, 0.4). For both, the induced ZI, characterised byπ 0 (π P 0 , φ) is such thatπ 0 → 0 as π P 0 → 0 or equivalently as µ → ∞, or equivalently as π P 0 → 0, as with the ZI of types B and D. More specifically, for NB-lin, from log(π 0 (π P 0 , φ)) = φ −1 log(1 + φ) log(π P 0 ), if we set e γ = φ −1 log(1 + φ), the ZI induced by NB-lin is exactly as Type B for all π P 0 and thus for all µ. Also, except for large µ (small π P 0 ), this induced ZI bears some resemblance to Type C, the classic mixture model of ZI. Further, we see that, apart from the behaviour for very small π P 0 (ie for π P 0 above the 'elbow') the ZI function π 0 (π P 0 , φ) for NB-quad is even more similar to that for Type C. This is essentially because here this function has unit slope both at the elbow and, unlike NB-lin, as π P 0 → 1 (equivalently µ → 0).
But recall that, despite ZI Type B and NB-lin having parameters γ and φ such that they have the same probabilities of zero for all µ, they are quite different distributions and have different variance functions. Although there are similarities in behaviour between the NB-quad and Type C models their probability structures are different and, for example,π C y can be bi-modal. The point is that the over-dispersion in the Type C model is centred solely on the zero probability, whereas in the NB-quad the extra dispersion is smoothly spread across the whole range of yvalues. The central challenge is that with only two parameters available, correspondence cannot be obtained on three important aspects: mean, variance and the probability of zero.

Simple Inference for the Negative Binomial
Inference is well established for the NB model. But there is a subtlety, even in the iid case.
Moments based estimators of (µ, φ) will necessarily match the sample mean and variance; and thus not match p 0 , the observed frequency of zeros, especially if this is large. But other, more widely used, estimators will, especially for large p 0 , result in aφ that (approximately) matches the fitted π 0 (μ,φ) by overestimating the variance. The theory is more straightforward for NBquad.
NB-quad is a member of the Exponential Dispersion family. Here, in the iid case, the sample mean is the MLE for µ; while the MLE for the dispersion parameter φ requires an iterative solution of the score equation. However, the joint ML estimators do have the nice property of being asymptotically independent (see Lawless, 1987). Other estimators used for φ include a moment estimator (this can be obtained for more general models by equating the generalized Pearson chi-square statistic to the degrees of freedom). Historically, a zero-frequency based estimator has also been used and, as for the explicit ZI models, it results in the zero-frequency being fitted perfectly; however, for data generated by an explicit ZI Poisson process, it does this by overestimating the variance with a larger fitted φ value. Interestingly in this ZI setting the MLE lies between the moment and the zero-frequency estimators, giving both a fitted variance Var π 0 (μ,φ) that is larger than the observed sample variance, and a fitted π 0 (μ,φ) that is closer to the observed p 0 than might be expected, at the price of an extended fitted upper tail. However, in applications to data with excess zeros, there is typically also some OD in the non-zero counts and so this inflated estimated value of φ may simultaneously capture both excess zeros and additional dispersion and lead to an adequate and parsimonious fit.
This behaviour perhaps explains the findings in Warton (2005), who provides comparisons between different implicit and explicit ZI models fitted to multivariate abundance data from 20 datasets. His results show that the NB-quad model is superior to ZIP and ZINB in terms of AIC, when looking at an average over 1672 count variables. He comments that these abundances do not have extra zeros when compared to the NB-quad distribution, and are likely to have arisen from NB-quad distributions with small means.
The NB-lin is not in the Exponential Family, even for a known value of φ, and is inferentially less convenient. Of course, in the iid case the NB-lin is simply a reparameterisation of the NBquad, but for fixed values of the dispersion parameters they behave differently as µ varies, see Table 1 and Figure 3, and these differences potentially become apparent in the regression model setting.

R Packages for Fitting ZI Models
In this section we describe the use and background of various commonly-used R packages for modelling zero inflation. There are many packages listed on the R package web page which purport to perform zero-inflated regression, but many of these apply to specific data types or constraints (e.g. hidden Markov models, monotonic zero inflation, compositional data), apply to a particular application area, or deal in continuous rather than count data (the focus of this paper). Instead we focus on 4 main packages: zic, pscl, VGAM, and gamlss. A further popular package is COZIGAM but this package has been archived from CRAN and has not been updated since 2012.
The purpose of this section is to showcase the wide array of possibilities currently available in R for zero inflated/adjusted modelling and provide brief examples of their use on our example data to enable those new to the field to pick up on the salient features. Table 2 shows a summary of these packages and an overview of their features and differences.
All of the aforementioned packages are Frequentist in their inferential approach with the exception of zic. A number of packages exist which potentially allow for Bayesian zero-inflated and zero-adjusted models to be fitted, such as JAGS (Plummer, 2003), Stan (Stan Development Team, 2014), INLA (Rue et al., 2009), andNimble (de Valpine et al., 2017). However, these require considerable extra coding experience (and often many more lines of code) so we do not review them here. A particular key issue is that many of these packages support only a small set of probability distributions, so that extra distributions have to be added by hand (or hack) and often lack vectorisation speed-ups. For those wishing to persist in learning one or more of these packages (which we would recommend), we suggest first consulting the manual to determine whether the chosen custom probability distributions are included by default.
The zic package (Jochmann, 2013)  R package pscl (Zeileis et al., 2008) contains a very wide array of both hurdle and zero-inflated models. A key feature is the possibility of selecting different distribution types for π y and π 0 . Three different distributions are permitted for the counts (Poisson, negative binomial, and geometric), and three types of zero-inflation (binomial, Poisson, negative binomial, and geometric). These latter distributions are censored at value 1 to produce the required hurdle effect.
For standard zero inflation models only binomial (i.e. Bernoulli) inflation is permitted. Link functions can also be changed (for both hurdle and zero-inflated) though again some restrictions apply.
The VGAM package interface (Yee, 2015) is very similar to that of the standard R base lm and glm functions whilst adding in multivariate components (via constraint matrices) which do not apply to the zero inflated models implemented. Exactly as with glm, the vglm and vgam functions take a formula and family argument, the latter of which has a variety of zero-inflated options, including zero inflated Poisson, negative binomial, and geometric, and similarly zero-altered (two-stage/hurdle) versions. We found the output of these models to be somewhat confusing, since the tabular output displays multiple intercepts (corresponding seemingly to the different linear predictors).
The gamlss package contains the richest set of zero-inflated and zero-altered (two-stage/hurdle) models, including a large number of zero-inflated continuous (and bounded) likelihood distributions. The count distributions supported include binomial, beta-binomial, negative binomial, negative beta-binomial, logarithmic, Poisson, and Zipf. The gamlss package in particular allows for the modelling of multiple different parameters in a distribution, each with their own parametric relationship, but none of the examples we found in the package used this feature.
Finally, we provide unified code to fit ZI types A, B, C and D (and reproduce all results obtained from analysing the Trajan data) at www.github.com/andrewcparnell/ZI_ review.
In this paper, we have only examined the basic underpinnings of regression in the presence of excess zeros in univariate count data. Our first contribution to this is a novel perspective on the modelling of excess zeros in count data regression. This is primarily in the presentation of an approach which we have referred as explicit ZI. The modelling issue, as presented here, is simply the choice of link function, and the consequent estimation of the sole parameter. Four such simple options are presented; others and extensions to two parameter variations are possible.
The two approaches discussed -explicit ZI and implicit ZI via the use of distributions such as the NB -may of course be combined, and all parameters may be modelled via covariates.
It could be argued that the real difficulty is the embarrassment of choice. The estimation of the parameters is no longer a challenge (for univariate counts), given modern computing algorithms; nor is the identification of the best, given a metric such as the AIC, a metric which may not be natural for some users. The real challenge is in fact the identification of the most useful model given the vagaries of that term, and the ubiquity of potential outliers within all data sets.
Our approach to the current paradigm of explicit ZI models contrasts with that of data generating mechanisms that involve latent variables. Of course, as pointed out by Colin & Pravin (2013), p.147: "The latent class interpretation is not essential ... As such the approach is an alternative to non-parametric estimation." However, from the almost invariable discussion in papers in the applied literature, it does seem that the user community feels obliged to explain the mixture interpretation. This can be useful, of course; but sometimes it can involve shoehorning. Nevertheless mixtures defined by latent variables can be a fruitful theoretical avenue for defining models for use with multivariate counts. Indeed this was the purpose of Salter- The practical benefit of the new perspective in univariate regression may not in fact be new models. The non-parametric perspective may yield new diagnostics. It may even help to establish a-priori evidence of excess zeros. Current practice usually involves marginalisation over all covariates. An empirical approach would involve, as a first step, a comparison between observed I{y i = 0} and the fitted probabilities of zero under a default 'base' model, genericallyπ 0i = π 0 μ i ,φ i . This comparison could simply regress the binary I{y i = 0} -nonparametrically, using splines, for example -on functions including log (π 0i ) and log (1 −π 0i ), yielding valuesπ 0i . Any departure from the unit lineπ 0i = π 0i would constitute a priori evidence of excess zeros with respect to the fitted base model.
Furthermore such a plot might serve, post-hoc, as a diagnostic for a fitted ZI model, by regressing I{y i = 0} on functions of fittedπ 0i . We remark that the current literature on count data regression in general -and on excess zeros in particular -seems sadly lacking in criticism that most users would deem to be constructive. Portmanteau statistics based on Information Criteria do not always have such a constructive interpretation. Model details can be dominated by extreme values; and this is particularly so for models that admit very large values for the variance. But IC statistics do not readily identify such data points.
In circumstances where it is deemed to be important to choose amongst alternative models for the excess zeros, such regressions highlight the importance of the design. For without data corresponding to very large µ i (and thus small π 0i ) it will be impossible to distinguish ZI models of Type B, C or D; and difficult to distinguish from the OD approach. In practice, of course, this will be further complicated if -as is common -the parameters γ and/or φ are themselves modelled via covariates. But the fact that this is common may reflect poor diagnostics (or indeed over-enthusiasm).
Our second contribution is to make more explicit the parallels between the OD and ZI approaches to modelling excess zeros. But we have not resolved the choice between the explicit (ZI) and implicit (OD) approaches, when the base is the Poisson. On the contrary, we see that, from the point of view of the excess zeros, one ZI model -Type B -behaves exactly like the NB-lin; and NB-quad is not unlike the (dominant) ZI Type C. The difference between the latter two is only apparent for very large µ, and only carefully designed data will differentiate them.
They do differ as regard the mean-variance relationship, but only in detail for both exhibit a quadratic relationship; and this too will only be apparent at very large µ. In addition, for iid data, technical details of the usual inference procedures for NB-quad do little to help differentiate the very different models. Further, for the more typical case of regression, these details can only become more challenging. What is clear is that any attempt to discriminate between explicit and implicit ZI models will require a rich and varied dataset and diagnostics focussing jointly on both fitted zero-probability and upper tail behaviour. This discussion suggests that the burgeoning literature on other over-dispersed generalisations of the Poisson -and on zeroinflating them -may itself be premature. As a final contribution, Table 1 offers an alternative to the confusing nomenclature that has developed.    Table 1. For the Negative Binomial models φ NB-lin = 1.82 and φ NB-quad = 1.13, again with reference to the parameterisation in Table 1. Table 1: A typology of zero inflation and over-dispersion. Our four types and their common names are shown alongside their ZI behaviour with respect to the Poisson base.