Bias in parametric estimation: reduction and useful side‐effects
Conflict of interest: The author has declared no conflicts of interest for this article.
Abstract
The bias of an estimator is defined as the difference of its expected value from the parameter to be estimated, where the expectation is with respect to the model. Loosely speaking, small bias reflects the desire that if an experiment is repeated indefinitely then the average of all the resultant estimates will be close to the parameter value that is estimated. The current article is a review of the still‐expanding repository of methods that have been developed to reduce bias in the estimation of parametric models. The review provides a unifying framework where all those methods are seen as attempts to approximate the solution of a simple estimating equation. Of particular focus is the maximum likelihood estimator, which despite being asymptotically unbiased under the usual regularity conditions, has finite‐sample bias that can result in significant loss of performance of standard inferential procedures. An informal comparison of the methods is made revealing some useful practical side‐effects in the estimation of popular models in practice including: (1) shrinkage of the estimators in binomial and multinomial regression models that guarantees finiteness even in cases of data separation where the maximum likelihood estimator is infinite and (2) inferential benefits for models that require the estimation of dispersion or precision parameters.
This article is categorized under:
- Data: Types and Structure > Categorical Data
- Algorithms and Computational Methods > Maximum Likelihood Methods
- Statistical and Graphical Methods of Data Analysis > Bootstrap and Resampling
IMPACT OF BIAS IN ESTIMATION
By its definition, bias necessarily depends on how the model is written in terms of its parameters and this dependence makes it not a strong statistical principle in terms of evaluating the performance of estimators; e.g., unbiasedness of the familiar sample variance S2 as an estimator of σ2 does not deliver an unbiased estimator of σ itself. Despite this fact, an extensive amount of literature has focused on unbiased estimators (estimators with zero bias) as the basis of refined statistical procedures (e.g., finding minimum variance unbiased estimators). In such work unbiasedness plays the dual role of a condition that (1) allows the restriction of the class of possible estimators in order to obtain something useful (like minimum variance amongst unbiased estimators), and (2) ensures that estimation is performed in an impartial way, ruling out estimators that would favour one or more parameter values at the cost of neglecting other possible values. Lehmann and Casella1 is a thorough review of statistical methods that are optimal once attention is restricted to unbiased estimators.
Another stream of literature has focused in reducing the bias of estimators, as a means to alleviating the sometimes considerable problems that bias can cause in inference. This literature, despite dating back to the early years of statistical science, is resurfacing as increasingly relevant as the complexity of models used in practice increases and pushes traditional estimation methods to their theoretical limits.
The current review focuses on the latter literature, explaining the link between the available methods for bias reduction and their relative merits and disadvantages through the analysis of real data sets.
The following case study demonstrates the direct consequences that the bias in the estimation of a single nuisance (or incidental) parameter can have in inference, even if all parameters of interest are estimated with negligible bias.
Gasoline Yield Data
To demonstrate how bias can in some cases severely affect estimation and inference we follow the gasoline yield data example in Kosmidis and Firth2 and Grün et al.3 The gasoline yield data4 consists of n = 32 observations on the proportion of crude oil converted to gasoline after distillation and fractionation on 10 distinct experimental settings for the triplet (1) temperature in degrees Fahrenheit at which 10% of crude oil has vaporized, (2) crude oil gravity, and (3) vapor pressure of crude oil. The temperature at which all gasoline has vaporized is also recorded in degrees Fahrenheit for each one of the 32 observations.
(1)In the above expression, si1, …, si9 are the values of nine dummy covariates which represent the 10 distinct experimental settings in the data set and ti is the temperature in degrees Fahrenheit at which all gasoline has vaporized for the ith observation (i = 1, …, n).
The parameters θ = (α,γ1, …,γ9,δ,φ) are estimated using maximum likelihood and the estimated standard errors for the estimates are calculated using the square roots of the diagonal elements of the inverse of the Fisher information matrix for model 1. The parameter φ is considered here to be a nuisance (or incidental) parameter which is only estimated to complete the specification of the Beta regression model.
Table 1 shows the parameter estimates with the corresponding estimated standard errors and the 95% Wald‐type confidence intervals. One immediate observation from the table of coefficients is the very large estimate for the precision parameter φ. If this is merely the effect of upward bias then this bias will result in underestimation of the standard errors because for such a model the entries of the Fisher information matrix corresponding to the regression parameters α, γ1, …, γ9, δ are quantities of the form ‘φ times a function of θ’ (see Refs 2, 3 for expressions on the Fisher information). Hence, if the estimation of φ is prone to upward bias, then this can lead to confidence intervals that are shorter than expected at any specified nominal level and/or anti‐conservative hypothesis testing procedures, which in turn result in spuriously strong conclusions.
| Parameter | Estimate | Estimated Standard Error | 95% Confidence Interval | |
|---|---|---|---|---|
| α | −6.160 | 0.182 | −6.517 | −5.802 |
| γ1 | 1.728 | 0.101 | 1.529 | 1.926 |
| γ2 | 1.323 | 0.118 | 1.092 | 1.554 |
| γ3 | 1.572 | 0.116 | 1.345 | 1.800 |
| γ4 | 1.060 | 0.102 | 0.859 | 1.260 |
| γ5 | 1.134 | 0.104 | 0.931 | 1.337 |
| γ6 | 1.040 | 0.106 | 0.832 | 1.248 |
| γ7 | 0.544 | 0.109 | 0.330 | 0.758 |
| γ8 | 0.496 | 0.109 | 0.282 | 0.709 |
| γ9 | 0.386 | 0.119 | 0.153 | 0.618 |
| δ | 0.011 | 0.000 | 0.010 | 0.012 |
| φ | 440.278 | 110.026 | 224.632 | 655.925 |
To check whether this is indeed the case a small simulation study has been designed where 50000 samples are simulated from the maximum likelihood fit shown in Table 1. Maximum likelihood is used to fit model 1 on each simulated sample and the bias of the maximum likelihood estimator is estimated using the resultant parameter estimates. The estimated bias for α is 0.010 while the estimated biases for γ1, …, γ9, δ are all less than 0.005 in absolute value, providing indications that bias on the regression parameters is of no consequence. Nevertheless, the estimated bias for φ is 299.779 which indicates a strong upward bias in the estimation of φ. To check how the upward bias in the precision parameter can affect the usual Wald‐type inferences, we estimate the coverage probability (the probability that the confidence intervals contains the true parameter value) of the individual Wald‐Type confidence intervals at levels 90, 95, and 99%. Table 2 shows the results. It is clear that the Wald‐type confidence intervals systematically undercover the true parameter value across parameters.
| Parameter | Nominal Level | ||
|---|---|---|---|
| 90% | 95% | 99% | |
| α | 80.2 | 87.2 | 94.9 |
| δ1 | 80.3 | 87.3 | 95.2 |
| δ2 | 80.2 | 87.1 | 95.1 |
| δ3 | 80.2 | 87.1 | 94.8 |
| δ4 | 80.2 | 87.5 | 95.2 |
| δ5 | 80.5 | 87.5 | 95.2 |
| δ6 | 80.4 | 87.4 | 95.1 |
| δ7 | 80.6 | 87.4 | 95.1 |
| δ8 | 80.2 | 87.3 | 95.1 |
| δ9 | 80.5 | 87.3 | 95.0 |
| φ | 79.9 | 87.1 | 94.9 |
Such behavior is observed even when the precision parameter is linked to covariates through a link function, like logarithm (see e.g. Ref 3). More generally, similar consequences of bias in inference are present in all exponential family models that involve the estimation of dispersion (or precision) parameters.
CONSISTENCY, BIAS, AND VARIANCE
Suppose that interest is in the estimation of a p‐vector of parameters θ, from data y(n) assumed to be realizations of a random quantity y(n) distributed according to a parametric distribution Mθ, θ = (θ1, …,θp)T ∈ Θ ⊂ ℜp. The superscript n here is used as an indication of the information in the data and is usually the sample size in the sense that the realization of Y(n) is y(n) = (y1, …,yn)T. An estimator of θ is a function
and in the presence of data the estimate would be t(y(n)).
An estimator
is consistent if it converges in probability to the unknown parameter θ as n → ∞. Consistency is usually an essential requirement for a good estimator because given that the family of distributions Mθ is large enough, it ensures that as n increases the distribution of
becomes concentrated around the parameter θ, essentially providing a practical reassurance that for very large n the estimator recovers θ.

Loosely speaking, small bias reflects the desire that if an experiment that results in data y(n) is repeated indefinitely, then the long‐run average of all the resultant estimates will not be far from θ. Small bias is a much weaker and hence less useful requirement than consistency. Indeed, one may get an inconsistent estimator with zero bias or a consistent estimator that is biased. For example, if Y(n) = (Y1, …,Yn)T, with Y1, …, Yn mutually independent random variables with Yi ∼ N(μ,σ2) then t(Y(n)) = Y1 is an unbiased but inconsistent estimator of μ. On the other hand,
is a consistent estimator for μ but has bias B(t(Y(n))) = 1/n. So, bias becomes relevant only if it is accompanied by guarantees of consistency, or more generally when the variability of V around θ is small (see Ref 5, § 8.1 for a discussion along this lines).
satisfies
(2)Maximum Likelihood Estimation
is the value of θ which maximizes the log‐likelihood function l(θ;y(n)) = logf(y(n);θ). Given that the log‐likelihood function is sufficiently smooth on θ,
can be obtained as the solution of the score equations

is positive definite when evaluated at
. An appealing property of the maximum likelihood estimator is its invariance under one‐to‐one reparameterizations of the model. If θ′ = g(θ) for some one‐to‐one function g : ℜp → ℜp, then the maximum likelihood estimator of θ′ is simply
. This result states that when obtaining the maximum likelihood estimator of θ, we automatically obtain the maximum likelihood estimator of g(θ) for any function g that is one‐to‐one, simply by calculating
without the need of maximizing the likelihood on g(θ).
It can also be shown that the maximum likelihood estimator θ has certain optimality properties if the ‘usual regularity conditions’ are satisfied. Informally, the usual regularity conditions imply, among others, that (1) Mθ is identifiable (i.e.,
, for any pair (θ,θ′) such that θ ≠ θ′, apart from sets of probability zero), (2) p is finite, (3) that the parameter space Θ does not depend on the sample space (which implies that p does not depend on n and iv) that there exists a sufficient number of log‐likelihood derivatives and expectations of those under Mθ. A more technical account of those conditions can be found in McCullagh6, § 7.1,7.2, or equivalently in Cox and Hinkley5, §9.1.
If these conditions are satisfied, then
is consistent and has bias of asymptotic order O(n− 1), which means that its bias vanishes as n → ∞. Moreover, the maximum likelihood estimator has the property that as n → ∞ its distribution converges to a multivariate Normal distribution with expectation θ and variance‐covariance matrix {F(θ)}− 1. Hence, the variance of the asymptotic distribution of the maximum likelihood estimator is exactly the Cramér‐Rao lower bound {F(θ)}− 1 given in Eq. 2.
Reducing Bias
All the above shows that under the usual regularity conditions as n → ∞, the maximum likelihood estimator
has optimal properties, a fact that makes it a default choice in applications. However, for finite n these properties may deteriorate, in some cases causing severe problems in inference. Such an effect has been seen in the gasoline yield data case study where the bias of
affects the performance of tests and the construction of confidence intervals based on the asymptotic normality of
.
Before reviewing the basic methods for reducing bias, it is necessary to emphasize again that bias necessarily depends on the parameterization of the model; if the bias of any estimator
is reduced resulting to a less biased estimator
, then it is not necessary that the same will happen for the estimator
. In fact, the bias of the
as an estimator of g(θ) may be considerably inflated. Hence, correction of the bias of the maximum likelihood estimator comes at the cost of destroying its invariance properties under reparameterization. Therefore, all the methods for bias reduction that are described in the current review should be seen with scepticism if invariance is a necessary requirement for the analysis. On the other hand if the parameterization is fixed by the problem or practitioner, one can do much better in terms of estimation and inference by reducing the bias. Furthermore, as it will be seen later, for some models reduction of bias produces useful side‐effects which in many cases have motivated its routine use in applications. A thorough discussion on considerations on bias and variance and examples of exactly unbiased estimators that are useless or irrelevant can be found at Cox and Hinkley5 §8.2 and Lehmann and Casella1 § 1.1.
BIAS REDUCTION—A SIMPLE RECIPE WITH MANY DIFFERENT IMPLEMENTATIONS
taking values in Θ ⊂ ℜp, consider the solution of the equation
(3)with respect to a new estimator
. Equation 3 is a moment‐matching equation which links the properties of the estimation method to the properties of Mθ through
and B(θ), respectively. If both the function B(θ) and θ were known then it is straightforward to show that
has zero bias and hence, smaller mean squared error than
. If, in addition, the initial estimator
has vanishing variance‐covariance matrix as n → ∞ then an application of Chebyschev's inequality shows that
is consistent, even if
is not. Of course, if θ is known then there is no reason for estimation, and furthermore, usually the function B(θ) cannot be written in closed‐form. The importance of Eq. 3 is that, despite of its limited practical value, all known methods to reduce bias can be usefully thought of as attempts to approximate its solution. These methods can be distinguished into explicit and implicit.
EXPLICIT METHODS
Explicit methods rely on an one‐step procedure where B(θ) is estimated and then subtracted from
resulting in the new estimator
The most popular explicit methods for reducing bias are the jackknife, the bootstrap, and methods which use approximations of the bias function through asymptotic expansions of B(θ).
Jackknife
(4)
which results from leaving the jth random variable out of the original set of n variables has the same bias expansion as in Eq. 4 but with n replaced with n − 1. In light of this observation, Quenouille7 noticed that the estimator

is the average of the n possible leave‐one‐out estimators
, has bias expansion − b2(θ)/n2 + O(n− 3) which is of smaller asymptotic order than the O(n− 1) bias of
. This procedure is called jackknife (see Ref 8 for an overview of jackknife). Efron9 §2.3 shows the basic geometric argument behind the jackknife; the jackknife is estimating the bias based on a linear extrapolation of the expected value of the estimator as a function of 1/n. The same procedure can be carried out for correcting bias in higher orders essentially replacing the linear extrapolation by quadratic extrapolation and so on. Schucany et al.10 give an elegant way of deriving such higher order corrections in bias with the jackknife being a prominent special case of their method. The jackknife is an explicit method because the new estimator
is simply

is the jackknife estimator of the bias.
Bootstrap
, where
is the average of the estimates based on each of the bootstrap samples. Efron and Tibshirani12 and Davison and Hinkley13 are thorough treatments of bootstrap methodology. Under general conditions Hall and Martin14 showed that, if
has a bias of O(n− 1) which can be consistently estimated, then the estimator

The estimates of the bias in the gasoline yield data case study were obtained by simulation from the fitted model in Table 1, and thus are parametric bootstrap estimates of the bias.
Asymptotic Bias Correction
which is the first‐term in the right hand side of Eq. 4 evaluated at
. Cox and Snell,15 in their investigation of higher order properties of residuals in general parametric models, derive an expression for the first‐order bias term b(θ)/n in Eq. 4, when
is the maximum likelihood estimator. That expression has sparked a still‐active research stream in correcting the bias by using the estimator

Efron16 showed that
has bias of order o(n− 1) which is of smaller order than the O(n− 1) bias of the maximum likelihood estimator and that the asymptotic variance of any estimator with O(n− 2) bias is greater or equal to the asymptotic variance of
(second‐order efficiency). For the interested reader, Pace and Salvan17 give a thorough discussion of those properties.
Landmark studies in the literature for asymptotic bias corrections are Cook et al.18 who investigate correcting the bias in nonlinear regression models with Normal errors and Cordeiro and McCullagh19 who treat generalized linear models with interesting results on the shrinkage properties of the reduced‐bias estimators in binomial regression models and an attractive implementation through one supplementary reweighted least squares iteration. Furthermore, Botter and Cordeiro20 and Cordeiro and Toyama Udo21 extend the results in Cordeiro and McCullagh19 and derive the first‐order biases for generalized linear and nonlinear models with dispersion covariates.

(5)

Breslow and Lin23 derive the expressions for the asymptotic biases in generalized linear mixed models for various estimation methods and used those to correct for the bias. Higher order corrections have also appeared in the literature24 where expressions for b(θ)/n + b2(θ)/n2 in Eq. 4 are obtained. The expressions involved for such higher order corrections are too cumbersome requiring enormous effort in derivation and implementation, and there is always the danger that the benefits in estimation from this effort are only marginal, if any, compared to methods that are based on simply removing the first‐order bias term.
Advantages and Disadvantages of Explicit Methods
The main advantage of all explicit methods is the simplicity of their application. Once an estimate of bias is available, reduction of bias is simply a matter of an one‐step procedure where the estimated bias is subtracted from the estimates. Nevertheless because of their explicit dependence on
, explicit methods directly inherit any of the instabilities of the original estimator. Such cases involve models with categorical responses where there is a positive probability that the maximum likelihood estimator is not finite (see Ref 25 for conditions that characterize when infinite estimates occur in multinomial response models) and have been the subject of study in works like Mehrabi and Matthews,26 Heinze and Schemper,27 Bull et al.,28 Kosmidis and Firth,29 Kosmidis and Firth,2 and Kosmidis.30 In particular, Kosmidis30 relates to the case study of the proportional odds models, discussed below.
Furthermore, asymptotic bias correction methods have the disadvantage that are only applicable when b(θ)/n can be obtained in closed‐form, which can be a tedious or even impractical task for many models (see, e.g., Ref 3 where the expressions for b(θ)/n are given for Beta regression models).
IMPLICIT METHODS
Implicit methods approximate B(θ) at the target estimator
and then solve Eq. 3 with respect to
. Hence,
is the solution of an implicit equation.
Indirect Inference

by approximating B(θ) at
through parametric bootstrap. Kuk32 independently produced the same idea for reducing the bias in the estimation of generalized linear models with random effects and Jiang and Turnbull33 give a comprehensive review of indirect inference from a statistical point of view. Furthermore, Gourieroux et al.34 and Phillips35 discuss bias reduction through indirect inference in econometric applications. Pfeffermann and Correa36 give an alternative approach to bias reduction which is in line with the basic idea of indirect inference.
Bias‐Reducing Adjusted Score Equations
is the maximum likelihood estimator and under the usual regularity conditions, Firth37 and Kosmidis and Firth29 investigate what penalties need to be added to S(θ) in order to get an estimator that has asymptotically smaller bias than that of the maximum likelihood estimator. In its simplest form such an approach requires finding
by solving the adjusted score equations
(6)
is an estimator with o(n−1) bias. Equation 6 can be rewritten as

is another approximate solution to Eq. 3 because
approximates
up to order O(n− 2) and {F(θ)}− 1S(θ) is the O(n− 1/2) term in the asymptotic expansion of
evaluated at 
An important property of the estimator based on adjusted score functions, which is also shared by the estimator from asymptotic bias correction, is that it has the same asymptotic distribution as the maximum likelihood estimator, namely a Normal distribution centered at the true parameter value with variance‐covariance matrix the Cramér‐Rao lower bound {F(θ)}− 1. Hence, the first‐order methods that are used for the maximum likelihood estimator, like Wald‐type confidence intervals, score tests for model comparison, and so on, are unaltered in their form and apply directly by using the new estimators.
It is noteworthy that in the case of full exponential families (e.g., logistic regression and Poisson log‐linear models) the solution of Eq. 6 can be obtained by direct maximization of a penalized likelihood where the penalty is the Jeffreys38 invariant prior (see Ref 37 for details). It should also be stressed that not all models admit a penalized likelihood interpretation of bias reduction via adjusted scores. Kosmidis and Firth29 give an easy‐to‐check necessary and sufficient condition that identifies which univariate generalized linear models admit such penalized likelihood interpretation and provide the form of the resultant penalties when the condition holds. That condition is a restriction on the variance function of the responses in terms of the derivative of the chosen link function.
Advantages and Disadvantages
The main disadvantage of implicit methods is that their application requires the solution of a set of implicit equations which in most of the useful cases requires numerical optimization. This task is even more computationally demanding for indirect inference approaches in general models because of the necessity to approximate the bias function in a p‐dimensional space. Furthermore, indirect inference approaches inherit the disadvantages of explicit methods because they explicitly depend on the original estimator.
The approach in Firth37 and Kosmidis and Firth,29 on the other hand, does not directly depend on
and hence has gained considerable attention compared to the other approaches. Another reason for the considerable adaptation of this method are recent advances which simplify application through either iterated first‐order bias adjustments (see Refs 3, 22) or iterated maximum likelihood fits on pseudo observations (see Refs 2, 29, 30). Of course, as for the asymptotic bias correction methods the adjusted score equation approach to bias reduction has the disadvantage of being directly applicable only under the same conditions that guarantee the good limiting behavior of the maximum likelihood estimator and only when the score functions, Fisher information and the first‐order bias term of the maximum likelihood estimator are available in closed‐form.
PROPORTIONAL ODDS MODELS
This example was analyzed in Kosmidis.30 The data set in Table 3 is from Randall39 and concerns a factorial experiment for investigating factors that affect the bitterness of white wine. There are two factors in the experiment, temperature at the time of crushing the grapes (with two levels, ‘cold’ and ‘warm’) and contact of the juice with the skin (with two levels ‘Yes’ and ‘No’). For each combination of factors two bottles were rated on their bitterness by a panel of nine judges. The responses of the judges on the bitterness of the wine were taken on a continuous scale in the interval from 0 (‘None’) to 100 (‘Intense’) and then they were grouped correspondingly into five ordered categories, 1, 2, 3, 4, 5.
| Temperature | Contact | Bitterness Scale | ||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | ||
| Cold | No | 4 | 9 | 5 | 0 | 0 |
| Cold | Yes | 1 | 7 | 8 | 2 | 0 |
| Warm | No | 0 | 5 | 8 | 3 | 2 |
| Warm | Yes | 0 | 1 | 5 | 7 | 5 |
(7)Table 4 shows the maximum likelihood estimates for model Eq. 7 and the corresponding estimated standard errors as reported by the clm function of the R package ordinal.41 It is directly apparent that the absolute value of the estimates and estimated standard errors for the parameters α4, β1 and β4 is very large. Actually, these would diverge to infinity as the stopping criteria of the iterative fitting procedure used become stricter and the number of allowed iterations increases. The estimates for the remaining parameters are all finite and will preserve the value shown in Table 4 even if the number of allowed iterations increases. This is an instance of the problems that practitioners may face when dealing with categorical response models. Using a Wald‐type statistic based on the maximum likelihood estimator for testing the hypothesis of proportional odds would be adventurous here because such a statistic explicitly depends on the estimates of β1, β2, β3 and β4. Of course, given that the likelihood is close to its maximal value at the estimates in Table 3, a likelihood ratio test can be used instead; the likelihood ratio test for this particular example has been carried out in Christensen.42
| Parameter | Maximum Likelihood | Adjusted Score Equations | ||
|---|---|---|---|---|
| Estimates | Z‐Statistic | Estimates | Z‐Statistic | |
| α1 | −1.27 (0.51) | −2.46 | −1.19 (0.50) | −2.40 |
| α2 | 1.10 (0.44) | 2.52 | 1.06 (0.44) | 2.42 |
| α3 | 3.77 (0.80) | 4.68 | 3.50 (0.74) | 4.73 |
| α4 | 28.90 (193125.63) | 0.00 | 5.20 (1.47) | 3.52 |
| β1 | 25.10 (112072.69) | 0.00 | 2.62 (1.52) | 1.72 |
| β2 | 2.15 (0.59) | 3.65 | 2.05 (0.58) | 3.54 |
| β3 | 2.87 (0.82) | 3.52 | 2.65 (0.75) | 3.51 |
| β4 | 26.55 (193125.63) | 0.00 | 2.96 (1.50) | 1.98 |
| θ | 1.47 (0.47) | 3.13 | 1.40 (0.46) | 3.02 |
Note here that methods like the bootstrap and jackknife would require special considerations for their application in a well‐designed experiment like the above, the question to be answered being what comprises an observation to be resampled or left‐out. Even if such considerations were resolved, bootstrap and jackknife would be prone to the problem of infinite estimates. The latter is also true for the estimator based on asymptotic bias corrections and for indirect inference.
Kosmidis30 derives the adjusted score equations for cumulative link models, and uses them to calculate the reduced‐bias estimates shown in the right of Table 4. The reduced‐bias estimates based on the adjusted score functions are finite and, through the asymptotic normality of the reduced‐bias estimator, they can form the basis of a Wald‐test for the hypothesis β1 = β2 = β3 = β4. This test has been carried out in Kosmidis30 and gives a p‐value of 0.861, providing no evidence against the hypothesis of proportional odds.
Furthermore, the values of the Z‐statistics for α4, β1 and β4 in Table 4 are essentially zero when based on the maximum likelihood estimator. This is typical behavior when the estimates diverge to infinity and it happens because the estimated standard errors diverge much faster than the estimates, irrespective of whether or not there is evidence against the individual hypotheses. This is also true if we were testing individual hypothesis at values other than zero, and can lead to invalid conclusions if the maximum likelihood output is interpreted naively; as shown in Table 4, the Z‐statistics based on the reduced‐bias estimates are far from being zero.
Such inferential pitfalls with the use of the maximum likelihood estimator are not specific to partial proportional odds models. For most models for categorical and discrete data (binomial‐response models like the logistic regression, multinomial response models, Poisson log‐linear models, and so on) there is a positive probability of infinite estimates. Bias reduction through adjusted score functions has been found to provide a solution to those problems and the corresponding methodology is quickly gaining in popularity and has found its way to commercial software like Stata and SAS. Open‐source solutions include the logistf R package43 for logistic regressions which is based on the work in Heinze and Schemper,27 the pmlr R package44 for multinomial logistic regressions based on the work of Bull et al.28 and the brglm R package45, 46 which at the time of writing handles all binomial‐response models. At the time of writing, the brglm R package is being extended for the next major update which will handle all generalized linear models, including multinomial logistic regression2 and ordinal response models.30
GASOLINE YIELD DATA REVISITED
In this section, the reduced‐bias estimates for the parameters of model 1 are calculated using jackknife, bootstrap, asymptotic bias correction, and the approach of bias‐reducing adjusted score functions. The full parametric bootstrap estimate of the bias has been obtained in our earlier treatment showing that the bias on the regression parameters is of no consequence. A fully nonparametric bootstrap where the bootstrap samples are produced by sub‐sampling with replacement the full response‐covariate combinations (yi,si1, …,si9,ti) (i = 1, …, n) is not advisable here because the 9 dummy variables si1, …, si9 are representing 10 distinct experimental settings and sub‐sampling those will result in singular fits with high probability (see also Ref 13, § 6.3 for a description of such problems in the simpler case of multiple linear regression). An intermediate sub‐sampling strategy is to resample residuals and use them with the original model matrix to get samples for the response. This strategy lies between fully nonparametric bootstrap and fully parametric bootstrap (see Ref 13, § 7). Residual resampling works well in multiple linear regression because the response is related linearly to the regression parameters, which is not true for Beta regression. For more complicated models like generalized linear models and Beta regression, an appropriate residual definition has to be chosen. Because Beta responses are restricted in (0,1), the best option is to resample residuals on the scale of the linear predictor and then transform back to the response scale using the inverse of the logistic link, obtaining bootstrap samples for the response (see Ref 13 expression (7.13) for rationale and implementation). In the current case we choose the ‘standardized weighted residual 2’ of Espinheira et al.47 because it appears to be the one that is least sensitive to the inherent skewness of the response.
The reduced‐bias estimates of φ using jackknife, residual‐resampling bootstrap (with 9999 bootstrap samples), asymptotic bias correction and bias‐reducing adjusted score functions are 165.682, 236.003, 261.206, and 261.038, respectively, all indicating that the maximum likelihood estimator of φ is prone to substantial upward bias. The simulations in Kosmidis and Firth22 illustrate that asymptotic bias correction and the bias‐reducing adjusted score functions, correctly inflate the estimated standard errors to the extent that almost the exact coverage of the first‐order Wald‐type confidence intervals is recovered.
DISCUSSION AND CONCLUSION
As can be seen from the earlier case‐studies, reduced‐bias estimators can form the basis of asymptotic inferential procedures that have better performance than the corresponding procedures based on the initial estimator. Heinze and Schemper,27 Bull et al.,48 Kosmidis,49 Kosmidis and Firth,22, 29 and Grün et al.3 all demonstrate that such improved procedures are delivered either by using the penalized likelihood that results from the approximation of Eq. 3, or by replacing the initial estimator with the reduced‐bias estimator in Wald‐type pivots, as was done in the case‐studies of this review.
At the time of writing the current review there is no general answer to which of the methods that have been reviewed here produces better results. All methods deliver estimators that have o(n− 1) bias which is asymptotically smaller than the O(n− 1) bias of the maximum likelihood estimator. In models with categorical or discrete responses, the adjusted score equations approach is preferable to the other bias reduction approaches because the resultant estimates appear to be always finite, even in cases where the maximum likelihood estimates are infinite (see, e.g., Refs 27, 48, 29, 2, 30 for generalized linear and nonlinear models with binomial, multinomial, and Poisson responses). This has led researchers to promoting the routine use of the adjusted score equations in such models as an improved alternative to maximum likelihood.
The general use, though, of the adjusted score equations approach is limited by its dependence on a closed‐form expression for the first‐order bias of the maximum likelihood estimator which may not be readily available or even intractable (e.g., generalized linear mixed effects models).
At this point, we should also stress that improving bias does not always have desirable effects; an improvement in bias can sometimes result in inflation of the mean squared error, through an inflation in the estimator's variance. The use of simple simulation studies, similar to the one in Kosmidis and Firth22 is recommended for checking whether that is the case. If that is the case then the use of reduced‐bias estimates in test statistics and confidence intervals is not recommended.
Furthermore, bias is a parameterization‐specific quantity and any attempt to improve it will violate the invariance properties of the maximum likelihood estimator. Hence, bias‐reduction methods should be used with care, unless the parameterization is fixed either by the context or by the practitioner.
All the discussion in the current review has focused on the effect that bias can have and the benefits of its reduction in cases where the usual regularity conditions are satisfied. An important research avenue is the reduction of bias under departures from the regularity conditions and especially when the dimension of the parameter space increases with the sample size. Lancaster50 gives a review of the issues that econometricians and applied statisticians face in such settings. A viable route toward reduction of bias in such cases comes from the use of modified profile likelihood methods (see, Ref 51 for a brief introduction), which have been successfully used for reducing the bias in the estimation of dynamic panel data models in Bartolucci et al.52 Another route is the appropriate adaptation of indirect inference approaches or of other approximate solutions of Eq. 3 in such settings. The Econometric community is currently active in this direction, with a recent example being Gouriéroux et al.53 where indirect inference is applied to dynamic panel data models. These early attempts are only indicative that, there is still much to be explored and much work to be done on the topic of bias reduction in parametric estimation.
ACKNOWLEDGMENTS
The author thanks the Editor, the Review Editor, and three anonymous Referees for detailed, helpful comments which have substantially improved the presentation.
REFERENCES
Citing Literature
Number of times cited according to CrossRef: 21
- Euloge C. Kenne Pagui, Enrico A. Colosimo, Adjusted score functions for monotone likelihood in the Cox regression model, Statistics in Medicine, 10.1002/sim.8496, 39, 10, (1558-1572), (2020).
- Cecilia A. Essau, Alejandro Torre‐Luque, Peter M. Lewinsohn, Paul Rohde, Patterns, predictors, and outcome of the trajectories of depressive symptoms from adolescence to adulthood, Depression and Anxiety, 10.1002/da.23034, 37, 6, (565-575), (2020).
- Dragana M. Pavlović, Bryan R.L. Guillaume, Emma K. Towlson, Nicole M.Y. Kuek, Soroosh Afyouni, Petra E. Vértes, B.T. Thomas Yeo, Edward T. Bullmore, Thomas E. Nichols, Multi-subject Stochastic Blockmodels for adaptive analysis of individual differences in human brain network cluster structure, NeuroImage, 10.1016/j.neuroimage.2020.116611, (116611), (2020).
- Neil E. Coughlan, Stephanie J. Bradbeer, Ross N. Cuthbert, Eoghan M. Cunningham, Kate Crane, Stephen Potts, Joe M. Caffrey, Frances E. Lucy, Alison M. Dunn, Eithne Davis, Trevor Renals, Claire Quinn, Jaimie T. A. Dick, Better off dead: assessment of aquatic disinfectants and thermal shock treatments to prevent the spread of invasive freshwater bivalves, Wetlands Ecology and Management, 10.1007/s11273-020-09713-4, (2020).
- Vanessa S Dias, Guy J Hallman, Amanda A S Cardoso, Nick V Hurtado, Camilo Rivera, Florence Maxwell, Carlos E Cáceres-Barrios, Marc J B Vreysen, Scott W Myers, Relative Tolerance of Three Morphotypes of the Anastrepha fraterculus Complex (Diptera: Tephritidae) to Cold Phytosanitary Treatment, Journal of Economic Entomology, 10.1093/jee/toaa027, (2020).
- Neil E. Coughlan, Eoghan M. Cunningham, Stephen Potts, Diarmuid McSweeney, Emma Healey, Jaimie T. A. Dick, Gina Y. W. Vong, Kate Crane, Joe M. Caffrey, Frances E. Lucy, Eithne Davis, Ross N. Cuthbert, Steam and Flame Applications as Novel Methods of Population Control for Invasive Asian Clam (Corbicula fluminea) and Zebra Mussel (Dreissena polymorpha), Environmental Management, 10.1007/s00267-020-01325-1, (2020).
- Kumari Priyanka, Pidugu Trisandhya, Item sum techniques for quantitative sensitive estimation on successive occasions, Communications for Statistical Applications and Methods, 10.29220/CSAM.2019.26.2.175, 26, 2, (175-189), (2019).
- Noah D. Dell, Quinn G. Long, Matthew A. Albrecht, Germination Traits in the Threatened Southeastern Grassland Endemic, Marshallia mohrii (Asteraceae), Castanea, 10.2179/0008-7475.84.2.212, 84, 2, (212), (2019).
- Pedro Rodríguez-Veiga, Shaun Quegan, Joao Carreiras, Henrik J. Persson, Johan E.S. Fransson, Agata Hoscilo, Dariusz Ziółkowski, Krzysztof Stereńczak, Sandra Lohberger, Matthias Stängel, Anna Berninger, Florian Siegert, Valerio Avitabile, Martin Herold, Stéphane Mermoz, Alexandre Bouvet, Thuy Le Toan, Nuno Carvalhais, Maurizio Santoro, Oliver Cartus, Yrjö Rauste, Renaud Mathieu, Gregory P. Asner, Christian Thiel, Carsten Pathe, Chris Schmullius, Frank Martin Seifert, Kevin Tansey, Heiko Balzter, Forest biomass retrieval approaches from earth observation in different biomes, International Journal of Applied Earth Observation and Geoinformation, 10.1016/j.jag.2018.12.008, 77, (53-68), (2019).
- Jessica R. McLachlan, Chaminda P. Ratnayake, Robert D. Magrath, Personal information about danger trumps social information from avian alarm calls, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2018.2945, 286, 1899, (20182945), (2019).
- Hülya Olmuş, Ezgi Nazman, Semra Erbaş, Comparison of penalized logistic regression models for rare event case, Communications in Statistics - Simulation and Computation, 10.1080/03610918.2019.1676438, (1-13), (2019).
- Ioannis Kosmidis, Euloge Clovis Kenne Pagui, Nicola Sartori, Mean and median bias reduction in generalized linear models, Statistics and Computing, 10.1007/s11222-019-09860-6, (2019).
- undefined Šinkovec, undefined Geroldinger, undefined Heinze, Bring More Data!—A Good Advice? Removing Separation in Logistic Regression by Increasing Sample Size, International Journal of Environmental Research and Public Health, 10.3390/ijerph16234658, 16, 23, (4658), (2019).
- Lei Wen, Hongwei (Chris) Yang, Danlu Bu, Lizabeth Diers, Huaqing Wang, Public accounting vs private accounting, career choice of accounting students in China, Journal of Accounting in Emerging Economies, 10.1108/JAEE-09-2016-0080, 8, 1, (124-140), (2018).
- Emmanuel O. Ogundimu, Gary S. Collins, Predictive performance of penalized beta regression model for continuous bounded outcomes, Journal of Applied Statistics, 10.1080/02664763.2017.1339024, 45, 6, (1030-1040), (2017).
- Evangelia Samoli, Barbara K. Butland, Incorporating Measurement Error from Modeled Air Pollution Exposures into Epidemiological Analyses, Current Environmental Health Reports, 10.1007/s40572-017-0160-1, 4, 4, (472-480), (2017).
- E C Kenne Pagui, A Salvan, N Sartori, Median bias reduction of maximum likelihood estimates, Biometrika, 10.1093/biomet/asx046, 104, 4, (923-938), (2017).
- Jiwei Zhao, Reducing bias for maximum approximate conditional likelihood estimator with general missing data mechanism, Journal of Nonparametric Statistics, 10.1080/10485252.2017.1339306, 29, 3, (577-593), (2017).
- Ji-Hyung Shin, Ruiyang Yi, Shelley B. Bull, Identification of low frequency and rare variants for hypertension using sparse-data methods, BMC Proceedings, 10.1186/s12919-016-0061-6, 10, S7, (2016).
- Thomas W. Yee, Thomas W. Yee, Other Topics, Vector Generalized Linear and Additive Models, 10.1007/978-1-4939-2818-7_9, (277-287), (2015).
- Thomas W. Yee, Thomas W. Yee, VGAMs, Vector Generalized Linear and Additive Models, 10.1007/978-1-4939-2818-7_4, (127-166), (2015).




