Accurate bias estimation with applications to focused model selection

We derive approximations to the bias and squared bias with errors of order o ( 1 ∕ n ) where n is the sample size. Our results hold for a large class of estimators, including quantiles, transformations of unbiased estimators, maximum likelihood estimators in (possibly) incorrectly specified models, and functions thereof. Furthermore, we use the approximations to derive estimators of the mean squared error (MSE) which are correct to order o ( 1 ∕ n ) . Since the variance of many estimators is of order O ( 1 ∕ n ) , this level of precision is needed for the MSE estimator to properly take the variance into account. We also formulate a new focused information criterion (FIC) for model selection based on the estimators of the squared bias. Lastly, we illustrate the methods on data containing the number of battle deaths in all major inter-state wars between 1823 and the present day. The application illustrates the potentially large impact of using a less-accurate estimator of the squared bias.


INTRODUCTION
Bias estimation has a long tradition and an extensive list of techniques.Cox and Snell (1968) give expressions of the bias of maximum likelihood estimators under the assumption of a correctly specified model.The jackknife, a way of estimating bias based on resampling, was introduced in Quenouille (1949), with important extensions in Quenouille (1956) and Tukey (1958).Firth (1993) takes a different approach and proposes techniques for reducing the bias, and hence removing the need for its estimation.In the present article, we build on classic theory by deriving approximations to the bias of a large class of estimators.Our goal will be to give comprehensive formulas throughout the article, and special attention should be given to Section 2.3.Here we derive model-robust expressions for the bias of maximum likelihood estimators.We believe the main theorem of this section serves as a modern extension of the results in Cox and Snell (1968) and is a novel addition to the field of bias estimation and correction.
Our main motivation for deriving precise bias approximations is estimation of mean squared error, or MSE for short.This risk function measures the expected squared L 2 -distance from an estimator to the value it is aiming for.In symbols, this means MSE( θ) = E( θ −  0 ) 2 , where θ is an estimator of a parameter  0 .Famously, the MSE decomposes into the sum of the squared bias and the variance of the estimator evaluated.For many popular and well used estimators, for example, means, medians, and maximum likelihood estimators, the variance component is relatively easy to estimate.The bias component of the MSE, on the other hand, is often more complicated as the true value  0 shows up in the expression.Furthermore, the variance of many estimators, for example, quantiles, maximum likelihood estimators, and means, is of order O(1∕n), where n is the sample size.Because, of this, estimators of the squared bias must have expected errors of order o(1∕n) for the variability to be given sufficient weight in the estimated MSE.To achieve this, the precise bias approximations is crucial.
Recently, estimation of bias and MSE has been developed in great detail within the field of model selection.This is mainly due to the information criterion introduced in Claeskens and Hjort (2003) and later extended by multiple authors.See for instance Zhang and Liang (2011), Claeskens et al. (2006), or Claeskens and Hjort (2008).The criterion is called the focused information criterion, or FIC for short, and ranks models by the quality of their estimate of a prespecified parameter of interest.Quality is measured by MSE, and as a consequence, the formulation of the FIC requires sophisticated estimators of the MSE and squared bias.Because of this, many advanced techniques for estimation of the squared bias have been developed in the literature concerning the FIC.The original versions of the criterion were created for situations where models are nested within a true parametric alternative, with the degree of misspecification decreasing by the order of O(1∕ √ n).In Jullum and Hjort (2017), however, a new version of the criterion is given, which gives a general way of estimating the MSE of estimators.
The approximation given in Jullum and Hjort (2017) avoids the assumption of asymptotically correctly specified models.Because of this, the estimators in this article have found use in multiple situations, see for example, Jullum and Hjort (2019), Ko et al. (2019), Claeskens et al. (2019), and Cunen, Walløe, and Hjort (2020).The approximations can, however, be shown to have an error term of order O(1∕n) in cases where some or all of the quantities involved are asymptotically biased.This makes the criterion too imprecise to give proper weight to the variance component of the MSE.As a result the variability of estimators can potentially be greatly downplayed in the estimator of the MSE developed in Jullum and Hjort (2017).In the present article, we will build on the estimators given in Jullum and Hjort (2017) and derive an approximation to the MSE which has expected error order o(1∕n).
We will start by considering precise bias estimation for a wide class of estimators.Afterwards, we will discuss MSE estimation and the FIC.In particular, we will consider when and why the expressions given in Jullum and Hjort (2017) need to be corrected.In Section 4.1, we will give a new version of the FIC, extending that of Jullum and Hjort (2017).Lastly, we will illustrate the techniques developed in this article by estimating the MSE of a selection of estimators of the difference in the median number of battle deaths in major inter-state wars before and after the Korean war.

PRECISE BIAS ESTIMATION
Our main topic is precise estimation of the bias.We will work with estimators on the form where Y 1 ,…, Y n are i.i.d.from a distribution F and E(Y ) = 0 when Y ∼ F. At first glance, considering estimators on the form of (1) only might look slightly restrictive.In practice, however, this condition holds for many commonly used estimators.If, for instance, μ1 is a mean, the above equation holds trivially with  1 (y) = y −  1 .For the p quantile  p ,  1 (y) = [p − I(y ≤  p )]∕f ( p ) can be used when f , the density in the distribution F, exists.When μ1 is a maximum likelihood estimator of  in some parametric model f  , (1) holds with  1 (y) = ∇g( lf ) T J( lf ) −1 u(y,  lf ) where  lf is the minimizer of the Kullback-Leibler divergence from the true distribution to the parametric family, u is the score function, J( lf ) is the Fisher matrix in the model evaluated at  lf and g is a function mapping  to the value of  in the parametric model, see Section 2.3 for details.In addition, expressions for  1 when μ1 is a smooth function of the above estimators, can be derived by the delta method.For conditions ensuring that (1) is satisfied for the influence function, see for example, Thm. 5.5 in Shao (2003).
The main goal of this section is to find expressions for and estimators of Before we can begin with estimation of the quantity, however, we need to discuss when nE n converges and c is well-defined.We start with an informal argument which hopefully will provide some intuition about what c is really is and when it exists.Let F n be the empirical distribution function of Y 1 ,…, Y n ∈ R and F the true cumulative distribution function.Assume that μ can be represented as a functional of F n , that is, that there exists a function T such that μ = T(F n ).If T is sufficiently smooth, a Taylor expansion around F reveals where D (k)   F−F n T is the kth order directional derivative of T in direction F n − F. The expression above is called the von Mises expansion, see for example, chap.20 of Van der Vaart (1998) for more details.
The Donsker theorem, see for example Van der Vaart (1998, p. 266) The first term on the right-hand side is a sum of influence functions of T and has expectation zero.Under sufficient regularity, we therefore have is the quadratic term in a Taylor expansion, we would expect it to be of order O(||F n − F|| 2 ) = O pr (1∕n), ensuring that the expression in (3) converges and c exists.A formal statement, proof and precise set of conditions showing that this intuition is indeed correct is given in Thm.3.1 of Shao (1991).
In principle, we can always estimate c by approximating the right-hand side of (3).This method is quite general and can, in theory, be used for most estimators.In practice, however, this procedure is far from straightforward and, when done in full generality, results in incomprehensible formulas requiring fine-tuning and extensive calculations in each new application.We will therefore instead focus on some situations where precise formulas and/or estimation strategies for c can be found.
We will present four ways to estimate c.First, we will work with functions of unbiased estimators and quantiles.In Section 2.3 we will work with higher-order Taylor expansions of the log-likelihood function to give expressions for the bias of maximum likelihood estimators.Afterwards, we will discuss how c can be estimated when μ is a function of estimators for which approximations to the bias exists.Lastly, we will discuss resampling techniques.
Remark 1.In the following, we will derive multiple approximations to c.Most of the formulas will be obtained by finding estimators with sufficiently small remainder terms,  n .Typically these will be of size o pr (1∕n), and we will take this to imply that their expected values are of order o(1∕n).For this to hold true, a sufficient condition is that {n n } n is uniformly integrable.See, for example, Billingsley (1999, p. 31) for a proof and details concerning uniform integrability or Shao and Wu (1989) for an alternative set of regularity conditions.To make arguments more transparent, we will not focus on these technical details in the proofs, but take E o pr (1∕n) = o(1∕n) for granted.All required regularity conditions will, however, be stated in the theorems and full proofs dealing with all technicalities may be found in the Appendix B.

Functions of unbiased estimators
The first situation we will consider is when μ is a function of unbiased estimators.The following result gives a formula for c in this case.
Theorem 1.Let â ∈ R p be an unbiased estimator of a ∈ R p such that √ n(â − a) converges in distribution to a N(0, Σ)-distributed variable and all components of √ n(â − a) to the power of three are uniformly integrable.Assume h ∶ R p → R is a function for which all partial derivatives up to and including order three exist and are continuous in a neighbourhood of a.Then, if  = h(a) and μ = h(â), where Hh(a) is the Hessian matrix of h evaluated at the point a.Furthermore, c can be estimated consistently by where Σ is some consistent estimator of Σ.
Proof.Taylor expanding h around a 0 shows for some remainder term  n .Typically,  n will be of order O pr (||â − a|| 3 ) which ensures E n = O(n −3∕2 ) under uniform integrability.Under the conditions of the theorem, this is indeed the case.Consult the Appendix B for details.Hence, As is shown in the Appendix B, the conditions in the theorem ensures that Hence, by properties of the trace operator, (4) holds true.
Lastly, since â and Σ are consistent estimators of Σ and a, respectively, ĉ defined in (5) converges in probability to c by the continuous mapping theorem.This concludes the proof.▪ The simplest examples of functions of unbiased estimators are unbiased estimators themselves.For such estimators h is linear and its Hessian zero.Hence, c = 0 and needs not be estimated.Example include linear function of means and ordinary least squares regression coefficients.
Another important example is maximum likelihood estimators in exponential families.To see this, let f be a member of the exponential family.Then f is on the form for some functions A, B, , and T and a parameter .The maximum likelihood estimator based on a sample Y 1 ,…, Y n , is the maximizer of the log-likelihood function, which is the solution to the following equation Solving the equation shows ) where g is the function Now assume the focus parameter takes the value k() in the family parameterized by ( 7) and let μ = k( θ) be the maximum likelihood estimator of this quantity.Then where . This shows that the maximum likelihood estimators in exponential families are functions of means.By Theorem 1, we get It is worth noting that (8) makes no assumptions about the parametric model being correctly specified.Hence, the expression in (8) holds true even when the true distribution of the data is not a member of the parametric family f  .

Quantiles
We will now consider the case where μ is a quantile.More specifically, for Y 1 ,…, Y n ∼ F i.i.d. and some p ∈ (0, 1), we will work with (1 − )Y (j) + Y (j+1) , where Y (k) is the kth order statistic and j and  are functions of p and n.All of the standard quantile estimators in R and Python are on this form, with j and  given in Table 1.We start by considering the case when Y 1 ,…, Y n are uniformly distributed.The results are stated in the following lemma.
Lemma 1.Let U 1 ,…, U n be i.i.d.uniformly distributed variables and set μ = (1 − ) U (j) + U (j+1) with  and j defined by any of the rows in Table 1.Then, c takes the following form as p is the p-quantile in the uniform distribution.
Proof.By standard results, the jth-order statistic follows a Beta(j, n − j + 1)-distribution.The expression in (9) follows.▪ In many cases, the left-hand side of (9) converges as n grows to infinity.For the quantile versions in R with the type argument equal to 4, 5,…, 9, this is the case, and the corresponding limits are shown in Table 1.When type is equal to 1, 2, or 3, however, the expression to the left in (9) does not converge unless very specific conditions on how n changes from iteration to iteration is assumed.In applications, we recommend reading the limit of (9) off Table 1 when using a quantile estimator corresponding to one in the default quantile function in R with type equal to 4,…, 9.When type equal to 1, 2, or 3 is used, the left-hand side of (9) can be estimated directly using j and  from Table 1.The latter approach can, of course, also be used when type equals 4,…, 9, but using the limit in Table 1 leads to relatively simpler formulas.
We will now use Lemma 1 to find an expression for c for more general distributions than the uniform.
Theorem 2. Let Y 1 ,…, Y n be i.i.d.from a distribution F with p-quantile  for some fixed p ∈ (0, 1).Assume that F is invertible and that there exists a neighborhood around  on which F is thrice continuously differentiable.Let f be the density in the distribution.Then, when μ = (1 − )Y (j) + Y (j+1) , with j and  defined by either of the rows in Table 1, provided n 3∕2 ( μ − ) 3 is uniformly integrable.Furthermore, c can be estimated consistently by n f () where f () and f ′ () are some estimators going in probability to f () and f ′ (), respectively.
Proof.We sketch the proof here.More details can be found in the Appendix B.
By the probability integral transform, as F is increasing and preserves ordering of variables.Notes: The column Type gives the type argument that should be given to quantile in R to get the corresponding estimator.To the far right we have displayed c, the limits of ( 9) when they exist.The symbol ⌊⋅⌋ denotes truncation. Hence, A Taylor expansion around p shows By arguments that can be found in the Appendix B, All the standard versions of quantile estimators, have j chosen in such a way that U A similar result holds for F −1 (U (j+1) ).By the inverse function theorem (F −1 ) ′ (p) = f () −1 and (F −1 ) ′′ (p) = −f ′ ()f () −3 .Combining this with ( 12) and ( 13), shows where Furthermore, the arguments in the Appendix B ensure that under the present conditions.Plugging this into ( 14), shows Combining the above equation with Lemma 1, shows (10).By the continuous mapping theorem (11) is consistent for c.This concludes the proof ▪ If a parametric model is known, we can estimate f () and f ′ () by their maximum likelihood estimate.When working with for instance nested models, one may then fit larger models by maximum likelihood and use the fit to estimate f () and f ′ ().Otherwise, kernel density estimation can be used.In the Appendix A, we discuss the latter approach and give suggestion for bandwidths minimizing the MSE of the estimates of f () and f ′ ().

Maximum likelihood estimators
In some situations, we either know or at least strongly believe, that the true underlying distribution of the data belongs to some parametric family.When this is the case, the maximum likelihood estimator of the focus parameter will be both consistent and asymptotically unbiased for the true value.Hence, it can be used as the consistent estimator, μ0 , from Section 3.2.Because of this, we need expressions for c when μ is a maximum likelihood estimator.Such expressions exist already.Cox and Snell (1968) derived formulas for the limit of nE( θ −  0 ) when θ is a maximum likelihood estimator in a correctly specified model and  0 is the true value of this parameter.In most situations, however, we do not know which, if any, parametric model the distribution of the data belongs to.Using the results of Cox and Snell (1968) will therefore be too optimistic.In this section, we will derive model robust expressions for c.These can be applied even when we do not believe that the chosen model is the true one.
Let Y 1 ,…, Y n ∈ R d be i.i.d.random variables from some underlying distribution with density or probability mass function, g, and assume we fit a parametric model, f  , to these data using maximum likelihood.We will let θ ∈ R p denote the maximum likelihood estimator of , that is, the maximizer of  n () = ∑ n i=1 log f  (Y i ).Under the assumptions stated in White (1982), θ converges in probability to the minimizer of the Kullback-Leibler divergence from g to f  , The minimizer of this expression is in some sense, the "least false" parameter and is the quantity we have denoted by  lf .By White (1982), we also know that the speed of convergence is where In the above, g is added as a subscript to emphasize that we are taking expectations with respect to this distribution.The normal limit of √ n( θ −  lf ) is proved by finding the root of the one term Taylor expansion of ∇ n and showing that this is only o pr (1∕ √ n) away from θ.This allows us to study the behavior of an explicit expression rather than the, in general, only implicitly defined θ.For our purposes, however, a linear approximation is not sufficient, as we want to estimate the bias with an error of order o(1∕n) and therefore need higher precision than o pr (1∕ √ n).This is the main idea in the proof of the following proposition.

from a distribution F and let f 𝜃 be some parametric family of densities or probability point mass functions indexed by an open set
Θ ⊆ R p .Furthermore, let θ be the maximum likelihood estimator when fitting this family to the data and  lf the minimizer of the Kullback-Leibler divergence from F to f  .Then, the bias of θ is given by the following formula, where J and K are defined in ( 15) and ) and ) provided the following four regularity conditions hold true: (A1) All partial derivatives up to and including order four of log f  (y) at  lf exists and are continuous for F-almost all y.In addition, with Y ∼ F we need all fourth order moments of ∇  lf log f  (Y ) and all second order moments of all second-and third-order partial derivatives of log f  (Y ) to exist and be finite.
fourth-order partial derivatives of log f  (y) are bounded by m(y) for F-almost all y, and Em(Y ) 4 exists and is finite.
In addition, c can be estimated consistently by replacing the matrices J and K and V j and W j for j = 1,…, p by the empirical means, covariances, and variances of the corresponding functions evaluated at  = θ.
Proof.As mentioned previously, we will only sketch the proof of Theorem 3.For full arguments taking care of all details, consult the Appendix B.
Fix j and let  j denote the jth component function of ∇ n , that is,  n ∕ j .Taylor expanding the function around  lf , shows Here we have used that the θ maximizing  n satisfies  j ( θ) = 0.
Since  lf minimizes the Kullback-Leibler divergence, the derivative of Because of this, taking the expectation of both sides of ( 18) shows We will work with each term in (19) separately, and use the result to find an expression for E( θ −  lf ) with remainder term smaller than o(1∕n).In the following, let We start with the first term.By definition of  j and properties of the trace operator, ) .
By definition of the covariance, this equals Tr ) .
We recognize the vector E 2  n ( lf )∕ j  as −n times the jth column of the Fisher matrix J. Let this be denoted by J j .Furthermore, White (1982) showed that Hence, Let u be the score function and u j its jth component function.Then, since the Y i s are independent and identically distributed, with V j defined in ( 17).Putting all of our results together, shows that the first term of ( 19) is equal to The second term of ( 19) requires less effort.First notice that By White (1982), . Furthermore, by the law of large numbers H j ( lf )∕n = W j + o pr (1), where W j is defined in (17).Hence, Here we have again used properties of the trace operator multiple times.
We are now ready to complete the argument.Combining ( 19), ( 21), and ( 22) for j = 1,…, p.These p equations hold simultaneously if and only if Rearranging the terms and multiplying the equation by (nJ) −1 , shows ( 16).
Consistency of the estimators defined in Theorem 3 is shown in the Appendix B. ▪ If one of the parameters of the model itself is chosen as focus parameter, Theorem 3 can be used to estimate c.In most situations, however, μ is a function of θ, and the above expressions do not suffice.This is a special case of what we will discuss in Section 2.5.First, we will, however, give extensions of Theorem 3 to the more general case of M-and Z-estimators.

M-and Z-estimators
The calculations made in the proof of Theorem 3 are not limited to maximum likelihood estimators.Similar arguments can be made to find the bias of a more general class of estimators: M-and Z-estimators.Examples of such estimators include copulas fitted using a two-stage procedure, see Joe (2005) and Ko et al. (2019), and robust M-estimators, see for example, Huber (2009).M-estimators are maximizers of expressions on the form where  ∶ R d+p → R. The most obvious example of such estimators is maximum likelihood estimators.In this case (y, ) = log f  (y).Because of their similarity, M-estimators share many properties with maximum likelihood estimators.It can for instance be shown that the maximizer of ( 23), θ, converges almost surely to  0 , the maximizer of the limit function   → E(Y , ), and that √ n( θ −  0 ) has a normal limit, or in particular, that where V is the Hessian matrix of the limit function   → E(Y , ) at  0 .This expression is an analogue to (20).For more details and proofs, consult Chap. 4 of Van der Vaart (1998).
Looking at the proof of Theorem 3 we notice that the arguments can be repeated for the more general class of M-estimators.This is achieved by replacing log f  (y) by the function  and using the properties listed in the previous paragraph in place of the corresponding results for maximum likelihood estimators.The argument is very similar and will not be repeated here, but the result is stated in the following theorem.𝜃).Furthermore, let  0 ∈ R be the maximizer of the limit function   → E(Y , ) where Y ∼ F. Then the bias of θ is given by the following formula, where J and K are defined as ) . and provided the regularity conditions (A1-A4) of Theorem 3 hold true after replacing log f with ,  lf with  0 and the maximum likelihood estimator with the maximizer of n −1 ∑ n i=1 (Y i , ).In addition, c can be estimated consistently by replacing the matrices J and K and V j and W j for j = 1,…, p by the empirical means, covariances, and variances of the corresponding functions evaluated at  = θ.
Assume that the function  is differentiable in .Maximization of ( 23) can then be rephrased as solving Solutions to equations like the one above, are called Z-estimators.In the proof of Theorem 3, it is the fact that the maximum likelihood estimator is a Z-estimator which is used in the arguments.
Because of this, the result holds true also for this class of estimators, as stated in the following theorem.The proof is very similar to that of Theorem 3 and is left out.𝜃).Furthermore, let  0 ∈ R be the solution to the limit equation 0 = E(Y , ) where Y ∼ F. Then the bias of θ is given by the following formula, where J and K are defined as provided the regularity conditions (A1-A4) of Theorem 3 hold true after replacing  log f ∕ with ,  lf with  0 and the maximum likelihood estimator with the solution to In the above  j is the jth component function of .
In addition, c can be estimated consistently by replacing the matrices J and K and V j and W j for j = 1,…, p by the empirical means, covariances and variances of the corresponding functions evaluated at  = θ.
Lastly, we remark that Theorem 4 and Theorem 5 hold true even when θ is only "almost" an M-or a Z-estimator.In fact, it is sufficient that θ is an estimator which can be expressed as the root of a function on the form where  n () = o pr (1) uniformly in a neighbourhood of the root of the limit function   → E(Y , ).This follows from the fact that  n () can be incorporated into the remainder term of (18).

Resampling techniques
The last way of estimating c we will present is resampling techniques.We will discuss two methods: the bootstrap and the jackknife.The main ideas will be presented briefly here, but additional details, proofs, and discussions can be found in for example, Efron (1982), Efron and Tibshirani (1993), or Shao and Tu (1995).
The bootstrap was first introduced in Efron (1979), and has since become a popular statistical tool.To see how the procedure can be used to estimate c, notice that by ( 1) The left-hand side of this equation can be easily estimated with bootstrap.Let μ * j be the estimator of  based on the jth out of B bootstrap samples.Then is an estimator of c, see for example, Chap. 10 of Efron and Tibshirani (1993).The bootstrap procedure is quite general and works for most focus parameters.Hence, this technique is an option when all other methods fail.Using the procedure in practice is, however, often inconvenient, as a very large number of bootstrap samples must be drawn to achieve sufficient precision.To see this, notice that the bootstrap estimate B −1 ∑ B j=1 ( μ * j − μ) is a Monte Carlo estimate of its expectation.Since the variance of μ * j is of order O(1∕n) by (1), the central limit theorem ensures that the Monte Carlo error of For ĉboot to be consistent we would therefore need B ≫ n.To achieve a convergence rate of O pr (1∕ √ n) for c we would need a million bootstrap samples when n = 1000, and even with a mere 50 data points, we would need B to be greater than 2000.Because of this, the computational burden might be too heavy for bootstrap to be practical in many situations.
The jackknife is another resampling technique.It was introduced in Quenouille (1949) and Tukey (1958), and was developed specifically for the purpose of estimating the bias with an error of order o(1∕n).It is therefore well suited for estimating c.
The jackknife estimate of the bias of μ is defined as Here μ−i is the same estimator of  as μ, but based on Y 1 ,…, Y i−1 , Y i+1 ,…, Y n , for i = 1,…, n rather than the full dataset.More formally, we assume there is a functional T, such that μ = T(F n ) and , where F −i n is the empirical distribution function based on all but the ith data point.Under certain smoothness conditions, ĉjack is consistent for the true value c.This is shown in Thm.3.1 of Shao (1991).
The jackknife procedure is a flexible and general tool for bias estimation.That being said, the method requires n applications of the functional T to estimate c.In cases where application of the functional is computationally expensive or n is very large, the jackknife estimate can be slow to calculate.To ease the computational burden, it is possible to use less computer intensive versions of the standard jackknife procedure.Options include the one-step jackknife and the grouped jackknife.See Chap.five of Shao and Tu (1995) for an overview.These methods limit the number of times T needs to be applied, reducing the overall computational cost.In certain situations, however, resampling techniques can be too time consuming to be practical, even with modifications like these.In such cases, the formulas derived in the previous sections can be used instead.

Functions of the above
In the previous sections, we derived formulas and described strategies for estimating c in certain situations.We will now show how these results can be used to estimate the bias of even more complex estimators.
Theorem 6.Let θ be some estimator of  for which c  , the limit of nE( θ − ) is known.Assume furthermore that √ n( θ − ) converges in distribution to a N(0, Σ)-distributed variable.and that h ∶ R p → R is a function for which all partial derivatives up to an including order three exists and are continuous in a neighborhood of .
Furthermore, if ĉ is a consistent estimator of c  , converges in probability to c.
Notice the similarities with Theorem 1.When θ is unbiased, the first term of (24) reduces to (4).
Proof.As before, we present a proof sketch only.A full argument is similar to that of Theorem 1. Consult the proof of this theorem in the Appendix B for more details.
A Taylor expansion reveals the following Let b n = E( θ − ).Taking expectations on both sides of the above equation, shows Since ĉ is consistent for c  , (25) converges to c by the continuous mapping theorem. ▪ Now that we have discussed c in detail, we are ready to move onto our main application: estimation of the MSE and the FIC.

ESTIMATING MSE
We will start by introducing notation and assumptions.In addition, we will derive equations that show explicitly the quantities needed to estimate the MSE.

An approximation to the MSE
Assume we have i.i.d.data Y 1 ,…, Y n ∈ R d from a distribution, F. Let μ1 be an estimator of some focus parameter with true value  0 ∈ R based on these data, and assume μ1 admits an influence function,  1 , with a small enough remainder term, in the sense that with E 1 (Y ) = 0 and S n = O pr (1∕n).Lastly, let c 1 denote the limit of nES n .Direct computations show where b =  1 −  0 and  is the variance of  1 (Y ).By the above assumptions ) to be uniformly integrable, the above is sufficient for the last two terms in ( 27) to be of order o(1∕n), see for example, Billingsley (1999, p. 31).Hence, the MSE of μ1 takes the following form We summarize our findings in a theorem for easier reference.
Theorem 7. Assume Y 1 ,…, Y n ∈ R d are i.i.d.from a distribution F, and μ1 is an estimator of  0 satisfying for some S n = O pr (1∕n).Furthermore, let b =  1 −  0 , and take c 1 to be the limit of nES n .Then where  is the variances of  1 (Y ), provided this quantity exists and is finite and nS n is uniformly integrable.
To estimate the MSE of μ1 , we need to approximate all quantities in (28).This will be the topic of the next section.

Estimating to the correct order
Looking at (28), we notice that the limiting bias, b, is constant while the remaining terms decrease at the speed of O(1∕n).Because of this, b will dominate when n gets large.Hence, minimizing the MSE asymptotically reduces to choosing a μ1 with b = 0.In practice, however, we do not have infinite data points, and blindly choosing the estimator with the lowest limit bias is therefore rarely useful.Most applications require extremely flexible estimators to ensure that the bias disappears in the limit.Such procedures often have high variance.As a result, their MSEs tend to be high in finite samples.Because of this, many statistical applications instead attempt to find a trade-off between bias and variance.For our MSE estimates to be useful, they therefore need to take both the bias and variance into account.Since the latter is of order O(1∕n), we will need to create an approximation with an error of a smaller order.Otherwise, the variance is on the same scale as the error of the estimator, washing out the effect of the variability.We will create an estimator of the MSE which is only o(1∕n) away from (28) in expected value.To do this, we need to estimate three quantities: the asymptotic bias b, the variance  of  1 (Y ) and the limit c 1 of nES n .If we can find asymptotically unbiased estimators for these quantities, the two last terms in (28) can be estimated with an error of o(1∕n), due to the n appearing in the denominator of these terms.We do, however, need to be extra careful with the estimator of the squared bias.This quantity is not divided by n, and hence, we need to make sure that the bias of the estimator is of order o(1∕n).
To estimate the asymptotic bias, we need to approximate the distance from the limit of our estimator,  1 , to the truth,  0 .This is impossible to do without having some idea of the value of  0 .Because of this, we need to introduce a second estimator which is consistent for this true value.We will call this estimator μ0 and make similar assumptions about its form to the ones made for μ1 , that is, with T n being an error term of order O pr (1∕n) and  0 ∶ R d → R a function such that E 0 (Y ) = 0.The introduction of μ0 might seem slightly counter-intuitive at first glance, and one might wonder why we would bother with μ1 when we have a consistent estimator available.The answer to this concern is that even though μ1 is inconsistent for  0 , its MSE might still be lower than that of the consistent estimator μ0 , making in preferable for estimation.In addition, μ0 does not need to satisfy all the conditions we require of μ1 .The most obvious example of this is when we wish to fit a parametric model to the data.In such cases we might "need" μ1 to be a maximum likelihood estimator, while this condition is not required of μ0 .This is a crucial point when working with the FIC, which will be discuss in Section 4.
Since μ0 is consistent for the true value  0 , the asymptotic bias can be estimated consistently by the following quantity where  =  1 −  0 and R n = S n − T n .Squaring the above expression shows Taking expectations on both sides of the equation and arguing as in Section 3.1, reveals under the assumption of uniform integrability of R 2 n and n Just as for S n , ET n will typically be of order O(1∕n).So let c 0 be the limit of nET n .Then, the above equation implies We summarize the results in a theorem for easier reference.
Theorem 8. Assume Y 1 ,…, Y n ∈ R d are i.i.d.from a distribution F, and μ1 and μ0 are estimators satisfying for some S n , T n = O pr (1∕n).Furthermore, let b = μ1 − μ0 andb =  1 −  0 , and take c 0 and c 1 to be the limits of nET n and nES n .Then where  and  are the variances of (Y ) =  1 (Y ) −  0 (Y ) and  1 (Y ), respectively, provided  and  exists and are finite and nS n and nT n are uniformly integrable.
The first two terms in (30) appear in (28), the asymptotic approximation we wish to estimate.The last two terms, however, do not and need to be corrected for.If we can find consistent estimators, ĉ0 and κ, for c 0 and , respectively, the following estimator is only o(1∕n) away from b 2 + 2bc 1 ∕n in expected value b2 − κ∕n + 2 bĉ 0 ∕n.
Sometimes, (31) will be negative, leading to an estimate of the squared bias which less than zero.By definition, however, b 2 ≥ 0. Negative estimates are therefore both counter-intuitive and always further away from the true value than 0. Hence, when negative estimates show up, we will truncate them to zero, as in Jullum and Hjort (2017).This leads to the following estimator of the MSE of μ1 Estimators for b, , and  have already been discussed.By definition c 0 is the limit of nE( μ0 −  0 ), and estimation of this quantity was discussed in length in Section 2.
Remark 2. The estimator in (32) is not the only way to estimate the MSE of μ1 .Rather than adding 2 bĉ 0 ∕n to b2 to remove the bias, we could subtract ĉ0 ∕n from μ0 before estimating b.This would correct the bias of μ0 and remove the need to add 2 bĉ 0 ∕n to the squared bias estimate.Such a procedure is indeed an alternative to using (32), but does not lead to a significantly different estimator.By correcting the bias of μ0 before using it to estimate the MSE of μ1 , we are left with the estimator b2 + 2 bĉ 0 ∕n + ĉ2 0 ∕n 2 − κ∕n of the squared bias.The expected difference between this expression and (30) is a term of order O(1∕n 2 ).As we are attempting to create estimators with biases of order o(1∕n), the difference between the two approaches is negligible.
That being said, estimating and subtracting ĉ0 is only one out of many ways to remove bias from μ0 .Modifying the score function as in Firth (1993), removes the O(1∕n) part of the bias of the maximum likelihood estimator.For M-estimators, similar alterations can be made to the moment equations to achieve approximate unbiasedness, see Kim (2016).For situations where such techniques are available, correcting the bias of μ0 before using it to estimate b is indeed an alternative to the methods suggested in this article.

THE FOCUSED INFORMATION CRITERION
As mentioned previously, estimation of the MSE has been developed recently in the literature concerning the FIC.This criterion attempts to choose the model with the most precise estimate of a prespecified focus parameter, , rather than the one fitting the data the best overall.This is achieved by ranking candidate models by the MSE of their estimate of .The estimator in (32) can therefore be seen, not only as a way to approximate MSEs, but indeed as an information criterion.
Assume we fit a parametric model f  using maximum likelihood.The resulting estimator θ of  then satisfies the following equation where u is the score function and  lf and J( lf ) are as defined in Section 3.1.See for example, Thm. 5.39 in Van der Vaart (1998) for a proof and a set of conditions.Applying the delta method ensures that the focus parameter satisfies ( 26) with  1 (y) = ∇( lf ) T J( lf ) −1 u(y,  lf ), where ∇ denotes the gradient operator.Because of this, the MSE of ( θ) can be estimated using ( 32), provided a consistent estimator of the true value exists and satisfies the conditions stated previously in this chapter.
Since the FIC of the parametric model is the MSE of ( θ), we can compute the FIC of f  using (32).A related idea is used in Jullum and Hjort (2017), and their formulas are indeed quite similar to ours.We will now discuss the FIC of Jullum and Hjort (2017) in closer detail and use (32) to define a new FIC.

The new and old FIC
In Jullum and Hjort (2017), the authors derive a FIC, allowing for comparison of multiple nonnested parametric models and nonparametric alternatives, by evaluating how well a certain pre-specified parameter of interest  is estimated.This quantity is called the focus parameter, and a model's FIC score is an approximation to the MSE of its estimator μ1 of .
To derive the criterion, the authors proceed more or less as we have done in Section 3.They argue that b2 = ( μ1 − μ0 ) 2 tends to overshoot the actual bias of μ1 and that ∕n = Var(Y i )∕n should be subtracted from the expression ot correct for this.Using our notation, their final formula for the FIC is as follows, This expression is very similar to the one given in (32).The only difference is that the above expression lacks the 2 bĉ 0 ∕n term, which is present in (32).Because of this, FIC old is easier to compute than the expression in (32), but, unless b or c is zero, the expected value of the quantity is O(1∕n) away from the true MSE of μ1 .Hence, the FIC of Jullum and Hjort (2017) will, in many cases, underemphasize the importance of the variance since the error of the estimated squared bias is on the same scale as this quantity.We therefore propose the following new form of the FIC, which in some sense can be seen as a corrected version of FIC old , As argued previously, this estimator is only o(1∕n) from the true MSE of μ1 .With this new version of the FIC, we need to estimate c 0 , as was discussed in Section 2. All other quantities can be approximated by the canonical estimators.

How much does it matter?
Even though the the FIC from Jullum and Hjort ( 2017) is slightly off its mark, this version of the FIC has had success in many applications.See, for instance, Cunen, Walløe, and Hjort (2020), Claeskens et al. (2019), Wang andHobaek Haff (2019), andHermansen et al. (2015).It is therefore natural to ask "how wrong" this formula really is.In this section, we will present some examples illustrating the differences.Firstly, notice that when μ0 is unbiased, c = 0.When this is the case, the term distinguishing FIC old from FIC new is equal to zero and the two versions of the criterion agree.This happens when μ0 is a mean or some bias corrected estimator.Another perhaps less trivial example is linear combinations of the regression coefficients used in linear regression.These estimators are unbiased for the true values, provided the model is correctly specified.Hence c 0 is zero in this case.Because of this, the use of the FIC in, for instance, Cunen, Walløe, and Hjort (2020) and Claeskens et al. (2019) is unproblematic.
Another situation where FIC old and FIC new give very similar results, is when either b or c are very small.If either μ1 or μ0 are almost unbiased for the true value  0 , this can happen, and the term distinguishing the two versions of the criterion is too small to have any real effect.In addition, we truncate negative estimates of the squared bias to zero.In some cases, the estimated squared bias will be less than zero regardless of whether we subtract 2 bĉ 0 ∕n or not.Hence, the two criteria can end up being very similar, if not equal, when the bias is small, relatively speaking.An example of such a situation, is when estimating the difference in median life lengths of Roman era Egyptian men and women using data from Pearson (1902).Jullum and Hjort (2017) analyze this data set in their article and use FIC old to choose the model estimating the focus parameter with the highest precision.We repeated their computations in addition to calculating FIC new for the same data.In this example, the two information criteria gave almost identical results.The reason for this, is that the bias estimate is truncated for all but one model.This happens for both FIC new and FIC old , and hence, the two criteria agree for all but the Gompertz model.For this one case, FIC new was 0.022 larger than FIC old .The Gompertz model is, however, by far the worst of the ones considered in Jullum and Hjort (2017) with very high estimated MSE of the focus parameter.Because of this, adding 0.022 to the result, does not change the conclusions we would draw from using the FIC in this case.
The above example might lead one to believe that the difference between the new and old FIC is negligible in all but a few situations.It might therefore be tempting to use FIC old rather than FIC new because of its relatively easier formulation.This, however, is not to be advised, as there in general is no bound on the size of term 2bc∕n.To see this, consider fitting an exponential model to i.i.d.data points Y 1 ,…, Y n ∼ Gamma(k,

√
k) for a fixed number k in order to estimate the SD in the distribution, which is 1 for all k.In this situation, the limiting bias in the model will grow with k, and as a result FIC old will loose precision when k is large.A small simulation supports this claim.For a selection of k-values, we simulated one million data sets of size n = 100.In each case, we fitted an exponential model and computed FIC old as well as FIC new using the SD as focus parameter.We used the empirical SD as the consistent estimator μ0 .In Figure 1, we illustrate the The figure displays the absolute value of the difference between the mean of 10 6 simulated FIC-values and the true mean squared error as functions of the shape parameter in the underlying Gamma(k, √ k)-distribution of the data.The error for the old version of the information criterion is shown by a grey dashed line, while the error of new version is plotted as a solid black line.The focus parameter is the SD in the distribution and the empirical SD is used as the consistent estimator μ0 .The sample size is n = 100.results.We have plotted the estimated absolute value of the error of both FIC old and FIC new as functions of k, the shape parameter in the true underlying Gamma distribution.The figure shows clearly that not taking the term 2bc∕n into account, causes the error of FIC new to grow large as k increases.
This is of course only a toy example, but shows that there, at least in theory, is no bound on 2bc∕n and that in certain situations ignoring its effect can be very problematic.We therefore recommend using FIC new unless working exclusively with unbiased estimators.We will now illustrate this even further by trying out the new and old FIC on a dataset where the two criteria give very different results.

Application-battle deaths in inter-state wars
In Pinker (2011) it is claimed that the world is gradually becoming more peaceful.Such statements have since been a heated topic of discussion.See for example, Cirillo andTaleb (2016a, 2016b), Clauset (2018), or Cunen, Hjort, and Nygård (2020).When investigating claims like that of Pinker statistically, proper modeling is important.
In this section, we will use the FIC to evaluate suitability of different models for checking the claim of Pinker based on the Correlates of War (CoW) dataset (Sarkees & Wayman, 2010).This dataset consists of the number of battle deaths in the 95 most recently concluded wars.The number of battle deaths is by no means a complete measure of violence.Hence, using the CoW data set to assess the claim of Pinker is of course somewhat lacking.Nevertheless, we use these data to estimate the difference in median battle deaths between newer and older inter-state wars.
To compare recent wars to older ones, we need to decide which conflicts should be considered new.Cunen, Hjort, and Nygård (2020) investigate this question.Their analysis finds that the Korean War marks a change in the number of battle deaths in inter-state wars.Building on this, we define the "old" wars to be all conflicts before and including the Korean war.The remaining wars are treated as "new" conflicts.
We consider the data points as independent.Furthermore, we assume that the battle deaths in the recent wars are identically distributed and that the same holds true for the older wars.In our analysis, we consider eight different models and investigate how well they estimate the difference in median deaths between newer and older inter-state wars.For each case, we fit the corresponding model to the data sets consisting of newer and older wars separately.Afterwards, we use the fitted models to estimate the difference between median number of battle deaths in older and more recent wars.To evaluate the different models, we compute FIC new for each case.As the consistent estimator, μ0 , we use the difference between the empirical median for the older and newer wars.We use the default method in R with type equal to 7 to estimate this quantity.Since the two datasets are independent, ∕n and ∕n are estimated by the sum of the corresponding quantities for each separate data set.To approximate c 0 and b, we take the difference between the corresponding estimates for older and more recent wars.Afterwards, the squared bias is estimated and truncated to zero in the cases where the approximation is negative.It is worth noting that the forms of , , b, and c 0 are relatively simple in this case where the focus parameter is the difference between two quantiles.For more complicated functions, for example, μB ∕ μA where μA and μB are the median number of battle deaths in newer and older wars, respectively, the delta method and the result of Section 2.5 need to be applied.
We worked with a slightly modified version of the data.In the CoW dataset only wars with more than 1001 battle deaths are registered.To avoid problems with observations on the bounds of the support, we replaced all observations of 1001 with 1001.01.The same trick is used in Cunen, Hjort, and Nygård (2020).See this article for more explanations and discussions.After transforming the data, we fitted eight parametric models.Firstly, Dagum, Log-logistic, Pareto IV, Lomax, log-normal and log-Cauchy distribution were fit to the data shifted by 1001 to the left.Secondly, we fitted log-Gamma and log-Weibull distributions to the data scaled by 1000.The resulting values of FIC new for each model can be found in the plot in the left panel of Figure 2 together with the maximum likelihood estimate of the focus parameter for each model.From the figure, we notice that using the log-normal distributions results in the lowest MSE for the focus parameter.With this model, the estimated difference between median battle deaths before and after the Korean War is 2066.
To illustrate the difference between FIC new and FIC old , we also computed the criterion of Jullum and Hjort (2017).The results are displayed in Figure 2. From the plot, we notice that the two methods indeed give very different estimates of the MSE for many of the models, though not all.In fact, using FIC new rather than FIC old actually leads to different models being chosen.With the model selected by FIC old the difference between median number of battle deaths before and after the Korean War is estimated to be 3958, which is almost twice as large as the estimate we got from the model chosen by FIC new .This shows that the additional term in FIC new does have a real implication and is important for analysing the CoW dataset correctly.
The above analysis can be repeated for other quantiles than the median.In the right panel of Figure 2 we have computed and displayed the FIC values for and estimates of the difference in the third quartile in the two distributions.This can be used as a measure of how violence in the , , ,

F I G U R E 2
The figure shows FIC old (colored grey) and FIC new (colored black) for different models fitted to the correlates of war (CoW) dataset together with the models corresponding estimate of the focus parameter.The focus parameter is the difference in the p-quantile for older and newer wars.The value of p is shown in the title of each plot.
more deadly wars has changed in recent years.From the plot we notice that the two criteria give similar results in this case, with both the new and the old FIC selecting the Lomax model with an estimate of 28,740 for the focus parameter.The plots in Figure 2 give some insight into how the new and old version of FIC differ.For the case of the median, c 0 ∕n was estimated to be approximately 1227.This is a positive number, and hence the nonparametric estimate of the focus parameter is biased upward.Because of this, we would expect models underestimating the focus parameter compared to the nonparametric estimate, to be "less off" than what the naive estimate b suggests.Models estimating the focus parameter to be even larger than μ0 , on the other hand, are likely to be more biased than what b would lead one to believe.This intuitive idea is what is caught by FIC new , but not FIC old .If b and ĉ0 have the same signs, 2 bĉ∕n > 0 and the model is penalized by the criterion developed in this article, whereas when μ0 is biased in the opposite direction of sign( b), 2 bĉ∕n < 0, resulting in a lower value of FIC new than FIC old unless the estimate of the squared bias is truncated to zero.The above calculations show that there can be real and impactful differences between the old and new FIC.This is, however, not a guarantee that FIC new performs better than FIC old in this example.To investigate whether this is the case, we need to compute the true value of the focus parameter and compare the true MSEs with the corresponding FIC new and FIC old scores.In theory, this is a simple task, but in practice we do now know the true underlying random structure of battle deaths in inter-state wars.Because of this, we cannot check the performance of FIC new and FIC old directly.To get some indication on the quality of the MSE estimates we will therefore instead work with a distribution similar to our data set where the truth is known.To achieve this, we fitted a Lomax model to the data as explained previously.Afterwards, we simulated 10,000 datasets from this model.For each simulated dataset, we computed μ1 , FIC old and FIC new for each of the models considered in this section.We used the difference in median number of battle deaths before and after the Korean war as the focus parameter.Afterwards, we used the 10,000 samples to estimate the expected value of FIC new and FIC old in addition to the true MSE for μ1 for each model.For five out of the eight models considered, the average value of FIC new was closer to the true MSE than FIC old .For the remaining three cases FIC old performed better than FIC new .The wo criteria were more or less equally variable, with VarFIC new ∕VarFIC old ranging from 0.655 to 1.186 for the different models.
Comparing only the FIC-scores gives somewhat of a limited view.FIC old and FIC new differ only in their estimate of the squared bias.The estimated variance is identical in the two criteria.Because of this, an over-or underestimated variance might make it beneficial to under-or overestimate the squared bias, respectively, as this "cancels" parts of the error made by the variance estimate.Although this might lead to more precise estimates of the MSE in certain cases, such cancellation happens mostly due to "luck" and is in general not something one should hope or aim for.We therefore believe it is more relevant to compare the quality of the estimates of the squared bias made by each criterion rather than the full MSE scores.Looking only at the estimated squared bias shows that the cancellation described above has happened in two of the cases where FIC new performed better than FIC old .We find that the estimate of the squared bias in the new criterion is indeed closer to the true value than the estimate provided by FIC old in all but one model.
Choosing to fit a Lomax model and simulate from this is of course not the only way one can generate data similar to the number of battle deaths in the 95 most recent and concluded inter-state wars.We could fit any of the models considered in this section or even draw bootstrap samples from the dataset itself.We have tried out a number of these methods, and similar results were found in all cases.For the sake of clarity and brevity, we will therefore not go into depth about all settings, but taking the average over all simulation settings gives the following result.More often than not, FIC new performed better than FIC old when it comes to estimation of the MSE.Furthermore, the squared bias estimate provided by FIC new were on average more precise than the one from FIC old for a little more than five out of eight models.In addition, the average value of VarFIC new ∕VarFIC old ranged from 0.457 to 1.025 between the different models.

CONCLUDING REMARKS
In this article, we have only considered the framework of i.i.d.data, but extensions to more complicated situations are certainly possible.For regression and classification settings with responses Y i and predictors X i for i = 1,…, n, the formulas derived in this article are directly applicable when the covariates are considered random.This is because the tuples Z i = (Y i , X i ) for i = 1,…, n can be considered as i.i.d.data points in this case.In addition, results like the Lindeberg-Feller theorem (see e.g., Billingsley, 1995, p. 359) allows most of the formulas derived in this article to be generalized to situations where the covariates are considered nonstochastic.It is, however, worth noting that in regression and classification settings, nonparametric estimators satisfying (29) can be hard to come by.This is especially true when the number of covariates is high.In such cases, a solution can be to replace the true parameter,  0 , by the least false parameter in a widest model.This is in some sense the closest we can get to the true value, and the FIC should therefore measure the MSE of estimators relative to this quantity.Claeskens et al. (2019) and Cunen, Walløe, and Hjort (2020) take this approach for the simpler FIC of Jullum and Hjort (2017).Modifications along the same lines can be made for the FIC derived in this article as well, but to use the more precise information criterion developed in this article, c 0 needs to be estimated.This can be achieved using Theorem 3. Further extensions of the iid framework include dependent and censored data .Arguments similar to those given in Hermansen et al. (2015) and Jullum and Hjort (2019), respectively, can be used to generalize the results of this paper to these more complicated data structures.
FIC is a popular research topic and there exists many extensions.Most of which can also be made for the criterion developed in this article as well.We will mention two such extensions, but many other possibilities exist.For an overview of techniques, consult Claeskens and Hjort (2008).
In certain situations, multiple focus parameters are of interest and should be taken into account when performing model selection.One way of achieving this for the FIC is to use average FIC (AFIC).Assume the focus parameters can be written as (u) for u in some index set U. A risk function taking all parameters into account is the following expression, where W is some distribution over U indicating how much each focus parameter matters relative to the others.The risk of the estimators μ(u) for u ∈ U can therefore be estimated by the following expression Here b(u), ĉ0 (u), κ(u)a, and τ(u) are the estimators discussed previously in this article computed for the estimator μ(u).Notice that AFIC does not equal the integral of the individual FIC scores as we truncate possibly negative estimates of the integrated squared bias rather than the individual bias estimates for each u ∈ U to zero.For more information about and variants of AFIC, see Claeskens and Hjort (2008, pp. 179-183).FIC scores can be used for more than strictly selecting the optimal model for estimating the focus parameter.An alternative approach, is to use model averaging.Assume we have p models with estimates μ1 ,…, μp of the focus parameter.Rather than using the FIC to choose the estimator with the lowest MSE, we could use the weighted average μavg = ∑ p j=1 w j μj , where w j are weights summing to one.The weights should be related to the FIC scores such that μj s with low MSE are given more weight than μj s for which FIC( μj ) is large.Inspired by Eq. 6.1 in Jullum and Hjort (2017), we suggest the following form for the weights , for some tuning parameter .In Section 2.2, we derived an estimator for c when μ is a quantile.To use the formula in (10), however, we need consistent estimators of both the density f and its derivative f ′ at  p .The classic kernel density estimators are

ACKNOWLEDGMENTS
In the above, h 1 and h 2 are called the bandwidths and K is the kernel.In principle, the kernel can be any nonnegative and differentiable function, but to simplify calculations we will assume it is symmetric.The obvious choice for such a kernel is the standard normal density, .
The standard strategy is to minimize the asymptotic integrated MSE.Minimizing the L 2 distance between f and f ′ and the corresponding true values is reasonable when a good overall fit is desired.In our situation, however, we are only interested in using the kernel density estimates to approximate f ( p ) and f ′ ( p ).Hence, we should choose bandwidths minimizing the MSE of these estimators and not the full integrated MSE.Arguing as in Singh (1979), one can show that this is achieved by where f (k) denote the kth derivative of f and See Chap. 2 of Wand and Jones (1995) for more details.
To estimate the optimal bandwidths, we need to know the value of f ( p ), f (2) ( p ) and f (3) ( p ).These quantities are, however, unknown by definition of the problem.To estimate them, we therefore suggest fitting parametric models to the data.This can lead to inconsistent estimators of f ( p ), f (2) ( p ) and f (3) ( p ), but this will only result in bandwidths being slightly off from the optimal ones.The estimators in (A1) will still converge in probability to the true values, f ( p ) and f ′ ( p ) when inconsistent estimators of f ( p ), f (2) ( p ), and f (3) ( p ) are used to find optimal bandwidths.
We recommend the following rule of thumb: If the distribution has a heavy tail, fit a Pareto distribution to the data.Otherwise, fit a normal distribution.These distributions are chosen mainly because their maximum likelihood estimators have closed form expressions, allowing f ( p ), f (2) ( p ), and f (3) ( p ) to be estimated without numerical optimization.
Combining all of the above, leads to the following rule of thumb for estimating f ( p ) and f ′ ( p ): ) .
where  is the standard normal density and when the distribution of Y is heavy tailed and otherwise.In the above, σ, Y , and Y (1) are the empirical SD, mean, and minimum value in the sample, respectively.In addition, μp is the empirical quantile and ] −1 .
Lastly, we would like to remark that there exists more sophisticated methods for estimating f ( p ) and f ′ ( p ) than the ones presented here.For instance, rather than using a purely nonparametric or parametric estimate, a semi-parametric approach is possible.Parametric and nonparametric models can be combined as in, for example, Hjort and Glad (1995) and Efron and Tibshirani (1996).Another approach is to use local likelihoods, see Hjort and Jones (1996) and Loader (1996).
Theorem 9 (Functions of unbiased estimators).Let â ∈ R p be an estimator of a 0 such that √ n(â − a 0 ) converges in distribution to some U with finite second moment.Assume that all components of √ n(â − a 0 ) to the power of three are uniformly integrable and that there exists a neighborhood  of a 0 on which h ∶ R p → R has continuous partial derivatives up to order 3. Then Proof.Since all components of √ n(â − a 0 ) are uniformly square integrable, case (vii) of lemma 2 ensures that all components of En(â − a 0 )(â − a 0 ) T are uniformly integrable as well.Hence, En(â − a 0 )(â − a 0 ) T converges to EUU T and nE(â − a 0 ) T Hh(a 0 )(â − a 0 ) = Hh(a 0 )EUU T + o(1).To show the theorem, we therefore only need to prove that, nE n (â) = o(1), where  n (â) is the remainder term in (6).
Choose a compact subset K of  with a 0 in its interior.Since â converges in probability to a 0 , it lies in the interior of K with probability tending to one.Without loss of generality, we will therefore assume that â lies in the interior of K. Since all third-order partial derivatives of h are continuous on this set, the extreme value theorem guarantees that they are bounded by some M ∈ R. Hence, by standard results for Taylor expansions of multivariate functions, See for example, Coroll.6.5.8 in Lindstrøm (2017).In the above, || ⋅ || 1 denotes the L 1 norm on R p .Since all components of √ n(â − a 0 ) to the power of three are uniformly integrable, and n 3∕2 ||â − a 0 || 3 1 is O Pr (1) by the continuous mapping theorem, we have E| n (â)| = O(n −3∕2 ).▪ For the special case of empirical means, the theorem simplifies somewhat.
Corollary 1 (Functions of means).Let Y 1 ,…, Y n ∈ R d be random variables with empirical mean μ and expected value  0 .Assume further that the expected value of all components of Y i to the power of four exists, and that h is a function with continuous partial derivatives up to order three in a neighborhood of  0 .Then Proof.By the central limit theorem √ n( μ −  0 ) converges to a normal distribution with variance VarY i .Because of this, the result follows from Theorem 9 provided all components of √ n( μ −  0 ) to the power of three are uniformly integrable.By case (iv) of lemma 2, this holds true when all components of √ n( μ −  0 ) to the power of four are uniformly integrable.We will show this latter statement.
Fix j.By the continuous mapping theorem, n 2 [( μ ) j − ( 0 ) j ] 4 converges in distribution to U 4 where U ∼ N[0, (VarY i ) j,j ].Furthermore, utilizing that Y 1 ,…, Y n are independent and have mean zero one can show We will now prove (2).The argument for (3) is similar and hence omitted.We are now ready to prove Theorem 3 of the article.
Theorem 11.Let all quantities be defined as in Lemma 2 and assume that (A1-A3) hold true.Then the bias of θ −  lf is on the form given in Theorem 3 of the article, provided the following additional assumption: • There exists a neighborhood,  of  lf and an integrable function m such that all fourth-order partial derivatives are bounded by m(y) and Em(Y ) 4 exists and is finite.
Proof.By the mean value theorem, the following holds for all i, j, k = 1,…, p, for some  * on the line segment between  and  lf .Hence, if we assume without loss of generality that  is convex, the existence of m as described in (A4), ensures that for all  ∈  , for some C > 0. If we further, without loss of generality, assume that  is compact we can use the above equation and the triangle inequality to show for some positive C ′ ∈ R. The right-hand side does not depend on  and has finite expectation, hence all third-order partial derivatives of log f  (y) are bounded by a fixed integrable function.This together with condition (A1) and (A2) of Lemma 3, ensures that θ exists with probability tending to 1, θ converges in probability to  lf and that θ =  lf + (nJ) −1  n ( lf ) + o Pr (1). (B1) This follows from Thms.5.41 and 5.42 in Van der Vaart (1998).Since θ converges to  lf in probability, the probability of θ lying in  goes to one.In the following, we will therefore assume that θ ∈  without loss of generality.
To show Theorem 3 in the article we need to prove that the expected value of the remainder term  n ( θ) is of order O(1∕ √ n), and that each component of ) T is uniformly integrable.The rest of the argument is similar to that given in the article.
We start with the remainder term  n ( θ).Since m bounds all fourth order partial derivatives of log f  (y) in  for almost all y, standard results concerning remainder terms in Taylor expansions can be used to show Taking expectations on both sides of the inequality shows, , which, by standard rules for the trace operator, is equivalent to nE( μ − ) = ∇h() T nb n + (1∕2)Tr[Hh()Σ] + o(1).By assumption nb n converges to c  , and hence, c = ∇h() T c  + (1∕2)Tr[Hh()Σ].

TA B L E 1
Each row in the table describes the nonparametric quantile estimator (1 − )Y (j) + Y (j+1) .
Fix j and k.By the law of large numbers n[ (J n ) j,k − J j,k ] 2 converges almost surely to VarH  lf log f  (Y i ) j,k .Furthermore, the expected value of n [ (J n ) j,k − J j,k ] 2 is equal to VarH  lf log f  (Y i ) j,k .Hence, by case (v) of Lemma 2, n [ (J n ) j,k − J j,k ] 2 is uniformly integrable.▪