A central limit theorem for the Hellinger loss of Grenander type estimators

We consider Grenander type estimators for a monotone function $\lambda:[0,1]\to\mathbb{R}$, obtained as the slope of a concave (convex) estimate of the primitive of $\lambda$. Our main result is a central limit theorem for the Hellinger loss, which applies to statistical models that satisfy the setup in Durot (2007). This includes estimation of a monotone density, for which the limiting variance of the Hellinger loss turns out to be independent of $\lambda$.


INTRODUCTION
One of the problems in shape-constrained nonparametric statistics is to estimate a real-valued function under monotonicity constraints. Early references for this type of problem can be found in Grenander (1956), Brunk (1958), and Marshall and Proschan (1965), concerning the estimation of a probability density, a regression function, and a failure rate under monotonicity constraints. The asymptotic distribution of these types of estimators was first obtained by Prakasa Rao (1969Rao ( , 1970 and reproved by Groeneboom (1985), who introduced a more accessible approach based on inverses. The latter approach initiated a stream of research on isotonic estimators, for example, see Groeneboom and Wellner (1992), Huang and Zhang (1994), Huang and Wellner (1995), and Lopuhaä and Nane (2013). Typically, the pointwise asymptotic behavior of isotonic estimators is characterized by a cube-root n rate of convergence and a nonnormal limit distribution.
The situation is different for global distances. In Groeneboom (1985), a central limit theorem was obtained for the L 1 -error of the Grenander estimator of a monotone density (Groeneboom, Hooghiemstra, & Lopuhaä, 1999), and a similar result was established in Durot (2002) for the regression context. Extensions to general L p -errors can be found in Kulikov and Lopuhaä (2005) and Durot (2007), where the latter provides a unified approach that applies to a variety of statistical models. For the same general setup, an extremal limit theorem for the supremum distance was obtained in Durot, Kulikov, and Lopuhaä (2012).
Another widely used global measure of departure from the true parameter of interest is the Hellinger distance. It is a convenient metric in maximum likelihood problems, which goes back to the works of LeCam (1970LeCam ( , 1973, and it has nice connections with Bernstein norms and empirical process theory methods to obtain rates of convergence, due fundamentally to the works of Birgé and Massart (1993), Wong and Shen (1995), and others; see section 3.4 of van der Vaart and Wellner (1996) or chapter 4 in van de Geer (2000) for a more detailed overview. Consistency in Hellinger distance of shape-constrained maximum likelihood estimators was investigated by Pal, Woodroofe, and Meyer (2007), Seregin and Wellner (2010), and Doss and Wellner (2016), whereas rates on Hellinger risk measures were obtained in Seregin and Wellner (2010), , and Kim, Guntuboyina, and Samworth (2016).
In contrast with L p -distances or the supremum distance, there is no distribution theory available for the Hellinger loss of shape-constrained nonparametric estimators. In this paper, we present a first result in this direction, that is, a central limit theorem for the Hellinger loss of Grenander-type estimators for a monotone function . This type of isotonic estimator was also considered by Durot (2007) and is defined as the left-hand slope of a concave (or convex) estimate of the primitive of , based on n observations. We will establish our results under the same general setup of Durot (2007), which includes estimation of a probability density, a regression function, or a failure rate under monotonicity constraints. In fact, after approximating the squared Hellinger distance by a weighted L 2 -distance, a central limit theorem can be obtained by mimicking the approach introduced in the work of Durot (2007). An interesting feature of our main result is that, in the monotone density model, the variance of the limiting normal distribution for the Hellinger distance does not depend on the underlying density. This phenomena was also encountered for the L 1 -distance in Groeneboom (1985) and Groeneboom et al. (1999).
In Section 2, we define the setup and approximate the squared Hellinger loss by a weighted L 2 -distance. A central limit theorem for the Hellinger distance is established in Section 3. We end this paper by a short discussion on the consequences for particular statistical models and a simulation study on testing exponentiality against a nonincreasing density.

DEFINITIONS AND PREPARATORY RESULTS
Consider the problem of estimating a nonincreasing (or nondecreasing) function ∶ [0, 1] → R + on the basis of n observations. Suppose that we have at hand a cadlag step estimator Λ n for If is nonincreasing, then the Grenander-type estimator̂n for is defined as the left-hand slope of the least concave majorant of Λ n , witĥn(0) = lim t↓0̂n (t). If is nondecreasing, then the Grenander-type estimator̂n for is defined as the left-hand slope of the greatest convex minorant of Λ n , witĥn(0) = lim t↓0̂n (t). We aim at proving the asymptotic normality of the Hellinger distance between̂n and defined by We will consider the same general setup as in the work of Durot (2007), that is, we will assume the following conditions: Durot (2007) also considered an additional condition (A2) in order to obtain bounds on pth moments; see theorem 1 and corollary 1 in Durot (2007). However, we only need condition (A2') for our purposes.
(A3)̂n(0) and̂n(1) are stochastically bounded. (A4) Let B n be either a Brownian bridge or a Brownian motion. There exists q > 12, C q > 0, L ∶ [0, 1]  → R, and versions of M n = Λ n − Λ and B n , such that In Durot (2007), a variety of statistical models are discussed for which the above assumptions are satisfied, such as estimation of a monotone probability density, a monotone regression function, and a monotone failure rate under right censoring. In Section 4, we briefly discuss the consequence of our main result for these models. We restrict ourselves to the case of a nonincreasing function . The case of nondecreasing can be treated similarly. Note that, even if this may not be a natural assumption, for example, in the regression setting, we need to assume that is positive for the Hellinger distance to be well defined. The reason that one can expect a central limit theorem for the Hellinger distance is the fact that the squared Hellinger distance can be approximated by a weighted squared L 2 -distance. This can be seen as follows: Because L 2 -distances for Grenander-type estimators obey a central limit theorem (e.g., Durot, 2007;Kulikov & Lopuhaä, 2005), similar behavior might be expected for the squared Hellinger distance. An application of the delta method will then do the rest.
The next lemma makes the approximation in (3) precise.
In order to prove Lemma 1, we need the preparatory lemma below. To this end, we introduce the inverse of̂n, defined bŷ U n (a) = argmax where by choosing p ′ ∈ (3∕2, 2). It follows that I 1 = J 1 + o P (n −5/6 ). By a change of variable b = (t) + a 1/3 , we find (1) ∫Û Then, by a Taylor expansion, (A1), and (4), there exists a K > 0, such that for all b ∈ ( (1), (0)) and t ∈ (g(b), 1]. We find where by using (23) in Durot (2007), that is, for every q ′ < 3(q − 1), there exists K q ′ > 0 such that It follows that In the same way, one finds and it follows that Now, because ′ is bounded, by Markov's inequality, for each > 0, we can write For the last inequality, we again used (9) with q ′ = 3. It follows that which finishes the proof.
Proof of Lemma 1. Similar to (3), we write for some positive constant C only depending on (0) and (1). Then, from Lemma 2, it follows that n 5/6 R n = o P (1).

MAIN RESULT
In order to formulate the central limit theorem for the Hellinger distance, we introduce the process X, defined as with W being a standard two-sided Brownian motion. This process was introduced and investigated by Groeneboom (1985Groeneboom ( , 1989 and plays a key role in the asymptotic behavior of isotonic estimators. The distribution of the random variable X(0) is the pointwise limiting distribution of several isotonic estimators, and the constant appears in the limit variance of the L p -error of isotonic estimators (e.g., Durot, 2002Durot, , 2007Groeneboom, 1985;Groeneboom et al., 1999;Kulikov & Lopuhaä, 2005). We then have the following central limit theorem for the squared Hellinger loss.
Proof. According to Lemma 1, it is sufficient to show that n 1∕6 (n 2∕3 I n − 2 ) → N(0, 2 ), with Again, we follow the same line of reasoning as in the proof of theorem 2 in Durot (2007). We briefly sketch the main steps of the proof. We first express I n in terms of the inverse procesŝ U n , defined in (5). To this end, similar to the proof of Lemma 2, consider For the first integral, we can now writẽ Then, if we introduceJ we obtain Similar to the reasoning in the proof of Lemma 2, we conclude thatĨ 1 =J 1 + o p (n −5∕6 ). Next, the change of variable b = (t) + √ 4a (t) yields Let us first consider the second integral on the right-hand side of (14). We then have again by using (9) with q ′ = 3. Then, consider the first integral on the right-hand side of (14). Similar to (7), there exists K > 0 such that for all b ∈ ( (1), (0)) and t ∈ ( g(b), 1]. Taking into account that ′ ( g(b)) < 0, similar to (8), it follows that by using (9) once more, and the fact that s > 3∕4. It follows that In the same way, We then mimic step 2 in the proof of theorem 2 in Durot (2007). Consider the representation where W n is a standard Brownian motion, n = 0 if B n is a Brownian motion, and n is a standard normal random variable independent of B n if B n is a Brownian bridge. Then, define which has the same distribution as a standard Brownian motion. Now, for t ∈ [0, 1], let Then, similar to (26) in Durot (2007), we will obtain To prove (16), by using the approximation U n (a) − g(a) ≈ L(Û n (a)) − L(g(a)) L ′ ( g(a)) and a change of variable a = a − n 1/2 n L ′ ( g(a)), we first obtain where n = n −1∕6 ∕ log n. Apart from the factor 1∕4a, the integral on the right-hand side is the same as in the proof of theorem 2 in Durot (2007) for p = 2. This means that we can apply the same series of succeeding approximations for L(Û n (a )) − L(g(a )) as in Durot (2007), which yields Finally, because the integrals over [ (1), (1)+ n ] and [ (0)− n , (0)] are of the order o p (n −1/6 ), this yields (16) by a change of variables t = g(a). The next step is to show that the term with n can be removed from (16). This can be done exactly as in Durot (2007), because the only difference with the corresponding integral in Durot (2007) is the factor 1∕4 (t), which is bounded and does not influence the argument in Durot (2007). We find that By approximatingṼ(t) by and using that, by Brownian scaling, d(t) 2/3 V(t) has the same distribution as X(0) (see Durot, 2007, for details), we have that It follows that We then first show that Once more, following the proof in Durot (2007), we have v n = Var After the same sort of approximations as in Durot (2007), we get where c n = 2n −1∕3 log n∕inf t L ′ (t) and where, for all s and t, Then, use that d(s) 2/3 V t (s) has the same distribution as so that the change of variable a = n 1/3 d(s) 2/3 (L(t) − L(s)) in v n leads to which proves (18). Finally, asymptotic normality of n 1∕6 ∫ 1 0 Y n (t) dt follows by Bernstein's method of big blocks and small blocks in the same way as in step 6 of the proof of theorem 2 in Durot (2007). = 2 −1∕2 and̃2 = 2 ∕8 2 , where 2 and 2 are defined in Theorem 1.
Proof. This follows immediately by applying the delta method with (x) = 2 −1∕2 √ x to the result in Theorem 1.

EXAMPLES
The type of scaling for the Hellinger distance in Corollary 1 is similar to that in the central limit theorem for L p -distances. This could be expected in view of the approximation in terms of a weighted squared L 2 -distance (see Lemma 1), and the results, for example, in Kulikov and Lopuhaä (2005) and Durot (2007). Actually, this is not always the case. The phenomenon of observing different speeds of convergence for the Hellinger distance from those for the L 1 and L 2 norms was considered by Birgé (1986). In fact, this is related to the existence of a lower bound for the function we are estimating. If the function of interest is bounded from below, which is the case considered in this paper, then the approximation (3) holds; see Birgé (1986) for an explanation.
When we insert the expressions for 2 and 2 from Theorem 1, then we get where k 2 is defined in (12). This means that, in statistical models where L = Λ in condition (A4) and, hence, L ′ = , the limiting variancẽ2 = k 2 ∕(4E[|X(0)| 2 ]) does not depend on . One such a model is estimation of the common monotone density on [0, 1] of independent random variables X 1 , … , X n . Then, Λ n is the empirical distribution function of X 1 , … , X n , and n is Grenander's estimator (Grenander, 1956). In that case, if inf t (t) > 0, the conditions of Corollary 1 are satisfied with L = Λ (see theorem 6 in Durot, 2007), so that the limiting variance of the Hellinger loss for the Grenander estimator does not depend on the underlying density. This behavior was conjectured in Wellner (2015) and coincides with that of the limiting variance in the central limit theorem for the L 1 -error for the Grenander estimator, first discovered by Groeneboom (1985); see also Durot (2002Durot ( , 2007, Groeneboom et al. (1999), and Kulikov and Lopuhaä (2005).
Another example is when we observe independent identically distributed inhomogeneous Poisson processes N 1 , … , N n with common mean function Λ on [0, 1] with derivative , for which Λ(1) < ∞. Then, Λ n is the restriction of (N 1 + · · · + N n )∕n to [0, 1]. Also in that case, the conditions of Corollary 1 are satisfied with L = Λ (see theorem 4 in Durot, 2007), so that the limiting variance of the Hellinger loss for̂n does not depend on the common underlying intensity . However, note that, for this model, the L 1 -loss for̂n is asymptotically normal according to theorem 2 in Durot (2007) but with limiting variance depending on the value Λ(1) − Λ(0).
Consider the monotone regression model y i,n = (i∕n) + i,n , for i = 1, … , n, where the i,n 's are i.i.d. random variables with mean zero and variance 2 > 0. Let be the empirical distribution function. Then,̂n is (a slight modification of) Brunk's (1958) estimator. Under appropriate moment conditions on the i,n , the conditions of Corollary 1 are satisfied with L(t) = t 2 (see theorem 5 in Durot, 2007). In this case, the limiting variance of the Hellinger loss for̂n depends on both and 2 , whereas the L 1 -loss for̂n is asymptotically normal according to theorem 2 in Durot (2007) but with limiting variance only depending on 2 . Suppose we observe a right-censored sample (X 1 , see theorem 3 in Durot (2007). This means that the limiting variance of the Hellinger loss depends on , F, and G, whereas the limiting variance of the L 1 -loss depends only on their values at 0 and 1. In particular, in the case of nonrandom censoring times, L = (1 − F ) −1 − 1, the limiting variance of the Hellinger loss depends on and F, whereas the limiting variance of the L 1 -loss depends only on the value F(1).

TESTING EXPONENTIALITY AGAINST A NONDECREASING DENSITY
In this section, we investigate a possible application of Theorem 1, that is, testing for an exponential density against a nonincreasing alternative by means of the Hellinger loss. The exponential distribution is one of the most used and well-known distributions. It plays a very important role in reliability, survival analysis, and renewal process theory, when modeling random times until some event. As a result, a lot of attention has been given in the literature to testing for exponentiality against a wide variety of alternatives, by making use of different properties and characterizations of the exponential distribution (Alizadeh Noughabi & Arghami, 2011;Haywood & Khmaladze, 2008;Jammalamadaka & Taufer, 2003;Meintanis, 2007). In this section, we consider a test for exponentiality, assuming that data come from a decreasing density. The test is based on the Hellinger distance between the parametric estimator of the exponential density and the Grenander-type estimator of a general decreasing density. In order to be able to apply the result of Corollary 1, we first investigate a test whether the data is exponentially distributed with a fixed parameter 0 > 0. Because such a test may not be very interesting from a practical point of view, we also investigate testing exponentiality, leaving the parameter > 0 unspecified.

Testing a simple null hypothesis of exponentiality
Let (x) = e − x 1 {x≥0} be the exponential density with parameter > 0. Assume we have a sample of i.i.d. observations X 1 , … , X n from some distribution with density f and for 0 > 0 fixed, we want to test Under the alternative hypothesis, we can estimate f on an interval [0, ] by the Grenander-type estimator̂n from Section 2. Then, as a test statistic, we take T n = H(̂n, 0 ), the Hellinger distance on [0, ] between̂n and 0 , and at level , we reject the null hypothesis if T n > c n, , 0 , for some critical value c n, , 0 > 0. According to Corollary 1, it follows that T n is asymptotically normally distributed, but the mean and the variance depend on the constant k 2 defined in (12). To avoid computation of k 2 , we estimate the mean and the variance of T n empirically. We generate B = 10, 000 samples from 0 . For each of these samples, we compute the Grenander estimator̂n ,i and the Hellinger distance T n,i = H(̂n ,i , 0 ), for i = 1, 2, … , B. Finally, we compute the meanT and the variance s T of the values T n,1 , … , T n,B . For the critical value of the test, we take c n, , 0 =T+q 1− s T , where q 1 − is the 100(1 − )% quantile of the standard normal distribution. Note that, even if in the density model the asymptotic variance is independent of the underlying distribution, the asymptotic mean does depend on 0 , that is, the test is not distribution free. Another possibility, instead of the normal approximation, is to take as a critical valuec n, , 0 the empirical 100(1 − )% quantile of the values T n,1 , … , T n,B .
To investigate the performance of the test, we generate N = 10, 000 samples from 0 . For each sample, we compute the value of the test statistic T n = H(̂n, 0 ) and we reject the null hypothesis if T n > c n, , 0 (or if T n >c n, , 0 ). The percentage of rejections gives an approximation of the level of the test. Table 1 shows the results of the simulations for different sample sizes n and two values of 0 and = 0.01, 0.05, 0.10. Here, we take = 5 because the mass of the exponential distribution with parameter one or five outside the interval [0, 5] is negligible. We observe that the percentage of rejections is close to the nominal level if we usec n, , 0 as a critical value for the  test, but it is a bit higher if we use c n, , 0 . This is due to the fact that, for small sample sizes, the normal approximation of Corollary 1 is not very precise. Moreover, to investigate the power, we generate a sample from the Weibull distribution with shape parameter and scale parameter −1 0 . Recall that Weibull(1, −1 0 ) corresponds to the exponential distribution with parameter 0 and that a Weibull distribution with < 1 has a decreasing density. We compute the Hellinger distance T n = H(̂n, 0 ) and we reject the null hypothesis if T n > c n, , 0 (or if T n >c n, , 0 ). After repeating the procedure N = 10, 000 times, we compute the percentage of times that we reject the null hypothesis, which gives an approximation of the power of the test.
The results of the simulations, done with n = 100, 0 = 1, = 0.05, and alternatives for which varies between 0.4 and 1 by steps of 0.05, are shown in Figure 1. As a benchmark, we compute the power of the likelihood ratio (LR) test statistic for each . As expected, our test is less powerful with respect to the LR test, which is designed to test against a particular alternative. However, as the sample size increases, the performance improves significantly and the difference of the results when using c n, , 0 orc n, , 0 becomes smaller. To investigate the level of the test, for = 0.05 and > 0 fixed, we start with a sample from an exponential distribution with parameter and repeat the above procedure N = 10, 000 times. We count the number of times we reject the null hypothesis, that is, the number of times the value of the test statistics exceeds the corresponding 5th upper percentile. Dividing this number by N gives an approximation of the level. Table 2 shows the results of the simulations for different sample sizes n and different values of . The rejection probabilities are close to 0.05 for all the values of , which shows that the test performs well in the different scenarios (slightly and strongly decreasing densities).
To investigate the power, for = 0.05 and fixed 0 < < 1 and > 0, we now start with a sample from a Weibull distribution with shape parameter and scale parameter −1 and compute the value R n = H(̂n ,̂n). In order to calibrate the test, we treat this sample as if it were an exponential sample and estimate bŷn = n∕ ∑ n i=1 X i . Next, we generate B = 1, 000 bootstrap samples of size n from the exponential density with parameter̂n. For each bootstrap sample, we compute the test statistic R * n,i = H(̂ * n,i ,̂ * n,i ), for i = 1, 2, … , B, and we determine the 5th upper percentile d * n,0.05 of the values R * n,1 , … , R * n,B . Finally, we reject the null hypothesis if R n > d * n,0.05 . After repeating the above procedure N = 10, 000 times, each time starting with a Weibull sample, we compute the percentage of times that we reject the null hypothesis, which gives an approximation of the power of the test.
We compare the Hellinger distance test to some of the tests from Alizadeh Noughabi and Arghami (2011), which are designed to test exponentiality against all the possible alternatives, that is, not only against decreasing densities. These tests are all distribution free, which means that their critical values can be computed independently of . Then, for each of the Weibull samples generated before, we count the percentage of times that the tests FIGURE 2 Simulated powers of (black solid) the Hellinger distance test and some other competitor tests, that is, (blue) T 1 , (green) T 2 , (yellow) 2 n , (brown) S n , (red) EP n , (purple) KL mn , and (orange) CO n , and the power of (black dotted) the likelihood ratio test for (left) n = 100, = 1, 0.4 ≤ ≤ 1 and (right) 1 ≤ ≤ 8. (a) Weibull. (b) Beta T 1 , T 2 , 2 n , S n , EP n , KL mn , and CO n (see Alizadeh Noughabi & Arghami, 2011, for a precise definition) reject the null hypothesis. Finally, we also compare the power of our test with the LR test for each .
The results of the simulations, done with n = 100, = 1, and alternatives for which varies between 0.4 and 1, are shown in the left panel in Figure 2. Actually, we also investigated the power for different choices of , and we observed similar behavior as for = 1. The figure shows that the test based on the Hellinger distance performs worse than the other tests. In this case, the test of Cox and Oakes CO n has greater power. However, Alizadeh Noughabi and Arghami (2011) concluded that none of the tests is uniformly most powerful with respect to the others.
We repeated the experiment taking, instead of the Weibull distribution, the beta distribution with parameters = 1 and 1 ≤ ≤ 8 as alternative. Note that it has a nonincreasing density on [0, 1] proportional to (1 − x) − 1 , and the extreme case = 1 corresponds to the uniform distribution. Results are shown in the right panel in Figure 2. We observe that, for small values of , the Hellinger distance test again behaves worse than the others, and in this case, R n and EP n have greater power. However, for larger , the Hellinger distance test outperforms all the others.