Smoothed isotonic estimators of a monotone baseline hazard in the Cox model

We consider the smoothed maximum likelihood estimator and the smoothed Grenander-type estimator for a monotone baseline hazard rate $\lambda_0$ in the Cox model. We analyze their asymptotic behavior and show that they are asymptotically normal at rate $n^{m/(2m+1)}$, when~$\lambda_0$ is $m\geq 2$ times continuously differentiable, and that both estimators are asymptotically equivalent. Finally, we present numerical results on pointwise confidence intervals that illustrate the comparable behavior of the two methods.


Introduction
The semi-parametric Cox regression model is a very popular method in survival analysis that allows incorporation of covariates when studying lifetime distributions in the presence of right censored data. The ease of interpretation, resulting from the formulation in terms of the hazard rate and the proportional effect of the covariates, and the fact that the regression coefficients (parametric component) can be estimated while leaving the baseline distribution unspecified favour the wide use of this framework, especially in medical applications. On the other hand, because its first introduction (Cox, 1972), much effort has been spent on giving a firm mathematical basis to this approach. Initially, the attention was on the derivation of large sample properties of the maximum partial likelihood estimator of the regression coefficients and of the Breslow estimator for the cumulative baseline hazard (e.g., see Efron, 1977;Cox, 1975;Tsiatis, 1981). Although the most attractive property of this approach is that it does not assume any fixed shape on the hazard curve, there are several cases where order restrictions better match the practical expectations (see van Geloven, 2013, for an example of a decreasing hazard in a large clinical trial for patients with acute coronary syndrome). Estimation of the baseline hazard function under monotonicity constraints has been studied in Chung & Chang (1994) and Lopuhaä & Nane (2013).
Traditional isotonic estimators, such as maximum likelihood estimators and Grenander-type estimators, are step functions that exhibit a non normal limit distribution at rate n 1=3 . On the other hand, a long stream of research has shown that, if one is willing to assume more regularity on the function of interest, smooth estimators might be preferred to piecewise constant ones because they can be used to achieve a faster rate of convergence to a Gaussian distributional law and to estimate derivatives. Isotonized smooth estimators, obtained either by a least squares projection, maximum likelihood or penalization, are studied in Mukerjee (1988), Ramsay (1998), Eggermont & LaRiccia (2000), van der Vaart & van der Laan (2003) and in Mammen (1991), who also compares isotonized kernel estimators with smoothed isotonic estimators in the regression setting. Smoothed maximum likelihood estimators (SMLEs) for monotone functions have also been investigated by Durot et al. (2013) to bootstrap from a Jongbloed (2014), we obtain polynomial bounds, which will suffice for obtaining uniform L 2bounds, and we avoid using the maximal distance between succeeding points of jump of the non-smoothed isotonic estimator, by establishing a sufficiently small bound on the expected supremum distance between the non-smoothed isotonic estimator and the true baseline hazard.
This leads to asymptotic normality at rate n m=.2mC1/ of the SMLE and the smoothed Grenander-type estimator, which are also shown to be asymptotically equivalent. By means of a small simulation, we investigate the finite sample behaviour in terms of asymptotic confidence intervals corresponding to the limit normal distributions, as well as bootstrap confidence intervals based on a smooth bootstrap proposed by Burr (1994) and Xu et al. (2014). As expected, no estimator performs strictly better than the other.
The paper is organized as follows. In Section 2, we specify the Cox regression model and provide some background information that will be used in the sequel. The kernel smoothed versions of the Grenander-type estimator and of the maximum likelihood estimator of a nondecreasing baseline hazard function are considered in Section 3. We only consider the case of a nondecreasing baseline hazard. The same results can be obtained similarly for a non-increasing hazard. The results of a small simulation study are reported in Section 4, and we conclude with a brief discussion in Section 5. In order to keep the exposition clear and simple, most of the proofs are delayed until Section 6, and remaining technicalities have been put in the Supporting Information online.

The Cox regression model
Let X 1 ; : : : ; X n be an i.i.d. sample representing the survival times of n individuals, which can be observed only on time intervals OE0; C i for some i.i.d. censoring times C 1 ; : : : ; C n . One observes i.i.d. triplets .T 1 ; 1 ; Z 1 /; : : : ; .T n ; n ; Z n /, where T i D min.X i ; C i / denotes the follow up time, i D 1 ¹X i ÄC i º is the censoring indicator and Z i 2 R p is a time independent covariate vector. Given the covariate vector Z; the event time X and the censoring time C are assumed to be independent. Furthermore, conditionally on Z D´; the event time is assumed to be a nonnegative r.v. with an absolutely continuous distribution function F .x j´/ and density f .x j´/. Similarly, the censoring time is assumed to be a nonnegative r.v. with an absolutely continuous distribution function G.x j´/ and density g.x j´/. The censoring mechanism is assumed to be non-informative, i.e., F and G share no parameters. Within the Cox model, the conditional hazard rate .x j´/ for a subject with covariate vector´2 R p is related to the corresponding covariate by where 0 represents the baseline hazard function, corresponding to a subject with´D 0, anď 0 2 R p is the vector of the regression coefficients. Let H and H uc denote, respectively, the distribution function of the follow-up time and the sub-distribution function of the uncensored observations, i.e., where P is the distribution of .T; ; Z/. We also require the following assumptions, some of which are common in large sample studies of the Cox model (e.g., see Lopuhaä & Nane, 2013): (A1) Let F ; G and H be the end points of the support of F; G and H . Then H. Lopuhaä  Let us briefly comment on these assumptions. While the first one tells us that, at the end of the study, there is at least one subject alive, the other two are somewhat hard to justify from a practical point of view. One can think of (A2) and (A3) as conditions on the boundedness of the second moment of the covariates, uniformly forˇin a neighbourhood ofˇ0.
By now, it seems to be rather a standard choice estimatingˇ0 by Ǒ n , the maximizer of the partial likelihood function, as proposed by Cox (1972). The asymptotic behaviour was first studied by Tsiatis (1981). We aim at estimating 0 , subject to the constraint that it is increasing (the case of a decreasing hazard is analogous), on the basis of n observations .T 1 ; 1 ; Z 1 /; : : : ; .T n ; n ; Z n /. We refer to the quantity as the cumulative baseline hazard and, by introducinĝ we have where h.x/ D dH uc .x/=dx (e.g., see (9) in Lopuhaä & Nane, 2013). Forˇ2 R p and x 2 R, the functionˆ.xIˇ/ can be estimated bŷ where P n is the empirical measure of the triplets .T i ; i ; Z i / with i D 1; : : : ; n. Moreover, in lemma 4 of Lopuhaä & Nane (2013), it is shown that It will be often used throughout the paper that a stochastic bound of the same order holds also for the distance between the cumulative hazard ƒ 0 and the Breslow estimator but only on intervals staying away of the right boundary, i.e., sup x2OE0;M jƒ n .x/ ƒ 0 .x/j D O p .n 1=2 /; for all 0 < M < H ; (see theorem 5 in Lopuhaä & Nane, 2013).
Smoothing is done by means of kernel functions. We will consider kernel functions k that are m-orthogonal, for some m 1, which means that R jk.u/j juj m du < 1 and R k.u/u j du D 0, for j D 1; : : : ; m 1, if m 2. We assume that k has bounded support OE 1; 1 and is such that k is differentiable with a uniformly bounded derivative.
We denote by k b its scaled version k b .u/ D b 1 k.u=b/. Here, b D b n is a bandwidth that depends on the sample size, in such a way that 0 < b n ! 0 and nb n ! 1, as n ! 1. From now on, we will simply write b instead of b n . Note that if m > 2, the kernel function k necessarily attains negative values, and as a result, also the smooth estimators of the baseline hazard defined in Section 3 may be negative and monotonicity might not be preserved. To avoid this, one could restrict oneself to m D 2. In that case, the most common choice is to let k be a symmetric probability density.

Smoothed isotonic estimators
We consider smoothed versions of two isotonic estimators for 0 , i.e., the maximum likelihood estimator O n and the Grenander-type estimator Q n , introduced in Lopuhaä & Nane (2013). The maximum likelihood estimator of a nondecreasing baseline hazard rate 0 can be characterized as the left derivative of the greatest convex minorant of the cumulative sum diagram consisting of points P 0 D .0; 0/ and P with Ǒ n being the partial maximum likelihood estimator (see lemma 1 in Lopuhaä & Nane, 2013). For a fixed x 2 OE0; H , the SMLE O SM n of a nondecreasing baseline hazard rate 0 was defined in Nane (2013) by The Grenander-type estimator Q n of a nondecreasing baseline hazard rate 0 is defined as the left hand slope of the greatest convex minorant Q ƒ n of the Breslow estimator ƒ n . For a fixed x 0 2 OE0; H , we consider the smoothed Grenander-type estimator Q SG n , which is defined by Uniform strong consistency on compact intervals in the interior of the support OE ; M OE0; H is provided by theorem 5.2 of Nane (2013) uniform strong consistency for O SM n , similar to (12). Inconsistency at the boundaries is a frequently encountered problem in such situations and can be partially avoided by using a boundary corrected kernel. One possibility is to construct linear combinations of k.u/ and uk.u/ with coefficients depending on the value near the boundary (e.g., see Zhang & Karunamuni, 1998;Durot et al.,2013;or Lopuhaä & Musta, 2017b). Then, it can be proved, exactly as it is done in Lopuhaä & Musta (2017b), that uniform consistency holds on OE0; M OE0; H . Figure 1 shows the SMLE (left) and the smoothed Grenander-type estimator (right) for a sample of size n D 500 from a Weibull baseline distribution with shape parameter 1.5 and scale 1. For simplicity, we assume that the real valued covariate and the censoring times are uniformly .0; 1/ distributed and we takeˇ0 D 0:5. We used a boundary corrected triweight kernel function k.u/ D .35=32/.1 u 2 / 3 1 ¹juj Ä1º and bandwidth b D n 1=5 . In the remainder of this section, we will derive the pointwise asymptotic distribution of both smoothed isotonic estimators, in (10) and (11). As already mentioned, our approach is inspired by techniques introduced in Section 11.6 of Groeneboom & Jongbloed (2014). We briefly describe this approach for the smoothed Grenander estimator, for which the computations are more complicated. We start by writing The first (deterministic) term on the right hand side of (13) gives us the asymptotic bias. The method applied in Lopuhaä & Musta (2017b) for the right censoring model continues by decomposing the second term in two parts and then uses the Kiefer-Wolfowitz type of result to show that R k b .x u/ d. Q ƒ n ƒ n /.u/ converges to zero. Finally, results from empirical process theory are used to show the asymptotic normality of R k b .x u/ d.ƒ n ƒ 0 /.u/. This approach cannot be followed in our case because of the lack of a Kiefer-Wolfowitz type of result as in (14) for the Cox model. Alternatively, we proceed by describing the main steps of the L 2 -bounds approach introduced in Groeneboom & Jongbloed (2014). On an event E n with probability tending to one, we will approximate by R Â n;x .u; ı;´/ dP.u; ı;´/, for some suitable function Â n;x (lemma 3.1), whose piecewise constant modification N Â n;x integrates to zero with respect to the empirical measure P n (lemma 3.2). This enables us to approximate (15) by Z N Â n;x .u; ı;´/ d.P n P/.u; ı;´/ C Z N Â n;x .u; ı;´/ Â n;x .u; ı;´/ dP.u; ı;´/: (16) Then, the key step is to bound the second integral in (16) by means of L 2 -bounds on the distance between the ordinary Grenander estimator and the true baseline hazard (lemma 3.3). The last step consists of replacing N Â n;x by a deterministic function Á n;x (lemma 3.4) and use empirical process theory to show that Z Á n;x .u; ı;´/ d.P n P/.u; ı;´/ is asymptotically normal. Before we proceed to our first main result, we will formulate the steps described earlier in a series of lemmas. Let x 2 .0; H / and 0 < M < H . For n sufficiently large, such that whereˆ.uIˇ0/ is defined in (2), and a n;x .u/ D 0 for u > x C b. We then have the following approximation for (15). The proof can be found in Section 6.
Lemma 3.1. Suppose that (A1)-(A2) hold. Let a n;x be defined by (17) and let Ǒ n be the partial MLE forˇ0. There exists an event E n , with 1 En ! 1 in probability, such that for Â n;x .u; ı;´/ D 1 En ² ı a n;x .u/ e Ǒ0 n´Z u 0 a n; it holds Z Â n;x .u; ı;´/ dP.u; ı;´/ D 1 En Next, we consider a piecewise constant modification N a n;x N n of a n;xˆn , which is constant on the same intervals as Q n . Let 0 D x b, mC1 D x C b and let . i / m iD1 be successive points of jump of Q n in the interval .x b; x C b/. Then, for u 2 . i ; iC1 , we choose N a n;x N n .uI Ǒ n / D a n; Furthermore, let E n be the event from lemma 3.1 and define N ‰ n;x .u/ D N a n;x N n .uI Ǒ n / and N ‰ n;x .u/ D 0, for u ¤ OEx b; x C b. Note that, because u Ä x C b < M < T .n/ on the event E n , we haveˆn.uI Ǒ n / > 0 (see the proof of lemma 3.1), and thus N ‰ n;x .u/ is well defined. Now, define the following piecewise constant modification of Â n;x by We then have the following property. The proof can be found in Section 6.
Lemma 3.2. Let N Â n;x be defined in (22). Then Z N Â n;x .u; ı;´/ dP n .u; ı;´/ D 0: At this point, it is important to discuss in some detail how we will obtain suitable bounds for the second integral in (16). In order to do so, we first introduce the inverse process Q U n . It is defined by and it satisfies the switching relation Q n .x/ Ä a if and only if Q U n .a/ x, for x Ä T .n/ . In their analysis of the current status model, Groeneboom et al. (2010) encounter an integral that is similar to the second integral in (16). They bound this integral using that the maximal distance between succeeding points of jump of the isotonic estimator is of the order O p .n 1=3 log n/. Such a property typically relies on the exponential bounds for the tail probabilities of Q U n .a/, obtained either by using a suitable exponential martingale (e.g., see lemma 5.9 in Groeneboom & Wellner, 1992), or by an embedding of the relevant sum process into Brownian motion or Brownian bridge (e.g., see lemma 5.1 in Durot et al., 2012). Unfortunately, an embedding of the process ƒ n is not available, and in our current situation, the martingale approach only yields polynomial bounds for tail probabilities of Q U n .a/. A polynomial bound was also found by Durot (2007) (see her lemma 2), leading to for p 2 OE1; 2/ and some interval I n (see her theorem 1). By intersecting with the event E n from lemma 3.1, we extend (25) to a similar bound for p D 2. Groeneboom & Jongbloed (2014) provide an alternative approach to bound the second integral in (16), based on bounds for (25) with p D 2. Unfortunately, they still make use of the fact that the maximum distance between succeeding points of jump of the isotonic estimator is of the order O p .n 1=3 log n/ to obtain a result similar to (28). Nevertheless, we do follow the approach in Groeneboom & Jongbloed (2014), but instead of using the maximum distance between succeeding points of jump of Q n , we use a bound on for 0 < < M < H . Exponential bounds for the tail probabilities of Q U n .a/ would yield the same bound for (26) as the one in (25) apart from a factor log n. Because we can only obtain polynomial bounds on the tail probabilities of Q U n .a/, we establish a bound for (26) of the order O.n 4=9 /. This is probably not optimal, but this will turn out to be sufficient for our purposes and leads to the following intermediate result, of which the proof can be found in Section 6. Lemma 3.3. Suppose that (A1)-(A2) hold. Fix x 2 .0; h / and let Â n;x and N Â n;x be defined by (18) and (22), respectively. Assume that 0 is differentiable, such that 0 0 is uniformly bounded above and below by strictly positive constants. Assume that x 7 !ˆ.xIˇ0/ is differentiable with a bounded derivative in a neighborhood of x and let k satisfy (8). Then, it holds The last step is to replace N Â n;x in the first integral of (16) with a deterministic approximation. This is done in the next lemma, of which the proof can be found in Section 6.
0 is uniformly bounded above and below by strictly positive constants. Assume that x 7 !ˆ.xIˇ0/ is differentiable with a bounded derivative in a neighborhood of x. Let N Â n;x be defined in (22) and define Á n;x .u; ı;´/ D 1 En Â ı a n;x .u/ eˇ0 0´Z u 0 a n; where a n;x is defined in (17) and E n is the event from lemma 3.1. Let k satisfy (8). Then, it holds Z ® N Â n;x .u; ı;´/ Á n;x .u; ı;´/¯d.P n P/.u; ı; We are now in the position to state our first main result.
Assume that 0 is m 2 times continuously differentiable in x, such that 0 0 is uniformly bounded above and below by strictly positive constants. Moreover, assume that t 7 !ˆ.tIˇ0/ is differentiable with a bounded derivative in a neighborhood of x and let k satisfy (8). Let Q SG be defined in (11) and assume that n 1=.2mC1/ b ! c > 0. Then, it holds Scand J Statist 45 Furthermore, Consider the event E n from lemma 3.1 and choose 1 ; 2 > 0 and 3 , such that it satisfies (65). We write Because 1 E c n ! 0 in probability, the third term on the right hand side tends to zero in probability. For the first term, we obtain from a change of variable, a Taylor expansion, and the properties of the kernel: with j n xj < bjyj. Finally, for the second term on the right hand side of (31), lemmas 3.1 to 3.4 yield that For the first term on the right hand side of (33), we can write We will show that the first term on the right hand is asymptotically normal and the second term tends to zero in probability. Define Y n;i D n .mC1/=.2mC1/ i k b .x T i /=ˆ.T i Iˇ0/, so that the first term on the right hand side of (34) can be written as .Y n;i E OEY n;i / : Using (3), together with a Taylor expansion and the boundedness assumptions on the derivatives of 0 andˆ.xIˇ0/, we have Moreover, jY n;i j Ä n .mC1/=.2mC1/ˆ. M Iˇ0/ 1 sup x2OE 1;1 k.x/, so that P n iD1 E jY n;i j 2 1 ¹j Y n;i j > º ! 0, for any > 0, because 1 ¹jY n;i j > º D 0, for n sufficiently large. Consequently, by Lindeberg central limit theorem, and the fact that 1 En ! 1 in probability, we obtain For the second term on the right hand side of (34), write where the integral on the right hand side is bounded by Hence, the second term on the right hand side of (34) tends to zero in probability. Together with (31), (32) and (36), this proves the first part of the theorem.
For the SMLE, we can follow the same approach and obtain similar results as those in lemmas 3.1 to 3.4. The arguments are more or less the same as those used to prove lemmas 3.1 to 3.4. We briefly sketch the main differences. First, instead of Q ƒ n , we now use (15). Then, because the maximum likelihood estimator is defined as the left slope of the greatest convex minorant of a cumulative sum diagram that is different from the one corresponding to the Grenander-type estimator, lemmas 3.1 and 3.2 will hold with a different event O E n and N ‰ n;x will have a simpler form (see lemmas B.1-B.2 and definition (S10) in Supporting Information). Similar to the proof of lemma 3.3, the proof of its counterpart for the maximum likelihood estimator (see lemma B.8 in Supporting Information) is quite technical and involves bounds on the tail probabilities of the inverse process corresponding to O n (lemma B.5), used to obtain the analogue of (26) (lemma B.6). Moreover, the inverse process related to the maximum likelihood estimator is defined by where V n and O W n are defined in (9), and we get a slightly different bound on the tail probabilities of O U n (compare lemma 6.3 and lemma B.5 in Supporting Information). The reason is that the martingale decomposition of the process V n .t / a O W n .t/ has a simpler form. The counterpart of lemma 3.4 (see lemma B.10 in Supporting Information) is established in the same way, replacing Q n by O n . For details, we refer to Section B in Supporting Information.
From (31) and (33), we have that where Á n;x is defined in (27) and where Similarly, from the results in Section B of Supporting Information, we have that there exists an event O E n , such that where O Á n;x is defined in (27) Together with (39) and (41), this means that Note that in the special caseˇ0 D 0 and m D 2, we recover theorem 3.2 in Lopuhaä & Musta (2017b) and theorem 11.8 in Groeneboom & Jongbloed (2014), for the right censoring model without covariates. The fact that Q SG n .x/ and O SM n .x/ are asymptotically equivalent does not come as a surprise, because for the corresponding isotonic estimators according to theorem 2 in Lopuhaä & Nane (2013) However, we have not been able to exploit this fact, and we have established the asymptotic equivalence in (30) by obtaining the expansions in (38) and (40) separately for each estimator.
Remark 3.6. The estimators considered in theorem 3.5 are based on the partial maximum likelihood estimator Ǒ n , which defines the Breslow estimator, see (6), and the cumulative sum diagram from which the SMLE is determined, see (9). However, theorem 3.5 remains true, if Ǒ n is any estimator that satisfies Ǒ n ˇ0 ! 0; a.s., and p n.
In particular, this holds for the partial MLE forˇ0. See, e.g., theorems 3.1 and 3.2 in Tsiatis (1981). When proving consistency of the bootstrap, we are not able to establish bootstrap versions of theorems 3.1 and 3.2 in Tsiatis (1981), but, in view of this remark, it is sufficient to assume the bootstrap version of (42).

Numerical results for pointwise confidence intervals
In this section, we illustrate the finite sample performance of the two estimators considered previously by constructing pointwise confidence intervals for the baseline hazard rate. We consider two different procedures: the first one relies on the limit distribution and the second one is a bootstrap based method. In all the simulations, we use the triweight kernel function, which means that the degree of smoothness is m D 2. The reason for choosing a second-order kernel is that higher order kernels may also take negative values, which then might lead to non monotone estimators for the baseline hazard.

Asymptotic confidence intervals
From theorem 3.5, it can be seen that the asymptotic 100.1 ˛/%-confidence intervals at the point where q 1 ˛=2 is the .1 ˛=2/ quantile of the standard normal distribution, SI n .x 0 / is the smooth isotonic estimator at hand (SG or SMLE) and O n .x 0 /, O n .x 0 / are corresponding plugin estimators of the asymptotic mean and standard deviation, respectively. However, from the expression of the asymptotic mean in theorem 3.5 for m D 2, it is obvious that obtaining the plug-in estimators requires estimation of the second derivative of 0 . Because accurate estimation of derivatives is a hard problem, we choose to avoid it by using undersmoothing. This procedure is to be preferred above bias estimation, because it is computationally more convenient and leads to better results (see also Hall, 1992;Groeneboom & Jongbloed, 2015;Cheng et al., 2006). Undersmoothing consists of using a bandwidth of a smaller order than the optimal one (in our case n 1=5 ). As a result, the bias of n 2=5 . SI n .x 0 / 0 .x 0 //, which is of the order n 2=5 b 2 (32), will converge to zero. On the other hand, the asymptotic variance is n 1=5 b 1 2 (see (35) with m D 2). For example, with b D n 1=4 , asymptotically n 2=5 . SI n .x 0 / 0 .x 0 // behaves like a normal distribution with mean of the order n 1=10 and variance n 1=20 2 . Hence, the confidence interval becomes where Note that undersmoothing leads to confidence intervals of asymptotic length O P .n 3=8 /, while the optimal ones would be of length O P .n 2=5 /. In our simulations, the event times are generated from a Weibull baseline distribution with shape parameter 1.5 and scale 1. The real valued covariate and the censoring time are chosen to be uniformly distributed on the interval .0; 1/ and we takeˇ0 D 0:5. We note that this setup corresponds to around 35% uncensored observations. Confidence intervals are calculated at the point x 0 D 0:5 using 10,000 sets of data and we take bandwidth b D cn 1=4 , with c D 1, and kernel function k.u/ D .35=32/.1 u 2 / 3 1 ¹juj Ä 1º . It is important to note that the performance depends strongly on the choice of the constant c, because the asymptotic length is inversely proportional to c (44). This means that by choosing a smaller c, we get wider confidence intervals, and as a result, higher coverage probabilities. However, it is not clear which would be the optimal choice of such a constant. This is actually a common problem in the literature (see for example Cheng et al., 2006 andGonzález-Manteiga et al., 1996). As indicated in Müller & Wang (1990), cross-validation methods that consider a trade-off between bias and variance suffer from the fact that the variance of the estimator increases as one approaches the endpoint of the support. This is even enforced in our setting, because the bias is also decreasing when approaching the endpoint of the support. We tried a locally adaptive choice of the bandwidth, as proposed in Müller & Wang (1990), by minimizing an estimator of the Mean Squared Error, but in our setting, this method did not lead to better results. A simple choice is to take c equal to the range of the data (Groeneboom & Jongbloed, 2015), which in our case corresponds to c D 1. Table 1 shows the performance of the estimators. The four columns corresponding to SG and SMLE list the average length (AL) and the coverage probabilities (CP) of the confidence intervals given in (43) for various sample sizes. Results indicate that the SMLE behaves slightly better, but as the sample size increases, its behaviour becomes comparable with that of the SG estimator. Even though the coverage probabilities are below the nominal level of 95%, q˛.Z/ is the˛-quantile of the distribution of Z D argmin t 2R ¹W .t/ C t 2 º, with W as a standard two-sided Brownian motion starting from zero. In particular, q 0:975 .Z/ D 0:998181. The main advantage of using the non-smoothed Grenander-type estimator is that it does not involve the choice of a tuning parameter. However, the performance is not satisfactory, because we still need to estimate the derivative of 0 , which is difficult if the estimator of 0 is a step function. Here, we use the slope of the segment OE Q n .T .i/ ; Q n .T iC1 / on the interval OET .i/ ; T .iC1/ that contains x 0 .
We also compare the performance of the SG estimator and the SMLE with that of the ordinary (non-monotone) kernel estimator which is shown in columns 10-11 of Table 1. We note that the kernel estimator coincides with the naive estimator that approximates the isotonized smoothed Breslow estimator, see Section 4 in Lopuhaä & Musta (2017a). In their proof of theorem 4.3, it is shown that s n exhibits a limit distribution that coincides with the one of the smooth estimators in theorem 3.5. Also, the numerical results in Table 1 confirm that the performance of the kernel estimator is comparable with that of the smoothed isotonic estimators. However, we notice that the latter ones have slightly better coverage probabilities and shorter confidence intervals.
Moreover, as noticed in Lopuhaä & Musta (2017a), estimation of the parameterˇ0 also has a great effect on the accuracy of the results. The columns 6-9 of Table 1 show that if we use the true value ofˇ0 in the computation of the estimators, the coverage probabilities increase significantly, but in this case, the confidence intervals for the SMLE become too conservative. Although the partial ML estimator Ǒ n is a standard estimator for the regression coefficients, the efficiency results are only asymptotic. As pointed out in Cox & Oakes (1984) and Ren & Zhou (2011), for finite samples, the use of the partial likelihood leads to a loss of accuracy. Recently, Ren & Zhou (2011) introduced the MLE forˇ0 obtained by joint maximization of the loglikelihood in the Cox model over bothˇand 0 . It was shown that for small and moderate sample sizes, the joint MLE forˇ0 performs better than Ǒ n . However, in our case, using this estimator instead of Ǒ n , does not bring any essential difference in the coverage probabilities. Pointwise confidence intervals, for a fixed sample size n D 500, at different points of the support are illustrated in Figure 2. The results are again comparable and the common feature is that the length increases as we move to the left boundary. This is due to the fact that the length is proportional to the asymptotic standard deviation, which in this case turns out to be increasing, 2 .x/ D 1:5 p x=.cˆ.xIˇ0//. Note thatˆ.xIˇ0/ defined in (2) is decreasing.

Bootstrap confidence intervals
In an attempt to improve the coverage probabilities, we also construct bootstrap confidence intervals. Studies on bootstrap confidence intervals in the Cox model are investigated in Burr (1994) and Xu et al. (2014). In the latter paper, the authors investigate several bootstrap procedures for the Cox model. We will use one (method M5) of the two proposals for a smooth bootstrap that had the best performance and were recommended by the authors. We fix the covariates and we generate the event time X i from a smooth estimate for the cdf of X conditional on Z i : where ƒ s n is the smoothed Breslow estimator The censoring times C i are generated from the Kaplan-Meier estimate O G n . Then we take T i D min.X i ; C i / and i D 1 ¹X i ÄC i º . For constructing the confidence intervals, we take 1,000 bootstrap samples .T i ; i ; Z i /, and for each bootstrap sample, we compute the smoothed Grenander-type estimate Q SG; n .x 0 / and the smoothed maximum likelihood estimate O SM; n .x 0 /. Here, the kernel function is the same as before and the bandwidth is taken to be b D n 1=5 . Then, the 100.1 ˛/% confidence interval is given by where q ˛. x 0 / is the˛-percentile of the 1,000 values of the estimates Q SG; n .x 0 / or O SM; n .x 0 /. The average length and the empirical coverage for 1,000 iterations and different sample sizes are reported in Table 2. We observe that bootstrap confidence intervals behave better than confidence intervals in Table 1, i.e., the coverage probabilities are closer to the nominal level of 95%. Comparing also with the two alternative estimators considered in Lopuhaä & Musta (2017a), we notice that the SMLE and the MSLE have better coverage probabilities than the smoothed Grenander-type and isotonized Breslow estimator, respectively.
In order to provide some theoretical evidence for the consistency of the method, we would like to establish that, given the data .T 1 ; 1 ; Z 1 /; : : : ; .T n ; n ; Z n /, it holds for some Q 2 R (possibly different from in theorem 3.5) and 2 as in (29), where SI n is one of the smooth isotonic estimators at hand and SI; n is the same estimator computed for the bootstrap sample. A detailed investigation of (46) is beyond the scope of this paper. Nevertheless, in view of Remark 3.6, we are able to obtain (46)  where P n is the measure corresponding to the distribution of .T ; ; Z/ conditional on the data .T 1 ; 1 ; Z 1 /, : : : ; .T n ; n ; Z n /, with T D .min.X ; C / and D 1 ¹X ÄC º ; Z/, where X conditional on Z has distribution function O F n .x j Z/ and C has distribution function O G n . To prove (46), we mimic the proof of theorem 3.5, which means that one needs to establish the bootstrap versions of lemmas 3.1-3.4. A brief sketch of the arguments is provided in Appendix C of Supporting Information.
Then, we can approximate the distribution of n 2=5 . 0 .x 0 / SI n .x 0 // by the distribution of n 2=5 . SI; n .x 0 / SI n .x 0 // . Q C /. Consequently, we can write  This means that we should actually take OEq ˛=2 .x 0 /; q 1 ˛=2 .x 0 / n 2=5 . Q C / instead of (45). The use of (45) avoids bias estimation. However, because the effect of the bias is of the order n 2=5 , the results are still satisfactory. In order to further reduce the effect of the bias, we also investigated the possibility of constructing bootstrap confidence intervals with undersmoothing, i.e, we repeat the previous procedure with bandwidth b D n 1=4 . Results are shown in Table 3. We notice that the length of the confidence interval increases slightly and the coverage probabilities improve significantly. To summarize, also the bootstrap confidence intervals are affected by the choice of the bandwidth, but the results are more satisfactory in comparison with the ones in Table 1.

Discussion
In this paper, we considered smooth estimation under monotonicity constraints of the baseline hazard rate in the Cox model. We investigated the asymptotic behaviour of two estimators, which are the kernel smoothed versions of the monotone MLE and a Grenander-type estimator. The main result is that they are asymptotically equivalent with a normal limit distribution at rate n m=.2mC1/ , where m is the degree of smoothness assumed for the baseline hazard. Two other methods that combine smoothing and isotonization for estimation of the baseline hazard in the Cox model were considered in Lopuhaä & Musta (2017a). As shown in theorems 3.6 and 4.4 in Lopuhaä & Musta (2017a), the smoothed Grenander-type estimator, the SMLE and the isotonized kernel estimator are all asymptotically equivalent, while the MSLE exhibits a different asymptotic bias (which might be smaller or larger than the one of the previous three estimators). This means that, from the theoretical point of view, there is no reason to prefer one estimator with respect to the other (apart from the fact that the kernel smoothed estimators are differentiable while the other two are usually only continuous).
The method used to establish asymptotic normality for the estimators in this paper is quite different from the ones in Lopuhaä & Musta (2017a). In the latter paper, the isotonization step was performed after a smoothing step. As a consequence, the resulting estimators are asymptotically equivalent to corresponding naive estimators that are combinations of ordinary kernel type estimators, to which standard techniques apply. This approach does not apply to the smoothed isotonic estimators in this paper. Alternatively, we followed the approach from Groeneboom & Jongbloed (2014) based on L 2 -bounds for the isotonic estimator. The approach had to be adapted at several points leading to L 2 -bounds that are suboptimal, but sufficient for our purposes.
Furthermore, we investigated also the finite sample performance of these estimators by constructing pointwise confidence intervals. First, making use of the theoretical results, we construct pointwise confidence intervals based on the limit distribution with undersmoothing to avoid bias estimation. Results confirm the comparable behaviour of the four methods and favour the use of the smoothed isotonic estimators instead of the unsmoothed Grenander-type estimator or the non-isotonic kernel estimator. However, coverage probabilities are far from the nominal level and strongly depend on the choice of the bandwidth and the accuracy in the estimation of the regression coefficientˇ0. Because most of the well-known methods to overcome these problems do not seem to work in our setting, a thorough investigation is still needed for improving the performance of the confidence intervals. Instead, we choose to exploit pointwise confidence intervals based on bootstrap procedures. As it turns out, the simple percentile bootstrap works better than the studentized one. Such a phenomenon was also observed in Burr (1994). The four estimators, the SMLE, the smoothed Grenander-type estimator, the MSLE and the isotonized smoothed Breslow estimator, again exhibit comparable behaviour, but the SMLE and the MSLE have slightly better coverage probabilities. The performance is satisfactory, but still further investigation is required for bandwidth selection and correcting the asymptotic bias, which might improve the results.
To obtain suitable bounds for (26), we will establish bounds on the tail probabilities of Q U n .a/ defined in (24). To this end, we consider a suitable martingale that will approximate the process ƒ n ƒ 0 . For i D 1; 2; : : : ; n, let N i .t / D 1 ¹X i Ät º i be the right continuous counting process for the number of observed failures on .0; t and Y i .t/ D 1 ¹T i t º be the at-risk process. Then, for each i D 1; 2; : : : ; n, s/, is a mean zero martingale with respect to the filtration (e.g., see kalbfleisch & Prentice, 2002). Furthermore, it is square integrable, because Finally, it has predictable variation process hM i i D A i .t/ (e.g., see Gill, 1984 or theorem 2 of Appendix B in Shorack & Wellner, 1986). For each n 1, define Then M n .t / is a mean zero square integrable martingale with predictable variation process whereˆn is defined in (4). is a mean zero, square integrable martingale with respect to the filtration F n t . Moreover, B n has predictable variation process and M n D N n A n . We apply theorem B.3.1c in Shorack & Wellner (1986) with Y , H , M , N and A, replaced by B n , Y n , M n , N n and A n , respectively. In order to check the conditions of this theorem, note that Y n is a predictable process satisfying jY n .t/j < 1, almost surely, for all t 0, and that Moreover, because for s Ä M we haveˆ.sIˇ0/ ˆ.M Iˇ0/ > 0, it follows that because of the assumption (A2). It follows from theorem B.3.1c in Shorack & Wellner (1986) that B n is a square integrable martingale with mean zero and predictable variation process whereˆandˆn are defined in (2) and (4), respectively.
It is straightforward to verify that for t 2 OE0; M and M < T .n/ , where For establishing suitable bounds on the tail probabilities of Q U n .a/, we need the following result for the process B n , which is comparable with condition (A2) in Durot (2007). (51). Then, there exists a constant C > 0 such that, for all x > 0 and t 2 OE0; M ,

Lemma 6.2. Suppose that (A1)-(A2) hold. Let 0 < M < H and let B n be defined as in
Proof. The proof is similar to that of theorem 3 in Durot (2007). First, consider the case t Ä u Ä t C x. According to lemma 6.1, B n is a martingale. Hence, by Doob's inequality, we have where according to (A2), for some C > 0. This proves the lemma for the case t Ä u Ä t C x. For the case t x Ä u Ä t , we can write E " sup u2OE0;M ;t xÄuÄt Then similar to (55), the right hand side is bounded by Note that U is continuous and differentiable on . 0 .0/; 0 .M //, but it is different from the inverse of 0 on the entire interval OE 0 .0/; 0 . H /. Lemma 6.3. Suppose that (A1)-(A2) hold. Let 0 < M < H and let Q U n and U be defined in (24) and (56), respectively. Suppose that H uc , defined in (1), has a bounded derivative h uc on OE0; M and that 0 0 is bounded below by a strictly positive constant. Then, there exists an event E n , such that 1 En ! 1 in probability, and a constant K such that, for every a 0 and x > 0, for n sufficiently large.
Note that lemmas 6.2 and 6.3 correspond to theorem 3(i) and lemma 2 in Durot (2007). It is useful to spend some words on the restriction to the event E n \ ¹ Q U n .a/ Ä M º. The event ¹ Q U n .a/ Ä M º is implicit in Durot (2007), because there the Grenander-type estimator is defined by only considering ƒ n on a compact interval not containing the end point of the support. The event E n is needed in our setup because of the presence of the covariates, which lead to more complicated processes, and because we require (25) for p D 2.
Proof of Lemma 6.3. First, we note that from the definition of U and the fact that Q U n is increasing, it follows that j Q U n .a/ U.a/j Ä j Q Hence, it suffices to prove (57) only for a 2 OE 0 .0/; 0 .M /. Let E n be the event from lemma 3.1. We start by writing First, consider the first probability on the right hand side of (58). It is zero, if U.a/ C x > M. Otherwise, if U.a/ C x Ä M , then x Ä M and Let i 0 be such that M U.a/ 2 OEx2 i ; x2 iC1 /, and note that, on the event E n one has T .n/ M . Therefore, if U.a/ < y Ä M , then y Ä T .n/ and U.a/ < T .n/ . It follows that the previous probability can be bounded by where the supremum is taken over y 2 OE0; M , such that y U.a/ 2 OEx2 k ; x2 kC1 /. Using that P.X C Y / Ä P.X =2/ C P.Y =2/, together with the Markov inequality, we can bound this probability by For the first term in the right hand side of (60) In particular, for sufficiently large n, we have sup x2Rˇˆn .xI Ǒ n / ˆ.xIˇ0/ˇÄˆ.M Iˇ0/=2, which yields that, for x 2 OE0; M , Using (61) Using lemma 6.2 and the bound in (59), for the first probability on the right hand side of (58), it follows that there exist K 1 ; K 2 > 0, such that for all a 0, n 1 and x > 0, We proceed with the second probability on the right hand side of (58). We can assume x Ä U.a/, because otherwise P Q U n .a/ Ä U.a/ x D 0. We have OEƒ n .y/ ay ƒ n .U.a// C a U.a/ Ä 0 Let i 0 be such that U.a/ 2 OEx2 i ; x2 iC1 /. By a similar argument used to obtain the bound (59), this probability is bounded by sup yÄU.a/;U.a/ y2OEx2 k ;x2 kC1 / 1 En jR n .y/ R n .U.a//j 3 # : In the same way as in the first case, we also have E " sup yÄU.a/;U.a/ y2OEx2 k ;x2 kC1 / 1 En jR n .y/ R n .U.a//j 3 # Ä K 2 max´x 2 kC1 n 3 ; x 3 2 3.kC1/ n μ : Exactly as in (63), lemmas 6.2 and (64) imply that for some positive constant K. Together with (58) and (63), this finishes the proof.
Lemma 6.4. Suppose that (A1)-(A2) hold. Let 0 < < M 0 < M < H and suppose that H uc , defined in (1), has a bounded derivative h uc on OE0; M . Let Q n be the Grenander-type estimator of a nondecreasing baseline hazard rate 0 , which is differentiable with 0 0 bounded above and below by strictly positive constants. Let E n be the event from lemma 3.1 and take 3 in (49) such that Then, there exists a constant C such that, for n sufficiently large, Proof. It is sufficient to prove that there exist some constants C 1 ; C 2 > 0, such that for each n 2 N and each t 2 . ; M 0 , we have Scand J Statist 45 Let us first consider (66). We will make use of the following result for a fixed Á > 0. We distinguish between the cases a C n 1=3 x Ä 0 .M / and a C n 1=3 We prove that, in the first case, there exist a positive constant C such that for all t 2 . ; M 0 and n 2 N, for all x Á, and in the second case First, assume a C n 1=3 x Ä 0 .M /. By the switching relation, we get Because a C n 1=3 x Ä 0 .M /, we have U.a C n 1=3 x/ M > t. Furthermore, ¹ Q U n .a C n 1=3 x/ < tº ¹ Q U n .a C n 1=3 x/ < Mº. Hence, together with lemma 6.3, we can write because U.a C n 1=3 x/ t D U 0 . n /n 1=3 x, for some n 2 .a; a C n 1=3 x/, where U 0 . n / D 0 0 . 1 0 . n // 1 1= sup t 2OE0; H 0 0 .t/ > 0. Next, consider the case a C n 1=3 x > 0 .M /. Note that we cannot argue as in the previous case, because for a C n 1=3 x > 0 .M /, we always have U.a C n 1=3 x/ D M , so that we loose the dependence on x. However, if n 1=3 . Q n .t / 0 .t// > x, then for each y > t, we have where a D 0 .t /. In particular, for y D Q M D M 0 C .M M 0 /=2, we obtain also using that so that, by the definition of 3 in (49), the probability on the right hand side (69) is zero. This concludes the proof of (66). Next, we have to deal with (67). Arguing as in the proof of (66), we obtain for a fixed Á > 0, where with a D 0 .t /. First of all, we can assume that a n 1=3 x 0, because otherwise P¹ Q n .t / Ä a n 1=3 xº D 0. Because t D U.a/, as before, we write P °Q U n .a n 1=3 x/ t ± \ E n Á Ä P °ˇQ U n .a n 1=3 x/ U.a n 1=3 x/ˇ t U.a n 1=3 x/ ± \ E n Á : In order to apply lemma 6.3, we intersect with the event Q U n .a n 1=3 x/ Ä M . Note that This can be seen as follows. If Q n .M / Ä a n 1=3 x, then for each y < M, we have Q ƒ n .M / Q ƒ n .y/ Ä Q n .M /.M y/ Ä .a n 1=3 x/.M y/: In particular, for y D Q M D M 0 C .M M 0 /=2, similar to (69), we obtain Because a D 0 .t / Ä 0 .M 0 /, we can argue as in (70) and conclude that the probability on the right hand side is zero. It follows that P °Q U n .a n 1=3 x/ t ± \ E n Á Ä P °ˇQ U n .a n 1=3 x/ U.a n 1=3 x/ˇ t U.a n 1=3 x/ ± \E n \°Q U n .a n 1=3 x/ Ä M ±Á Ä K max´1 n t U.a n 1=3 x/ 3 ; 1 n 3 t U.a n 1=3 x/ 5 μ : To bound the right hand side, we have to distinguish between a n 1=3 x > 0 .0/ and a n 1=3 x Ä 0 .0/. If a n 1=3 x > 0 .0/, then the right hand side is bounded by K=x 3 , because t U.a n 1=3 x/ D U 0 . n /n 1=3 x, for some n 2 .a n 1=3 x; a/, where then we are done because then P n 1=3 1 En . 0 .t / Q n .t// x Á D 0. This can be seen as follows. When a n 1=3 x Ä 0 .0/, then for each y < t, we have In particular, for y D 0 D =2, we obtain Because a n 1=3 x Ä 0 .0/, we can argue as in (70), and conclude that the probability on the right hand side is zero. This concludes the proof of (67).
Lemma 6.5. Suppose that (A1)-(A2) hold. Fix x 2 .0; h /. Let 0 < < x < M 0 < M < H and suppose that H uc , defined in (1), has a bounded derivative h uc on OE0; M . Let Q n be the Grenander-type estimator of a nondecreasing baseline hazard rate 0 , which is differentiable with 0 0 bounded above and below by strictly positive constants. Let E n be the event from lemma 3.1 and assume that 3 satisfies (65). Then Proof. Markov's inequality and Fubini, yield For n sufficiently large OEx b; x C b OE ; M 0 , so that according to lemma 6.4, the right hand side is bounded by 2C =K, for some constant C > 0. This proves the lemma.
Proof of Lemma 3.3. Take x < M < H and n sufficiently large such that x C b Ä M . With N a n;x N n defined in (19), we have Z ® N Â n;x .u; ı;´/ Â n;x .u; ı;´/¯dP.u; ı;´/ D 1 En Z 1 OEx b;xCb .u/ı N a n;x N n .uI Ǒ n / n .uI Ǒ n / a n;x .u/ ! dP.u; ı;´/ N a n;x N n .uI Ǒ n / n .uI Ǒ n / a n;x .u/ N a n;x N n .vI Ǒ n / n .vI Ǒ n / a n;x .v/ using Fubini, the definition ofˆin (2), and the fact that N a n;x N n and a n;x are zero outside OEx b; x C b. Write Ô n .u/ Dˆn.uI Ǒ n /, Ô .u/ Dˆ.uI Ǒ n /, andˆ0.u/ Dˆ.uIˇ0/. Then the right hand side can be written as 1 En Z xCb x b a n;x . O A n .u// Ô n . O A n .u// a n;x .u/ Ô n .u/ Ô n .u/ ˆ0.u/ 0 .u/ Ô .u/ Q n .u/ Á du; where O A n .u/ is defined in (20). The Cauchy-Schwarz inequality then yieldš Z ® N Â n;x .u; ı;´/ Â n;x .u; ı;´/¯dP.u; ı;´/Ä 1 En .a n;x ı O A n /. Ô n ı O A n / a n;x Ô n Ô n 1 OEx b;xCb Furthermore, 1 En .a n;x ı O A n /. Ô n ı O A n / a n;x Ô n Ô n 1 OEx b;xCb for some constants c 1 ; c 2 > 0, where we use the boundedness of k 0 , dˆ.xIˇ0/=dx and 1=ˆ.xIˇ0/ on OE0; This holds also in the case O A n .u/ D i , simply because j 0 .u/ 0 . O A n .u//j Ä j 0 .u/ Q n .u/j. As a result, using lemma 6.5, for the first term on the right hand side of (74), we derive that For the second term on the right hand side of (74), we finďˆn Consequently, from (74) together with (62), for the first term on the right hand side of (73), we obtain 1 En .a n;x ı O A n /. Ô n ı O A n / a n;x Ô n Ô n 1 OEx b;xCb For the second term on the right hand side of (73), we first write On the event E n , we find due to lemma 6.5. It follows that Together with (73), this concludes the proof.
To establish lemma 3.4, we need a slightly stronger version of lemma 6.4. Note that, in order to have the uniform result in (78), we loose a factor n 2=9 with respect to the bound in lemma 6.4. This might not be optimal, but it is sufficient for our purposes. Lemma 6.6. Suppose that (A1)-(A2) hold. Let 0 < < M 0 < M < H and suppose that H uc , defined in (1), has a bounded derivative h uc on OE0; M . Let Q n be the Grenander-type estimator of a nondecreasing baseline hazard rate 0 , which is differentiable with 0 0 bounded Proof of Lemma 3.4. Take 0 < < x < M 0 < M < H . Let n be sufficiently large such that x C b Ä M 0 < M < H . Denote by R n the left hand side of (28). Write G n D p n.P n P/ and decompose R n D R n1 C R n2 , where R n1 D n 1=2 1 En Z ı1 OEx b;xCb .u/ n .uI Ǒ n / N a n;x N n .uI Ǒ n / a n;x .u/ˆn.uI Ǒ n / Á dG n .u; ı;´/; R n2 D n 1=2 We consider the following events.