Nonparametric Shrinkage Estimation for Aalen's Additive Hazards Model

Authors


Abstract

Aalen's nonparametric additive model in which the regression coefficients are assumed to be unspecified functions of time is a flexible alternative to Cox's proportional hazards model when the proportionality assumption is in doubt. In this paper, we incorporate a general linear hypothesis into the estimation of the time-varying regression coefficients. We combine unrestricted least squares estimators and estimators that are restricted by the linear hypothesis and produce James-Stein-type shrinkage estimators of the regression coefficients. We develop the asymptotic joint distribution of such restricted and unrestricted estimators and use this to study the relative performance of the proposed estimators via their integrated asymptotic distributional risks. We conduct Monte Carlo simulations to examine the relative performance of the estimators in terms of their integrated mean square errors. We also compare the performance of the proposed estimators with a recently devised LASSO estimator as well as with ridge-type estimators both via simulations and data on the survival of primary billiary cirhosis patients.

1 Introduction

Survival analysis is a branch of statistics which deals with the analysis of time-to-event data (or more generally of event history). Applications of event history analysis are numerous in the medical field, but are also found in economics, engineering and sociology. Usually the investigator collects data on the occurrence times of the outcome variable along with a set of predictors (covariates) such as gender, age, social status, biomarkers of diseases and other variables of a similar nature. The investigator then attempts to determine if and how such covariates influence the occurrence rates (intensity) of the event of interest. Cox's proportional hazards (PH) model is one of the earliest and perhaps the most widely used statistical model which attempts to address such questions. Cox's original model was re-formulated in terms of counting process theory by Andersen & Gill (1982). The re-formulation led to the multiplicative intensity (MI) model which extends the PH model in the sense that it allows multiple events and time varying covariate processes. The PH model and its variants assume that the intensity function of the counting process defined by the events of interest is made up of the product of a baseline nonparametric intensity function and a parametric part consisting of a function of a linear combination of the independent variables. The effects of the covariates are measured through the unknown coefficients of the linear combination (regression parameters) which do not depend on time. This implies that the hazard ratio of two individuals differing just by the level of a given covariate is constant over time. This property which is known as the proportional hazards assumption yields an attractive interpretation of the model in terms of risk ratio and is mathematically tractable. However such an assumption is sometimes not appropriate for the data at hand. In such situations, time varying regression coefficients are required in order to quantify the effect of the covariates on the intensity function. An alternative to the MI model is provided by the additive regression model proposed by Aalen (1980). In this model the intensity function is governed by the covariates as well as the past events through a linear regression with time-varying coefficients. Estimation of Aalen's nonparametric time-varying regression coefficients is performed via weighted least squares and the asymptotic properties of these coefficients are studied using martingale theory for counting processes (Martinussen & Scheike 2006).

Often there is a large number of potential covariates that could be included in such regression models. However it may turn out that only a handful of such covariates is relevant in explaining the outcome of interest. Investigators usually employ either intuitive judgement or a model selection mechanism in order to filter out most of the unimportant covariates and obtain a parsimonious final model. Despite their appeal model selection methods have the disadvantage of introducing bias due to the fact that the descarded covariates may not be completely irrelevant. Therefore, the question of whether to settle for a reduced (uncertain) model or a full (possibly inefficient) model remains open. A way out of this dilemma is to construct James-Stein-type shrinkage estimators which incorporate both models into the estimation process (Saleh 2006). In the classical linear and partially linear regression models and in censored data models, the shrinkage estimators are known to dominate the unrestricted estimators (based on the full model) over the whole parameter space and dominate the restricted estimators (based on the linear hypothesis) except in a small neighborhood of the linear restriction (Ahmed, Doksum, Hossain and You, 2007; Raheem, Ahmed and Doksum, 2012).

In this manuscript we propose James-Stein-Type shrinkage estimators for the nonparametric regression coefficients in Aalen's additive model under a general linear hypothesis about the coefficients.

The manuscript is organized as follows. In Section 'The proposed methodology', we introduce Aalen's additive model, define a general linear hypothesis to be satisfied by the regression coefficients and provide restricted estimators of the coefficients. We study the joint asymptotic normality of the restricted and unrestricted estimators. We then define James-Stein-type shrinkage estimators of the coefficients under the prior uncertain information given in the form of the linear hypothesis. We define and study the integrated distributional quadratic risks of the proposed shrinkage estimators and compare them asymptotically to those of the restricted and unrestricted estimators. In Section 'Empirical Studies', we conduct Monte Carlo simulations examining the small sample performance of the estimators. Furthermore, we compare the performance of the proposed estimators to a recently devised least absolute shrinkage and selection operator (LASSO) estimator as well as to a ridge-type estimator both via simulations and via analysis of data on the survival of primary billiary cirhosis patients.

2 The proposed methodology

2.1 Aalen's additive model and the unrestricted estimator

Event history data are usually presented in the form of triplets math formula for a sample of individuals i = 1,…,n, where Ni(t) is the number of events up to time t, Yi(t) is a risk indicator which is one if the ith individual is at risk just prior to time t and zero otherwise and Xi(t) is a k × 1 vector of locally bounded covariates. Let math formula be the filtration of σ-fields generated by {(Ni(t),Yi(t),Xi(t));i = 1,2,…,n;0 ≤ t ≤ τ}, where [0,τ] is the time window in which study took place. For simplicity of notation, let us redefine the vector of covariates to include the risk indicator functions, and organize them in a n × k design matrix, math formula Also define the vector of counting processes math formula and the corresponding vector of intensity functions math formula.

Following Martinussen & Scheike (2006) pp. 108–109, Aalen's nonparametric additive regression model is defined through the intensity functions of the counting processes, N(t), as follows

display math

where math formula is a locally integrable, k-dimensional vector of regression coefficients. In general a major objective is to estimate the cumulative regression coefficients vector, math formula, where

display math

An estimator of B(t), motivated by a weighted least squares argument, was proposed by Aalen (1980) and discussed in detail in Martinussen & Scheike (2006). The estimator is defined by

display math(1)

where math formula is a n × n predictable diagonal weight matrix. Martinussen & Scheike (2006) recommended the use of math formula. However, with this choice of weights, the math formula obtained via (1) is not a real estimator since the weights depend on the unknown parameters. One solution is to obtain initial estimates of β(t) by smoothing (1) with W = I and then re-iterate the procedure by updating the weights in (1) with math formula based on the initial estimates. It has been shown that the asymptotic properties of the resulting estimator are equivalent to those for known weights. In the following discussions we will adopt these weights and for simplicity we will assume that the weights are known. We call the estimator obtained in this way, the unrestricted estimator of B(t). Following Martinussen & Scheike (2006), under some regularity conditions, math formula on the space math formula of all cadlag functions defined on [0,τ] with values in the real numbers, where U is a mean zero Gaussian martingale with covariance function

display math

and

display math

This covariance function can, in general, be estimated by plugging in consistent estimators of Ω and β(s). Obviously an estimator of β(s) is math formula while on the other hand, it can be easily shown that

display math(2)

provides a consistent estimator of Ω (see Tomanelli 2012).

2.2 The restricted estimator

Often investigators have some idea about the importance of the covariates in the model. For instance, it may be suspected that a reduced model, in which some of the covariates are replaced by zero, is the correct model. In some circumstances investigators might use a model selection technique to determine a reduced model and utilize it as the model of choice. Investigators might also consider imposing a certain contrast on the coefficients of interest. This host of possible restrictions can be unified in the form of a general linear hypothesis,

display math

where 0 ≤ t ≤ τ, R is a known q × k full-rank matrix of constants and r1(t) is a function which is Riemann-integrable on every compact subset of math formula. For instance, if the investigator believes that the vector of coefficients β(t) is partitioned into math formula and it is suspected that β(1)(t) = 0, then one could identify an appropriate contrast matrix, R, and set r1(t) = 0.

Using Lagrange multipliers it is easy to derive the restricted estimator of the cumulative coefficients under the above linear hypothesis as,

display math

where Ik is the k-dimensional identity matrix, and

display math

The main result of this section is the asymptotic joint normality of the restricted and unrestricted estimators under a sequence of local alternatives. This result is important on its own and is the main tool for the risk analysis of the shrinkage estimators. To this end, we define the following sequence of local alternatives

display math(3)

where δ1(t) is a known q×1 vector of functions which are Riemann-integrable on every compact subset of math formula and math formula. Also let math formulamath formula, and define

display math

Now, the following proposition constitutes the main result of this section. The proof of the proposition is provided in the Appendix.

Proposition 2.1. Under conditions math formula in the Appendix and for the sequence of local alternatives given above, we have

  1. display math
    on math formula, where math formula is a Gaussian martingale with mean math formula and covariance
    display math
    for 0≤tτ.
  2. display math
    where math formula Furthermore if J(t) is replaced by a consistent estimator math formula, the asymptotic chi-square distribution remains valid. That is,
    display math(4)

A version of math formula can be obtained by replacing Ω(t) in the expression for J(t) by math formula as given in (2).

2.3 The shrinkage estimators and their asymptotic performance

We are now in a position to define the proposed James-Stein-type estimators for the cumulative regression coefficients. The James-Stein shrinkage estimator can be defined by:

display math

where c, which is known as the shrinkage constant, is chosen in an interval such that math formula dominates math formula for all c in that interval, and φn(t) is defined in (4). Following the same method as in the original work of James and Stein (Saleh 2006) one can show that math formula dominates math formula for all 0≤ c≤ 2 (q−2). In addition, one can show that the integrated asymptotic distributional risk is the smallest if c = q − 2, hence in what follows we will consider only the case c = q − 2.

Sometimes the James-Stein estimator defined above suffers from a phenomenon known as over-shrinkage whereby negative coordinates of math formula are obtained whenever math formula. A remedy for this problem is the so called positive-part shrinkage estimator defined by truncating the shrinkage estimator in the following way:

display math

It is usual to compare shrinkage estimators in terms of their asymptotic distributional risk functions (ADR). Here we introduce a new risk measure, based on a quadratic integrated loss function, which we call integrated asymptotic distributional risk (IADR). We define the IADR of an estimator math formula, over [0,τ], as math formula, where Ψ is the distributional limit of the loss function

display math

in which W*(s) is a predictable weight matrix. In particular, if this weight matrix is the identity, the risk reduces to the usual integrated mean squared error. In the following proposition we state the risk dominance of the proposed shrinkage estimators with respect to the restricted and unrestricted estimators. The proof of the proposition is lengthy and hence omitted, but with the help of the joint asymptotic normality of the restricted and unrestricted estimators given in the last proposition, one can work out the proof along the same lines as in Ahmed et al. (2007), Nkurunziza & Ahmed (2011) and Tomanelli (2012).

Proposition 2.2. Suppose that Conditions math formula in the appendix and the sequence of local alternatives in (3) hold. Then,

display math

3 Empirical Studies

3.1 Performance of the proposed estimators

In this section we study the performance of the proposed shrinkage estimators by means of Monte Carlo simulations. To this end we consider a simple survival model whose intensity function for the ith individual is given by math formula where we set β(t) = (β1t,β2t,…,β5t) where βq for q = 1,…,5 are unknown constants and the covariate process is time independent. This leads to cumulative intensity functions given by Λi(t) = t2Xiβ where β = (β1,…,β5). We generated the covariates Xi2,…,Xi5 independently from the U(0,20) distribution whereas Xi1 is a vector of ones. We set β = (2,0,0,0,0), under the null hypothesis and β = (2,0,0,0,δ) under the alternative hypothesis where δ varied from zero to one in steps of 0.05. These alternatives are simple but address the issue of the performance of the shrinkage estimators as a function of the distance from the null model. In general it is not important whether one or more of the coefficients change under the alternative hypothesis. It is the magnitude of the non-centrality parameter Δ, which is also a measure of how far we are from the null model, that determines the performance of the shrinkage estimators. The random survival times are generated using the fact that, given the first i event times T1 = t1,T2 = t2,…,Ti = ti, the inter-event time Zi = Ti+1 − Ti has cumulative distribution function math formula. Therefore we first set t1 = 0 and for i = 1,…,(n−1) we find zi such that −log(1 − ui) = Λ(ti+zi) − Λ(ti) where ui is a random number from the uniform distribution on [0,1] and then we define ti+1 = ti+zi. The random survival times were then censored using independent random variates uniformly distributed so that the resulting censoring rates varied from 5% to 15%. Each scenario was simulated 1000 times for sample sizes of n = 250,500,750,1000. In each scenario, we computed the empirical integrated mean squared error (IMSE) for each estimator (shrinkage, positive shrinkage, restricted and unrestricted). We used the unrestricted estimator as a benchmark and hence reported the relative integrated mean squared error (RIMSE) defined as

display math

where math formula represents any of the proposed estimators. The results are summarised in Figure 1.

Figure 1.

Relative IMSE for the unrestricted estimator (U) represented by the line at 1, the restricted estimator (R), the shrinkage estimator (S), and the positive shrinkage estimator (PS) with respect to the unrestricted estimator.

The proposed shrinkage estimators outperform the usual restricted and unrestricted estimators on almost all of the parameter space. When the null hypothesis is true (in other words when δ = 0), we see that the best estimator is the restricted estimator, as foreseen from Proposition 2.2, but the performance of this estimator deteriorates substantially as we go away from the null model. On the other hand the positive shrinkage estimator dominates the unrestricted estimator throughout the parameter space and converges to it in terms of IMSE, as δ increases, for all the sample sizes considered. In contrast the shrinkage estimator seems to be worse than the unrestricted at the null hypothesis for sample sizes that are smaller than n = 1000. This may indicate that the asymptotic distributional risk dominance of the shrinkage estimator requires very large samples in order to take effect.

3.2 Comparison to penalized estimators

In recent years, penalty methods have become popular as tools for model selection and estimation in many regression contexts. In particular, the method (LASSO) of Tibshirani (1996) has been extensively studied and used in the statistical literature. This method is also known as the l1 penalized model selection approach. Recently Gorst-Rasmussen & Scheike (2012) proposed an algorithm for computing LASSO estimators for Aalen's additive model when the regression coefficients are time independent. Aalen's model with constant coefficients is obviously a special case of the model proposed in the current paper. The authors are not aware of any LASSO-type methodology developed for the general additive model studied in this paper. Consequently we use only this special (constant coefficient) case as a basis for comparing our shrinkage estimators with the LASSO. We also compare our estimators with a ridge-type estimator with an l2 penalty on the regression coefficients. For this set of simulations, we generated the data from the intensity λ(t) = β1x1 + … + βq+4Xq+4 where βi = 10 for i = 1,…,4 and βj = 0 for j = 5,…,q+4. That is, we set four of the coefficients to non-zero values and p − 4 = q of the coefficients to zero. The value of q was varied over q = 10,20,30,40,50, and the value of the sample size n was varied over n = 100,200,500,1000. We ran 1000 simulations in each case and for each run computed our shrinkage estimators, the LASSO and ridge estimators. A 10-fold cross validation was used to choose the tuning parameter for the LASSO method. Since the penalty methods are independent of the non-centrality parameters, we ran the simulations only under the null model H0 in order to make the comparisons fair. To save space we summarise the results only for the cases n = 100 and 500 in Figures 2 and 3 which depict the RIMSE with respect to the unrestricted estimator, taken as a benchmark. Note the larger the RIMSE the better the estimator is with respect to the unrestricted one. The figures show that shrinkage and positive shrinkage estimators perform better than both ridge and LASSO estimators only when p/n is sufficiently small. For instance, when q = 10 (and hence p = 14) and n = 100, the shrinkage, positive shrinkage, ridge and LASSO estimators have RIMSE values of 2.9, 3.4, 1.9, 3.1, respectively. On the other hand, when q = 50, the corresponding RIMSE values are 39.2, 63.4, 66.4, 73.3, respectively. This pattern is maintained as n grows. For instance, when n = 500 we see that the shrinkage estimators are substantially better than the penalty estimators for all dimensions p = q + 4. From Figures 2 and 3 we infer that the shrinkage estimators outperform the LASSO and ridge estimators as long as p/n < 1/3.

Figure 2.

Relative IMSE for the unrestricted estimator (U) represented by the line at 1, the shrinkage estimator (S), the positive shrinkage estimator (PS), the ridge estimator (R) and the LASSO estimator (L) with respect to the unrestricted estimator when the number of zero coefficients q varies and the sample size is n = 100.

Figure 3.

Relative IMSE for the unrestricted estimator (U) represented by the line at 1, the shrinkage estimator (S), the positive shrinkage estimator (PS), the ridge estimator (R) and the LASSO estimator (L) with respect to the unrestricted estimator when the number of zero coefficients q varies and the sample size is n = 500.

3.3 Application and comparison with LASSO

In this section we apply the proposed shrinkage estimation strategies to the famous PBC data from a clinical trial relating to primary biliary cirrhosis (Fleming & Harrington 1991). We compare the performance of our estimator with a recently proposed l1 penalized estimator for the additive model with constant regression coefficients (Gorst-Rasmussen & Scheike 2012). The PBC clinical trial was a randomized placebo-controlled trial of the drug D-penicillamine, and there were a total of 424 patients who were eligible to participate. The data are mostly complete for the first 312 patients but the last 112 did not actually participate in the clinical trial, consenting only to have measurements recorded and to be followed with respect to survival. In addition to the treatment indicator, a total of 16 covariates including age, sex and various biomarkers were measured on patients. The outcome of interest was the patient's survival time. We calculated LASSO estimators by applying the gradient descent algorithm of Gorst-Rasmussen and Scheike, and a 10-fold cross-validation for choosing the optimal tuning parameter. The overall integrated mean squared error in estimating the 16 coefficients was computed by using 1000 bootstrap samples and was found to be 1.5 relative to the overall MSE of the vector of the unrestricted estimators. To place our comparisons on an equitable footing we produced our shrinkage estimators based on the linear hypothesis H1:β = 0, which assumes no specific knowledge about the coefficients. The relative (to the unrestricted) integrated mean squared errors of shrinkage and positive shrinkage estimators were both found to be 1.531. The ridge regression is substantially below the other estimators, having a relative IMSE of 1.2. This is consistent with our simulation studies comparing the ridge, LASSO and the proposed shrinkage estimators.

4 Conclusion

In this manuscript, we have proposed estimators which combine reduced and full model estimators of Aalen's additive hazards regression coefficients. The proposed estimators are computationally inexpensive, perform better than estimators based solely on either the reduced or the full models and are comparable to LASSO and ridge penalized estimators.

Acknowledgements

The authors are grateful for the careful reading and helpful suggestions of the referees and the Associate Editor. This research was in part supported by NSERC Canada Discovery grants.

Appendix: Proofs and regularity conditions

Conditions math formula

  1. display math
  2. display math
  3. display math
  4. display math
  5. display math
  6. display math

Lemma. Under conditions math formula and for the sequence of local alternatives (3),

display math

onmath formula, whereU*is a Gaussian martingale with mean equal tomath formulaand covariance equal to

display math

for all 0 ≤ t ≤ τ andκ1(s) = A(s)RΩ(s)RA(s).

Proof of the Lemma. We have ηn(t) = P1,n(t) + P2,n(t) − P3,n(t), where

display math
display math

with Γ(s) = X(s)W(s)X(s) and M(s) = N(s) − Λ(s) is the martingale associated with the counting process N(s). By using Robolledo's Martingale Central Limit Theorem, one can prove that math formula converges in distribution to a Gaussian martingale on math formula, with covariance function Φ*(t). Furthermore one can verify that P2,n(t) converges in probability to 0, uniformly on [0,τ].

Finally, after some algebraic computations, one can verify that

display math

uniformly over [0,τ], and this completes the proof.

Proof of proposition 2.1. Notice that one can rewrite math formula as

display math

The proof is then completed by following the same steps as were used in the proof of the previous lemma.

Ancillary