Maximum likelihood estimation in semiparametric regression models with censored data

Authors


D. Y. Lin, Department of Biostatistics, CB 7420, University of North Carolina, Chapel Hill, NC 27599-7420, USA.
E-mail: lin@bios.unc.edu

Abstract

Summary.  Semiparametric regression models play a central role in formulating the effects of covariates on potentially censored failure times and in the joint modelling of incomplete repeated measures and failure times in longitudinal studies. The presence of infinite dimensional parameters poses considerable theoretical and computational challenges in the statistical analysis of such models. We present several classes of semiparametric regression models, which extend the existing models in important directions. We construct appropriate likelihood functions involving both finite dimensional and infinite dimensional parameters. The maximum likelihood estimators are consistent and asymptotically normal with efficient variances. We develop simple and stable numerical techniques to implement the corresponding inference procedures. Extensive simulation experiments demonstrate that the inferential and computational methods proposed perform well in practical settings. Applications to three medical studies yield important new insights. We conclude that there is no reason, theoretical or numerical, not to use maximum likelihood estimation for semiparametric regression models. We discuss several areas that need further research.

1. Introduction

The Cox (1972) proportional hazards model is the corner-stone of modern survival analysis. The model specifies that the hazard function of the failure time conditional on a set of possibly time varying covariates is the product of an arbitrary base-line hazard function and a regression function of the covariates. Cox (1972, 1975) introduced the ingenious partial likelihood principle to eliminate the infinite dimensional base-line hazard function from the estimation of regression parameters with censored data. In a seminal paper, Andersen and Gill (1982) extended the Cox regression model to general counting processes and established the asymptotic properties of the maximum partial likelihood estimator and the associated Breslow (1972) estimator of the cumulative base-line hazard function via the elegant counting process martingale theory. The maximum partial likelihood estimator and the Breslow estimator can be viewed as non-parametric maximum likelihood estimators (NPMLEs) in that they maximize the non-parametric likelihood in which the cumulative base-line hazard function is regarded as an infinite dimensional parameter (Andersen et al. (1993), pages 221–229 and 481–483, and Kalbfleisch and Prentice (2002), pages 114–128).

The proportional hazards assumption is often violated in scientific studies, and other semiparametric models may provide more accurate or more concise summarization of data. Under the proportional odds model (Bennett, 1983), for instance, the hazard ratio between two sets of covariate values converges to 1, rather than staying constant, as time increases. The NPMLE for this model was studied by Murphy et al. (1997). Both the proportional hazards and the proportional odds models belong to the class of linear transformation models which relates an unknown transformation of the failure time linearly to covariates (Kalbfleisch and Prentice (2002), page 241). Dabrowska and Doksum (1988), Cheng et al. (1995) and Chen et al. (2002) proposed general estimators for this class of models, none of which are asymptotically efficient. The class of linear transformation models is confined to traditional survival (i.e. single-event) data and time invariant covariates.

As an example of non-proportional hazards structures, Fig. 1 displays (in the full curves) the Kaplan–Meier estimates of survival probabilities for the chemotherapy and chemotherapy plus radiotherapy groups of gastric cancer patients in a randomized clinical trial (Stablein and Koutrouvelis, 1985). The crossing of the two survival curves is a strong indication of crossing hazards. This is common in clinical trials because the patients who receive the more aggressive intervention (e.g. radiotherapy or transplantation) are at elevated risks of death initially but may enjoy considerable long-term survival benefits if they can tolerate the intervention. Crossing hazards cannot be captured by linear transformation models. The use of the proportional hazards model could yield very misleading results in such situations.

Figure 1.

 Kaplan–Meier (——) and model-based estimates (-·-·-·) of survival functions for gastrointestinal tumour patients (the chemotherapy and combined therapy patients are indicated by blue and green respectively): (a) model (3); (b) model (4)

Multivariate or dependent failure time data arise when each study subject can potentially experience several events or when subjects are sampled in clusters (Kalbfleisch and Prentice (2002), chapters 8–10). It is natural and convenient to represent the dependence of related failure times through frailty or random effects (Clayton and Cuzick, 1985; Oakes, 1989, 1991; Hougaard, 2000). The NPMLE of the proportional hazards model with gamma frailty was studied by Nielsen et al. (1992), Klein (1992), Murphy (1994, 1995), Andersen et al. (1997) and Parner (1998). Gamma frailty induces a very restrictive form of dependence, and the proportional hazards assumption fails more often with complex multivariate failure time data than with univariate data. The focus of the existing literature on the proportional hazards gamma frailty model is due to its mathematical tractability. Cai et al. (2002) proposed estimating equations for linear transformation models with random effects for clustered failure time data. Zeng et al. (2005) studied the NPMLE for the proportional odds model with normal random effects and found the estimators of Cai et al. (2002) to be considerably less efficient.

Lin (1994) described a colon cancer study in which the investigators wished to assess the efficacy of adjuvant therapy on recurrence of cancer and death for patients with resected colon cancer. By characterizing the dependence between recurrence of cancer and death through a random effect, one could properly account for the informative censoring caused by death on recurrence of cancer and accurately predict a patient's survival outcome given his or her cancer recurrence time. However, random-effects models for multiple types of events have received little attention in the literature.

In longitudinal studies, data are often collected on repeated measures of a response variable as well as on the time to the occurrence of a certain event. There is a tremendous recent interest in joint modelling, in which models for the repeated measures and failure time are assumed to depend on a common set of random effects. Such models can be used to assess the joint effects of base-line covariates (such as treatments) on the two types of outcomes, to study the effects of potentially mismeasured time varying covariates on the failure time and to adjust for informative drop-out in the analysis of repeated measures. The existing literature (e.g. Wulfsohn and Tsiatis (1997), Hogan and Laird (1997) and Henderson et al. (2000)) has been focused on the linear mixed model for repeated measures and the proportional hazards model with normal random effects for the failure time.

The linear mixed model is confined to continuous repeated measures with normal error. In addition, the transformation of the response variable is assumed to be known. Inference under random-effects models is highly non-robust to misspecification of transformation. Our experience in human immunodeficiency virus (HIV) and acquired immune deficiency syndrome research shows that different transformations of CD cell counts often yield conflicting results. Thus, it would be desirable to employ semiparametric models (e.g. linear transformation models) for continuous repeated measures, so that a parametric specification of the transformation or distribution can be avoided. This kind of model has not been studied even without the task of joint modelling, although econometricians (Horowitz (1998), chapter 5) have proposed inefficient estimators for univariate responses.

As evident from the above description, the existing semiparametric regression models, although very useful, have important limitations and, in most cases, lack efficient estimators or careful theoretical treatments. In this paper, we unify and extend the current literature, providing a comprehensive methodology with strong theoretical underpinning. We propose a very general class of transformation models for counting processes which encompasses linear transformation models and which accommodates crossing hazards, time varying covariates and recurrent events. We then extend this class of models to dependent failure time data (including recurrent events, multiple types of events and clustered failure time data) by incorporating a rich family of multivariate random effects. Furthermore, we present a broad class of joint models by specifying random-effects transformation models for the failure time and generalized linear mixed models for (discrete or continuous) repeated measures. We also propose a semiparametric linear mixed model for continuous repeated measures, under which the transformation of the response variable is completely unspecified.

We establish the consistency, asymptotic normality and asymptotic efficiency of the NPMLEs for the proposed models by appealing to modern empirical process theory (van der Vaart and Wellner, 1996) and semiparametric efficiency theory (Bickel et al., 1993). In fact, we develop a very general asymptotic theory for non-parametric maximum likelihood estimation with censored data. Our general theory can be used to derive asymptotic results for many existing semiparametric models which are not covered in this paper as well as those to be invented in the future. Simulation studies show that the asymptotic approximations are accurate for practical sample sizes.

It is widely believed that NPMLEs are intractable computationally. This perception has motivated the development of ad hoc estimators which are less efficient statistically. We present in this paper simple and effective methods to calculate the NPMLEs and to implement the corresponding inference procedures. These methods apply to a wide variety of semiparametric models with censored data and make the NPMLEs computationally more feasible than the ad hoc estimators (when the latter exist). Their usefulness is amply demonstrated through simulated and real data.

As hinted in the discussion thus far, we are suggesting the following strategies in the research and practice of survival analysis and related fields.

  • (a) Use the new class of transformation models to analyse failure time data.
  • (b) Make routine use of random-effects models for multivariate failure time data.
  • (c) Choose normal random effects over gamma frailty.
  • (d) Determine transformations of continuous response variables non-parametrically.
  • (e) Formulate multiple types of outcome measures with semiparametric joint models.
  • (f) Adopt maximum likelihood estimation for semiparametric regression models.
  • (g) Rely on modern empirical process theory as the primary mathematical tool.

We shall elaborate on these points in what follows, particularly at the end. In addition, we shall pose a wide range of open problems and outline several directions for future research.

2. Semiparametric models

2.1. Transformation models for counting processes

The class of linear transformation models relates an unknown transformation of the failure time T linearly to a vector of (time invariant) covariates Z:

image(1)

where H(·) is an unspecified increasing function, β is a set of unknown regression parameters and ɛ is a random error with a parametric distribution. The choices of the extreme value and standard logistic error distributions yield the proportional hazards and proportional odds models respectively.

Remark 1.  The familiar linear model form of equation (1) is very appealing. Since the transformation H(·) is arbitrary, the parametric assumption on ɛ should not be viewed as restrictive. In fact, without Z, there is always a transformation such that ɛ has any given distribution.

We extend equation (1) to allow time varying covariates and recurrent events. Let N*(t) be the counting process recording the number of events that have occurred by time t, and let Z(·) be a vector of possibly time varying covariates. We specify that the cumulative intensity function for N*(t) conditional on {Z(s);sleqslant R: less-than-or-eq, slantt} takes the form

image(2)

where G is a continuously differentiable and strictly increasing function, R*(·) is an indicator process, β is a vector of unknown regression parameters and Λ(·) is an unspecified increasing function. For survival data, R*(t)=I(Tgeqslant R: gt-or-equal, slantedt), where I(·) is the indicator function; for recurrent events, R*(·)=1. It is useful to consider the class of Box–Cox transformations

image

with ρ=0 corresponding to G(x)= log (1+x) and the class of logarithmic transformations

image

with r=0 corresponding to G(x)=x. The choice of G(x)=x yields the familiar proportional hazards or intensity model (Cox, 1972; Andersen and Gill, 1982). If N*(·) has a single jump at the survival time T and Z is time invariant, then equation (2) reduces to equation (1).

Remark 2.  Specifying the function G while leaving the function Λ unspecified is equivalent to specifying the distribution of ɛ while leaving the function H unspecified. Non-identifiability arises if both G and Λ (or both H and ɛ) are unspecified and β=0; see Horowitz (1998), page 169.

To capture the phenomenon of crossing hazards as seen in Fig. 1, we consider the heteroscedastic version of linear transformation models

image

where inline image is a set of (time invariant) covariates and γ is the corresponding vector of regression parameters. For notational simplicity, we assume that inline image is a subset of Z, although this assumption is not necessary. Under this formulation, the hazard functions that are associated with different values of inline image can cross and the hazard ratio can invert over time. To accommodate such scenarios as well as recurrent events and time varying covariates, we extend equation (2) as follows:

image(3)

For survival data, model (3) with G(x)=x is similar to the heteroscedastic hazard model of Hsieh (2001), who proposed to fit his model by the method of histogram sieves.

Under model (3) and Hsieh's model, the hazard function is infinite at time 0 if inline image. This feature causes some technical difficulty. Thus, we propose the following modification:

image(4)

If γ=0, equation (4) reduces to equation (2) by redefining G(1+x)−G(1) as G(x). For survival data, the conditional hazard function under model (4) with G(x)=x becomes

image

where λ(t)=Λ(t). Here and in what follows g(x)=dg(x)/dx. This model is similar to the cross-effects model of Bagdonavicius et al. (2004), who fitted their model by modifying the partial likelihood.

Let C denote the censoring time, which is assumed to be independent of N*(·) conditional on Z(·). For a random sample of n subjects, the data consist of {Ni(t),Ri(t),Zi(t);t ∈ [0,τ]} (i=1,…,n), where inline image, inline image, a?b=min(a,b) and τ is the duration of the study. For general censoring and truncation patterns, we define Ni(t) as the number of events that are observed by time t on the ith subject, and Ri(t) as the indicator on whether the ith subject is at risk at t.

Write λ(t|Z)=Λ(t|Z) and θ=(βT,γT)T. Assume that censoring is non-informative about the parameters θ and Λ(·). Then the likelihood for θ and Λ(·) is proportional to

image(5)

where dNi(t) is the increment of Ni over [t,t+dt).

2.2. Transformation models with random effects for dependent failure times

For recurrent events, models (2)–(4) assume that the occurrence of a future event is independent of the prior event history unless such dependence is represented by suitable time varying covariates. It is inappropriate to use such time varying covariates in randomized clinical trials because the inclusion of a post-randomization response variable in the model will attenuate the estimator of treatment effect. It is more appealing to characterize the dependence of recurrent events through random effects or frailty. Frailty is also useful in formulating the dependence of several types of events on the same subject or the dependence of failure times among individuals of the same cluster. To accommodate all these types of data structure, we represent the underlying counting processes by inline image (i=1,…,n; k=1,…,K; l=1,…,nik), where i pertains to a subject or cluster, k to the type of event and l to individuals within a cluster; see Andersen et al. (1993), pages 660–662. The specific choices of K=nik=1, nik=1 and K=1 correspond to recurrent events, multiple types of events and clustered failure times respectively. For the colon cancer study that was mentioned in Section 1, K=2 (and nik=1), with k=1 and k=2 representing cancer recurrence and death.

The existing literature is largely confined to proportional hazards or intensity models with gamma frailty, under which the intensity function for inline image conditional on covariates Zikl(t) and frailty ξi takes the form

image(6)

where ξi (i=1,…,n) are gamma-distributed random variables, inline image is analogous to inline image and λk(·) (k=1,…,K) are arbitrary base-line functions. Murphy (1994, 1995) and Parner (1998) established the asymptotic theory of the NPMLEs for recurrent events without covariates and for clustered failure times with covariates respectively.

Remark 3. Kosorok et al. (2004) studied the proportional hazards frailty model for univariate survival data. The induced marginal model (after integrating out the frailty) is a linear transformation model in the form of equation (1).

We assume that the cumulative intensity function for inline image takes the form

image(7)

where Gk (k=1,…,K) are analogous to G of Section 2.1, inline image is a subset of Zikl plus the unit component, bi (i=1,…,n) are independent random vectors with multivariate density function f(b;γ) indexed by a set of parameters γ and Λk(·) (k=1,…,K) are arbitrary increasing functions. Equation (7) is much more general than equation (6) in that it accommodates non-proportional hazards or intensity models and multiple random effects that may not be gamma distributed. It is particularly appealing to allow normal random effects, which, unlike gamma frailty, have unrestricted covariance matrices. In light of the linear transformation model representation, normal random effects are more natural than gamma frailty, even for the proportional hazards model. Computationally, normal distributions are more tractable than others, especially for high dimensional random effects.

Write θ=(βT,γT)T. Let Cikl, Nikl(·) and Rikl(·) be defined analogously to Ci, Ni(·) and Ri(·) of Section 2.1. Assume that Cikl is independent of inline image and bi conditional on Zikl(·) and non-informative about θ and Λk (k=1,…,K). The likelihood for θ and Λk (k=1,…,K) is

image(8)

where inline image (k=1,…,K).

2.3. Joint models for repeated measures and failure times

Let Yij represent a response variable and Xij a vector of covariates that are observed at time tij, for observation j=1,…,ni on subject i=1,…,n. We formulate these repeated measures through generalized linear mixed models (Diggle et al. (2002), section 7.2). The random effects bi (i=1,…,n) are independent zero-mean random vectors with multivariate density function f(b;γ) indexed by a set of parameters γ. Given bi, the responses Yi1,…,Yini are independent and follow a generalized linear model with density fy(y|Xij;bi). The conditional means satisfy

image

where g is a known link function, α is a set of regression parameters and inline image is a subset of X.

As in Section 2.1, let inline image denote the number of events which the ith subject has experienced by time t and Zi(·) be a vector of covariates. We allow inline image to take multiple jumps to accommodate recurrent events. If we are interested in adjusting for informative drop-out in the repeated measures analysis, however, inline image will take a single jump at the drop-out time. To account for the correlation between inline image and the Yij, we incorporate the random effects bi into equation (2),

image

where inline image is a subset of Zi plus the unit component, ψ is a vector of unknown constants and v1v2 is the componentwise product of two vectors v1 and v2. Typically but not necessarily, Xij=Zi(tij). It is assumed that inline image and the Yij are independent given bi, Zi and Xij.

Write θ=(αT,βT,γT,ψT)T. Assume that censoring and measurement times are non-informative (Tsiatis and Davidian, 2004). Then the likelihood for θ and Λ(·) can be written as

image(9)

where λ(t|Z;b)=Λ(t|Z;b).

It is customary to use the linear mixed model for continuous repeated measures. The normality that is required by the linear mixed model may not hold. A simple strategy to achieve approximate normality is to apply a parametric transformation to the response variable. It is difficult to find the correct transformation in practice, especially when there are outlying observations. As mentioned in Section 1, our experience in analysing HIV data shows that different transformations (such as logarithmic versus square root) of CD cell counts or viral loads often lead to conflicting results. Thus, we propose the semiparametric linear mixed model or random-effects linear transformation model

image(10)

where inline image is an unknown increasing function and ɛij (i=1,…,n;j=1,…,nij) are independent errors with density function f. If the transformation function inline image were specified, then equation (10) would reduce to the conventional (parametric) linear mixed model. Leaving the form of inline image unspecified is in line with the semiparametric feature of the transformation models for event times. There is no intercept in α since it can be absorbed in inline image. Write inline image. The likelihood for θ, Λ and inline image is

image(11)

where inline image.

3. Maximum likelihood estimation

The likelihood functions that are given in expressions (5), (8), (9) and (11) can all be written in a generic form

image(12)

where ��=(Λ1,…,ΛK), ��i is the observation on the ith subject or cluster and Ψ is a functional of random process ��i, infinite dimensional parameter �� and d-dimensional parameter θ; expression (11) can be viewed as a special case of expression (8) with K=2, inline image, ni1=1 and ni2=ni, where repeated measures correspond to the second type of failure. To obtain the Kiefer–Wolfowitz NPMLEs of θ and ��, we treat �� as right continuous and replace λk(t) by the jump size of Λk at t, which is denoted by Λk{t}. Under model (2) with G(x)=x, the NPMLEs are identical to the maximum partial likelihood estimator of β and the Breslow estimator of Λ.

The calculation of the NPMLEs is tantamount to maximizing Ln(θ,��) with respect to θ and the jump sizes of �� at the observed event times (and also at the observed responses in the case (11)). This maximization can be carried out in many scientific computing packages. For example, the ‘Optimization toolbox’ of MATLAB (Gilat, 2004) contains an algorithm fminunc for unconstrained non-linear optimization. We may choose between large scale and medium scale optimization. The large scale optimization algorithm is a subspace trust region method that is based on the interior reflective Newton algorithm of Coleman and Li (1994, 1996). Each iteration involves approximate solution of a large linear system by using the technique of preconditioned conjugate gradients. The gradient of the function is required. The Hessian matrix is not required and is estimated numerically when it is not supplied. In our implementation, we normally provide the Hessian matrix, so that the algorithm is faster and more reliable. The medium scale optimization is based on the BFGS quasi-Newton algorithm with a mixed quadratic and cubic line search procedure. This algorithm is also available in Press et al. (1992). MATLAB also contains an algorithm fmincon for constrained non-linear optimization, which is similar to fminunc.

The optimization algorithms do not guarantee a global maximum and may be slow for large sample sizes. Our experience, however, shows that these algorithms perform very well for small and moderate sample sizes provided that the initial values are appropriately chosen. We may use the estimates from the Cox model or a parametric model as the initial values. We may also use some other sensible initial values, such as 0 for the regression parameters and Y for H(Y). To gain more confidence in the estimates, one may try different initial values.

It is natural to fit random-effects models through the expectation–maximization (EM) algorithm (Dempster et al., 1977), in which random effects pertain to missing data. The EM algorithm is particularly convenient for the proportional hazards model with random effects because, in the M-step, the estimator of the regression parameter is the root of an estimating function that takes the same form as the partial likelihood score function and the estimator for �� takes the form of the Breslow estimator; see Nielsen et al. (1992), Klein (1992) and Andersen et al. (1997) for the formulae in the special case of gamma frailty.

For transformation models without random effects, we may use the Laplace transformation to convert the problem into the proportional hazards model with a random effect. Let ξ be a random variable whose density f(ξ) is the inverse Laplace transformation of  exp {−G(t)}, i.e.

image

If

image

then

image

Thus, we can turn the estimation of the general transformation model into that of the proportional hazards frailty model. This trick also works for general transformation models with random effects, although then there are two sets of random effects in the likelihood; see Appendix A.1 for details.

There is another simple and efficient approach. Using either the forward or the backward recursion that is described in Appendix A.2, we can reduce the task of solving equations for θ and all the jump sizes of Λ to that of solving equations for θ and only one of the jump sizes. This procedure is more efficient and more stable than direct optimization.

4. Asymptotic properties

We consider the general likelihood that is given in equation (12). Denote the true values of θ and �� by θ0 and ��0 and their NPMLEs by inline image and inline image. Under mild regularity conditions, inline image is strongly consistent for θ0 and inline image uniformly converges to ��0(·) with probability 1. In addition, the random element inline image converges weakly to a zero-mean Gaussian process, and the limiting covariance matrix of inline image achieves the semiparametric efficiency bound (Sasieni, 1992; Bickel et al., 1993).

To estimate the variances and covariances of inline image and inline image, we treat equation (12) as a parametric likelihood with θ and the jump sizes of �� as the parameters and then invert the observed information matrix for all these parameters. This procedure not only allows us to estimate the covariance matrix of inline image, but also the covariance function for any functional of inline image and inline image. The latter is obtained by the delta method (Andersen et al. (1992), section II.8) and is useful in predicting occurrences of events. A limitation of this approach is that it requires inverting a potentially large dimensional matrix and thus may not work well when there are a large number of observed failure times.

When the interest lies primarily in θ, we can use the profile likelihood method (Murphy and van der Vaart, 2000). Let pln(θ) be the profile log-likelihood function for θ, i.e. pln(θ)=max��[ log {Ln(θ,��)}]. Then the (s,t)th element of the inverse covariance matrix of inline image can be estimated by

image

where ɛn is a constant of order n−1/2, and es and et are the sth and tth canonical vectors respectively. The profile likelihood function can be easily calculated through the algorithms that were described in the previous section. Specifically, pln(θ) can be calculated via the EM algorithm by holding θ fixed in both the E-step and the M-step. In this way, the calculation is very fast owing to the explicit expression of the estimator of �� in the M-step. In the recursive formulae, the profile likelihood function is a natural product of the algorithm.

The regularity conditions are described in Appendix B. There are three sets of conditions. The first set consists of the compactness of the Euclidean parameter space, the boundedness of covariates, the non-emptyness of risk sets and the boundedness of the number of events (i.e. conditions D1–D4 in Appendix B); these are standard assumptions for any survival analysis and are essentially the regularity conditions of Andersen and Gill (1982). The second set of conditions pertains to the transformation function and random effects (i.e. conditions D5 and D6); these conditions hold for all commonly used transformation functions and random-effects distributions. The final set of conditions pertains to the identifiability of parameters (i.e. conditions D7 and D8); these conditions hold for the models and data structures that are considered in this paper provided that the covariates are linearly independent and the distribution of the random effects has a unique parameterization. In short, the regularity conditions hold in all practically important situations.

5. Examples

5.1. Gastrointestinal tumour study

As mentioned previously, Stablein and Koutrouvelis (1985) presented survival data from a clinical trial on locally unresectable gastric cancer. Half of the total 90 patients were assigned to chemotherapy, and the other half to combined chemotherapy and radiotherapy. There were two censored observations in the first treatment arm and six in the second. Under the two-sample proportional hazards model, the log-hazard ratio is estimated at 0.106 with a standard error estimate of 0.223, yielding a p-value of 0.64. This analysis is meaningless in view of the crossing survival curves that are shown in Fig. 1.

We fit models (3) and (4) with G(x)=x and inline image indicating chemotherapy versus combined therapy by the values 1 versus 0. We use the backward recursive formula of Appendix A.2 to calculate the NPMLEs. Under model (3), β and γ are estimated at 0.317 and −0.530 with standard error estimates of 0.190 and 0.093. Under model (4), the estimates of β and γ become 3.028 and −1.317 with standard error estimates of 0.262 and 0.032. As evident in Fig. 1, model (4) fits the data better than model (3) and accurately reflects the observed pattern of crossing survival curves.

5.2. Colon cancer study

In the colon cancer study that was mentioned in Section 1, 315, 310 and 304 patients with stage C disease received observation, levamisole alone and levamisole combined with 5-fluorouracil (group Lev+5-FU) respectively. By the end of the study, 155 patients in the observation group, 144 in the levamisole alone group and 103 in the Lev+5-FU group had recurrences of cancer, and there were 114, 109 and 78 deaths in the observation, levamisole alone and Lev+5-FU groups respectively. Lin (1994) fitted separate proportional hazards models to recurrence of cancer and death. That analysis ignored the informative censoring on cancer recurrence and did not explore the joint distribution of the two end points.

Following Lin (1994), we focus on the comparison between the observation and Lev+5-FU groups. We treat recurrence of cancer as the first type of failure and death as the second, and we consider four covariates:

image
image
image
image

We fit the class of models in equation (7) with a normal random-effect and the Box–Cox transformations {(1+x)ρ−1}/ρ and logarithmic transformations r−1pt log (1+rx) through the EM algorithm. The log-likelihood functions under these transformations are shown in Fig. 2. The combination of G1(x)=2{(1+x)1/2−1} and G2(x)= log (1+1.45x)/1.45 maximizes the likelihood function. By the Akaike (1985) information criterion, we select this bivariate model. Table 1 presents the results under the model selected and the proportional hazards and proportional odds models. All three models show that the Lev+5-FU treatment is effective in preventing recurrence of cancer and death. The interpretation of treatment effects and the prediction of events depend on which model is used.

Figure 2.

 Log-likelihood functions for pairs of transformations in the colon cancer data: indices below 20 pertain to the Box–Cox transformations with ρ ranging from 1 to 0, whereas indices above 20 pertain to the logarithmic transformations with r ranging from 0 to 2

Table 1.   Estimates of regression parameters and variance component under random-effects transformation models for the colon cancer study†
 Estimates for the following models:
Proportional hazardsProportional oddsSelected
  1. †Standard error estimates are shown in parentheses.

Treatment
Cancer−1.480 (0.236)−1.998 (0.352)−2.265 (0.357)
Death−0.721 (0.282)−0.922 (0.379)−1.186 (0.422)
Surgery
Cancer−0.689 (0.219)−0.786 (0.335)−0.994 (0.297)
Death−0.643 (0.258)−0.837 (0.369)−1.070 (0.366)
Depth
Cancer2.243 (0.412)3.012 (0.566)3.306 (0.497)
Death1.937 (0.430)2.735 (0.630)3.033 (0.602)
Node
Cancer2.891 (0.236)4.071 (0.357)4.309 (0.341)
Death3.095 (0.269)4.376 (0.384)4.742 (0.389)
inline image11.62 (1.22)24.35 (2.46)28.61 (3.06)
Log-likelihood−2895.1−2895.0−2885.7

We can predict an individual's future events on the basis of his or her event history. The survival probability at time t for a patient with covariate values z and with cancer recurrence at t0 is

image

where Φ is the standard normal distribution function. We estimate this probability by replacing all the unknown parameters with their sample estimators and estimate the standard error by the delta method. An example of this kind of prediction is given in Fig. 3.

Figure 3.

 Estimated survival probabilities of the colon cancer patients with recurrences of cancer at 500 days under the model selected (the blue and green curves pertain to z=(1,1,0,0) and z=(0,0,1,1) respectively): ——, point estimates; -·-·-·, pointwise 95% confidence limits

To test the global null hypothesis of no treatment effect on recurrence of cancer and death, we may impose the condition of a common treatment effect while allowing separate effects for the other covariates. The estimates of the common treatment effects are −1.295, −1.523 and −1.843, with standard error estimates of 0.256, 0.333 and 0.318 under the proportional hazards, proportional odds and selected models. Thus, we would conclude that the Lev+5-FU treatment is highly efficacious.

5.3. Human immunodeficiency virus study

A clinical trial was conducted to evaluate the benefit of switching from zidovudine to didanosine (ddI) for HIV patients who have tolerated zidovudine for at least 16 weeks (Lin and Ying, 2003). A total of 304 patients were randomly chosen to continue the zidovudine therapy whereas 298 patients were assigned to ddI. The investigators were interested in comparing the CD4 cell counts between the two groups at weeks 8, 16 and 24. A total of 174 zidovudine patients and 147 ddI patients dropped out of the study owing to patient's request, physician's decision, toxicities, death and other reasons.

To adjust for informative drop-out in the analysis of CD4 cell counts, we use a special case of equation (10):

image(13)

where Xi is the indicator for ddI, tij is 8, 16 and 24 weeks, bi is zero-mean normal with variance inline image and ɛij is standard normal. Table 2 summarizes the results of this analysis, along with the results based on the log- and square-root transformations. These results indicate that ddI slowed down the decline of CD4 cell counts over time. The analysis that is based on the estimated transformation provides stronger evidence for the ddI effect than those based on the parametric transformations. Model (13) includes the random intercept; additional analysis reveals that the random slope is not significant.

Table 2.   Joint analysis of CD4 cell counts and drop-out time for the HIV study†
ParameterResults for the following transformation functions:
EstimatedLogarithmicSquare root
EstSEEstSEEstSE
  1. †The parameters α1 and α2 represent the effects of ddI and time on CD4 cell counts, and β pertains to the effect of ddI on the time to drop-out. The estimates of α, inline image and ψ under the log- and square-root transformations are standardized to have unit residual variance. Est and SE denote the parameter estimate and standard error estimate.

α10.6740.2220.5060.2150.6130.261
α2−0.0430.005−0.0410.005−0.0410.004
β−0.3380.114−0.3160.116−0.3280.118
inline image7.8370.6857.4210.5758.9940.772
ψ−0.1580.023−0.1320.021−0.1540.023

Fig. 4 suggests that neither the log- nor the square-root transformation provides a satisfactory approximation to the true transformation. The histograms of the residuals (which are not shown here) reveal that the residual distribution is normal looking under the estimated transformation, is right skewed under the square-root transformation and left skewed under the log-transformation. In addition, the qq-norm plots of the residuals (which are not shown) indicate that the estimated transformation is much more effective in handling the extreme observations than the log- and square-root transformations.

Figure 4.

 Transformation functions for the HIV study (the blue and green curves pertain respectively to the log- and square-root transformation functions subject to affine transformations): ——, estimated transformation function; -·-·-·, corresponding pointwise 95% confidence limits

Without adjustment of informative drop-out, the estimates of α1 and α2 under model (13) shrink drastically to 0.189 and −0.011. The same model is used for CD4 cell counts in the two analyses, but the estimators are severely biased when informative drop-out is not accounted for.

6. Simulation studies

We conducted extensive simulation studies to assess the performance of the inferential and numerical procedures proposed. The first set of studies mimicked the colon cancer study. We generated two types of failures with cumulative hazard functions Gk{ exp (β1kZ1i+β2kZ2i+bi)×Λk(t)} (k=1,2; i=1,…,n), where Z1i and Z2i are independent Bernoulli and uniform [0,1] variables, bi is standard normal, β11=β12=−β21=−β22=1, Λ1(t)=0.3t, Λ2(t)=0.15t2 and G1(x)=G2(x) equals x or  log (1+x). We created censoring times from the uniform [0, 5] distribution and set τ=4, producing approximately 51.3% and 48.5% censoring for k=1 and k=2 under G1(x)=G2(x)=x, and 59.9% and 57.3% under G1(x)=G2(x)= log (1+x). We used the EM algorithm that is described in Appendix A.1 to calculate the NPMLEs.

Table 3 summarizes the results for β11, β21, Λ1(t) and inline image, where inline image is the variance of the random effect. The results for β12, β22 and Λ2(t) are similar and have been omitted. The estimators of βk appear to be virtually unbiased. There are some biases for the estimator of inline image and for the estimator of Λk(t) near the right-hand tail, although the biases decrease rapidly with sample size. The variance estimators are fairly accurate, and the confidence intervals have reasonable coverage probabilities.

Table 3.   Simulation results for bivariate failure time data†
nParameterResults for G1(x)=G2(x)=xResults for G1(x)=G2(x) = log (1+x)
BiasSESEECPBiasSESEECP
  1. †Bias and SE are the bias and standard error of the parameter estimator, SEE is the mean of the standard error estimator and CP is the coverage probability of the 95% confidence interval. The confidence intervals for Λ(t) are based on the log-transformation, and the confidence interval for inline image is based on the Satterthwaites (1946) approximation. Each entry is based on 5000 replicates.

100β11−0.0140.4060.3920.942−0.0130.5160.4960.947
β210.0250.6710.6640.9550.0300.8490.8470.957
inline image−0.0890.4890.4820.965−0.1250.6040.7170.963
Λ1(τ/4)0.0300.1560.1450.9450.0430.2100.1900.952
Λ1(3τ/4)0.0730.4740.4290.9520.1160.6550.5710.955
200β110.0000.2860.2770.9490.0030.3950.3500.950
β210.0070.4740.4680.9480.0160.5990.5960.955
inline image−0.0370.3530.3460.961−0.0540.4680.5090.957
Λ1(τ/4)0.0140.1040.0990.9440.0180.1300.1260.953
Λ1(3τ/4)0.0320.3050.2910.9480.0440.3930.3750.952
400β11−0.0000.2070.1960.943−0.0020.2580.2470.940
β210.0090.3290.3310.9520.0140.4170.4200.950
inline image−0.0110.2510.2470.961−0.0240.3350.3620.959
Λ1(τ/4)0.0050.0700.0690.9480.0080.0880.0870.954
Λ1(3τ/4)0.0140.2100.2020.9470.0200.2670.2590.950

In the second set of studies, we generated recurrent event times from the counting process with cumulative intensity G{Λ(t)pt exp (β1Z1+β2Z2+b)}, where Z1 is Bernoulli with 0.5 success probability, Z2 is normal with mean Z1 and variance 1, b is normal with mean 0 and variance inline image, Λ(t)=λpt log (1+t) and G(x)={(1+x)ρ−1}/ρ or G(x)= log (1+rx)/r. We generated censoring times from the uniform [2, 6] distribution and set τ to 4. We considered various choices of β1, β2, ρ, r, λ and inline image. We used a combination of the EM algorithm and the backward recursive formula to calculate the NPMLEs. The results are very similar to those of Table 3 and thus have been omitted.

The third set of studies mimicked the HIV study. We generated repeated measures from model (13), in which Xi is Bernoulli with 0.5 success probability and tij=jτ/5 (j=1,…,4). We set inline image or

image

and let the transformation function be unspecified in the analysis. We generated survival times from the proportional hazards model with conditional hazard function 0.3t exp (βXi+ψbi), and censoring times from the uniform [0, 5] distribution with τ=4. The censoring rate was approximately 53%, and the average number of repeated measures was about 1.58 per subject. We used the optimization algorithm fminunc in MATLAB to obtain the NPMLEs. We penalized the objective function for negative estimates of variance and jump sizes by setting its value to −106. The results are similar to those of the first two sets of studies.

The fourth set of studies is the same as the third except that the scalar random effect bi on the right-hand side of equation (13) is replaced by b1i+b2itij. The random effects b1i and b2i enter the survival time model with coefficients ψ1 and ψ2 respectively. We generated (b1i,b2i)T from the zero-mean normal distribution with variances inline image and inline image and covariance σ12. Table 4 reports the results for α1=1, α2=−β=0.5, ψ1=1, ψ2=0.5, inline image and σ12=−0.4. We again conclude that the asymptotic approximations are sufficiently accurate for practical use.

Table 4.    Simulation results for joint modelling of repeated measures and survival time†
nParameterResults for H (y)=log(y)Results forH (y)=log [{(1+y)2−1}/2]
BiasSESEECPBiasSESEECP
  1. †Bias and SE are the bias and standard error of the parameter estimator, SEE is the mean of the standard error estimator and CP is the coverage probability of the 95% confidence interval. The confidence intervals for inline image are based on the log-transformation, and the confidence intervals for inline image and inline image are based on the Satterthwaites (1946) approximation. Each entry is based on 5000 replicates.

100α1−0.0200.2530.2480.941−0.0170.2500.2490.943
α2−0.0110.2070.2030.946−0.0110.2080.2050.947
β−0.0410.4150.4160.960−0.0470.4150.4180.959
inline image−0.0630.4030.4150.963−0.0540.4000.4180.965
inline image−0.0530.5680.5700.935−0.0360.5530.5800.949
ψ10.0820.4530.5500.9560.0840.4630.5530.969
ψ20.0220.5140.6020.9670.0130.5010.6120.983
inline image0.0150.2010.1960.9470.0160.3080.3020.948
inline image0.0270.7300.6920.9400.0822.4882.3940.939
Λ(3τ/4)0.0060.1800.1720.9540.0080.1800.1730.954
200α1−0.0130.1770.1760.948−0.0140.1770.1760.948
α2−0.0070.1450.1450.949−0.0070.1450.1450.949
β−0.0280.2780.2830.960−0.0280.2790.2830.960
inline image−0.0410.2970.3010.967−0.0420.2970.3010.966
inline image−0.0470.4110.4110.958−0.0480.4120.4110.957
ψ10.0530.3220.3510.9690.0530.3220.3410.968
ψ20.0140.3510.3660.9790.0140.3510.3660.979
inline image0.0090.1400.1380.9500.0080.2150.2120.947
inline image0.0120.4930.4850.9500.0221.6961.6510.943
Λ(3τ/4)0.0020.1220.1180.9500.0020.1220.1180.950

In the first three sets of studies, which involve scalar random effects, it took about 5 s on an IBM BladeCenter HS20 machine to complete one simulation with n=200. In the fourth set of studies, which involves two random effects, it took about 7 min and 35 min to complete one simulation with n=100 and n=200 respectively. In the first three sets of studies, the algorithms failed to converge on very rare occasions with n=100 and always converged with n=200 and n=400. In the fourth set of studies, the algorithm failed in about 0.4% occasions with n=100 and 0.2% of the time with n=200.

We conducted additional studies to compare the methods proposed with the existing methods. For the class of models in equation (1), the best existing estimators are those of Chen et al. (2002). We generated survival times with cumulative hazard rate

image

where Z1 is Bernoulli with 0.5 success probability, Z2 is normal with mean Z1 and unit variance, Λ(t)=3t, β1=−1 and β2=0.2. We simulated exponential censoring times with a hazard rate that was chosen to yield a desired level of censoring under τ=6. Our algorithm always converged, whereas the program that was kindly provided by Z. Jin failed to converge in about 2% of the simulated data sets. For n=100 and 25% censoring, the efficiencies of the estimators of Chen et al. (2002) relative to the NPMLEs are approximately 0.92, 0.83 and 0.69 under r=0.5, 1, 2 respectively, for both β1 and β2. We also compared the estimators proposed with those of Cai et al. (2002) for clustered failure time data and found that the former are much faster to compute and considerably more efficient than the latter; see Zeng et al. (2005) for the specific results under the proportional odds model with normal random effects.

7. Discussion

The present work contributes to three aspects of semiparametric regression models with censored data. First, we present several important extensions of the existing models. Secondly, we develop a general asymptotic theory for the NPMLEs of such models. Thirdly, we provide simple and efficient numerical methods to implement the corresponding inference procedures. We hope that our work will facilitate further development and applications of semiparametric models.

In the transformation models, the function G is regarded as fixed. One may specify a parametric family of functions and then estimate the relevant parameters. This is in a sense what we did in Section 5.2, but we did not account for the extra variation that is due to the estimation of those parameters. It is theoretically possible, although computationally demanding, to account for the extra variation. Whether this kind of variation should be accounted for is debatable (Box and Cox, 1982). Leaving G non-parametric is a challenging topic that is currently being pursued by statisticians and econometricians.

As argued in Sections 1, 2.3 and 5.3, it is desirable to use the semiparametric linear mixed model that is given in equation (10) so that parametric transformation can be avoided. It is surprising that this model has not been proposed earlier. Our simulation results (which are not shown here) reveal that the NPMLEs of the regression parameters and variance components are nearly as efficient as if the true transformation were known. Thus, we recommend that semiparametric linear regression be adopted for both single and repeated measures of continuous response, whether or not there is informative drop-out.

In the joint modelling, repeated measures are assumed to be independent conditional on the random effects. One may incorporate a within-subject autocorrelation structure in the model, as suggested by Henderson et al. (2000) and Xu and Zeger (2001). One may also use joint models for repeated measures of multiple outcomes. The likelihood functions under such extensions can be constructed. The likelihood approach can handle random intermittent missing values, but not non-ignorable missingness.

The asymptotic theory that is described in Appendix B is very general and can be applied to a large spectrum of semiparametric models with censored data. In the existing literature, the asymptotic theory for the NPMLE has been proved case by case only. This kind of proof involves very advanced mathematical arguments. The general theorems that are given in Appendix B enable one to establish the desired asymptotic results for a specific problem by checking a few regularity conditions, which is much easier than proving the results from scratch.

There are some gaps in the theory. First, we have been unable to prove the asymptotics of the NPMLEs for linear transformation models completely when the observations on the response variable are unbounded. This means that the NPMLE for model (10) does not yet have rigorous theoretical justifications, although the desired asymptotic properties are strongly supported by our simulation results. Secondly, there is no proof in the literature for the asymptotic distribution of the likelihood ratio statistic under a semiparametric model when the parameter of interest lies on the boundary of the parameter space. This is a serious deficiency since we might want to test the hypothesis of zero variance in random-effects models. In many parametric cases, the limiting distributions of likelihood ratio statistics are mixtures of χ2-distributions (Self and Liang, 1987). We expect those results to hold for the kind of semiparametric model that is considered in this paper. This conjecture is well supported by our simulation results (e.g. Diao and Lin (2005)), although it remains to be proved.

The counting process martingale theory, which has been the workhorse behind the theoretical development of survival analysis over the last quarter of a century, plays no role in establishing the asymptotic theory for the kind of problem that is considered in this paper, not even for univariate survival data. We have relied heavily on modern empirical process theory, which we believe will be the primary mathematical tool in survival analysis and semiparametric inference more broadly for the foreseeable future.

The EM algorithms that are described in Appendix A.1 are similar to the QEM algorithm of Tsodikov (2003), but the latter is confined to univariate failure time data. Although we have very good experience with them, the convergence rates of such semiparametric EM algorithms have not been investigated in the literature. It is unclear whether the recursive formulae that are given in Appendix A.2 are applicable to time varying covariates. Whether the Laplace transformation idea that is described in Section 3 can be extended to recurrent events is also an open question. Thus, the extent to which the NPMLEs will be generally adopted depends on further advances in numerical algorithms.

It is desirable to choose the ‘best’ model among all possible ones. We used the Akaike information criterion to select the transformations in Section 5.2. A related method is the Bayesian information criterion (Schwarz, 1978). An alternative approach is likelihood-based cross-validation. Another strategy is to formalize the prediction error criterion that was used in Section 5.1. Further research is warranted.

We have demonstrated through three types of problem that the NPMLE is a very general and powerful approach to the analysis of semiparametric regression models with censored data. This approach can be used to study many other problems. We list below some potential areas of research.

7.1. Cure models

In some applications, a proportion of the subjects may be considered cured in that they will not experience the event of interest even after extended follow-up (Farewell, 1982). Peng and Dear (2000) and Sy and Taylor (2000) described EM algorithms for computing the NPMLEs for a mixture cure model that postulates a proportional hazards model for the susceptible individuals, but they did not study their theoretical properties. It is desirable to extend this model by replacing the proportional hazards model with the class of transformation models that is given in equations (2) or (3), to allow non-proportional hazards models and recurrent events. The asymptotic properties are expected to follow from the general theorems of Appendix B, although the conditions need to be verified.

7.2. Joint models for recurrent and terminal events

In many instances, the observation of recurrent events is ended by a terminal event, such as death or drop-out. Shared random-effects models which are similar to those described in Section 2.3 have been proposed to formulate the joint distribution of recurrent and terminal events (e.g. Wang et al. (2001), Liu et al. (2004) and Huang and Wang (2004)). In particular, Liu et al. (2004) incorporated a common gamma frailty into the proportional intensity model for the recurrent events and the proportional hazards model for the terminal event. They developed a Monte Carlo EM algorithm to obtain the NPMLEs but provided no theoretical justifications. One may extend the joint model of Liu et al. (2004) by replacing the proportional hazards or intensity model with the general random-effects transformation models and try to establish the asymptotic properties of the NPMLEs by appealing to the general theorems of Appendix B.

7.3. Missing covariates

Robins et al. (1994) and Nan et al. (2004) obtained the information bounds with missing data. Chen and Little (1999) and Chen (2002) studied the NPMLEs for the proportional hazards model with missing covariates, whereas Scheike and Juul (2004) and Scheike and Martinussen (2004) considered the specific situations in which covariates are missing because of case–cohort or nested case–control sampling (Kalbfleisch and Prentice (2002), page 339). To make the NPMLEs tractable, one normally assumes data missing at random and imposes certain restrictions on the covariate distribution. How general the covariate distribution can be is an open question.

7.4. Genetic studies

Models (7) and (10) can be extended to genetic linkage and association studies on potentially censored non-normal quantitative traits, whereas models (2) and (7) can be adapted to haplotype-based association studies (e.g. Diao and Lin (2005) and Lin and Zeng (2006)); inference on haplotype–disease association, which is a hot topic in genetics, is essentially a missing or mismeasured covariate problem. The analysis of genetic data by the NPMLE is largely uncharted.

There are alternative approaches to the NPMLE. Martingale-based estimating equations were used by Chen et al. (2002) for linear transformation models and by Lu and Ying (2004) for cure models. This approach can also be applied to the general transformation models that are given in equations (2)–(4). The inverse probability of censoring weighting (Robins and Rotnitzky, 1992) approach was used by Cheng et al. (1995) and Cai et al. (2002) for linear transformation models, and by Kalbfleisch and Lawless (1988), Borgan et al. (2000) and Kulich and Lin (2004) for case–cohort studies. These estimators are not asymptotically efficient. The estimating equations are usually solved by Newton–Raphson algorithms, which may not converge. The moment-based estimators are expected to be more robust than the NPMLEs against model misspecification. It would be worthwhile to assess the robustness versus efficiency of the two approaches through simulation studies.

Marginal models (Wei et al. (1989) and Kalbfleisch and Prentice (2002), pages 305–306) are used almost exclusively in the analysis of multivariate failure time data, mainly because of their robustness and available commercial software. Because in general marginal and random-effects models cannot hold simultaneously, there is a debate about which approach is more meaningful. Random-effects models have important advantages. First, they enable us to predict future events on the basis of an individual's event history, as shown in Fig. 3, or to predict a person's survival outcome given the survival times of other members of the same cluster. Secondly, they allow efficient parameter estimation. Thirdly, the dependence structures are of scientific interest in many applications, especially in genetics.

Our work does not cover the accelerated failure time model, which takes the form of equation (1) but with known H and unknown distribution of ɛ (Kalbfleisch and Prentice (2002), pages 218–219). Rank and least squares estimators for this model have been studied extensively over the last three decades; see Kalbfleisch and Prentice (2002), chapter 7. These estimators are not asymptotically efficient. In addition, it is difficult to calculate them or to estimate their variances, although progress has been made on this front (Jin et al., 2003, 2006). We are pursuing a variant of the NPMLE for the accelerated failure time model with potentially time varying covariates, which maximizes a kernel-smoothed profile likelihood function. The estimator is consistent, asymptotically normal and asymptotically efficient with an easily estimated variance, and it works well in real and simulated data.

We have focused on right-censored data. Interval censoring arises when the failure time is only known to fall in some interval. It is much more challenging to apply the NPMLE to interval-censored data than to right-censored data. So far asymptotic theory is only available for proportional hazards models with current status data (Huang, 1996), which arise when the failure time is only known to be less than or greater than a single monitoring time. Wellner and Zhang (2005) studied proportional mean models for panel counts data with general interval censoring. We expect considerable theoretical and numerical innovation in this area in the coming years.

We have taken a frequentist approach. Ibrahim et al. (2001) provided an excellent description of Bayesian methods for semiparametric models with censored data. There are many recent references. It would be valuable to develop the Bayesian counterparts of the methods that were presented in this paper.

Much of the theoretical and methodological development in survival analysis over the last three decades has been centred on the proportional hazards model. Because everything that has been written about that model is also relevant to transformation models, opportunities for research abound. Besides the problems that have already been mentioned earlier, it would be worthwhile to develop methods for variable selection, model checking and robust inference (under misspecified models) and to explore the use of these models in the areas of diagnostic medicine, sequential clinical trials, causal inference, multistate processes, spatially correlated failure time data, and so on.

Acknowledgements

We thank David Cox, Jack Cuzick, Vern Farewell, David Oakes, Peter Sasieni, Richard Smith and Bruce Turnbull, the referees and the Research Section Committee for helpful comments and constructive suggestions. This work was supported by the National Institutes of Health.

References

Discussion on the paper by Zeng and Lin

Robin Henderson (Newcastle University)

Zeng and Lin have provided a beautifully structured paper whose development in some ways mimics the history of event history methodology over the last 35 years: standard survival and proportional hazards; recurrent events and counting processes; clustered events and frailty; joint modelling for longitudinal and event history data. For discussion I shall concentrate mainly on Zeng and Lin's approach for single-event survival analysis, recognizing that their methods go much further.

The paper starts with the statement that the Cox (1972) proportional hazards model is the corner-stone of modern survival analysis. Although this may not be true for reliability, it is indeed so for biostatistics. Even in 1998 Niels Keiding remarked that ‘… explicit excuses are now needed to use different models’ (Keiding, 1998) and very little has changed since then. The proportionality assumption can and should be challenged but the basic model is so well known and well used that it makes sense to ensure that the standard proportional hazards model is nested within more flexible alternative semiparametric models. In Section 2.1 Zeng and Lin do just that, with equation (4) providing an interesting and potentially very useful extension. This model also incorporates as special cases alternative generalized versions that have been proposed by, for instance, Bagdonavicius and Nikulin (1999), Hseih (2001) and Bagdonavicius et al. (2004).

The price to be paid for generality, however, is lack of transparency of the role of covariates. It is difficult to look at an expression like equation (4) and to gain any intuitive impression of exactly how a covariate will influence the hazard. To explore, it is convenient to consider perhaps the simplest case of single-event survival times with two groups, described by a time constant binary covariate Z, with Z=0 in the control group and Z=1 in the treatment group. Writing θ= exp (β) and φ= exp (γ), under the Box–Cox transformations the hazards corresponding to model (4) become

image

For exploration we shall assume that λ(t|Z=0)=1. We can obtain this in two ways: either by ρ=1 and λ(t)=1, or by Λ(t)=(ρt+1)1/ρ−1 for any ρ>0. We can take these in turn, starting with ρ=1 and λ(t)=1, which gives

image(14)

Interpretation in this case is straightforward. The initial hazard in the treatment group is determined by the ratio θφ. For φ>1 we have a monotonic increasing hazard (to ∞); for φ<1 monotonic decreasing (to 0). At φ=1 we have a proportional hazards model.

Now taking Λ(t)=(ρt+1)1/ρ−1 we have

image(15)

The initial value is again determined by θφ but otherwise it is difficult to see the role of the three parameters. The treatment hazard tends to ∞ if φ>1, to 0 if φ<1 and to θρ if φ=1. Formally the choice between expressions (14) and (15) is identifiable from data, given a sufficiently large sample. In practice I suspect that it will be difficult and wonder whether the authors have encountered any identifiability problems in their work. Perhaps in the example above it will not matter as the fitted hazard or survival curves under the alternatives may be very close. And this brings us back to interpretation: if the model parameters are difficult to interpret do the authors advocate inspecting fitted curves? More generally, it would be helpful if Zeng and Lin could comment on what real advantages for single-event survival are provided by equation (4) over generalized additive versions of the proportional hazards model (e.g. Sasieni and Winnett (2003)):

image

or

image

Returning to the two-group illustration, another point is that neither expression (14) nor expression (15) can accommodate converging hazards unless ρ=0, which in many ways should be the default model for time constant covariates. Indeed, in discussion of Aalen and Gjessing (2001), Keiding pointed to a paper which is even more venerable than Cox (1972), namely Tetens (1786), where the author took it as almost axiomatic that hazards converge over time and suggested the hazard ratio model

image

The second family of transformations, G(x)= log (1+rx)/r, can, I believe, describe converging hazards, at least for the non-crossing alternative with γ=0. In the two-group case the hazards are

image

and more generally, but still with time constant covariates, the assumed survivor function is

image

This is precisely the marginal survivor function for gamma frailty within a proportional hazards model

image

I wonder whether the authors have experience of estimating r for single-event survival data. My own experience is that there is usually downward bias in modest sample sizes. In the simulations was r fixed at the true value or estimated?

All of the above is for single-event survival with standard independent right censoring. Of course the paper goes much further and perhaps its main advantages with time will prove to be for the more complex situations that are now being more often considered in applications. Direct maximum likelihood estimation of the jumps in Λ has often been considered infeasible (e.g. Tsodikov (2003)) but the authors have shown this not to be so. What is particularly impressive about the paper is the achievement of realistic computing times for joint modelling of longitudinal and event history data. Zeng and Lin quote a computing time of 35 min at n=200 for random intercept and slope models. Similar models, though without transformations, were considered by me and colleagues Peter Diggle and Angela Dobson and could take many days to fit, meaning that it was realistic only to fit one or a small number of models in any application. Quick computation combined with the availablity of a maximized log-likelihood means straightforward model comparison and proper statistical practice.

It gives me great pleasure to propose the vote of thanks.

Odd O. Aalen (University of Oslo)

The paper by Zeng and Lin is an interesting extension of the frailty models in survival and event history analysis, complementing and extending previous work which has for instance been summarized by Hougaard (2000). One feels, however, that they might have gone even further in their frailty formulation. In fact, the transformation model in formula (2) is most easily understood as the result of a frailty factor operating on the Cox proportional hazards model. The authors in fact use this indirectly at the end of Section 3 and in the beginning of Appendix A.1, but they could have given this formulation from the outset. If the intensity for an individual is U exp {βT Z(t)} λ(t) where U is a frailty variable with Laplace transform L(x), then we obtain essentially formula (2) when the distribution of U is integrated out and we put G(x)=− log {L(x)}. Of course, this is just valid when the function G(x) has this kind of representation for some Laplace transform L(x). It is interesting to note that the Box–Cox transformation for 0leqslant R: less-than-or-eq, slantρleqslant R: less-than-or-eq, slant1 corresponds to L(x) being the Laplace transform of a power variance family of frailty distributions which played a central role in Hougaard (2000). For ρ>1, it is not clear that there is any frailty representation at all; remarks at the beginning of Appendix A.1 seem to indicate that there should be one in this case as well, and the authors should explain this. We may, however, also use ρ<0; that would give a compound Poisson frailty distribution (Aalen, 1988, 1992; Moger et al., 2004), which is still a subgroup of the power variance family class. In this case there will be a group of individuals with zero risk, but this just means that we have a cure model which would be quite reasonable in many cases.

A basic idea in the paper is to extend the semiparametric principle in the Cox analysis with a non-parametric Λ(t), and the rest of the model being parametric. Although the authors elegantly handle the technical problems that are connected to this, one could ask whether the semiparametric principle is reasonable in this case. In the original Cox model, the semiparametric idea yielded a very elegant solution in terms of the partial likelihood. However, this simplicity entirely disappears in the more general context here, and one could ask whether the semiparametric idea is carried too far. With a purely parametric version one would avoid many complications and, anyway, parametric models are highly underused in biostatistical survival analysis.

A nice aspect of the approach by Zeng and Lin is the ability to handle crossing curves. There is in survival analysis too much emphasis on the rather dogmatic assumption of proportional hazards, in spite of the fact that we often observe deviations from this assumption. Professional statisticians would be aware of the mathematical convenience nature of this assumption and how to handle deviations, but other people doing proportional hazards regression, like many medical researchers, might not have a clear view of this limitation and they are not helped by standard statistical software. By the way, crossing of hazard rates could easily be a frailty effect; see for example Aalen (1994).

Another interesting aspect is the general frailty structure for recurrent event models. An alternative to this would be to use a model with dynamic covariates, e.g. according to Fosen et al. (2006).

In Section 7 the authors make a rather sweeping statement concerning the use of martingale methods, saying:

‘The counting process martingale theory … plays no role in establishing the asymptotic theory for the kind of problem that is considered in this paper …. We have relied heavily on modern empirical process theory, which we believe will be the primary mathematical tool in survival analysis and semiparametric inference more broadly for the foreseeable future.’

It is true that the general model in the paper can be handled by empirical processes, and a solution with martingale central limit theorems does not seem to have been worked out, although such a solution might exist. Clearly, if Λ(t) is parametric, martingale theory can be used. It should then be possible to use a sieve approach (Bunea and McKeague, 2005) to approximate the non-parametric Λ(t) and to obtain asymptotic results within the martingale framework.

It should be noted, however, that the main point of using martingale theory in survival analysis is not to achieve the asymptotics but to obtain a conceptual underpinning under the statistical approaches. Censoring, for instance, may depend on what has happened previously in the processes. The martingale formulation allows very general assumptions on the censoring mechanisms, which are related to the fundamental martingale concept of optional stopping time.

More generally, the martingale structure is not imposed from the outside but originates in the heart of the processes themselves. This is connected to the ‘French probability school’ which views stochastic processes in terms of how the past influences the future and the present. A major result here is the Doob–Meyer decomposition by which a semimartingale is decomposed into the sum of a compensator and a martingale. It was first proved by Brémaud in 1972 that this result is precisely what we need to obtain a precise definition of the intensity process (see references in Andersen et al. (1993)). The intensity process is a general concept embedding the hazard rate, and at the very heart of modern survival analysis. Before Brémaud's work the definitions of an intensity process were intuitive and had no mathematical precision.

The classical framework of statistics is the assumption of independence in various forms, and this assumption is also necessary for the empirical process approach of Zeng and Lin. The trouble with independence is that it is immediately destroyed once you apply, for instance, a censoring mechanism that is dependent on all the processes under observation. This is not so for the martingale assumption which enjoys fundamental invariance properties expressed by preservation under optional stopping and stochastic integration. Also covariates can depend on the past in complex ways. Moreover, some statistically important quantities are martingales when they are properly normalized; the likelihood and the associated score functions, the Kaplan–Meier and Nelson–Aalen estimators, the empirical transition matrix, the two-sample test statistics and many other important quantities. The introduction of ideas from the French probability school to survival analysis has been presented in several text-books, e.g. Andersenet al. (1993) and Martinussen and Scheike (2006). Recent applications to general longitudinal data are given by Martinussen and Scheike (2006), chapter 11, Borgan et al. (2007) and Farewell (2006).

There is also a central limit theory for martingales which nicely complements the aspects that were discussed above. There might be limitations in the application of this in some cases, but that is no justification for advocating a general return to the strait-jacket of the independent and identically distributed data world.

Martingale theory has recently revolutionized the field of mathematical finance, and it is high time that the world of statistics also realizes its usefulness. Martingales should be an integrated part of the curriculum of statistics students.

In spite of my critical comments I consider the paper by Zeng and Lin to be of considerable interest. I thank Zeng and Lin for a challenging paper which is also based on a large and impressive effort to make all the technical details work. It gives me great pleasure in congratulating the authors on their paper and in seconding the vote of thanks.

The vote of thanks was passed by acclamation.

Daniel Commenges (University of Bordeaux 2)

The authors must be congratulated for proposing a general model encompassing multivariate failure time data, frailty models and joint models, for proving asymptotic results for the non-parametric maximum likelihood estimators (NPMLEs) in this model and for proposing maximization algorithms. I have two comments: one on a drawback of the NPMLE; the other on an alternative algorithm.

Although, as shown by the authors, the NPMLE has the advantage of being more efficient than most other estimators it has the drawback of yielding estimators which are not a priori in the class of admissible estimators for most applications: the NPMLE of the compensator of a counting process makes jumps whereas we would generally expect that, under the true law, the counting process has an absolutely continuous compensator and admits an intensity. From a descriptive point of view this drawback leads to representing the compensator itself or the survival function only, and not risk functions or transition intensities. This drawback leads to another limitation in that it makes likelihood cross-validation unusable: generally the NPMLE estimate of the risk at the time of event of a removed observation is 0 so the cross-validation criterion takes an infinite value. Thus most often the NPMLE is strongly rejected by a likelihood cross-validation criterion. Likelihood cross-validation ‘estimates’, up to a constant, the Kullback–Leibler risk, and the NPMLE is not consistent for the Kullback–Leibler risk. An alternative is to use a penalized likelihood yielding smooth compensators and intensities (O'Sullivan, 1988); the advantage of this approach is that likelihood cross-validation may be used for choosing both the structure of the model and the smoothing coefficient, as proposed in Commenges et al. (2007). However, deriving the asymptotic properties of the resulting estimator is still an open problem in the general case.

The numerical problem is of course crucial for complex models and the authors investigate several possibilities. I would like to draw attention to an algorithm for maximizing likelihoods, which has already been used by several researchers, and which is based on using the empirical variance of the score in place of the Hessian, thus sparing much computation time; I call it the ‘robust variance scoring’ algorithm (Commenges et al., 2006). When the function to maximize is the log-likelihood, this algorithm is superior to the BFGS algorithm which is not specific. Even with frailties this algorithm can be used because the observed scores can be obtained by numerical integration from the score of a full problem by using Louis formulae, such as in Hedeker and Gibbons (1994). It would be interesting to try this algorithm on the model that is proposed in this paper.

Torben Martinussen and Thomas H. Scheike (University of Copenhagen)

We congratulate the authors on this very interesting and impressive paper. The class of semiparametric transformation models is an appealing class as it accommodates the crossing hazards situation. It has, however, the weakness of not being able to describe time varying covariate effects in a direct interpretable way. Time varying effects are easily estimated by using Aalen's additive hazards model (Aalen, 1980)

image

where inline image are unspecified regression functions. We would also like to mention the semiparametric additive risk model of McKeague and Sasieni (1994)

image

where some effects are time varying and others are constant (see also Martinussen and Scheike (2006)).

We fitted Aalen's additive hazards model to the gastrointestinal tumour data. The cumulative regression coefficient corresponding to the combined therapy group is depicted in Fig. 5, showing nicely that a time varying effect is indeed the case for these data with a negative effect of the combined treatment in the first 300 days or so, and then apparently an adverse effect thereafter. Another appealing model for these data is the changepoint model (Martinussen and Scheike, 2007)

image

where Z is a scalar covariate, and γ1,γ2 and θ are unknown parameters with the last being the changepoint parameter. For the gastrointestinal tumour data we obtain the estimates inline image and inline image with the estimated cumulative regression function superimposed on the Aalen estimator in Fig. 5 indicating that this model gives a good fit to these data. The Aalen additive hazards model and its corresponding changepoint model are easily formulated in multivariate covariate settings. We wonder whether it is possible, on the basis of model (4) in the present paper, to pinpoint a time varying effect of a specific covariate in a multiple covariate setting with time constant effect of the remaining covariates, say.

Figure 5.

 Gastrointestinal tumour data: effect of combined therapy—Aalen's least squares estimate of B1(t) with 95% pointwise confidence bands (- - - - - -, estimate based on a changepoint model)

The use of non-parametric maximum likelihood estimators (NPMLEs) and their asymptotic properties in the setting considered are very useful. We wonder how far this can be taken in terms of other classes of models, some of them describing time varying effects of some of the covariates.

The NPMLE behaves sensibly for Aalen's additive hazards model in the situation with only one categorical covariate but seems to break down in the multiple-covariate setting with some of them being continuous. It would also be interesting to clarify whether NPMLEs can be applied to other general models as for example the extended proportional odds model

image

with hazard rate

image

or to a variant of this model, using a first-order approximation of the term  exp {ZT β(t)} (Scheike, 2006).

Kani Chen (Hong Kong University of Science and Technology) and Zhiliang Ying (Columbia University, New York)

The pursuit of efficient estimation is always an important problem for nearly all statistical models, especially those of a semiparametric nature. The computational complexity and theoretical justification are main hurdles for obtaining efficient estimators in general semiparametric models. However, the current computational technology, especially the fast developing statistical and mathematical software and the availability of computation capacity, has drastically reduced the computational workload and time. It has made many statistical problems that were previously considered as intractable now readily solvable. Professor Zeng and Professor Lin made this important contribution of computing and justifying efficient estimation for a broad class of semiparametric models with censored data. We congratulate them for such an important development.

Their method seems to be far reaching and has many potential applications and extensions. Heuristically, this method should work if the infinite dimensional parameter, which is typically a function, can be properly discretized so that support of the likelihood function is on a finite dimensional space. The maximization procedure can then take advantage of the available computational algorithms. The idea may date back to the discovery of the empirical distribution as the non-parametric maximum likelihood estimator of the cumulative distribution function. We use the following examples to illustrate further extensions.

Consider the transformation model H(Ti)=−βZi+ɛi for doubly censored data. Let (Li,Ui) be the censoring variables, and the observations are (Yi,Zi,δi), where Yi=Ti and δi=1 if Ti ∈ [Li,Ui], Yi=Ui and δi=2 if Ti>Ui, and Yi=Li and δi=3 if Ti<Li. The likelihood is

image

where λ(·) and Λ(·) are respectively the known hazard and cumulative hazard functions of ɛi. Restricting H(·) to be step functions with jumps only at the uncensored observations, the likelihood function can be maximized. The maximizers inline image are consistent and asymptotically normal under suitable conditions.

In this case, there is no backward or forward recursive algorithm for computation, unlike those presented in Chen et al. (2002) and this paper for the right-censored data. However, using the function fmincon in MATLAB works reasonably well in our simulation studies. For example, we simulated the transformation model with one covariate and with ɛ following one of the Pareto family of distributions with r=0,0.5,1. Note that r=0 and r=1 correspond to the proportional hazards model and the proportional odds model respectively. With censoring percentages that are well over 50% and sample sizes at 100 and 200, we find that the resulting point and variance estimators are virtually unbiased and confidence intervals have coverage probabilities that are close to their nominal levels.

Another type of data is the left-truncated data (Ti,Ci,Zi) where Ci is the truncation variable. The likelihood is

image

The maximization procedure is similar to that for right-censored data. Certain technical conditions on the distribution of the truncation variable near 0 will be required to ensure proper large sample behaviour of the maximizers.

Alex Tsodikov (University of Michigan, Ann Arbor)

I congratulate the authors on this interesting and general paper. I have a few comments on models and the justification for EM algorithms resulting from the quasi-expectation–maximization (QEM) approach of Tsodikov (2003).

Models can be constructed by using a transform γ(x)=QE(xu), where QE is an operator that is defined so that γ behaves like a probability-generating function E(xU) of a random variable U up to its first k moments. The resulting ‘artificial’ mixed model lends itself naturally to EM-like QEM algorithms. The E-step is represented by using derivatives of the transform γ much like moments of a random variable are represented by using derivatives of its probability-generating function. When the ith derivatives satisfy (−1)idγ(i−1)  exp (−s)/ds>0, i=1,…,k, the resultant algorithm is monotonic in likelihood and QE satisfies the Jensen inequality. The matrix speed of convergence of the algorithm is determined by the fraction of ‘missing information’inline image and IM are ‘complete-data’ and ‘missing data’ information matrices respectively that are also expressed through derivatives of the transform.

The authors used an ‘artificial’ EM construction in their paper to deal with the transformation G that was introduced in equation (2) by suggesting that  exp (−G) be a Laplace transform, and later expressing the E-step by using derivatives of G. It can be shown by using the Bernstein theorem (Feller, 1991) that  exp (−G) is a Laplace transform if and only if the derivative G satisfies (−1)i+1G(i)>0, i=1,2,…,∞. Linking this to QEM, we have γ(x)= exp [−G{− log (x)}]. As a QEM the algorithm is valid under the weaker condition of G>0,G′′<0 and G′′ being an increasing function. The logarithmic transform satisfies the condition for any rgeqslant R: gt-or-equal, slanted0, whereas it is necessary that 0leqslant R: less-than-or-eq, slantρ<1 for the Box–Cox transform. Potentially, G may be represented by a function with discontinuous high order derivatives (splines), in which case  exp (−G) will not be a Laplace transform. However, the algorithm will still converge provided that the weak QEM condition is satisfied. Also, it can be shown that a composition of G-based QEM and an EM dealing with the random effects b will preserve the necessary conditions for monotonic convergence of likelihood values.

Thomas H. Scheike and Torben Martinussen (University of Copenhagen)

We are very pleased that the authors take up the theme of random-effects models for survival data where there is clearly much more to be done. One basic problem with the standard shared frailty model

image

where Zi is a random effect that is gamma distributed with mean 1 and variance θ−1, is that we can identify all parameters solely on the basis of univariate survival data. Therefore the variance parameter cannot be interpreted as reflecting only correlation, but it will also reflect lacking fit of the model. Even though it may not be a big problem for multivariate data it is difficult to know and it is clearly a problem with the model. We believe that the two-stage procedure with marginals on a specific form provides a practical solution to this identifiability problem and we may also use non-parametric maximum likelihood estimation techniques for this model.

The authors seem to prefer normal random effects but we see no reason why these should be preferred. It has been shown that different random effects lead to different types of dependence and it will vary from case to case which random-effect distribution leads to the best description of the correlation (Hougaard, 2000).

We have considered the colon cancer data in the case of the shared frailty model that is contained in the class of models that was considered by the authors but with a gamma-distributed frailty for simplicity. The colon study is clearly asymmetric since death will censor cancer whereas the opposite is not true; therefore any correlation should be identified on the basis of how the occurrence of cancer changes the death-rates. The frailty model considered may be written as

image
image

for the ith patient where c is for cancer and d is for death, and Zi is gamma distributed with mean 1 and variance θ−1. One special feature of the asymmetry is that the at-risk indicator Yic(t) is 0 when a patient has died but Yid(t) is equal to 1 if the patient is still alive even if the patient has experienced cancer. The intensities with respect to the observed history are

image
image

For the colon data there are no deaths for subjects who did not experience cancer and so all the deaths are for subjects who did experience cancer, thus indicating that a frailty-type model is not well suited to fit these data. Alternatively one may consider intensity models where one directly models the effect of cancer on the death-rate by conditioning on the timing of this event.

Per Kragh Andersen (University of Copenhagen)

Professor Zeng and Professor Lin have presented a very impressive paper covering a broad range of models for univariate and multivariate survival data, and repeated events and joint models for longitudinal data and survival data. They could give a unified approach to maximum likelihood estimation in such models, including asymptotic theory, based on results for empirical processes. I would like to address two aspects of the likelihood derivation: one deals with types of observational patterns; the other with types of time-dependent covariates.

Though ‘general censoring and truncation patterns’ are, indeed, mentioned when presenting the likelihood (5) it seems as if later likelihoods in random-effects models will only be applicable for independent and non-informative (in the sense of Kalbfleisch and Prentice (2002), chapter 6, and Andersen et al. (1993), chapter III) right censoring. Random left truncation, for example, would require non-identical frailty distributions as the distribution of bi must be evaluated conditionally on survival beyond the left truncation time. It is not clear whether the theorems as stated in Appendix A cover this situation.

The likelihoods (5) and (8) are only full likelihoods if the time-dependent covariates fulfil certain simplifying assumptions. This could be that they are either deterministic, ancillary or adapted to the history that is generated by the failure time counting process (Kalbfleisch and Prentice (2002), chapter 6, and Andersen et al. (1993), chapter III). For internal time-dependent covariates likelihoods (5) and (8) are only partial likelihoods. In Section 2.3 the resulting likelihoods (9) and (11) may be full likelihoods since a joint model is available for the failure time data and the internal time-dependent covariate. Does the type of likelihood have consequences for the asymptotic results that are derived in Appendix B and for the use of the EM algorithm or do these results only rely on the shape of the likelihood (see equation (12))?

In Section 2.3 covariates X are introduced in the model for response variables Y and it is stated that ‘Typically, but not necessarily Xij=Z(tij)’. Consider this situation for the simple Cox-type version of equation (2) with a single time-dependent covariate and random effects:

image

Consider also the simple linear version of model (10):

image

To compute the likelihoods (5), (8), (9) and (11) values of Zi(t) must be observed for all t or, at least, for all observed event times. In contrast, the response variables Y are only observed at certain (‘non-informative’) measurement times tiji=1,…,nj=1,…,ni. One situation in which it is plausible to observe Zi(t) continuously but Y only at tij arises when Zi(t) is deterministic (see Tsiatis and Davidian (2004)) but if Zi(t), more generally, is a random process it is not clear whether such observations arise in practice.

V. T. Farewell and B. D. M. Tom (Medical Research Council Biostatistics Unit, Cambridge)

We commend the authors on an impressive piece of work. The class of semiparametric regression models proposed provides a comprehensive framework for the identification of relationships between occurrence of events and potential explanatory variables.

As a means for the examination of goodness of fit regression models, and perhaps as the basis of exploratory investigation of relationships along with graphical and tabular procedures, the class is of particular interest. We would like to ask, however, whether, for the reporting of relationships between an outcome of interest and explanatory variables, some caution might be wise. Potential issues relate to interpretability, overfitting and reproducibility.

For the gastric cancer example, the authors’Fig. 1 displays fits from their two heteroscedastic versions of the linear transformation model, and the claim is made that the second version fits better. From an applied perspective, we wonder how important the better fit is in this example. Additionally, we present a figure here that results from fitting a Cox regression model with two explanatory variables, treatment and a  log (time)×treatment ‘interaction’ (Fig. 6). This time-dependent Cox model we infer would be a special case of their class of models. The interaction variable is highly significant and the fit is comparable with those provided in the paper, particularly that of the authors’ model (3). We wonder therefore whether this approach to a regression model might provide a suitable representation of the treatment effect for empirical modelling. Would the nature of the treatment effect be conveyed more usefully, in terms of it varying with a simple function of time, than in terms of the β- and γ-parameters of the models proposed?

Figure 6.

 Kaplan–Meier and time-dependent Cox model estimates of the survival functions for the gastric cancer data set: ——, Kaplan–Meier, chemotherapy; inline image, Kaplan–Meier, combined therapy; - - - - - -, time-dependent Cox model, chemotherapy; inline image, time-dependent Cox model, combined therapy

With respect to overfitting and reproducibility, we wonder whether there is the potential that their models are sensitive to aspects of a particular data set, which might be absent in other studies of the same question. In particular, one might ask whether, for their models (3) and (4), the authors would respond differently to variation in β-estimates than to variation in γ-estimates across studies.

In addition, the simulation studies in the paper are restricted to data that were drawn from their proposed class of models. The behaviour of this class in other situations (e.g. data drawn from Aalen's additive hazards model) might be of interest.

Again, we congratulate the authors and ask these questions to understand better how the approach that they have developed can be incorporated into the current body of techniques that are used for comparable problems.

D. R. Cox (Nuffield College, Oxford)

It is a pleasure to have the chance of congratulating the authors on an interesting and valuable paper. It raises many points: some detailed and some general.

That fine Mexican statistician, the late Francisco Aranda-Ordaz, was the first to study estimated transformations in this context, although in a way that was different from that used in the present paper (Aranda-Ordaz, 1983). The simplest way to test for and to represent non-proportionality of hazards is often through a manufactured time-dependent variable (Cox, 1972; Grambsch and Therneau, 1994). Table 2 illustrates that, although regression coefficients in different models have different interpretations, ratios of regression coefficients are relatively stable. There is a simple invariance-based qualitative explanation of this; for a theoretical treatment in the present context, see Solomon (1984) and Struthers and Kalbfleisch (1986).

Broader aspects are why the hazard and are there special reasons for proportionality? Failure is a stochastic process, Markovian if properly described, and stochastic processes are usually best specified by their transition probabilities, in this case the hazard, which is the complete intensity function of a point process. Note, though, that, although the accelerated life model, which is very natural in some physical contexts with a single failure mode, can be specified via the hazard, that is not the most direct definition. Proportional odds and cumulative hazards seem much less natural as a base for initial specification.

The contrast between proportional and additive measures pervades epidemiology. Proportionality evades positivity constraints, specifies effects essentially in dimensionless form and possibly gives more stability in estimated effects. Yet quite clearly there are no all-persuasive arguments for proportionality.

More broadly still, the paper illustrates a wide-ranging tension in statistical development. If models are compact essentially descriptive representations of patterns of variability, the move to ever more general families of models is a very welcome and fruitful one. If the objective is in part to probe the data-generating process and to provide simple more incisive interpretations, more specificity in model specification may often be preferable (Cox, 1972) and this may or may not lie naturally within some prescribed general setting.

Ørnulf Borgan (University of Oslo)

I congratulate the authors on an impressive paper that will have an influence on the further development of survival and event history analysis.

In Section 1 the authors praise the ‘ingenious partial likelihood principle’ and the ‘elegant counting process martingale theory’. However, a main message is that inference for survival and event history models should be based on non-parametric maximum likelihood estimation and that empirical processes are the preferred mathematical tool. I shall advocate a more pragmatic attitude: we should adopt the inference methods and mathematical tools that are most convenient, seen both from a theoretical and a practical perspective. I shall use my experience with cohort sampling methods to underpin my point of view.

There are two classes of cohort sampling methods: the nested case–control and case–cohort designs. These designs are only briefly mentioned in the paper but, as they are likely to gain importance, methods for survival and event history data should be able to address the methodological problems of cohort sampling methods.

In a nested case–control study, a few controls are sampled from those at risk at the failure times. A joint model for the occurrences of failures and the sampling of controls may be formulated and studied by using counting processes and martingales (Borgan et al., 1995). By allowing the sampling probabilities to depend on covariates for all individuals at risk (thus leaving an independent and identically distributed data set-up), the framework makes it possible to tailor the control sampling to the specific design and analysis needs of a study (e.g. Langholz (2007)). Inference may be based on a Cox-type partial likelihood and easily performed by using standard software. For simple nested case–control designs, alternative inference methods have been suggested (e.g. Scheike and Juul (2004) and Samuelsen et al. (2007)). However, in my opinion, the modest gains in efficiency that are obtained by these methods (for simple nested case–control designs) do not outweigh the practical complications in fitting the models and the loss of flexibility in tailoring the control sampling to the study needs.

In a case–cohort study, a subcohort is sampled at the outset of the study by simple or stratified sampling, and the at-risk individuals in the subcohort are used as controls at all failure times. For case–cohort designs the partial likelihood degenerates, and estimation is usually based on a pseudolikelihood. Counting processes and martingales are of no help in studying the properties of the estimators, which must be done by using empirical processes for finite population sampling (e.g. Breslow and Wellner (2007)).

P. M. Philipson and W. K. Ho (Newcastle University)

The paper builds on oft-used models in an interesting and widely applicable manner and, as such, the authors are to be commended. Our comments concern the survival models of Section 2.1.

We feel that there is scope for a link between the models that are routinely used at present and those that are proposed in this paper. Given the overwhelming popularity of the Cox proportional hazards model, a useful intermediate step would be to ascertain whether there is sufficient evidence to warrant the use of transformation models, or the extension to the crossing hazards model. Score tests, requiring only the trivial fitting of a proportional hazards model, could be used for such a purpose.

Consider the class of Box–Cox transformations. If we assume that only time invariant covariates are present and that β and Λ0 are known, then, appealing to martingale theory, we obtain

image

as the score for the transformation parameter ρ under the null hypothesis (i.e. ρ=ρ0=1), where Mi is the usual counting process martingale and τ is the duration of study. The predictable variation under the null hypothesis is

image

Focusing on the model given by equation (4) of the paper, ignoring any transformations for now, allows investigation of the crossing hazards case. In an analogous fashion, the score for γ (here assumed for convenience to be scalar), under the null hypothesis, can be expressed as

image

with associated variance

image

This fledgling idea is at a very embryonic stage; clearly adjustments will need to be made to accommodate estimation of β and Λ0. It is hoped that fully developed tests would dovetail with the novel models that have been put forward in this paper and provide clarity for the statistician.

The two cases of transformation and crossing hazards have been considered separately here. The similarity of the above expressions for ρ and γ leads us to ponder the effect of fitting models when both cases are present. Are the parameters suitably disentangled so that estimation remains robust? Do the authors have any insight to offer, or any reflections on the fitting of such models?

John A. Nelder (Imperial College London)

Those of us who have been working on h-likelihood methods (Lee et al., 2006) are naturally disappointed to see no reference to them in this paper. Professor Lee and Professor Ha will give detailed comments on the use of h-likelihood for fitting the model class that is described in the paper. I shall just say that more general frailty models, allowing structured dispersion, can be expressed in the form of hierarchical generalized linear models after a suitable arrangement of the data matrix (Noh et al., 2006). Advantages of this approach in fitting the models are the following.

  • (a)Quadrature is not required.
  • (b)The EM algorithm, that lumbering giant of an algorithm, is not required.
  • (c)The bias in the frailty coefficient that is caused by the large number of parameters is no longer a problem.
  • (d)Standard software (in Genstat) is available to fit these models efficiently.

I commend this approach to the authors.

J. L. Hutton (University of Warwick, Coventry)

I thank the authors for a thought-provoking paper. I make three simple remarks as a medical statistician.

Accelerated life models are useful in medical applications. In my work on cerebral palsy and epilespy, accelerated life models have been more useful than proportional hazard models (Hutton and Pharoah, 2002; Hutton and Monaghan, 2002; Cowling et al., 2006). Professor Henderson mentioned that hazards often converge over time and suggested that proportional hazard models needed to incorporate time-dependent elements to allow for this. Accelerated life models with a log–logistic or gamma base-line have converging hazard ratios. With a log–logistic base-line, the hazard ratio converges to 1.

Hazard functions should not cross. My response to the data on gastric cancer patients is to ask whether an oncologist or pharmacologist could suggest covariates which would distinguish those patients who die early on from those with a better prognosis. Accelerated life models with Pareto distributions were effective in allowing us to understand the responses to antiepileptic drugs (Cowling et al., 2007).

The following contributions were received in writing after the meeting.

Peter J. Bickel (University of California, Berkeley)

I enjoyed the authors’ presentation of a general toolbox of models for censored survival data. Their paper raised an old philosophical issue for me.

The philosophical point has to do with the failure to account for the variability of the transformation parameter estimates in the confidence bands for the treatment effect parameter that they represent. This was the subject of Bickel and Doksum (1981) and a subsequent discussion of a paper of Hinkley and Runger (1984). As I agreed, stating confidence bounds on an effect on an unknown scale is problematic, but so is underestimating variability. The simple solution is to announce only simultaneous confidence limits for the transformation parameters and the effects, permitting the reader to interpret the transformation effect on their selected scale.

Here are a few more comments.

  • (a)The authors point out that non-parametric maximum likelihood estimators are intractable computationally, leading to ad hoc estimates. The ad hoc estimates may be the most reasonable starting-point for their optimization procedure. If it involves Newton–Raphson iteration, a single step from an o(n1/4) consistent estimate will suffice for efficiency as remarked by LeCam and extended in Bickel (1975).
  • (b)I want to stress a point that was alluded to by the authors in their data analysis. If, as expected, the model is a crude approximation to the mechanism producing the data, the interpretability of quantities that are estimated is important. In the authors’ models the form of G and ɛ must be predetermined. What they estimate are distributions that are closest to the truth in Kullback–Leibler distance—but what this means for parameters of interest is not necessarily clear.
  • (c)The foregoing suggests that making G and even the distribution of ɛ non-parametric is worthwhile if only to see what effect this has on the parameters of interest. As the authors point out, not specifying the distribution of ɛ, in the models of Section 2.1, makes the parameter of interest β unidentifiable—but only in a weak sense. We can still identify the components of β up to a common scale factor, so relative magnitudes of effects can be measured. Econometricians have studied this problem in simpler contexts as discussed in Horowitz (1998).
  • (d)Finally, here is a note of caution. As the complexity of the model increases, so does the number of parameters explicit or implicit as in the estimated H. I believe that one needs to think about imposing sparsity on one's models.

N. E. Breslow (University of Washington, Seattle)

I congratulate Zeng and Lin for their construction of a general model that nicely brings together much previous work, their development of asymptotic theory that justifies inference using profile likelihood and their description of innovative computational approaches. They provide little guidance, however, regarding parameter interpretation. One important benefit of semiparametric models is that quantities that are of key scientific interest may be summarized parametrically whereas nuisance factors are treated non-parametrically. Thus the Cox model focuses attention on the well-understood hazard ratio. Marginal mean and marginally specified hierarchical models (Heagerty and Zeger, 2000) express treatment effects in terms of population averages, possibly within subgroups defined by covariates, whereas marginal structural models (Robins et al., 2000) express population level effects that would be observed if treated and control groups had the same covariate distribution. Parameters in hierarchical transformation models, by contrast, must be interpreted conditionally and may be highly sensitive to distributional assumptions. How to interpret the fixed covariate coefficients in the random-effects transformation model for longitudinal data, where the link is left unspecified, seems particularly deserving of comment.

When faced with crossing hazards, as in Fig. 1, we might add a time-dependent covariate to the Cox model and express the log-hazard-ratio parametrically as β0+β1(t) where g(·) is a specified, increasing function such as g(t)= log (t/t0) with t0 a modal time. Then β0 expresses the log-hazard-ratio at t0 and β1 its rate of change with a known function of time. With the heteroscedastic transformation model, by contrast, g(t)=H(t)= log {Λ(t)} and the interpretation is less clear.

For missing data problems, including two-phase studies where data are missing by design, Zeng and Lin note that Horvitz–Thompson or inverse probability of sampling weighted estimators may be more robust than non-parametric maximum likelihood estimators in the face of model misspecification. Indeed, survey statisticians advocate their use on grounds that they consistently estimate the parameters obtained when fitting a possibly misspecified model to the source population, which the non-parametric maximum likelihood estimator may fail to do. Breslow and Wellner (2007) provide theory for Horvitz–Thompson estimation of both Euclidean and infinite dimensional parameters in semiparametric models fitted to data from two-phase stratified samples. Efficiency may be enhanced through adjustment of the sampling weights by using the survey techniques of post-stratification and calibration, or by their estimation. I concur wholeheartedly that further studies are needed to assess the relative merits of Horvitz–Thompson and non-parametric maximum likelihood methods for complex sampling designs.

Jack Cuzick (Wolfson Institute of Preventive Medicine, London)

We have recently extended our study of non-compliance in randomized trials from the binary outcome case (Cuzick et al., 1997) to a proportional hazard set-up (Cuzick et al., 2007). This greatly complicates estimation, and we have developed a non-iterative ad hoc estimator, as well as estimators that are based on a partial likelihood and the full semiparametric likelihood. The complications arise because non-compliance is modelled by a covariate (insistor, refuser or ambivalent) which is incompletely observed, leading to a latent class model. In particular insistors cannot be distinguished from patients who are willing to accept either treatment in the active treatment arm, whereas refusers and ambivalent patients are indistinguishable in the control arm. In simulations we have found that the non-parametric maximum likelihood estimator that is computed by using the BFGS quasi-Newton algorithm outperforms other estimators in a wide range of conditions, but an asymptotic theory has eluded us. The likelihood is of their general form (12) but not the specific form (11), and validation of the conditions of their general theorem in this case is still formidable, as is computation of the asymptotic variances. Nevertheless the paper gives a very valuable foundation for studying a wide range of semiparametric problems and will no doubt become a standard reference for further research.

Paddy Farrington (The Open University, Milton Keynes) and Mounia Hocine and Thierry Moreau (Institut National de la Santé et de la Recherche Médicale, Paris)

The authors have achieved an impressive unification and extension of several classes of semiparametric models for event history and repeated measures data. Particularly useful is the general asymptotic theory underpinning these models.

Does the modelling framework for recurrent events encompass models for different timescales, including both time from last event and calendar time? We were mystified by the comment regarding clinical trials at the beginning of Section 2.2. For example, Duchateau et al. (2003) proposed a model involving both frailties and time-varying effects in the form of gap time-dependent hazards, which they applied to clinical trial data.

General and complex models induce problems of model identification and interpretation. For example, crossing hazards may be attributable to one of several contrasting effects. These include time varying exposures, selection effects and the functional form of the dependence of the hazard on fixed covariates. All three are available in the present models: to what extent are they identifiable from data?

Furthermore, interpreting the model parameters is far from easy, as illustrated by several of the examples in the paper. In the gastrointestinal tumour example, the estimated parameter value for the treatment effect for the well fitting model (4) is β=3.028 (0.262). Yet to make sense of this requires us to look at the Kaplan–Meier survival curves: knowledge of β provides little further enlightenment. A similar point can be made about the colon cancer example: the parameter estimates for the selected model provide little clue about how effective the treatment really is.

It would be useful in particular if the authors could clarify under what parameter combinations the hazards do not cross, and whether a test for non-crossing hazards could be derived for model (4) which could lead to simplifications. For example, in model (3) with G(x)=x and fixed covariates, the null hypothesis of no crossing is simply γ=0, and the corresponding score test is readily obtained (Quantin et al., 1996).

An alternative to building ever more complex models is perhaps to focus on the questions of primary interest, while eliminating nuisance parameters by conditioning. One such approach, admittedly for much simpler data structures than those considered here, is provided by the semiparametric case series model (Farrington and Whitaker, 2006). This employs a conditioning argument to eliminate the multiplicative effects of frailties and non-varying covariates, thus focusing the analysis on time varying exposures of interest.

Jason P. Fine (University of Wisconsin, Madison)

Theory for non-parametric maximum likelihood estimation has percolated over the past decade, stimulated by the seminal work of Murphy (l994, 1995). The current paper presents potentially useful albeit somewhat straightforward extensions, with the modelling ideas and theoretical developments following closely earlier contributions. The argument that such methodology should play a wider role in statistical practice is intriguing. Unfortunately, the rationale for widespread adoption in applications is less convincing than that for the underlying mathematics. There is an inattention to key applied issues and the relevance and practical utility of the framework in the examples.

In the cancer illustration in Section 5.1, the treatment effect clearly violates proportional hazards and fitting models (3) and (4) yields very different results from a naïve proportional hazards analysis. Viewing Fig. 1, the study investigators would be interested in understanding how the treatment effect changes over time in the population. This is obscured by the heteroscedastic transformation model, whose interpretation is rather mathematical and difficult for non-statisticians. The proportional hazards model easily accomodates time-dependent effects. Either time-dependent covariates involving interactions of treatment and time (Therneau and Grambsch, 2000) or time-dependent coefficients (Martinussen and Scheike, 2006) may be employed.

In Section 5.2, the joint model for the recurrence time X and the death time Y is questionable. The resulting analysis of the marginal distribution of recurrence corresponds to the setting where death before recurrence does not occur, which is generally of secondary interest to patients and physicians. Typically, competing risk end points, like cause-specific hazard and cumulative incidence (Kalbfleisch and Prentice, 2002), are reported in oncology journals. Their interpretation corresponds to the current reality where death may occur before recurrence, which is of greater interest. If the distribution of residual life post recurrence is of interest, then the distribution of YX can be directly modelled conditionally on X and other covariates using proportional hazards models, which would be the default in practice. This simple analysis can be implemented using standard software and the interpretation of the effect of X on YX is much more transparent than that in the joint model.

For multivariate data, random-effects models may be useful for assessing failure time correlations. The gamma distribution is attractive, because of its relationship to the cross-hazard ratio (Oakes, 1989). The model can be generalized to permit time varying and asymmetric associations. The cross-hazard interpretation is opaque for the normal distribution (Hougaard, 2000). When correlation is a nuisance, random-effects models seem less attractive, as their misspecification may bias other parameter estimators. Moreover, incorporating random effects in conditional proportional hazards models generally gives marginal non-proportional hazards models, whose interpretation may be problematic. Marginal proportional hazards models avoid such limitations. These models can be coupled with copulas, e.g. multivariate normal. Presumably, non-parametric maximum likelihood estimation inferences are efficient, similarly to random-effects models.

Il Do Ha (Daegu Hanny University, Daegu)

For the maximum likelihood (ML) estimation the authors use the EM algorithm and the discrete non-parametric Breslow estimator, which results in biased estimators (Rondeau et al., 2003). Overall, the authors consider bivariate survival data, where there is less of a problem than with univariate data (Barker and Henderson, 2005). We now demonstrate how the h-likelihood approach overcomes this problem. For simplicity of argument, we consider the semiparametric frailty models (1) with clustered failure time data. We assume that ui has a gamma distribution with E(ui)=1 and var(ui)=α to allow an explicit marginal log-likelihood m.

Let m* be the profile marginal likelihood (Nielsen et al., 1992; Murphy and van der Vaart, 2000) after eliminating the nuisance parameter λ0, defined by

image

where m= log {∫ exp (h) dv} is the marginal likelihood, h is the h-likelihood (Lee and Nelder, 1996; Ha et al., 2001) and inline image is the discrete Breslow estimator, obtained from ∂m/∂λ0=0. In fact, the maximization of m* gives the ML estimators by using the EM algorithm (Andersen et al., 1997). The resulting ML estimators have downward biases, particularly for the frailty parameter α.

On the basis of 200 replications of simulated data we investigate the performances of three profile likelihood methods (m*,pw(m) and inline image. For the gamma frailty we use the second-order Laplace approximation inline image (Lee and Nelder, 2001). Given the frailty parameter α, we use profile likelihoods m* and h*, which provide the same estimates for β (Ha et al., 2001). However, they give different estimators for α because the estimates of α are obtained by maximizing the three adjusted profile likelihoods. Under no censoring we generate data by assuming the exponential base-line hazard λ0(t)=1, one standard normal covariate with β=1, and α=1. We consider both univariate and bivariate sample cases: inline image with (n,ni)=(100,1),(100,2),(200,1). Note here that we choose fairly extreme cases, with no censoring and small sample size, because these situations yielded the most biased estimates of inline image in the simulation studies by Nielsen et al. (1992) and Barker and Henderson (2005). The results are summarized in Table 5. As expected, m* gives severe downward biases in all cases considered, especially with ni=1. Moreover, the underestimation of α leads to that of β. Table 5 also demonstrates that the two adjusted profile likelihoods pw(m) and inline image reduce such biases substantially, giving almost the same results.

Table 5.   Simulation results for the estimators inline image and inline image under marginal and h-likelihoods in semiparametric gamma frailty models†
nniMethodResults forinline imageResults forinline image
MeanStandard deviationMean-squared errorMeanStandard deviationMean-squared error
  1. †The simulation is conducted with 200 replications at true gamma frailty variance α=1 and regression parameter β=1 (no censoring).

1001m*0.420.3630.4690.800.2280.094
pw(m)0.870.6180.3980.970.2970.088
inline image0.890.6170.3930.960.2960.089
1002m*0.900.2400.0670.980.2810.079
pw(m)0.990.2580.0661.000.2860.081
inline image0.990.2580.0661.000.2860.081
2001m*0.630.3010.2310.870.1880.054
pw(m)0.980.4370.1910.990.2210.050
inline image0.990.4390.1921.000.2310.053

Here we have considered the gamma frailty model to have an explicit form for m and pw(m). However, this is not so in general, for example, for models with log-normal frailty, or with nested and/or serially correlated frailty. Thus, the adjusted profile likelihoods that are based on h-likelihood are useful for general frailty models. We believe that the h-likelihood approach gives more flexible ML inferences than the EM approach.

Joel L. Horowitz (Northwestern University, Evanston)

I congratulate Professor Zeng and Professor Lin on their interesting paper. It presents a class of flexible semiparametric models for censored survival and longitudinal data. The models accommodate a wide variety of distributions of random effects or frailty. In particular, the standard assumption of gamma-distributed random effects is removed. It is useful to ask whether the assumptions about frailty can be relaxed further by making the frailty distribution non-parametric.

There has been much interest in this question in econometrics over the past two decades. Heckman and Singer (1984a) showed that the parameter estimates from a Weibull hazard model are very sensitive to the choice of frailty distribution. They established consistency of a non-parametric maximum likelihood estimator of this distribution. Elbers and Ridder (1982), Heckman and Singer (1984b) and Ridder (1990) gave conditions for identification of proportional hazard and generalized accelerated failure time models with non-parametric frailty. Honoré (1990) developed an estimator of the shape parameter of a Weibull hazard model with frailty and gave conditions under which it is asymptotically normal with a rate of convergence in probability that is arbitrarily close to n−1/3. Horowitz (1999) showed how to estimate a proportional hazard model in which the base-line hazard function and frailty distribution are both non-parametric. Horowitz's estimator, like Honoré’s, has a slower than n−1/2 rate of convergence. This happens because identification is through the behaviour of the hazard function in an arbitrarily small neighbourhood of 0. However, Ridder and Woutersen (2003) showed that n−1/2-convergence is possible if the base-line hazard function is bounded away from 0 and ∞ in a neighbourhood of 0. An n−1/2 rate of convergence is also possible if we have longitudinal data (Horowitz and Lee, 2004). Indeed, this is possible with longitudinal data even if the frailty variable is correlated with the covariates.

Non-parametric estimation of the frailty distribution with cross-sectional data is a deconvolution problem, so the rates of convergence in probability are quite slow in general. None-the-less, it appears possible to obtain useful estimates with samples of practical sizes (Horowitz, 1999). In summary, estimation results in hazard models can be very sensitive to misspecification of the frailty distribution. Non-parametric treatment of the frailty distribution is possible in simple models such as proportional hazards models. It is worth investigating whether more complicated models such as those of Zeng and Lin are also sensitive to misspecification of the frailty distribution and whether non-parametric estimation of this distribution is possible.

John D. Kalbfleisch (University of Michigan, Ann Arbor, and National University of Singapore) and Jinfeng Xu (National University of Singapore)

We congratulate Professor Zeng and Professor Lin on an interesting and far reaching paper with many facets and intricacies. Our allotted space is very short, so we confine our comments to three points.

There is clearly value in developing methods for and investigating uses of alternative models but, like many others, this paper begins by setting up proportional hazards as a straw man. There is no recognition that the covariates in the model can incorporate interactions with functions of time and so model, quite parsimoniously, various non-proportional aspects of a problem. For this reason, Kalbfleisch and Prentice (2002) suggest use of the term relative risk or Cox model instead of the proportional hazards misnomer. The proposal of incorporating heterogeneous errors in a linear transformation model to account for possible non-proportional hazards leads to incorporation of coefficients that are difficult to interpret, and more difficult, we suggest, than the interpretation of a time varying term in a relative risk model.

How in general do we interpret parameters in a linear transformation model? There are simple interpretations for extreme value or logistic error. In the more general case with estimated G and arbitrary H, however, it is not clear what β is measuring in model (1) or its extensions. Table 1 illustrates the point. Here the parameters in the relative risk model have simple interpretations as log-relative-risks and similar interpretations apply in the proportional odds model. The paper notes that the ‘interpretation of treatment effects … depends on which model is used’ but provides no guidance on interpreting the parameters under the suggested analysis. We seem to be left with a test for no treatment effect but without the benefit of interpretable parameters.

Finally, Fig. 2 comprises four separate sheets wherein (0,0), (0,40), (40,0) and (40,40) all correspond to proportional odds and (20,20) corresponds to proportional hazards. It is not clear that (20,20) corresponds to a single ordinate in all four sheets as it should; nor is it apparent that all representations of proportional odds (e.g. (0,40) and (0,0)) have the same ordinate, and we wonder whether the normalizing constant is the same in all sheets. A contour plot would perhaps have been more informative. The reason for the arbitrary cut-off at ρ=1 also is unclear and seems that it may affect the model selected. Some investigation of simple time-dependent relative risks in this example would be interesting.

Michael R. Kosorok (University of North Carolina, Chapel Hill)

I congratulate Zeng and Lin on an excellent contribution to statistical modelling for right-censored data. The authors make a very strong case for the practical use of efficient, maximum-likelihood-based estimation for semiparametric models. Moreover, the heteroscedastic linear transformation model proposed and, especially, the random-effects linear transformation model are scientifically interesting and very appealing new models.

Nevertheless, there are a few points to be made. To begin with, several important references should be added to the part of the introduction that reviews transformation models. Slud and Vonta (2004) generalized the work of Scharfstein et al. (1998) to more general choices of the G-function than is given in expression (2) of the paper under discussion. Kosorok et al. (2004) further generalized to allow G to be parameterized with unknown parameter values. Thus the future topic that Zeng and Lin propose in the second paragraph of Section 7 has already been partly accomplished in Kosorok et al. (2004).

On a more favourable note, the ensemble of numerical tools that were developed by Zeng and Lin is a key contribution that makes the methods proposed usable in practice. However, additional gains in computational efficiency are possible when both the finite and the infinite dimensional parameters are jointly efficiently estimated, as has been verified, for example, by Kosorok et al. (2004) for transformation models with right-censored data. Incidentally, this joint efficiency holds for most of the models in Zeng and Lin's paper, even though they neglected to point this out. This gain in computational efficiency is achievable through careful utilization of the profile likelihood structure for both the finite dimensional parameter via the profile sampler (Lee et al., 2005) and for all parameters jointly via the piggyback bootstrap (Dixon et al., 2005). The computational savings of these methods have been verified rigorously and can be dramatic. Combining the profile sampler and piggyback bootstrap with the numerical innovations of Zeng and Lin should lead to further dramatic improvements.

In Section 7, the authors mention robustness under model misspecification and extending transformation models to interval-censored data as important future topics. Some initial work on robustness of transformation models was given in Kosorok et al. (2004), who showed that the direction of the regression effects can be accurately estimated even when the G-function is misspecified. Likelihood inference under interval censoring for transformation models with partly linear regression effects was developed and verified theoretically in Ma and Kosorok (2005). A key challenge here is the presence of two non-root-n consistent estimators.

In general, very little work has been done for likelihood-based semiparametric inference involving parameters that are not root n estimable. A very careful analysis of the entropy of these models is usually required, making this area of endeavour one of the most intellectually demanding in all of statistics. Adding to this the other open topics that were mentioned by Zeng and Lin, it is clear that many challenging questions in semiparametric inference remain.

Youngjo Lee (Seoul National University, Seoul)

I congratulate the authors on unifying maximum likelihood estimation for the analysis of multivariate survival data, based on the EM algorithm. However, the EM method is slow and may not be easily applicable to complicated situations. For simplicity of argument, consider the semiparametric frailty models (6) with clustered failure time data, giving conditional hazard

image(16)

where λ0(·) is an unspecified base-line hazard function and the ui follow some distribution. For inferences, Lee and Nelder (1996) proposed to use the h-likelihood, which is defined by

image

where f(y|u) and f(v) are probability density functions for y|u and v= log (u) respectively. For inferences about the fixed parameters θ, the marginal (log-)likelihood has been proposed, using

image

However, in general the required integration is intractable. Thus, Lee and Nelder (2001) considered a function class that they called adjusted profile likelihoods p(l); these eliminate the nuisance parameter τ from a likelihood l, defined by

image

where D(l,τ)=−∂2l/∂τ2 and inline image solves ∂l/∂τ=0. The adjusted profile function pv(h) eliminates the random parameters v by the Laplace approximation to integration (Lee et al., 2006) and p(m) eliminates the fixed parameters β by conditioning on inline image (Cox and Reid, 1987).

For frailty models, Ha and Lee (2005) proposed to use an adjusted profile h-likelihood pv(h*), which is defined by

image

where inline image is a profile h-likelihood with a solution inline image from ∂h/∂λ0=0, D(h*,v)=−∂2h*/∂v2 and inline image solves ∂h*/∂v=0. Ha and Lee (2007) showed that

image

where w= log (λ0). These adjusted profile likelihoods give practically satisfactory estimators (Ha and Lee, 2005, 2007). Instead of using the E-step the h-likelihood method directly maximizes various adjusted profile likelihoods: for its advantages see Lee et al. (2006).

Yi Li (Harvard School of Public Health and Dana–Farber Cancer Institute, Boston)

Zeng and Lin are to be congratulated for a wonderful work on non-parametric maximum likelihood estimation for semiparametric frailty regression models. In this comment, I concentrate on the interpretation of frailties.

In the framework of random-effects models, the frailties have been introduced to model the clustering effect and will be useful for prediction as illustrated in Section 5.2. However, they are meant to model the within-cluster dependence as the variance components of the frailties typically gauge the magnitude of such dependence (Diggle et al., 1994). This, however, was not elucidated in this paper. This note bridges the frailty parameters with within-cluster dependence measures and highlights a challenge in interpreting these parameters. To convey the idea, consider a (much) simplified version of model (7) for bivariate failure times (T1,T2) with no covariates, namely

image(17)

where bf(·;γ). Our goal is to link the variance component γ to a ‘model-free’ and standardized dependence measure that is commonly used for bivariate survival. One such device is Kendall's coefficient of concordance (Kendall's τ), which can be evaluated by

image

where p(t1,t2) and S(t1,t2) are the joint bivariate density and survival functions respectively (see, for example, Hougaard (2000)). It follows that the joint survival under model (17) is

image

and p(t1,t2) can be conveniently evaluated by p(t1,t2)=∂2S(t1,t2)/∂t1 ∂t2. Therefore, γ can be viewed to characterize the bivariate dependence through

image

It is worth noting that τ does not depend on the base-line function Λ(t) in model (17) and its efficient estimate can be obtained by replacing γ with its maximum likelihood estimator inline image, whose variance estimate will be immediately available via the delta method. In a similar fashion, the relationship of variance component γ with the other global dependence measures, e.g. the Spearman correlation, integrated hazard ratio and median concordance, and the local dependence measure, i.e. the local cross-ratio, can also be established.

However, a serious challenge of interpreting γ as a dependence measure lies in its dependence on the link function G in model (17). This can be illustrated by Fig. 7, which depicts Kendall's τ against various γ when bN(0,γ), under the proportional hazards model (with G(x)=x) and the proportional odds model (with G(x)= log (1+x)). For example, γ=1.8 corresponds to Kendall's τ of 0.40 under the proportional hazards model, which is almost twice as much as that of 0.21 under the proportional odds model, begging the cliché question of ‘how large is large?’ when viewing the variance component as a measurement for dependence under various transformation models. I would welcome the authors’ comments on this issue.

Figure 7.

 Kendall's τversusγ for the proportional odds (——) and proportional hazards (– – –) models

N. T. Longford (SNTL, Reading, and Universitat Pompeu Fabra, Barcelona)

I am perplexed by the double negative (‘no reason for not using maximum likelihood estimation’) in the penultimate sentence of the summary, which I regard as vacuous. The authors do not mention any alternative, and I think that there is no credible alternative to maximum likelihood estimation. The difficulty is not in maximizing a likelihood, such as expression (12), but in specifying one appropriately through the details of the adopted model, balancing the requirements of validity and parsimony. This highlights the need for model selection, or dealing with model uncertainty, and for taking account of the model selection process in the analysis. Information criteria, such as Akaike's information criterion and the Bayes information criterion, are standard but exceedingly poor solutions if the model that is selected is regarded as being valid and as if it were selected before data inspection (Longford, 2005). The model selection process is far from ignorable.

Maximum likelihood is efficient only asymptotically, and only with a valid model. If simulations confirm that the asymptotic sampling variance closely approximates the sampling variance in a finite sample setting and the bias is small or none, we cannot conclude that maximum likelihood estimation is efficient also in small samples. In finite samples, some submodels of a valid model may yield more efficient estimators for some targets; the bias of an estimator that is based on an invalid model may be more than offset by the variance reduction in relation to a valid model. This issue is highly relevant in semiparametric models in which the effective numbers of observations and parameters cannot be counted straightforwardly.

Xavier de Luna (Umeå University) and Per Johansson (Uppsala University and Institute for Labour
Market Policy Evaluation, Uppsala
)

The paper presents in a convincing manner a broad class of models for longitudinal and event time data. Our purpose with this comment is to make potential users of these models aware of an important pitfall in the analysis of studies with waiting time to treatment (i.e. the time units have been eligible for treatment). Indeed, ignoring waiting time is not innocuous even in experiments with randomized treatment assignment. When the outcome of interest is a survival time, then the effect of the treatment can be defined by using the survival functions of the treated and non-treated subjects over the population of those who are eligible for treatment. The survival function S(t)=E{ℐ(Tgeqslant R: gt-or-equal, slantedt)}, where ℐ(·) is the indicator function, is then estimated for treated and control units respectively. By randomizing treatment assignment to units, waiting time to treatment is balanced for. In general, we would expect the hazard to death to be a function of waiting time w:h(t;w)=E{ℐ(T=t)|Tgeqslant R: gt-or-equal, slantedt,W=w}. Marginalizing over waiting time is equivalent to considering h(t)=EW{h(t;W)}. Unfortunately, the use of the estimated average (over w) hazards to construct the Kaplan–Meier estimator does not yield a valid estimator of the population survival function S(t), unless h(t;w)=h(t). This is because the Kaplan–Meier estimator is a non-linear function of the hazards.

In the colon cancer study of Section 5.2, waiting time is taken into account through the covariate Z2i. We must assume that the hazards that are modelled are not functions of w within the classes Z2i=0(wleqslant R: less-than-or-eq, slant20 days) and Z2i=1 (w>20 days), for the curves that are displayed in Fig. 3 to be interpretable as survival functions. Note that the waiting time is at most 1 month (Fleming, 1992), and this might be a reasonable assumption here.

In non-randomized experiments, controls have no well-defined waiting time. This constitutes a major complication because waiting time cannot then be introduced as a covariate in a model. This issue was addressed in Fredriksson and Johansson (2004) and de Luna and Johansson (2007) by conditioning the inference on waiting time. For instance, the models that are discussed by Zeng and Lin may be applied on a stratum that is defined by a waiting time w0. Then, in this stratum, all units having survived until time w0 and not treated at that time can and must be considered as controls. For this approach to be feasible, enough observed cases for each waiting time stratum of interest must be available.

Ross L. Prentice (Fred Hutchinson Cancer Research Center and University of Washington, Seattle)

I congratulate the authors on a lucid and impressive paper that unifies and extends a diverse statistical literature, while using a modern empirical process for asymptotic distributional results, including that of semiparametric efficiency.

The authors recommendation (a) is to ‘use the new class of transformation models to analyse failure time data’. They motivate this new class of models (2)–(4) by the need to allow for the possibility of crossing hazards. However, as noted in the first sentence of their paper, the Cox (1972) model includes time varying covariates, which may readily be defined to include crossing hazards; see Prentice et al. (2005) for a recent, practically important, example. More generally, the Cox model is by far the most important special case of models (2)–(4), because of the ready interpretation of regression effects on the hazard ratio, to the point that I wonder whether the larger class adds much. Linear transformation models as a class do not seem to share such a useful interpretation, except for accelerated failure time models which, as the authors note, are not encompassed by models (2)–(4).

The authors’ recommendation (b) is to ‘make routine use of random-effects models for multivariate failure time data’. Although frailty models allow for dependence between clustered failure times, and I agree that normal random effects have the advantage of avoiding restrictions on pairwise dependences, I do not find the frailty approach to be so appealing. For example, frailty models typically imply complicated marginal distributions for failure times, and the interpretation of regression coefficients is conditional on the frailty. Why should the marginal models for a failure time change, just because some possibly correlated failure times are being simultaneously analysed? Copula models preserve marginal distributions while also allowing correlation. A multivariate normal copula model as applied to standard normal variates arising from Cox model margins (e.g. Li and Lin (2006)) seems particularly appealing. For recurrent events, the authors argue that the inclusion of post-randomization time-dependent variables in Cox models may affect treatment effect interpretation. However, random-effect modelling cannot be expected to remove biases if censoring rates depend in a complex fashion on the preceding counting process history. Careful data analysis is then required, with Cox models having evolving covariates providing a useful and interpretable modelling context (e.g. Kalbfleisch and Prentice (2002), chapter 9).

I again congratulate the authors on their stimulating work.

N. I. Ramesh (University of Greenwich, London) and A. C. Davison (Ecole Polytechnique Fédérale de Lausanne)

We would like to mention work on a related topic for which the methods that are outlined in this interesting paper may be applicable.

We use a multistate Markov model to analyse the movement of ticks of the species Ixodes ricinus up and down blades of grass under the influence of covariates such as temperature, relative humidity or light. The movements are recorded under controlled conditions in a laboratory setting over 10 days, with light changed to mimic diurnal variation (Perret et al., 2003).

A simple model is that at any time a tick is in one of the three states, resting at the foot of the blade (1), walking between the top and bottom of the blade (2) and questing for prey at the top of the blade (3), and that transitions may take place between these as follows: 1⇆2⇆3, so direct transition between states 1 and 3 is not allowed. Under a proportional hazards model, we might suppose that a tick in state 1 moves out of it at a rate h12(tξ12(x;β), where h12(t) depends on time t since the start of the experiment, and represents a base-line rate at which a tick in state 1 might leave that state. Similarly we define base-line rates h32(t), h21(t) and h23(t) for the other possible transitions. The quantities ξij(x;β) reflect the influence of covariates on the rates.

The key aspect of interest is what influences changes between the states?

This is an application where the diurnal variation would lead us to expect cyclic behaviour, so we may use a proportional hazard model with periodic base-line hazard (Pons and de Turckheim, 1988).

To what extent do the ideas of the present paper extend to the periodic case, or to other situations where a base-line function has some special form or constraints induced by the sampling plan?

Peter Sasieni (Queen Mary, University of London)

This impressive paper unifies many models for right-censored data (including right-censored repeated measures) and provides a general approach to asymptotic theory and computation. Although the power of empirical process theory is not doubted, the loss of the key concept of ‘history’, which is central to martingale theory, is lamentable. More particularly, are these large unifying models simply too big to be useful? If model (4) were widely used, how would one perform a meta-analysis based on published results? The following issues all relate to parameter identifiability, interpretation and approximation (McCullagh, 2002).

  • (a) In the transformation model (1), the scale is fixed by the error distribution. Since different distributions have different variances, the magnitude of the parameter β has no common interpretation. Would constraining the variance of ɛ help interpretation of β (Chen et al., 2002a; McCullagh, 2002)?
  • (b) In general, since the interpretation of the regression parameters β depends on the transformation H, there are difficulties in interpreting confidence intervals that take into account the uncertainty in H (Bickel and Doksum, 1981; Hinkley and Runger, 1984). There are exceptions: notably, when the error is extreme value or logistic, β has a natural interpretation that does not depend on H.
  • (c) In model (4), one cannot even interpret the sign of β without taking into account γ and Λ. In such circumstances the focus will turn to estimating functionals such as Pr(T1<T2|Z1,Z2) or E(T2T1|Z1,Z2). What then is the advantage of model (4) over non-parametric estimation?
  • (d) Is model (7) simply too big? As illustrated by Fig. 2, the likelihood can be quite flat with multiple local maxima. In such circumstances we might prefer a suboptimal model with a simpler interpretation that captures the main features of the data. A hierarchy of models would be useful in practice.
  • (e) Given that the likelihood is not (even asymptotically) convex, can we be sure that the solution to the score equation that is used in the M-step will lead to a consistent estimator?
  • (f) The model that was used to analyse the colon cancer data treats cancer and death symmetrically despite the fact that there were no cases of death without recurrence. It is of clinical interest to note that the time from recurrence to death is
  • (i)strongly correlated with the time to recurrence (Fig. 8) and
  • (ii)decreased by treatment.
Figure 8.

 Kaplan–Meier estimates of survival from cancer recurrence to death in the colon cancer example (a) showing how the survival from recurrence is shorter in those who had recurrence earlier (——, <121 days; – – –, 121–365 days; - - - - - -, 366–730 days; ·—·, >730 days) and (b) showing how the survival from recurrence is worse in those in the treatment arm (although overall survival is better owing to the much bigger beneficial effect of treatment on recurrence) (——, treat = 0; – – –, treat = 1)

L. Tian (Northwestern University, Evanston) and L. J. Wei (Harvard School of Public Health, Boston)

We thank Professor Zeng and Professor Lin for providing us practically useful, theoretically justifiable inference procedures for a general class of semiparametric models via the maximum likelihood principle. Their proposals are far reaching and can handle various classical challenging problems in survival and longitudinal data analysis. When the fitted model is correctly specified, the resulting estimation procedure is asymptotically efficient. Moreover, unlike other ad hoc methods dealing with censored data, theirs is valid without much restriction on the distribution of the censoring variable. The authors also showed that the non-parametric maximum likelihood estimator (NPMLE) is numerically tractable at least when the number of observed failure times is not too large. An alternative way to obtain an approximation to the distribution of the NPMLE of the Euclidean parameter θ0 is to draw random samples repeatedly from the density function proportional to  exp {−pln(θ)}, where pln(·) is the profile log-likelihood function. The realizations of these random samples can be generated via a Markov chain Monte Carlo sampler (Lee et al., 2005).

The authors concluded that ‘there is no reason, theoretical or numerical, not to use maximum likelihood estimation for semiparametric regression models’. However, a fitted model is probably an approximation to the true model. It is possible that the NPMLE may not converge under a working model. Moreover, generally the robust ‘sandwich variance estimate’ for the NPMLE is difficult to obtain owing to a lack of an explicit score function. Furthermore, for evaluating a working model, first we fit the data with the model and then validate it by using, for example, a function D, which measures the average distance between the observed and the model-based predicted responses. Preferably this distance function can be easily interpretable, say, with respect to the scale of the response variable. For example, we may use an R2-type measure or the absolute prediction error as a possible candidate for D(Uno et al., 2007; Tian et al., 2007). We then use the sampling distribution of an estimated D to evaluate the adequacy of the fitted model. However, a likelihood-based validation criterion may not be easy to interpret. Therefore, to make a coherent package from model estimation to validation, we may prefer to use certain moment-based estimates for the regression parameters.

Lastly, we may take the above approach to compare different working models, e.g. to examine whether a normal random-effects model is better than a gamma frailty counterpart or a parametric model is better than a semiparametric counterpart.

Keming Yu and Shanchao Yang (Brunel University, Uxbridge, and Guangxi Normal University, Guilin) and Ali Gannoun (Conservatoire National des Arts et Métiers, Paris)

This is an impressive piece of work; the idea may motivate some new research for quantile regression in survival analysis. Whereas a conditional survival function at time t represents the proportion of those conditionally surviving up to time t, a pth (0<p<1) conditional quantile function provides the earliest time by which the proportion p have died. Let inline image be the pth conditional quantile of h(T). First, although it is difficult to find a parametric transformation to achieve normality for a standard linear mixed model or to leave the unknown transformation function under a semiparametric version (10), quantile regression has the feature of equivalance to monotone transformation, i.e. inline image for any monotone function h, so we can simply select a transformation such as the Box–Cox transformation to apply. In fact, whatever the parametric monotone transformation h to achieve normality of the response variable, we shall transform it back to obtain inline image. Second, in the context of this paper we may propose semiparametric quantile models in different stages. For example, corresponding to the heteroscedastic version of linear transformation models inline image, we may have their quantile versions as inline image(Chaudhuri et al., 1997; Koenker and Geling, 2001), where parameters β(p) and γ(p) depend on p (the monotonicity of inline image over p may be required). Corresponding to those new cumulative intensity functions in equations (3) or (4), the new quantile regression models can be derived as follows: from

image

where λ(t) is the base-line hazard function and inline image, we have that inline image satisfies the equation

image

Then inline image could be estimated on the basis of the ‘check function’ or ‘loss function’ρ(u)=u{pI(u<0)} (Portnoy, 2003; Gannoun and Yu, 2007) with proper non-linear optimization.

The authors replied later, in writing, as follows.

We are delighted with the unusually large number of contributions from such a diverse group of researchers. We thank all the discussants for taking the time to read our paper and to prepare constructive comments. We are particularly grateful to those who travelled from outside the UK to attend the Ordinary Meeting. For brevity, it is not feasible to respond to all the points that were raised. We shall focus on some common themes.

This year marks the 35th anniversary of Cox's (1972) landmark paper on proportional hazards regression, which is the foundation of our work. We are deeply honoured by Sir David Cox's participation in the meeting and the discussion. As always, his comments are extremely insightful and pertinent. We share his views on hazard and accelerated failure time modelling, contrast between proportionality and additivity, and general versus specific models.

Several discussants, particularly Prentice and Sasieni, question the usefulness of transformation models. We wish to reiterate that we are not advocating abandonment of the Cox model, but rather extension of this highly useful model to provide additional modelling capabilities. Although we have presented our models in very general and somewhat abstract forms, any specific application will probably involve only a subset of the models with simpler representation. Most studies are concerned with single events with time invariant covariates, in which setting the class of linear transformation models that is given in equation (1) or its heteroscedastic version that is given above equation (3) would suffice. Although cast within the framework of transformation models, the paper contains new development for the Cox regression, such as Cox models with non-gamma random effects and the joint modelling of repeated measures and failure times via the semiparametric linear mixed model and the Cox model with normal random effects.

There seems to be a general agreement that the proportional hazards assumption should be challenged in practice. Several discussants, including Cox, Breslow, Farewell and Tom, Fine, Kalbfleisch and Xu, and Prentice, recommend adjustment of non-proportionality through the use of manufactured time varying covariates in the form of Z f(t), where f(t) is a known function, such as t or  log (t). This approach can be quite useful, especially if we wish to stay within the hazard modelling framework, but it is rather restrictive and data driven. Finding the right form of f can be challenging, particularly when there are multiple continuous covariates. If a linear transformation model, such as the proportional odds model, truly captures the non-proportionality, then that model would provide more concise summarization of the data than the Cox model with manufactured time varying covariates. As Breslow points out, the log-hazard ratio takes the form of a+b  log (t) under the two-sample Cox model with f(t)= log (t) and the form of a+b  log {Λ(t)} under model (3). The latter formulation is actually more appealing since it is non-parametric and scale invariant.

The familiar linear model form of equation (1) is more intuitive than the hazard formulation, especially when the response variable does not pertain to failure time. The choice of the extreme value distribution for ɛ yields the proportional hazards model. If the true error distribution is not extreme value, then we should use whatever the true distribution is rather than abandoning this attractive formulation.

Equation (1) can be expressed as g{Sz(t)}=H(t)+βTZ, where Sz(·) is the conditional survival function of T given Z, and g(·) is a known link function. The choices of g(x)= log {− log (x)} and g(x)= log {x/(1−x)} yield the proportional hazards and proportional odds models respectively. These two models are analogous to the binary data regression models with the complementary log–log- and logit link functions. If the true link function is logit, it would not be sensible to insist on using the complementary log–log-link function and trying to correct for the misspecification of the link function by incorporating interaction terms.

Both equation (1) and its survival function representation show that linear transformation models characterize directly the effects of covariates on the ultimate outcome, i.e. survival time or survival probability. By contrast, hazard is a conditional concept and the effects of covariates on the survival time under the Cox model with (manufactured) time-dependent covariates is not transparent. A few discussants, particularly Henderson, Martinussen and Scheike, Farewell and Tom, and Prentice, are concerned that the effects of covariates on the hazard function may not be clear under transformation models. But we should not confine ourselves to a hazard interpretation, especially when the hazards are not proportional and alternative formulations lead to more parsimonious models.

Hutton remarks that accelerated failure time models are more useful than proportional hazards models in certain medical applications and can accommodate non-proportional hazards through appropriate choices of the error distribution. Her comments further support the use of linear transformation models, which are the same as parametric accelerated failure time models except that the transformation of the failure time is unspecified.

Several discussants, particularly Breslow, Fine and Farrington, Hocine and Moreau, query the interpretation of the regression parameters in the gastric cancer example. In that example, model (3) takes a simple form

image

where ɛ has the extreme value distribution. This is just a heteroscedastic linear regression model. The model can also be written in terms of the cumulative hazard function

image

which is a semiparametric version of the Weibull regression model in which β and γ represent the effects of the combination therapy on the scale and shape of the failure time distribution respectively. Under model (4), the hazard ratio is

image

which reduces to equation (14) in Henderson's contribution under Λ0(t)=t. As explained by Henderson, the interpretation of β and γ is fairly straightforward in this case. We agree with Hutton and Farrington, Hocine and Moreau that it would be desirable to identify factors that cause crossing hazards.

For recurrent events and time varying covariates, it is necessary to formulate the transformation models in terms of hazard. Then the interpretation of the regression parameters indeed may not be simple, and the main advantages of the transformation models may lie in prediction. Regression analysis has traditionally been focused on individual regression parameters. Inference on individual regression parameters can be quite misleading when covariates are correlated. More emphasis should be placed on prediction, i.e. on characterizing how different covariates act together to affect the ultimate outcomes. For purposes of prediction, it is desirable to use the most accurate model. For that, rich classes of models such as those presented in our paper are highly valuable, as alluded by Cox.

As Sasieni and Li point out, the interpretation of the regression parameters and variance components generally depend on the transformation function. In the special cases of the proportional hazards and proportional odds models, the regression parameters have simple interpretation. Thus, we recommend the use of those two models as long as they provide reasonable approximations.

Henderson and Fine mention the Cox model with time varying regression coefficients. This is a nice way of visualizing the effects of covariates on the hazard function over time. It is, however, very difficult to estimate the time varying coefficients well. Indeed, such parameters cannot be estimated at the usual n1/2-rate and non-parametric smoothing is required. The proportional odds model with time varying regression coefficients that was suggested by Martinussen and Scheike encounters the same difficulties. Additive hazards models with time varying regression coefficients are easier to deal with; however, additive models do not constrain the hazard functions to be positive, as noted by Cox.

The advantages of the non-parametric maximum likelihood estimators (NPMLEs) have been discussed at great length in the paper, and we shall not repeat our arguments. As Borgan points out, partial likelihood provides a simple, although inefficient, approach to analysing nested case–control data. This approach applies only to the Cox model and is intractable when covariates are measured with error, whereas the NPMLE is more broadly applicable. As demonstrated by Zeng et al. (2006), NPMLEs provide a unified framework for efficient semiparametric inference under the nested case–control and case–cohort designs.

Fine and Prentice find marginal models conceptually more appealing than random-effects models. This view is not universally adopted. Indeed, random-effects models are more desirable than marginal models when the response for an individual rather than for the population is the focus (Zeger et al., 1988). The advantages of random-effects models over marginal models are discussed in Section 7 of our paper.

We share Scheike and Martinussen's sentiment that one should use the random-effect distribution that provides the best description of the correlation. Unfortunately, it is difficult to determine the true distribution of random effects empirically. The advantages of the normal random effects are described in the paper. Normal random effects are particularly natural in joint modelling of repeated measures and failure times.

We agree with Aalen that counting process martingale theory is a very important conceptual framework for formulating the effects of potentially time varying covariates on event history under general censoring mechanisms. Our remarks about the limitations of this tool pertain only to the proofs of asymptotic results. As Aalen points out, counting process martingale theory provides a very simple approach to understanding the properties of standard survival analysis methods, such as Kaplan–Meier and Nelson–Aalen estimators, log-rank tests and Cox regression analysis. Thus, most graduate courses in survival analysis theory are currently taught from the counting process martingale point of view. Once the students start working on their theses, however, they realize that this elegant theory cannot be used to solve cutting edge research problems whereas empirical process theory is much more powerful.

Andersen is right that the construction of the likelihoods makes the standard assumption of independent and non-informative censoring. The theoretical results in the paper cover random left truncation under suitable regularity conditions. In the presence of internal time varying covariates, likelihoods given in expressions (5) and (8) are indeed only partial likelihoods and the EM algorithms may not apply. Like the standard Cox regression analysis, our methods require that time varying covariates be measured at all observed event times.

Since the scope of our paper is very broad, it is impossible to cite all relevant papers. Kosorok et al. (2004) is discussed in remark 3 of the paper. The h-likelihood that was mentioned by Nelder, Lee and Ha is very intriguing. It is unclear whether this approach will provide numerically accurate and statistically efficient estimators in the semiparametric setting that is considered in our paper.

Tsodikov provides very nice insights into the convergence properties of the semiparametric EM algorithms that are employed in our paper. The robust variance scoring algorithm that was mentioned by Commenges is promising and worth trying. It would also be worthwhile to explore the profile sampler and piggyback bootstrap procedures that were mentioned by Kosorok as well as Tian and Wei.

Bickel and Henderson bring up the issue of estimating the transformation parameter. This issue is briefly discussed in Section 7 and is carefully studied in Zeng and Lin (2007), which shows that the transformation parameter can be estimated reliably from the data and the variability of the estimator can be properly accounted for.

One potential use of the transformation models is to test the proportional hazards model since the latter is embedded in the former. Specifically, we can check the proportional hazards assumption by testing ρ=0 under the Box–Cox transformation or by testing γ=0 under model (3) or (4) with G(x)=x. This can be done by the Wald, score or likelihood ratio statistics. The score statistics that were proposed by Philipson and Ho make the unnecessary assumption that β and Λ0 are known. As Farrington, Hocine and Moreau point out, Quantin et al. (1996) proposed a score statistic to test γ=0 in model (3) with G(x)=x. Their statistic does not seem to account properly for the variability due to the estimation of the cumulative base-line hazard function.

Fig. 2 seems to have confused Sasieni as well as Kalbfleisch and Xu. This figure is actually a concatenation of four separate plots for the four classes of bivariate transformation models. For each class of models, the likelihood appears to be convex.

We are pleased to see the interesting extension of our work to doubly censored data that was presented by Chen and Ying and the application of the transformation models to quantile regression that was described by Yu and Yang. It should be possible to apply our theory to the problem that was described by Cuzick. By imposing appropriate constraints on the jump sizes of Λ(·) to reflect periodicity, our results should also be applicable to the problem that was outlined by Ramesh and Davison. Fine as well as Farrington, Hocine and Moreau suggest the use of different timescales, such as the gap times between successive events. By formulating the dependence of the gap times through random effects, our framework can cover such models. The non-parametric transformation model that was mentioned by Bickel and the non-parametric random-effects distribution that was mentioned by Horowitz are very challenging problems, and it is unclear whether the NPMLE is feasible in either case.

References in the discussion

Appendices

Appendix A: Numerical methods

A.1. EM algorithms

We describe an EM algorithm for maximizing the likelihood function that is given in expression (8). Similar algorithms can be used for the other likelihood functions. For simplicity of description, we focus on multiple-events data. The data consist of (Yikik,Zik) (i=1,…,n; k=1,…,K), where Yik is the observation time for the kth event on the ith subject, Δik indicates, by the values 1 versus 0, whether Yik is an uncensored or censored observation and Zik is the corresponding covariate vector. We wish to maximize the objective function

image

For all commonly used transformations, including the classes of Box–Cox transformations and logarithmic transformations,  exp {−Gk(x)} is the Laplace transformation of some function φk(x) such that

image

Clearly, inline image. We introduce a new frailty ξik with density function φk. Since

image

the objective function can be written as

image

This expression is the likelihood function under the proportional hazards frailty model with conditional hazard function inline image. Thus, treating the bi and ξik as missing data, we propose the following EM algorithm to calculate the NPMLEs.

In the M-step, we solve the complete-data score equation conditional on the observed data. Specifically, we solve the following equation for β:

image

where inline image is the conditional expectation given the observed data and the current parameter estimates. In addition, we estimate Λk as a step function with the following jump size at Yik:

image

and we estimate γ by the solution to the equation

image

The conditional distribution of ξik given bi and the observed data is proportional to

image

Thus, the conditional expectation of ξik given bi and the observed data is equal to

image
image

It follows that

image

which is an integration over bj only. Conditional on the data, the density of bi is proportional to

image

so the conditional expectation of any function of b can be calculated via high order numerical approximations, such as the high order Gaussian quadrature approximation, the Laplace approximation or Monte Carlo approximations.

On convergence of the algorithm, the Louis (1982) formula is used to calculate the observed information matrix for the parametric and non-parametric components, the latter consisting of the estimated jump sizes in the Λks.

A.2. Recursive formulae

We first consider transformation models without random effects for survival data. Suppose that Ψ(��i;θ,Λ) depends on Λ only through Λ(Yi), where Yi is the observation time for the ith subject. This condition holds if, for example, the covariates are time invariant. We wish to determine the profile likelihood function for θ, i.e. to find the value of Λ that maximizes the objective function for fixed θ. Let t1<…<tm be the ordered distinct time points where failures are observed, and let d1,…,dm be the jump sizes of Λ at these time points. The likelihood equation that dk should satisfy is given by

image

where ∇xg(x,y)=∂g(x,y)/∂x. It follows that

image

This gives a forward recursive formula for calculating the dk starting from d1. We can also obtain a backward recursive formula by reparameterizing Λ(x) as α(x) with α=Λ(τ) and F(x) a distribution function in [0,τ]. Abusing notation, we write Ψ(��i;θ,F) in which θ now contains α. Since the jump sizes of F add up to 1, the likelihood score equation for the jump size of F at tk+1, which is still denoted as dk+1, satisfies

image

This is a backward recursive formula for calculating the dk from dm. There is one additional constraint: Σkk=1. It is straightforward to extend the recursive formulae to recurrent events, the only difference being that the summation over individuals is replaced by the double summation over individuals and over events within individuals. For transformation models with random effects, the recursive formulae can be used in the M-step of the EM algorithm.

Appendix B: Technical details

In this appendix, we establish the asymptotic properties of the NPMLEs. A more thorough treatment is given in Zeng and Lin (2007). We first present a general asymptotic theory. We impose the following conditions.

  • (a) The parameter value θ0 lies in the interior of a compact set Θ, and Λ0k is continuously differentiable in [0,τ] with inline image, k=1,…,K (condition (C1)).
  • (b) With probability 1, P[infs ∈ [0,t]{Rik·(s)}geqslant R: gt-or-equal, slanted1|Zikl,l=1,…,nik]>δ0>0 for all t ∈ [0,τ], where inline image (condition (C2)).
  • (c) There is a constant c1>0 and a random variable r1(��i)>0 such that E[ log {r1(��i)}]<∞ and, for any θ ∈ Θ and any finite Λ1,…,ΛK,
    image
    almost surely, where inline image. In addition, for any constant c2,
    image
    where ?h?V[0,τ] is the total variation of h(·) in [0,τ], and r2(��i) is a random variable with E{r2(��i)6}<∞ and E[ log {r2(��i)}]<∞ (condition (C3)).
  • (d) For any (θ(1),θ(2)) ∈ Θ, and inline image with uniformly bounded total variations, there is a function ℱ(��i) in L2(P) such that
    image
    where inline image is the derivative of Ψ(��i;θ,��) with respect to θ, and inline image is the derivative of Ψ(��i;θ,��) along the path (Λk+ɛHk) (condition (C4)).
  • (e) If
    image
    almost surely, then θ*=θ0 and inline image for t ∈ [0,τ], k=1,…,K (condition (C5); first identifiability condition).
  • (f) There are functions ζ0k(s;θ0,��0) ∈ BV[0,τ], k=1,…,K, and a matrix ζ0θ(θ0,��0) such that
    image
    where BV[0,τ] denotes the space of functions with bounded total variations in [0,τ]. In addition, for k=1,…,K,
    image
    where η0k(s;θ,��) is a bounded function such that
    image
    η0km is a bounded bivariate function and η0kθ is a d-dimensional bounded function. Furthermore, there is a constant c3 such that
    image
    for any s ∈ [0,τ] and any t1,t2 ∈ [0,τ] (condition (C6)).
  • (g) If, with probability 1,
    image
    for some constant vector v ∈ Rd and hk ∈ BV[0,τ], k=1,…,K, then v=0 and hk=0 for k=1,…,K (condition (C7); second identifiability condition).
  • (h) There is a neighbourhood of (θ0,��0) such that, for (θ,��) in this neighbourhood, the first and second derivatives of Ψ(��i;θ,��) with respect to θ and along the path Λk+ɛHk with respect to ɛ satisfy the inequality in condition (C4) (condition (C8)).

Theorems 1 and 2 below state the consistency, weak convergence and asymptotic efficiency of the NPMLEs, whereas theorems 3 and 4 justify the use of the observed information matrix and profile likelihood method in the variance–covariance estimation.

  • Theorem 1. Under conditions (C1)–(C5),

    image

    converges to 0 almost surely.

  • Theorem 2. Under conditions (C1)–(C7), inline image converges weakly to a zero-mean Gaussian process in Rd×l(��K), where ��={h(t):?h(t)?V[0,τ]leqslant R: less-than-or-eq, slant1}. Furthermore, the limiting covariance matrix of inline image attains the semiparametric efficiency bound.

  • Theorem 3. Under conditions (C1)–(C8), inline image converges in probability to the asymptotic variance of

    image

    where hk is the vector consisting of the values of hk(·) at the observed failure times and ℐn is the negative Hessian matrix of the log-likelihood function with respect to inline image and the jump sizes of inline image.

  • Theorem 4. Let pln(θ) be the profile log-likelihood function for θ, and assume that conditions (C1)–(C8) hold. For any ɛn=Op(n−1/2) and any vector v, inline image converges in probability to vTΣ−1v, where Σ is the asymptotic covariance matrix of inline image.

Theorems 1–4 are proved in Zeng and Lin (2007). To establish the desired asymptotic results for a specific problem, all we need to do is to determine a set of conditions under which regularity conditions (C1)–(C8) are satisfied. As an illustration, we consider the transformation models with random effects for dependent failure times that were described in Section 2.2. We assume the following conditions.

  • (a) The parameter value inline image belongs to the interior of a compact set Θ in Rd, and inline image for all t ∈ [0,τ], k=1,…,K (condition (D1)).
  • (b) With probability 1, Zikl(·) and inline image are left continuous in [0,τ] with uniformly bounded left derivatives (condition (D2)).
  • (c) With probability 1, P(Ciklgeqslant R: gt-or-equal, slantedτ|Zikl)>δ0>0 for some constant δ0 (condition (D3)).
  • (d) With probability 1, nik is bounded by some integer n0. In addition, E{Nik·(τ)}<∞ (condition (D4)).
  • (e) For k=1,…,K, Gk(x) is four times differentiable such that Gk(0)=0, inline image, and, for any integer mgeqslant R: gt-or-equal, slanted0 and any sequence 0<x1<…<xmleqslant R: less-than-or-eq, slanty,
    image
    for some constants μ0k and κ0k>0. In addition, there is a constant ρ0k such that
    image
    (condition (D5)).
  • (f) For any constant a1>0,
    image
    and there is a constant a2>0 such that, for any γ,
    image
    (condition (D6)).
  • (g) If there are c(t) and v such that c(t)+vTikl(t)=0 with probability 1 for k=1,…,K and l=1,…,nik, then c(t)=0 and v=0. In addition, there is some t ∈ [0,τ] such that inline image spans the whole space of b (condition (D7)).
  • (h) f(b;γ)=f(b;γ0) if and only if γ=γ0; if vT(b;γ0)=0, then v=0 (condition (D8)).

We wish to show that conditions (D1)–(D8) imply conditions (C1)–(C8). Conditions (C1) and (C2) follow naturally from conditions (D1)–(D4). Tedious algebraic manipulations show that conditions (C5) and (C7) hold under conditions (D7) and (D8). Note that

image

where

image

and

image

If |b| and ?Λk?V[0,τ] are bounded, then inline image. Thus, Ψ(��i;θ,��) is bounded from below by inline image, so the second half of condition (C3) holds. It follows from condition (D5) that

image

Since inline image, we have

image

so

image

Thus, the first half of condition (C3) holds as well.

Under condition (D5),

image

By the mean value theorem,

image

It then follows from condition (D6) that |Ψ(��i;θ(1),��(1))−Ψ(��i;θ(2),��(2))| is bounded by the right-hand side of the inequality in condition (C4). The same arguments yield the bounds for the other two terms in condition (C4). The verification of condition (C8) is similar to that of condition (C4), relying on the explicit expressions of inline image and the first and second derivatives of Ψ(��i;θ,��0+ɛℋ) with respect to ɛ.

To verify condition (C6), we calculate that

image

For (θ,��) in a neighbourhood of (θ0,��0),

image

Thus, for the second equation in condition (C6), η0km(s,t;θ0,��0) is obtained from the derivative of η0k with respect to Λm along the direction Λm−Λ0m, and η0kθ is the derivative of η0k with respect to θ. Likewise, we can obtain the first equation in condition (C6). It is straightforward to verify the Lipschitz continuity of η0km.

The asymptotic properties for the other models in this paper are verified in Zeng and Lin (2007).

Ancillary