Efficient t 0 -year risk regression using the logistic model

In some clinical studies patient survival beyond a specific point in time, t 0 , say, may be of special interest as it may for instance indicate patient cure. To analyze the t 0 -year risk for such patients may be accomplished using logistic regression with appropriate weights (IPWCC) that may further be augmented (AIPWCC) to improve efficiency. In this paper, we derive the most efficient estimator for this problem, which is different from the AIPWCC based on the full data efficient influence function. We first give the result for a survival endpoint and then generalize to the competing risk setting. The proposed estimators superior behavior is illustrated using simulations as well as applying it to some real data concerning the survival of blood and marrow transplanted patients.

for cancer patients then one might want to compare disease-free survival for different treatments at 3 years if it is thought that 3-year disease-free survival means cure of the cancer.This calls for a more targeted analysis with specific emphasis on the chosen time point, t 0 .If the interest lies in evaluating the average treatment effect, comparing two treatment regimes, then efficient estimation using baseline variables has already been developed, see Díaz et al. (2019) and Ozenne et al. (2020).In this paper, however, we are interested evaluating effects within strata of baseline variables and will therefore consider a conditional setting rather than focussing on a marginal estimand.Direct modeling of survival until a specific point in time given covariates has been studied earlier, see Uno et al. (2007) and Zheng et al. (2006), and has also been extended to cover competing risks outcomes, see Scheike and Zhang (2007) and Scheike et al. (2008).An alternative approach builds on so-called pseudo-observations, see Andersen et al. (2003), Klein et al. (2007), Andersen and Perme (2010), and Overgaard et al. (2019).Such analyses are based on fewer assumptions than typical semi-parametric survival (or competing risk) models that would typically be based on restrictions that need to hold for the entire time-span thus increasing the risk of model misspecification.In this paper, we consider the direct modeling approach that works by adapting standard methods such as logistic regression to deal with right-censoring by the use of inverse probability weighting (IPWCC).This is also referred to as t 0 -year risk or survival (Blanche et al., 2023).The parameter of the logistic model approach leads to a direct and simple interpretation of the t 0 -year risk being modeled unlike hazard function based models such as the Cox regression model.Treatment effects calculated from t 0 -year risks may also be more helpful in clinical decision-making and are arguably easier to communicate (Stensrud & Hernán, 2020).The standard logistic regression modeling approach may be improved in terms of efficiency and double robustness by augmentation (AIPWCC), see Robins and Rotnitzky (1992) and Tsiatis (2006).However, this will not result in the overall efficient estimator.Any full data influence function gives rise to a class of observed data influence functions by adding elements from the so-called augmentation space.By a suitable projection, we can find the most efficient influence function within this class.However, it is not clear which full data influence function that results in the overall most efficient observed data influence function.A key result of this paper is the derivation of this optimal full data influence function, which turns out to be different from the efficient full data influence function that one might falsely believe should also lead to efficient estimation based on the observed data.This result enables us to develop the semiparametrically efficient estimator for the logistic t 0 -year risk model.We demonstrate its superior behavior using simulations as well as applying it in a real application.The rest of the paper is organized as follows.In Section 2.1, we present the main result for the survival data setting.Section 2.2 explains how the obtained efficient influence function may be used to do the estimation in practice.The results are generalized to the competing risk setting in Section 3. Section 4 contains some simulation studies demonstrating the superior behavior of the proposed estimator, and in Section 5 the proposed estimator is applied to data concerning the survival of blood and marrow transplanted patients.Finally, Section 6 contains some concluding remarks, while detailed calculations are relegated to the Appendix.

Developing the efficient influence function
We first present the main result in the setting of right censored survival data, and then later generalize to competing risk data.Had there been no censoring then one could directly apply logistic regression analysis using Y = I(T ≤ t 0 ) as response variable, where T denotes the (uncensored) survival time.This is our starting point and we thus focus our attention on the p-dimensional parameter  that is assumed to exist so that where X denotes a p-dimensional covariate vector and expit(a) = e a ∕(1 + e a ).We need, however, to be able to take censoring into account so the above mentioned logistic regression analysis is not an option in reality.
We use Z = (T t 0 = T ∧ t 0 , X) to denote full data, that is, without right-censoring (the event T = t 0 happens with probability zero, but to avoid confusion we conceive T ∧ t 0 as T if T ≤ t 0 ).In the full data case, the efficient influence function (eif) is and basing estimation on the eif is nothing but the aforementioned standard logistic regression analysis.The problem arises because T t 0 can be right censored by C so that in reality we only observe In other words, it is only censoring before time point t 0 that matters as censoring after t 0 does not prevent observing Y .The conditional distributions of T and C given X are defined by the respective conditional hazard functions  T (⋅|X) and  C(⋅|X ), respectively.The goal is to estimate the parameter  as efficient as possible based on n iid replicates D 1 ,…, D n .Our main result builds on theory described for instance in Tsiatis (2006), and we therefore adopt the notation used in that reference.Specifically we refer to results in sections 10.3, 10.4, and 11.1 in Tsiatis (2006).If  denotes the coarsening mechanism induced by the right-censoring with complete data corresponding to ( = ∞) then and where K C(r|X ) = P( C > r|X).We further assume that T and C are conditionally independent given X (coarsening at random).Any full data influence function has the form (see for instance Tsiatis, 2006, section 4.6) and such one defines an augmented inverse probability weighted complete case (AIPWCC) class of observed data influence functions (Tsiatis, 2006, section 10.6): by adding elements from the so-called augmentation space corresponding to the second term in (3) where J(r; X) is any function and dM C(r|X ) denotes the (increment) of the censoring martingale given X.The censoring martingale is defined as where Any observed data influence function can be written as (3) for some  F (Z) and J(r; X).The optimal observed data influence function in the class defined by  F (Z) is obtained by taking where S(r; X) = 1 − F(r; X), see section 10.6 in Tsiatis (2006).The problem is now to find the optimal  F (Z), which we denote by B F eff (Z) = h F eff (X){Y − F(t 0 ; X)}.This is equivalent to finding the optimal h(t 0 ; X) in (2).Unfortunately, the full data efficient influence function  F e (Z) is not the wanted B F eff (Z) but these two functions solve the integral equation see Tsiatis (2006), and Appendix A for the proper definitions of mappings and spaces.Surprisingly, (5) has an explicit solution giving that the optimal function h(t 0 ; X) to use in ( 2) is where and Λ T (r|X) = ∫ r 0  T (u|X) du, see Appendix A for a proof of this.The desired optimal full data influence function is thus which is seen to be different from  F e (Z).In conclusion, the observed data efficient influence function is where B eff (Z) is given by ( 7) and E{B F eff (Z)|T > r, X} is given by the right hand side of (4) replacing h(t 0 ; X) with h eff (t 0 ; X).

Remark.
(i) If there is no censoring, that is,  C(r|X ) = 0 for r ≤ t 0 , then and thus and B F eff (Z) therefore reduces to  F e (Z) as it should.(ii) We show that the standard logistic AIPWCC, ie using  F e (Z), is efficient when the censoring mechanism does not depend on X and the logistic model ( 1) is correctly specified for all time-points, that is, for all r ≤ t 0 , and therefore  T (r; X)∕S(r; X) = g ′ (r) exp(X T ), where g is a positive unspecified function and g ′ denotes its derivative.Now rewrite H(t 0 ; X) as r), and in this case the standard logistic AIPWCC is efficient.If the logistic model does not hold for all r ≤ t 0 then dependence of X will re-appear.

Using the efficient influence function to construct estimation
We explain now how the efficient influence function may be used to construct the proposed estimator.We first describe how to perform estimation based on AIPWCC using the efficient full data influence function  F e (Z) as starting point.Letting G(r, X) = {S(r; X), K C(r|X )} and h(t 0 , X i ) = X i , we could then use the following AIPWCC estimating function to estimate  by defining β as the solution to Ũn { β; G n (⋅; X)} = 0, where G n means that we use working models to estimate G.The estimator β is what we refer to as the standard AIPWCC estimator.The standard AIPWCC estimator is consistent if either the working model S n (r; X) is correctly specified or the working model K n C(r|X ) is correctly specified, but not necessarily both.This result follows directly from theorem 10.5 in Tsiatis (2006).The standard IPWCC estimator that only uses the first term in the above estimating function (i.e., without the debiasing term) is only consistent if the working model K n C(r|X ) is correctly specified.Define now U n {; G(⋅; X)} similarly to Ũn {; G(⋅; X)}, but replacing h(t 0 , X i ) with the efficient choice h eff (t 0 ; X i ).Our proposed estimator β is then defined as the solution to U n { β; G n (⋅; X)} = 0.The above-mentioned robustness property of β also holds for β again using theorem 10.5 in Tsiatis (2006).
The resulting estimator β depends on two working models, K C(r|X ) and S(r; X), with the estimator being consistent if one of these two working models is correctly specified (doubly robust).If they are both correctly specified, or if at least the censoring model is correctly specified then it can be shown that the variance is estimated consistently by the empirical mean of the (estimated) squared influence function, see appendix H of Scheike et al. (2022) for a similar result.If both working models are correctly specified the resulting estimator is efficient.If the working censoring model is speculated to be misspecified one may alternatively use the nonparametric bootstrap to estimate the variance of the estimator.

COMPETING RISKS REGRESSION
In this section, we extend the results of the last section to the setting of competing risk data.Proofs are given in Appendix B. We consider the setting where there are two competing causes with T now denoting the time to an event happens and with  keeping track of which of the two competing causes that resulted in the event with  ∈ {1, 2}.We let F j (t 0 ; X), j = 1, 2, denote the two conditional cumulative incidence functions.Now Y = I(T ≤ t 0 ,  = 1) and The full data efficient influence function is As in the survival data setting, T t 0 can be right censored by C so that we only observe Still, any full data influence function has the form and such one defines an AIPWCC class of observed data influence functions by adding elements from the augmentation space Solving the defining Equation ( 5) as in the survival case we get where and The efficient influence function is thus The standard logistic AIPWCC is not efficient even if the censoring mechanism does not depend on X as for example, F 2 (t; X) can have any structure.In this respect, the obtained result for the competing risk regression setting differs from what was obtained in the standard survival setting.The derived efficient influence function may now be used to do estimation in a similar way as was done for the survival data case.

Survival regression
We simulated survival data where F(t, X) = Λ 0 (t) exp(X T )∕{1 + Λ 0 (t) exp(X T )} with Λ 0 (t) = {1 − exp(−t∕7)} and  = (0.5, −0.5).We considered  = 2 and other settings not reported here.We considered two covariates, where X 1 was binary with P(X 1 = −1) = 0.5, P(X 1 = 1) = 0.5 and X 2 was either binary (P(X 2 = 0) = 0.5, P(X 2 = 1) = 0.5), standard normal, exponential (minus its mean), or uniform(− ).We generated censoring that was dependent on the first covariate, specifically we took  c (t, X) = r c exp(0.5 ⋅ X 1 ) with r c being 0.5.This leads to a average censoring percentage (over the covariates) at around 75% at the time point t 0 = 3.We note that logit{F(t, X)} = log{Λ 0 (t)} + X T , and then consider estimating the parameters of our logistic-link survival function at the time point t 0 = 3 where approximately 1/3 had died.We considered two sample sizes n = 400 and n = 800.We computed the standard AIPWCC and compared it to the proposed estimator that is based on the efficient influence function.For the needed working models we used either a Cox model (approximate model) or a logistic survival model (correct model), and considered either model stratified on X 1 with X 2 as a covariates, or model based with main effects of X 1 and X 2 .Each of these four working models lead to almost identical results and we therefore only report the efficiency gain from using the incorrect regression model.For all simulations we used the correctly specified working model for the censoring distribution using the Kaplan-Meier estimator stratified on the first covariate.We computed the augmentation term and the efficient h only once based on our working models and then solved the estimating equation.In Table 1, we report the variance of the standard augmented IPWCC relative to efficient estimator, and note that the estimator based on the efficient influence function is considerably more efficient than the standard augmented IPWCC estimator.Estimated SEs were calculated squaring the corresponding estimated influence functions and scaling with the derivative.In Table 2, we report the coverage of the 95% confidence intervals for the three estimators: standard IPWCC (IPWCC), augmented IPWCC (AIPWCC) and efficient augmented IPWCC (Eff-AIPWCC) for the two different sample sizes.The coverage is only reported for the situation where X 2 was standard normal, but similar results were obtained in the other settings as well.It is seen from Table 2 that the coverage is close to the nominal 95% in all cases.

Competing risks
We simulated competing risks data where and  1 = (0.5, −0.5), and the other cause was given by F We considered a model with effect of covariates,  = 1, or without effect,  = 0. We considered  1 = 2, 5 and  2 = 1, 5 to get different cumulative incidence levels of the two causes.The two covariates were independent with P(X 1 = −1) = 0.5, P(X 1 = 1) = 0.5, and the distribution of X 2 was either binomial, normal, exponential or uniform.The binomial distribution of X 2 was P(X 2 = 0) = 0.5, P(X 2 = 1) = 0.5, and the other continuous distributions of X 2 was chosen to have mean 0 and variance 1.We generated censoring that was dependent on the first covariate so that  c (t, X) = r c exp(0.5X 1 ) with r c = 0.5 or r c = 0.3.We note that logit{F 1 (t, X)} = log{Λ 1 (t)} + X T  1 , and then consider estimating the parameters of our logistic-link cumulative incidence model at the time point t 0 = 3.We computed the standard IPWCC with its corresponding augmentation term and the efficient estimator.We estimated the working models, the augmentation term and censoring weights once.Either by using a correct working model for F 1 and a Cox model for  T (⋅|X), and a correct working model for the censoring distribution, or using a correct working model for all quantities.Similar results were obtained so we only report the results based on the approximative working model.Sample size was n = 200 and we only results corresponding to the setting where  1 = 2 and  2 = 5.In this setting we had an average censoring rates at 0.75 (r c = 0.5) or 0.55 (r c = 0.3) more or less the same for the different covariates situations.The estimates had a few outliers because of numerical convergence issues in the case where X 2 were exponentially distributed and therefore the SE did not provide a useful summary in this case.We therefore report instead the squared inter-quartile range which is a more robust measure of the variation of the estimates.More specifically, Table 3 gives the squared interquartile range for the augmented IPWCC relative to the efficient augmented IPWCC (Eff-AIPWCC).We again see that estimator based on the efficient influence function is considerably more efficient than the standard augmented IPWCC.

APPLICATION TO BLOOD AND MARROW TRANSPLANT DATA
We considered data from multiple myeloma patients treated with allogeneic stem cell transplantation from the Center for International Blood and Marrow Transplant Research (CIBMTR), Kumar et al. (2011) The data used in this paper consist of patients transplanted from 1995 to 2005.Kumar et al. ( 2011) considered the following risk covariates: transplant time period (gp: 1 for transplanted  ).For these data, we wish to evaluate the importance of the covariates on the 40 months risk (all cause mortality) using the logistic model.There were considerable censoring in the different strata (defined based on the covariates) and the censoring was quite different in the different strata.The censoring survival distribution varied from 20% to 90% across the different strata.The average censoring rate (over the covariates) were 60%.We calculated the standard IPWCC estimator, the augmented standard IPWCC estimator, and the proposed estimator based on the efficient influence function.For the two latter estimators we used a logistic regression model as the working survival model while the censoring distribution was estimated using Kaplan-Meier estimators stratified on all covariates.The results are reported in Table 4 from which it is seen that all point estimates are in close agreement with the AIPWCC-estimator having a smaller estimated SE than the IPCW-estimator, and with the proposed estimator having the smallest estimated SE as expected.From the analysis we see for instance that, within the strata defined by the other covariates, the odds of dying before 40 months after transplant for patients waiting more than 24 months for a transplant is estimated to be e 0.9 = 2.46 (1.34, 4.52) the odds of dying before 40 months after transplant for patients waiting less than 24 months for a transplant.

CONCLUDING REMARKS
The logistic regression model approach to analyze t 0 -year risk is appealing as it is arguably simpler to communicate results based on such an analysis compared to a hazard function-based model such as the Cox regression model.A further benefit is that it put less restrictions on the data generating process.The proposed estimator is consistent if model ( 1) is correctly specified and either the conditional distribution of T t 0 given X or the censoring model is correctly specified.Note, in the setting where all covariates are categorical, that model ( 1) is correctly specified if it is saturated as well as both the two working models can be correctly specified using saturated models, and the proposed estimator is thus guaranteed to be consistent and efficient in this case.If some of the included covariates are continuous then the logistic regression may be misspecified, however, and if we assume for a moment that at least one of the two needed working models is correctly specified, then the proposed estimator converges in probability to the least false parameter  * that solves As an alternative one might focus attention on the parameter using a weighted least squares projection of g{F(t 0 ; X)} onto {X T } with w(X) being some user-specified weight and with g being the logit-function.If we set w(X) to 1 then  0 solves The two estimands,  * and  0 , may be estimated by developing their corresponding efficient influence function but we refrain from going further into this here.

APPENDIX A. THE EFFICIENT INFLUENCE FUNCTION IN THE SURVIVAL SETTING
Let for some h(t 0 ; X), which is what we want to calculate.Once we have this function then the observed data efficient influence function is where E{B F eff (Z)|T > r, X)} is given by (4).Unfortunately, the full data efficient influence function  F e (Z) is not the wanted B F eff (Z) but, according to chapter 11 in Tsiatis (2006), these two functions solve the equation where Γ F denotes the full data nuisance tangent space and Π { ⋅|Γ F ⊥ } denotes the projection operator onto the orthogonal complement of Γ F .The operator  is also defined chapter 11 in Tsiatis (2006), and from there it also follows that in h * for all h(t 0 ; X) which is equivalent to implying that This gives and using (A1) gives that the wanted h(t 0 ; X) is ) and we now have the eif.It is given by ( 8) with B F eff (Z) = XF(t 0 ; X) S(t 0 ; X)L(t 0 , X) {I(T ≤ t 0 − F(t 0 ; X)}.

APPENDIX B. THE EFFICIENT INFLUENCE FUNCTION IN THE COMPETING RISK SETTING
To find the estimator we must solve and as before since We now do a direct calculation of the right-hand side of the latter display.First, we note that E[{Y − F 1 (t 0 ; X)} 2 |X] = F 1 (t 0 ; X)(1 − F 1 (t 0 ; X)) Also as before  −1 {B F eff (Z)} is given as S(r, X)F 1 (t 0 ; X) F 1 (t 0 ; X) − F 1 (r; X) S(r; X) H C(r|X )dr.
TA B L E 1Note: Based on 10,000 realizations.
Squared interquartile range for the standard augmented IPWCC relative to the efficient augmented IPWCC (Eff-AIPWCC) for sample size n = 200.
TA B L E 3Note: Based on 10,000 realizations.
Blood and marrow transplant data.Covariate effects on 40 months risk (all cause mortality) using the IPWCC-estimator, the AIPWCC-estimator, and the proposed estimator based on the efficient influence function (Eff-AIPWCC).
T A B L E 4Notes: