Dynamic inference for non‐Markov transition probabilities under random right censoring

The main contribution of this article is the verification of weak convergence of a general non‐Markov (NM) state transition probability estimator by Titman, which has not yet been done for any other general NM estimator. A similar theorem is shown for the bootstrap, yielding resampling‐based inference methods for statistical functionals. Formulas of the involved covariance functions are presented in detail. Particular applications include the conditional expected length of stay in a specific state, given occupation of another state in the past, and the construction of time‐simultaneous confidence bands for the transition probabilities. The expected lengths of stay in a two‐sample liver cirrhosis dataset are compared and confidence intervals for their difference are constructed. With borderline significance and in comparison to the placebo group, the treatment group has an elevated expected length of stay in the healthy state given an earlier disease state occupation. In contrast, the Aalen‐Johansen (AJ) estimator‐based confidence interval, which relies on a Markov assumption, leads to a drastically different conclusion. Also, graphical illustrations of confidence bands for the transition probabilities demonstrate the biasedness of the AJ estimator in this data example. The reliability of these results is assessed in a simulation study.

landmarking. The estimators by Titman (2015) and Putter and Spitoni (2018) are applicable in general non-Markov (NM) models. Both use different transition probability decompositions and utilizations of the AJ estimator and they do not rely on strong assumptions on the censoring times, just the common random right censoring assumption. However, weak convergence properties of these estimators as elements of càdlàg function spaces have not been analyzed yet.
In the present article, we focus on the NM state transition probability estimator by Titman (2015) and its weak convergence for the following reasons: the estimator has a simple, but intuitive structure and it allows estimation of general transition probabilities between sets of states rather than single states. Finally, the bootstrap is shown to correctly reproduce the weak limit distribution. Throughout the article, let (Ω, , P) be the underlying probability space. To introduce the Titman (2015) estimator, we consider an independently right-censored multistate model with R + 1 ∈ N different states. That is, the data consist of independent copies X 1 , … , X n of a multistate process, which are indexed by time t ≥ s and which occupy the states 0, 1, … , R. The right censoring is modeled by independent random variables C 1 , … , C n after which further observation of the corresponding multistate processes is no longer possible. Let  and  be subsets of {0, 1, … , R} for which the transition probabilities P  (s, t) = P(X(t) ∈  |X(s) ∈ ), shall be estimated. Here and below, irrelevant subscripts are removed for notational convenience. Introduce the absorbing subset of states   ⊂  that implies sure occupancy of  at all later points in time, and the absorbing subset of states   ⊂ {0, 1, … , R}∖ =  which prevents occupancy of  at all later points of time. Individual competing risks processes, Z i , i = 1, … , n, are used in an intermediate estimation step. Here, Z i (t) = 0, 1, 2, respectively, imply occupation of   ∪   ,   , and   at time t. Let Y i (t), i = 1, … , n denote the left-continuous at-risk indicators of the competing risks processes and let i , i = 1, … , n be the corresponding censoring indicators, that is, i = 1 if the transition of Z i is eventually observed and i = 0 if a censoring comes first. The actually observable data are summarized in = {X i ⋅ Y i , Y i , i ∶ i = 1, … , n}. Hence, the processes X i (and Z i ) are observable only until censoring occurs. The transition probability of interest can be decomposed into three probabilities that are estimable based on : (Titman, 2015). The first two components in this decomposition are estimated by the standard Kaplan-Meier estimatorF 0 for the survival function and the AJ estimatorF 1 for the cumulative incidence function of the subgroup of those individuals, which were observed to occupy state  at time s. The remaining conditional probability is estimated using the empirical proportion where 1{⋅} denotes the indicator function. Consequently, the overall transition probability estimator by Titman (2015) is given aŝ This article shall establish the weak convergence of √ n(P  (s, ⋅) − P  (s, ⋅)) and a bootstrap variant thereof as n → ∞. Finally, implications to confidence interval construction for the conditional expected length of stay and to time-simultaneous confidence bands for the transition probabilities are demonstrated. The proofs of the central limits theorems and a detailed derivation of the asymptotic covariance function are given in the Supplementary Material provided online.

MAIN RESULTS
We focus on the Titman (2015) estimator during [s, ] where ∈ (s, ∞) is a terminal time, such that F 0 ( ) > 0 and P(C > ) > 0. The large sample properties ofP  (s, ⋅) are derived from central limit theorems for its individual components. To obtain such for the fraction processp  | , we throughout assume the existence of bounded one-and two-step hazard functions for transitions into  and  , respectively, which are approached uniformly: Here, t 1 , t, t 2 ∈ (s, ]. Conditions (1) and (2) impose slightly more regularity than just the existence of the hazard rates   ,   , and the same is true for the two-step hazard functions in (3) and (4). Loosely speaking, the latter two conditions imply that multiple state changes into or out of  do not happen in a too fluctuating way. In the special case when X is Markov, conditions (3) and (4) reduce to (1) and (2), respectively. Another realistic example, in which the above four conditions are met, is given in the data generating process for our simulations in Section 4.
From now on, we slightly modify the original definition ofp  | by using the right-continuous versions of the at-risk processes, indicated by a +-sign in the argument. This leads to the choicep The reason for this choice is that, using the same notation as before, the resulting estima-torP  (s, ⋅) can then be considered as a random element of the Skorokhod space D[s, ] of right-continuous functions with existing left-sided limits. The probabilistic consequences of this slightly modified estimatorP  are negligible because all event times are continuously distributed, so that a censoring coincides with an event with probability 0. After these preparations, we come to our main central limit theorem.
Theorem 1. Under conditions (1) to (4) and as n → ∞, the following convergence holds in distribution on D[s, ]: where U is a zero-mean Gaussian process.
The asymptotic covariance function (r, t)  → Γ  (r, t) of U is given in Section 3 in the Supplementary Material. To be able to use the result of Theorem 1 for time-simultaneous inference on P  (s, ⋅) or simply for improving the accurateness ofP  (s, ⋅)-based inference methods with a pivotal limit distribution, the classical bootstrap seems to be a useful choice of a resampling method. To this end, a bootstrap sample * is obtained by randomly drawing triples (X i Y i , Y i , i )n times with replacement from the original sample . This bootstrap sample is used to recalculate the transition probability estimator, resulting in For this bootstrap counterpart, a central limit theorem similar to the main Theorem 1 holds: Theorem 2. Under conditions (1) to (4), as n → ∞, and given , the following conditional convergence in distribution on D[s, ] holds in probability: where U is the same Gaussian process as in Theorem 1.
That is, the bootstrap succeeds in correctly recovering the limit distribution of the transition probability estimateP  (s, ⋅), enabling an approximation of its finite sample distribution.

Restricted conditional expected length of stay
A practical application of both main theorems is a two-sample comparison of expected lengths of stay in the state set of interest,  , given occupation of a fixed state set  at time s ∈ [0, ]. Considering temporarily the above one-sample setup, the restricted conditional expected length of stay is Such quantities could be of clinical interest in assessing the effect of complications to surgical interventions (Silber et al., 1999). Grand and Putter (2016) considered a pseudo-observation regression treatment in the context of estimating expected years of health or disability conditional on the state occupied at a given age. An R package for deriving the change in lengths of stay, that relies on the AJ estimator, is described in Wangler, Beyersmann, and Schumacher (2006). For further literature on lengths of stay within multistate models, see the articles cited therein. In the special case of a simple survival model, the only expected length of stay of interest is the mean residual lifetime; Irwin (1949) proposed to consider such expected lifetimes which are restricted to a fixed end point in time. Here, the Markov assumption shall not be imposed and estimation of the length of stay is based on the Titman (2015) estimator.
Sample-specific quantities are furnished with a superscript ( ) , = 1, 2. For instance,P (1)  (s, t) is the first sample's transition probability estimate to switch into some state of  at time t given that a state of  is occupied at time s. The sample sizes n 1 and n 2 may be different. We would like to test the null hypothesis If  is a favourable set of states, then the rejection of H  | is an indication of the superiority of the second treatment over the first. A reasonable estimator of e ( )  (s, ), the underlying transition probabilities P (1)  and P (2)  are allowed to differ. Hence, we do not rely on the restrictive null hypothesisH ∶ P (1)  ≡ P (2)  but rather on the hypothesis H  | of actual interest. Theorem 3. Under conditions (1) to (4) and if 0 < lim inf n 1 ∕n 2 ≤ lim sup n 1 ∕n 2 < ∞ as n → ∞, the following convergence in distribution holds: Here,̂2  is any consistent estimator, for example, found via plug-in, of the asymptotic variance along subsequences of n 1 ∕(n 1 + n 2 ) → ∈ (0, 1).
Replacing all estimators in the previous theorem by their bootstrap counterparts, where both underlying bootstrap samples (1) * and (2) * are obtained by n 1 and n 2 times independently and separately drawing with replacement from (1) and (2) , respectively, a similar convergence in conditional distribution given (1) ∪ (2) holds in probability. This result may be used for bootstrap-based and asymptotically exact confidence intervals for the difference in the restricted conditional expected lengths of stay, e (1)  (s, ) − e (2)  (s, ); see Sections 4 and 5 below.

Simultaneous confidence bands
Even though the bootstrap procedure is useful but not strictly necessary for inference on the restricted expected length of stay, it plays an essential role in time-simultaneous inference on the transition probabilities t  → P  (s, t). This is due to the unknown stochastic behaviour of the limit process U which, again in the context of creating confidence bands, is also the reason for the inevitableness of resampling procedures for AJ estimators even if the Markov assumption is fulfilled; see Bluhmki et al. (2018) for theoretical justifications and their practical performance. For the derivation of reasonable time-simultaneous 1 − ∈ (0, 1) confidence bands for P  (s, ⋅) on a time interval [t 1 , t 2 ] ⊂ [s, ] based onP  (s, ⋅) in combination with the bootstrap, we focus on the process where a transformation will ensure bands within the probability bounds 0% and 100%, and a suitable multiplicative weight function w will stabilize the bands for small samples. In particular, we use one of the log-log transformations 1 (p) = log(− log(p)) or 2 (p) = log(− log(1 − p)) (Lin, 1997), depending on whether we expect P  (s, t 1 ) to be closer to 1 or 0. As weight function, for With these choices, the resulting confidence bands will correspond to the classical equal precision (EP) bands for w = w 1k or the Hall-Wellner (H-W) bands for w = w 2k if  is an absorbing subset of states (cf. p. 265ff in Andersen et al., 1993 for the survival case and k = 1 or Lin, 1997 in the presence of competing risks and k = 2).
The confidence bands are found by solving for P  (s, t) in the probability P(sup t∈[t 1 ,t 2 ] |B n (t)| ≤ q 1− ) = 1 − . Here, the value of q 1− is approximated by the (1 − )-quantile q * 1− of the conditional distribution of the supremum of the bootstrap version B * n (t) = √ n ⋅ w * (t) ⋅ ( (P *  (s, t)) − (P  (s, t))) of B n given the data, where in the definition of w * all estimatorsP  (s, t) and̂  (t) are replaced with their bootstrap counterparts P *  (s, t) and *  (t). Finally, the resulting 1 − confidence bands for t  → P  (s, t) are as follows: The performance of these confidence bands as well as the confidence band obtained from not transforming or weighting the transition probability estimator are assessed in a simulation study in the subsequent sections.

Restricted conditional expected length of stay
To assess the performance of the conditional length of stay estimators for finite sample sizes, a small simulation study is conducted. Data are simulated from a pathological NM model in which the subsequent dynamics of the process depend on the state occupied at t = 4. A three-state illness-death model with recovery is used where the state 0 means "healthy", 1 "ill", and 2 "dead". The transition intensities are 02 (t) = 0.02, 10 (t) = 0.3, 12 (t) = 0.1, while In Section 7 of the Supplementary Material, we prove that the conditions (1) to (4) are satisfied in the present setup. Subjects are independently right-censored via an exponential distribution with rate 0.04. We focus on the restricted expected length of stay in the health state conditional on an earlier illness at time s = 5, that is,  = {1} and  = {0}. For each simulated dataset, e 10 (5, 30) is estimated and 95% confidence intervals are constructed via three methods; a Wald interval using an estimator̂2 10 of the asymptotic variance 2 10 , which is found by plugging in canonical estimators, a classical bootstrap, and a bootstrap-t procedure using the bootstrap-version of̂2 10 to studentize the bootstrap estimator. For each bootstrap, B = 1000 samples are generated. In the present three-state model, the Titman (2015) transition probability estimator reduces toP 10 (5, t) =P  (5, t) =F 0 (t)p  | (t) =F 0 (t)p 0|1 (t) because the healthy state is nonabsorbing. As a competing method, AJ estimator-based Wald-type confidence intervals are constructed for the same expected length of stay. Standard arguments yield that the asymptotic covariance function of the normalized AJ estimator for the 1 → 0 transition is 1j (s, u)(P k0 (u, t) − P j0 (u, t))(P k0 (u, r) − P j0 (u, r)) jk (u) y j (u) du; (sect. IV.4.2 in Andersen et al., 1993 for the corresponding particular variance formula). The expected length of stay asymptotic variance again results from a double integral of this covariance function with integration range r, t ∈ (s, ]. We consider several different scenarios for the total sample size, n = 50, 100, 150, and 200. Note that the NM estimator only uses the subgroup of subjects satisfying C i > 5, X i (t) = 1, which corresponds to only 0.367n subjects, on average. For each of the scenarios, 10 000 datasets are generated. As a result, the empirical coverage percentages have approximate Monte-Carlo standard error of around 0.2.
The simulation results are shown in Table 1. Both the AJ and NM estimator have substantial bias when n = 50 leading to under-coverage of all the nominal 95% confidence intervals. For larger n, both of the bootstrap confidence intervals and the Wald-based interval for the NM estimator give close to nominal coverage. The coverage of the AJ-based Wald confidence interval deteriorates with increasing n due to the estimator being inherently biased.

Simultaneous confidence bands
To investigate the coverage probabilities of simultaneous confidence bands introduced in Section 3.2, datasets are simulated under the same NM model as above, and considering total sample sizes of n = 50, 100, 150, 200, 300, and 400 patients. Ninety-five percent simultaneous confidence bands are constructed for p 21 (5, t) for t ∈ (6, 7]. For the NM estimator, in addition to the H-W and EP bands using the transformation 2 (p), a naive confidence band based on a constant weight function, w(t) ≡ 1, and an identity transformation, (p) = p, is also constructed. In addition, for the AJ estimates, EP bands based on 2 (p), are constructed both via the Efron bootstrap and also via the wild bootstrap using the R code which has been made available in the supplement to Bluhmki et al. (2018). In all cases, for each bootstrap B = 1000 samples are generated. Table 2 gives the empirical coverage probabilities of the simultaneous confidence bands based on 5000 simulated datasets. For the bands based on the NM estimator, the H-W and EP bands tend to over-cover quite markedly for small sample sizes, while the naive bands under-cover. Adequate coverage is achieved for the H-W and EP bands by n = 200. Since the AJ estimator is biased for this scenario, we would expect the coverage of the EP AJ bands to deteriorate with increasing sample size. However, it appears this is counteracted by the estimated quantiles, q * 1− increasing with n. As a consequence, the coverage initially grows with increasing n before beginning to decrease. There is more marked under-coverage for the Efron than wild bootstrap constructed EP AJ bands. This is not surprising given the fact that Efron's bootstrap does not succeed in reproducing an appropriate randomness nature of the trajectories-even in the Markov case. The reason for this is that, in the case of continuous event times, the drawn trajectory is uniquely determined by the first event time.

APPLICATION TO THE LIVER CIRRHOSIS DATASET
As an illustrative example, we consider the liver cirrhosis dataset introduced by Andersen et al. (1993) and also analyzed by Titman (2015). Patients were randomized to either a treatment of prednisone (251 patients) or a placebo (237 patients). A three-state illness-death model with recovery is assumed, where the healthy and illness states correspond to normal and elevated levels of prothrombin, respectively. A potential measure of the effectiveness of prednisone as a treatment is the restricted conditional expected length of stay in the normal prothrombin level state, from a defined starting point. Specifically, we consider Δe 10 = e (1) 10 (s, ) − e (2) 10 (s, ), the difference in restricted conditional expected length-of-stay in normal prothrombin levels, given the patient is alive and with abnormal prothrombin at time s, for the prednisone and placebo groups, corresponding to = 1 and = 2, respectively. Titman (2015) observed an apparent treatment difference with respect to the transition probabilities with respect to s = 1000 days postrandomization. We consider the length of stay in state 0 up to time = 3000 days.
In this case, n 1 = 26 patients in the prednisone group and n 2 = 35 patients in the placebo group meet the condition that X i (1000) = 1 and Y i (1000) = 1. Bootstrap-based confidence intervals are constructed using 1000 bootstrap samples within each group. Table 3 shows the three constructed 95% confidence intervals, which are broadly similar, although the naive bootstrap interval excludes 0, whereas the Wald and bootstrap-t intervals do not. The AJ-based estimate and a Wald confidence interval is also given. It is seen that there is a dramatic difference between the NM and AJ estimates, with the latter indicating no treatment difference. Potentially, the apparent NM behaviour in the data may be due to patient heterogeneity within the treatment groups.
To illustrate the construction of simultaneous confidence bands, we construct 95% confidence bands for P 21 (500, t), in the interval t ∈ (750, 1250]. Since it was seen in Section 4 that a reasonable sample size is required for good coverage, an earlier start time is used here than for the estimate of the restricted expected length of stay to ensure sufficient numbers of patients are under observation. Since we expect P 21 (500, t) < 0.5, we chose the confidence bands using the transformation 2 (p) = log(− log (1 − p)). Figure 1 shows the confidence bands for the placebo and prednisone groups using H-W and EP intervals using the NM estimator and using the EP intervals for the AJ estimator.

DISCUSSION
To the best of our knowledge, our article is the first to prove weak convergence properties of a general transition probability estimator in independently right-censored NM multistate models while also providing explicit formulas of the asymptotic covariance functions. We gave similar proofs for the classical bootstrap, which finally allows the utilization of the Titman (2015) estimator in time-simultaneous inference procedures. Unfortunately, when focusing on inference on the restricted expected length of stay, both applied bootstrap methods resulted in similarly reliable confidence intervals as the Wald method. In general, sample sizes of at least n = 100 ensured a coverage level close to the nominal 95%.
On the other hand, the time-simultaneous confidence bands derived in this article cannot rely on a purely asymptotic quantile finding approach. Instead, they truly require a resampling method such as the bootstrap method developed at the end of Section 2. Furthermore, the simulation results in Section 4.2 and the application to the liver cirrhosis dataset in Section 5 demonstrate their practical usefulness: not only did they reveal reliable coverage probabilities for relatively small sample sizes already. They were also only slightly wider than the confidence bands based on the AJ estimator for which there is no guarantee of consistency. In future research, one could analyze other resampling procedures as well; they might even improve the performance in small samples for which our NM bands were either too conservative or slightly too liberal. For example, the wild bootstrap methods of Lin, Fleming, and Wei (1994), Lin (1997), Beyersmann, Pauly, and Di Termini (2013), or Bluhmki et al. (2018) are popular choices for resampling counting process-based estimators; see, for example, the frequent use of the multiplier bootstrap with normal variables in the book by Martinussen and Scheike (2006).
In this article, in common with other treatments of NM transition probabilities in the literature, independent censoring has been assumed (Allignol et al., 2014;Meira-Machado et al., 2006;Putter & Spitoni, 2018). For Markov processes, the AJ estimator has the advantage of being robust to censoring which is dependent on the state occupied. Titman (2015) included a simulation study which showed the proposed NM estimator was biased under such scenarios. In the context of estimation of state-occupation probabilities, methods based on inverse probability of censoring weights (IPCWs) have been proposed to allow consistent estimation under state-dependent censoring (Datta & Satten, 2002). In principle, such methods could be applied to the estimator of Titman (2015). However, extending the results of this article to accommodate IPCW-based estimators is not straightforward.
Another field of possible future interest is the extension of the present methodology to quality-adjusted sums of restricted expected lengths of stay in a specific subset of states (cf. Williams, 1985 who proposed such quality-adjusted life years for operation considerations). For instance, the expected length of stay in a healthy state could be weighted with a factor p ∈ (0.5, 1), whereas the length of stay in the illness state obtains the weight 1 − p. The specific choice of p may be obtained from additional information on, for example, the pain scores of patients in the illness state and may be chosen individually by the patients themselves. This way, a comparison of treatments could be achieved that more realistically accounts for the circumstances of a specific disease. Estimation of quality-adjusted survival has received substantial attention in the statistical literature. Tsiatis (1997, 1999) used the right-censored sample of observed quality-adjusted lifetimes directly to estimate their distribution. Glasziou, Simes, and Gelber (1990) considered a partitioned survival approach to estimate quality-adjusted survival assuming a progressive process. Cook, Lawless, and Lee (2003) compared estimates of quality-adjusted survival based on the AJ and robust Pepe-type estimates of the state occupation probabilities. In addition, quality-adjusted survival estimates based on parametric models making a specific Markov or semi-Markov assumption have been considered by several authors (Chen & Sen, 2001;Pradhan & Dewanji, 2009). Future research, extending the results of this article, could focus on inference on quality-or cost-adjusted life years, possibly simultaneously for multiple different time windows. In this context, resampling methods would again assure the asymptotic exactness of the inference procedure.
However, one should bear in mind the early controversial debate on quality-adjusted life years as articulated by, for example, Harris (1987); see also the discussion of the article by Cox et al. (1992). Therein, G.W. Torrance appreciates the usefulness of quality-adjusted life years for resource allocation if combined with other measures and advices: He states that quality-adjusted life years "were never intended for clinical decision-making-they were developed for use in resource allocation." They should also be reported together with other important measures for reaching a decision. Moreover, one should add "a thorough sensitivity analysis and thoughtful discussion, including caveats" to obtain an informative quality-adjusted life years analysis.
Finally, reconsider the data analysis in Section 5, where the AJ estimator yielded a difference of restricted expected lengths of stay, which drastically deviates from the same quantity based on the Titman (2015) estimator. This might be considered as a strong evidence against the Markov assumption. However, as also noted by Putter and Spitoni (2018), if the discrepancy between the AJ and NM estimates is modest, the AJ estimator may be preferred since any bias will likely be outweighed by the substantial gain in efficiency. This is apparent in the mentioned liver cirrhosis data example, where the confidence intervals and simultaneous confidence bands based on the AJ estimator are much narrower. Equivalently, test procedures based on the classical estimator are more powerful in detecting deviances from null hypotheses. As such, there is interest in developing formal tests for the applicability of the classical AJ estimator, which will be the subject of future articles.