Multiple imputation strategies for missing event times in a multi-state model analysis

In clinical studies, multi-state model (MSM) analysis is often used to describe the sequence of events that patients experience, enabling better understanding of disease progression. A complicating factor in many MSM studies is that the exact event times may not be known. Motivated by a real dataset of patients who received stem cell transplants, we considered the setting in which some event times were exactly observed and some were missing. In our setting, there was little information about the time intervals in which the missing event times occurred and missingness depended on the event type, given the analysis model covariates. These additional challenges limited the usefulness of some missing data methods (maximum likelihood, complete case analysis, and inverse probability weighting). We show, for the first time in the MSM context, that multiple imputation (MI) of event times can perform well in this setting. MI is a flexible method that can be used with any complete data analysis model. Through an extensive simulation study, we show that MI by predictive mean matching (PMM), in which sampling is from a set of observed times without reliance on a specific parametric distribution, has little bias when event times are missing at random, conditional on the observed data. Applying PMM separately for each sub-group of patients with a different pathway through the MSM tends to further reduce bias and improve precision. We recommend MI using PMM methods when performing MSM analysis with Markov models and partially observed event times.


Introduction
In clinical studies, there is often interest in describing the sequence of events that each patient experiences, to enable better understanding of disease progression.Increasingly, multi-state model (MSM) analysis is used for this purpose.MSMs have been used in a wide variety of clinical contexts, such as organ and stem cell transplantation, 1,2 studies of dementia 3 and aging, 4 and in cancer research. 5The advantage of the MSM approach is that the probability of multiple events can be modelled simultaneously.This allows the prediction of clinicallyrelevant quantities, such as the probability of each event at any given time, and the average number of days spent in each state.This in turn enables more effective communication of risk to patients, 6 particularly because these quantities can easily be illustrated graphically.
A complicating factor in MSM studies is that the exact time of each event may not be known.
In some settings, none of the event times are exactly observed (with the possible exception of time of death).For example, in HIV 7 or dentistry, 8 changes in the health of the patient are reported only at intermittent clinic visits.Formally, such events are "interval-censored": the event time lies in the interval (L, R], where L represents the last known event-free time and R represents the first time at which the event is reported.In such settings, maximum likelihood (ML) methods for interval-censored data 1,3,9 -in which the marginal likelihood of the observed data is maximised -are generally used.In other settings, exact event times are observed for some individuals but not others.For example, in a pregnancy study, 10 gestational age at delivery was recorded for some individuals but missing for others.In this type of setting, a wider choice of methods for missing data is available because some individuals have complete data.As well as ML, available methods include complete case analysis (CCA), inverse probability weighting (IPW), and multiple imputation (MI). 11Our motivating example is in this type of setting.We consider a previously analysed dataset of patients who received haematopoietic stem cell (HSC) transplants using cord blood (CB)   .donated to the UK National Health Service (NHS) Cord Blood Bank (CBB). 12There were missing data in the NHS CBB dataset.In particular, the times of onset of acute graft-versushost disease 13 (aGvHD, caused by an immune response of donor cells -the "graft" -against the patient's tissues and organsthe "host" ) and relapse (i.e.signs and symptoms that the patient's original blood disease has returned after treatment) were missing for approximately 25% of patients who experienced aGvHD and/or relapse, respectively.Exact times of death or last follow-up were reported for all patients.Note that these missing times can still be considered interval-censored, with finite interval boundaries inferred from clinical criteria (e.g. the standard clinical definition of aGvHD 13 assumes occurrence between day 0 and 100 post-transplant) or the known length of the monitoring period for each patient.
The NHS CBB dataset is an interesting test case because ML, CCA, and IPW have limited use, for the following reasons: Our setting deviates from the assumptions of available ML methods in two ways: 1. Our event times are a mixture of observed and missing (interval-censored) times.
ML methods developed so far assume that all times are interval-censored.
2. For our missing times, the associated interval boundaries are wide relative to the observed event times.In a review of ML methods for handling interval-censoring in MSM analysis, Machado et al. found that none of the available methods performed well when censoring intervals were wide, relative to the change in hazards. 14i) CCA (in which only patients with observed values for all analysis model variables are included) will give biased estimates in our setting because missingness depends on the analysis model outcome.15 Note that CCA estimates would only be unbiased in this setting if the probability that event times were missing did not depend on the type of event nor the event times themselves (after conditioning on the analysis model covariates).
(iii) IPW (in which the complete cases are weighted by the inverse of the estimated probability of being complete) is likely to perform poorly because the types of event experienced by each patient is strongly predictive of missingness of the event times, resulting in extreme weights for some individuals. 16In addition, like CCA, IPW estimates generally lack precision because the incomplete cases (which contain partial information about the outcome) are discarded. 11n contrast to approaches (i)-(iii), above, MI (assuming the missingness mechanism is ignorable and the imputation model is correctly specified 17 ) utilises all available data, using observed data for the analysis model variables plus any additional variables that are predictive of the missing event times, from both patients with fully observed event times and those with partially observed event times.In addition, MI can accommodate a mixture of exactly observed and missing times, plus it allows flexibility when choosing the analysis model.
The standard MI procedure consists of three steps: 1.An imputation model is fitted to the observed data and missing values are replaced with draws from its predictive distribution.This is repeated multiple (M) times, to give M completed datasets.
2. The analysis model is fitted to each of the M completed datasets.
3. The M sets of results are combined using Rubin's rules. 18ing the NHS CBB dataset as motivation, Curnow et al. 12 considered MI and ML strategies for handling missing event times in a competing risks analysis.They examined the extent to which interval boundaries, the data distribution, and analysis model should be accounted for in the imputation model.They found that MI by predictive mean matching 19 (PMM) resulted in least biased estimates, and was robust to model mis-specification.PMM is a variation on the standard MI procedure described above, in which missing values are replaced with observed values from donors with a similar predicted mean.Therefore, in this paper, we also focus on MI, extending the work of Curnow et al. to multi-state Markov models.This is the first time that the performance of MI methods for missing event times has been assessed in the MSM context.Note that, in this paper, we generally refer to 'event' times rather than 'transition' times.This is because it is more realistic to have missing times for a specific event -which may affect several transition timesrather than for a specific transition (e.g. a missing times of aGvHD will affect times of transition to and from aGvHD).
In Section 2 we describe the motivating example.In Section 3 we describe MSM methodology in detail.In Section 4 we describe a simulation study comparing different MI strategies and present its results in Section 5.In Section 6 we apply our MI strategies to the motivating study dataset.We conclude with general discussion in Section 7.

Motivating Study
The NHS CBB dataset contained information about 432 CB transplants.Individual-level data were available about baseline patient, donor, and transplant characteristics (see Supplementary Material Section S4 for further details) as well as about events experienced by each patient during the post-transplant monitoring period.Event types included aGvHD and chronic GvHD (cGvHD, GvHD occurring more than 100 days post-transplant), relapse, and death.The median follow-up time was 3 years (Kaplan-Meier estimate, censoring follow-up time at death) and at least one post-transplant event was reported for each patient.For each type of event, both an indicator of whether the event was experienced and the associated time of onset were reported (censoring at the earliest of the time of a competing event or last follow-up).The time of onset of aGvHD was missing for 57 (24%) of 241 patients who experienced aGvHD, and the time of onset of relapse was missing for 22 (25%) of 89 patients who experienced relapse.

Multi-state models
Formally, we consider a stochastic process {Y(t), t ∈ T} with a finite state space Z = {0, ..., N} and process history up to time s, Hs = {Y(u); 0 ≤ u ≤ s}. 2 Then P( MSMs are often represented using diagrams, such as Figure 1 below, which depicts the classic "illness-death" model.In Figure 1, states are represented by rectangles, and possible transitions by arrows.The transition intensity, α ab (t), is shown for each transition.In Figure 1, there is a single initial state (0. Alive), and we assume that all patients are in this state at the time origin, t = 0.There is a single "absorbing" state (2.Death), that is, a state from which further transitions cannot occur.There is also an "intermediate" state (1.Illness) between initial and absorbing states.There are two possible pathways through this MSM: either 0-1-2 (patient i is alive without illness at t = 0, becomes ill at time t 1i , and dies at time t 2i , with 0 < t 1i < t 2i ), or 0-2 (patient j is alive without illness at t = 0, and is still without illness at time of death t 2j , with t 2j > 0).Patients can be right-censored at any time-point  21 This paper only considers Markov models.We include a test for the Markov property in the simulation and real data analysis.

Simulation study
We conducted a simulation study to assess the performance of MI methods when applied to a MSM analysis, when some event times were missing.The aim of the simulation study was to quantify the bias and precision of estimates from a MSM analysis in various missing data scenarios.The design of the simulation study is summarised below.Further details are provided in Supplementary Material Section S1.

Data generation
We first generated complete data for the event times and associated states using the method described by Beyersmann et al. 22 applied to a Markov uni-directional three-state model (similar to Figure 1 above).We used 1000 simulations and each simulated dataset contained 500 patients (similar to the size of the real dataset).In our model, transplant was the initial state for all patients (state 0), aGvHD the single intermediate state (state 1), and relapse or death (a composite outcome which is of clinical interest in many HSC transplant studies [23][24][25] ) was the single absorbing state (state 2).Note that here, for simplicity, we generated a single time of relapse/death for each patient (rather than separately generating a time of relapse and a time of death as in the real data).
We assumed that each transition intensity model, α ab (t), had a proportional hazards (PH) structure, defined as follows: where t represents the time in days since transplant, z 1 ~ Bernoulli(0.2) represents whether a patient is in relapse at time of transplant (assuming patients in relapse at the time of transplant are relapse-free immediately post-transplant), z 2 ~ Bernoulli(0.45)represents whether a patient receives a double cord transplant (vs.single cord), and z 1 and z 2 are independent.The magnitude of the model parameters and choice of covariates were based on the real data.
Censoring times were randomly generated between one and five years post-transplant, to represent administrative (non-informative) censoring at study end.
In a subsequent step, missing event times were generated using 12 different missing data mechanisms (MDMs).Firstly, we considered event times missing completely at random (MCAR, i.e. missingness was independent of both the observed and missing data) by setting a random 30% of event times to missing, regardless of the event type.Next, we considered 11 different MDMs (see Supplementary Material Table S1).We assumed that event times to acute GvHD and/or relapse/death were either (i) missing at random (MAR), conditional on the observed data (missingness depended on the event type and covariates but not on the missing data itself), or (ii) missing not at random (MNAR, missingness depended on the missing data itself).Although our chosen MI methods assumed data were MAR, MNAR MDMs allowed us to assess the impact on bias and precision when the MAR assumption was violated.Approximately 30% of event times were missing in each MDM, to reflect the percentage of missing times in the real data.

Analysis model, estimands and performance measures
Consistent with the data-generating mechanism (DGM), we fitted PH regression models for each transition intensity i.e. we fitted models of the form: α ab (t) = α ab 0 (t) exp(β ab ′ zi), where α ab 0 (t) represents the baseline intensity at time t when moving from state a to state b, β ab is the vector of regression parameters, and zi are the set of (time-fixed) covariates for patient i.
We fitted both Cox and Weibull models.We fitted Cox models because they are commonly used in practice.In case of bias due to mis-specification of the baseline intensity function (because it is estimated non-parametrically in the Cox model), we also fitted Weibull PH models (i.e. using the same form for the baseline intensity as the DGM).
In our analysis, the estimands of interest were: The vector of transition intensity regression parameters βab for all possible states a and b. (ii) The restricted expected length of stay (RELOS) in each state, 4 restricted to the time period between transplant and two years post-transplant, was used as a summary of the transition probability distributions.
RELOS from time 0 to time t for state b is defined as: where the state occupation probability, P b (t), denotes the probability of being in state b at time t. 4 If all patients are in state 0 initially, P b (t) is equivalent to the transition probability from state 0 to state b at time t, i.e.P b (t) = P 0b (0,t). 20 calculated e b (t) using the consistent estimator 4 : where P ̂(t m ) is the estimated state occupation probability for state b at time t and t 0 < t 1 < … < t M ≤ t M+1 are the set of ordered times from time 0 up to time t, across all transitions.For Cox models, the set of times was the set of all simulated times for the k th simulation.For Weibull models, the set of times was specified as the set of all values of t from time 0 up to time t, in increments of 0.1 days.
(iii) The final estimand of interest, regression parameter γ 12 , was used to test whether our MI approach led to conclusions about the Markov assumption that were consistent with the DGM. 26In this test, time from transplant until aGvHD, denoted by d, was included as an additional covariate in the model for α 12 (for the transition from aGvHD to relapse/death), i.e. we fitted the model: α 12 (t) = α 12 0 (t) exp(γ 12 d + β 12 ′ zi).
Since our model is Markovian, in truth, γ 12 equals zero.
Performance measures for regression parameters βab and RELOS were standardised bias (defined as bias/SD of the per-simulation estimates) and average model-based SE.The performance measure of interest for the regression parameter γ 12 was the coverage of the 95% . confidence interval i.e. the percentage of within-simulation 95% confidence intervals for γ̂1 2 that included the true value.
Model-based SE of the regression parameter estimates was calculated using standard methods. 27,28We calculated the model-based SE of RELOS for Cox models using a nonparametric bootstrap estimator, 29 with 50 bootstraps per simulated dataset, and the modelbased SE of RELOS for Weibull models using the delta method.Further details of a separate simulation study to identify the best estimator of model-based SE for Cox and Weibull models are provided in the Supplementary Material (Section S2).The true values of the regression parameters were as per the DGM.The true values of RELOS were calculated using numerical integration.

MI approaches
We considered four MI approaches in this study.Approaches (i) -(iii) did not explicitly account for the ordered nature of the event times.Approaches (ii) -(iv) allowed for different distributions of event times for those who experienced aGvHD compared with those who did not experience aGvHD before relapse/death: (i) "Type 1" 19 predictive mean matching (PMM), fitting a single imputation model for all patients.
(ii) PMM, applying separate imputation models for patients who did and did not experience aGvHD before relapse/death (PMMSUBGP).
(iii) MI using draws from a linear imputation model, applying separate imputation models for patients who did and did not experience aGvHD before relapse/death (LINMI).
Any negative imputed times were replaced by the value 0.0001 post-imputation.
(iv) PMM, compatible with the ordered nature of the event times as specified in the analysis model (PMMCOMP).In this method, PMM was applied using separate imputation models for patients who did and did not experience aGvHD before relapse/death.This method proceeds as follows: a. Impute the first event time, i.e. impute event times to relapse/death for the subgroup of patients who did not experience aGvHD, and event times to aGvHD for the subgroup who did.
b.For the latter group, also impute the (calculated) time from aGvHD to relapse/death, including the time to aGvHD and time from aGvHD to .relapse/death in the imputation model, but not the time to relapse/death.Postimputation, calculate any missing times to relapse/death as the sum of the (observed or imputed) time to aGvHD and the (observed or imputed) time from aGvHD to relapse/death.Methods (iii) and (iv) were applied in the MCAR scenario.Due to their relatively poor performance, we did not apply these methods in other scenarios.For comparison purposes, we also performed CCA because this method is often used in practice (and is the default method when there are missing values in most statistical software).
Following current guidelines, 30 each imputation model included all analysis model variables, i.e. an indicator of the associated transition (whether times were missing for 0 → 1, 0 → 2, and/or 1 → 2 transitions), both analysis model covariates, indicators of the other events experienced, and associated event times.We used default settings for the number of imputations, iterations, and size of the donor pool for PMM (five in each case) because we wanted to assess the application of MI as routinely used in practice.We concluded that default settings would not unduly influence our results, because we were repeating the analysis for 1000 simulated datasets (which controlled the overall Monte Carlo error even when using a small number of imputations).In addition, in missing data scenarios in which only one variable (the time of aGvHD or relapse/death) was incomplete, no iteration was required.In scenarios in which times to both aGvHD and relapse/death were incomplete, we confirmed that convergence was achieved within five iterations by examining trace plots 31 for a randomly chosen simulated dataset for each MI method and MDM.

Computer software
Regression parameter estimates and SEs for Cox and Weibull models were calculated using 'survival' 27 and 'flexsurv' 28 R packages, respectively; state occupation probability estimates were calculated using the 'mstate' R package; 32 MI methods were implemented using the 'mice' R package 31 .R code to perform the simulation study is provided in Supplementary Material Section S5.

Simulation study results
Simulation study results are illustrated in Figures 2 and 3 2 shows results for scenarios in which MI was expected to work well, that is, when all event times were either MCAR or MAR.Conversely, Figure 3 shows results for scenarios in which MI was not expected to work well, that is, when some or all transition times were MNAR.
For comparison, CCA estimates are also illustrated.
As expected, CCA gave unbiased estimates only when event times were MCAR.When event times were either MCAR or MAR (Figure 2), PMM resulted in a small amount of bias for all estimands (that is, the magnitude of the standardised bias was <0.5), except for e 2 (2) (RELOS for the relapse/death state) and the regression parameter β 12 1 (for the covariate "in relapse or not at time of transplant" in the transition intensity model from aGvHD to relapse/death).The bias in the RELOS estimate, e 2 (2), remained for all imputation methods and MDMs when fitting a Cox model, so is not further discussed here.Bias for regression parameter β 12 1 was large in scenarios when event times to relapse/death after aGvHD were missing and small when only event times to aGvHD or relapse/death without aGvHD were missing.
Applying PMM separately for patients who did and did not experience aGvHD before relapse/death (PMMSUBGP) or accounting for the ordered nature of the event times as specified in the analysis model (PMMCOMP) reduced the bias in regression parameter β 12 1 .
When these methods were used, bias remained small for all other estimands except the RELOS estimate, e 2 (2).Results using PMMSUBGP were very similar for both Cox and Weibull models, except that the bias in the RELOS estimate, e 2 (2), was greatly reduced when fitting a Weibull model.MI using draws from a linear imputation model (LINMI) resulted in large bias for some estimands, particularly estimates of RELOS (see Supplementary Material for PMMCOMP and LINMI results). .
When some or all event times were MNAR (Figure 3), MI using either PMM or PMMSUBGP led to biased estimates.Bias was generally the same or larger than when using  1, for all scenarios in which times to aGvHD were MAR, times to relapse/death times MNAR and the imputation method was PMMSUBGP.
As discussed above, MNAR mechanisms in which smallest times to relapse/death times tended to be missing affected mainly patients who experienced relapse/death without aGvHD.
Hence, the regression parameter γ 12 is unbiased with coverage close to the nominal value in this scenario.In MNAR mechanisms in which largest times to relapse/death tended to be missing, there is little bias when fitting a Cox model.However, the model-based SE is larger, which may explain the slight over-coverage in this case.Bias is large when fitting a Weibull model, which may explain the high degree of under-coverage in this case. .
For each MI method, the imputation model for each partially observed variable included all other analysis variables, i.e. covariates, an indicator of whether the patient experienced relapse and/or death, and times of aGvHD, relapse, and death.Note that indicators of whether the patient experienced aGvHD or cGvHD were excluded from each imputation model because these had the same value for all patients in each sub-group.Year and country of transplant, and time of cGvHD were also included in each imputation model as auxiliary variables because they were highly predictive of missingness, as well as of the incomplete variables themselves.The time of the composite event (relapse/death) was derived postimputation.
6][37] Since both covariate data and event times were missing in the NHS CBB dataset, the actual event times were included in the imputation models, rather than, for example, the baseline hazard function recommended by White and Royston. 36We performed 80 imputations (following the "rule of thumb" 38 that the number of imputations should at least equal the percentage of incomplete cases -73% in the NHS CBB dataset).As in the simulation study, we used the default of five iterations per imputation (assessing convergence using trace plots as before) and a donor pool of five donors for each PMM method.We also calculated CCA estimates for comparison purposes.

Results
To illustrate the difference between estimates from CCA and MI methods (PMM and PMMSUBGP, fitting either a Cox or Weibull model for the latter method), Figure 4 shows estimated hazard ratios (HR, conditional on all other covariates) for a double cord transplant (vs.single cord) and whether a patient was in relapse at time of transplant (vs. in remission) for each transition (see Supplementary Material Tables S4a-c for full results).For each HR, .
95% confidence intervals (CI) were wider for CCA estimates than for MI estimates.
PMMSUBGP estimates were very similar, whether a Cox or Weibull method was fitted.
PMM estimates were generally similar to PMMSUBGP estimates.Some CCA point estimates were outside the 95% CI for the equivalent MI estimates (this was the case for double cord transplant, for both the transition from transplant to aGvHD and the transition from aGvHD to relapse/death, and, for some MI estimates, for whether in relapse at time of transplant, for both the transition from transplant to relapse/death and the transition from aGvHD to relapse/death).In these cases, the MI estimates were closer to the null than the CCA estimate.
Table 2 shows CCA and MI estimates of RELOS in the first year post-transplant (illustrated for three different patient types: a patient with reference values of covariates, a low-risk, and a high-risk patient).For each patient type, CCA estimates of the time spent in relapse/death were higher than the MI estimates whilst CCA estimates of time spent in the aGvHD state were lower.For all estimates, CIs were wide and widest for CCA estimates.CIs were also wide for the aGvHD and relapse/death states when fitting a Weibull model (PMMSUBGP Weibull).This may be due to the small number of transitions relative to the range of observed event times e.g.only 46 patients with reference values of covariates experienced relapse/death after acute GvHD, and the range of event times was 14 -1711 days posttransplant.As a consequence, convergence of the Weibull model was not achieved in 29 of the 80 imputed datasets for the transition from aGvHD to relapse/death.
There was no apparent association between time of acute GvHD and the hazard of relapse/death after acute GvHD (HR 1.00, 95% CI 0.99-1.01,see Supplementary Material), suggesting there was no violation of the Markov assumption.There was some indication of a violation of the PH assumption, particularly for the model for the transition from transplant to relapse/death (global PH test p-value = 0.14, 0.03, 0.30 for the transitions from transplant to acute GvHD, transplant to relapse/death and acute GvHD to relapse/death, respectively).

Discussion
In this paper, by simulation, we have shown that MSM analysis using Markov models with an MI strategy based on PMM yields estimates with little or negligible bias when event times are MAR.In our setting, in which the probability that event times are missing depends on the event type (a common occurrence in practice because overall survival status is generally completely reported whereas non-fatal events may not be), CCA is not valid because .missingness depends on the analysis outcome.Our simulation study shows that even when CCA estimates are unbiased (e.g. when times are MCAR), PMM estimates have better precision than CCA estimates.In PMM, missing values are replaced by sampling at random from a donor pool of patients with observed values who are 'similar' to the subject with missing data.In the MSM context, this means that the donor pool tends to contain patients who have experienced the same sequence of events as the incomplete case.Therefore, the original sequence of events can be preserved for the incomplete case, without explicitly specifying the order of events in the imputation process.In both our simulation study and real data application, the distribution of event times differed across sub-groups of patients.In the simulation study, applying PMM separately for sub-groups of patients who did and did not experience the intermediate event (PMMSUBGP) tended to reduce bias and model-based SE, particularly for parameters for the transition from the intermediate to absorbing state (aGvHD to relapse/death).PMMSUBGP also improved coverage in a parameter used to test the Markov assumption (by including time from transplant to aGvHD in the model).
An extension of the PMMSUBGP method, which explicitly preserved the ordering of events by including the aGvHD event time and time from aGvHD to relapse/death in the imputation model, but not the relapse/death event time (PMMCOMP), gave results with comparable bias to PMMSUBGP, but larger model-based SE.Due to the loss of information using this method, with no advantage in terms of bias reduction, we would not recommend this approach.
In our study, MI using draws from a linear imputation model (LINMI) led to more bias than PMM when estimating transition intensity model parameters and RELOS.This may be because this approach could result in an imputed relapse/death time that was smaller than the (observed or imputed) aGvHD time.Hence, LINMI was not compatible with the analysis model and estimates were biased as a consequence.
Overall, we recommend using Type 1 PMM to impute missing event times in a MSM analysis using Markov models, first exploring the distribution of event times for each subgroup of patients with a different path through the MSM.Type 1 PMM should be applied separately for each sub-group of patients with a different distribution of event times.In our simulation study, the distributions of simulated times of relapse/death were very different for patients who did and did not experience aGvHD.In analysis of real data, there may be .
smaller differences between distributions of event times for different sub-groups of patients, and applying PMM by sub-group may make little difference to the results.Therefore, to assess the sensitivity of results to the imputation method, we recommend performing analysis using both a single imputation model and separate models for each sub-group of patients.
Note that sub-groups should be of sufficient size to allow for random donor selection in the PMM procedure.
The PMM strategy described here can only be used if some event times are exactly observed, which may not always be the case.For example, after corneal transplantation, hospitals were asked whether any post-transplant surgery had been performed since the previous follow-up report but were not asked for the date of surgery. 39In this example, time of surgery would be missing for all patients.Valid use of PMM would require further data collection obtain exact event times for a representative sample of patients.If this was not possible, a ML approach could be used instead.
Although PMM performed well in our study, there is still scope for improvement, for example, by development of methods that are explicitly compatible with a MSM analysis.
This could be achieved, for example, through an extension of the MAR stacked MI approach of Beesley and Taylor 40 or the SMC-FCS method 41 to MSM, particularly when these use parametric models.Alternatively, another method proposed by Beesley and Taylor 42 could be extended, combining an ML approach with full imputation (Beesley and Taylor use "improper" imputation within their EM algorithm).
Generally, MI techniques that assume MAR are not recommended when data are MNAR.In this study, MI resulted in biased estimates when event times were MNAR or a mixture of MAR and MNAR.Our results suggested that bias was greater when the time to the absorbing state (relapse/death) was MNAR than when the time to the intermediate state (aGvHD) was MNAR.This may be due to the constrained nature of the time to an intermediate event (in an illness-death model, this is bounded by 0 and the time of transition to the absorbing state), which may limit the degree of bias even when event times are MNAR.Conversely, the lack of constraint on the maximum time of transition to an absorbing state, and the different pathways through the MSM to that state (each potentially with a different distribution of event times), may increase the degree of bias.In our simulation study, parametric analysis models generally performed as well as semiparametric models.Furthermore, parametric models resulted in less biased estimates of the expected length of stay in state (RELOS) when there were sparse event times.However, regression parameter estimates from parametric models were more biased than estimates from semi-parametric models when event times were MNAR.In addition, in practice, parametric models seemed to be more prone to convergence problems than semi-parametric models (and this may have been the case even if all variables were fully observed because event times were sparsely distributed).Further work is required to determine if this is also the case for flexible parametric models.
In the real data analysis, there was some indication that the proportional hazards assumption did not hold, particularly for the model for the transition from transplant to relapse/death.
Therefore, the model could be improved by including time-dependent regression parameters, or by using the dynamic landmarking approach. 43In addition, clinical inference would be strengthened, and important clinical questions could be answered, if a more detailed event history was modelled for each patient.However, this does rely on the availability of additional post-transplant data, which will almost certainly include some missing data.A more complex analysis model will increase the complexity of any imputation model and the likelihood of imputation model misspecification.outside the range (0,365) were truncated to 0 and 365 for lower and upper bounds respectively.Unless otherwise stated, Cox transition intensity models were fitted.CCA, complete case analysis; PMM, MI by Type 1 predictive mean matching; PMMSUBGP, as for PMM with imputation models fit separately for patients experiencing (i) both acute and chronic GvHD or chronic GvHD without acute GvHD, (ii) acute GvHD without chronic GvHD, (iii) relapse without GvHD or neither GvHD nor relapse.

Tables
Reference patient has reference values of all categorical covariates and age 24 years.Low-risk patient has a non-malignant disorder, receives a high stem cell dose, age 4 years, with reference values of all other covariates.High-risk patient is in relapse at transplant, receives a double cord transplant, has 2 or more donorrecipient human leucocyte antigen mismatches (matching is important to avoid graft rejection), age 44 years, with reference values of all other covariates.

Figure 4 .
Figure 4. Hazard ratio estimates and 95% confidence intervals for each transition, comparing CCA and MI methods (2), for each state, fitted using a Cox model.Results are illustrated for CCA and the two main MI methods (that is, the methods that were applied in all scenarios): PMM and PMMSUBGP.Bias and model-based SE are not illustrated because these could not be shown on the same scale for all estimands, and because model-based SE was similar for all MI methods and MDMs (and always larger for CCA than for MI methods).Similarly, coverage of the Markov test parameter, γ 12 , is not illustrated because with one exception, discussed later, it was similar for all MI methods and MDMs. Figure using "lollipop" plots and all results are included in the Supplementary Material Tables S3a-b.Figures2 and 3show the standardised bias of transition intensity regression parameters βab for each transition, and RELOS within two years, e b CCA.Using MI, bias was larger when the time to the absorbing state (relapse/death) was MNAR than when the time to the intermediate state (aGvHD) was MNAR, and when the largest times were MNAR than when the smallest times were MNAR.Times to relapse/death 1, when the largest relapse/death times were MNAR and vice versa when the smallest relapse/death times were MNAR.As a test of the Markov assumption, the time from transplant until aGvHD was added as a covariate to the transition intensity model from aGvHD to relapse/death.Coverage for the regression parameter for this covariate, γ 12 , was in the range 0.92-0.98 in all methods and scenarios, except one.The coverage was 0.66 when applying MI using the PMMSUBGP method and a Weibull analysis model, with aGvHD times MAR and largest relapse/death times MNAR.To allow further exploration of this outlying value for coverage, performance measures for the regression parameter γ 12 are shown in Table event times are MNAR.Therefore, further work is needed to determine if the conclusions of this research still hold for more complex MSMs.A further, useful, extension of this research would be to consider a range of sample sizes, covariate associations, and event rates.
Here, we only considered MSMs with the Markov property.Semi-Markov or non-Markov models may result in greater bias when .intermediate state

Table 1 .
Performance measures for estimates of the regression parameter γ12 in the transition intensity model from aGvHD to relapse/death when some event times are MNAR ModSE, average model-based SE; StdBias, standardised bias; Cov, coverage Parameter γ12 is for the time from transplant until aGvHD PMMSUBGP, MI by Type 1 predictive mean matching with imputation models fit separately for patients with and without aGvHD

Table 2 .
Estimates and 95% confidence intervals (CI) of expected length of stay in each state in the first year post-transplant, comparing CCA and MI methods