Calibration plots for multistate risk predictions models

There is currently no guidance on how to assess the calibration of multistate models used for risk prediction. We introduce several techniques that can be used to produce calibration plots for the transition probabilities of a multistate model, before assessing their performance in the presence of random and independent censoring through a simulation.


Introduction
2][3][4][5][6][7][8][9] A multistate model may be used when the outcome of interest can potentially be reached via intermediate health states deemed important to include in the model, or may help improve prediction. 10For example, when modelling the co-development of multiple long-term conditions (multimorbidity), [11][12][13] once an intermediate state has been developed (such as type 2 diabetes), this may increase the risk of developing further conditions (such as cardiovascular disease).[16][17][18] Evaluation of the performance of a prediction model, through a validation study, is essential before it can be used in practice. 19Model evaluation should routinely involve assessment of calibration and discrimination, [20][21][22] or other relevant techniques such as decision curve analysis, 23,24 measures of explained variation ( 2 statistics) or the Brier score. 25The calibration of a model is evaluated by how well the estimated risks from the model agree with the observed event rates in a cohort of interest.Best practice for assessing calibration of a model predicting continuous, binary, polytomous or survival outcomes is well established, [20][21][22][26][27][28][29][30][31][32][33][34][35][36] with a (flexible) calibration plot being the preferred approach. Method for evaluating the calibration of multi-state models are not well established.Titman and Sharples suggested comparing the risk of entry into an absorbing state (of a multistate model) with a Kaplan Meier estimate.37 Spitoni et al., 38 illustrated the estimation of prediction errors based on the Brier score, and Van Geloven et al. 29 reviewed methods for assessing calibration of a competing risks model (special case of a multistate model).
We reviewed a sample of studies published since 2017 involving the development of a multistate model for prediction.In studies that did assess model calibration, a wide range of approaches were taken.Postmus et al., 39 compared the mean predicted risk of death (the primary state of interest) with mean observed survival calculated through a Kaplan Meier estimator, within subgroups defined by predicted risk.Beyer et al., 40 analyse trial data, and compare the predicted value of overall survival from their multistate model with a Kaplan Meier estimate of overall survival within each trial arm.Deschepper et al. 41 compared predicted vs observed bed occupancy (state of interest), possible due to the lack of censoring.These approaches are valid but do not allow generation of calibration curves.[42][43][44] The lack of validation applied in practice may be due in part to the lack of guidance on model validation for multistate models.
In this paper, we detail how to assess calibration of the predicted transition probabilities from a multistate model using a variety of methods, focusing on methods that can produce calibration plots (Section 2).In Section 3, we compare the performance of such methods in a simulation study, and demonstrate their use in a clinical example in Section 4.

Methods for assessing calibration 2.1. Preliminaries
Let () be a multistate survival process with  states, where () =  if an individual is in state  at time .Assume a multistate model has been developed to predict () using a set of predictor variables  (e.g., one available in the literature, which we now aim to externally validate).Transition probabilities,  , (  ,   ), are the probability of being in state  at time   , when in state  at time   .Transition probabilities are agnostic to the path taken between  and  and any number of intermediate states may have been visited (i.e., we are not only interested in 'direct' transitions, which is the case for competing risks).Let   be the censoring time, such that () is only observed for  <   .Here, the aim is to assess the calibration of the estimated transition probabilities ̂ , (  ,   ) in a cohort of interest. 45 consider cohorts where all individuals start in the same initial state.We restrict our assessment to the calibration of transition probabilities out of the initial state ( = 1), made at the start of follow up (  = 0).These two conditions are relevant to most clinical applications reported in the literature.Given this setting, we reduce the notation of the predicted transition probabilities to  ̂().How to utilise landmarking 46,47 to evaluate calibration when these requirements do not hold is discussed in Section 4.
Four categories of calibration have previously been described: 35 1) Mean calibration (or calibrationin-the-large) where "the average predicted risk is compared with the overall event rate".2) Weak calibration where "on average, the model does not over-or underestimate risk, and does not give overly extreme (too close to 0 and 1) or modest (too close to disease prevalence or incidence) risk estimates".This is assessed using calibration intercept and slope.3) Moderate calibration where the estimated risks correspond to observed proportions across the entire range of predicted risk.This would generally be assessed using flexible calibration curves, 27,28,32 where the observed event rate is plotted against the predicted risk.4) Strong calibration requires "predicted risk corresponds to the observed proportion for every possible combination of predictor values", and is rarely considered in practice.
In sections 2.2 -2.5, we detail a range of methods that can be used to assess the calibration of the transition probabilities out of the initial state in a multistate model.We focus on calibration plots, but details are also provided on how to assess mean and weak calibration.

Aalen-Johansen estimator
The Aalen-Johansen estimator 48 is an estimator for the transition probability matrix of a nonhomogeneous Markov process (a generalisation of the Kaplan Meier estimator to multistate survival data).This can be used to estimate the observed risk of each state in the cohort of interest,    ().The mstate 49 package can be used to derive the Aalen-Johansen estimator.
Mean calibration using this estimator is assessed by comparing the mean predicted transition probability with the observed risk for each outcome category: where ̂  () is the predicted transition probability for individual .This estimator makes the assumptive of non-informative censoring.In the presence of informative censoring, we recommend calculating the Aalen-Johansen estimator within sub-groups of individuals defined by the predicted transition probability of each state and taking the average.Specifically, for each category , order individuals by predicted transition probability of category , split into equally sized groups, within each group calculate the Aalen-Johansen estimator, take the average and subtract the mean predicted transition probability in the cohort.This relaxes the non-informative censoring assumption, to non-informative censoring after conditioning on the predicted transition probabilities.Moderate calibration is assessed by comparing the Aalen-Johansen estimator and average predicted transition probability within each sub-group.Importantly, the Aalen-Johansen approach does not allow the estimation of calibration curves and therefore the approach is not recommended.However, knowledge of this estimator is required to motivate the pseudo-value approach given in Section 2.3.
Finally, a discussion of the Markov assumption is required.In general, the Aalen-Johansen estimator is only a consistent estimator under the Markov assumption for the multistate process (), which means at all times  "given the present state and the event history of a patient, the next state to be visited and the time at which this will occur will only depend on the present state". 10However, it can be shown through the work of Datta and Satten 50 and Glidden, 51 that in this specific setting (all individuals start in the same initial state), that the Aalen-Johansen estimator gives a consistent estimate of the transition probabilities out of this initial state.Therefore Estimators which do not depend on the Markov assumption [52][53][54][55] are not required here.A full line of reasoning is given in Supplementary Material part 1.

Pseudo-values
Pseudo-values enable estimation of estimands that can be expressed as an expectation when analysing censored survival data. 56The pseudo-value itself denotes the contribution of each individual to the expectation that is being estimated.If  ̂ is such an estimator, the pseudo-value for individual  is defined as: where  ̂− is the estimator  ̂ calculated in the cohort with individual  removed.For more details on pseudo-observations, and the required properties of the estimator  ̂ and estimand , see the work of Andersen and Perme. 56eudo-values for the transition probabilities from a multistate model can be calculated using the Aalen-Johansen estimator. 57Let  ̂() be the vector of pseudo-values for individual  at time , where  ̂  () is the pseudo-value for the transition to state .Moderate calibration for each state can be evaluated by regressing the pseudo-values on the predicted transition probabilities using a loess smoother: ̂  () =  (̂  ()). ( Other models and link functions could also be used. 56Observed probabilities  ̂ , () can then be obtained by generating fitted values for each individual in the validation cohort from model (1).The set of points {̂  (),  ̂ , ()} gives the calibration plot for state .
The estimates of calibration derived using pseudo-values are based on the same assumptions as the underlying estimator  ̂.For example, when  ̂ is the Aalen-Johansen estimator, censoring must be non-informative.However, if the censoring mechanism and outcome are conditionally independent given some variables , then the pseudo-values will be valid if calculated within subgroups defined by . 56We therefore suggest calculating pseudo-values within subgroups defined by the predicted transition probabilities.Specifically, for each state , order individuals by ̂  () and split into groups (say deciles or ventiles).Within each group, calculate  ̂ using the Aalen-Johansen estimator and extract  ̂  ().Note that the ordering of individuals may be different for each state .All individuals are then recombined into a single dataset when fitting model (1).
The mean calibration for each outcome category can be calculated as the average difference between the pseudo-values and predicted transition probabilities: )  ⁄ To assess weak calibration, fit the following linear regression model: and report the calibration slope  ̂.
2.4.Binary logistic recalibration using inverse probability of censoring weights (BLR-IPCW) Barrowman et al., 15 first proposed evaluating the calibration of ̂() by estimating calibration intercepts and slopes (at a given time point ) using binary logistic regression calibration techniques 58,59 and inverse probability of censoring weights.We extend the approach to assess moderate calibration using calibration curves.
Let   () be an indicator variable indicating whether an individual is in state  at time  (for all individuals uncensored at time ).The indicator   () can be modelled in the validation cohort using a loess smoother 32 : Fitted values  ̂ , () can then be generated for each individual in the validation cohort using Model (2), and the set of points {̂  (),  ̂ , ()} results in the calibration plot for state .Other models could be used to model the relationship between   () and ̂(), such as a binary logistic regression model (after applying logit transformation to ̂  ()) with multiple fractional polynomials 60 or restricted cubic splines. 20 the absence of censoring, the above approach would allow us to estimate the estimand of interest, which is the observed event probabilities for state  among individuals with a predicted transition probability of ̂(): (() = |̂()) However, in the presence of censoring this can only be done in the subgroup of the population which is uncensored at time .The assessment of calibration will be biased by censoring (induces selection bias) unless the censoring mechanism is independent from everything else.The calibration Model (2) must therefore be fitted using data reweighted according to inverse probability of censoring weights 25,61 (IPCW) to obtain valid estimates.The weights are calculated as the inverse of the probability that an individual has not been censored by time point : where  is a vector of baseline covariates, and () denotes the position of the multistate process at time , and information on any previous transition times.The inclusion of () reflects that if an individual enters an absorbing state at time   < , the weight for this individual is the probability of being uncensored at time   , as opposed to time .Specifically: (  > |, ()) = (  >   |) if () is in an absorbing state, and (  > |, ()) = (  > |) otherwise.In the presence of an absorbing state, it is therefore imperative to apply IPCWs even when censoring is non-informative.These weights should be estimated from the data by modelling the time until censoring occurs.
To assess mean calibration, one would fit the following model: fixing  to 1.The estimate  ̂ (calibration intercept) gives an assessment of mean calibration on the logit scale.We recommend the following steps to transform this onto the probability scale, in line with those from Sections 2.2 and 2.3.Calculate observed event probabilities by generating fitted values from Model (3): ̂ −, () =  −1 ( ̂+  (̂  ())) The mean calibration for state  can then be estimated as the mean difference between the observed event probabilities and the predicted transition probabilities: )  ⁄ .
To assess weak calibration, one fits Model (2) with no constraints to calculate the calibration slope,  ̂, where values less than 1 indicate overfitting.

Nominal recalibration framework using multinomial logistic regression and inverse probability of censoring weights (MLR-IPCW)
Barrowman et al., 15 also proposed calculating calibration intercepts and slopes using the nominal calibration framework of Van Hoorde et al., 33 developed for assessing the calibration of polytomous outcomes to model the observed event probabilities of all outcome categories simultaneously.We extend this to assess moderate calibration by adapting the spline-based extension of the nominal calibration framework. 34t   () be a polytomous variable taking values in {1, … , }, such that   () =  if an individual is in state  at time .For each individual  in the validation cohort, calculate the log-ratio of the probability of each state ( ≠ 1) relative to state 1: The outcome   () is then modelled using a multinomial logistic regression with a vector spline smoother: for  > 1. Observed event probabilities values  ̂ , () can then be obtained by generating fitted values for each individual in the validation cohort from Model (4) using the following formulae: where  ̂ , () is defined for  > 1.The set of points {̂  (),  ̂ , ()} then gives the calibration plot for state .Alike to the methods in Section 2.4, this process can only be done in the sub-cohort of individuals uncensored at time , and therefore IPCW must be applied to obtain valid estimates of calibration.Weights are calculated the same way as in section 2.4.The VGAM package 62 can be used to fit these multinomial logistic regression models, and generate observed event probabilities (fitted values).For more details on the vector spline smoother see Van Hoorde et al. 34 Note that unlike MLR-IPCW, the BLR-IPCW and pseudo-value approaches ignore the multistate aspect of the data and treat each outcome category with a 'one vs all' approach.This means the observed event probabilities for an individual will only sum to 1 when using the MLR-IPCW approach.Furthermore, the calibration plots for MLR-IPCW are a scatter plot, reflecting the fact that individuals with the same predicted transition probability of state , may have different predicted transition probabilities of the other states, and in turn different observed event probabilities.This approach can therefore provide more insight into the calibration of a model over the BLR-IPCW and pseudo-value approaches.
To assess mean calibration, one would fit the following multinomial logistic regression model in the validation cohort: where the coefficients   are fixed to 1.The estimated  ̂ constitute the calibration intercepts from the calibration framework of Van Hoorde et al. 33 To get this estimate of mean calibration onto the probability scale, we again calculate observed event probabilities by generating fitted values from Model (4) for each individual  in the validation cohort using the following formulae: ̂ −, () = exp( ̂ +  ̂  ) The mean calibration for state  can then be calculated as the mean difference between the observed event probabilities and the predicted transition probabilities: )  ⁄ .
To assess weak calibration, fit Model (5) with no constraints and calculate the calibration slopes  ̂.

Simulation
The methods are presented using the Aims, Data generating mechanisms, Estimands, Methods and Performance (ADEMP) structure. 63All code for running the simulation is available on GitHub. 64

Aims
The primary aim was to compare the bias (and variability) of each of the calibration methods outlined in Section 2 when the assumption of non-informative censoring does and does not hold.

Underlying structures
Data was generated using the multistate model structure in Figure 1.This was designed to mimic the clinical setting from Putter et al., 16 predicting local recurrence (state 2), distant metastasis (state 3), local recurrence + distant metastasis (state 4), and death (state 5), after surgery (state 1) in patients with early stage breast cancer.
We first generated two baseline predictor variables  1 and  2 , both drawn from a standard Gaussian distribution.The survival times for transition  →  were simulated from an exponential distribution with hazard (0.5*  1 − 0.5 *  2 ) *  , .The predictor  1 represents a variable associated with higher risk of remission and/or death, and  2 represents a variable associated with lower risk of remission and/or death.The baseline hazard for each transition ( , ) was chosen so that the 7-year survival matched that of the cause-specific hazard reported by Putter et al. 16 The targeted values for each transition (and corresponding values of  , ) are provided in Table 1.7-year survival was targeted as this was the maximum follow up observed for all transitions.The censoring time was generated from an exponential survival distribution with hazard ( ,1 *  1 +  ,2 *  2 ) *   .The value of   = 5005 was chosen to induce a censoring probability of 0.4 over 7 years for individuals with  1 =  2 = 0.An uninformative censoring scenario (UIC) was induced by setting  ,1 =  ,2 = 0.A weakly informative censoring scenario (WIC) was induced by setting  ,1 = 0.25,  ,2 = −0.25.A strongly informative censoring scenario (SIC) was induced by setting  ,1 = 1,  ,2 = −1.

Process for simulation
Large validation sample ( = 200,000) and small validation sample ( = 3000,  = 1500) simulations were carried out for each scenario (NIC, WIC and SIC).Bias (bias and variance) of the calibration assessment was evaluated in the large (small) validation sample analysis.A sample size of 3000 was chosen to match the sample size used in the work of Putter et al. 16 A sample size of 1500 was chosen to test small sample performance even further.Sample size criteria do not yet exist for multistate models, 65 but would be dependent on the event rates of all the possible transitions.With this data generating mechanism some of the transitions are rare (approximately 2% of individuals are in state 4 after 7 years).We therefore believe 1,500 should be considered a very low sample size.
The large validation sample simulation followed the following steps for each scenario: 1. Generate a validation dataset of size  = 200,000 according to the relevant data generating mechanism.
2. Calculate the true transition probabilities for each individual at 7 years follow-up,   (7).
4. Assess mean and moderate calibration of ̂ , , ̂ , and ̂ , in the dataset using each of the calibration approaches (section 3.4).
5. Compare calibration to true calibration (estimand, Section 3.3) for moderate and mean calibration.
For the small sample simulation, a cohort of size 1,000,000 was generated.Cohorts of size 3000 or 1500 were then sampled at random (without replacement) and Steps 2 -5 were implemented.For Step 5 mean calibration was assessed.This process was repeated 1,000 times.This represents the process of validating the model in a cohort sampled randomly from the population of interest.

Estimands and other targets
The estimand of interest is the calibration of the predicted transition probabilities defined in the previous section (̂ , , ̂ , and ̂ , ).Let ̂  represent any of ̂ , , ̂ , and ̂ , .
Let    be the 'true' transition probabilities from state 1 to state  for individual  used in the data generating mechanism.These can be calculated from the specified cause specific hazards in section 3.2.1 following Putter et al. 10 The exact process and formulae for doing so is provided in Supplementary Material 1.When assessing moderate calibration, the estimand is the set of points {̂  ,    }.When assessing mean calibration, the estimand of interest is the average difference between the predicted transition probabilities and true transition probabilities: Note that the estimand is a function of both the 'true' transition probabilities used to generate the data and the predicted transition probabilities.In practice the predicted transition probabilities (̂  ) would be estimated by fitting a multistate model, however in this simulation they have been defined deterministically.The manner in which ̂ , and ̂ , were generated will result in a nonlinear calibration curves, desirable when assessing the ability of the proposed methods to evaluate moderate calibration.

Methods for analysis
When assessing moderate calibration, we compared the BLR-IPCW, MLR-IPCW and pseudo-values based on Aalen-Johansen estimator.The Aalen-Johansen estimator was not considered given it cannot estimate the estimand ({̂  ,    }).When assessing mean calibration we compared BLR-IPCW, MLR-IPCW and the Aalen-Johansen estimator.The pseudo-value approach was not considered as the only advantage of the pseudo-value approach over the Aalen-Johansen estimator is that it enables the estimation of calibration curves.For the pseudo-value and Aalen-Johansen approaches, individuals were grouped into 20 groups based on their risk of state  in order to assess calibration, as outlined in Sections 2.2 and 2.3.We apply this extra step in the NIC scenario, even though this would not be required in practice if the censoring mechanism is known to be non-informative.
We also conducted sensitivity analyses to assess the sensitivity of BLR-IPCW and MLR-IPCW to the calculation of the weights.For the main implementation of BLR-IPCW and MLR-IPCW, the weights were estimated from the data using a perfectly specified Cox model which adjusted for  1 and  2 .A second analysis was carried out using a misspecified Cox model for the time until censored which didn't adjust for  1 and  2 .A third analysis used the true weights, where the probability of being censored was calculated based off the data generating mechanism.A fourth analysis did not apply weights at all.

Performance metrics
In the large sample simulation, for moderate calibration bias was assessed by graphically comparing the estimated calibration curves from each method vs the true calibration curve represented by the set of points {̂  ,    }.For mean calibration the bias and standard error of the estimated mean calibration was reported for each method.Standard errors were calculated using bootstrapping, although we do not report the standard error for the Aalen-Johansen approach for computational reasons, given it is expected to be very small anyway.For the small sample simulation, we report the median bias and 2.5 -97.5 percentile range (across the 1000 simulation iterations) in the bias of the mean calibration estimate. 3.6.Results

Large sample simulation: Moderate calibration
Figure 2 contains the calibration plots of the perfectly calibrated (̂ , ), over predicting (̂ , ) and under predicting transition probabilities (̂ , ) in scenario NIC (blue lines), against the calibration plot produced by using BLR-IPCW to assess calibration.Equivalent plots for the pseudovalue and MLR-IPCW approaches are provided in the supplementary material (Figures S1 and S2).These Figures showcase the ability of each method to appropriately assess non-linear changes in calibration over the distribution of predicted transition probabilities.All methods provide an unbiased assessment of the calibration when censoring is non-informative.There are some minor deviations from the true calibration for states 2, 3 and 4, for which the predicted transition probabilities are much smaller, highlighting a bigger sample size is required to assess moderate calibration of rarer states.Variability in calibration estimates is considered in the small validation sample simulation.
Figures 3 and 4 contain the moderate calibration plots of the perfectly calibrated transition probabilities for scenarios WIC and SIC.As the strength of informative censoring increases, this introduces bias into all the calibration methods.However, even when informative censoring is at its highest (Figure 4), BLR-IPCW, MLR-IPCW and pseudo-value all provide a predominately unbiased assessment of calibration.The rug plots indicate that the bias only occurs over the range of predicted transition probabilities where data points are sparse.Comparing the BLR-IPCW and pseudo-value approaches (the calibration curves) in the SIC scenario (Figure 4), shows each method gives a better evaluation of calibration for different states.Equivalent plots for the over and under predicting transition probabilities (Figures S3 -S6) lead to the same conclusions.
The variance in the points in the MLR-IPCW scatter plot for each state is very small.This is due to a lack of variability in the predicted transition probabilities of the other states.The added information provided by this approach is therefore not evident from the simulation.We refer to the clinical example (Section 4) for an illustration of the MLR-IPCW approach where the scatter plot is more varied, and discuss the benefits of this approach in more detail there.It also appears there is a small number of data points where calibration is biased over dense regions of predicted transition probability (e.g.state 4 in Figure 4), in contrary to the arguments in the previous paragraph.However, this is due to the multidimensional nature of these calibration plots.These points represent a small group of individuals who have the same predicted transition probability of state 4 as many other individuals, but have predicted transition probabilities of the other states which few people do.Therefore this is in fact a sparse area of predicted transition probability with respect to the other states.
We tested the sensitivity of the BLR-IPCW and MLR-IPCW approaches to miss-specification of the censoring distribution (Figures S7 -S24).This resulted in a small increase in the bias of these methods, however the calibration was still unbiased over a large range of the predicted transition probabilities, indicating some protection against misspecification of the censoring distribution.

Large sample simulation: mean calibration
When there is no informative censoring (NIC), all methods give an unbiased estimate of the mean calibration (Figure 5).WIC starts to introduce bias into the mean calibration of the BLR-IPCW and MLR-IPCW approaches for states 1 and 5.When large levels of informative censoring are introduced (SIC) the bias increases further, whereas the Aalen-Johansen approach remained unbiased.Comparing this to the moderate calibration plots (Figures 4, S5 and S6), indicates the bias in the mean calibration is driven by a small number of outliers.Bias is only found for states 1 and 5 because increasing  1 or  2 only has a minor impact on the probability of being in states 2, 3 and 4. A change in  1 or  2 will increase or decrease the hazard for transitions both into and out of intermediate states.For example, an increase in  1 increases the rate of entry into state 2, but also increases the rate at which individuals leave state 2, having a net zero effect.This highlights that the issue of informative censoring may be present for some states but not others.We tested the sensitivity of the BLR-IPCW and MLR-IPCW approaches to misspecification of the weights (Figures S25 and S26).Misspecification of the weights had a minor effect on the bias, however implementing either approach without weights greatly increased the bias.

Small sample simulation: mean calibration
The results from the small sample simulation (N = 3000) are presented in Figure 6.In scenarios NIC and WIC all methods performed very similarly.For SIC there were some differences in performance for prediction of states 1 and 5, the states where informative censoring was at its strongest.The magnitude of bias was similar for AJ, BLR-IPCW and MLR-IPCW (although sometimes in opposite directions), but the variation in estimates was much bigger for the AJ approach.Reducing the number of groups patients are categorised into before evaluating calibration only had a minor impact on reducing this variation (Figures S27 and S28).Corresponding figures for N = 1500 (Figures 29 and S30) lead to similar conclusions.

[INSERT Figure 5: Bias (CI) of each method for assessing mean calibration, large sample analysis.] [INSERT Figure 6: Median bias and 2.5 -97.5 percentile range in bias of each method for assessing mean calibration, small sample analysis.]
4. Clinical example

Aim and setting
We aimed to demonstrate the application of each calibration method to typical clinical data and contrast the levels of information provided by each method.We developed an illustrative (not intended for clinical use) multistate model to predict the 10-year risk of co-existence of three longterm conditions, cardiovascular disease, type 2 diabetes and chronic kidney disease, in healthy individuals -a common and important multimorbidity state in which the develop of one condition increases the probability of the development of the other conditions.The clinical and economic impact of multimorbidity is high and rising in many parts of the world, [66][67][68][69][70][71][72][73][74][75] and preventing it is a priority for health systems.Start of follow up was defined as the maximum of date turned age 65, 1 st Jan 2000, and date of 1 year of up to standard registration in the database.Inclusion in the cohort was dependent on these three conditions being met.Patients were excluded if they had any of the outcomes prior to start of follow up.Censoring was defined as last date of data collection for the practice, or date transferred out of practice, which should be an uninformative censoring mechanism.

Methods
We extracted the transition times for the outcomes of incident cardiovascular disease, type 2 diabetes and chronic kidney disease (stage 3, 4 or 5), multimorbidity and death (Figure 7).There was a transition from every possible state to death.Outcome events were identified through the primary care, secondary care and death record data sources.Baseline data was extracted solely from the primary care record.We extracted: age, gender, systolic blood pressure, body mass index, total cholesterol/high density lipoprotein ratio, smoking status, index of multiple deprivation, and history of hypertension, depression and alcohol misuse.Code lists for all variables are available in Supplementary Material 1.To keep the scope of this example manageable, we wanted to focus on complete case datasets, however data were missing for systolic blood pressure, body mass index, cholesterol/high density lipoprotein and smoking status.We therefore generated a pseudocomplete case dataset by imputing missing values using a single stochastic imputation (achieved through a single multiple imputation chain of length 20, using all other variables as predictors).

[INSERT Figure 7: A multistate model for prediction of cardiovascular disease, type 2 diabetes and chronic kidney disease]
Model development and validation: Two development datasets (N = 5,000 and N = 100,000) and one validation dataset (N = 100,000) were sampled at random from the overall cohort.Multistate models were developed in both development datasets following the approaches of Putter et al., 10 and de Wreede et al., 49 taking a clock forward approach.Cox proportional hazard models adjusting for all baseline variables were fitted for each cause-specific hazard depicted in Figure 7.Note it was also possible that individuals were diagnosed with two conditions on the same date (i.e.transition from state 1 straight to state 5).However, we do not believe these transitions to be representative of the true underling disease process (long term conditions were likely developed at different times and only recorded in database once the patient engaged with the health system).The number of individuals who made these transitions was also small.We therefore modelled the cause-specific hazard for these transitions using a Kaplan Meier estimator and make no adjustment for baseline predictors.
After model development, transition probabilities were estimated for every patient in the validation cohort following the process of Putter et al., 10 and de Wreede et al. 49 The censoring distribution was estimated using a Cox proportional hazards model adjusting for all baseline covariates.This model was used to estimate each individuals probability of being censored at 10-years, which were converted into IPCWs (capped at a maximum weight of 10 77 ).For individuals who entered the absorbing death state, the probability of being uncensored at the time of death was used.Mean and moderate calibration was then assessed using the Aalen-Johansen, BLR-IPCW, MLR-IPCW and pseudo-values (based on the Aalen-Johansen estimator) approaches.

Results
The number of transitions over the entire course of follow up in the development (N = 5,000) and validation (N = 100,000) datasets are given in supplementary Table S1.The moderate calibration according to each method for N = 5,000 development sample are presented in Figure 8. Focusing on the pseudo-value and BLR-IPCW calibration curves, calibration of the transition probabilities of being in state 2 (CVD) is the worst, under predicting for lower risks and over predicting higher risks (possibly a sign of overfitting).A similar pattern is seen for transition probabilities into state 4 (CKD), although it is less extreme.There is over prediction of the transition probabilities into state 3 (CKD) and state 5 (multimorbidity) only at the highest predictions.Calibration of the transition probabilities into states 1 (healthy) and 6 (death) are the strongest.
The MLR-IPCW calibration scatter plot reflects quite a lot of variation in the calibration of the transition probabilities into states 2 and 4. The interpretation of this is that even for individuals with predicted transition probabilities which are deemed to be 'well calibrated' according to the BLR-IPCW or pseudo-value plots, there may be considerable miscalibration depending on their predicted transition probabilities for the other states.This contrasts with the calibration scatter plots for the model developed on the N = 100,000 sample (Figure S31).While the plots for BLR-IPCW and pseudovalues look similar in Figures 8 and S31, the difference in the calibration scatter plots produced by MLR-IPCW is notable.This evidence would lead to the conclusions that the N = 100,000 is a better calibrated model, a conclusion that may not have been reached looking only at the BLR-IPCW and pseudo-value plots.

Discussion
We have detailed a range of techniques for evaluating the calibration of the transition probabilities out of the initial state of a multistate model.The BLR-IPCW, MLR-IPCW and pseudo-value approaches each yielded unbiased calibration curves (or scatter plots for MLR-IPCW) under noninformative censoring.The calibration plots showed relatively small bias even in the presence of strong levels of informative censoring.At small sample sizes all methods had similar levels of bias when assessing mean calibration, however variability in the Aalen-Johansen estimator was larger than for BLR-IPCW and MLR-IPCW.We recommend assessing calibration using one of either BLR-IPCW or pseudo-value approach, both of which produce a smoothed calibration curve, alongside MLR-IPCW, which provides a deeper evaluation of calibration through a scatter plot.
The BLR-IPCW and pseudo-value approaches produce curves because they assess the calibration of the transition probabilities into each state vs not being in that state, taking a "one vs all" approach.On the other hand, MLR-IPCW considers the probability of being in each state simultaneously, providing information on whether the observed event rate for state  (state of interest) differs for individuals with the same risk of being in state , but different risks of being in states ≠ .To illustrate this, suppose we have a three-state illness-death model, where individuals with a 50% risk of being in the 'illness' state at time  either have a 50% risk of being in the 'healthy' state (group A), or have a 50% risk of being in the 'death' state (group B), and that these groups are the same size.Now suppose that the model overpredicts the risk of being in the 'illness' state at time  for group A, and underpredicts for group B by the same amount.Calibration of the transition probabilities into the 'illness' state is therefore poor for all these individuals.However, the BLR-IPCW and pseudovalue approach would find the model to be well calibrated for these individuals as the event rate would equal the average predicted risk across all individuals in both groups A and B. This type of miscalibration would be identified by the MLR-IPCW approach (see Van Hoorde et al., 34 for further discussion of this approach).Note that that confidence intervals can be calculated for the calibration curves using bootstrapping, however, assessing the variability in the scatter plot produced by MLR-IPCW is less tractable.
A benefit of the BLR-IPCW and MLR-IPCW approaches over the pseudo-value approach based on the Aalen-Johansen estimator is that they are cross-sectional and make no assumptions about the time scale or whether the Markov assumption holds.For example, in this study we have assumed the time scale in all models to be clock-forward, 10 however in practice the analyst may want to assess the calibration of a model developed under a clock-reset framework (where time resets to 0 after entry into a new state).This is also known as a semi-Markov or Markov renewal model, and the Markov assumption is violated.While we argued that the Aalen-Johansen approach was valid even when the Markov assumption was violated under our assumptions (section 2.2), this would not be the case if some individuals started in subsequent states, or if we were interested in assessing calibration of transition probabilities out of subsequent states.Biologically, end-organ susceptibility to common risk factors such as obesity varies in populations, where one individual may suffer kidney damage from high blood pressure, another may develop diabetes before kidney disease.In this case, other estimators do exist that could replace Aalen-Johansen, such as the Landmark Aalen-Johansen estimator, 52 or estimators designed specifically for Markov renewal processes. 78,79While these estimators can replace the Aalen-Johansen estimator when calculating the pseudo-values, the added benefit of BLR-IPCW and MLR-IPCW is that they can be applied ubiquitously to validate a multistate model developed with any range of assumptions or structure.A second advantage is that the pseudo-value approach takes considerably longer to implement than the BLR-IPCW or MLR-IPCW approaches from a computational perspective, which impacts the ability to calculate confidence intervals through bootstrapping.
The BLR-IPCW and MLR-IPCW approaches are dependent on correct estimation of the IPCWs.Even when censoring is non-informative (i.e.happens at random), the application of weights is essential for these methods in the presence of an absorbing state.Only in the presence of informative censoring, the pseudo-values and Aalen-Johansen estimator must be calculated within subgroups of predicted risk.The BLR-IPCW and MLR-IPCW approaches require conditional independence in the reweighted population, whereas the Aalen-Johansen and pseudo-value approaches require conditional independence given the predicted risks.In the simulation, when censoring was strongly informative, we found that calibration was poor over the range of predicted risks where density was low for both methods, but there was variability in the direction and magnitude of bias, and for which states it was present.Future research may usefully compare these two ways of dealing with informative censoring.Our work constitutes phase I and phase II of the methodological research pipeline, 80 and covers some aspects of a phase III study.Phase III and IV work comparing these methods in a wider range of scenarios would be of high value.We recommend this work focus on how the pseudo-value and IPCW approaches deal with informative censoring, with a wider range of data generating mechanisms and event rates.
This study was restricted to when the following two conditions held: 1) all individuals started in initial state, and 2) only interested in calibrating transition probabilities out of the initial state ( = 1) at time   = 0.All of the methods outlined in this paper can be implemented for  ≠ 1 and   > 0 through landmarking, 46,47 where calibration would only be assessed in individuals present in state  at time   .We recommend this should be explored in future work.While we have focused on irreversible multistate models, there is no reason these methods cannot also be applied to reversible models.Furthermore, we have focused on calibration of the transition probabilities, however there are a variety of possible approaches for assessing calibration.It may be of interest to validate the probability of a specific clinical outcome (such as developing multimorbidity in our clinical example).This risk may be the sum of multiple transition probabilities (i.e. the sum of the probabilities of being in the multimorbidity state, and of having entered death via the multimorbidity state).For this aim, graphical calibration curves 27 could be applied, where the outcome is a binary outcome: development of multimorbidity.Another approach may be to validate each competing risks sub-model for the transitions out of each state separately.This may help explore which sub-models are causing miscalibration in the estimated transition probabilities.For this aim, graphical calibration curves for competing risks models 28 would be appropriate.We focused on transition probabilities in this study as there is the least existing guidance in this area.However, we believe a study collating the methodology for each of the three approaches, and highlighting when each is appropriate, would be of use.
To our knowledge, this is the first study to detail how to assess the calibration of the transition probabilities of a multistate model, within the context of evaluating the performance of a risk prediction model.]22,[26][27][28][29][30][31][32] All the methods considered (BLR-IPCW, MLR-IPCW and pseudo-values based on Aalen-Johansen estimator) give unbiased estimates of the calibration.We recommend producing a calibration curve using either BLR-IPCW or pseudo-values, and reporting alongside a calibration scatter plot using MLR-IPCW.Work is underway to embed the IPCW approaches into an R package.

Discussion of reliance on the Aalen-Johansen estimator on the Markov assumption in this setting
In general, the Aalen-Johansen estimator is only a consistent estimator under the Markov assumption for the multistate process (), which means at all times  "given the present state and the event history of a patient, the next state to be visited and the time at which this will occur will only depend on the present state". 1 If this does not hold, the estimator may give a biased estimate of transition probabilities for non-Markov processes (and in turn a biased estimator of calibration), however this is not the case for this specific setting.Datta and Satten 2 and Glidden 3 show that the Aalen-Johansen estimator is a consistent estimator of the state occupation probabilities.State occupation probabilities are defined as: where (0) is the vector of probabilities of being in each state at  = 0, such that   (0) = ((0) = ), and (0, ) is the x matrix of transition probabilities where element {, } is  , (0, ).
When everyone starts in the initial state, all elements of (0) = 0 except the first element,  1 (0) = 1.The state occupational probabilities then reduce to the vector of transitions probabilities out of state 1: which is what we are trying to estimate.
Therefore, by the work of Datta and Satten 2 and Glidden, 3 the Aalen-Johansen estimator gives a consistent estimate of the transition probabilities out of state 1, and the process outlined at the start of this section will give an unbiased estimate of calibration even for non-Markov multistate processes.In summary, if all individuals start in the same initial state, and interest only lies in estimating transition probabilities out of the initial state, the Aalen-Johansen estimator is sufficient for assessing calibration.Estimators which do not depend on the Markov assumption [4][5][6][7] are not required.
9. Formulas to calculate the true transition probabilities under data generating mechanism 1 For predictors which are 'history of', we derived a variable which indicates whether an individual has a record of the comorbidity prior to their index date in their primary care record.
For predictors reliant on test data (BMI, SBP, cholesterol/HDL ratio, smoking status), we looked in the five years prior to the index date for an occurrence of the variable.Appropriate conversions were applied based on the unit of measurement recorded in the database.Extreme values were then removed.See Github page for full algorithms for extraction of each variable. 8 outcomes, we derived the time until first occurrence or censoring, and a censoring indicator, in both the primary care, secondary care and ONS datasets.We then took the event as the time until the first of any of these to occur.We also derived presence of each condition at baseline, in order to apply our exclusion criteria.Cardiovascular disease was a composite variable consisting of coronary heart disease, heart failure, myocardial infarction, stroke and transient ischaemic attack.
CKD (stage 3, 4 or 5) events were identified through medical codes and test data, namely eGFR scores and creatinine measurements.The process for doing so is below.
Identifying CKD from eGFR/GFR, and estimating eGFR/GFR from creatinine KDIGO guidelines give definition of CKD. 9 This is from 2012, but are currently still the guidelines they recommend.Abnormalities of kidney function must be present for > 3 months.This is CKD-EPI equation 10 they recommend for converting creatinine mesaurements to GFR/eGFR scores (also requires sex, ethnicity and age).Note that they state something along the lines of "this, or any equation shown to be better than this" can be used.Therefore there may be a more recent equation for this conversion, as the one they recommend was developed in 2009.This comparison shows the CKD-EPI equation is better than the MDRD, which is another alternative. 11 attached code looks for two entries below a certain value (60)  Note that some of the code lists were edited after being moved onto incline as some formatting changes had to be made.I am keeping a table of the pre-formatted variable names for personal reference (Table 2) as both have been saved on GitHub. 8

[INSERT Figure 3 :[INSERT Figure 4 :
̂, ) and under predicting transition probabilities ( ̂, ) in scenario NIC, large sample analysis.]Moderate calibration of each method for perfectly calibrated transition probabilities ( ̂, ) in scenario WIC, large sample analysis.]Moderate calibration of each method for perfectly calibrated transition probabilities ( ̂, ) in scenario SIC, large sample analysis.] that are more than 90 days (3 months) apart.Dataset should be in format of: -1 row per eGFR score -Variables: person_id(identifier for individuals), EntryDate (date of code), CodeValue (value of eGFR), num (increasing integer indicating the observation number for each individual)Overall strategy:1) Identify CKD stages 3/4/5 using medical codes 2) Extract eGFR/GFR + creatinine scores from test data 3) Convert creatinine measurements to eGFR scores 4) Use algorithm to identify if individuals meets criteria for CKD stages 3/4/5 using algorithm provided

Table 1 :
Targeted 7-year survival and corresponding baseline hazards  , used to generate the data in data generating mechanism 1 76ta source: Data from the Clinical Practice Research Datalink (CPRD) Aurum, a primary care dataset containing data from general practices with the EMIS Web® computer systems in England and Northern Ireland, was used is this study.CPRD Aurum contains > 39 million historical patients, and > 13 million currently registered.It is representative of the English population in terms of age, gender, geographical spread and deprivation (as of 2019).76Theserecords were linked to Hospital Episode Statistics (HES) and deaths data from the Office for National Statistics (ONS).
9.1.Preliminaries, notation and overview of how transition probabilities are calculated The cause specific hazards for transitions from state  to state  at time ,   () are  12 ,  13 ,  15 ,  24 ,  25 ,  34 ,  35 ,  45 .Let   () be the survival function for state , the probability of not having suffered any of the competing events out of state  prior to time  (assuming one is in state  at time point 0).() = exp(− ∑ ℎ  ()  ), where ℎ  () is the cumulative cause-specific hazard up to time , for transition to state .Let   (, ) be the conditional probability of being in state  at time , when in state  at time .(, ) be the conditional probability of being in state  at time , when in state  at time , and having reached state  via the specified route, .Note that here  could be a collection of states (for example reaching state 5 from state 1, via states 2 and 4, would be denoted  25 24 (, ) The transition probabilities we want to calculate are  12 (, ),  13 (, ),  14 (, ) and  15 (, ).The approach is to calculate the simplest transition probabilities first (i.e.probability of transition from state 4 to state 5,  45 (, ), and work backwards from there.Following the theory of Putter et al. 1 A key property used throughout, is that the probability of having not left state  by time , if in state  at time .A second key property is that the probability of being in state  at time , when in state  at time , is equal to the integral of the following quantity over all time points  between  and : The product of (1) cause-specific hazard of transitioning out of state  to some intermediate state  at time  (  ()), (2) the probability of having remained in state  until time point  (see previous equation), and (3) the conditional probability of being in state  after time , when in state  at time ,   (, ).This third quantity will have been derived by first calculating the transition probabilities at the end of the model, and using these in subsequent calculations.(i.e.first calculate the probability of transition from state 4 to state 5,  45 (, ), and work backwards from there).These are symmetrical in terms of how they are derived)Calculate each of the transition probabilities (will require multiple integrations I guess).Although the way it's written, each function will only have one explicitly defined integration, so it shouldn't get too messy.The others will be hidden/contained with already defined functions.I will then generate a cohort with all individuals same predictor, and compare the event rates to the true probabilities I have derived, to see if they match.They are symmetrical in terms of how they are derived Operational definitions for extracting variables The index date is defined as the start of follow up: maximum of date turned age 65, 1 st Jan 2000, and date of 1 year of up to standard registration in the database

Table SM3 :
Pre-formatted code list names and post-formatted code list names Code lists used for HES/ONS extraction TableSM4contains the variables and ICD 10 codes used to extract it.For chronic kidney disease, the full 5 digit ICD 10 code is required to separate CKD stages 1 and 2 from stages 3, 4 and 5.All others only require the initial 3 digits for extraction.Following the process in QRISK3, CHD and MI were grouped, and Stroke and TIA were grouped into one outcome.

Table S1 :
Number of transitions in and out of each state in the clinical example 75