## 1 Introduction

Stroke is the rapidly developing loss of brain function due to a disorder in the blood supply to the brain. It can cause serious complications that may lead to death. Stroke is the third largest cause of death in the UK and the USA [1, 2]. Non-fatal stroke may cause serious complications including permanent neurological damage and adult disability.

Multi-state modelling is a method of analysing longitudinal data when the observed outcome is a categorical variable. In medical research, multi-state models are often used to model the development or progression of a disease, where the different levels of the disease can be seen as the states of the model. This approach enables the investigation of ageing in the older population by jointly modelling the rate of having a non-fatal stroke or dying on healthy individuals and the rate of dying after having a non-fatal stroke. Multi-state models have been used in a wide range of applications including AIDS [3], liver cirrhosis [4], cognitive impairment [5], coronary heart disease [6], stroke [7] and various types of cancer [8, 9]. Putter *et al.* [10] have published a concise introduction to multi-state modelling.

Norris [11] has discussed the theory of stochastic processes and Markov chains. Fitting multi-state models involves various assumptions. A common hypothesis is that the data satisfy the first-order time-homogeneous Markov property. According to this assumption, the transition to the next state depends only on the current state. This means that any previous history of the process can be ignored. Although this assumption simplifies statistical modelling, it may often be inappropriate and lead to incorrect conclusions. A number of extensions to the theory have been proposed including the incorporation of history in the underlying stochastic process. Weiss and Zelen [12] first proposed a semi-Markov model for clinical trials. In semi-Markov models, the transition to the next state depends not only on the current state but also on the time spent in the current state. This involves the exact transition time from one state to the other, which in many applications is unknown. In 1999, Commenges introduced the terminology of a partial Markov model [13]. In partial Markov models, the transition to the next state depends not only on the current state but also on a multivariate explanatory process that can be predicted at the current state. This enables the inclusion of explanatory covariates in multi-state modelling. Faddy [14] applied originally a model with piecewise-constant transition intensities, which enables intensities to depend on time-varying covariates. Van den Hout and Matthews [15] have also discussed a piecewise-constant approach for the effect estimation of explanatory variables in multi-state modelling.

Longitudinal studies, as opposed to cross-sectional studies, involve repeated observations on the same individuals over time. In such studies, researchers often recruit individuals over a range of ages at which some participants may have already developed and progressed through the different study endpoints. Longitudinal data are usually collected by monitoring individuals at prespecified times over the period of an observational study. Thus, the value of monitored variables is known at a discrete set of times, only. The case where the exact value of a variable is unknown and only partial information is available is referred to as censoring [16]. There are three types of censoring, namely left, right and interval censoring. In left and right censoring, the value of a variable is known to lie below and above a certain value, respectively. In interval censoring, the value of a variable is known to lie within an interval with known limits. Methods for handling right-censored data have been discussed in a number of statistical textbooks [17, 18] and are widely implemented in medical research. However, methods for adjusting for left censoring are less frequently employed in longitudinal studies [19]. Ignoring the presence of left censoring when estimating the underlying stochastic process that explains the data observed, may cause substantial bias [19]. Cain *et al.* have shown that including individuals whose data are subject to left censoring (by collecting all necessary information at the time of recruitment) rather than excluding them from the analysis reduces bias significantly [19]. A notion similar to left censoring is that of left truncation. However, left truncation is to be distinguished from left censoring. A left-truncated distribution is one formed from another distribution by cutting off and ignoring the part lying to the left of a fixed variable value [20]. A left-truncated sample is likewise obtained by ignoring all values smaller than a fixed value [20]. Left truncation may occur in longitudinal studies when individuals who have already developed and progressed through the different study endpoints before the beginning of the study are not included in the study. A reason for an individual not to be included in the study is the event of death before the initiation of the study. In 1986, Kay [21] introduced a method that dealt with the problem of right censoring and also handled the case where the time of death is known precisely. Foucher *et al.* have investigated ways to fit multi-state models in the presence of left, right and interval censoring by using a generalised Weibull distribution for the waiting times of the underlying process [22]. Interval censoring has often been dealt with by integration [6]. In 1993, Lindsey and Ryan [8] presented another approach for adjusting for interval censoring based on the Expectation–Maximisation (EM) algorithm.

This paper presents a method to incorporate history in the underlying process in the presence of left truncation and left, right and interval censoring. The proposed model combines properties of semi-Markov models and partial Markov models. We handle interval censoring by integration and adjust for left censoring by using an EM-inspired algorithm [23]. We bypass left truncation by analysing data only over the period of follow-up although, for the adjustment for left censoring, assumptions about the process before baseline need to be made. We illustrate the method in an application by using data from the UK Medical Research Council Cognitive Function and Ageing Study (MRC CFAS). The objective was to investigate ageing in the older population by modelling the transition intensities in a three-state model that comprises the states ‘healthy’ (state 1), ‘history of stroke’ (state 2) and ‘death’ (state 3) and to investigate how time after an individual has a stroke affects the rate of dying. Statistical inference about ageing is feasible only for the older population because the study includes individuals in their 65th year and above. Survival after having a stroke has been discussed in several articles [1, 24]. These articles assist the understanding of the mechanisms and the difficulties that exist in the particular data set that is used in the application and enable the validation of the results of the proposed method.

Section 2 presents the available data of the MRC CFAS. Section 3 presents the statistical model and the method to include time-varying explanatory covariates in the presence of right and interval censoring. We discuss handling left censoring in Section 4. A simulation study in Section 5 shows how assumptions about the process before baseline affect the performance of the method. Section 6 illustrates the method on the MRC CFAS data and investigates model fit graphically. Finally, Section 7 is the discussion.