Bayesian semi‐parametric G‐computation for causal inference in a cohort study with MNAR dropout and death

Causal inference with observational longitudinal data and time‐varying exposures is often complicated by time‐dependent confounding and attrition. The G‐computation formula is one approach for estimating a causal effect in this setting. The parametric modelling approach typically used in practice relies on strong modelling assumptions for valid inference and moreover depends on an assumption of missing at random, which is not appropriate when the missingness is missing not at random (MNAR) or due to death. In this work we develop a flexible Bayesian semi‐parametric G‐computation approach for assessing the causal effect on the subpopulation that would survive irrespective of exposure, in a setting with MNAR dropout. The approach is to specify models for the observed data using Bayesian additive regression trees, and then, use assumptions with embedded sensitivity parameters to identify and estimate the causal effect. The proposed approach is motivated by a longitudinal cohort study on cognition, health and ageing and we apply our approach to study the effect of becoming a widow on memory. We also compare our approach to several standard methods.


| INTRODUCTION
Causal inference in non-randomised longitudinal studies with time-varying exposures is often complicated by time-dependent confounding and attrition. Attrition is inevitable especially if individuals in the studied population are older and followed over a long time period. Additionally, for cohort studies, an individual's data are only recorded if that person completes follow-up testing. Hence, data for not only the outcome but also exposure level and confounders are missing at subsequent test waves.
The G-computation formula (Robins, 1986) is one approach for estimating a causal effect of time-varying exposures when time-varying confounding is present. The approach is completely non-parametric in its original form, although a parametric modelling approach based on maximum likelihood estimation is most typically used in practice (e.g. Snowden et al., 2011;. Valid inference with the parametric G-formula requires correct model specification. This can be extremely difficult when there is a large set of regressors, the relationship is non-linear and/or includes interaction terms and there are multiple observation times. Non-and semi-parametric estimation techniques that do not require prespecified distributional or functional forms of the data, have become popular in the causal inference literature (e.g. Häggström, 2018;Hill, 2011;Karim et al., 2017;Kim et al., 2017;Tan & Roy, 2019;Wager & Athey, 2018). One such modelling strategy is Bayesian Additive Regression Trees (BART, Chipman, George, McCulloch, et al., 2010). BART is a sum-oftrees model that adds together the predictions of a number of regression trees regularised by prior distributions. BART does not rely on strong modelling assumptions and in contrast to other tree-based algorithms BART yields interval estimates for full posterior inference.
A number of methodologies have been applied to address missing response or missing covariate data in causal effect estimation of longitudinal data under an assumption of missing at random (MAR; Chen & Zhou, 2011;Robins et al., 1995). These methods, however, are generally invalid when the missingness is missing not at random (MNAR) or due to death (Kurland et al., 2009). Partly conditional models have been proposed to address the combination of dropout and truncation by death, where inference is conditioning on the sub-population being alive at a specific time-point (Kurland & Heagerty, 2005;Li & Su, 2018;Rizopoulos, 2012;Shardell & Miller, 2008;Wen & Seaman, 2018). However, conditioning on survival may introduce bias due to the fact that survival is a post-randomisation event. One estimand that has gained much attention to address this issue is the 'survivors average causal effect' (SACE), that is, the causal effect on the subpopulation of those surviving irrespective of exposure (Frangakis & Rubin, 2002;Frangakis et al., 2007). Several approaches have been developed for estimation of the SACE in longitudinal randomised control studies (e.g. Lee & Daniels, 2013;Lee et al., 2010;), or in context of semicompeting risks (Comment et al., 2019;Xu et al., 2019). For observational data Tchetgen Tchetgen (2014) developed a weighting estimator to identify the SACE without missingness and Shardell et al. (2014) identified the SACE with MAR missingness using also a weighting technique. Moreover, Josefsson et al. (2016) proposed assumptions to identify the SACE of a baseline exposure on a longitudinal outcome under MNAR missingness for the outcome using parametric methods. These approaches however, do not appropriately account for MNAR data among survivors when the exposure and confounding are time-varying. Shardell and Ferrucci (2018) proposed a parametric shared parameter model with G-computation to identify a principal stratum causal effect for observational longitudinal data with time-dependent confounding. A drawback of their approach is that unbiased estimation depends on correct model specification and it does not appropriately account for MNAR data among survivors.
Widowhood has been identified as an important social factor associated with increased mortality (Håkansson et al., 2009) and cognitive impairment (e.g. Mousavi-Nasab et al., 2012). Here, our goal is to develop a framework for assessing the impact of becoming a widow on memory, a monotone exposure, by estimating the SACE in a setting with MNAR dropout among survivors. The proposed approach is motivated by the Betula study (Nilsson et al., 1997), where individuals are followed over multiple test waves to study how cognitive functions potentially deteriorate with age and identify risk factors for dementia.
The remainder of the paper is organised as follows. In Section 2, we introduce the notation and the causal estimand. In Section 3, we propose identifying the default assumptions and sensitivity parameters to allow deviations from these assumptions, followed by the identification of the SACE in Section 4. In Section 5, we propose a Bayesian semi-parametric (BSP) modelling approach for the observed data distributions and the algorithm for estimation of the SACE. In Section 6, we provide a simulation and in Section 7 an application to the Betula data. Finally, we conclude with a discussion and possible future work in Section 8.

| Data structure and notation
We begin with a formal description of the data. Let i=1,2,…,N denote individual and j=0,1,…,J denote time (the data used from the Betula study has J=3 follow-up test waves). We denote the vector of baseline confounders by X i0 (gender, education and age cohort) and the time-varying confounder by W ij (if the spouse has been seriously ill between the j−1th and jth test wave). The continuous memory outcome is denoted by Y ij and the binary exposure is denoted by Z ij . We assume a monotone exposure where initially all subjects are unexposed (Z 10 = 0 for all i). If a subject is exposed (widowed) at test wave jZ ij = 1 and if Z ij = 1, then Z ik = 1 for k>j. Let S ij denote survival, where S ij = 1 if an individual is alive at the time of the testing and 0 otherwise. Let R ij be a dropout indicator, where R ij = 1 if an individual has completed the cognitive testing or 0 otherwise. We have monotone missingness, so if R ij = 0, R ik = 0 for k>j. Note that vital status information is presumed to be available even after dropping out of the study. The history of the time-varying variables are denoted with an overbar. For example, the exposure history for individual i through test wave j is denoted by Z ij = {Z i0 , Z i1 , …, Z ij }. Furthermore, for individual i, J r i denotes the number of test waves (s)he participates in the study and J s i ≥ J r i denotes the number of test waves (s)he is alive. A simplified version of the study design restricted to two test waves is depicted in a causal diagram in Figure 1.

| Causal estimand
The goal of the study is to estimate the causal effect of becoming a widow (within 5 years) on memory among those who would survive irrespective of being widowed or not. We consider two contrasting F I G U R E 1 A causal diagram of a simplified version of the Betula study design restricted to two test waves JOSEFSSON aNd daNIELS exposure regimes, z ij = {z i0 = 0, …, z ij−1 = 0, z ij = 1}, that is, individuals exposed (widowed) between the j−1th and jth wave, and the contrasting regime z � ij = {z i0 = 0, …, z ij = 0 }, that is, individuals unexposed through test wave j, for j=1,2,3. Below, we generally suppress the subscript i to simplify notation. The potential memory outcome at wave j is denoted by Y j (z j ) for an individual under exposure regime z j . Similarly, let S j (z j ) be the potential survival outcome at wave j, denoting survival under exposure regime z j .
We consider a principal stratum causal effect of a time-varying exposure on the outcome, at wave j, for those who would survive under either exposure regime, However, main interest is not the effect at a specific wave, but rather the effect aggregated over test waves, defined as

SENSITIVIT Y PARAMETERS
To identify the causal effect in Equation (2) from the observed data we first introduce a set of assumptions followed by a set of sensitivity parameters to assess the impact of violations to some of the assumptions. The sensitivity parameters (and their values) will be explained in relation to the Betula data in Section 7.2.

| Assumptions
Assumptions 1-3 are a set of standard assumptions for causal inference of longitudinal observational data: Assumption 1 Consistency: For a given individual, if Z j = z j , then Y j = Y j (z j ) and S j = S j (z j ). Assumption 2 Positivity for a monotone exposure: for z j = 0, 1 and for all individuals, such that all unexposed individuals have a nonzero probability of becoming exposed between test wave j−1 and j if p(y j−1 , z j−1 , w j , r j−1 , s j−1 = 1, x 0 ) ≠ 0. Assumption 3 Conditional exchangeability: If X 0 and W j contains all pre-exposure covariates related to exposure, potential outcomes and survival, then for all exposure regimes That is, at each test wave j, being exposed z j is as if randomised conditional on the set of the temporally preceding variables. The assumption of conditional exchangeability is likely to be violated in many settings and is impossible to assess from the data. Therefore, we introduce a sensitivity parameter to investigate the sensitivity for unmeasured confounding in Section 3.2. (1) In cohort studies Y j , Z j and W j are not observed (but defined) for individuals who are alive but who drop out of the study. We make an MAR type assumption conditional on being survival at time j (MAR-S) to identify the distribution of dropouts among survivors.

Assumption 4 Dropout among survivors: For all j≥1 and all t≤j
That is, the outcome is distributed the same among dropouts and non-dropouts conditional on survival and the temporally preceding variables. Similarly, p(w j | y j−1 , z j−1 , w j−1 , r j = 0, s j = 1, x 0 ) = p ( w j | y j−1 , z j−1 , w j−1 , r j = 1, s j = 1, x 0 ) and p(z j | y j−1 , z j−1 , w j , r j = 0, s j = 1, x 0 ) = p (z j | y j−1 , z j−1 , w j , r j = 1, s j = 1, x 0 ). Previous studies of the Betula data have shown that individuals who drop out have lower cognitive performance and steeper decline (Josefsson et al., 2012). In Section 3.2 we introduce sensitivity parameters to allow the dropout to deviate from this MARtype assumption.
We also need three further assumptions for identification of the potential outcomes for those individuals who would survive regardless of exposure history, that is, the principal strata. We start with two standard assumptions.
; if an individual were to be alive under exposure regime z j then (s)he would also be alive under the contrasting regime z ′ j . Deterministic monotonicity can be too strong in many settings and we discuss a weakening of this in Section 8. Assumption 6 Differences in outcomes when comparing different strata. For the contrasting exposure regime z That is, there is no difference in potential outcomes when comparing the 'always survivor' strata to the strata where individuals were to live under the contrasting regime z ′ j but not under exposure regime z j . In Section 3.2 we introduce a sensitivity parameter to investigate sensitivity to this assumption, due to the fact that individuals in the always survivor strata are likely healthier and have better cognitive performance.
A common problem encountered in longitudinal cohort studies is that an individual's exposure level z j , hence the exposure regime z j , and time-varying confounder w j is only observed if (s)he is alive and participates at the jth test wave. Hence, we need to introduce a new assumption to be able to identify the probability of survival among exposed and non-exposed; this is necessary for the identification of the potential outcomes among always survivors.

| 403
JOSEFSSON aNd daNIELS that is, the exposure and confounder are distributed the same among survivors and non-survivors conditional on the temporally preceding variables.
This assumption is used for identification of the principal strata. In the Betula study, the cognitive testing is performed at 5-year intervals. Since 5 years is a rather long time period it is likely that some of the participants who died before follow-up were also widowed before death. Thus, the number of widowed participants in the sample may be underestimated and must be accounted for.

| Sensitivity parameters
To investigate sensitivity of Assumption 3 we follow the procedure of Brumback et al. (2004). The unmeasured confounding is quantified through a parameter which describes the outcome confounding. That is, for Sensitivity to several types of unmeasured confounding can be assessed using this form. Here, we restrict to an unmeasured confounder independent of the history of the joint processes (y j−1 , z j , w j , r j , s j , x 0 ).
To investigate sensitivity of Assumption 4 we first make an assumption of non-future dependence (NFD) conditional on survival (NFD-S) for the outcome and then, introduce sensitivity parameters within this partial identifying restrictions (Linero & Daniels, 2018). NFD is a special case of MNAR (Kenward et al., 2003) and NFD-S is defined as, p(y j |y j−1 , z j , w j , {r 0 = 1, …, r t−1 = 1, r t = 0, …, r j = 0}, s j = 1, x 0 ) = p(y j |y j−1 , z j , w j , r j = 1, s j = 1, x 0 ) , for all j>1 and all t<j. Here, it is defined conditional on being alive at time j. The NFD-S assumption leaves one conditional distribution per incomplete dropout pattern unidentified, that is when t=j. To identify the unidentified conditional distribution left by the NFD-S assumption, we introduce a sensitivity parameter j such that p(y j | y j−1 , z j , w j , r j = {1, …, 1, 0}, s j = 1, x 0 ) = p(y j + j | y j−1 , z j , w j , r j = 1, s j = 1, x 0 ), when j < 0 implies a negative location shift in the outcome at the first unobserved test wave. This assumption implies dropout at time j depends on being alive at that time, the history up to that time, the exposure, time-varying confounder and the outcome at time j, but not outcomes or time-varying variables after time j. This assumption of dropout not depending on the 'future' is often viewed as realistic and was proposed originally as a remedy to concerns about many pattern mixture models implicitly having future dependence. Table 1 displays a description of the possible mortality and missing data patterns under the NFD-S assumption.
To investigate sensitivity of Assumption 6 we let, That is, the mean difference in potential outcomes when comparing the 'always survivor' strata to the strata where individuals were to live under the contrasting regime z ′ j but not under exposure regime z j . In our analysis we assume Δ z � j ≥ 0 which implies that memory performance is on average higher in the 'always survivors'-strata (the always survivors-strata is healthier). We further assume this difference is independent of the preceding variables.
To investigate sensitivity of Assumption 7, we introduce a sensitivity parameter j for the exposure such that, j equals representing the mean difference in the proportion exposed between non-survivors and survivors. The first probability on the right-hand side of each expression is not identified. However, bounds can be derived for j ; see the Web Appendix Section A.2 for details. In particular, the upper bound This reflects that among non-survivors, all subjects were exposed before the event of death between the j−1th and jth wave. Furthermore, using Assumptions 1 and 5, the lower bound for j is obtained when . This reflects an equal survival probability among those exposed or unexposed at wave j. Here, using the law of total probability and Bayes theorem, the lower bound L j becomes 0.

| IDENTIFICATION
Identification of the SACE in Equation (2) follows from two results.

Result 1.
The causal contrasts in Equation (1) can be identified as follows where  denotes the set of temporally preceding variables (y j−1 , w j , r j , x 0 ). (2) can further be identified using Assumption 5 by weighting the contrasts in Equation (3) with

JOSEFSSON aNd daNIELS
The proofs of the results can be found in the Web Appendix Section A.3. The causal effect is identifiable based on the observed data and Assumptions 1-7, conditional on the fixed values for the sensitivity parameters c( z j ), c( z ′ j ), Δ z � j , j and, j . For a Bayesian analysis, the sensitivity parameters can be given informative priors. In Section 5.3 and Table 2 we describe the estimation algorithm where the sensitivity parameters are given informative, non-degenerate and priors.

DISTRIBUTIONS AND COMPUTATION OF THE CAUSAL EFFECT
The joint distribution of the observed data is specified as a marginal model for the baseline confounders and a set of sequential conditional models for the time-varying variables, given the history of the joint process (the outcome, exposure, confounders, and missingness). Details of the joint distribution are given in Web Appendix Section A.1. The baseline confounders x i0 are all observed before an individual enters the study. For each visit j we postulate the time-varying variables in the following order: ( s ij , r ij , w ij , z ij , y ij ), even though the exposure, the time-varying confounder, and survival all occurred between (j−1)st and jth test wave. Of course, y ij , w ij and z ij , are only observed if r ij = 1 and s ij = 1. It is further allowed that w ij and z ij may have occurred before s ij .

| Bayesian semi-parametric modelling
We propose a BSP modelling approach based on Bayesian Additive Regression Trees (BART, Chipman, George, McCulloch, et al., 2010) for the observed data distribution.
For the time-varying components, we specify BART models for the responses as a function of prior histories for all individuals alive and not dropped out at a given test wave. The model consists of two parts: a sum-of-trees model and a regularisation prior on the parameters of that model. The model for the continuous response Y j is conditioned on the history of the joint process (y j−1 , z j , w j , x 0 ) for the subset that satisfies r j = 1 and s j = 1, and can be expressed as ) + j . The model consists of K Y j distinct binary regression trees denoted by T k Y j . Each tree constitute a set of interior node decision rules leading down to b k Y j terminal nodes, and for a given T k is the associated terminal node parameters. The conditional distribution of the continuous outcome is specified as normal, where the mean function, Y j (y j−1 , z j , w j , x 0 ), is given by the sum-of-trees. (2) using the G-computation formula. Details of the algorithm can be found in the Web Appendix section A.4

1.
Sample the observed data posteriors as described in Section 5.

2.
For each posterior sample of the parameters sample pseudo data ( y * j − 1 , w * j , r * j , s * j , x * 0 ) and sensitivity parameters j , j , c ( z j ), and c ( z ′ j ) of size N * . Additionally, sample one set of Δ z � j .

3.
Implement G-computation for z j , and similarly for z ′ j , using the pseudo data and sensitivity parameters from Step 2 by computing E [ Y j | y j−1 , z j , w j , r j , s j = 1, x 0 ] and Furthermore, implement Monte Carlo integration using the pseudo data to compute Pr [ S j = 1 | z j ] and

4.
Use the quantities in Step 3 to compute one posterior sample of τ as defined in Equations (3)-(4).

5
Repeat step 2-4 for each of the posterior sample of the parameters.
The BART models for our binary responses Z j , W j , R j , and S j are specified as probit models. For example the model for the exposure can be expressed as: , where Φ denotes the cumulative density function of the standard normal distribution and Z j (y j−1 , z j−1 , w j−1 , x 0 ) is the probability of being exposed at wave j given (y j−1 , z j−1 , w j , x 0 ) for the subset that satisfies r j = 1 and s j = 1. The BART model for S j is fitted for the subset that satisfies r j−1 = 1 and s j−1 = 1, and for R j the subset that satisfies r j−1 = 1 and s j = 1. The predicted probabilities of r j = 1 and s j = 1 are: R j (y j−1 , z j−1 , w j−1 , x 0 ) and S j (y j−1 , z j−1 , w j−1 , x 0 ). Note that, s 0 = 1 and r 0 = 1 for all individuals, R j = 0 if r j−1 = 0, and The baseline confounders are all categorical (age cohort, sex and education level). We create a saturated multinomial random variable, ), based on these categorical variables. L is the number of categories and each category corresponds to a unique combination of the categorical variables.
is given a Dirichlet prior with parameters equal to one.

| Posterior
Draws from the posterior distribution of the sum-of-trees models are generated using Markov chain Monte Carlo (MCMC). The parameters of the conditional distributions for Y j , Z j , W j , R j , and S j are assumed independent and thus their posteriors can be sampled simultaneously. BART is implemented in the R package bartMachine (Kapelner & Bleich, 2013) for continuous and binary responses. We use default priors on all of the parameters of the sum-of-trees model, that is, on the tree structure, the terminal node parameters and the error variance. For details see Kapelner and Bleich (2013).

| Computation of the SACE
The algorithm for generating samples from the posterior distribution of τ in Equation (2) using the G-computation formula is given in Table 2. Details can be found in the Web Appendix SSection A.4. The algorithm provides the details of generating posterior samples of the causal quantities in Results 1 and 2 (from Section 4) using the posterior distribution of the observed data model parameters (Section 5.1) and the identifying restrictions with sensitivity parameters (Sections 3.1 and 3.2). Recall the expressions in Results 1 and 2 are a function of the observed data distribution and the sensitivity parameters.
For implementation of the algorithm in practice, a number of the initial posterior samples are discarded as burn-in. Parallel computation can be implemented to speed up computations. For example, instead of running one long chain in Step 1, it is possible to run multiple shorter chains in parallel, although each chain still needs to converge. Also, Step 2 may be divided into k blocks of size N * ∕k , and in Steps 3-4 the parameters of interest are computed by combining the pseudo data from the k blocks. We give further details on computation with Betula data in Section 7.3.

| SIMULATION STUDY
We performed a simulation study to evaluate the performance of the BSP G-computation algorithm. For simplicity of comparison to other appropriate methods we estimate E[Y j (z j ) − Y j ( z � j ) | s j = 1] and set Δ z � j = 0, j = 0, and c(z j ) = c(z � j ) = 0, that is, a setting with MAR missingness and no deaths. Details are found in the Web Appendix section A.5.
We consider two settings for our BSP approach. First, where we specify a normal distribution for the outcome as described in the algorithm (BSP-GC1) and second (BSP-GC2), when specifying a t-distribution with three degrees of freedom (t 3 ). We compare our approach with three other methods used for causal effect estimation of longitudinal data with time-varying confounding. The three other methods implemented are: (i) A parametric version of the proposed procedure (BP-GC). Here, we specified Bayesian linear and logistic additive regression models instead of the BART models described in Section 5.1. (ii) Inverse probability of treatment weights (IPTW; Cole & Hernán, 2008). Here, the mean E[ Y j | s j = 1, z j ] is estimated by averaging the memory outcome for the subset with Z j = z j in a pseudo-population constructed by weighting each individual using both unstabilised weights (IPTW-W) and stabilised weights (IPTW-SW), to adjust for confounding and for attrition among survivors. The IPTW-W and IPTW-SW were implemented using the ipw and survey packages in R. (iii) Targeted minimum loss-based estimation approach for longitudinal data structures (TMLE; Laan & Gruber, 2012). We implemented the TMLE using the ltmle package using default settings (Lendle et al., 2017). Confidence intervals were calculated using non-parametric bootstrap. We used 5000 bootstrap samples. The bootstrap confidence intervals were calculated using the 2.5th and 97.5th percentiles of the resulting estimates.
Data were generated based on a simplified version of the Betula data. We simulated 1000 datasets of size n=1000. We considered J i = 2 follow-up test waves, a continuous baseline covariate, X i0 , generated as X 0 ∼ Unif(0, 1 ). The outcome, Y ij , was considered a continuous time-varying variable. The binary variable Z ij indicated if the subject was widowed or not, and W ij indicated if the spouse been severely sick. Widowhood was an absorbing state, such that, if Z ij = 1 then Z ik = 1 for k≥j. Note, that Z i0 = 0 for all subjects. As in the Betula data, all time-varying variables had a highly non-linear relationship with the baseline covariate and the time-varying confounder interacted with the baseline covariate in the exposure model. Data for the simulation study were generated as X 0 ∼ Unif(0, 1 ), W j ∼ Ber (expit( − 2 + 0.5X 0 − 2X 2 0 + 5X 3 0 + 0.25W j−1 ) ), . R code for the data generation is provided in the Web Appendix Section A.5. Table 3 shows the bias, empirical standard deviation (ESD), mean squared error (MSE) and coverage of 95% confidence intervals from the simulation study for BSP-GC1, BSP-GC2, BP-GC, IPTW-W, IPTW-SW and TMLE. The causal effect estimates for BSP-GC1, BSP-GC2 and TMLE are nearly unbiased. BSP-GC1 and BSP-GC2 are however more efficient (smaller MSE and ESD) and have higher coverage (larger than 95%) than TMLE (lower than 95%). The simulation results for BSP-GC1 and BSP-GC2 are very similar. As expected, the three other methods; BP-GC, IPTW-W and IPTW-SW are all biased. Additionally, of all methods, BP-GC has the highest bias and MSE, IPTW-W is the least efficient and IPTW-SW has the lowest coverage.
To see how the proposed approach performs when the error distribution is misspecified data were instead generated from a t 3 -distribution for the error of the outcome Y j . Bias, ESD, MSE and coverage from the simulation are found in Table 4. The results are similar in terms of bias and coverage compared to the previous simulation with correctly specified error. However, ESD and MSE are higher and are now comparable to TMLE.
To see how the proposed approach performs when there is lack of overlap, data were generated as Ber(expit(a Z j ) )I X 0 >0.5 for the exposure, mimicking the Betula study where only individuals at older ages were exposed (widowed). The results from the simulation are found in Table 4. The results are similar to the first simulation with correctly specified error and non-linear effects for BSP-GC1 and TMLE. But for the other three approaches bias was higher and CP was lower.

| The Betula data
The goal is to estimate the causal effect of becoming a widow on memory among those who would survive irrespective of being widowed or not. As such, we limit our data set to those individuals who were married at enrolment. Of approximately 2000 participants N=1059 were married at study enrolment, and data were recorded at four fixed test waves (j=0,…,3) with 5 years interval. The memory outcome was assessed at each wave using a composite of three episodic memory tasks. The score can range between 0 and 76, with a higher score indicating better memory (for details see Josefsson et al., 2012). We consider two contrasting exposure regimes, subjects who became a widow between the j−1th and jth wave, z j = {z 0 = 0, …, z j−1 = 0, z j = 1 }, and subjects married through test wave j, z � j = {z 0 = 0, …, z j = 0 }, for j=1,2,3. Baseline demographic characteristics included age-cohorts: 45,50,…,80 years of age at T A B L E 3 Simulation results for causal effect estimation with n=1000 with the true causal effect τ=−0.05, using two settings for our proposed approach: error specified as normal (BSP-GC1) and error specified using a t-distribution (BSP-GC2), a parametric version of the proposed procedure (BP-GC), inverse probability of treatment weights using unstabilised weights (IPTW-W), stabilised weights (IPTW-SW) and Targeted minimum loss-based estimation approach for longitudinal data structures (TMLE). Mean squared error (MSE) are multiplied by 100 for ease of presentation. ESD denotes empirical standard deviation and CP denotes coverage probability of 95% credible intervals  enrolment, gender and education, categorised into low: 6-7 years of education (29%), intermediate: 8-9 years (31%), or high: >9 years (40%). We also measured a time-varying confounder; an indicator if the spouse has been sick within the last 5 years. We note that baseline confounders are always recorded.

| Sensitivity parameters
Our approach allows uncertainty about untestable assumptions by specifying priors for the sensitivity parameters described in Section 3.2. We restrict the parameters to a plausible range of values, reflecting the authors' beliefs about the unknown quantities. In Section 3.2, the sensitivity parameter c( z j ) reflects the average difference in potential outcomes due to unmeasured confounding (violation of Assumption 3). For the Betula data, when studying the effect of widowhood on cognition, one concern may be that the association is confounded by a healthy lifestyle, such as a healthy diet and/or exercise, something that is often shared within couples. Couples with a healthy lifestyle live longer and may have better cognitive performance than couples with a less healthy lifestyle. This information is not available from the database. Hence, it is a potential unmeasured confounder. Here, we assume c( z j ) < 0 and c( z ′ j ) > 0, reflecting that exposed (widowed) individuals are less healthy compared to unexposed (married) individuals. We further assume the effect is equal for exposed and unexposed. That is, we assume c(z j ) = − j and c(z � j ) = j . Here, we specify a uniform prior on the sensitivity parameters, j ∼ Unif(0, U j ), with upper bound U j = 1 2 × SD(Y j | y j−1 , z j , w j , r j = 1, s j = 1, x 0 ). That is, we expect the sensitivity parameter not to be bigger than one-half standard deviation of the outcome conditional on the history of the joint process. This approximately corresponds to an effect size similar to that found in previous literature on the effect of Mediterranean diet on memory (Radd-Vagenas et al., 2018).
Departures from a MAR mechanism (Assumption 4) for the missingness among survivors can be investigated by varying j in Section 3.2. Our prior belief is that j < 0, reflecting a negative shift in memory performance occur immediately after the first unobserved test wave. Here, the prior is specified as j ∼ Unif( − L j , 0), where we assume the lower bound is one observed conditional standard deviation, L j = 1 × SD(Y j | y j−1 , z j , w j , r j = 1, s j = 1, x 0 ). The effect is similar to what has been found in previous work examining differences in cognition between completers and those who withdraw, at the last cognitive testing visit before dropping out (Rabbitt et al., 2008) Sensitivity to Assumption 6, uses Δ z � j , which reflects the difference in outcomes when comparing the 'always survivor' strata to the strata where individuals were to live under the contrasting regime z ′ j but not under exposure regime z j . We again specify a uniform prior Finally, sensitivity to Assumption 7 uses the sensitivity parameter j , which represents the difference in the probability of being exposed at wave j for non-survivors and survivors conditioning on the history of the joint process. As shown in Section 3.2, j is restricted to [0, U j ]. We assume the prior for j is uniform over this range, j ∼ Unif(0, U j ). The upper bound reflects that, between the j−1th and jth wave, all subjects were exposed before death.

| Results and comparison with other methods
We estimated τ using the proposed BSP method and embedded sensitivity parameters. For each chain the first 1000 iterations were discarded as burn-in and 2240 posterior samples of τ were obtained. We sampled pseudo data of size N * = 25, 000 at each iteration. Convergence of the posterior samples was monitored using trace plots of the samples. To reduce computation time we used 448 parallel chains. Total computation time was 1 hour and 18 minutes.
For longitudinal exposure regimes limited overlap is not uncommon. To avoid extrapolation of the outcome model outside the range of estimated propensities we restrict the overlap region for the longitudinal exposure regimes. Specifically, we restrict data to the set of individuals that have an estimated propensity score that lies within the range of the observed propensities for the two contrasting regimes (similar to the procedure used in Zhou et al., 2019).
We consider two settings for our BSP approach. First, we specify a normal distribution for the residual of the outcome as described in the algorithm (BSP-GC1); second (BSP-GC2), we replace the normal distribution with a t-distribution with three degrees of freedom (t 3 ). For BSP-GC1, the posterior sampling results revealed a mean episodic memory score of 38.2 (95% CI; 35.4, 40.8) for exposed and 38.1 (95% CI; 35.6, 40.1) for unexposed individuals and an estimate of τ of 0.18 (95% CI; −1.43, 1.86), suggesting that there is no effect of becoming a widow on memory among those who would survive irrespective of exposure. For BSP-GC2, the posterior sampling results revealed a mean episodic memory score of 38.2 (95% CI; 35.5, 40.9) for exposed and 38.0 (95% CI; 35.6, 40.1) for unexposed individuals and an estimate of τ of 0.21 (95% CI; −1.42, 1.82). The conclusions are insensitive to the two choices of outcome residual distribution here.
As a sensitivity analysis we compare how the point estimates and uncertainty varied when setting one sensitivity parameter at a time to zero, while the remaining sensitivity parameters are given the priors described in Section 7.2. Setting j to zero resulted in a estimate of τ of 0.18 (95% CI; −1.42, 1.91); for j = 0, 0.20 (95% CI; −1.36, 1.83); for δ=0, 0.21 (95% CI; −1.38, 1.94); and for j = 0, −0.83 (95% CI; −2.43, 0.75). The largest differences was found for the analysis setting j to zero (i.e. no unmeasured confounding); however the CI still cover zero and we expect this assumption to not hold. Fixing the other sensitivity parameters at zero had minimal impact.
We also compare our approach, BSP-GC1 with BP-GC, IPTW-W, IPTW-SW and TMLE (described in the simulation study). For simplicity of comparison we estimate the causal contrasts described in the simulation study. Further to avoid limited overlap, we restrict our data to those age-cohorts where we observe both married and widowed participants over the study period, instead of restricting to the region as for the main analyses. For IPTW-W, IPTW-SW and TMLE, confidence intervals were calculated using non-parametric bootstrap. We used 5000 bootstrap samples. The bootstrap confidence intervals were calculated using the 2.5th and 97.5th percentiles of the resulting estimates.
The results from all the methods are given in Table 5. First, all of the methods display a negative widowhood effect on memory, although all confidence/credible intervals (CI) cover zero. There is a large discrepancy between our semi-parametric approach, BSP-GC1 and the parametric counterpart, BP-GC. In the latter, the effect was attenuated and the CI was narrower. A likely explanation of the discrepancy in effect estimates is that BP-GC is more susceptible to bias caused by model misspecification. BP-GC and IPTW-SW yielded most similar results, although the weighting approach had much wider CI. Furthermore, the effect estimate appeared most negative using IPTW-W and the CI was much wider than for any of the other methods. Weighting methods are known to be unstable and to have problems with large variance estimates in finite samples if the values of the weights are extreme. In our analysis the range of the weights was 0.06-14.3 for IPTW-W, compared to 0.06-5.4 for IPTW-SW. The large weights using IPTW-W may explain the deviating result using this method. Our BSP-GC1 approach yielded an estimate of τ most similar to TMLE, although TMLE had slightly wider CI. This is consistent with the results of the simulation study.

| 411
JOSEFSSON aNd daNIELS 8 | CONCLUDING REMARK S This paper has proposed a BSP framework for estimating the SACE with longitudinal cohort data. Our approach allows for Bayesian inference under MNAR missingness and truncation by death, as well as the ability to characterise uncertainty about unverifiable assumptions. The proposed approach has several advantages compared to existing approaches: (i) the flexible modelling of the observed data as compared to parametric methods, while maintaining computational ease, (ii) interval estimates for full posterior inference (iii) easy to introduce sensitivity parameters.
The simulation study, although simplified, mirrored the Betula data. All time-varying variables had a highly non-linear relationship with the baseline covariate and interaction effects were included. The models for BP-GC, IPTW-W, IPTW-SW and TMLE were specified using additive effects, and thus, were misspecified. The results showed that BSP-GC1, BSP-GC2 and TMLE were nearly unbiased. BSP-GC1 and BSP-GC2 were however more efficient and had better coverage than TMLE (though a bit conservative). The results are in line with previous research (Roy et al., 2018), suggesting that TMLE is less efficient than BSP and non-parametric modelling. This, however, must be explored more thoroughly in future work.
The three other methods; BP-GC, IPTW-W and IPTW-SW, were all biased. This is expected since these methods make stronger distributional assumptions and thus are more sensitive to model misspecification. Similar to TMLE our approach does not rely on strong modelling assumptions, but unlike TMLE, it is quite easy to modify assumptions and incorporate sensitivity parameters. Recall we could not easily make direct comparisons of the proposed approach with the other approaches under our assumptions that include sensitivity parameters. We attempted to implement Super learner, implemented in the R package SuperLearner, but observed highly variable results for the Betula data (causal effect estimates varied between −0.34 and −1.33). This may be a result of the cross-validation step and the fact that the exposure is a rather rare event. Using our BSP approach these problems are avoided by increasing the size of the pseudo data and running longer chains. Although, computation time can be demanding for large pseudo sample sizes, the algorithm can be fully parallelised as discussed in Sections 5.3 and 7.3, which would vastly reduce the total computation time.
For the Betula data we did not find an effect of widowhood on memory. The results were not sensitive to different specification of the errors as normal-or t-distributed, and changing the sensitivity parameters one at a time did not change the results significantly either. The difference in findings from previous studies may partly be explained by different estimands being used; ours is the only analysis using a SACE. Additionally, in this study we considered the immediate effect of widowhood (within 5 years) rather than a long-term effect; it may take longer for degeneration to become apparent.
Our approach can be generalised in various ways. For example, it is possible to allow for multiple time-varying confounders and/or continuous baseline confounders using a sequential approach T A B L E 5 Comparison of methods used for causal effect estimation of the Betula data, setting Δ z � j = 0, j = 0, and c ( z j ) = c ( z � j ) = 0, using our proposed approach (BSP-GC), a parametric version of the proposed procedure (BP-GC), inverse probability of treatment weights using unstabilised weights (IPTW-W), stabilised weights (IPTW-SW) and Targeted minimum loss-based estimation approach for longitudinal data structures (TMLE) as proposed by Xu et al. (2016). This would involve first ordering the confounders into sequential conditionals and then, applying BART to model each of these univariate conditionals. Additionally, although widowhood status is thought of as a monotone exposure pattern and an absorbing state in this study, this is not essential for the proposed approach and other (non-monotone) exposure regimes, such as the effect of widowhood duration on memory at the last visit, might be of interest and are possible to study with a few modifications (e.g. the positivity assumption).
Violations of the consistency assumption can be problematic when using observational data (Cole & Frangakis, 2009). For example, the effect of widowhood can affect memory via different pathways, for example, for some subjects via stress or depression and others via reduced physical health due to poorer lifestyle choices (Gerritsen et al., 2017). This is a limitation with the current study where the exposure is defined homogeneously, and should be explored more thoroughly in future work.
Several of our assumptions can be (further) relaxed. For example, Assumption 5 can be weakened to a stochastic Monotonicity, by following the procedure described in Lee et al. (2010). Also, in this study we have considered unmeasured outcome confounding; this assumption can easily be extended to allow unmeasured mortality confounding. Assumption 6 can be weakened by conditioning on the history of the joint process. However, a drawback with relaxing these assumptions is increasing the number of sensitivity parameters.
One limitation with BART is the restrictive, and sometimes unrealistic, assumption of IID normal errors, (although they can easily be replaced with heavier tail errors as in here). A fully non-parametric modelling approach could be obtained by extending BART to model the error distribution using the Dirichlet process mixtures (George et al., 2018). An additional limitation of the proposed approach is that we used existing R-functions for BART that are not most efficient for our setting. We will explore these limitations in future work, as well as, other choices for priors of the sensitivity parameters.

SUPPLEMENTARY MATERIALS
Web Appendices referenced in Sections 3, 4, 5, 6 and 7, as well as R code are available as Supplementary materials.