Semi-Parametric Methods of Handling Missing Data in Mortal Cohorts under Non-Ignorable Missingness

Summary We propose semi-parametric methods to model cohort data where repeated outcomes may be missing due to death and non-ignorable dropout. Our focus is to obtain inference about the cohort composed of those who are still alive at any time point (partly conditional inference). We propose: i) an inverse probability weighted method that upweights observed subjects to represent subjects who are still alive but are not observed; ii) an outcome regression method that replaces missing outcomes of subjects who are alive with their conditional mean outcomes given past observed data; and iii) an augmented inverse probability method that combines the previous two methods and is double robust against model misspecification. These methods are described for both monotone and non-monotone missing data patterns, and are applied to a cohort of elderly adults from the Health and Retirement Study. Sensitivity analysis to departures from the assumption that missingness at some visit t is independent of the outcome at visit t given past observed data and time of death is used in the data application.


Appendices
Appendix A Details about the SACE A.1 Example that compares the partly conditional mean and the SACE Let Z be an indicator of exposure (or treatment) where Z = 1 if a subject is exposed and 0 otherwise, and let Y (Z) and A(Z) be the counterfactual outcome and vital status under Z. For example A(1) = 1 means that the subject would have been alive had he/she been exposed and A(1) = 0 means that the subject would have died had he/she been exposed. Suppose that higher outcome values are more beneficial than low values, and assume that an exposure never harms survival (i.e. number of subjects in S3 is 0). Further suppose that the individuals in S1 have high (better) outcomes if exposed and moderate outcomes if not; and the individuals in S2 have low outcomes if exposed. A similar setting is found in Egleston and others (2009). Table 1 shows a hypothetical observational study of the scenario described above, where half of the people in the study are exposed and the other half is unexposed. Suppose that Y is an outcome of interest (e.g. health status) and assume that undefined post-death outcomes are represented by * .
If Z is a treatment indicator, then it is obvious that treatment (Z = 1) is better than placebo (Z = 0) because the treatment has a positive effect on the outcomes in the always survivors and on survival, and hence the partly conditional mean would be misleading because it implies that the effect of the treatment on the outcome is not different from the placebo. In this case the SACE would be more appropriate than the partly conditional mean. Now suppose that Z is an indicator of education level (i.e. Z = 1 if subjects have a high level of education and 0 otherwise). In reality, Z would include many other variables such as sex and age, but for the purpose of illustrating the difference between the estimand, we consider one variable -education. Suppose that policy-makers would like to know how the average outcomes in the oldest-old differ between those with high and low education levels so that they know approximately how much resources to allocate to surviving subjects with high/low levels of education. Higher (better) outcome would imply lower requirement for resources. Then the SACE is misleading as a measure because it suggests that people with more education would require less health support than people with less education. A policy derived from the SACE would neglect those subjects with a high level of education and low outcome (Y = 2). In this case the partly conditional mean would provide a more fairer comparison.

A.2 The SACE: methods and assumptions
Methods to estimate the SACE either (1) calculate bounds on the SACE (Zhang and Rubin, 2003;Chiba, 2012;Yang and Small, 2016), (2) make assumptions that allow the SACE to be identifiable and estimable (Mattei and Mealli, 2007;Zhang and others, 2009;Ding and others, 2011), or (3) conduct sensitivity analysis around assumptions about the distribution of the outcome or survival process (Egleston and others, 2007;Jemiai and others, 2007;Chiba and VanderWeele, 2011). The SACE is not identified without untestable assumptions, e.g. the stable unit treatment value assumption (SUTVA) and the no unmeasured confounders assumption. SUTVA says that a subject's potential outcomes do not depend on the exposure status of other subjects and that there are no multiple versions of treatment; no unmeasured confounders says that given observed covariates, exposure is independent of potential outcomes. With the exception of Zhang and others (2009) who make parametric modelling assumptions, the aforementioned papers assume monotonicity. Monotonicity says that all subjects who would survive if exposed would survive if unexposed. Zhang and Rubin (2003), Chiba (2012) and Yang and Small (2016) also make the ranked average scores assumption, which says that the average potential outcome when exposed is better in the S1 group (always survivors) than in the S2 group, and that the average potential outcome when not exposed is better for the S1 group than for the S3 group. A table summarizing methods to estimate the SACE can be found below.
It would be interesting to estimate the SACE for education in the HRS. However, while we would not exclude the possibility of doing this, there would be difficulties. First, there is the problem of multiple versions of treatment. That is, it is unclear what intervention on education one would be considering. Different interventions could be expected to produce different causal effects on cognitive function. For example, allow access to high-quality preparatory schools for everyone, encouraging women and people of colour to attend more school, or granting subsidized education might well have different causal effects on cognitive function in later life. If an estimate of SACE were to be obtained from the HRS data, it is unclear to which (if any) of these possible interventions it would relate. Second, as outlined above, quite strong assumptions are needed to estimate a SACE (e.g. no unmeasured confounders) and one might be a little wary about making such assumptions to estimate a SACE for education from the HRS data. The aim in our paper is arguably more modest: to describe associations between outcomes and covariates.
A.3 Table of methods and assumptions for identifying the SACE In the following table, X is vector of auxiliary variables, Z is exposure status (Z = 1 if exposed, 0 otherwise), Y (Z) is potential outcome under exposure Z, A(Z) is potential vital status under exposure Z (A(Z) = 1 if a subject would be alive had he/she been given exposure Z). All of the methods described in this section assume SUTVA and randomization (or no unmeasured confounders).
An estimand related to the SACE is the complier average causal effect (CACE), which is relevant when subjects in a randomized trial fail to adhere to their treatment assignment.
Let Z be a binary treatment assignment indicator (Z = 1 if assigned to active treatment, Z = 0 if assigned to control treatment) and A(Z) be the potential treatment compliance indicator. That is, A(1) = 1 if the subject would take the active treatment if assigned to active treatment (A(1) = 0 if s/he would instead take the control treatment), and A(0) = 1 if the subject would take the control treatment if assigned to control treatment (A(0) = 0 if s/he would instead take the active treatment).
There are four principal strata that arise from compliance/non-compliance: (1) Compliers: those who receive the treatment s/he is assigned (A(0) = A(1) = 1) (2) Always-takers: those who receive active treatment, regardless of assignment (A(0)=0,A(1)=1) (3) Never-takers: those who receive control treatment, regardless of assignment (A(0)=1,A(1)=0) (4) Defiers: those who receive the opposite treatment s/he is assigned (A(0) = A(1) = 0) CACE is then the effect of the treatment on an outcome Y in the compliers. There is an immediate parallel between SACE and CACE in that survival is analogous to compliance and death is analogous to non-compliance. However, unlike for death, both potential outcomes are defined for subjects in all four strata. Many of the methods in the following table are described in the context of the CACE, but for the purpose of this paper we will describe them in the context of the SACE.

Paper
Assumptions Method

Zhang and
Rubin (2003) -Monotonicity -Identifying bounds of SACE -Ranked average score -Randomized trial Imai (2008) -Monotonicity -Identifying bounds for quantiles of the causal -Stochastic dominance (similar to, but weaker than, effect in always survivors ranked score average) -Randomized trial Chiba (2012) -Monotonicity -Identifying bounds of SACE -Ranked average score -Randomized trial -Assumptions about stratum sizes (to narrow bounds)

Yang and
Small (2016) -Monotonicity -Identifying bounds of SACE -Ranked average score assumptions for two time points -Survival data before and after measurements of outcome used to narrow bounds -Observational study

Mattei and
Mealli (2007) -Monotonicity -Make assumptions so SACE is identified -Latent ignorability: within PS missingness is ignorable -Bayesian approach to draw SACE -Stochastic dominance -Complications from death and from -Other assumptions about compliers and non-compliers non-compliance for those who are given active treatment -Randomized trial Zhang and others (2009) -P (belonging to a PS | X) obey multinomial logit model -Make assumptions so SACE is identified -Y (1), Y (0) in S1 are bivariate normal with mean linear in X -Parametric model assumptions to draw SACE -Y (1) in S2, Y (0) in S3 are normally dist. with mean linear in X -Randomized trial Ding and others (2011) -Monotonicity (relaxed if (i) and/or (ii) hold and X is cts.) -Make assumptions so SACE is identified (i) X ⊥ ⊥ Y | PS and Z -Randomized trial (ii) Distribution of X in S1 and in S2 are different -SACE identified if (i), (ii) hold Hayden and others (2005) -Explainable nonrandom survival:

Asymptotic distribution of IPW estimators
Let U IP W equal to estimating equations (4). Under standard regularity conditions, and α 0 and β 0 are the true values of α and β.
We can also expand equation (3.1) around α 0 to get: whereα is an intermediate value between α 0 andα. Substituting this into equation (B.1) and using law of large of numbers, we get and the asymptotic distribution ofβ follows.
where S(α 0 ) is made up of the score function of the probability density function of missingness process. By the Multivariate Pythagorean Theorem (Tsiatis, 2006), variance matrix of , would overestimate the asymptotic variance.
Appendix C

Details about Conditional Mean Outcome Regression
Equation (3) implies the following lemma, which relates the expected outcome at visit t given historyŌ t−1 in survivors who drop out just after visit t − 1 to the expected outcome in survivors who are observed at visit t.
Lemma 1: If Assumption 1 holds, then If q t (Ō t−1 , Y t ; γ) = 0, then conditional onŌ t−1 and survival at visit t, subjects who drop out between visits t − 1 and t have the same Y t distribution as those who are observed at visit t.
is an increasing (decreasing) function of Y t , then this indicates that subjects who drop out between visits t − 1 and t tend to have larger (smaller) values of Y t than those who are observed at visit t.
with unknown parameter θ t,s . Lemma 1 implies that θ t,t−1 (t = 1, 2, . . . , J) can be estimated by regressing Y t onŌ t−1 in the subset of individuals with R t = 1, weighting each of these individuals by exp{q t (Ō t−1 , Y t )}. Letθ t,t−1 denote the resulting estimate of θ t,t−1 .
The following lemma relates the expected outcome at visit t given historyŌ s (0 s t−2 and t J) in survivors who drop out just after visit s to the expectation in those who are observed at visit s + 1.
Lemma 2: If Assumption 1 and mortal-cohort NFD hold, then for s t − 2, Setting s = t−2, Lemma 2 indicates that, if m t (Ō t−1 ) were known, then θ t,t−2 (t = 2, . . . , J) could be estimated by weighted regression after first imputing the missing Y t value in each subject with R t−1 = 1, R t = 0 and A t = 1 as its corresponding expectation m t (Ō t−1 ).
Specifically, Y t (or, if missing, its imputed value, m t (Ō t−1 )) would be regressed onŌ t−2 in the subset of individuals with R t−1 = 1 and A t = 1, weighting each individual by However, ifθ t,t−1 has previously been calculated using the weighted regression method described above (following Lemma 1), A similar procedure can then be used to estimate θ t,t−3 , then θ t,t−4 , and so on. In detail, the CMOR procedure is as follows. (1) and let Y * t be missing otherwise. Set a = 1.
(2) For each t = 1, . . . , J in turn, let s = t − a and calculateθ t,s by regressing (3) If a < J, let a = a + 1 and return to step 2.
Note that this procedure does not impute post-death outcomes (i.e. Y * t is still missing when A t = 0). The following Lemma summarizes the procedure described above.
Lemma 3: If mortal-cohort NFD holds, the selection bias function and the sensitivity parameter γ are correctly chosen, and the regression models are correctly specified, then the estimatorθ t,t−1 that solves and subsequently for s < t − 1, the estimatorθ t,s that solves where d t (Ō s ) -a function ofŌ s -has the same dimension as θ t,s (∀s < t), will be consistent.
After imputing the missing outcomes as described above, the parameter β in the model of interest can be estimated by applying IEE to the imputed data set with Y it replaced with Theorem 1: If mortal-cohort NFD holds, the selection bias function and the sensitivity parameter γ are correctly chosen, and the regression models are correctly specified, then the estimatorβ defined as the solution to estimating equations (C.1) will be consistent.

C.1 Asymptotic distribution of CMOR estimator
Let U CM OR equal to estimating equations (C.1). Under standard regularity conditions, and θ 0 and β 0 are the true values of θ and β. Proof is analogous to that of the IPW estimator.
Appendix D

Proofs of Lemmas 1 and 2
Proof. Lemma 1 From Bayes' Theorem Then it can be shown that We then multiply the left-and right-hand side by Y t , then integrate both sides with respect to Y t to obtain Lemma 1.

Proof. Lemma 2
From Bayes' Theorem This implies that by mortal cohort NFD. Then it can be shown that Integrate with respect to Y s+1 , we obtain Then multiplying both sides by Y t and integrating with respect to Y t , we obtain Lemma 2

Unbiasedness of AIPW estimating equations for monotone missing data
For simplicity, we assume that R 1 = 1 and that 0 t=1 {·} = 1, but note that this proof can be easily adapted to the case where R 1 may not always be 1. If mortal-cohort NFD holds, the selection bias function and the sensitivity parameter γ are correctly chosen, and if either the dropout models are correctly specified at all time points or the outcome regression models are correctly specified at all time points, then the estimatorβ that solves We must first prove the following Lemma 4. For the proof, we suppress the subscript i, let m t (Ō l ) = m t (Ō l ; θ t,l ), let π t = π t (Ō t−1 , Y t ; α t , γ) and let λ t = λ t (Ō t−1 , Y t ; α, γ).
Lemma 4: The expression in the large parenthesis [·] in equation (E.1) can be written as the following: Proof. (Lemma 4) We prove this Lemma by showing that (E.2) can be rewritten as (E.1).
Combining like terms, (E.2) can be re-written as the sum of: for each t. Now we can show that each item can be re-written and combined to give the expression in [·] of equation (E.1).

Proof. (Unbiasedness of AIPW estimating equations)
For simplicity, we assume that X is a vector of time-independent auxiliary variables that include Z. Let the true values of α and θ be represented by α 0 and θ 0 . Letα andθ be estimators of α and θ, and let α * and θ * be the probability limits ofα andθ. In addition, Let the true value of β be β 0 . To show double robustness, we show that when α * = α 0 or when θ * = θ 0 , then E{Ψ(α * , θ * , γ, β 0 )} = 0. First we assume that the dropout models are correctly specified at all visits t and that α * = α 0 . We show that the estimating equations in equation (E.1) have expectation zero.
Take expectation with respect to the first term of equation (E.1): Now take expectation with respect to the second term of equation (E.1): Next we assume that the outcome regression models are correctly specified at all visits t and that θ 0 = θ * . We show that the estimating equations in equation (E.1) with the content in the large parenthesis [·] replaced by equation (E.2) have expectation zero. For the proof, we break the items in equation (E.2) into sections, and let h 0t = P (R t = 1 | R t−1 = 1,Ȳ t−1 , X, A t = 1).
First two terms of equation ( Using induction by taking iterative expectations on equation (E.9), we obtain: Here g * Ō t−1 is defined by the postulated dropout model at visit t.
The two lines of the proof is similar to the proof for the first two terms of equation (E.2).
It can be shown that = 0 (By Lemma 2)

Proof of equivalence
We prove that when q t (Ō t−1 , Y t ) = 0 and D is included in both of the dropout and outcome regression models, the AIPW estimator in Wen and others (2017) (Theorem 2) is the same as the one in the current paper for monotone missing data. Like Wen and others (2017), we suppose that R 1 = 1. When q t (Ō t−1 , Y t ) = 0, then for all s < t D, be a model for for E U (β) |Ō l , D, R l = 1 with parameters β and γ, and let λ l (Ō l−1 , D; α) be a model for P (R l = 1|Ō l−1 , D) with parameters α. In addition, let π l (Ō l−1 , D; α l ) be a model for P (R l = 1|R l−1 = 1,Ō l−1 , D). We suppress the parameters for the proof.
A subject's contribution to the AIPW estimator in Wen and others (2017) is given by When q t (Ō t−1 , Y t ) = 0, we let s t (Ō s , D; γ t,l ) be a model for E(Y t |Ō s , Y s , R s = 1, D) with parameters γ. A subject's contribution (suppressing parameter) to the AIPW estimator in the current paper is given by When D = 1, equivalence is not difficult to prove. For any D > 1 (i.e. D = 1, 2, . . . , J),
The estimating equations given in (F.1) is then which can be written as Zt ( Zt(Yt − µt)+
Theorem 2: Suppose the regularity conditions given below hold. Then where Moreover, assuming that the selection bias function and γ are correctly chosen, then (1) when the dropout models are correctly specified, even if the regression models are not, (2) when the regression models are correctly specified, even if the dropout models are not, (3) when both the dropout and the regression models are correctly specified, we can replace Each component of the asymptotic variance can be then estimated from their corresponding sample averages, with (α * , θ * , β 0 ) replaced with (α,θ,β).
Note that if the dropout and regression models are misspecified, the variance estimator is still consistent (i.e. expression G.1 still holds with α * , θ * and β 0 replaced by their probability limits), even though the point estimatorβ is, in general, not consistent. Hence, to ensure double-robustness of the variance estimator, we recommend estimating variance using (G.1).
Proof. We provide a sketch proof. Let U (α * , θ * , γ, . We assume that regularity conditions 1-9 of Appendix B of Robins and others (1994) hold. But we replace U (α * , θ * , γ, β 0 ) and (α * , θ * , β 0 ) for their H i (γ) and γ 0 , respectively, and replace P (R t = 1|Ō t−1 , Y t , A t = 1; α) > c > 0, ∀α and some c for their regularity assumption 3. We can expand equations (E.1) around β 0 , then around α * and around θ * . This will give us: In Web Appendix E, we showed that the AIPW estimating equations are unbiased when mortal-cohort NFD holds, the selection bias function and the sensitivity parameter γ are correctly chosen, and if either the dropout models are correctly specified at all time points or the regression models are correctly specified at all time points.
Suppose that the dropout models are correctly specified, but the conditional mean outcome regression models are not (for all t). This implies that α * = α 0 , but θ * = θ 0 . Using equation (E.1), it is not difficult to show that A Ψ θ = 0, and the results from (1) follows.
Similarly, if the dropout models are correctly specified, but the conditional mean outcome regression models are not (for all t), then α * = α 0 and θ * = θ 0 . Using equation (E.2), it is not difficult to show that A Ψα = 0, and the results from (2) follows.
When α * = α 0 and θ * = θ 0 (i.e. both of the dropout and the conditional mean outcome regression models are correctly specified for all t), it can be shown that A Ψα = 0 and A Ψ θ = 0, and the results from (3) follows.

Mapping of assumptions to MAR
Wen and others (2017) described three assumptions related to MAR for monotone missing data. These assumptions are called unconditional MAR (u-MAR), partly conditional MAR (p-MAR) and fully conditional MAR (f-MAR). u-MAR is the assumption that ∀t s and f-MAR is the assumption that As stated in Wen and others (2017) The figure above shows, for monotone missing data, the relationship between u-MAR and f-MAR, and mortal-cohort NFD and fully conditional mortal-cohort NFD of the current paper. By definition, the data (X, Y 1 , . . . , Y J ) are MAR if P (R t =r t |Ȳ t , X) = P (R t =r t | Y t,obs (r t ), X) for all t andr t , whereȲ t,obs (r t ) denotes the elements ofȲ t that are observed By definition, for monotone missing data, the data (X, Y 1 , . . . , Y J ) are MNAR if P (R t =r t | Y t , X) = P (R t =r t |Ȳ t,obs (r t ), X) for some t andr t . If (X, Y 1 , . . . , Y D ) are MNAR conditional on D, then R t could depend on future outcomes Y t . . . Y D , which is a weaker assumption than fully-conditional mortal-cohort NFD. Similarly, if (X, Y 1 , . . . , Y t ) are MNAR conditional on {A t = 1}, then R t could depend on future outcomes, which is a weaker assumption than mortal-cohort NFD.
With non-monotone missingness, sequential explainability is not the same as MAR. Sequential explainability says nothing about dependence of R t on outcomes, whereas under MAR, R t could depend on observed future outcomes. Hence, with non-monotone missingness, our assumptions have no immediate connection to MAR (or MNAR).

Supplementary results from HRS data
To better understand the reasons for the associations seen in the partly conditional model for the HRS data, we conducted a survival analysis where the event is death. Time to death, T , equals j if the subject died in the jth interval, i.e. if the subject died in [t j , t j+1 ) (j = 1, 2, . . . , 5), and equals 6 if the subject was still alive after the 5th interval. In particular, the first interval (j = 1) is [0, 2), second interval (j = 2) is [2, 4), third interval (j = 3) is [4, 6), fourth interval (j = 4) is [6,8) and fifth interval (j = 5) is [8, 10) years after baseline.
logitP (T = j | T j, X) =α 0 + 5 k=2 α l I(j = k) + α 5 age + α 6 sex + α 7 edu + α 8 Y j−1 + α 9 Y j−1 age+ α 10 Y j−1 sex + α 11 Y j−1 edu + α 12 age · sex + α 13 age · edu + α 14 sex · edu where sex=1 if female and 0 if male, 'edu' is years of education, 'age' is recruitment age (centred at 80), and Y j−1 is the most recent cognitive score (centred at 18). We performed survival analyses for imputed outcomes under γ = {−0.2, −0.25, −0.3} and found that results did not differ much from the three datasets. Parameter estimates (empirical standard errors using bootstrap) are given in the tables below:  If the association between being a woman and a faster decrease over time in mean outcome given survival is partly due to mortality being higher in women with good cognitive function than in men with good cognitive function, then we would expect to see a positive value for α 10 (the parameter associated with Y j−1 sex). However, results show that α 10 is not statistically significant (exp(α 10 ) = 1.004, p = 0.70).
Similarly, if the association between being older and a faster decrease over time in mean outcome given survival is partly due to mortality being higher in older subjects with good cognitive function than in younger subjects with good cognitive function, then we would expect to see a positive value for α 9 (the parameter associated with Y j−1 age). However, results show that α 9 is not statistically significant (exp(α 9 ) = 1.001, p = 0.24).
[  (1)) for cognitive function fitted to HRS data using the first selection bias function at more extreme γ values