Addressing unmeasured confounders in cohort studies: Instrumental variable method for a time‐fixed exposure on an outcome trajectory

Instrumental variable methods, which handle unmeasured confounding by targeting the part of the exposure explained by an exogenous variable not subject to confounding, have gained much interest in observational studies. We consider the very frequent setting of estimating the unconfounded effect of an exposure measured at baseline on the subsequent trajectory of an outcome repeatedly measured over time. We didactically explain how to apply the instrumental variable method in such setting by adapting the two‐stage classical methodology with (1) the prediction of the exposure according to the instrumental variable, (2) its inclusion into a mixed model to quantify the exposure association with the subsequent outcome trajectory, and (3) the computation of the estimated total variance. A simulation study illustrates the consequences of unmeasured confounding in classical analyses and the usefulness of the instrumental variable approach. The methodology is then applied to 6224 participants of the 3C cohort to estimate the association of type‐2 diabetes with subsequent cognitive trajectory, using 42 genetic polymorphisms as instrumental variables. This contribution shows how to handle endogeneity when interested in repeated outcomes, along with a R implementation. However, it should still be used with caution as it relies on instrumental variable assumptions hardly testable in practice.


INTRODUCTION
Observational studies are widely used in epidemiology to assess the relation between an exposure X and an outcome Y, with the perspective to identify the causal effect of X on Y. Statistical techniques (Ertefaie et al., 2017;Hernan & Robins, 2020) have been used to derive causal interpretations in the presence of confounding.However, they rely on the assumption that all the sources of confounding have been observed and controlled for.Yet, in many contexts, the assumption that all the confounders are observed is unrealistic, and statistical analyses are likely to provide biased estimates of causal associations (Fewell et al., 2007).For instance, when studying the relation between cardiometabolic factors on cognitive aging, so many confounders may intervene (Rawlings et al., 2014) that residual unobserved confounding is very likely.The issue of unmeasured confounding relates to the more general problem of endogeneity that occurs when the covariate is partly explained by the system under study.Beyond confounding, endogeneity also encompasses reverse causation that occurs when the outcome or its underlying process may cause a change in the exposure (Wagner, 2018).
To handle endogeneity, instrumental variable (IV) analysis, first developed in Economics (Wright, 1928), was applied in Public Health from the early 2000s (Greenland, 2000).This method consists in using an exogenous variable, the "IV", which is not subject to unmeasured confounding and recreates the randomization framework.The principle of the IV methodology can be illustrated in the cross-sectional framework (Figure 1A).Let us denote Z the IV, X the endogenous exposure variable, Y the outcome, and U the unobserved confounders.To be considered as valid, the IV needs to satisfy three assumptions (Greenland, 2000): (1) Z is strongly associated with X; (2) Z is associated with Y only through X; and (3) Z is independent of U conditionally on X.Under these assumptions, Z can be used to retrieve the causal association between X and Y.In epidemiology, genetic data have been considered as promising IV because genes are determined from birth, thus not subject to confounding; in this context, IV methodology is called Mendelian randomization (MR) (Davies et al., 2018).Finally, to be interpreted as causal effects, IV analyses require a fourth assumption of homogeneity for the average causal effect or monotonicity for the local average causal effect (Hernán & Robins, 2006;Swanson & Hernán, 2018).
The most widely used estimation technique in IV methodology is the two-stage approach, called two-stage least square (2SLS) method (Burgess et al., 2017): (1) the endogenous exposure is regressed on the IV and (2) the derived prediction, which is independent of the unmeasured confounders due to the assumptions of Z, substitutes the exposure in the regression of the outcome to quantify the causal relation between X and Y. First proposed in the cross-sectional framework where X and Y were continuous variables measured at a single time point (Burgess et al., 2017), it was adapted to handle binary exposures and/or binary outcomes (Li et al., 2022;Terza et al., 2008), and to treat grouped data (Li et al., 2015(Li et al., , 2020)).
Recently, the methodology was extended to handle longitudinal data.Two settings were explored: (i) an exposure repeatedly measured over time and its effect on the concomitant level of a repeatedly measured outcome (Hogan & Lancaster, 2004;O'Malley, 2012) and (ii) a time-fixed exposure and its effect on the subsequent risk of an event (Li et al., 2015;Martínez-Camblor et al., 2019;Tchetgen Tchetgen et al., 2015).Yet, another frequent setting encountered in longitudinal studies concerns a time-fixed exposure and its effect on the subsequent trajectory of an outcome repeatedly measured over time.
In the present contribution, we aim to didactically explain how the IV methodology can be used in observational cohort studies to assess the association between an exposure collected at baseline and the trajectory of an outcome repeatedly measured over follow-up in the presence of potential unmeasured confounding.Our solution consists in considering a mixed model for the repeated marker in the second step of the two-stage IV approach.We show how this can solve situations of unmeasured confounding and endogeneity, and we illustrate it in a simulation study considering both a binary and a continuous exposure, and a continuous outcome.We finally apply the methodology to assess the association between type-2 diabetes and cognitive aging in the French cohort "Three city" (3C) (Alperovitch, 2003), by using genetic polymorphisms as the exogenous variable.

Framework
Let us consider a classical longitudinal framework (Figure 1B) where  is the time-fixed exposure,  is a r-vector of confounders, and  is a p-vector of exogenous (instrumental) variables, all defined and measured at entry in the cohort while the continuous outcome Y is repeatedly measured over time t after baseline.Without loss of generality, we assume () = 0.
To ease the problem description, we first consider the case of a continuous exposure, and we assume that Y evolves linearly over time and can be summarized by its latent level at baseline and its latent slope over time, on which the other variables can have an effect.The generalization to a nonlinear trajectory over time is straightforward by considering a more flexible basis of time functions instead of only intercept and slope.
Let us assume that the true relations schematized in Figure 1(B) translate for each subject i ( = 1, … , ) of a sample and each occasion j ( = 1, …   ) in a linear regression for the continuous exposure (1) and a linear mixed model for the outcome (2): (2) For the sake of readability, conditioning on covariates and random effects, although systematic, is not made explicit in any of the linear regressions throughout the manuscript.
Following classical definitions of the linear mixed model (Commenges & Jacqmin-Gadda, 2015;Laird & Ware, 1982),  *  = ( * 0 ,  * 1 ) ⊤ ∼  (0,  * ) is the vector of individual random effects that accounts for the intraindividual correlation within the repeated Y measures.The measurement error in the exposure regression   *  is independent of   and   and the measurement error at time   in the outcome regression   *  ∼  (0,   ) is independent of all the other measurement errors at different times   *  ′ with  ′ ≠ , and of   ,   , and  *  .The random effects  *  are also independent of   and   .In Equations ( 1) and (2), superscript * refers to the parameters and latent variables under the true model.
The parameters of interest are  *  and  *  corresponding to the effect of X on the level of Y at inclusion and the effect of X on the subsequent change of Y over time, respectively.Since all confounders are included through U in model (2), we can interpret these parameters in a causal way.The fundamental problem is that this model and these parameters cannot be directly estimated when some of the confounders  are not observed.Let us split  = (  ,   ) with   the observed confounders and   the unobserved confounders.

Instrumental variable approach
The two-stage IV methodology aims at correcting the bias due to residual unmeasured confounding.We show here how it can be adapted to the longitudinal framework described above by replacing the second-stage least-square regression by a second-stage linear mixed model.
For clarity, we distinguish below the case of a continuous endogenous exposure from the case of a binary endogenous exposure.The method relies on the independence between the regressors (,   ) and the unobserved variables   .As this assumption may likely be violated between   and   , we consider below the total vector  = (  ,   ) as being unobserved to ensure independence.

X continuous
With a continuous endogenous exposure, the two-stage methodology is defined as follows: (5) This model relies on the same distributions and independence assumptions as model ( 2).
From the IV conditional independence assumption (3), the conditional expectation (  |  ) = X =  * 0 +  .By definition, (  |  ) and   are independent, so   = ( 0 ,  1 ) ⊤ is independent of the covariates in the model, as required in a linear mixed model.The model defined in Equation ( 5) is thus equivalent to the target model in Equation ( 2), except that the variance of the random effects is not homoskedastic anymore.
Maximum likelihood estimates of the fixed effects in a mixed model being unbiased even when the covariance structure is misspecified (following the same principle as with generalized estimating equations, Liang & Zeger, 1986), β and β  are unbiased estimators of  *  and  *  ; they may be used to quantify the causal relation between X and Y.However, their variance needs to be corrected for the heteroskedasticity and the use of an IV.By applying the same principle of robust variances (Royall, 1986;White, 1980) as in IV methods for cross-sectional studies (e.g., in ivtools R package, Sjolander & Martinussen, 2019), we define the following sandwich estimator: where Ŵ is the matrix of variables associated with the vector of fixed effects  (in our example in Equation ( 5), Ŵ is a   × 4-matrix with intercept, time, (  |  ) and its interaction with time, and  = ( 0 ,   ,   ,   ) ⊤ ), V =   B   + σ2     with   the matrix of variables related to the random effects (in our example, an   × 2 with intercept and time),    is the identity matrix, and β, B, σ are the estimates obtained in the second-stage model (5).Finally,   is the empirical covariance matrix of , that is,   = Cov(  −  ⊤  β,   −  ⊤  β) where   is the   × 4 matrix with intercept, time,   , and its interaction with time.
The robust variance  2-S ( β) quantifies the second-stage variability in the estimates, but it neglects the first-stage uncertainty.To compute the total variance that accounts for the variability in the two stages, we use a parametric bootstrap (Efron & Tibshirani, 1993): instead of running the second-stage analysis once from the maximum likelihood estimates α, the second stage is replicated  times from first-stage parameters   ( = 1, .., ) randomly drawn from their asymptotic normal distribution with mean α and variance V( α).The total variance estimate of β can then be derived with the Rubin's rule (Little & Rubin, 2019) from the  second-stage estimates β as:

X binary
The absence of bias demonstrated for the continuous exposure comes from the use of additive models in both stages.
Although not frequent, a linear model could also be considered for a binary exposure.Called linear probability model (Li et al., 2022), it translates into the exact same inference technique as described for the continuous exposure with (  |  ) derived from a linear model for  and included into the second-stage linear mixed model, and the same variance estimator.Alternatively, the more classical logistic model can also be considered: with the derived (  |  ) included in the second-stage linear mixed model in (5), and the same total variance estimator used.However, due to the nonlinear nature of the logistic regression, (  |  ,   ) does no longer equal (  |  ), and the convergence of the estimates of   and   to  *  and  *  in (2) is not ensured anymore.To further account for the residual effect of the unmeasured confounders, some authors recommended to replace the substitution of  by (  |  ) by the combination of  and the residual  − (  |  ) in the second stage.We call these three options linear/substitution, logistic/substitution, and logistic/residual inclusion, respectively.

Software
The IV estimation technique for a binary or continuous time-fixed exposure and a continuous repeatedly measured outcome is implemented in the R package IVmm available at url of the package -blinded version.It relies on the hlme function of lcmm R package for the linear mixed model estimation (Proust-Lima et al., 2017).

SIMULATION STUDY
We ran a simulation study to illustrate the behavior of the naive approach and of the IV methods in the presence of unmeasured confounding.

Simulation design
The simulation setting followed the DAG of Figure 1(B).The procedure of data generation including parameters values considered is fully summarized in Table S1.For each individual  in a sample of size , we first generated an exogenous IV   and an unobserved confounder   according to standard Gaussian distributions, and random visit times   =  +   around theoretical annual visits  (with  = 1, .., 6) with   a visit-and-subject-specific random Gaussian departure ( (0, 0.05)).We then generated the endogenous continuous exposure   according to model (4) (for a binary, a logistic version of (4) was considered) the repeated measures of the outcome   according to model (2).We considered scenarios with different sample sizes (N=2000, 6000, or 20,000) and different strengths of association between the IV and the exposure   resulting in different strengths of the IV.As common in the IV literature, the strength of association between the IV and the exposure was quantified with the F-statistic (ratio of the explained variance and the residual variance) (Andrews et al., 2019) and the Nagelkerke  2 for a continuous and binary exposure, respectively.For each scenario, 500 datasets were simulated.

Simulation results
The results of the naive and the IV approaches are reported in Tables 1 and 2; they are also displayed in Figure 2 for the slope with time (and in Figure S1 for the initial level).
As expected, whatever the sample size and the strength of the IV association with the exposure, the naive method showed very large bias and null coverage rate for the association between the exposure and the change over time in all cases.In contrast, the two-stage IV methods retrieved the true causal association without any bias for the continuous TA B L E 1 Simulation results for continuous exposure (over 500 replicates) for the association between the exposure and the trajectory of Y (summarized by the effect on the baseline level and the slope over time) according to the sample size, and strength of the instrumental variable (  ). a Strength of association is assessed with the F-statistic for continuous X.
Abbreviations: CR, coverage rate of the 95% confidence interval; N, sample size; RB, relative bias (defined as the average percentage of difference between the estimate and the true parameter value).
TA B L E 2 Simulation results for binary exposure with naive method, linear/substitution, and logistic/substitution IV methods (over 500 replicates) for the association between the exposure and the trajectory of Y (summarized by the effect on the baseline level and the slope over time) according to the type of exposure, the sample size, and strength of the instrumental variable (  ). a Strength of association is assessed with the  2 expressed in % (and F-statistic) for the linear regression, and the  2 of Nagelkerke for the logistic regression also expressed in %.
Abbreviations: CR, coverage rate expressed in % of the 95% confidence interval;Log/Sub, logistic/substitution method; Lin/Sub = linear/substitution method; N, sample size; RB, relative bias expressed in % (defined as the average percentage of difference between the estimate and the true parameter value); Str, strength.
exposure, and for the binary exposure when using the linear/substitution and logistic/substitution methods, even for the scenarios with a weak instrument.In contrast, the logistic/residual methodology for a binary exposure showed large bias and null coverage rate.In the following, we thus did not investigate this method further.The simulation study also validated the proposed estimate of variance with reported coverage rate of the 95% confidence interval very close the nominal value in both the continuous and binary cases.However, although correct, the two-stage IV method showed substantial variability in the estimates when the IV was weaker.

APPLICATION
We aimed to assess the relation between type-2 diabetes measured at baseline and subsequent cognitive trajectory in the elderly population.Indeed, biological mechanisms suggest an implication of type-2 diabetes on cognitive aging (Frison, 2019), but unmeasured confounders can interfere with this process.To handle this, we used a genetic IV defined by the 42 single nucleotide polymorphisms (SNPs) (listed in the Supporting Information) that were previously identified in genomewide association studies of type-2 diabetes (Morris et al., 2012;Tchetgen Tchetgen et al., 2015).

The Three-city study
The 3C study is a population-based prospective cohort that aimed at assessing the relation between vascular diseases and dementia in the elderly (Alperovitch, 2003).Participants, aged 65 years and older, were randomly selected in 1999 from the electoral lists of three French cities.In total, 9294 participants underwent an in-depth examination of their health and risk factors at baseline, and were then followed every 2-3 years for up to 20 years with an extensive interview and a neuropsychological battery.Among them, 6948 participants have been typed on genome-wide genotyping arrays and further imputed from Haplotype Reference Consortium panel (Lambert et al., 2009).Genotype data that were retained F I G U R 2 Association estimates (over 500 replicates) of the continuous exposure or the binary exposure with the change of the outcome over time using the naive or the IV approaches (logistic/residual, linear/substitution, and logistic/substitution in the binary case) for different sample sizes (N) and different intensities of association (through the regression coefficient ).In the binary case only, the Nagelkerke R 2 is also reported to further illustrate the strength of the IV in comparison with the application setting.
in the study are those with an imputation quality greater than 0.70.Type-2 diabetes were determined from blood glucose level (fasting glucose level≥ 7.0 mmol/L) or the use of antidiabetic treatment at baseline.We studied the cognitive trajectory through the Isaacs set test (IST), which measures verbal fluency and has been shown to differentiate early in the pathological process toward dementia (Amieva et al., 2014).The score is the total number of words given in four semantic categories in 15 s.The final sample size included 6224 participants whose type-2 diabetes were ascertained at baseline, who were genotyped, and had at least one IST measure during the follow-up.Participants were 74 years old at baseline on average, 61 % were women, and 38% had an educational level higher than secondary school (Table 3).Among them, 598 (9.6 %) were ascertained with diabetes at baseline; those with diabetes were more often male, more likely to have a low educational level.Participants were followed up for 8 years on average with a mean of four repeated measures of IST.

The IV analysis
We primarily used the logistic/substitution method.The R 2 of 4.8% showed a weak association between type-2 diabetes and genetic polymorphisms.The linear mixed model for the IST trajectory included a basis of four natural cubic splines on the time from baseline to account for the nonlinear trajectories over time.Diabetic status (in the naive model) or its expectation based on the 42 polymorphisms (in the IV model) was included in interaction with each spline function.For the naive model, we considered both no adjustment or adjustment on measured potential confounders (educational level, age at baseline).Parameter estimates are given in Table S2.Predicted trajectories of IST according to diabetic status are displayed in Figure 3(A) (corresponding differences over time between groups in Figure 3B).The naive method, whether it was adjusted or not for potential confounders, highlighted a difference at inclusion according to the type-2 diabetes but no differential change over time.At any time, the mean IST score was lower for  F I G U R E 3 (A) Predicted trajectories of IST score according to type-2 diabetes at baseline and associated 95% confidence interval.(B) Estimated difference in IST score over time for diabetic compared to nondiabetic using the naive method (not adjusted or adjusted on gender, educational level, and age) and the logistic/substitution instrumental variable method.

DISCUSSION
The IV method has gained interest in observational studies to address unmeasured confounding.Yet, although the framework is very common in observational longitudinal studies, an IV solution for the assessment of an exposure collected at baseline on the subsequent trajectory of a repeated outcome had not been previously described in the medical statistics literature.We showed in this work how the two-stage approach frequently used in IV methodology for cross-sectional or survival outcomes (Burgess et al., 2017;Tchetgen Tchetgen et al., 2015) could be adapted to study the association between a time-fixed exposure and the subsequent trajectory of an outcome using the mixed model theory.Previous contributions dealing with repeated data over time had systematically focused on time-dependent exposures (rather than time-fixed) and associations with either the level of a time-fixed outcome (Sánchez et al., 2017) or the level of a repeated outcome at a given time using distributed lag models (Hogan & Lancaster, 2004;O'Malley, 2012).To our knowledge, the use of a mixed model with an IV approach in epidemiology was limited to the analysis of a complex clinical trial to treat noncompliance over time (Bond et al., 2007), the issue of measurement error of time-dependent exposures with regression calibration (Strand et al., 2014), and the issue of between/within unmeasured confounding in cross-sectional grouped data (Li et al., 2015).
The conducted simulation study emphasized the highly biased estimations obtained when ignoring unmeasured confounding.They also showed the correct inference that our IV solution could provide for assessing the causal association between a time-fixed continuous or binary exposure and a continuous longitudinal outcome in the presence of endogeneity.However, we noticed a very high variance for moderate sample sizes (a few thousand subjects) when the IV was weakly associated with the exposure.For simplicity of result reporting, we focused in the methodology and in the simulations on scenarios with a linear trajectory for the outcome.However, the methodology applies equivalently to any scenario with a nonlinear trajectory, provided that the mixed model remains linear in the fixed and random effects, and random effects are included for each time function.This is what was done in the application considering natural splines to approximate the nonlinear cognitive trajectory.
The IV methodology highly relies on additive model properties to eliminate the association with the unmeasured confounders.The use of nonlinear models may prevent from a total elimination of this association and induce biased estimates.When considering a binary exposure, we explored linear and nonlinear regressions.Our simulations showed that the causal association could be correctly retrieved when using the linear probability model for the binary exposure but also when using the nonlinear logistic model combined with a substitution method in the second stage.In the application, both methods also gave the same results.In contrast, the logistic regression combined with the residual inclusion in the second stage (Terza et al., 2008) showed large bias in our simulation setting with a linear mixed model in the second stage and was not further investigated.Regarding the outcome, we restricted our framework to continuous longitudinal outcomes with linear mixed models and leave extensions to other types of outcomes to future research.
Our motivating application aimed at evaluating the causal association between type-2 diabetes and cognitive decline by using 42 genetic polymorphisms associated with type-2 diabetes as IV.While the classical (naive) regression ignoring unmeasured confounders highlighted a lower cognitive level for type-2 diabetics at all times, the IV methodology that handles unobserved confounding suggested a different and time-varying association.However, the analysis by IV does not allow to reach a conclusion as the confidence intervals were excessively large because of the limited sample size for an IV application with a binary exposure (N = 6224), and the weakness of the association between genetic polymorphisms and type-2 diabetes ( 2 = 4.8%).These results were similar when considering logistic and linear models in first step.
MR studies had already been conducted to assess the causal association between type-2 diabetes and cerebral aging.Cross-sectional studies had focused on cognitive level (Ware et al., 2021) and dementia risk (Østergaard et al., 2015;Walter et al., 2016), and one longitudinal survival study had investigated the association with dementia risk (Tchetgen Tchetgen et al., 2015).None had identified a causal association between genetically predicted type-2 diabetes and cerebral aging.

F
I G U R E 1 Directed acyclic graph for the IV methodology with a cross-sectional outcome Y (Panel A) or a longitudinal continuous outcome Y (Panel B).X is the exposure, Z the instrumental variable (with 1, 2, 3 the corresponding IV assumptions), and U the (partially) unobserved confounders.Int and slope represent the underlying latent level of Y at baseline and the latent slope of Y over time, respectively.

TA B L E 3
Characteristics of the 6224 participants of 3C sample according to their type-2 diabetes and overall.