Unbiased and efficient estimation of causal treatment effects in crossover trials

We introduce causal inference reasoning to crossover trials, with a focus on thorough QT (TQT) studies. For such trials, we propose different sets of assumptions and consider their impact on the modeling strategy and estimation procedure. We show that unbiased estimates of a causal treatment effect are obtained by a g‐computation approach in combination with weighted least squares predictions from a working regression model. Only a few natural requirements on the working regression and weighting matrix are needed for the result to hold. It follows that a large class of Gaussian linear mixed working models lead to unbiased estimates of a causal treatment effect, even if they do not capture the true data‐generating mechanism. We compare a range of working regression models in a simulation study where data are simulated from a complex data‐generating mechanism with input parameters estimated on a real TQT data set. In this setting, we find that for all practical purposes working models adjusting for baseline QTc measurements have comparable performance. Specifically, this is observed for working models that are by default too simplistic to capture the true data‐generating mechanism. Crossover trials and particularly TQT studies can be analyzed efficiently using simple working regression models without biasing the estimates for the causal parameters of interest.


Introduction
A TQT study is an essential component of drug development, ensuring drug safety for patients.Therefore, it is a regulatory requirement to conduct such trials [6].These trials have complicated designs in an attempt to minimize the sample size.This complexity has in turn led to a long-standing debate and several suggestions on how to best model the resulting data, and in particular how to use baseline measurements [15,13,20,30,19].In parallel with these developments, the causal inference literature, and recently also official regulatory recommendations, have increased the focus on clearly defining what we are actually trying to estimate.This has led to the endorsement of the so called estimand framework within regulatory guidance documents and the causal roadmap concept within the causal inference literature [11,33].
One of the central points made here is that we should enable our research question to be defined in terms of the trial data and not just in terms of a specific model.That said, we would still want to use models in order to gain efficiency or eliminate bias.This is in line with recent regulatory guidance documents that encourage the active use of baseline variables in randomized trials for gaining precision of estimated treatment effects [4,5].
The developments in this paper aim to support this push towards clearer and more focused statistical procedures.In particular, the causal inference approach we present facilitates a clear and transparent definition of our target of estimation in cross-over trials and specifically in TQT studies.
Similar developments are already available in the literature for one-period trials.Here, standard estimators have been shown to be unbiased for causal parameters [29,34].Moreover, appreciable gains in efficiency compared to unadjusted estimators has been demonstrated [28,9,1].We provide similar results for standard estimators in cross-over trials, and in particular in TQT studies.
For a standard TQT study, healthy volunteers [11] are enrolled with the purpose of obtaining electrocardiograms (ECGs) from each subject under different treatment conditions.An ECG measures the electrical output from the heart, resulting in replicates of graphical outputs as illustrated in Figure 1.The output is characterized by different waves, complexes, and intervals.In Figure 1 we see the P, Q, R, S, and T waves.The interval from the beginning of the Q wave to the end of the T wave is called the QT interval, and is measured in milliseconds (ms).The QT interval measures the time it takes the heart to repolarize and prepare for the next beat.The longer the QT interval, the longer the time between heart beats, and the less oxygen is transported to cells in the body.Specifically, prolongation of the QT interval has been shown to be related to an increased risk of Torsades de Pointes, a malignant ventricular arrhythmia.Thus, it is undesirable for the QT interval to be prolonged due to drug exposure.The length of the QT interval is positively associated with the length of the RR interval, i.e. the interval from the R wave on an ECG until the next R wave on the ECG.Therefore, QT intervals are standardized in order to get a QTc (QT corrected) measurement corresponding to a particular length of the RR interval (typically 1 second).An example of a commonly used correction is the Fridericia correction, which is given by The purpose of the statistical analysis in a TQT study is to formally assess if clinically relevant prolongation is present based on the QTc measurements [21].
TQT studies are predominantly conducted as cross-over trials in an attempt to lower sample size and eliminate between subject variation by paired comparisons of different treatments.In a cross-over trial, each subject is randomized to one of several treatment sequences that uniquely determines the treatment they receive at any given treatment period throughout the trial.Within each period, a baseline QTc measurement is obtained just prior to treatment, followed by a number of post treatment QTc measurements obtained at pre-defined time points following treatment (see Figure 2 for the two-period case).Each pair of consecutive periods will be separated by a washout period to minimize the risk of carry-over effects, i.e. any effects of treatment from the previous period on the QTc measurements in the current period.From a practical point of view, the main challenge with cross-over designs is that the washout period needs to be tailored to the half-life of drug concentration to reflect proper washout of the drug.Specifically, if the half life is long, an even longer (typically 5 half lives) wash-out period is required in order to avoid carry-over effects.Ultimately, this may impose a very long study period for the subjects enrolled in the study.This is clearly not optimal and may also prove to be a challenge with regard to case retention.In such situations, a parallel arm design may be more feasible [6].The outline of the paper is as follows.In the next section we introduce the basic notation used in the paper and in that context define the fundamental assumption of no carry-over in a cross-over trial.We briefly introduce the concept of counterfactual outcomes and define the causal quantities of interest alongside the data assumptions needed to identify these quantities directly from the data.Section 3 is dedicated to deriving and assessing the theoretical performance of a number of causally motivated estimators.The section also contains the main result of this paper, which shows that a large class of working models still result in unbiased estimates of the causal target parameters even if they do not capture the true data generating mechanism.In Section 4, we analyse data from a real TQT study using a range of working models that in theory result in unbiased estimates of the causal target parameters according to the main result of this paper.Section 5 is dedicated to comparing the same working models in a simulation study cast around the data example in order to evaluate performance in a realistic scenario.We conclude the paper with a discussion in Section 6.

Notation and assumptions
In the following, we introduce the notation and causal assumptions needed in order to identify the target of estimation.Let Y ipt denote the QTc measurement for subject i in period p at time t, i = 1, . . ., n, p = 1, . . ., P, t = 1, . . ., T .Denote baseline measurement for subject i in period p by X ip , and treatment by Z ip .TQT studies tend to have as many treatments as periods.Thus, we will denote treatments by 0, . . .P − 1, where Z ip = 0 corresponds to subject i receiving placebo in period p. Often we will suppress the i since we assume the subjects are independent draws from the same distribution.Furthermore, let Y p = (Y p1 , . . ., Y pT ) T denote the vector of post-baseline measurements in period p.
In line with the informed choice of washout period in TQT studies, we assume the wash-out period has been sufficient to ensure no carry-over effects.Under this assumption, our data can be described by the Directed Acyclic Graph (DAG) in Figure 3 in the two-period case.Note that the DAG has no arrows from baseline measurements to post-baseline measurements, since we do not expect a causal effect of the baseline measurements.Instead, we expect any association between baseline measurements and post-baseline measurements to arise from the latent variables, W , W 1 , and W 2 from Figure 3.The latent variable W reflects the dependence owing to measurements being from the same subject, whereas the latent variables, W 1 and W 2 reflect the dependence between measurements from the same period, i.e. temporary traits.Despite the lack of arrows between baseline and post-baseline measurements, it still makes sense to adjust for baseline measurements, for example in a regression model, because we do not observe the latent variables, in which case the baseline measurements act as proxies for the latent variables.

R Z
The DAG in Figure 3 implies the following about the data distribution: Assumption 1 (No carry-over).Let x = (x 1 , . . ., x p ) T , and likewise for z and y, and let f x (x) be the density for variable x and likewise for all other variables.The distribution of our data satisfy the Markov factorization property with respect to the DAG in Figure 3 [22], i.e., the joint density of our data can be written as In accordance with Figure 3, Assumption 1 states that the conditional distribution of y in period p only depends on treatment in period p and the latent variables w and w p .
In the following, we take a causal approach to TQT studies.By doing so, we provide a clearly stated research question that is completely disentangled from the modelling of the data.This exercise provides complete clarity on what assumptions about the data generating mechanism are necessary to answer the research question and sets them apart from purely technical assumptions made during the modelling stage of the estimation.
Let Y z1,...,z P p denote the post-baseline QTc measurements we would have made in period p if, possibly counter to fact, the subject had received treatments z 1 , . . ., z P .Note that under Assumption 1, the QTc measurements in period p only depend on the treatment in period p. I.e., Y z1,...,zp,...,z P p = Y q1,...,zp,...,q P p = Y zp p for all treatment regimes and periods.Thus, from here on, Y zp p will denote the potential QTc measurements in period p if the subject, possibly counter to fact, had received treatment z p in period p.
If we were particularly interested in treatment z and had ample resources in terms of money, time, and subjects, we would have made a two-arm trial and used as the causal contrasts of interest.Note that the first period in a cross-over trial corresponds to such a trial and as a consequence the targeted causal contrasts can be identified as: We can then easily estimate the contrasts based on data from the first period only, for example by Clearly this is not an efficient use of all the data collected in the cross-over trial.However, to enable full use of the data, stricter assumptions than Assumption 1 are needed.Specifically, we need to make assumptions about the distributions of the post-baseline measurements.To this end, it would be natural to assume the same treatment effect in all periods: for all p and q.I.e. the treatment effect is the same in all periods.
Assumption 2 enables estimation of one overall treatment effect across all periods.A special case where Assumption 2 holds is when the distribution of period specific data does not depend on period: for all p and q.
Assumption 3 is rather restrictive compared to Assumption 2 in that it a priori excludes any systematic effect due to period.This contradicts current modelling and design practice in cross-over trials, where potential period effects are modelled and subjects are randomized according to a latin square in order to balance any period effect [31].
As an alternative, one can assume that the conditional distribution of the post-baseline measurements, given baseline measurements and treatment, is the same in all periods: Assumption 4 (Same relationship).
for all p and q.
This facilitates a model fit based on data from all periods, which may in turn be used to infer the period specific causal contrast E(Y z p − Y 0 p ).

Estimation
The causal framework from the last section enables us to clearly define our target of estimation.In this section, we provide inference procedures tailored to assess the causal target parameter in the context of a TQT study.For brevity, we only consider the case, where we have assumed the same average causal treatment effect in all periods.I.e., we specifically develop estimators for E(Y z 1 − Y 0 1 ) under Assumption 2. An outline of how to estimate period specific average causal effects without Assumption 2 is provided in Section 6.
Under Assumption 2 the fact that subjects receive both the placebo and active treatment initially motivates the following simple non-parametric estimator: where the indicator function, I(A), maps elements of A to one and is zero otherwise.Note that this, and all other estimators throughout this paper, has t in the subscript to indicate that it is the estimator for the post-baseline measurement time point t.Naturally, an effect will be estimated for each t = 1, . . ., T .Note that μ1t (z) simply takes the outcomes in the periods, where the subjects receive the treatment of interest, and subtracts the outcomes in the placebo periods, and averages across subjects.It is an unbiased estimator for the treatment effects of interest due to the randomization.However, it only uses post-baseline measurements, and may therefore lack precision.Alternatively, one can pursue fitting a working regression model to the data and thereby bring baseline measurements into play.We specifically assume that the possible outcome predictions in this working model are given by h pt (x, z, β), where β is a vector of regression parameters.Under Assumption 3 it is possible to ignore periods and use the simpler working regression model h t (x, z, β).Such a working model can be used to estimate the causal effects of interest simply by plugging into the G-computation formula [27]: This estimator uses covariate information to gain efficiency.Moreover, in situations with missing endpoint data, the estimator is still unbiased under the missing at random assumption given that the regression model is correctly specified.In comparison, the endpoint data has to be missing completely at random for the simple estimator μ1t (z) to be unbiased.
Up front, the above developments depend on the fact that the working regression model is specified so that it captures the true mean value structure.Since this by no means is warranted, it is important to mitigate the impact in terms of bias if the working model is misspecified.Such a mitigation can be successfully achieved with the following semi-parametric estimator: The estimator is derived in Appendix C. It uses covariates to gain precision, but is unbiased due to the independence between X p and Z p .This independence is ensured by the randomization and implies that the last term has a mean of zero.Thus, the estimator has the same mean as the non-parametric estimator, namely the true causal effect.The following theorem shows that μ3t (z) = μ2t (z) for certain types of working models.
Theorem 1 (Unbiasedness of μ2t (z)).Assume the data structure of this paper, and assume we use a working model with post baseline measurements as outcome, and with main effects of treatment specific to each post baseline time point.Let Y i be the vector of all post baseline measurements for subject i, and let h(X i , Z i , β) be the vector of all predictions for subject i from the working model.Assume the working model parameters, β, are estimated from the following weighted least squares estimating equation: where D i is the design matrix for subject i, and V is a weight matrix on the form where A and B are T × T matrices.If A and A − B are non-singular, then Proof.The proof is provided in Appendix A.
We know that μ3t (z) is unbiased by construction.Hence, Theorem 1 implies unbiased estimation when using μ2t (z) with a working model satisfying the conditions in Theorem 1, regardless of the ability of the working model to capture the true data generating mechanism.
Previously, the optimal choice of model for TQT studies has been debated extensively, based on the premise that such a model should provide a valid fit to the data.Theorem 1 implies the existence of a whole range of working models that may be used to provide unbiased estimates of causal treatment effects without such considerations.
In this context, it is also worthwhile to note that many of the models that are traditionally applied in TQT studies satisfy the conditions in Theorem 1.We summarize this in the following corollary.
Corollary 1 (Gaussian linear mixed models).Assume the working model is a Gaussian linear mixed model with main effects of treatment specific to each post baseline time point, and a correlation structure satisfying (2).Then, μ2t (z) = μ3t (z) if model parameters are estimated by maximum likelihood or restricted maximum likelihood estimation.
Proof.The estimating equation for Gaussian linear mixed models is given on page 10 of [12] and can be rewritten to (1).
When the working model is a Gaussian linear mixed model, the requirement (2) corresponds to modelling the dependence within periods by some matrix A, and the dependence between periods by a matrix B. For example, B might be a matrix of constants corresponding to a random subject effect, and A can be modelled more flexibly, for example with an unstructured covariance structure, or an AR(1) covariance structure.
In the special case of a Gaussian linear mixed model, consider A = σ 2 I and B a matrix of zeros.In this case, the working model is a standard linear regression model that ignores any dependence between observations.Corollary 1 then ensures that we are able to produce sound inference even with this simplistic model.We do, however, expect this working model to be less efficient than if we model the dependence structure in a Gaussian linear mixed model.
A particular example of a much used working regression model is given by [21]: Clearly the conditions of Theorem 1 are satisfied for the systematic part of this model, as it includes a main effect of treatment specific to each post baseline time point.Additionally, the covariance structure proposed in [21] is AR(1) within periods, and assuming constant covariance between observations on the same subject in different periods.Accordingly, the proposed covariance structure complies with (2).The estimates of the time specific effects of treatment β zt equal the estimates obtained if we were to plug model (3) into μ2t (z).Thus, the estimates of β zt are unbiased for the treatment effects of interest under arbitrary model misspecification.
In order to enable inference fully, we further need to characterize the large sample behaviour.This is well established if the targeted treatment effects appear as parameters in the model, and assuming that the model is correctly specified.However, in more complex models, the target treatment effect is not readily identified as a parameter specified in the model.Moreover, model based standard errors may not be appropriate, unless the model is correctly specified.The influence function of μ2t (z) is derived in Appendix B under the assumption that the β parameters are estimated using an M-estimator.For models not covered by Theorem 1, μ3t (z) is still unbiased whereas μ2t (z) may be biased.Therefore, it would be preferable to use μ3t (z) in such cases.Accordingly, we also derive the influence function for μ3t (z) in Appendix C.
One particular use of the above asymptotic results is when assessing QT prolongation in a TQT trial.In this context, the test for QTc prolongation for the drug of interest is carried out by use of an intersectionunion test.I.e. the null-hypothesis is ), and ∆ is some reasonable amount of QT prolongation, such as 10 ms [21].Commonly speaking, the nullhypothesis dictates that there exists a time point where QT prolongation exceeds a prespecified clinically negligible threshold.Tests are carried out for each t based on the asymptotic behaviour of the standardized estimates of µ t (z), and the null is rejected if all of these tests are rejected.
In addition, it is common practice to assess if prolongation can be detected for the positive control.The corresponding null-hypothesis is The test of this hypothesis needs to be adjusted for multiple testing.This adjustment should be made in the most efficient way possible, which is possible because we can estimate the joint (asymptotic) distribution of our estimators [23,10].We can derive the joint asymptotic distribution of our estimators from the influence functions as , where ϕ t (d i ), t = 1, . . ., T are the influence functions of the individual estimators.It follows that the asymptotic variance matrix of the joint distribution of our estimators equals The estimation of targeted treatment effects as outlined above is based on well known regression models and as we have seen the targeted treatment effects may sometimes even be identified directly as regression parameters in those models.Theorem 1 then guarantees that the identified regression parameters are estimated without bias, irrespective of whether the regression model is correctly specified or not.This guarantee, however, does not extend to the model based variance matrix of the estimates.For this to be appropriate, one also needs to assume that the regression model is correct.The asymptotic variance matrix (4) on the other hand is generally applicable.When the targeted treatment effects are identified as regression parameters, it may even be obtained by standard software [24].
The asymptotic theory developed is applicable, as is when the sample size is large.However, TQT trials and many cross-over trials have a rather small sample size, in which case the appropriateness of the asymptotic theory is questionable.For such cases, there is a substantial literature on how to improve asymptotic standard errors and confidence intervals [2,3,16,25].One simple improvement of the asymptotic standard errors from [3] is to use the influence function in the same way as we would in the context of a linear normal model.I.e. to estimate the variance by φ2 i , and use the 97.5% quantile of the t-distribution with n − 1 degrees of freedom instead of a standard normal distribution to construct confidence intervals.These modifications vanish as the sample size increases, and will consequently lead to correct asymptotic inference.These simple small sample modifications are used in the analyses and simulations throughout this paper.

Data example
To illustrate the developments in this paper, we reanalyse a standard TQT trial also analysed in chapter 9 of [21].The data set is freely available on the book website, and consists of 41 subjects, two of which we have excluded due to missingness.There are three single-dose treatments (C, D, E) and a placebo (F).Treatment E is included as a positive control, i.e. treatment E is known to mildly prolong the QT interval.The subjects' QT intervals are measured in triplicates at baseline, 0.5 hours, 1 hour, 1.5 hours, 2.5 hours, and 4 hours post treatment.The triplicates are averaged at each time point.
In accordance with the developments presented in Section 3, we analyse the data with a range of models of differing complexity, that, in theory, facilitate unbiased estimates of the average causal effects under Assumption 2. We informally compare these models in terms of obtained estimates and standard errors.
Our benchmark model corresponds to the recommendation made in [15].This paper advocates a regression model including average baseline measurements as a covariate.It is shown in [15] that this approach is consistent with the joint baseline and post baseline measurement model advocated in [13,17].It is further argued in [15] that the resulting estimates of the treatment effects will be superior in terms of precision.
Specifically, the working model proposed in [15] is given by: where X is the average baseline measurement.The effect of the baseline measurements and average baseline measurements are different at different time points.Moreover, we fit a simpler model: i.e. a model without average baseline measurements, but still with interaction between the effect of baseline and time point.Furthermore, we fit an even simpler model: i.e. without average baseline measurement, no interaction between time point and period, and the effect of the baseline measurement is the same at all time points.All the models above have the treatment effects as specific parameters.In order to illustrate the modelling flexibility facilitated by Theorem 1, we fit a model with interaction between baseline and treatment: Note that with the complexity of this model, it is no longer possible to identify the average causal effect as a parameter in the regression model.Therefore, we can no longer rely on standard inference of regression models, but need to rely on the general inference procedures developed in Section 3.
On top of specifying a working regression structure for the models, we also need to specify the working covariance structure.The working covariance structure has to be on the form (2), and in the following we will use a random subject effect corresponding to B being a matrix of constants unless otherwise specified.We consider three different specifications for the A matrix: 1. Unspecified: all the variances and covariances have to be estimated.

AR(1): variance matrix is on the form:
, where ρ and σ 2 are parameters to be estimated.
3. Independence: A = σ 2 I , where I is the identity matrix.In this case, B will be a matrix of zeros corresponding to a standard linear regression model.
Last, we also fit the non-parametric estimator μ1t (z).The estimates of the effect of treatment E compared to placebo at post baseline time 4 from the models are displayed in Table 1.Note that the standard errors and confidence intervals are based on the small sample size adjustment discussed in the last section.The remaining estimates of treatment effects are presented in Appendix D. The different models result in similar treatment effects as expected.In practice, the main reason for choosing one model over another is therefore in order to have as much efficiency as possible, and not because we expect to actually know the true datagenerating mechanism.From seem comparable.The standard errors are substantially larger with the non-parametric approach, whereas for all other model based approaches, standard errors are comparable.We investigate these observations further in a simulation study mimicking the data example in the next section.

Simulation study
Simulation studies have already demonstrated that it is theoretically possible to gain precision by including the average baseline measurement as a covariate [15,17].However, it is unclear how much the addition of the average baseline covariates matters for the precision in realistic setups.Therefore, we have based our simulation on the data set from the previous section.Specifically, we have simulated the data as follows: a joint normal distribution of the baseline measurements is fitted to the baseline measurements in the data set, and baseline measurements are simulated according to this fit.The average baseline model from the previous section is fitted to the data set, and the post baseline measurements are simulated from that model.
There are at least two advantages to this approach: first, the simulation must be considered realistic since it is based on parameters estimated on a data set from a real TQT trial.Second, we know that models ignoring the average baseline measurements are too simple to capture the true data-generating mechanism.Thereby, the simulations can show us how much precision we can expect to lose by not using average baseline measurements as covariates in real TQT studies.
We compare all the models from the last section.The models are misspecified both in terms of mean structure and in terms of correlation structure, but are all unbiased by Corollary 1.We ran 10000 simulations in the statistical program R [26], and the code is available at https://github.com/Jeepen/TQTpaper.The results of the simulations are displayed in Table 2 error estimates seem approximately correct compared to the sample standard deviation of the estimates, and the coverages of the confidence intervals are consequently close to 95 %.The standard deviations in the table show that by far the majority of the gain in precision from using a model comes from the inclusion of the baseline measurements.Using the average baseline measurement as a covariate adds little precision, even over the linear regression model.However, the gain in precision from including the average baseline as a covariate and using a more flexible covariance structure than independence is for free, since Corollary 1 implies unbiased estimation in any case.

Discussion
We have introduced causal reasoning to the field of TQT studies.We have shown how typical choices of estimators can be given very clear causal interpretations in terms of the data, and not just in terms of specific models.Furthermore, we have shown that popular choices of working models in many circumstances yield unbiased estimates of causal parameters under arbitrary model misspecification.We have illustrated these results in a data example and a simulation study.However, the unbiasedness of the proposed estimators follows from the balancing induced by randomization.In practice, we may have missing data, which can invalidate the balancing initially ensured by the randomization.That said, the amount of missing data is typically negligible in TQT studies owing to the fact that they are most often conducted on healthy subjects who are paid to participate [6].In cases with a non-negligible amount of missing data the working regression models can be fitted to the observed data with MLE assuming missingness at random (MAR), and distinct parameters for the missing data mechanism and for the outcome.As a consequence, it is straightforward to estimate the causal effects when using a linear mixed model, where the causal effects are identified as the main effects of treatment.However, this becomes challenging if we consider more complex models.In that case, a viable strategy could be to use imputation or weighting methods in order to go from model to estimates of the causal effects [14,32].
We have focused on how to estimate causal effects under assumption (2), that is, assuming the effects are the same in all periods.To complement these developments, it is interesting to consider what we are estimating if that assumption does not hold.In general, the mean of μ1t (z) equals i.e. the average of the causal effects for each period.μ3t (z) has the same mean due to the randomization, and μ2t (z) will equal μ3t (z) when estimation is done according to Theorem 1.Thus, in general, the above strategy will lead to unbiased estimation of the average of the period specific causal effects.Alternatively, one may fit a working model to all data under Assumption 4, and subsequently apply G-computation for a single period: The estimation of h gains precision by using data from all periods, which in turn makes (5) more precise.The estimator (5) emulates a standard one-period trial, while it is unclear what we are emulating by ignoring the period specific treatment effects [8].This is a topic for further research.Two other theoretical issues also deserve more attention.First, we have not theoretically shown that μ2t (z) or for that matter μ3t (z) are in fact more efficient than the non-parametric estimator μ1t (z).We, however, suspect this to be the case in line with the results obtained for one-period trials in [1,33].Second, the impact of a violation of restrictions on the working model dictated by Theorem 1 deserves further investigation.Possibly, a more flexible weight matrix than what is warranted by Theorem 1 may lead to further efficiency gains.
Finally, we would like to point out the difference between the estimation procedure proposed in this paper and the traditional approach of reporting treatment effects based on differences in least squares means (see for example chapter 8 of [21]).It is duly noted that differences in least squares means are equivalent to μ2t (z) when the treatment effects are modelled as main effects in a linear mixed model.However, they are not equivalent to μ2t (z) when the model is more complex.In those cases, least squares means lack a proper causal interpretation in a meaningful population and on those grounds μ2t (z) or μ3t (z) should be preferred for assessing causal treatment effects.

Appendix A: Proof of Theorem 1
We base the proof on rewriting the estimator μ3t (z) as The last rewriting shows that μ3t (z) is equal to μ2t (z) plus an adjustment term, i.e., μ2t (z) = μ3t (z) if the last term is zero.We assume the β-parameters are estimated from the following weighted least squares estimating equation.
where V is on the form (2). The inverse of these matrices is also on the form where C and D are matrices of the same dimensions as A and B, namely T × T .This can be realized by using the Woodbury identity (Theorem 18.2.8 in [7]).The Woodbury identity states that assuming X is nonsingular, then X + Y is nonsingular if and only if I + X −1 Y is nonsingular, and in that case: For our purposes we will choose: To realize that (7) follows from the Woodbury identity, one needs to realize that X −1 is block diagonal with (A − B) −1 in the diagonal.Moreover, The inverse of (I + X −1 Y ) is then EF , where This can be checked by multiplying the matrices and see that they give you the identity (it might be easier if you substitute (A − B) −1 B with K).Then X −1 Y (I + X −1 Y ) −1 X −1 is a block matrix where all blocks are equal.Then the Woodbury identity gives us a block diagonal matrix minus a block matrix with all blocks equal, which is on the form (7).
For convenience, we will introduce There is an equation in (6) for each column in our design matrix, i.e. one equation for each β-parameter.However, as it turns out, we only need some equations to get μ2t (z) = μ3t (z).We re-parametrize our model so that instead of having a β-parameter per combination of treatment and time point, we have main effects from time points, and main effects of non-placebo treatment per time point, i.e. placebo treatment becomes a reference level.First, we need the equations coming from having an effect of time point in our model.If we specifically look at the estimating equation coming from the effect of time point t 0 ∈ {1, . . ., T }, then it looks like this: The inverse variance matrix leads to all residuals from all time points, periods, and subjects, being included in the estimating equation (8) for time point t 0 .However, ( 8) is T linear equations, one for each t 0 , with T unknowns, all equalling zero.Hence, we can conclude The second set of estimating equations we need comes from having main effects of treatment per time point in our model.The estimating equation corresponding to the effect of treatment z at time point t 0 is: (*) comes from the fact that all subjects receive each treatment exactly once, so that q =p I(Z iq = z) = 1 − I(Z ip = z).(**) comes from (9).The rest is just interchanging the order of summation.Again, we have T equations with T unknowns and can conclude as wanted.Note that the above argument only works for all other treatments than the placebo, since placebo is the reference treatment.However, we can rewrite ( 9 where the last equality comes from (10).Thus, the extra term in μ3t (z) is zero, and μ3t (z) = μ2t (z).Adding other covariates or terms to the regression will not change this fact as long as we have main effects of treatment specific to each post baseline time point.
8 Appendix B: Influence function for μ2t (z) Let's say that the estimation in the first stage solves the following equation: where d i is all the information we have about subject i.This would for example be the case if we used MLE in which case m would be the score function.Denote the solution to (11) by βn , and the limit in probability of βn by β 0 .Then the influence function of μ2t (z) is found using Theorem 6.1 from [18]:

Table 1 :
Table 1we note that estimates of the targeted treatment effect across all models Estimates and standard errors for data example.

Table 2 :
. Standard errors and confidence interval coverages are again based on an adjustment for small sample size.As expected, all estimators have negligible bias.The standard Bias, standard deviation of estimates, and coverage of confidence intervals in simulations.

Table 3 :
Estimates in data example.

Table 4 :
Standard error estimates in data example.

Table 7 :
Average standard error estimates in simulations.

Table 8 :
Coverage frequencies of confidence intervals in simulations.