Discussion on “Instrumented difference‐in‐differences” by Ting Ye, Ashkan Ertefaie, James Flory, Sean Hennessy & Dylan S. Small

We reinterpret the instrumented difference‐in‐differences (iDID) under a linear instrumental variables (IV) model. Under the linear IV model, we show why iDID is a clear improvement over two existing methods, difference‐in‐differences (DID) and a cross‐sectional, IV analysis. We also re‐express some of the assumptions of iDID using familiar, regression‐based identification assumptions. We conclude with a method inspired by the linear IV model that can potentially remedy the weak identification problem in iDID.


INTRODUCTION
I want to congratulate the authors for their fantastic work on combining two well-known methods in causal inference, difference-in-differences (DID) and instrumental variables (IV), in order to study causal exposure effects in repeated, cross-sectional observational studies. By combining the strengths from each method, the proposed instrumented DID (iDID) is more robust to violations of the DID's parallel trend assumption due to an unmeasured confounder and the IV's exclusion restriction, where iDID can use an instrument that has a direct effect on the outcome. The goal of the paper is to reinterpret this promising method under a simple, but popular model in econometrics and statistics, a linear IV model. Linear IV models (or linear models, in general) have been the workhorse in applied statistics and economics to conduct data analysis. Also, in non-applied works, linear IV models have been used to build theoretical insights and create more robust, less parametric methods. In fact, most textbook introductions of IV or DID in econometrics use linear models as This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2022 The Authors. Biometrics published by Wiley Periodicals LLC on behalf of International Biometric Society.
a "reference" model to ground key ideas and discuss more complex topics (e.g., Chap. 5 of Angrist & Pischke (2008) or Chap. 5 of Wooldridge (2010)). By using a linear IV model, I wish to provide an alternative explanation of the authors' fantastic method that is (hopefully) more familiar, simple, and accessible.
The paper will primarily focus on three aspects of iDID under the linear IV model: (a) why iDID is a clear improvement over DID or a crosssectional, IV analysis; (b) how some of the identifying assumptions of iDID can be re-expressed using traditional, regression-based assumptions; (c) how to use insights from linear IV models to potentially mitigate the weak identification problem discussed in Section 5 of the authors' work.
Of course, according to George Box's famous aphorism, all models are wrong and the linear IV model used in the paper, while based on the authors' results (see Section 2.1), is no exception. However, I hope the model is still useful, especially for investigators contemplating to use iDID in their observational studies.

Review of Section 5 and setup
We first review Section 5 of the authors' work where the connection between iDID and a linear IV model is hinted from a result concerning the properties of one of the authors' proposed estimator,ˆw ald . Formally, in the absence of covariates, consider the following model for individual 's observed data = ( , , , ) where, identical to the authors' notation, is a real-valued outcome, is a binary exposure, is a binary instrument, and is a binary time indicator: (1) The terms int , D , Z , T are unknown parameters of the model and the term is a random error term that has mean zero given the regressors and , but not the regressor . Using econometrics terminology, and are exogenous regressors (i.e., independent from the error term) and is an endogenous regressor (i.e., dependent on the error term). The authors showed that the well-known, two-stage least squares (2SLS) estimator of D with an "interacted" instrument is numerically equivalent to one of their proposed estimators,ˆw ald . That is, in the first step, we regress on the intercept, , , and and obtain the predicted value of , denoted asˆ; note that the regression in the first step must be linear. In the second step, we regress on the intercept, , , andˆand the authors showed that the estimated coefficient for the regressor( i.e., the 2SLS estimator of D ) is numerically equivalent tô wald .
We make a few remarks about model (1) that may be useful for extending iDID to other data types. First, if covariates are present, we can incorporate them in model (1), say by adding a linear term ⊺ X , where X is another unknown parameter whose dimension is equal to the dimension of . But this introduces additional modeling assumptions about the relationship between and . Second, if there are multiple time points or if the instrument is non-binary, model (1) provides one simple, starting point to extend iDID, especially its estimation framework. For example, for multiple time points, investigators can represent time as fixed effects in model (1) and use one (or several) interacted instruments between and each level of . For a non-binary instrument, model (1) can be used as-is or can be modified to reflect the instrument's potentially nonlinear effect on the outcome.
Taking inspiration from the authors' numerical equivalence result, the rest of the paper will assume that model (1) is the true model for the observed data. But, as forewarned in Section 1, if the model is misspecified, the discussion below may be dangerously misleading and readers should consult the authors' work, which does not rely on a parametric model.

Advantages of instrumented difference-in-differences versus differencein-differences or instrumental variables with model (1)
Taking a step back from the authors' result on the equivalence betweenˆw ald and the 2SLS estimator of D , the structure of the linear model (1) already reveals some of the advantages of iDID compared to DID or a cross-sectional, IV analysis. For example, under the usual DID setup without an instrument (e.g. Chapter 5.2 of Angrist & Pischke 2008), the conditional mean of the error term given the exposure and the time indicator would be zero and consequently, the parallel trend assumption would hold; in other words, the usual DID setup assumes that the exposure is exogenous. Instead, iDID allows the exposure to be endogenous and the parallel trend assumption may be violated due to an unmeasured confounder that affects the exposure and the outcome.
To better illustrate this point, consider a simple, hypothetical setup where we evaluate the usual DID estimator (denoted asˆD ID and defined below) under model (1) where the exposure effect is zero (i.e., D = 0) and the instrument satisfies the exclusion restriction (i.e., the instrument has no direct effect on the outcome so that Z = 0). This exercise mimics an investigator who may initially run a DID analysis and assume that the exposure is exogenous, even though in reality, the exposure is endogenous due to unmeasured confounding. After some algebra, we get: The right arrow above represents the probability limit as the sample size goes to infinity and the limiting value is derived by using the law of large numbers. Roughly speaking, the term Δ =1 represents the effect of unmeasured confounding at time = 1 and Δ =0 represents the effect of unmeasured confounding at time = 0. If there is no unmeasured confounding at each time point and there are no covariates, the exposure is effectively randomly assigned to everyone at each time point, akin to running a randomized experiment at each time point, and the means of the error terms between the exposed (i.e., = 1) and the unexposed (i.e., = 0) groups would be the same, leading to Δ =1 = 0 and Δ =0 = 0. In other words, the usual DID estimatorˆD ID will converge to 0, as expected from this hypothetical setup. More generally, if the effect from unmeasured confounders are "identical" in magnitude at each time point where Δ =1 = Δ =0 , the DID estimator will still converge to 0; note that the parallel trend assumption implies Δ =1 = Δ =0 . However, if unmeasured confounders have different effects across time so that Δ =1 ≠ Δ =0 , the parallel trend assumption is violated andˆD ID no longer converges to 0.
Also, compared to a standard, cross-sectional IV analysis, iDID allows an instrument to violate the exclusion restriction. This can be clearly seen in model (1) where after fixing a particular time point = , the instrument can have a non-zero direct effect on the outcome through the term Z . Also, if an investigator naively computes the usual Wald estimator in IV at = 0 (denoted asˆw ald, =0 and defined below), the Wald estimator would evaluate to the following under model (1): The estimatorˆw ald, =0 is inconsistent for D unless the instrument satisfies the exclusion restriction by setting Z = 0. Alternatively, by having one additional sample at = 1, a time-invariant instrument, and other assumptions stated in the authors' work, we can remove the bias arising from violating the exclusion restriction and consistently estimate D . Note that this is not the only way to consistently estimate D when the exclusion restriction is violated; see Kang et al. (2016)

Reinterpreting instrumented difference-in-difference assumptions with regression-based assumptions in linear instrumental variables' models
We can also use the well-established identifying conditions for model parameters in linear IV models, specifically a necessary condition known as the order condition (see Chap. 5.2.1 of Wooldridge 2010), to reinterpret some of the identifying assumptions of iDID. To review, in linear IV models, the order condition roughly states that if the parameters in a linear IV model are identifiable, the number of instruments must be greater than or equal to the number of endogenous variables. In model (1), the order condition is satisfied because there is one instrument (i.e., ) and one endogenous variable (i.e., ). Now suppose there is an interaction term between the exposure and the time indicator in model (1). If included, the interaction term would allow the effect of the exposure on the outcome (i.e., the exposure effect) to vary across time. But including the interaction term would violate the order condition because there are more endogenous variables (i.e., and ) than the number of instruments (i.e., ) and subsequently, the model parameters in model (1) are not identifiable. In the authors' work, Assumption (2d) is the "most relevant, nonparametric formulation" of this condition, where the exposure effect is assumed to be homogeneous across time; here, we put the phrase "most relevant, nonparametric formulation" in quotes because formally tying model (1), the order condition, and the authors' nonparametric, identifying assumptions implicitly requires other assumptions in the authors' work, notably Assumption 1; see Section 4.4 of Holland (1988) for an example.
Similarly, suppose there is an interaction term between the exposure and the instrument in model (1). If included, the interaction term would allow the exposure effect to vary between the encouraged (i.e., = 1) and the non-encouraged (i.e., = 0) groups. But, similar to the previous paragraph, including the interaction term would violate the order condition. Also, in the authors' work, Assumption (2b) is the most relevant, nonparametric expression of this condition, where the exposure effect is independent of the instrument . More generally, it is likely that most of the identifying assumptions of iDID are nonparametric extensions of the identification conditions for model parameters in a linear IV model.

A robust confidence interval under weak identification: the Anderson-Rubin interval
Finally, we can use a simple, well-known method associated with linear IV models to potentially address the weak identification problem in iDID. To review, under Assumption (2a) in the authors' work, iDID requires an instrument that, on average, changes the trend in the exposure. But, when the instrument induces little to no change in the exposure's trend, the proposed point estimators may be biased and non-normal, a problem the authors refer to as the weak identification problem. The authors proposed a diagnostic test to check for this problem by using a well-known F-test for instrument strength in linear IV models where if the F-test is sufficiently large, the proposed estimators may be less prone to bias.
In a similar vein, we can use a method inspired by linear IV models, specifically the work by Anderson and Rubin (1949), to come up with a valid 1 − confidence interval that does not suffer from the weak identification problem. To motivate the confidence interval, suppose we want to test the null hypothesis that the coefficient D in model (1) is some hypothesized value ,0 , that is, 0 ∶ D = ,0 . After subtracting ,0 from both sides of the equality in model (1) Now, consider the following "model" for the conditional distribution of given and , where the four terms int , Z , T , ZT are unknown parameters. We put "model" in quotes because every conditional distribution of given and can be characterized by Equation (3); in short, unlike Equation (1) Note that Equation (4) is a linear regression model with an "adjusted" outcome − ,0 and regressors , , and . Thus, we can use ordinary least squares (OLS) to arrive at consistent estimators and/or tests of the parameters in the curly brackets above. Second, under the null 0 ∶ D = ,0 , the coefficient in front of the interaction term in (4) (i.e., ZT ) is zero, implying another null hypothesis 0 ∶ ZT = 0. Critically, we can test the latter null hypothesis by using the usual (two-sided) t-test from the OLS estimate of ZT and its null distribution does not depend on how strong the instrument changes the trend in the exposure, that is, the term ZT in Equation (3).
The connection between testing 0 ∶ D = ,0 and testing a regression coefficient is the basis for the Anderson-Rubin confidence interval for D . Formally, for a level ∈ (0, 1), we can test the regression coefficient 0 ∶ ZT = 0 across different values of D,0 with the two-sided, t-test from OLS regression and by the duality between testing and confidence intervals, the accepted values of D,0 (i.e., the values of D,0 where 0 ∶ ZT = 0 is accepted at level ) form a two-sided 1 − confidence interval for D . These accepted values of D,0 , denoted as  AR 1− , can be compactly expressed as the following set: where, using the authors' notation, ⊺ = ( 1 , … , ), ⊺ = ( 1 , … , ), = ( ⊺ ) −1 ⊺ , and is thedimensional vector of residuals from regressing on the intercept, , and . Also, 2 1− ,1 is the 1 − quantile of the chi-square distribution with one degree of freedom. One of the most appealing properties of AR 1− is that compared to the Wald-based confidence interval in the authors' work,  AR 1− will always have at least 1 − coverage irrespective of the instrument's association to the trend in the exposure. In the extreme case, where Assumption (2a) is violated so that the target parameter is no longer point identified,  AR 1− will still have coverage by elongating itself to cover the entire real line, that is,  AR 1− = (−∞, ∞). While an infinite confidence interval may initially be unappealing, it alerts investigators about the lack of point identifiability from the observed data. Also, Dufour (1997Dufour ( , p. 1377 showed that a valid 1 − confidence interval of D must be unbounded with non-zero probability and Moreira (2009, pp. 133 and 134) showed that under some assumptions, the test statistic underlying  AR 1− is the uniformly most powerful unbiased test for 0 ∶ D = D,0 . Of course, there are no theoretical justifications for AR 1− outside of linear IV models. But AR 1− could be a promising starting point to address the weak identification problem in iDID.

FINAL THOUGHTS
While the linear IV model (1) is undoubtedly too simple for real data and prone to misspecification, I hope the small exercise in the paper can provide another useful explanation of iDID. More broadly, for methodologists proposing new causal methods, especially those that are historically based on linear models, it may be meaningful to illustrate their new methods under linear models to increase accessibility and accelerate adoption in applied settings.

A C K N O W L E D G M E N T
The research of Hyunseung Kang was supported in part by NSF Grant DMS-1811414.