Discussion on: Instrumented difference‐in‐differences, by Ting Ye, Ashkan Ertefaie, James Flory, Sean Hennessy and Dylan S. Small

I discuss the assumptions needed for identification of average treatment effects and local average treatment effects in instrumented difference‐in‐differences (IDID), and the possible trade‐offs between assumptions of standard IV and those needed for the new proposal IDID, in one‐ and two‐sample settings. I also discuss the interpretation of the estimands identified under monotonicity. I conclude by suggesting possible extensions to the estimation method, by outlining a strategy to use data‐adaptive estimation of the nuisance parameters, based on recent developments.

I congratulate the authors on their work, instrumented difference-in-differences (IDID), which extends difference-in-difference (DID) estimation to situations where there is unobserved confounding, but nevertheless there is a valid instrumental variable (IV) for the exposure trend, in settings with binary exposure and repeated cross-sectional data. After establishing identification of the average treatment effect (ATE) and local average treatment effects (LATE), the authors also provide us with several estimators, including a multiple-robust estimator based on semi-parametric theory and prove that this is consistent asymptotically normal under the usual regularity conditions.
Here, I discuss (i) the assumptions needed for identification and their plausibility, as well as possible trade-offs between standard IV and this new proposal and (ii) extensions to the estimation method based on recent developments.

ASSUMPTIONS AND INTERPRETATION OF THE ESTIMANDS
While the authors provide some insights on the interpretation of the necessary assumptions as well as their plausibility, I believe that the readers might benefit from a deeper discussion.
The authors state that the IDID method "relaxes" both the assumptions of standard IV and standard DID (i.e., parallel trends). I believe that it is more accurate to say that IDID replaces some aspects of the standard assumptions of each by adding supplementary assumptions drawn from the other method.
Let us take standard IV identification assumptions as a starting point. IDID inherits the core IV assumptions, but applied to the trends instead of a single time point, in the following sense. Instrument relevance (Assumption 2a) says that the instrument is associated with the exposure trend, while Assumption 2b, explicitly the exclusion restriction part (ER), states that the instrument and the outcome trend 1 − 0 are conditionally independent given and 1 − 0 (as can be seen in the DAG included in Ye et al. 2022). This subtle change allows for the IV to have a direct effect (not through ) on the outcome 1 , thus violating the standard IV ER assumption. The trade-off is that to satisfy this ER assumption on the trend, the IV will now need to satisfy (i)  Ye et al. (2022), ∼ denotes having the same distribution, and ( ) denotes the potential outcome at time if exposed to and had been externally set to . The first part (i) equates to requiring that does not modify the treatment effect, as the authors mention. The assumption (ii) above is, however, a "parallel trends" type assumption: it says that the outcome trend in the untreated potential outcomes is the same on the encouraged (those where = 1) and not encouraged groups ( = 0), respectively. This is analogous to the standard DID assumption of parallel trends, which requires the untreated potential outcomes among the exposed and the unexposed (with respect to ) to be the same.
Let us continue using standard IV as our template. Recall that the three core IV assumptions (relevance, ER, and unconfoundedness) are not sufficient to point identify a causal effect (Hernán & Robins, 2006). An extra assumption is required. In the case of standard (one-time point) IV estimation, the ATE can be point-identified by requiring that there is no effect modification by among the treated and untreated population. Note that this is indeed similar to assumption 2b(i). However, 2b(i) is not sufficient for point identification of the ATE in IDID. Indeed, this "no effect modification by " assumption arose in this setting as part of the ER for outcome trends.
The IDID settings require a stronger assumption to point identify the ATE, namely assumption 2c, "no unmeasured common effect modifier". This assumption has been proposed in standard IV settings as an alternative to "no effect modification by " (Cui & Tchetgen, 2021). It is this alternative assumption (2c) that can be replaced by the monotonicity assumption, (1) ≥ (0) with probability 1, where ( ) denotes the potential exposure for time ∈ {0, 1}, leading to identification of the LATE. Note that unlike relevance (which needs to hold for the exposure trend), the monotonicity assumption needs to hold at both time points, and therefore, it could be considered a stronger set of assumptions than those required for standard IV.

Interpretation of the LATE estimand
The identified LATE can be interpreted as the causal effect in the "compliers" stratum, (1) − (0) = 1, that is, those who receive = 1 when = 1 but not otherwise at both time points, only when the IV is causally related to the exposure (Swanson & Hernán, 2018). Establishing whether the relationship between and the exposure trend is causal would typically vary from application to application. Thus, the interpretation of the LATE will depend on the type of "encouragement" instrument used when applying IDID.
Even with a causal IV, LATE is controversial in clinical and epidemiological applications, as the compliers stratum always remains unobserved. Suppose that we have two cross-sectional samples, where we have only measured either the exposure or the outcome, correspondingly denoted by and (for before and after). Just like in standard two-sample IV, we can use ( , , ) to estimate the relationship between the exposure and the instrument, ∼ (or in this case the exposure trend), and use the second sample ( , , ) to estimate the relationship between the outcome and the instrument, ∼ (or in our settings, the outcome trend). Now, the ATE is identified by a twosample Wald estimand 0 = , so long as ( | , ) = ( | , ), and ( | , ) = ( | , ). Such "structural stability" assumptions rule out covariate shifts across the two samples for the outcome and the exposure in the strata defined by , and therefore seem implausible, especially in situations where we seek to apply them, where we do not have access to the same individuals in the two time periods. It would be of interest to explore relaxing this, and allow covariate shifts for the two time periods, perhaps by adapting techniques developed by Nie et al. (2019).

Extra assumptions for the two-sample estimator
Moreover, using two-sample Wald estimand in conjunction with monotonicity has implications for what the estimand corresponds to. By analogy to the standard IV (Zhao et al., 2019), we can see that for identification of the LATE, we also need to assume "structural invariance" for the compliers class at time , that is, ( . For this to have the same sign as late we also need to assume that this scaling factor is positive.
Finally, assumption 2d, that the CATE is constant in time, also seems implausible in this two-sample crosssectional setting, even when the study period spans only a short time, as it is more likely that the two samples correspond to slightly different populations. This assumption seems more plausible in one-sample, longitudinal settings, where the same individuals are followed up in time.

DATA-ADAPTIVE ESTIMATION
After establishing identification, the authors propose several estimators. As well as the Wald estimator, analogous to the Wald estimator in standard IV, the authors derive several estimators that target the estimand 0 resulting from a projection of the true CATE 0 ( ) function onto a parametric working model ( ; ).
Here, I primarily discuss the so-called "multiply robust estimator" mr . This is an estimating equations estimator, based on the efficient influence function, EIF, ( , 0 ) of the (projection) estimand 0 .
As in Ye et al. (2022), for ∈ { , }, denote by The authors prove that the estimatorˆis multiple robust, and is consistent and asymptotically normal (CAN), under an appropriate Donsker condition (Assumption 3) and either: (i) models for ( ), ( ) and ( ) are correct; or (ii) models for ( , , ) and ( ) are correct; or (iii) models for ( , , ) and ( ) are correct. While the multiple-robust property means that not all the nuisance models have to be correctly specified, in practice, most parametric models are probably wrong, so a multiple robust property is of limited practical utility.
Nevertheless, recent advances in semiparametric efficiency theory have shown that EIF-based estimators can converge at fast parametric rates to the true 0 and thus be asymptotically normal, even when the nuisance functionals are estimated non-parametrically at slower rates, for example, via flexible data-adaptive (machine learning) methods.
Since is an EIF-based estimating equation estimator, it is suitable for using data-adaptive methods for nuisance parameter estimation. As discussed by Ye et al. (2022), under empirical process conditions, for example, Donsker class assumptions (which can be avoided via sample splitting, see below), the error term is (to first-order approximations) the product of the errors of the nuisance models (Theorem 2 of Ye et al. 2022). This allows us to use flexible, machine learning plug-in estimators for the nuisance functionals, which typically converge at slower rates. Then, as long as each data-adaptive estimator converges to their respective truth (denoted by a subscript 0) at a sufficiently fast rate such that the condition of Theorem 2 holds, that is, then the estimator that results from plugging in these dataadaptive nuisance estimators is CAN and Equation (5) of Ye et al. (2022) holds. The variance can be obtained based on the variance of the EIF . I remark that, in general, using machine learning plugin nuisance estimators on estimators based on inverse probability weighting or "outcome regression" leads to biased estimators, because of the slower convergence rates. Moreover, constructing confidence intervals with valid coverage is difficult. It is important to note that generic nonparametric bootstrap arguments are no longer justified in conjunction with data-adaptive plug-in estimators for nuisance parameters (Bickel et al., 1997;Coyle & van der Laan, 2018).
Finally, I would like to discuss the Donsker condition on the class of functions that contains the estimated EIF. To understand why this is often required, we need to take a step back and briefly discuss the error term between a typical plug-in estimator (ˆ), an estimator that replaces 0 withˆ, where the sub-index denotes the sample size of the data, and the value of the estimand at 0 , the true data distribution. This is characterized by how changes when the data distribution changes from 0 to a different distri-bution˜in a small neighborhood. This change is described by the so-called von Misses expansion, a functional-version of the Taylor expansion, with the EIF playing the role of the usual derivative. Using this expansion, the error term of many plug-in estimators, can be decomposed into where 2 is a second-order term, and denotes the empirical distribution function. The first term is well understood and converges to a normal, mean zero variable. The second term is known as the drift or plug-in bias term. This term is zero by construction in estimating equation estimators (see, e.g., Hines et al. 2022). The third term is known as the empirical process term. Donsker conditions are typically required to control the asymptotic behavior of the empirical process term. This assumption can be relaxed by adopting sample splitting, or cross-fitting, as done in the de-biased machine learning and cross-validated TMLE literature (Chernozhukov et al., 2018;Zheng & van der Laan, 2011). While cross-fitting can be used in conjunction with parametric nuisance models to avoid assuming Donsker conditions, the use of sample splitting or cross-fitting is preferable to Donsker conditions when using machine learning nuisance parameter estimation, as certain data-adaptive methods (e.g., random forests) may give rise to plug-in influence function based estimators which do not satisfy the Donsker condition (Chernozhukov et al., 2018).

A C K N O W L E D G M E N T S
DiazOrdaz thanks the co-editor for the invitation to discuss this paper. DiazOrdaz is funded by a Royal Society Wellcome Trust Sir Henry Dale Fellowship 218554/Z/19/Z. O R C I D Karla DiazOrdaz https://orcid.org/0000-0003-3155-1561