### Discussion on the paper by Imai, Tingley and Yamamoto

- Top of page
- Abstract
- 1. Introduction
- 2. The fundamental problem of identifying causal mechanisms
- 3. Experimental designs with direct manipulation
- 4. Experimental designs with imperfect manipulation
- 5. Numerical example
- 6. Concluding remarks
- Acknowledgements
- Appendix A
- References
- Discussion on the paper by Imai, Tingley and Yamamoto
- References in the discussion

**Fabrizia Mealli** (*Università di Firenze*)

Imai, Tingley and Yamamoto must be congratulated for having attacked the challenging problem of understanding causal mechanisms. I like their engagement in exploring experimental designs, because this task requires spelling out the assumptions that are needed for identification and makes one reflect deeply on the questions that are asked.

Understanding mechanisms is particularly valuable if it helps to design improved interventions. An issue that deserves further discussion is whether investigations on pathways should focus on natural direct and indirect effects. Although these effects have received attention is some disciplines, are they the *natural* estimands that may suggest important pathways? Pearl (2001), paragraph 2.2, originally called them *descriptive* tools for attributing part of the effect of an intervention to an intermediate variable. But I think that the tool often fails to provide a good description of how things work, because of the asymmetric roles of *M*_{i}(0) and *M*_{i}(1); in general only one of the potential values of the intermediate variable is chosen as descriptive of the *causal forces under natural conditions*. To me, both values are *natural*, in that they describe how an individual reacts to a treatment. Their joint value is essentially a characteristic of a subject, so conceiving a manipulation of one of the two values is like considering changing the value of a pretreatment characteristic. This is essentially why consistency assumptions are rarely credible: they assume that the action that is taken to modify the value of a characteristic of a subject has no consequence on the outcome value. Also, quantities of the type , *t*≠*t*^{′}, are ill defined and sometimes difficult to conceive if one is not explicit about the process that led to observing *M*_{i}(0) and *M*_{i}(1) (Mealli and Rubin, 2003); , *t*≠*t*^{′}, are quantities that in a single experiment are ‘*a priori* counterfactuals’ because they cannot be observed for any subset of units. Even assuming that consistency holds, there may be subjects, possibly characterized by covariates’ values, for whom a level of *M* equal to *M*_{i}(0) under treatment can never be reached. If this is so, it means that the experiment is seeking an outcome that never occurs in real life, so I fail to understand how such a quantity can have some descriptive power. This suggests that, when interest lies in opening the black box, valuable design issues should be directed more on collecting detailed background covariates and additional outcomes, rather than on generating outcomes under manipulations of the mediating variable.

Despite recognizing the value of the experiments that are proposed by the authors for opening the possibility of identifying causal mechanisms through clever manipulation, I find that the type of settings where those could be applied most convincingly are those like the example of gender discrimination, where manipulation does not involve human beings directly.

A better description of how things work is provided by looking at the joint value of the natural levels of *M*_{i}(0) and *M*_{i}(1); those joint values define a stratification of the population into principal strata. Principal strata effects (PSEs), contrasts of *Y*(0) and *Y*(1) within principal strata, are well-defined causal quantities, which do not involve *a priori* counterfactuals (Frangakis and Rubin, 2002). If one is seeking information on the effect of an intervention that is not attributable to the change in the intermediate variable, it is sensible to start looking at the effect of the intervention on subjects for whom *M naturally* does not change, i.e. . These PSEs are called dissociative and can be contrasted with associative effects, i.e. effects in principal strata where . They allow distinguishing causal effects of *T* on *Y* that are associated with causal effects of *T* on *M*, from causal effects of *T* on *Y* that are dissociative and thereby associated with other causal pathways. The information that is provided by PSEs is extremely valuable: for example, large associative effects relative to small dissociative effects would indicate that the intervention has stronger effects on units where it also has an effect on the mediator. Associative and dissociative effects of equal magnitude would instead indicate that the intervention's effect is the same regardless of whether it has an effect on the mediator, which would suggest some alternative causal pathways through which the intervention has an effect without having an effect on the mediator. Even if these different values of PSEs can be due to principal strata heterogeneity only, an accurate principal strata analysis can provide useful insights on mechanisms and generate useful hypotheses that can be confronted with subject matter knowledge and also tested with a confirmatory experiment on a newly designed intervention. Looking at the distribution of the covariates and outcomes within the strata (Frumento *et al*., 2012) may provide insights on the plausibility of ignorability assumptions for *M*(0) and *M*(1) (Jin and Rubin, 2008) to identify the effects of *M* on *Y*.

I appreciated the authors’ effort to relax perfect manipulation by introducing encouragement designs, providing new alternative approaches to discover causal mechanisms. However, the effects that these designs usually help to reveal are essentially PSEs, and I am glad that the authors recognize their usefulness, despite their *local* nature.

The variety of designs presented by Imai, Tingley and Yamamoto shows how some common jargon is difficult to translate into proper causal statements. They have done a great job by engaging in this challenging area, and it is therefore my pleasure to propose the vote of thanks.

**Carlo Berzuini** (*University of Cambridge*)

The authors must be congratulated for their very stimulating paper that bridges advanced causal inference methodology and experimental scientific investigation.

The authors choose a causal inference framework based on potential outcomes, where each individual is characterized by a notional value of the response for each possible treatment. These notional values—called potential outcomes—are assumed to be fixed for the individual even before any treatment is applied. In their approach to the encouragement designs, the authors use a powerful device, that of restricting inferential attention to a particular principal stratum (PS), which means a group of individuals defined by the values of two or more potential outcomes for the same variable. A possible difficulty arises here. This is because potential outcomes cannot be jointly observed, and therefore we do not generally know who the individuals in a given PS are. For example, in their treatment of the parallel encouragement design, the authors restrict attention to the PS of individuals characterized by specific patterns of reaction to specific stimuli, a property that we shall not normally be able to check in any given individual. The paper offers examples of clever use of PSs. But the use of this device raises *caveats* that we shall now discuss.

It will suffice to illustrate the issue in relation to the parallel encouragement design, where the authors restrict inferential attention to the PS of compliers, i.e. of those individuals who react to the encouragement in the intended direction (*M*(*t*,1)=1,*M*(*t*,−1)=0). The method assumes that complier status (albeit unobservable) is a fixed and time invariant attribute of an individual. Is this a reasonable assumption? For example, is it reasonable to assume that someone we observe reacting to a specific stimulus with an increase in anxiety will always react to it in the same way? The actual state of affairs might be different. No matter how we circumscribe the problem—the complier status might really remain a random variable, causing individuals to move in and out of the group of compliers in an unpredictable way. In this case our inferences would be based on just those individuals who happened by chance to be compliers during the experiment. Can we, in such a case, claim that we are learning about a stable mechanism of nature? In certain applications, a (real or presumed) natural law (of the kind that we encounter in physics) will support the claim that the compliers constitute a stable and scientifically meaningful stratum of the population, as illustrated by the application example at the end of these comments.

We conclude by illustrating the main points with the aid of a study of the role of the acid sensing ion channel l (ASIC 1) in the development of multiple sclerosis. The study is a collaboration with Luisa Bernardinelli, of the University of Pavia. The treatment here consists of inducing in each experimental mouse a disorder called experimental autoimmune encephalomyelitis (EAE), that simulates the neuropathological changes of human multiple sclerosis. The mice are randomized over two levels of severity of the induced EAE, the level of severity being represented in the diagram by the binary variable *T*. Each mouse is also characterized by the genotype at the rs28936 locus, which regulates the expression of ASIC 1 and is represented in our diagram by the three-level variable *Z*, the number of copies of the deleterious rs28936 allele. Induction of EAE, and the consequent inflammatory process, causes an increase in the expression of ASIC 1, and a corresponding neurological deficit, that we record in each mouse, in the form of an ordinal score, *Y*, after 15 days from inoculation. Also recorded, in each mouse, is the level of ASIC l expression, *M*, in terms of the amount of messenger ribonucleic acid in neuron al cell bodies at 15 days from inoculation. Of inferential interest is the extent to which the effect of EAE (node *T*) on the deficit (node *Y*) is mediated by quantitative changes in ASIC 1 expression (node *M*), and by the consequent increase in ion influx. The study can be adapted to the proposed parallel encouragement design, with the genetic effect acting as encouragement. Compliers, in this example, are all mice in which presence of the deleterious rs28936 allele induces an increase in ASIC 1 expression. Knowledge of molecular mechanisms supports the claim that such compliers represent a stable majority of the mouse population. Hence, under the assumptions that are represented in our causal diagram, the method proposed by the authors can be used to calculate meaningful bounds on the *T**M**Y* indirect effect, the effect that inflammation exerts on disease severity via changes in ASIC 1 expression.

It is a privilege for me to have been invited to discuss a paper which will no doubt stimulate plenty of future research.

I therefore have great pleasure in seconding the vote of thanks.

The vote of thanks was passed by acclamation.

**Guanglei Hong** (*University of Chicago*)

I congratulate Kosuke Imai and his colleagues for another important methodological paper on identifying causal mechanisms. The experimental designs that they proposed have many attractive features. Yet I am concerned with the assumption of no treatment-by-mediator interaction in the parallel designs and the assumption of no carry-over effect in the crossover designs. We can find many applications in social sciences in which these two assumptions are implausible.

I propose a ‘covariate-informed parallel design’ that does not require these key assumptions. This new design is similar to the parallel design except that the second experiment employs covariate-informed randomization in the same spirit as a randomized block design.

Let *D*=0 and *D*=1 denote the first and second experiments respectively. For simplicity, let treatment *T* and mediator *M*(*t*) both be binary. After collecting pretreatment information **X**, we randomly assign participants to either *D*=0 or *D*=1.

Participants in the *D*=0 group are assigned at random to either *T*=0 or *T*=1. We observe *M*(*t*) and specify a prediction function relating **X** to *M*(*t*) for *t*=0,1.

Those in the *D*=1 group are assigned at random to either *T*=0 or *T*=1. Applying the prediction functions that are specified in the first experiment, we obtain, for each participant assigned to treatment*t* in the second experiment, *φ*(*t*,**X**)=pr{*M*(*t*)=1|*T*=*t*,**X**}. The participants are then assigned at random to *M*(*t*)=1 with probability *φ*(*t*,**X**). Analogous to a two-stage adaptive design in clinical trials (Bauer and Kieser, 1999; Liu *et al*., 2002), the covariate-informed randomization should have a higher compliance rate than a simple randomized design.

In the covariate-informed parallel design, treatment and mediator are both randomized. This design requires the stable unit treatment value assumption and the consistency assumption. If using pretreatment information to create blocks, we may estimate the block-specific treatment effects as well as the average treatment effect. By comparing each of these effects across the two parallel experiments, we may partially test the consistency assumption. In the second experiment, we may test the no treatment-by-mediator interaction assumption not only on average but also within each block.

More importantly, when the no-interaction assumption fails, researchers can nonetheless apply ratio of mediator probability weighting to estimate the counterfactual outcome *E*[*Y*{1,*M*(0)}] (Hong, 2010; Hong *et al*., 2011). For a participant assigned to *T*=1 and to mediator value *m* in the second experiment, the weight is

We can show that *E*(*ω**Y*|*T*=1,*D*=1)=*E*[*Y*{1,*M*(0)}]. Future research may investigate the sensitivity of results to the specification of the prediction functions.

**Brian L. Egleston** (*Fox Chase Cancer Center, Philadelphia*)

I enjoyed this paper. The authors provide useful details on assumptions that are needed to identify mediational pathways. I do worry, however, whether we are doing scientists a disservice by focusing on indirect and direct effects as targets of investigation. Some of the interest in indirect and direct effects can probably be tied back to Wright's (1921) work on path analysis. Many scientists might be using the outgrowth of path analytic techniques without considering whether the estimands are germane to their research.

Imai, Tingley and Yamamoto have a particular focus on ‘natural’ effects (Pearl, 2001), as shown in equations (2) and (3) of their paper. Natural effects are not necessarily useful in cancer therapeutic development. A current goal of much research is to identify causal pathways of cancer growth that can be blocked. Although this research has led to the creation of useful drugs, the therapeutic effect has often been less than ideal. One problem is that the human body has built-in biologic redundancy. Hence, if a pathway is blocked, the body will often find another mechanism to achieve the same goal. This has led to estimands of interest that differ from those of focus by the authors.

Notationally, let *G*_{z} represent cancer-related gene number *z* for *z*=1,…,*n* (*G*_{z}=1 if *G*_{z} is active and *G*_{z}=0 otherwise). Let *T*(*C*) represent survival time under cancer state *C* (*C*=0 if no cancer and *C*=1 if cancer). Let *T*(*C*,*G*_{1}) and represent potential survival outcomes under *G*_{1} alone and with *G*_{2}. Current therapeutic research is interested in creating a situation in which *E*[*T*(1)]=*E*[*T*(0)]. Using inhibitors, *G*_{z} becomes manipulable. A first step in development is to investigate whether *E*[*T*(1,0)]=*E*[*T*(0)]. Unfortunately, scientists generally find that *E*[*T*(1,0)]<*E*[*T*(0)]. However, in the course of investigating why the survival benefit when inhibiting *G*_{1} is not as great as expected, researchers discover that *G*_{2} has taken over many of the functions of *G*_{1}. Previously, *G*_{2} was not strongly implicated as a potential confounder or mediator. A new inhibitor of *G*_{2} is developed and investigators find that *E*[*T*(1,0)]<*E*[*T*(1,0,0)]<*E*[*T*(0)], and the cycle of discovering why blocking pathways is not as successful as intended continues.

Although *G*_{j} might be a mediator of the relationship of *T*(*C*) and *G*_{j−1} for *j*>1, the relationship is not necessarily discoverable until *G*_{j−1} is inhibited. The estimation of *E*[*T*(1,0)] and *E*[*T*(1,0,0)] involves manipulation of the mediators, and the natural effects are of little inherent interest.

**Roland R. Ramsahai** (*University of Cambridge*)

The potential outcomes probabilities are invariant to the value assigned by randomization, by definition, and the paper assumes that they are invariant under randomization or observation. This invariance is no weaker than condition (12), which restricts the distribution of the individual characteristics *U* to be invariant to the strategy for assigning *T* and *M*.

Chen *et al*. (2007) showed that, under a zero direct effect, the causal effect of *T* on *Y* is not predictable from the effect of *T* on *M* and *M* on *Y*. Conditions were given in Chen *et al*. (2007) to ensure that the effect is predictable from the chain of effects. Perhaps similar criteria can be developed under a non-zero direct effect and then used to develop tests to check the validity of the ‘causal chain’ approach in Section 3.

**Theis Lange** (*University of Copenhagen*)

Firstly I congratulate the authors for an important and enjoyable paper; secondly I thank Professor Imai for an inspiring presentation at the Society. On reading the paper I was left with two concerns or perhaps more accurately wishes for future research.

- (a)
In the parallel design we have replaced the (untestable) assumption of sequential ignorability with an assumption of no-interaction at the unit level (which at least has testable implications). However, for non-binary outcomes the no-interaction assumption is scale dependent. I fear that it would often be difficult to argue for the validity of the no-interaction assumption by using only subject matter knowledge, even when there are good subject matter arguments for a mechanistic causal effect separation since such arguments are rarely scale specific. Thus, we have replaced a structural assumption (namely sequential ignorability) with a purely technical assumption. Perhaps future research can either remove this assumption or establish whether we are still estimating something interesting when the no-interaction assumption fails.

- (b)
On the basis of the present paper it could be argued that for any new experiment aiming at quantifying causal mechanisms one of the novel designs should (if at all possible) be employed simply as a precautionary measure. However, before adopting this guideline it would be of great value to know the price we are paying in terms of statistical uncertainty. Or, in other words, assuming that both sequential ignorability and the no-interaction assumption hold, but we only have 100 study subjects, are these 100 subjects then best ‘used’ in a single-experiment set-up or a parallel design in terms of statistical uncertainty of the resulting estimators?

**Andrew Gelman** (*Columbia University, New York*)

This is an impressive paper that goes beyond philosophical argument and mathematical manipulation and proposes specific designs to study real problems. Several of the proposed new studies seem fairly inexpensive—e.g. the expanded survey experiment in Section 4.2.3 on attitudes towards immigration—and I wonder whether the authors are considering performing these studies themselves or perhaps know of others who have such plans. Often in political science and economics we need to wait for new data (new elections; new revolutions; new economic or political trends), but these psychological studies can be replicated fairly easily, and I am curious about the results.

I have two further questions, one applied and one methodological. My applied question is about the effect of incumbency and money in US congressional elections. Unlike many causal questions in social science, this one can be formulated cleanly: for incumbency, either an incumbent runs for re-election or not. For money, if I give $100 to candidate X, what is the expected effect on his or her vote share in the upcoming election? Also, whether an incumbent runs for re-election affects the campaign contributions in his or her district. Although all these effects are clearly defined, studying them is tricky: incumbents’ decisions, results of primary elections and campaign donations are observational variables, as are the aspects of their opponents in the general elections. There is a literature on the estimation of the effects of incumbency and money on elections using various clever ideas with observational data and natural experiments. I am wondering whether the methods described in the present paper can be applied in this observational setting.

Finally, just as a minor comment: I hope in the future that the authors will think as hard about the presentation of their results as they do about their mathematical foundations. For example, they estimate a proportion as ‘0.730 for the treatment group and 0.392 for the control group’. Given that their sample sizes are below 68 and 198 respectively, I think that third digit is meaningless. Similarly, they present a confidence interval as [−0.175,0.825]. Given the evident level of uncertainty, [−0.2,0.8] would suffice. One of the most important messages statisticians convey is about the presence of uncertainty, and we dilute much of this when we display meaningless levels of precision.

**Manabu Kuroki** (*Institute of Statistical Mathematics, Tokyo*)

I congratulate the authors on this paper which tackles a difficult but interesting problem. I would like to provide some comments on the present paper from the viewpoint of quality control (QC) which is one of my main research fields.

Dr Genichi Taguchi, who was a pioneer of quality engineering, implied that the effect decomposition problem is one important issue in experimental design (Taguchi (l987), chapter 28). However, he did not provide any solution to this problem and this problem has not attracted much attention from QC experimenters for a long time. In this sense, although the authors’ research area is different from that of Dr Taguchi, the authors also shed light on the effect decomposition problem in experimental studies. Thus, the present paper provides a new motivation for QC practitioners who deal with this problem.

The results of the present paper have some limitations when we apply them to the QC area:

- (a)
the assumption of no carry-over effects does not hold in many cases;

- (b)
the monotonicity assumption is often violated and

- (c)
a main interest in QC is to evaluate direct and indirect effects in the whole population instead of a subset of the population.

Despite these limitations, the present results may be applicable to, for example, an experimental study to evaluate the effects in the case where we exchange components that may cause bad performance and the deterioration of assemblies (or sequential systems). To overcome these limitations, as one solution, the authors provided sharp bounds in some cases but they did not formulate bounds for encouragement designs. It would be helpful if the authors can provide the formulation of sharp bounds for these designs because sharp bounds often provide useful information on the evaluation of direct and indirect effects (e.g. Cai *et al*. (2008)).

The following contributions were received in writing after the meeting.

**Jeffrey M. Albert** (*Case Western Reserve University, Cleveland*)

I commend the authors on a stimulating and clearly written paper, and one that is a welcome contribution to the limited literature on study designs for the assessment of mediation. For brevity, and to cover the main ideas, my comments are focused on the (two-part) parallel design (with direct or imperfect manipulation).

The two-part design has some appealing features. In particular, it clearly separates the goal of estimating the overall treatment effect (which is provided by the first experiment) and that of estimating direct and indirect effects (which are provided by the second experiment). However, because the second experiment does not contribute to the estimation of the overall treatment effect (except possibly with additional assumptions), an obvious drawback of the design proposed is the requirement of additional resources for the estimation of mediation effects. It may be argued that researchers, or funding agencies, should be willing to pay the price for this information. However, when resources are limited this may be a difficult sell. In contrast, in the standard ‘single-experiment’ design, for which the primary objective is usually inference for the overall treatment effect, mediation analysis is offered ‘for the same price’, albeit with additional strong assumptions. Of course, low power for testing mediation effects may require a boost in the sample size, but then inference for the overall treatment effect will also benefit. To allow a complete evaluation, the power implications of the proposed *versus* standard designs could use further investigation.

It is notable that, even with the additional investment that is represented by the two-part design, the estimation of mediation effects still requires strong assumptions that are not assured by the randomization. These assumptions include that of no (individual level) treatment–mediator interactions and the ‘consistency assumption’ (assumptions 3 and 9). It is interesting that assumptions 3 and 4 (or 8 and 9) essentially render the *Z* as an instrumental variable. The authors dismiss the instrumental variable approach; however, some generalized (e.g. two-stage least squares, extending Albert (2008)) approach may be possible without having to assume no direct effect of *T* on *Y* (noting that multiple instrumental variables may be obtained from the multiple-category *Z*). Unfortunately, the assumption of no direct effect of *Z* (as well as *T*) on *Y* may be implausible in many situations, in which cases it is not clear whether it is worth trading this assumption for that of sequential ignorability.

**John G. Bullock** (*Yale University, New Haven*) **and Donald P. Green** (*Columbia University, New York*)

Imai, Tingley and Yamamoto remind us that an intervention's ‘direct effect’ and ‘indirect effect’ are fundamentally unidentified. Both involve inherently unobservable potential outcomes. Not even a randomized experiment can render an estimate of quantities such as . Yet they express optimism about our ability to learn about direct and indirect effects by coupling experiments with an array of supplementary assumptions. We applaud them for detailing the assumptions that are required to isolate causal mechanisms. But, when reflecting on applications in the social sciences, we remain sceptical about whether any experimental design will permit a researcher to estimate direct or indirect effects convincingly.

We are sceptical because the assumptions that are invoked by the authors are not directly testable: the ‘consistency assumption’ (assumption 3), the ‘no-interaction’ assumption (assumption 5) and the homogeneous unit effects assumption (on the eighth page). In practice, the list of assumptions in social science applications is even longer. First, social scientists routinely study mediation by using variables, such as beliefs or feelings, that are not observed directly. It is difficult to measure and manipulate a particular mediator without inadvertently measuring and manipulating other mediators as well. Measurement challenges are especially daunting given widespread reliance on survey measures; subjects are often invited to report beliefs or feelings, and their responses are used to measure the mediator and the outcome. Systematic response error that affects both mediator and outcome is a very real possibility.

Second, rarely can social scientists set specific values of a mediator (i.e. *M*(*t*)). At best, they intervene by using ‘encouragement’ designs like those which the authors discuss in Section 4. These designs force researchers to invoke additional untestable assumptions: most notably, the ‘exclusion restriction’, which says that encouragements affect outcomes solely through the intended mediator.

Few, if any, social science studies have satisfied or could convincingly satisfy these assumptions. Rather than attempt to estimate parameters that are fundamentally unidentified, let us set our sights on the still challenging task of estimating the causal effects of *t* and *M*. Even if we cannot know the indirect effect of *t*, we can still learn about its effects on hypothesized mediators, and we can learn the average effect of intervention-induced change in *M* on outcomes. The advantage of this approach is that it puts us on a firm experimental footing. After we have accumulated substantial knowledge about the effects of *t* and *M*, identification of causal mechanisms may become more plausible.

**Vanessa Didelez** (*University of Bristol*)

The importance of experimental design for causal inference is twofold. It can guarantee crucial assumptions; for example actual randomization allows identification of average causal effects. A careful design also clarifies, almost defines, the target of inference—this is especially relevant in the context of sometimes woolly notions of ‘causal mechanisms’.

The authors consider indirect or direct effects involving *Y*{*t*,*M*(*t*^{′})}. Setting treatment to *t* and *t*^{′} for the same unit is genuinely counterfactual Consequently, although their designs improve on the single experiment, they cannot avoid untestable assumptions. Furthermore, do the designs proposed *define* the target of inference? The additional experiment in the parallel designs really targets the controlled direct effect, and the two require linking by untestable assumption 5. However, the crossover designs proposed clearly target *Y*{*t*,*M*(*t*^{′})}, and under untestable assumption 7 this comes close to observing *Y*{*t*,*M*(*t*^{′})} itself.

A different type of design is sometimes possible and clarifies the causal parameter in a decision theoretic context (Didelez *et al.*, 2006), namely when we can manipulate the mediator, without controlling it (almost) as if treatment were at two different values for the same unit. For example double-blind placebo-controlled studies: these target the direct effect of an active ingredient not mediated by the patient's or doctor's expectation. Crucially the mediator (the expectations) is not (and cannot) itself be controlled, but the design guarantees that it arises as under ‘drug taken’. One can easily think of variations addressing the indirect effect, here the placebo effect. This type of design seems feasible whenever ‘treatment’ comprises different aspects that could—with a little imagination—be separated out. Robins and Richardson (2010) used similar examples and (possibly hypothetical) interventions in augmented directed acyclic graphs to discuss when *Y*{*t*,*M*(*t*^{′})} can be regarded as a manipulable quantity. Does this mean that we observe *Y*{*t*,*M*(*t*^{′})} itself? Not necessarily: the design fails when there are post-treatment confounders (even if observed) of *M* and *Y*; in the placebo-controlled trial this is known as ‘unblinding’, e.g. by side effects of the active ingredient.

Looking at typical applications, it will be rare that crossover or placebo-type designs can be used. The interest in causal parameters based on *Y*{*t*,*M*(*t*^{′})} therefore remains a mystery to me—what *practical* questions does it help to answer that simpler approaches (causal chain or controlled effects) do not? If effect modification is the main problem, we should maybe direct more attention to investigating effect modification and design experiments accordingly.

**David Draper** (*University of California, Santa Cruz*)

The authors of this interesting paper have offered us some increased clarity on a difficult question: can we go beyond estimating the average effects of causes to correct identification of the actual underlying causal *mechanisms*? Their answer is a cautious yes, by employing designs they recommend that differ from those in widespread current use; I am less sanguine, for at least the following two reasons.

- (a)
It is distressingly easy to imagine experiments in which the authors’ assumption 3, which they correctly point out is crucial to their attempts at improved designs, does not hold. For example, consider an experiment in which the dichotomous treatment variable is a form of talk therapy aimed at behaviour modification to avoid out-of-wedlock pregnancy (*T*=1) or no such therapy (*T*=0), and the outcome variable is the number of sexual partners. To keep this example from being too stereotypically gendered, imagine a world in which an effective male contraceptive pill is available, and consider one of the authors’ designs in which use or non-use of this pill is the mediator to be manipulated, on a cohort of young men. It is a brave (and foolhardy) assumption in this setting to believe that a man who chooses to take the pill will behave identically to a man who is randomized to the pill with respect to the number of sexual partners that he seeks.

- (b)
Almost all the authors’ examples involve a single mediator, but what if (as will often be the case) two or more mediators are active (i.e. highly relevant to correct causal conclusions, because of strong correlations with the treatment and outcome variables) but you are aware of only one of them, and therefore—using one of the authors’ designs—you manipulate only the one that you know about? Then what looks to you like unexplained variability in the outcome may actually be bias (arising from having manipulated only one mediator), and this will potentially distort your causal conclusions.

A little more detail on the following point would also be helpful. The authors make frequent use of the expectation operator, without saying what distribution the expectation is over: are we averaging over the distribution yielding the randomization to experimental groups distribution (holding experimental subjects constant, and attempting to generalize only to what the results would have been if they had ended up in different groups), or the distribution that is implied by the usual (often unstated, and often untrue) assumption that the subjects are like a random sample from the population to which we are actually trying to generalize, or what?

**Adam N. Glynn** (*Harvard University, Cambridge*)

This paper provides a thorough investigation of potential solutions to a difficult problem. As the authors note, much of this difficulty stems from the fact that, although designs with direct or indirect manipulation of the mediator provide more information about the mediation effect, these designs also require that the manipulation does not directly affect the outcome.

Interestingly, by clarifying these difficulties, this paper may lead researchers in non-experimental settings to reconsider whether mediation is the question that they want to address. For example, in observational studies of racial discrimination, the treatment could be conceptualized as the perception of race at the time of application (instead of race defined at birth as in the example from Section 3.2.4). This allows an applicant's qualifications to be incorporated in the analysis as pretreatment variables, and mediation analysis would not be necessary (see Greiner and Rubin (2011) for a discussion).

As another example, consider the conjecture known as the weak states mechanism—that natural resource abundance (e.g. oil or diamonds) might reduce the incentive for a state to develop the bureaucratic capacity that is necessary for taxation, and that this lack of state capacity might increase the likelihood of civil conflict (Fearon and Laitin, 2003). One reason why we might want to study this mechanism is to anticipate the effect of laws that would block the mediation effect (for example see the discussion of oil revenue management laws in Humphreys (2005)). However, any such intervention might have its own direct effects on the outcome and, therefore, the mediation effect may not necessarily represent the effect of interest.

It is unclear to me whether this paper will do more to encourage the use of good design or to dissuade questionable (and possibly unnecessary) attempts at mediation analysis. In either case, the authors have done a great service in clarifying the issues.

**Booil Jo** (*Stanford University*)

I congratulate the authors on their very important and stimulating contribution to the causal inference literature. Possibilities of manipulating mediators have been largely overlooked and, therefore, little knowledge has been accumulated so far about design possibilities in identifying causal mechanisms. It may seem that the proposed alternative experimental designs replace one untestable assumption with another set of untestable assumptions (that could be even stronger). However, these alternative experimental designs let us explore alternative identifying assumptions, the use of which is likely to improve the quality of our causal inference. As the authors emphasized, when the single-experiment design is the only option, the unavoidable choice of identifying assumption is sequential ignorability, which is not a desirable situation. The use of alternative designs and identifying assumptions opens up possibilities for diverse and improved sensitivity analysis strategies. Further, the authors demonstrated the use of encouragement, which not only makes implementation of the designs suggested more feasible but also improves the testability of some of the underlying identifying assumptions.

What seems somewhat unclear at this point is how the design strategies suggested will pan out in practice. The designs proposed will generally require larger sample sizes. This may not be feasible in many studies that must rely on small to moderate sample sizes. For example, in many medical and health-related experiments, recruiting a large sample is simply not feasible. The suggested parallel designs consist of two experiments, which inevitably require larger sample sizes. Even if recruitment is possible, the increased cost and practical issues that are related to having two experiments may discourage the use of the designs suggested. The crossover designs seem less costly, but the no-carry-over effects and consistency assumption can be quite strong. To make this assumption more testable, a larger sample is again needed to maintain the same level of statistical power (i.e. we need to include a group of individuals without mediator manipulation). I also suspect that we shall need some guidelines on ethical issues related to manipulation of mediators. Finally, I wonder how applicable the study designs suggested are. The examples that are used in the paper (transcranial magnetic stimulation and immigration) seem quite unique, making me somewhat unsure about the broad use of the designs suggested. I look forward to seeing more applications in diverse settings. I congratulate the authors again and hope that that this paper will ignite further development of creative and practical study designs to elucidate causal mechanisms.

**David A. Kenny** (*University of Connecticut, Storrs*)

The paper is in the now rather old tradition of finding ways of estimating causal mechanisms by combining experimental and non-experimental approaches. I have three comments.

First, the authors’ approach is to estimate the indirect effect (IE) as the difference between the total effect of *T* on *Y* and the direct effect of *T* on *Y* controlling for *M*. Such an approach is implicit in Baron and Kenny (1986) and was formally described in Clogg *et al.* (1992). An alternative, less general, but currently quite widely utilized, strategy for the estimation of IEs is to estimate the IE as the product of two effects: the path from *T* to *M* or *a* and the path from *M* to *Y* or *b*. Where appropriate, knowing the sizes of *a* and *b* can be very informative. First, if the IE is near 0, it is useful to know whether path *a* or *b* (or both) is 0. For instance, if path *a* is 0 but *b* is not, then we know that the intervention failed to trigger the mediator. Second, the relative size of path *a* and *b* can be informative. Some mediators are ‘proximal’ in that they are closer to *T* (Hoyle and Kenny, 1999) whereas others are ‘distal’ in that they are closer to *Y*.

Second, I think it highly unlikely that one ever has a ‘pure’ manipulation of *M* and so the authors’ consideration of such seems misplaced. In the tradition of Cook and Campbell (1979), a measure or manipulation is virtually never identical to the construct that it purports to measure. Moreover, mediators are typically inside the ‘black box’, and so they can be difficult to observe directly. It should also be realized that almost always the manipulation of *T* is one of ‘encouragement’, and so the use of encouragement is not a poor second choice. Rather it is what is almost always done.

Third, in cases for which we can assume continuous *M* and *Y*, no *TM* interactions and linear effects, I think that a single experiment can be undertaken in which both *T* and *M* are manipulated and measured. In such situations, the IE could be measured as the product of two effects, *ab*. The single experiment would yield a more precise estimate of the IE than the two-arm study proposed by the authors. The interested reader can consult Smith (1982) for an instructive example.

**V´ctor Leiva and Emilio Porcu** (*Universidad de Valpara´so*)

This interesting paper deals with designs of randomized experiments to evaluate the treatment effect on a response under causality, where the treatment effect is the sum of the causal mediation indirect (mediator) and direct effects. Although the single design is one of the most commonly used methods for identifying causality, it is based on assumptions that are difficult to justify in practice. The paper proposes parallel and crossover experimental designs by means of which it is possible to manipulate the mediator that connects the treatment and response. These designs are based on a key assumption that is the consistency, which allows us to manipulate the mediator without directly affecting the response. These designs improve the results from the single design.

Studies in diverse areas are usually causal and not associational. This makes standard statistical inference not suitable for these studies and so-called causal inference is needed instead. In general, because studies in these areas are usually observational and not experimental, it is somewhat complicated to justify parametric assumptions and so the use of semiparametric models seems to be more adequate. Indeed, there are examples where to assume parametric models implicitly leads to models that exclude *a priori* the null hypothesis of no causal effects; see Robins and Wasserman (1997). In spite or these difficulties, some efforts on the use of parametric models in causal inference, including non-normal distributions, have been made; see Shimizu and Kano (2008).

In parametric modelling, it is well known that outliers produce undesirable effects on the estimates of the model parameters, influencing their behaviour. Then, it is important to have tools that allow us to assess such influence. A method known as local influence provides us with an instrument to detect the effect of small perturbations in the model on the parameter estimates; see, for example, Leiva *et al.* (2007) and references therein. Because the problem of influence could also be present in causal models, with similar consequences, the idea of influence diagnostics could be explored in the class of models analysed in the paper.

Outcomes, mediators (such as ‘anxiety’) and direct effects can be accumulated in a similar way to that generated by a fatigue process, which acts under stress. Then, the data-generating process could be well explained by a process of this kind and so a non-normal model, such as the Birnbaum–Saunders distribution, might be considered in causal analyses of the type studied in the paper; see Leiva *et al.* (2007).

**N. T. Longford** (*SNTL and Universitat Pompeu Fabra, Barcelona*)

Statistical literature is replete with poorly founded claims of having identified causes and generated some understanding of causal mechanisms. This paper is a commendable effort to add scientific rigour to the discourse about causal mechanisms and to the design for studying them. However, the framework presented is not particularly constructive, because the numerous assumptions, although well motivated, are presented in the form of imperatives—if a particular setting departs from a required assumption, the edifice that is essential for the inference crumbles. The fact that some assumptions are unverifiable, or even untestable, adds to the difficulties. A more constructive approach would define metrics for departures from the assumptions and allow for some form of arbitration about how great a deviation from the assumption (the ideal) is permitted without undermining the inference about the causal mechanism. For example, carry-over in a (clinical) crossover trial can rarely be regarded as absent (satisfying the relevant null hypothesis *H*_{0}), because such an absence corresponds to an unsupportable *H*_{0}. Failure to reject *H*_{0} does not suffice here, even if we have ample evidence from elsewhere that the carry-over is sufficiently small for a different purpose.

I think that the limitation of the presented methodology to very simple causal mechanisms is not made clear. A unit (or link) of a causal mechanism is a direct cause without a mediator. All the examples discussed are mechanisms comprising two units. In more realistic settings, there are many interrelated mediators, and the framework presented would entail a large set of interrelated experiments and randomizations. For example, in the study of attitudes to immigration, having been abroad, having contemplated living there, having acquaintances among immigrants, having an occupation that involves international contacts, and the influence of the (self-selected) media outlets are relevant factors, most of them beyond our ingenuity and resources to manipulate. To study a causal mechanism (the verb ‘to identify’ is misleading because it implies a verdict with certainty that cannot be arrived at by a hypothesis test on a finite sample), we must have the ability to manipulate each mediator in a way that is described by the assumptions (extended to settings with several mediators), and that is a rather tall order.

**David P. MacKinnon** (*Arizona State University, Tempe*)

Imai, Tingley and Yamamoto link experimental designs and modern causal inference, thereby clarifying limitations about what experiments can demonstrate regarding a mediating mechanism. This important work is applicable to the many areas where researchers seek understanding of how a manipulation affects an outcome. I do not agree that the single-experiment design is how mediating mechanisms are identified. The search for mediating mechanisms is addressed by a programme of experimental research, replication studies, history and qualitative data, conducted by different researchers in different research contexts (MacKinnon, 2008). It is unlikely that any one study, even the ideal experiment designs that are described in the paper, would be sufficient to identify a mediating process (because of type II errors, for example). A programme of research is also critical to deal with other considerations, such as the requirement of valid and reliable measures, sample representation of the population of interest and selection of the position in a chain of mediation to investigate.

Given the strong assumptions that are necessary for identifying mediating mechanisms, it would seem surprising that mediating mechanisms can be found. However, research that is focused on predicted and observed patterns of results in different contexts is how mediating processes have been identified in the past. A few notable mediating mechanisms are atomic theory in chemistry, gene theory in genetics and cognitive dissonance theory in social psychology. In the social sciences, several designs that are closely related to those in the paper have been used to test logical predictions of mediation theory (Mark, 1986; MacKinnon 2008; MacKinnon and Pirlott, 2010). In the social science literature, the parallel and encouragement designs correspond to blockage and enhancement designs where additional conditions are specified that should lead to larger or smaller effects on outcomes depending on whether the mediator was enhanced or blocked. Also related are double-randomization designs whereby a manipulation is conducted and a mediator and outcome measured, and then a second randomization addresses the mediator-to-outcome link. Other designs attempt to demonstrate specificity for a mediation process by predicting mediation through a hypothesized mediator and not through a comparison mediator. Useful future research would clarify the causal assumptions of these additional designs, including methods to address the sensitivity of conclusions to assumptions. Another valuable next step is the application of experimental designs to answer important substantive questions with real data that includes collaboration between substantive researchers and statisticians.

**Jorge Mateu** (*University Jaume I, Castellón*), **Oscar O. Melo** (*National University of Colombia, Bogotá*) **and Carlos E. Melo** (*District University Francisco José be Caldas, Bogotá*)

Identifying causes is the goal of most scientific research. We can design research to create conditions that are very comparable so that we can isolate the effect of the treatment on the dependent variable. In this way, research designs that allow us to establish these criteria require careful planning, implementation and analysis. Many times, researchers must leave one or more of the criteria unmet and are left with some important doubts about the validity of their causal conclusions, or they may even avoid making any causal assertions.

We would like to draw the authors’ attention to a particular problem that could benefit from this strategy. To improve further on the crossover design, the results can be extended to models with the observed pretreatment covariates *X*_{i}. Then, the average indirect effect by using the same notation as the authors’ is given by

for *t*=0,1 and all *x* ∈ *χ*, and where denotes the observed value of the mediator that is realized after the exposure to the treatment, is the support of *M*_{i} and the two potential values *M*_{i}(0) and *M*_{i}(1) are the effects of the treatment over the mediator. During the second period of the experiment, the treatment status is 1−*T*_{i} for each unit, and the value of the mediator equals the observed mediator value from the first period, *M*_{i}. So, in the second period, the observed outcome can be written as . The following assumption is satisfied under the crossover design because the treatment is randomized:

for *t*,*t*^{′}=0,1. Additionally, Robins (2003) and Imai *et al.* (2010) considered the identification. In this case, it should satisfy the following assumptions:

where it is also assumed that and for *t*=0,1 and all *x* ∈ *χ* and . To this consistency assumption, the absence of carry-over effects can be also assumed, i.e.

for *t*=0,1 and all .

**Alessandra Mattei** (*University of Florence*)

Imai, Tingley and Yamamoto provide a valuable contribution on a subject that is just as attractive as it is challenging: understanding causal mechanisms. They focus on natural direct and indirect effects, which are defined as a function of potential outcomes of the type , usually named ‘*a priori* counterfactuals’, because they cannot be observed for any subset of units in a specific experiment.

To embed natural direct and indirect effects in the potential outcomes framework formally, the primitive concepts and the basic assumptions for causal inference should be generalized to make potential outcomes of the form , well-defined objects. Specifically, natural direct and indirect effects require that the intermediate variable *M* could be, at least in principle, regarded as an additional treatment. Therefore, assumptions on the compound assignment mechanism for the multivariate treatment variable (*T*, *M*) should be contemplated.

The parallel and crossover (encouragement) designs that are proposed by the authors imply that (partial) interventions on the intermediate variable are conceivable. My feeling is that, if we are willing to entertain hypothetical interventions on the intermediate variable, it could be more reasonable to design a single experiment posing a compound assignment mechanism for the treatment variable and the mediating or encouragement variable: alternative causal paths could be investigated, and various hypotheses on the causal mechanism could be assessed.

Another crucial issue concerns the assumptions of consistency and no carry-over effects, which allow us to carry out extrapolation of *a priori* counterfactuals for units on which the data contain no or little information, using data either across units for the same time or across time from the same unit. As the authors also recognize, these assumptions may be controversial: the experiment to which a unit is assigned may make a difference, and also time may matter, implying that treatment comparisons across time lack causal interpretation.

According to me, to understand clearly the nature of alternative identifying assumptions and to obtain useful insights on how to design experiments aiming at disclosing causal mechanisms, preliminary analyses based on the principal stratification framework could be valuable. A principal stratification analysis naturally provides information on the extent to which a causal effect of the treatment on the primary outcome occurs together with a causal effect of the treatment on the intermediate outcome, without involving *a priori* counterfactuals and identification and estimation strategies based on extrapolation methods.

**Emilio Porcu and V´ctor Leiva** (*Universidad de Valpara´so*)

The paper deals with parallel and crossover designs as an alternative to the single design, which are useful when the mediator that connects the treatment and outcome may be manipulated. The difference between these two designs is that experimental units are assigned to one of two treatments at random (parallel) or sequentially assigned to two treatments (crossover) by using the manipulation of the causal mediator. These experimental designs are based on the consistency assumption, which supposes that the manipulation of the mediator does not directly affect the outcome. By means of an example analysed in the paper, the effect of media framing on the subjects’ immigration preference is tested, using the anxiety as mediator. Because the manipulation of the anxiety is imperfect, the parallel design is used, turning out to be more informative than the single design.

**T. S. Richardson** (*University of Washington, Seattle*) **and****J. M. Robins** (*Harvard School of Public Health, Boston*)

This is a thought-provoking paper that proposes several new approaches to probing mediation. It is an attractive feature of the authors’ designs that their analyses are based on counterfactual independences that hold as a consequence of randomization.

In the context of a single-intervention study where *T* alone is randomized, in several references, the following independence assumptions have been entertained on the basis of substantive hypotheses:

- (16)

and

- (17)

for all *t*,*m* ∈ {0,1}. We have computed bounds on the average pure (or natural) direct effect (here ; see expression (3) in the main text) under these assumptions (Robins and Richardson (2011), appendix C). In the above expressions we have implicitly assumed there is a particular well-defined joint intervention that sets *M* to *m* and *T* to *t*.

Note that expression (16) follows from the assumption that *T* was randomized. By contrast, assumption (17) will hold in contexts in which there is no confounding between *M* and *Y*.

In contrast, no consistent test exists for the ‘cross-world’ counterfactual independence:

- (18)

even if we are willing to make assumption 3 and can carry out the parallel design. Note that expression (18) is required to obtain point identification of via Pearl's mediation formula.

More generally, Robins (1986) and Robins and Richardson (2011) gave a general framework for formulating causal models under which all counterfactual independence restrictions are in principle subject to experimental verification in the way that is outlined here.

**Donald B. Rubin** (*Harvard University, Cambridge*)

lmai, Tingley and Yamamoto are to be congratulated for addressing the challenging issue of direct or indirect causal effects using potential outcomes, a notation that was introduced by Neyman in 1923 (see Neyman (1990)) for repeated sampling, randomization-based inference in randomized experiments, and extended in Rubin (1974, 1975, 1977, 1978) to include general assignment mechanisms for treatments and other forms of inference. The condition for the notation's adequacy (e.g. discussed in Rubin (1978), pages 37–38) was eventually called the ‘stable unit treatment value assumption’ (Rubin (1980), page 591)—meaning that, no matter how the *i*th unit, *i*=1,…,*N*, was exposed to treatment level *t*, *t*=1,…,*T*, the outcome *Y*_{i}(*t*) would be realized, where this could be a probability distribution (Rubin (2010), page 40); potential outcomes are functions of units and treatments at defined times of assignment of treatments and measurement of outcomes.

One component of the stable unit treatment value assumption is ‘no interference’—explicit in this paper, but only implicit is the second component, ‘no hidden versions of treatments’ meaning that there are no levels other than those reflected in {1,…,*T*}, i.e. no levels that could lead to values of potential outcomes that are not represented in {*Y*_{i}(*t*),*i*=1,…,*N*;*t*=1,…,*T*}. With the authors’ notation indexing *Y* outcomes by treatments and mediators, the stable unit treatment value assumption implies that, given a fixed treatment level, say *t*, no matter how we force the mediator *M* to change its value for unit *i* from *M*_{i}(*t*) to another value, , the outcome would remain the same, which, if implausible for any *i*, makes the stable unit treatment value assumption implausible and thereby makes functionally ill defined because of its multiple values and thus makes estimands based on the notation ill defined, as argued in Rubin (1975), page 234, Rubin (1986) and Rubin (2010), pages 40–41.

To make the stable unit treatment value assumption plausible in this case, the essential conceptual task is to formulate an assignment mechanism, not only for treatment levels, but also for mediator levels given each treatment level (Mealli and Rubin, 2003), typically either ignorable (Rubin, 1978) or latently ignorable (Frangakis and Rubin, 1999); the former relies on apposite covariates—as in Nedelman *et al.* (2003); the latter typically relies also on principal stratification (Frangakis and Rubin, 2002)—as in Jin and Rubin (2008). Ill-defined notation and the jargon of direct and indirect effects distracts us from this essential, problem-specific, conceptual task—revealed by Fisher's using such jargon to justify covariance adjustment for observed values of mediators without consideration of assignment mechanisms for them (Rubin (2005), section 7).

**Marc Saez** (*University of Girona, and Consortium for Biomedical Research Network in Epidemiology and Public Health, Barcelona*)

I congratulate the authors for their splendid work. I think that they contribute in an important way to investigating the explanation of causal rnechanisms. However, I am not very sure that they have succeeded, indeed, in identifying causal mechanisms. Although the theoretical argument of the two experimental designs that they propose is impeccable, the examples they provide (i.e. Sections 4.1.3 and 4.2.3) do not satisfy the same consistency assumption that is unfulfilled by the parallel and crossover designs, namely that experimental subjects need to be kept unaware of the manipulation. Of course, this does not necessarily mean that the generalization of the parallel and crossover designs by allowing for imperfect manipulation does not help to identify, effectively, average natural indirect effects but, perhaps, the choice of the examples was not successful. So I would like to ask the authors to show an example with, maybe, fewer assumptions. In any case, I think that the authors have contributed in an excellent way to establishing the theoretical foundations of the identification of causal mechanisms, particularly when it is perfectly possible to manipulate an intermediate variable.

**Michael E. Sobel** (*Columbia University, New York*)

I congratulate Imai, Tingley and Yamamoto for proposing creative experimental designs to help to identify pure direct and indirect effects. This is challenging because there are no observations (*i* denotes subject, *T*=*t* denotes assignment to treatment *t* and is the mediator when *t*≠*t*^{′}), yet one must identify *E*[*Y*{*t*,*M*(*t*^{′})}]. Identifying sequential ignorability assumptions (several are referenced in the paper) have been given, but these are typically substantively unreasonable. The authors avoid these in the parallel design by adding to the usual ‘single-experiment design’ a second experiment with both treatment assignment and the mediator randomized, thereby identifying controlled effects . Still, additional assumptions are needed to identify pure direct and indirect effects; the authors assume no interaction at the unit level. This is also very strong, and often not credible. They acknowledge this, developing sharp bounds for the parallel design that hold without this assumption. Their modified crossover design is nice, and the assumptions, although strong, seem more possible to meet. Similar remarks apply to the encouragement versions.

Direct and indirect effects reflect processes involving causation, providing useful information about the role of the mediator in the relationship between treatment assignment and response. But even leaving aside how one might, in the spirit of this paper, define and formalize the notion of a causal mechanism, and what it would mean to have a probabilistic causal mechanism (or should it be causal probabilistic mechanism?), it is useful to recognize that identification and estimation of direct and indirect effects need not reveal much about a causal process at work.

Consider the following hypothetical, deliberately oversimplified example. Suppose that there is a function *g*(**x**,*t*,*m*), where possibly *g*(**x**,*t*,*m*)=*g*(**x**,*t*,*m*^{′}) for every (**x**,*t*) and (*m*,*m*^{′}), such that and The indirect and direct effects are respectively

- (19)

- (20)

Suppose that the authors’ crossover experiment can be used to identify these effects. We can then obtain good estimates of these, with little knowledge of mechanisms: we do not know how *g* and *M* combine, nor the causal relationship between *g* and *M*, nor even that there is such a *g*.

The example suggests the difficulty, using even the improved experimental designs in the paper, of learning about causal mechanisms. Unless the science is already strong, it may prove very difficult to do so. That said, Imai, Tingley and Yamamoto have made a very nice contribution, and certainly a step in the right direction.

**Tyler J. VanderWeele** (*Harvard University, Cambridge*) **and Richard A. Emsley** (*University of Manchester*)

Imai, Tingley and Yamamoto are to be congratulated for fine methodologic work which has provided experimental designs and theoretical results that together allow researchers at least sometimes to identify the sign of a mediated effect without any assumptions beyond so-called ‘consistency’ (see VanderWeele and Vansteelandt (2009) and VanderWeele (2012)), contrasting with prior work on bounds (Sjölander, 2009; Kaufman *et al.*, 2009; Robins and Richardson, 2010). They achieve these results by relying on fairly complex experimental designs such as when two trials are run, one in which treatment is randomized and another in which both treatment and mediator are randomized or alternatively trials in which it is possible to re-randomize, without carry-over, both treatment and mediator.

Although their designs have considerable identification power, they would, in many settings, be difficult to implement in practice. There is a trade-off between the complexity and practicality of the design on the one hand and strength of assumptions that must be employed to assess mediated effects on the other. A more common setting than the designs that they have considered is one in which treatment has been randomized in one trial, and the mediator has been randomized in another trial, possibly even with a different population from that of the first trial. The effect of treatment on the mediator and the outcome can be assessed in the first trial; the effect of the mediator on the outcome can be assessed in the second. Such designs lack the identification power of those considered by the authors and must make additional assumptions such as no interaction in expectation, cross-world independence and transportability when two different populations are used in the two experiments. But such designs would be easier to implement in practice and could even make use of existing trials and published data. We have been developing methods for such settings elsewhere (Emsley and VanderWeele, 2012). Although these methods do not allow for the identification of mediated effects without very strong assumptions, they can be useful in informing sensitivity analyses for these mediated effects. Such an approach constitutes an intermediate between the extremes of merely relying on observational studies and sensitivity analysis (Imai *et al.*, 2010; VanderWeele, 2010) or alternatively employing the complex experimental designs that were presented in the paper under discussion. However, when the parallel and crossover designs described by Imai, Tingley and Yamamoto are possible to implement, they clearly constitute a superior and more rigorous approach to assessing causal mechanisms.

The **authors** replied later, in writing, as follows.

We begin by thanking a total of more than 25 scholars from various disciplines for their valuable contributions. The fact that such a large number of contributions have been submitted reflects the interdisciplinary importance and challenges of identifying causal mechanisms. Given the limited space, we shall focus on several common themes and reserve for future occasions our specific responses to the other points raised by each discussant.

*Should scientists conduct causal mediation analysis?*

Some discussants believe that the efforts to improve the credibility of causal mediation analysis may not be so worthwhile. There appear to be two main reasons for this scepticism: one fundamental and the other more practical. The fundamental criticism is that our primary estimand, the average causal mediation effect (ACME), is of limited scientific value and thus we should instead focus on some other quantity. Some contributors (e.g. Didelez and Egleston) propose as an alternative the average controlled direct effect (ACDE), defined as . Others (e.g. Mealli and Rubin) argue for the principal strata direct effect (PSDE), such as the dissociative effect .

As explained in our paper, the ACDE represents the effect of manipulating both the treatment and the mediator to specific values and thus is not directly informative of the causal process through which the treatment affects the outcome. In contrast, the ACME formalizes the notion of a causal process by considering the counterfactual outcome values which would realize when the mediators were changed as they naturally would in response to the treatment. Putting aside the terminological issue of what should be labelled a ‘causal mechanism’ (e.g. Sobel), scientists across disciplines very often aim to learn about causal processes. This is because scientists care not only about *changing* the world by means of external intervention, but also about *understanding *the way that the world works.

In the job market discrimination example that is discussed in our paper, social scientists are often interested in uncovering the causal process which leads an African American applicant to fewer job opportunities. Their goal is to understand the actual corporate hiring practices in the hope that such understanding will shed light on the nature of discriminatory behaviour in a society and more generally among human beings. Does discrimination arise from the perceived difference in qualifications between black and white applicants, or from the fact that the applicant is black? This is a descriptive (rather than prescriptive) causal question that can be most directly answered by quantifying the natural causal process.

Of course, we do not imply that other causal quantities such as the ACDE and PSDE are of little value to scientists. In the above example, the ACDE will be more useful than the ACME if the researcher is interested in the question of whether a policy intervention to improve the qualifications of minority applicants, say through a job training programme, increases their employment prospects. We emphasize that the experimental designs that are proposed in our paper all nest the standard experiment in which only the treatment is randomized, and the parallel design in particular can point-identify the ACDE without additional untestable assumptions. Moreover, the ‘augmented design’ that has recently been proposed by Mattei and Mealli (2011) for the estimation of the PSDE is also nested in our parallel encouragement design. Therefore, the designs that are proposed in our paper simply expand the realm of possibility for experimental investigations into causal mechanisms. In fact, no opportunity will be lost by adopting one of our designs instead of simply conducting a standard single experiment (except the loss of statistical power, which may be an important concern in some situations as pointed out by Albert, Jo and Lange).

In the end, we believe that scientists should ultimately determine their causal quantity of interest in light of the specific applied problems they face. In our view, the job of statisticians in causal investigations are twofold:

- (a)
to clarify the assumptions that are required for the identification of the causal quantities that scientists wish to estimate, and

- (b)
to devise new methodological tools such as alternative designs and estimation techniques that help scientists to infer these quantities from the observed data better.

The choice of causal quantities should depend on the particular scientific questions being asked. It is clear that scientists are often interested in the examination of causal processes, and the ACME addresses this question most directly.

The second, more practical argument against causal mediation analysis is that, even if the ACME is of scientific interest, scientists should refrain from studying it because the identification of the ACME requires untestable assumptions, which may be difficult to justify in many applied research settings (e.g. Bullock and Green, Didelez, Draper and Glynn). In particular, concerns are raised about the plausibility of the assumptions such as consistency and the exclusion restriction. Although these assumptions should be taken seriously in applied research, we argue that the difficulty of causal mediation analysis should not be the sole reason to deter statisticians from working on related methodological problems. Typical applied research, especially in medical and social sciences, invokes several untestable assumptions. For example, the use of the instrumental variables method is usually accompanied by the assumptions of monotonicity and exclusion restriction. Even in randomized experiments, the consistency assumption (i.e. the stable unit treatment value assumption) may not be entirely valid. These concerns should not imply that empirical findings based on such assumptions are to be completely discredited. If such a perspective is applied, there will be very few valid studies left in many of the disciplines in the social and medical sciences!

A more constructive approach would be to confront these methodological challenges directly. In general, there are at least two ways in which statisticians can help scientists in this regard. First, the lack of point identification does not necessarily imply the absence of information about the ACME. As we demonstrate in the paper, the sharp bounds on the ACME can be derived to quantify precisely how much one can learn from the observed data without untestable assumptions. Indeed, some of the contributors (e.g. Ramsahai, Richardson and Robins) have taken this approach in their contributions and others have applied it in other contexts (e.g. Manski (2007)). Second, sensitivity analysis can be conducted to investigate how robust one's empirical findings are to the potential violations of such assumptions (e.g. Longford). Although we did not discuss them in our paper, several sensitivity analysis methods have already been developed for causal mediation analysis under the standard experimental design (e.g. Imai, Keele and Tingley (2010), Imai, Keele and Yamamoto (2010), VanderWeele (2010) and Tchetgen Tchetgen and Shpitser (2011)) and for some of the designs proposed in our paper (Imai and Yamamoto, 2012).

*Open methodological issues and future research agenda*

The methodological literature on causal mediation analysis has evolved rapidly over the last decade and we expect this trend to continue. Many of the contributors who accept the importance of causal mediation analysis suggest open methodological issues. We outline these and other challenges here in the hope that they guide future methodological research.

First, the main message of our paper is to draw attention to the ‘design-based approach’ to causal mediation analysis. Whereas prior research focused on various statistical methods under the standard experiment design, relatively little attention has been paid to the question of how to design randomized experiments differently to conduct causal mediation analysis with more credible assumptions. We hope that future research extends our work and develops alternative designs. Several contributors to this discussion appear to have already been moving in this direction by considering the use of covariates and other information (e.g. Albert, VanderWeele and Emsley, Hong, MacKinnon, Mateu and his colleagues and Saez; see also Section 3.1.2 of our paper). We look forward to seeing these new ideas in print. These new experimental designs are also important because they naturally serve as templates for observational studies. In Imai *et al*. (2011), we describe a couple of empirical studies in political science where the researchers analyse the observational study analogue of the crossover design that is proposed in our paper. These studies focus on the estimation of incumbency advantage, a topic which is mentioned by one of our contributors (Gelman).

Second, another important area of future research concerns multiple mediators because applied researchers are often interested in investigating the relative importance of one mediator over another (e.g. Longford). The key idea behind the proposed experimental designs is to side-step this issue by manipulating one specific mediator of interest. However, as some contributors pointed out, in practice manipulating one mechanism in isolation may be difficult, leading to the situation where multiple mediators are affected by an intervention. For this reason, it is critical to develop statistical methods that directly deal with the presence of multiple mediators. For example, Albert and Nelson (2011) discussed model-based estimation strategies for path-specific effects in the presence of multiple mediators. In Imai and Yamamoto (2012), we develop semiparametric linear models and sensitivity analyses for the potential violation of required identification assumptions concerning multiple mediators.

Third, there may be alternative approaches to causal mechanisms that are quite different from what is discussed in our paper. Some contributors mention the use of a decision theoretic framework (e.g. Berzuini and Ramsahai). Another approach is based on the identification of sufficient causes, which is briefly discussed in our paper. These alternative approaches may shed new light on key methodological issues. For example, in his discussion, Ramsahai shows how to relax the deterministic assumptions that are made in our paper and examines the effects of doing so on the identification power of the designs proposed.

Finally, we conclude our discussion by emphasizing the importance of close collaboration between statisticians and applied researchers. As George Box succinctly put it, ‘the business of the statistician is to catalyze the scientific learning process’. Any study of causal mechanisms will be best designed by taking into account specific aspects of scientific theories under investigation. Although the experimental designs that are proposed in our paper may serve as a starting point, we believe that in many situations they must be modified to address directly the methodological challenges that are faced by the researcher. In particular, practical difficulties of causal mediation analysis can be overcome by technological advances (as in the neuroscience example in our paper) and creativity on the part of the researcher (as in the labour market discrimination example). Some contributors discussed potential applications and specific challenges that range from medicine and social sciences to engineering (e.g. Egleston, Gelman, Leiva and Porcu, and Kuroki).

The challenges of causal mediation analysis should therefore motivate, rather than discourage, scientists and statisticians who are working on this important problem. For many statisticians, the mantra ‘No causation without manipulation’, which was put forth by Holland (1986) more than two decades ago, has been a starting point of causal analysis. Although we agree on the fundamental importance of manipulation in any causal analysis, this mantra should not be taken as a commandment that forbids certain scientific inquiry. Recently, Judea Pearl proposed another mantra ‘Causation precedes manipulation’. This reminds us that manipulation is merely a tool that is used by scientists to identify causal quantities of interest. It is clear to us, and hopefully to readers, that statisticians should no longer be passively analysing the data collected by applied researchers. Rather, they must understand the causal mechanisms that are specified by scientific theories and work together with applied researchers to devise an optimal design for testing them.