After years of finding that any one treatment does not affect all patients with a disorder in the same way, that a treatment highly effective for some may be ineffective or even harmful for others, and that, statistically significant or not, the effect sizes of many treatments tend to be small, emphasis in randomized clinical trials (RCTs) is gradually shifting (1) to increased focus on effect sizes [1-8] and (2) to discovery and documentation of moderators of treatment choice on outcome in RCTs, that is, personalized medicine [9-13]. In effect, the emphasis is shifting from a focus on populations as a whole to one in which the individual differences among the patients in the population are explicitly acknowledged and dealt with.
However, methodological problems continue to exist. There is often still some disagreement about the definition of a moderator and a lack of a clear distinction between a moderator versus a predictor or a mediator of treatment outcome . Moreover, methods to assess the strength or impact of a moderator (effect size) in a way that is clinically and/or scientifically meaningful are lacking. Preacher and Kelley  in discussing similar issues for mediators define an effect size as a measure that reflects a quantity of interest, scaled appropriately so as to be interpretable in the context of interest, a population parameter that can be estimated (with a confidence interval) from a sample so as to be unbiased, consistent, and/or efficient. Moreover, in order to use such an effect size to compare moderators (or mediators) of a treatment choice on an outcome in an RCT, the effect size must be such that one can reasonably compare the effect of one moderator (or mediator) to another for the same treatment choice and outcome.
Here, as a first step, we review the definitions of, and distinctions between, predictors, moderators, and mediators using the MacArthur approach [14, 16, 17] and discuss their importance both to scientific progress and to clinical decision making. We will then use the classic linear regression model for continuous outcome measures in the Baron and Kenny seminal paper on moderators and mediators  from which the MacArthur approach evolved in order to develop one moderator effect size with wide, but not universal, applicability, one that carries statistical meaning but does not necessarily well reflect clinical significance. However, this approach is the one most commonly used and is very useful both in exploring for moderators and in seeking an optimal composite moderator by linearly combining individual moderators. Because each individual moderator is likely to have a small effect, such optimal composite moderators are crucial for clinical decision making.
First, let us clarify the definitions. A predictor of treatment response in an RCT describes a relationship between two variables: a baseline response, M, and an outcome of treatment, O. For M to be a predictor of O, (1) M must precede O and (2) be correlated with O. Thus, in an RCT comparing two treatments, M may be a predictor of O in the treatment group (T1) or in the control/comparison group (T2), in both or in neither. The strength of prediction in either treatment group is usually conveyed by a correlation coefficient between M and O in T1 and in T2.
In contrast, a moderator or mediator of treatment in an RCT describes a relationship involving three random variables: here, the choice of treatment (T: T1 versus T2), the moderator or mediator (M), and the treatment outcome O .
M is a moderator (also called an ‘effect modifier’) of T on O if (1) M precedes T that precedes O, (2) M and T are uncorrelated, and (3) the effect size of T on O differs depending on what M is . Thus, in RCTs, a moderator of treatment is a baseline variable (thus satisfying criteria (1) and (2)), which helps identify on whom or under what conditions T has a certain causal effect on O. For clinical applications, M could be used to stratify the population into subgroups: those where T1 is clinically preferable to T2 and those where T2 is clinically preferable to T1.
M is a mediator of T on O if (1) T precedes M that precedes O, (2) M and T are correlated, and (3) the effect size of T on O can be explained partially or totally by the effect of T on M. In RCTs, a mediator of treatment describes an event or change that occurs during treatment before outcome is determined, an event or change on which T has a causal effect, suggesting how or why T has the causal effect on O that it does. Ultimately, a mediator might be used to suggest modifications to treatment strategies in order to augment efficacy or effectiveness or to reduce cost.
Clearly, with these definitions, the same variable cannot be both a moderator and a mediator of the same T on the same O in the same population, for a moderator precedes and is uncorrelated with T, and a mediator follows and is correlated with T. Moreover, the impact of a moderator and mediator are completely different: a moderator suggests a stratification of the population, whereas a mediator provides insight into the treatment processes. However, there is a crucial connection between moderators and mediators of T on O. If one demonstrates that a certain baseline variable moderates the effect of T on O, what mediates the effect of T on O may be different in subgroups of the population defined by the moderator. Thus, for example, if gender moderates the effect of T on O, that is, the effect of T on O is different for men and women, how that effect is achieved (mediators) may also be different for men and women. Consequently, the search for moderators should logically precede that for mediators. The focus of this discussion is on moderators.
Clearly, moderators and mediators are important beyond the context of RCTs. The definitions can be applied to any variables M, T, and O where M and T precede O. For example, there is recent controversy about the finding that a gene (M) may moderate an environmental risk factor (T) in helping to explain the presence of a disorder (O) [19, 20]. However, in absence of randomization to T, the difficult problem exists of trying to prove absence of correlation between M and T. Here, we focus on RCTs, where we avoid such problems.
1.2 Importance of moderators of treatment on outcome in randomized clinical trials
Moderators of treatment outcome are important to basic medical research, because moderators (including biomarkers or biosignatures) may suggest the cause/etiology of the disorder being treated. Moderators are even more important to clinical research, because moderators identify for which patients a particular treatment choice should be made for maximal effectiveness and safety. Identification of moderators of treatment effect is particularly important to drug/treatment development. It may well be that a drug may be found statistically significantly better than placebo in two or more large RCTs and be approved by the Federal Drug Administration (FDA) but still have a small overall effect size. This small overall effect size may reflect not that the drug is of low effectiveness for all but that it is highly effective for perhaps most, but less effective than placebo or even harmful for a minority of the patient population, thus attenuating the overall effect size. If such a drug were prescribed in the entire population, the failures of that treatment in that minority will eventually become apparent to patients and clinicians. In absence of identification of that minority who would be harmed (using moderators of treatment response), the choice may be to avoid prescribing that drug entirely, or even for the FDA to impose a ‘black box’ label, or withdraw approval of that drug. Then, a safe and effective drug might be withheld from the majority of that patient population because of the harm done to a minority. The opposite is also possible, that a treatment cannot be shown to be more effective than placebo but that it is highly effective within a subpopulation identifiable with a moderator. But because it is not shown to be more effective overall, the drug is not approved. With knowledge of moderators of treatment choice on outcome, one can focus on prescribing that drug only in the subpopulation for which the drug is safe and effective and seek other drugs/treatments for the remaining.
How to test a null hypothesis specifying a single moderator of treatment outcome is reasonably well known. However, how do we define the strength of moderation, that is, the moderator effect size, either for scientific purposes or for assessment of clinical significance? If there are many moderators, how can we compare them, choose among them, or develop a composite moderator than might more strongly moderate the effect of T on O than any single moderator? Those are the issues of concern in this discussion.
1.3 Preliminary considerations
Very often a predictor of treatment outcome in both treatment groups is not a moderator of treatment. There are many predictors and far fewer moderators of treatment. Any baseline variable that only predicts placebo response, for example, will predict response in all treatments (because placebo response is part of a response to any treatment) but will not be a moderator of treatment choice in any RCT. However, a baseline variable that is a predictor of outcome in neither treatment group cannot be a moderator of treatment. Consequently, the search for moderators of T on O in a population might well begin with examining predictors of outcome in one or the other treatment group but with the knowledge that most will not be moderators. In what follows, we will use the term ‘non-specific predictor’ to describe a baseline variable M that is not a moderator of T on O but is a predictor of O in both treatment groups  and ‘irrelevant to treatment outcome’ for a baseline variable that is a predictor in neither treatment group. Thus, all baseline variables can be classified into three mutually exclusive classes: those irrelevant to treatment outcome, those non-specific predictors, and those that are moderators. Only moderators might influence different choices between T1 and T2 for different patients.
It should also be recognized that a moderator of a treatment choice between, say, a drug and placebo, may not be a moderator for the choice between that same drug and an active comparator. Thus, in an RCT comparing more than two treatments, moderators need to be sought separately for each pair of treatments to be compared. Once that is carried out for each pair of treatments, the multiple treatments might be rank ordered.
A moderator of treatment choice in one population, say, among outpatients, may not be a moderator in another, say, among inpatients. In general, because the search for moderators is essentially the search for the sources of individual differences in response to treatment within a population, it will generally be easier to identify a moderator in a heterogeneous population, where individual differences are major, and using an outcome measure O that is highly sensitive to individual differences among patients in the RCT.
A moderator of treatment on one outcome, say, reduction in symptoms, may not be a moderator of treatment on another outcome, say, level of side effects. For this reason, the outcome measure to be used should preferably reflect the harm/benefit balance in individual patients [16, 22], again a measure as sensitive as possible to the crucial individual differences in clinical outcome among the patients in the population.
In short, the moderator relationship (as also for the predictor and mediator relationships) is specific to one population, one definition of T, M, and O, a consideration in designing RCTs or interpreting the results from RCTs. Whether the same relationship holds in other circumstances is an empirical question, the answer not to be taken for granted.
2 Parametric approach: the linear regression model
2.1 Developing a moderator effect size
The original, and still most common, approach to documenting that a specific M is a moderator [18, 23] is to use a linear regression model, in which it is assumed that
where O is the outcome variable, measured on a continuum. For clear interpretability of results [24, 25], T is coded + 1 ∕ 2 for T1 and − 1 ∕ 2 for T2, and M is standardized to have a mean of 0 and variance 1 (the same standardization in both treatment groups because M and T are uncorrelated). The error term, e, is assumed to be independent of T and of M and to have a normal distribution with variance . The regression coefficients are b0, the ‘intercept’; b1, the ‘main effect of treatment choice’; b2, the ‘main effect of M’; and b3, the ‘interaction effect’ of T by M. The standardized regression coefficients are defined as , i = 1,2,3.
Standard linear regression tests are based on the assumption that the M values are fixed, the same in all replications. There are no assumptions about the distribution of M, which can be binary, a 3,4,5 … point scale (e.g., a count or Likert scale) or a continuum and, if a continuum, may have a normal distribution in the population or not. However, in the moderator problem, M as well as O is a random variable, a fact that cannot be overlooked. With what instrument and on what scale M is measured and what its distribution is in the population sampled has a major effect on non-null distributions and on the clinical impact of the moderator.
For a single ‘a priori’ baseline variable, in the situation satisfying the linear model in Equation (1), Figure 1 shows the expected responses of patients at various levels of M, randomly assigned to the two treatment groups (T1 and T2). The vertical separation between the two lines at any value of M indicates the effect size of T1 versus T2 for patients with that value of M. In Figure 1(a) is shown M irrelevant to treatment outcome (M not a predictor in either treatment group), Figure 1(b), M a non-specific predictor (M a predictor in both treatment groups, but the effect size is the same for all values of M), and Figure 1(c), M a moderator (the effect size varies with M). The vertical separation at M = 0 indicates the main effect of T (b1), here the same in all three cases, b2 is the average slope of the two lines (zero in Figure 1(a) and the same non-zero value in Figure 1(b, c)), and b3 is the difference in the slopes of the two lines (zero in Figure 1(a, b) and non-zero in Figure 1(c)).
Some have suggested that b3 be used as the moderator effect size. However, b3 itself is not satisfactory, because changing the scale on which O is measured would change the effect size, but its impact would not change. Accordingly, others have suggested that standardized b3, d3 = b3 ∕ σe be used. However, d3 ignores the actual distribution of M in the population sampled, and its impact may be quite different when M is measured on a 2,3,4,5, … point scale from that when M is normally distributed, even if, with standardization, the means and variances of M are the same. Moreover, d3 ignores the fact that a moderator effect is superimposed on the main effects of T and M. The strength of the moderator should depend in part on how large those other effects are.
Because moderation focuses on the differential effects of treatment on individual patients, one might focus attention instead on pairwise differences among randomly selected patients, one from the T1 and one from the T2 group: O1 − O2. According to the linear model
Because M1 and M2 are drawn from the same distribution, average M, AM = (M1 + M2) ∕ 2 and the difference in M, DM = (M1 − M2) are uncorrelated. Thus, the correlation coefficients between ΔO and respectively DM and AM are
With r(ΔO,AM) as the moderator effect size, the effect size is invariant over linear transformations of either M or O. It is a number between − 1 and + 1, with null value 0, with greater magnitudes indicating stronger moderation. Moreover, if Cohen's d is used as the overall effect size comparing T1 with T2 in the population sampled,
whereas d1 = b1 ∕ σe is Cohen's d comparing T1 with T2 for the ‘typical patients’ with M = 0. Then
Thus, r(ΔO,AM) as a moderator effect size clearly shows the attenuation of a moderator on the overall effect size.
If both r(ΔO,AM) and r(ΔO,DM) equal 0 (M irrelevant to T, Figure 1(a)), the overall treatment effect size, Cohen's d, is the effect size for individuals whatever their values of M in the population. If the r(ΔO,AM) equals 0 but r(ΔO,DM) is non-zero (M a non-specific predictor, Figure 1(b)), then the treatment effect size is the same for individuals with all values of M in the population (d1), but Cohen's d < d1, an attenuated effect because a non-specific predictor increases the within-group variance, and thus the overlap of the two groups’ overall distributions. However, if the r(ΔO,AM) is non-zero (i.e., M a moderator, Figure 1(c)), then the effect size is not the same for those with different values of M and both Cohen's d and the main effect of T, d1, poorly describe the effect of treatment for individuals within the population. Thus, the use of r(ΔO,AM) emphasizes the special importance of a moderator of treatment in an RCT to subsequent clinical decision making, is clearly interpretable, and is invariant under all linear transformations of M or of O.
To test the null hypothesis of absence of moderation, the usual test for the interaction effect (H0 : b3 = 0) in the linear regression model is simplest, is valid, and is most powerful. However, the strength of the moderator should be based on consideration of r(ΔO,AM). To estimate the effect size of the moderator effect, one might use bootstrap methods to obtain confidence intervals for r(ΔO,AM), or new methods might be developed . If multiple moderators are to be compared in the same populations for the same outcomes, these might be compared using estimates of r(ΔO,AM). In seeking an optimal composite moderator, a combination of moderators, the search might focus on maximizing r(ΔO,AM).
2.2 Multiple moderators
When there are multiple potential moderators of T on O in a population (all reasonably satisfying the linear model), the search for an optimal composite moderator should obviously focus, first and foremost, on those baseline variables for which there is a scientific rationale and justification. Moreover, because the moderator effect size is essentially a correlation coefficient, it will be attenuated by the unreliability of the measurement of O and of M. Thus, if there are multiple M's related to the same underlying construct, these M's should be combined in order both to increase the reliability of the measurement of that construct and to avoid problems associated with multicollinearity in combining them. In what follows, the multiple individual potential moderators will be assumed to be related to constructs that are judiciously selected, well measured, and relatively independent of each other. Interactions may be included by defining some M's as products of others.
Then, if one had multiple possible moderators of T on O in the same population that satisfied the linear model assumptions, one might randomly pair each of the N1 observations in the T1 group, with each of the N2 observations in the T2 group (N1 ⋅ N2 pairs) compute ΔO = O1 − O2 for each pair and the paired averages of the multiple moderators AMi = (Mi1 + Mi2) ∕ 2, i = 1,2,3, … ,m, and use a multiple linear regression model to generate weights w1, w2, … ,wm so as to maximize the correlation between ΔO and . The optimal composite moderator M * for an individual patient in T1 (j = 1) or T2 (j = 2) would then be .
This procedure is preferable to one in which a multiple linear regression is used with O as the dependent variable, T as the main effects of the multiple M's, and their interactive effects with T as the independent variables. Because the interaction terms cannot be entered without the main effects of its components, power is expended in estimating the weights for the main effects, reducing the power available for detection of the interactive effects (moderators). Most of the main effects relate, not to moderators, but to non-specific predictors. Consequently, because non-specific predictors are relatively more common than moderators, the index that results is likely to include more non-specific predictors than moderators, which will confuse the understanding of moderation.
However, all foregoing considerations depend on having a situation well described by the linear model, and there are many situations in which the linear model does not well hold (e.g., a binary outcome measure). More important, as statistically appealing as r(ΔO,AM) may be with the linear model, the clinical impact of identifying a moderator of M on T for O in a population is necessarily not well reflected in r(ΔO,AM). In Figure 1(c), for example, clearly for those patients with values of M above Mcross = − d1 ∕ d3 (the cross points of the two lines), T1 is clinically preferred to T2, and for those below, T2 is clinically preferred to T1. However, in the population sampled, if the probability of falling below Mcross is small, no matter how large r(ΔO,AM), identification of M as a moderator of T on O will have little clinical impact: T1 will be preferred for almost all. Moreover, of two moderators with the same r(ΔO,AM), one may have more clinical impact simply because a greater proportion of the population fall below such a cross point. It is important to recognize the limitations that a linear model places on identification of moderators and to separate considerations of clinical significance from statistical significance.
3 A brief illustration
In 2011, Frank et al.  reported an RCT comparing a psychotherapy treatment (interpersonal therapy (IPT)) with a drug treatment (selective serotonin reuptake inhibitors (SSRI)) for treatment of major depression. Thirty two baseline variables were initially considered as possible moderators of treatment. Many of these were eliminated because of questionable clinical rationale, missing data, or substantial collinearity with other proposed moderators. We provide the eight remaining in Table 1, with the moderator effect size, r(ΔO,AM), computed for each individual variable. Also listed in Table 1 are the weights estimated for the optimal moderator M * , using the procedure described here (full details in ).
Table 1. Eight proposed moderators of treatment (interpersonal therapy versus selective serotonin reuptake inhibitors) for major depression.
Individual moderator effect sizes: r(ΔO,AM)
Weight in optimal moderator: w
Number of episodes
In the original RCT, the overall effect size was approximately 0, and the effect size for the typical subject (M * = 0) too was approximately 0. However, when the sample was stratified by M * , the effect size for those 44% with M * > .03 (the crossing point of the T1 and T2 regression lines) was − 0.50, favoring IPT over SSRI, and for those 56% with M * ⩽ .03, was + 0.48, favoring SSRI over IPT. Although this result, obtained in exploratory data analysis, requires independent confirmation in an RCT designed for the purpose, this type of result could have potential major clinical and scientific ramifications, because it identifies which types of patients would benefit from which treatment.
There are several important methodological insights to be emphasized from this illustration. First of all, methods to identify individual moderators will not suffice to solve the problem . Individual moderator effects sizes are likely to be small; combining moderators is crucial. This is particularly true because the effect size of the individual moderator clearly does not well predict what weight it will carry in a combined optimal moderator. In turn, it is important not to attribute causality to such moderators, for many will only serve as ‘proxies’ to other stronger moderators not yet offered for consideration. The optimal moderator is not unique and can usually be improved upon with greater understanding of the individual differences among the patients in the population sampled.
The following summarizes our discussion:
An irrelevant baseline factor and a non-specific predictor can have no influence on making different choices between T1 and T2 based on O, because the same treatment is preferred for all and the same treatment effect size pertains for all individuals in the population regardless of the value of that baseline factor. Only a moderator can change that decision. A quantitative moderator too has no influence on choices between T1 and T2, although the size of the treatment effect on O may be large for some patients and trivial for others depending on their value of M and such moderator may have scientific importance if not clinical importance. With a qualitative moderator, T1 is preferred for some patients, T2 for others, and in extreme cases, the treatment effect sizes may range from very strong preference for T1, through trivial or no preference, to very strong preference for T2, depending on the individual values of M.
Where the linear model holds, we propose r(ΔO,AM) as a moderator effect size, where ΔO is the response difference between randomly paired patients and AM the average of their two values of M. This effect size might be used in identifying moderators and rank-ordering moderators.
To find an optimal composite moderator that combines individual moderators, one might seek to maximize the correlation between ΔO and a linear combination of paired averages on the individual moderators. This would focus attention away from identification of non-specific predictors and directly on moderators.
In addition to what might be learned in the future from understanding moderators of treatment, there is much to be learned from consideration of moderators of treatment both about interpreting results from past RCTs.
Interpretation of the overall treatment effect size:
When patients within a sample from a population are randomly assigned to T1 or to T2, the overall effect size comparing T1 with T2 outcomes (here Cohen's d in the linear model approach) indicates the impact of prescribing T1 rather than T2 for all in the population, but if there are one or more unknown moderators, that overall effect size may be a poor indication of the impact of prescribing T1 rather than T2 for many individuals within that population.
Adjusting or controlling for M:
If a moderator M is used in a linear model (often described as ‘adjusting’ or ‘controlling’ for the effects of M) that includes the interaction effect and the model is properly centered, the main effect of T indicates the effect of T1 versus T2 for the ‘typical patient’ with M = 0, but it poorly indicates the effect of T1 versus T2 for other patients in the population. Indeed, the effect of T1 versus T2 may be positive for that typical patient but negative for many others within the population. Moreover, the overall effect size (here Cohen's d) comparing T1 with T2 will often differ from the effect size for the typical patient (here d1), not because one is right and the other is wrong but because they refer to different populations.
If, when M is a moderator and the interaction is included, the model is not properly centered, then the main effect of T (d1) may have no clinically meaningful interpretation. For example, in an RCT comparing two treatments for Alzheimer's disease (AD), where M is the age of onset of AD, coding M as age of onset in chronological age would mean that the main effect of treatment is that for those born with AD (M = 0), hardly a meaningful effect. If here, M were centered at the mean age of onset, the main effect of treatment is that for those with that mean age of onset (the typical patient), which conveys limited information but is meaningful.
Omitting the interaction term:
If, on the other hand, where M is a moderator of T on O, a linear model is used without the interaction term (thus assuming absence of moderation, perhaps using ANCOVA), unless treatment group sizes are equal, then the existing interaction effect will be partially remapped into the main effect of treatment, thus biasing it as an indicator of the effect of treatment on the typical patient, with the remaining portion remapped into the error term, attenuating power and the treatment effect size. Both type I and type II error probabilities may be increased, and once again, the interpretability of the treatment effect is compromised.
Stratifying the population:
Suppose, instead of a representative sample from the population, we took a sample stratified on a moderator M, say, an equal number of those in the majority and minority ethnic groups, and included both M and the M by T interaction, properly centered. This is often carried out in order to increase the representation in the sample of underserved minorities. If M (say, majority/minority status) moderates the effect of treatment on outcome, one can here estimate the effect size of treatment for each stratum, can test whether there is a moderator effect, but cannot estimate either the overall effect size, the effect size for the typical patient, or the moderator effect size, in absence of the sampling weights that connect the size of each sample stratum to its actual representation in the population. The main effect of treatment here in the linear model reflects the average of the effect sizes in the strata and may misrepresent the effect size in both the individual strata and overall.
In short, past RCTs that have used stratification factors in sampling or covariates in linear models to assess outcomes have often been carried out on the assumption that the stratification factors and/or covariates considered did not moderate treatment response. Where that assumption was wrong, the results may have been compromised.
Instead, initial RCTs might be carried out comparing T1 with T2 in a sample from the population of patients to which such a choice might pertain, powered adequately to test the overall effect of treatment. Results of hypothesis-testing results might be reported without consideration of baseline factors (i.e., no stratification, no adjusting, and no controlling). Once the overall results are established for that population, exploratory studies might be carried out on the same data considering all baseline variables (well measured and preferably relatively independent of each other) for which some scientific rationale and justification would suggest a role as a possible moderator. The goal would be to classify such baseline variables into those that are irrelevant, those that are non-specific predictors, and those that are moderators of T on O in that population.
If multiple moderators are found, efforts might be made to combine them to a single composite moderator to be used to decide clearly and without ambiguity which patients should be prescribed T1 and which T2.
As is always true of such exploratory studies, what results from such efforts are hypotheses, not conclusions. Independent confirmation of such hypotheses should be sought, either from other RCTs comparing the effects of T1 with those of T2 on O in independent samples from the same population or, better yet, from a confirmatory hypothesis-testing study specifically designed to test the hypothesized moderator. Such a hypothesis-testing study might well be based on a sample stratified on the moderator of interest to increase the power to detect the moderator effect and to increase the precision of estimation of the effect sizes, but part of the design then must include gathering the sampling weights necessary to estimate the moderator effect sizes from a stratified sample. In most cases, studies would analyze data using linear models. Once a moderator is confirmed, the crucial question is whether and how knowledge of that moderator would impact clinical decision making.
Once a moderator is confirmed and found to impact clinical decision making, the question of mediators would begin to take center stage: Discovery of such processes might suggest how one might improve upon whichever treatment is preferred. In this way, by targeting treatment appropriately (moderators) and by tailoring treatments optimally (mediators) using the individual characteristics of the patients, we might begin to realize the goals of personalized medicine, where treatment decisions are optimally made considering the individual characteristics of each individual patient.