The Use of Instrumental Variables in Peer Effects Models†

Abstract Instrumental variables are often used to identify peer effects. This paper shows that instrumenting the ‘peer average outcome’ with ‘peer average characteristics’ requires the researcher to include the instrument at the individual level as an explanatory variable. We highlight the bias that occurs when failing to do this.


I. Introduction
Many papers in economics provide empirical evidence on the causal effect of peers on individual outcomes using an instrumental variable (IV) approach. They usually consider linear in mean regressions of an individual outcome on the corresponding average outcome of peers and a set of individual explanatory variables. They may then instrument the average outcome of peers with the peer average of certain characteristics. 1 As in any other standard linear regression, the IV estimator consistently estimates the causal peer effect if the instruments are as good as randomly assigned (independence), irrelevant in explaining the individual outcome except through the average peers' outcome (exclusion), and relevant in explaining the endogenous outcome averaged across peers (relevance). 2 The contribution of this paper is to highlight a subtle, but important implication of the relevance assumption, something not explicitly recognized in this literature: the individual variable, say x, whose peer average, sayx, is used to instrument the peer average outcomeȳ must be included as an individual explanatory variable of the dependent variable y. The idea is simple: ifx is a valid instrument forȳ, then x must also be related to y at the individual level. We show that failing to include the individual variable leads to inconsistent estimates. The only case when consistency holds is if peers are randomly allocated across individuals. However, even if peers are randomly allocated within clusters (e.g. schools) but not across clusters, the inclusion of cluster fixed effects -a necessity as randomization takes place within clusters -renders the estimates inconsistent. 3 While most applications of peer effects that use IV do include the instrument at the individual level and therefore avoid the inconsistency and bias described here, a number of papers have not done so. More generally, we have found no discussion of this issue in the literature. Given the widespread use of IV in peer effects models, we argue that it is important to raise awareness of this among both econometricians and applied researchers.

II. The peer effects model
As the consistency of the instrumental variable estimation of a peer effect depends on whether cluster fixed effects are controlled for, we discuss both cases separately, and end with a formal proof of the asymptotic bias. To better clarify what we mean by peers and clusters, consider the case where the peer group is defined by the classmates within schools, then the peer effect is the effect of the classmates, while the cluster fixed effect is the school fixed effect.

The case without cluster fixed effects
We follow the existing literature that almost exclusively specifies a linear-in-mean peer effects model and consider the following specification where y is the N × 1 vector of the individual outcome, W is an N × N row-standardized weight matrix describing the social ties between individuals, is the scalar peer effect parameter and u is the residual error vector. 4 Model (1) does not include the intercept but there is no loss of generality as long as all variables are expressed as deviations from their means. Furthermore, as we discuss below, the model can easily be adjusted to account for additional explanatory variables. The instruments for Wy are defined as the peer average of characteristics X, i.e. WX. These must satisfy independence, exclusion and relevance. Exclusion assumes that the instruments WX only affect y through Wy, i.e. that there is zero correlation between the error term in model (1) and WX, or corr(WX, u) = 0; relevance requires the instruments to explain variation in Wy, i.e. that corr(Wy, WX) = 0. The IV estimation of the peer effect , which we refer to asˆ IV 0 is then given by: where P WX is the projection matrix [(WX)[(WX) (WX)] −1 (WX) ]. The IV estimatorˆ IV 0 is equivalent to a 2-stage least squares (2SLS) estimator where the first stage is the ordinary least squares (OLS) regression of Wy on WX, and the second stage is the OLS regression of y on the prediction of Wy obtained from the first stage, i.e. [P WX (Wy)] (see e.g. Cameron and Trivedi, 2005). The peer effects literature that adopts this IV approach assumes that the individual outcome y is not directly affected by peers' average characteristics WX, but they generally do not make any explicit assumption on whether the individuals' characteristics X directly affect y. Appendix A shows that under the relevance and exclusion assumptions, it follows that X directly affects y, and hence model (1) is misspecified because it omits X from the explanatory variables. 5 In other words, X should be included as explanatory variables in model (1): where we still omit the constant and assume that all variables, including X, are expressed in deviation from their mean. We therefore refer to equation (3) as the true model. 6 By replacing y in equation (2) with the right-hand side of equation (3), we can show that the estimatorˆ IV 0 in equation (2) is inconsistent: 4 W is generally constructed to have zero elements on the leading diagonal, ensuring that Wy excludes the individuals themselves. We also assume that the peer relationships be symmetric, so that W is symmetric. 5 Appendix A shows this is true under plausible assumptions. 6 Here, we follow the existing literature that almost exclusively considers specifications in which all covariates enter additively and linearly (including the literature that does account for the instrument at the individual level; section III discusses some of the relevant literature). We use this specification when deriving the asymptotic bias below. However, we note that these derivations do not generalize to situations where the true model includes some other function of the instrument at the individual level (e.g. X 2 or ln(X)). Hence, in such cases, the asymptotic bias is also likely to be different. Nevertheless, because the majority of studies specify the model as in equation (3), we derive the bias for this specification.
Denoting [P WX (Wy)] with (WX)ˆ , whereˆ is the OLS estimator of the coefficients of WX in the first stage regression of Wy on WX, and taking the probability limit, we obtain where = p − limˆ , which is the vector of the true slope coefficients of WX in the linear regression of Wy on WX. This shows that the IV estimation is consistent if and only if E((WX) (X + )) = 0. We discuss this separately as E((WX) X) = 0 and E((WX) ) = 0. The latter is the main assumption imposed by empirical studies that estimate peer effects by instrumenting the peer average Wy with WX. The condition E((WX) X) = 0 is satisfied when peers are randomly allocated across individuals. If, instead, peers are randomly allocated within clusters, but not across clusters, X may have a different distribution across these clusters, leading to E((WX) X) = 0 and potentially biasing the estimation. For example, university classmates can be randomly chosen from the students enrolled in a specific degree but not from other degrees, or university roommates can be randomly chosen within a college but not across colleges (see e.g. the review by Sacerdote 2001). Because students do not randomly select into different colleges or degrees, peers (i.e. class or roommates) are not necessarily randomly allocated across such clusters. Nevertheless, this potential inconsistency can be solved by controlling for the individual variables X as in model (3), and adopting the following IV estimation where M X = I − X(X X) −1 X, I is the identity matrix, and P M X (WX) is the projection matrix (M X (WX))[(M X (WX)) (M X (WX))] −1 (M X (WX)) . The estimatorˆ IV 1 is a standard twostage least squares estimation applied to model (3) transformed by premultiplying all variables by M X : with instruments M X(WX) , i.e. the original instruments (WX) premultiplied by M X . Note that transforming model (3) by premultiplying each variable by M X is equivalent to replacing each variable with the residual from the regression of the variable itself on the explanatory variables X. By applying the Frisch-Waugh theorem, we can prove that the above transformation does not affect the estimation of the peer effect . We refer to the estimation ofˆ IV 1 as IV approach 1, i.e. the approach that includes the instrument at the individual level; we refer to the estimation ofˆ IV 0 as IV approach 0, i.e. the estimation approach that omits the instrument at the individual level.

The case with cluster fixed effects
In applied work, peers are sometimes randomized within but not across clusters. For example, class peers are often randomly chosen from the set of children enrolled in a school, but because children do not randomly sort into schools, the distribution of individual characteristics X is likely to differ between schools, leading to E((WX) X) = 0 and potentially biasing the instrumental variable estimationˆ IV 0 . Because randomization in such cases is within schools, analyses of these experiments necessarily include school (or cluster) fixed effects. We now show that failing to include the instrument at the individual level leads to inconsistent estimation of the peer effect in models with cluster fixed effects, even in cases where peers are randomized.
Consider the following fixed effects model: where, D is the N × J matrix of binary cluster indicators, J is the number of clusters, is the corresponding vector of fixed effects and = X + e. Applying cluster-mean deviations, we can rewrite equation (8) as follows: where the subscript * indicates that the variable is premultiplied by the orthogonal projec- In other words, model (9) is equal to model (8) with the variables transformed to indicate deviations from their cluster means (i.e. a within-cluster transformation).
Using the instrument (WX) * , the IV estimator that fails to control for the individual variables X * , i.e. IV approach 0, can be written aŝ where P WX * is the projection matrix [(WX) * ((WX) * (WX) * ) −1 (WX) * ]. 7 With * = X * + e * , this converges in probability to where * = p − lim((WX) * (WX) * ) −1 (WX) * (Wy) * is the effect of the instruments (WX) * on the peer average outcome (Wy) * . Hence, consistency of equation (11) requires that E((WX) * (X * + e * )) = 0. Under random assignment of peers across individuals, the individual vector of characteristics X is uncorrelated with WX. This is because WX is the peer average excluding the individual herself, and the random assignment of peers implies that X is identically and independently distributed (i.i.d.) across individuals. Nevertheless, random assignment within clusters does not imply a zero correlation between the transformed variables X * and (WX) * , i.e. between the within-cluster deviations of X and WX, and therefore E((WX) * X * ) = 0 does not necessarily hold.
To prove this and without loss of generality, we consider a scalar exogenous variable x i and the corresponding scalar instrumental variablex Generalizing of the above proof to multivariate instruments, we can see that random assignment within clusters does not imply a zero correlation between X * and (WX) * . Ultimately, this implies that the instrumental variables (WX) * will be correlated with * = X * + e * , i.e. the error term in equation (9), biasing the instrumental variable estimation. Note that the bias is induced by the within transformation: it exists even if the untransformed instrumental variable WX is unrelated to the untransformed errors . 8 Avoiding this bias is possible by including the instruments at the individual level, X * , in the peer effects model, as in IV Approach 1, 9 considering the following model The IV estimator for the peer effect can then be written aŝ where M X * = I − X * (X * X * ) −1 X * , and P M X * (WX) * is the projection matrix on the space generated by the columns of M X * (WX) * . 10 By replacing y * in equation (13) with the right-hand side of equation (12), we can show thatˆ *IV 1 converges in probability to if E((WX) * e * ) = 0.

Asymptotic bias
We next characterize the asymptotic bias. For this, we assume that equation (12) represents the true model (or equation (3) for the case without cluster fixed effects). However, if the true model specifies y as some other function of the instrument at the individual level (e.g. X 2 or ln(X)), the asymptotic bias will be different and hence, our derivations only refer to the case where X enters the specification in an additively separable way. Assuming E((WX) * e * ) = 0, the asymptotic bias of the estimatorˆ *IV 0 is given by [ * E((WX) * (WX) * ) * ] −1 * E((WX) * X * ) ; as shown by equation (11) above. Nevertheless, it is difficult to predict its sign and magnitude because it depends on (i) the effect of the instrument at the individual level on the individual outcome, i.e. , (ii) the effect of the instruments on the peer average outcome * , (iii) E((WX) * X * ), and (iv) on E((WX) * (WX) * ). Nevertheless, we can characterize the asymptotic bias in the case with one instrument as shown in the following Proposition.
Proposition 1. Let us assume that the following conditions hold. 8 The idea is similar to the 'Nickell bias' (Nickell, 1981) in dynamic models that include individual fixed effects, leading to a correlation between the lagged-dependent variable and the mean deviation of the error term. However, the Nickell bias reduces as the number of time periods increases, the bias ofˆ *IV 0 reduces as the cluster size increases relative to the peer group, since the contribution of each peer to the cluster means becomes negligible. 9 Although the instrument at the individual level has to be included as an additional explanatory variable, the form in which it enters matters for the bias. As the existing literature mainly considers additively separable specifications, we characterize the bias for this case only in section 'Asymptotic bias'. 10 In addition to avoiding the bias discussed here, it also corrects for the 'exclusion bias' defined by Caeyers and Fafchamps (2016).

A1. Correct model specification:
The true model for y i is given by where the subscript i = 1,…, N denotes individuals; y i and x i are demeaned;ȳ The proof is given in Appendix B. The above proposition shows that the asymptotic bias is inversely related to the effect of the instrument on the peers' average outcome, * , and converges to zero if n c tends to infinite as long as n p remains bounded. 11 Similarly, the larger the peer group, n p , the larger the bias. Notice that Assumption A2 implies that the size of peer groups is smaller than the size of the clusters and this ensures that the bias does not explode. In the case where there is just one cluster i.e. n c = N , we have random allocation of peers across individuals and the asymptotic bias goes to zero for N which tends to ∞.
Note that IV approach 0 and 1 can easily be adjusted to account for additional explanatory variables, by extending model (12) to include covariates, say, Q * . The asymptotic results can be extended to this case by applying the Frisch-Waugh-Lovell theorem which implies replacing y * with the residual of the regression of y * on Q * , i.e. M Q * y = [I − Q * (Q * Q * ) −1 Q * ]y * and similarly replacing (Wy) * with (M Q * Wy * ) and X * with (M Q * X * ). The conclusions remain unchanged, i.e. IV approach 1 provides a consistent estimation for the peer effect , while IV approach 0 is inconsistent. 11 The latter also holds for the 'exclusion bias', which Caeyers and Fafchamps (2016) show converges to zero when n c tends to infinite while n p remains bounded.

III. A brief discussion of the literature
Although we recognize that most empirical peer effects estimations include the instrument at the individual level, some papers have not. For example, Kang (2007) examines peer effects in students' maths attainment, estimating a school fixed effects model that uses peers'average science scores to instrument for peers'average maths scores, but excludes the individual's science score from the structural equation. Hence, despite students being quasirandomly allocated from elementary to middle schools, not including the instrument at the individual level, combined with the inclusion of school fixed effects, leads to biased peer effects estimates. Similarly, Figlio (2007) investigates peer effects in students' disruptive behaviour, using the proportion of classroom boys with girls' names to instrument for peers' average behaviour, while adjusting for individual and grade fixed effects, but not including an indicator whether the individuals themselves have a girls' name. Lundborg (2006) investigates peer effects in adolescent substance use, estimating school-grade fixed effects models that use various peer-level instruments, several of which are excluded at the individual-level from the structural equation. For example, one of the instruments for peer average illicit drug use is the proportion of peers who indicate they know someone who could give or sell them drugs; and one of the instruments for peer average binge drinking is the proportion of peers who indicate their parents would provide beer if asked. These variables, however, are not included at the individual level.
As we discuss above, it is difficult to predict the sign and magnitude of the asymptotic bias as it depends on different factors. Nevertheless, we can comment on this to an extent. Equation (15) shows that the asymptotic bias has the same sign as − * . Because it is generally true that the relationship between x and y at the peer group level also holds at the individual level, and * are of the same sign, implying the bias is negative. Furthermore, the magnitude of the asymptotic bias depends on the ratio n p n c −n p . This suggests that in primary school settings, which tend to be smaller than secondary schools but with similar class sizes, one would expect to see larger biases if classes are defined as the peer group, all else equal.
As an example, consider the study by Kang (2007). Their data include 4,813 students in 248 classes and 124 schools, suggesting that the average peer group (i.e. class) and school include 19 and 39 pupils respectively. The estimated * (i.e. the effect of the instrument in the first stage) is 0.64. If we assume that * ≈ (i.e. the effect of the instrument at the individual level on the individual outcome is similar to the first stage), the asymptotic bias approximates − n p n c −n p * = −0.95 × 1 = −0.95. 12 This suggests that the bias may be relatively large, indicating that it does matter whether the instrument at the individual level is included as a covariate or not. Their peer effect is estimated to be around 0.3. Our back-of-the envelope calculations suggest that this is an underestimate, with our estimate closer to 1.25. Although this is a large difference, we cannot comment on its significance. 12 We do not know the true value of , as this is precisely the parameter that is not estimated. In our illustrative application, presented in the Web Appendix, the ratio * = 0.332 0.290 = 1.145. Hence, although this is tentative as this estimate is obtained from a different data set, it suggests that assuming = * is a reasonable approximation. It is difficult to characterize the likely bias in Figlio (2007) and Lundborg (2006); their data contain approximately 76,000 and 3,000 students respectively, but they do not mention how many schools and classrooms they observe, and Lundborg (2006) does not report the first stage estimates. Furthermore, we note that the bias also depends on the extent to which our assumptions, listed in the proposition above, hold. Indeed, it relies on the true model being defined by equation (12), in the sense that x i enters the equation in an additively separable way, which may not be the case. Similarly, we assume that each individual has the same number of peers and the same number of cluster members, which is unlikely to be the case. The true data structure will therefore also impact on the estimate of the bias.

IV. Conclusion
A popular approach to estimating peer effects in the economics literature is to fit linear in mean regressions of individuals' outcomes on the corresponding average outcomes of their peers. A common approach to deal with the simultaneity of the peer effect is to use IV, instrumenting the average outcome of peers with the peer average of certain characteristics. We show that the validity of the relevance assumption in this setting has a subtle, but important implication: the instrument at the individual level must be included as an additional explanatory variable. We show that failing to do so leads to biased and inconsistent peer effect estimates. We demonstrate that the only case when consistency holds, is if peers are randomly allocated across individuals. However, even then, the IV estimation remains inconsistent if the model includes cluster fixed effects in addition to the peer effect. Examples are those where randomization takes place within, but not across, schools or neighbourhoods, where the inclusion of school or neighbourhood fixed effects (a necessity as randomization takes place within these clusters) renders the estimates inconsistent. In that case, the bias is induced by the inclusion of cluster fixed effects and its within-cluster transformation; something that has hitherto not been discussed in this literature. We present a simple solution: the instrument at the individual level must be included in the peer effects model. This leads to consistent peer effect parameter estimates under the assumptions required for IV.

Appendix A: Proof by contradiction
In the following, we prove that, if the instrumental variables WX satisfy the relevance and exclusion conditions for the estimation of the peer effect in model (1), then X directly affects y, and hence model (1) is misspecified because it omits X from the explanatory variables. The proof does not rely on any specific type of peer assignment.
As used in the spatial statistics and econometrics literature on peer effect (see e.g. Lee, 2007;Bramoullé et al., 2009), we can derive the reduced form of model (1), where I is the identity matrix of size N and we assume that | | < 1 and > 0 so that the matrix (I − W ) is invertible and the peer effect is positive. By using the series expansion (I − W ) −1 = ∞ s=1 s W s we can then rewrite the reduced form model as Given equation (A2), the symmetry of the matrix W (because of the symmetry of peer relationships), and the fact that all variables are demeaned, we can prove that the covariance between Wy and WX is Cov(Wy , WX) = E( ∞ s=1 s u W s+2 X).
This implies that WX are relevant instruments for Wy only if the right-hand side of the above equation is different from zero. We can rewrite this as a sum of expectations, with weights given by s : Because s > 0, the above expression is different from zero if at least one of the following conditions hold: (i) u depends linearly on WX; (ii) u depends linearly on W h X for some h > 1 but does not depend linearly on WX; (iii) u depends linearly on X. Condition (i) would invalidate the instrumental variable because the exclusion restriction would not be satisfied. Condition (ii) would imply that the outcome y depends on the average of X for peers separated by h interactions 13 but not on the average of X for direct peers (i.e. peers separated by 1 interaction). This is unlikely, as it is implausible that peers separated by more than one interaction have a larger influence on the outcome y than direct peers. This implies that condition (iii) must hold to guarantee that the right-hand side of equation (A3) be non-zero. In other words, X and u are correlated, implying that X are omitted variables. The only situation when omitting X would not bias the estimation of the peer effect is when there is no correlation between the instruments WX and X.