## 1. INTRODUCTION

### 1.1. *“But Do Your Results Pertain to*…?”

Social scientists are faced with a dilemma because they are rarely able to gather data from the full range of contexts to which they hope to generalize (Shadish, Cook, and Campbell 2002). On the one hand, overly broad generalizations can be misleading when applied to populations that were not well represented by a sample. On the other hand, confining generalization to a target population from which a sample was randomly drawn can limit research results from informing the full range of policies for which they might be relevant. The challenge “But do your results pertain to …?” is essential, yet a quandary for social scientists.

Given this problem, the generality of any inference in social sciences is likely to be debated. But current debates are typically qualitative—either a sample represents a target population or it does not. And because generality is rarely certain, debates cast in qualitative terms will often be divisive. Proponents will claim that results generalize, and opponents will claim they do not. Furthermore, while there will rarely be consensus for any given policy, those in the middle must adjudicate in the qualitative terms in which the debate is cast.

Here we suggest that debates about the generality of causal inferences in the social sciences can be informed by quantifying the conditions necessary to invalidate an inference. In this sense we build on recent work in sensitivity analyses (Copas and Li 1997; Frank 2000; Gill and Robins 2001; Robins 1987; Rosenbaum 1987, 2001). But unlike other sensitivity analyses that focus on the robustness of inferences with respect to internal validity, we focus on the robustness of inferences with respect to external validity. Thus, after controlling for all relevant confounding variables (either through a randomized experiment or statistical control), we ask how heterogeneous parameters must be to invalidate inferences regarding effects.

We begin by differentiating the target population into two subpopulations: (1) a potentially observed subpopulation from which all of a sample is drawn, and (2) a potentially unobserved subpopulation from which no members of the sample are drawn (cf. Cronbach 1982) but which is part of the population to which policymakers seek to generalize. We then quantify the robustness of an inference from the observed data in terms of recomposition with the potentially unobserved subpopulation.

### 1.2. *From Causal Inference to Policy: The Effect of Small Classes on Academic Achievement*

The typical causal inference begins when an estimated effect exceeds some quantitative threshold (e.g., defined by statistical significance or an effect size). For the primary example of this article, consider the results from the Tennessee class size studies, which randomly assigned students to small and large classrooms to evaluate the effectiveness of small classes (Cook 2002; Finn and Achilles 1990; U.S. Department of Education 2002). As reported by Finn and Achilles (1990), the mean difference in achievement on the Stanford Achievement Test for reading for small classes (teacher pupil ratios of 1:13–17, *n*= 122) versus all other classes (teacher-pupil ratios of 1:22–25, some with an aid, *n*= 209) was 13.14 with a standard error of 2.34.^{1} This difference is statistically significant. Finn and Achilles then drew on their analysis (including the statistical inference as well as estimates of effect sizes) to make a causal inference: “This research leaves no doubt that small classes have an advantage over larger classes in reading and mathematics in the early primary grades” (p. 573).

If Finn and Achilles' causal inference is correct, it might be reasonable to develop educational policy to reduce class size (e.g., the U.S. Elementary and Secondary Education Act of 2000, which allocated $1.3 billion for class size reduction). Attention then turns to the validity of the causal inference. First, though implementation of the random assignment may not have been perfect (Hanushek 1999) as is often the case (Shadish et al. 2002, chaps. 9 and 10), random assignment of classes to conditions likely reduced most differences between classrooms assigned to be small or not (Cook 2002; Nye, Hedges, and Konstantopoulos 2000). Therefore any overestimate in the effect of small classes is unlikely to be attributed to preexisting differences between the small classrooms and other classrooms (in fact, Nye et al. suggest that deviations from intended treatment may have led to an *underestimate* of the effects of small classes). This is the power of randomization to enhance internal validity (Cook and Campbell 1979).

Attention then turns to the generality of the results beyond the particular sample. Critically, Finn and Achilles (1990) analyzed only a set of volunteer schools, all from Tennessee. Thus, in the most restricted sense, their findings generalize only to schools from Tennessee in the mid-1980s that were likely to volunteer. And yet restricted generalization places extreme limits on the knowledge gained from social science research, especially experiments on the scale of the Tennessee class size study (Shadish et al. 2002:18; Cronbach 1982). Do the results of the Tennessee study mean nothing regarding the likely effects of small classes in other contexts?

The challenge is how to establish external validity by bridging between the sample studied to any given target population. Anticipating challenges to external validity, Finn and Achilles (1990, pp. 559–60) noted that the schools studied were very similar to others in Tennessee in terms of teacher-pupil ratios and percentages of teachers with higher degrees. In the language of Shadish et al. (2002), social scientists can then use this surface similarity as one basis for generalizing from the volunteer sample to the population of schools in Tennessee. But those challenging the generality of the findings could note that the volunteer schools in the study were slightly advantaged in terms of per-pupil expenditures and teacher salaries (Finn and Achilles 1990:559) and Hanushek (1999) adds that the treatment groups were affected by nonrandom and differential attrition (although Nye et al. [2000] argue that this likely had little effect on the estimates). Thus, even for this well-designed study, there is a serious and important debate regarding the generality of the causal inference.

Critically, the debate regarding the generality of the findings beyond the interactions for which Finn and Achilles (1990) tested is either disconnected from the statistical analyses used to establish the effect or essentially qualitative—the sample is characterized as representative or not. For example, the statistical comparison of schools in the Tennessee class size study with other schools in Tennessee may suggest surface similarity, but it does not quantify how results may be different if a sample more representative of all schools in Tennessee had been used. Similarly, critics suggesting that education in Tennessee is not like that in regions such as California (e.g., Hanushek 1999) use qualitative terms; they do not quantify the differences between their target population and the sample necessary to invalidate the inference that small classes generally improve achievement. Thus, in this article, we develop indices of how robust an inference is by quantifying the sample conditions necessary to make an inference invalid.

In Section 2 we present theoretical motivations for robustness indices; in Section 3 we define an ideal or perfectly representative sample that includes cases from a potentially unobserved population as well as the observed cases; in Section 4 we derive robustness indices for the representation of a sample in terms of the sample recomposition; in Section 5 we apply our indices to the Tennessee class size study; in Section 6 we relate our indices to discussions of the theoretical breadth of external validity; in Section 7 we consider a baseline for our indices in terms of whether there must be a statistical difference between estimates from the observed and unobserved populations to make the original inference invalid; in Section 8 we consider a Bayesian motivation for our indices; in Section 9 we compare with other procedures. In the discussion we emphasize the value of quantifying robustness, use of various quantitative thresholds for inference and consider possible extensions. The conclusion extends a metaphor of a bridge between statistical and causal inference (Cornfield and Tukey 1956).