should be sent to Daniel J. Navarro, School of Psychology, Level 5 Hughes Building, University of Adelaide, Adelaide, SA 5005, Australia. E-mail: email@example.com
Inductive generalization, where people go beyond the data provided, is a basic cognitive capability, and it underpins theoretical accounts of learning, categorization, and decision making. To complete the inductive leap needed for generalization, people must make a key ‘‘sampling’’ assumption about how the available data were generated. Previous models have considered two extreme possibilities, known as strong and weak sampling. In strong sampling, data are assumed to have been deliberately generated as positive examples of a concept, whereas in weak sampling, data are assumed to have been generated without any restrictions. We develop a more general account of sampling that allows for an intermediate mixture of these two extremes, and we test its usefulness. In two experiments, we show that most people complete simple one-dimensional generalization tasks in a way that is consistent with their believing in some mixture of strong and weak sampling, but that there are large individual differences in the relative emphasis different people give to each type of sampling. We also show experimentally that the relative emphasis of the mixture is influenced by the structure of the available information. We discuss the psychological meaning of mixing strong and weak sampling, and possible extensions of our modeling approach to richer problems of inductive generalization.
The ability to make sensible inductive inferences is one of the most important capabilities of an intelligent entity. The capacity to go beyond the data and make generalizations that can hold for future observations and events is extremely useful, and it is of interest not only to psychologists but also to philosophers (e.g., Goodman, 1955) and researchers with an interest in formal theories of learning (e.g., Solomonoff, 1964). From a psychological perspective, experimental research dating back to Pavlov (1927) demonstrates the tendency of organisms to generalize from one stimulus to another, with learned contingencies being applied to novel but similar stimuli. Critically, in many cases these generalizations do not involve a failure of discrimination. Stimulus generalization is better characterized as a form of inductive inference than of perceptual failure, and indeed the two have a somewhat different formal character (Ennis, 1988). As Shepard (1987, p. 1322) notes, ‘‘we generalize from one situation to another not because we cannot tell the difference between the two situations but because we judge that they are likely to belong to a set of situations having the same consequence.’’
One of the best-known analyses of inductive generalization is Shepard's (1987) exponential law, which emerges from a Bayesian analysis of an idealized single-point generalization problem. In this problem, the learner is presented with a single item known to belong to some target category and the learner is asked to judge the probability that a novel item belongs to the same category. Shepard's analysis correctly predicts the empirical tendency for these generalization probabilities to decay exponentially as a function of distance in a psychological space (this decay function is called a generalization gradient). The exponential generalization function is treated as a basic building block for a number of successful theories of categorization and concept learning (e.g., Kruschke, 1992; Love, Medin, & Gureckis, 2004; Nosofsky, 1984; Tenenbaum & Griffiths, 2001a) that seek to explain how people learn a category from multiple known category members. The analysis by Tenenbaum and Griffiths (2001a), in particular, is notable for adopting much the same probabilistic formalism as Shepard's original approach, while extending it to handle multiple observations and cases where spatial representations may not be appropriate (see also Russell, 1986). In a related line of work, other researchers have examined much the same issue using ‘‘property induction’’ problems, leading to the development of the similarity-coverage model (Osherson, Smith, Wilkie, Lopez, & Shafir, 1990), as well as feature-based connectionist models (Sloman, 1993) and a range of other Bayesian approaches (Heit, 1998; Kemp & Tenenbaum, 2009; Sanjana & Tenenbaum, 2003).
In this article we investigate the implicit ‘‘sampling’’ models underlying inductive generalizations. We begin by discussing the ideas behind ‘‘strong sampling’’ and ‘‘weak sampling’’ (Shepard, 1987; Tenenbaum & Griffiths, 2001a) and by developing an extension to the Bayesian generalization model that incorporates both of these as special cases of a more general family of sampling schemes. We then present two experiments designed to test whether people's generalizations are consistent with the model, and more specifically, to allow us to determine what sampling assumptions are involved. These experiments are designed so that we are able to look both at the overall tendencies that people display but also so that we can infer the sampling models used by each individual participant. Our main findings are that there are clear individual differences in the mixture between strong and weak sampling used by different people, and that these mixtures are sensitive to the patterns of observed data. We conclude with a discussion of the psychological meaning of mixing strong and weak sampling, and of possible extensions of our modeling approach to richer problems of inductive generalization.
2. Bayesian models of inductive inference
In Bayesian accounts of inductive inference, the learner makes use of two different sources of information: the pre-existing knowledge that he or she brings to the task (the prior), and the information in the problem itself (the likelihood). If we let x denote the information in the problem and let h denote some hypothesis about the property to be inferred, then Bayes’ theorem implies that:
The numerator in this expression is composed of two terms: the prior P(h), which acts to characterize the prior beliefs, and the likelihood P(x|h), which describes the probability that one would have observed the data x if the hypothesis h were correct. The denominator is composed of the same two terms, summed over all possible hypotheses (i.e., over all possible h′). When combined in this manner, the prior and the likelihood produce P(h|x), the learner's posterior belief in the hypothesis h. It is important to recognize that Bayesian cognitive models are typically functionalist in orientation: They represent analyses of the computational problem facing the learner (Marr, 1982) and tend to remain agnostic about specific psychological processes. For the current purposes, this means that we are interested in determining whether people's generalizations are in agreement with the predictions of a Bayesian analysis.
Much of the variation among Bayesian induction models can be characterized in terms of different choices of prior. In the simplest case, the learner has an unstructured set of hypotheses, which form the hypothesis space , and places some simple (possibly uniform) prior over all . However, it is possible to adopt a more structured approach, in which latent mental representations constrain the hypothesis space and generate a more sophisticated and psychologically plausible prior over the hypothesis space (e.g., Kemp & Tenenbaum, 2009; Sanjana & Tenenbaum, 2003). For instance, in Shepard's analysis there was assumed to exist some low-dimensional space in which stimuli are mentally represented, and candidate hypotheses typically correspond to connected regions in that space. Alternatively, hypotheses could be organized into a tree structure, a causal graph, or any of a range of other possibilities (see Kemp & Tenenbaum, 2008, 2009). In our experiments, we restrict ourselves to the case where the prior is constrained by a simple spatial representation.
In contrast to the extensive array of mental representations that can constrain the prior, the likelihood functions considered in the Bayesian literature on inductive reasoning have largely been restricted to two possibilities, ‘‘strong sampling’’ and ‘‘weak sampling,’’ with almost no examples of empirical tests of these assumptions existing in the literature (see later).1 Many models (Heit, 1998; Kemp & Tenenbaum, 2009; Shepard, 1987) assume that the only role of the likelihood function is to determine whether the data are consistent with the hypothesis. If the data are inconsistent with a hypothesis, then the hypothesis is ruled out and so the likelihood is zero, P(x|h) = 0. If the data are consistent with the hypothesis, then the likelihood function is constant, P(x|h)∝1. Accordingly, the relative plausibility of two hypotheses h1 and h2 that are consistent with the data does not change:
This kind of likelihood function is referred to as ‘‘weak sampling.”2
A different approach, introduced by Tenenbaum and Griffiths (2001a), suggests that the learner might assume data are generated from the true hypothesis. This ‘‘strong sampling’’ assumption can take many different forms. In the simplest case, a hypothesis h that is consistent with |h| possible observations (i.e., has ‘‘size’’|h|) is associated with a uniform distribution over these possibilities. As with weak sampling, if hypothesis h is inconsistent with the data x, then P(x|h) = 0. However, when the data are consistent with the hypothesis, then P(x|h) = 1/|h|. What this means is that if two hypotheses h1 and h2 are both consistent with the data, their relative plausibility now depends on their relative sizes:
Other models are less explicit in their sampling assumptions. For instance, the featural similarity model used by Sloman (1993) relies on the contrast model for featural similarity (Tversky, 1977). It does not explicitly re-weight hypotheses according to their size, and it more closely approximates the weak sampling rule in Eq. 2 than the strong sampling rule in Eq. 3. The similarity-coverage model does not make any explicit statements either, but insofar as the ‘‘coverage’’ term attempts to measure the extent to which the observations span the range of possibilities encompassed by the hypothesis, it is consistent with strong sampling. Relatedly, if the generalization gradients used by various category learning models (e.g., Kruschke, 1992; Love et al., 2004; Nosofsky, 1984) are interpreted in terms of the Bayesian generalization model, then a strong sampling model would require the steepness of the generalization gradient to increase as a function of the number of exemplars. In some models (e.g., Nosofsky, 1984) this is not the case, and so we might consider them to be weak sampling models. In other cases, the model allows the width of the the generalization gradient to be adapted over time in order to match the observed data (e.g., Nosofsky, 1984), and so it might be considered to embody a strong sampling assumption, though the mapping is not exact.
3. Why do sampling assumptions matter?
As outlined at the start of the article, inductive inference is a central problem in cognitive science and has therefore been studied a great deal. What may be less obvious is the critical importance played by sampling assumptions. Most work in this area has focused on how inductive inferences are shaped by the pre-existing biases of the learner, and on the mental representations that underpin these biases. However, induction is in large part a process in which the learner's beliefs are shaped by data. In order for that to happen, the learner must rely on some (probably implicit) assumptions about the evidentiary value of his or her observations. In other words, he or she needs to have some theory for how the data were sampled, and some way of linking that theory to beliefs about the state of the world. The ‘‘strong sampling’’ and ‘‘weak sampling’’ models are two examples of such a theory, but the basic issue is more general.
To illustrate the generality of the issue, consider the dilemma faced by 17th-century European ornithologists trying to catalog the birds of the world. To their knowledge, nobody had ever observed a non-white swan. Should they have inferred that all swans are white? This question, identified by Popper (1935/1990) while discussing David Hume's problem of induction, is a classic. Although the learner in this instance has a very large number of observations of white swans, the obvious inference (that all swans are white) is of course incorrect. As it turned out, European swans are systematically different from Australian swans, and so the inference fails. A naive learner does not account for the systematic spatial variation and overestimates the informativeness of the observed data. Indeed, the evidentiary value of observations are shaped by a great many factors. The data might be very old (a new mutation might produce black swans), collected by someone untrustworthy (medieval bestiaries do not provide reliable biological data), or copies of one another (the same swan could be seen multiple times). Moreover, people are often quite sensitive to these factors (Anderson & Milson, 1989; Anderson & Schooler, 1991; Welsh & Navarro, 2007), altering their inferences based of their assumptions about the informativeness and relevance of the data.
These issues are all sampling assumptions: They relate to the learner's theory of how the data are generated. Within the Bayesian framework, such issues are handled through the likelihood function. To return to the specific issues discussed in the previous section, a learner who makes a strong sampling assumption (Eq. 3) is clearly relying on a very different theory of the data than one who uses a weak sampling model (Eq. 2), and his or her beliefs and inductions change accordingly. In other words, what these equations are saying is that the informativeness of a particular observation changes depending on how it was sampled. It is for this reason that the sampling assumptions play a key role in the specification of any Bayesian theory. That being said, these issues are by no means restricted to Bayesian theories. Rather, all theories of learning are reliant on assumptions about the nature of the observations. For instance, connectionist models (e.g., Hinton, McClelland, & Rumelhart, 1986) implement learning rules that describe how networks change in response to feedback, with different learning rules embodying different assumptions about data. Decision heuristics (Gigerenzer & Goldstein, 1996) specify a learning method by describing how to estimate ‘‘cue validities’’ and related quantities, with different estimation methods producing different inferences (Lee, Chandrasena, & Navarro, 2002). In short, although our approach to this problem is explicitly Bayesian, and covers only some of the issues at hand, the underlying problem itself has much broader scope.
4. Conservative learning in a complicated world
The previous section illustrates the central role played by sampling assumptions in guiding inductive inferences, and it illustrates that the core issue is much more general than the ‘‘strong versus weak’’ distinction. In particular, the focus on the evidentiary value of data is critical. In the weak sampling model described previously, an observation that is consistent with two hypotheses is assumed to convey no information about their relative plausibility. In contrast, a strong sampling assumption means that the observation is highly informative. However, in light of the issues raised in the last section (old data, untrustworthy data, correlated data, and so on) it seems plausible to suspect that some people would adopt a kind of intermediate position: When an observation is consistent with hypotheses h1 and h2, such a learner will make some adjustment to his or her beliefs, but this adjustment will be less extreme than the scaling that Eq. 3 implies. In fact, it has long been recognized that people generally adapt their beliefs more slowly than the naive application of Bayes’ theorem would imply, a phenomenon known as conservatism (Phillips & Edwards, 1966). Moreover, one of the main reasons why conservatism might be expected to occur is that people are sensitive to the fact that real-life observations are correlated, corrupted, and so on (Navon, 1978).
To illustrate the implications of conservative learning, consider the problem faced by a learner who encounters only positive examples of some category. If all category members are equally likely to be observed, and all observations are generated independently from the true category, then a strong sampling model is the correct assumption to make (Tenenbaum & Griffiths, 2001a). However, few if any real-life situations are this simple, so while a ‘‘naive’’ learner would adopt a strong sampling model, a more skeptical learner would not. In the most extreme case, our skeptical learner might decide that the sampling process is not informative and come up with a weak sampling model as a consequence. What this makes clear, however, is that strong and weak sampling are two ends of a continuum, since it is perfectly sensible to suppose that the learner is not completely naive (not strong) but not completely distrusting (not weak).
How might such a learner think? To motivate this, consider the problems associated with learning how alcoholic a beer is. The first beer that the world provides to the learner might be a Victoria Bitter (4.6% alc/vol), as is the second. The third beer might be Tooheys New (4.6% alc/vol), and the fourth a Coopers Pale Ale (4.5% alc/vol). If (as per strong sampling) the learner were to construe these as independent draws from a uniform distribution over beers, he or she would have strong evidence in favor of the hypothesis that all beers have alcohol content between 4% and 5% by volume. However, a little thought suggests that this inference is too strong. The problem is that sampling scheme here is neither uniform nor independent, making it inappropriate to apply the strong sampling model in a straightforward fashion. In this example, the learner might reasonably conclude that the second Victoria Bitter ‘‘token’’ is really just an example of the same ‘‘type’’ of beer and conveys no new evidence over the first. Moreover, since Tooheys New and Victoria Bitter are both lagers, one might expect them to be a little more similar to one another than two randomly chosen beers might be. Finally, since all three brands are Australian beers and are often served in similar establishments, it is probable that none of them are truly independent, and so a conservative learner might be unsurprised to discover that Duvel (a Belgian beer) has 8.5% alc/vol. In short, while all four observations are legitimate positive examples of the category ‘‘beer,’’ it would not be unreasonable to believe that only one or two of them actually qualify as having been strongly sampled.
To formalize this intuition, we construct a sampling model in which there is some probability that an observation is strongly sampled (drawn independently from the true distribution) but with some probability the observation conveys no new information, and it may therefore be assumed to be weakly sampled. With this in mind, let θ denote the probability that any given observation is strongly sampled, and correspondingly 1−θ is the probability that an observation is weakly sampled. Then, cleaning up the notation, the learner has the model:
where is the set of all possible stimuli, and counts the total number of possible items. When θ = 0 this model is equivalent to weak sampling, and when θ = 1 it is equivalent to strong sampling. For values of θ in between, the model behaves in accordance with the ‘‘beer generation’’ idea discussed above: Only some proportion θ of the observations are deemed to be strongly sampled. As a consequence, this more general family of models smoothly interpolates between strong and weak sampling.
What interpretation should we give to this more general family of models? One psychologically plausible possibility is suggested by the beer sampling model discussed earlier: Some observations are deemed to be correlated, so much so that while he or she has observed n‘‘tokens’’ (observations), they correspond to only m‘‘types’’ (genuine samples from the distribution to be learned), where m = θn < n. This idea has some precedent in the computational learning literature (e.g., Goldwater, Griffiths, & Johnson, 2006) and seems very plausible in light of both the conservatism phenomenon (Phillips & Edwards, 1966) and its rational explanation in terms of violations of independence (Navon, 1978). This idea explains why the learner might want to adopt this class of models: If he or she thinks that the world is messy, and generates observations in a correlated fashion rather than independently from the true category distribution, then Eq. 4 describes a sensible psychological assumption for the learner to make. Like both strong and weak sampling, it is an idealization: In any particular learning situation, the particular way in which the world turns out to be ‘‘messy’’ will produce a slightly different family of sampling models. Nevertheless, we suspect that the simple idea of reducing n strongly sampled observations to some smaller value m = θn can provide a sensible first approximation to use in many cases.
More generally, while the primary justification for the model is based on the theory of psychological conservatism, it is curious to note that Eq. 4 has an interpretation as an example of a ‘‘James-Stein-type’’ shrinkage estimator (see Hausser & Strimmer, 2009; Schäfer & Strimmer, 2005) for the distribution that generates the data.3 The weak sampling model acts as a low-variance, high-bias estimator, whereas the strong sampling model is high variance and low bias (see, e.g., Hastie, Tibshirani, & Friedman, 2001 for a statistical introduction, and Gigerenzer & Brighton, 2009 for a psychological discussion). Our ‘‘mixed’’ sampling model is based on the assumption that the learner aggregates these two. The interesting point is that James-Stein-type estimators have the ability to outperform either of the two estimators from which they are constructed (James & Stein, 1961; Stein, 1956). As such, the usefulness of the mixed sampling model may extend across a quite broad range of situations.
5. Testing sampling assumptions
The Bayesian theory of generalization was developed a decade ago (Tenenbaum, 1999; Tenenbaum & Griffiths, 2001a), with the assumption of strong sampling playing a key role in the extension to Shepard's (1987) exponential law. The strong–weak distinction appears in a different form within the theoretical literature on computational learning theory, with some things being difficult to learn with a weak sampling model (Gold, 1967), but other evidence suggesting that strong sampling methods are able to do so (Muggleton, 1997). Yet to date, the psychological evidence for this theoretical principle is somewhat limited.
One respect in which the limited research is quite noticeable is that in all cases the data presented are aggregated across participants. This raises the issue of individual differences (Lee & Webb, 2005; Navarro, Griffiths, Steyvers, & Lee, 2006). It has been long known (Estes, 1956) that this can induce very large distortions in the data. Do people tend to make the same default assumptions about sampling, or is this something that varies across individuals? If so, can the Bayesian generalization theory capture this variation? Indeed, the reassurance that the theory holds for individual people is of particular importance in this case, where the learner receives only positive instances of a category. In such ‘‘positive-only’’ learning scenarios there is an ongoing question about what it is theoretically possible for the human learner to actually learn from the data, dating back to Gold's (1967) theorem regarding language acquisition. The distinction between strong sampling and weak sampling matters in this regard: Under weak sampling, positive data are not powerful enough for the learner to acquire some types of complex knowledge structures, whereas under strong sampling a great deal more is learnable (see, e.g., Perfors et al., 2006). If it were to be the case that only some people apply strong sampling assumptions, then it may be more difficult to use strong sampling as an explanation for phenomena (such as language acquisition) that are genuinely universal across people.
6. Stimulus generalization in psychological space
In this section, we describe the Bayesian theory of generalization as it applies to stimuli that vary along a single continuous dimension, and we illustrate how inductive generalizations change as a function of the sampling assumptions and the prior beliefs of the learner.
6.1. Learning from examples
Suppose that the learner has observed a single item x that is known to possess some property. Since stimuli vary along a continuous dimension, we follow the approach taken by Shepard (1987) and Tenenbaum and Griffiths (2001a) and assumes that there exists a true ‘‘consequential region’’ (denoted rt) over which the property is true. If the property of interest is constrained to occupy a single connected region, then we may define the learner's hypothesis space to be , the set of all such regions. For instance, the stimulus dimension might be the latitude of a geographical location on planet Earth, and the property in question could be whether the location lies within the tropics. In this scenario, the stimulus dimension ranges from a latitude of −90° to 90°, and the consequential region for the tropics corresponds to the range from −23.5° to 23.5°. In general, consequential regions need not correspond to a single interval: For instance, the temperate zone on Earth covers two disconnected regions, the southern temperate zone from −66.5 to −23.5° and the northern temperate zone from 23.5° to 66.5°. As discussed by several authors (Navarro, 2006; Shepard, 1987; Tenenbaum & Griffiths, 2001b), the Bayesian generalization model can be extended to handle these situations without conceptual difficulty, but for the purposes of the current article the simpler version is sufficient. The basic idea is shown on the left-hand side of Fig. 1, which depicts 10 possible hypotheses that the learner might have about how far the tropics extend.
When presented with this learning scenario, one natural goal for the learner is to discover the identity of this unknown region. What information does he or she have to work with? Notice that there are two distinct pieces of information inherent in this problem. First, the learner has observed the item x itself (and not some other possible item), which may or may not be a useful clue. Secondly, the learner has discovered that this item x possesses the property of interest: that is, he or she knows that x ∈ rt. Suppose now that the learner hypothesizes that the correct region is r; that is, that rt = r. What is the probability that this hypothesis is correct, in light of the information presented to the learner? We may calculate this probability by applying Bayes’ rule (Eq. 1), which gives
The denominator in this expression is an integral rather than a sum, because the space of possible regions is continuous rather than discrete. In Appendix, we present the solution to all of the integrals relevant to this article, but for the purposes of explaining the basic ideas, these solutions are not important. As such, it is convenient to ignore the denominator (since it is just a normalizing constant), and note that
This makes clear that the learner's degree of belief in r depends on two things. First, it depends on the extent to which he or she originally believed that r was the true region, which is expressed by the prior probability P(rt = r). Secondly, it depends on the likelihood , which describes probability that the learner would have observed x and learned that x ∈ rt if the learner's hypothesis were correct. We will discuss these two terms in detail shortly, but notice that at a bare minimum the likelihood term is zero if the hypothesized region does not contain the item (i.e., if x∉r).
As noted by Tenenbaum and Griffiths (2001a), the theory extends to multiple examples in a very simple way. We now imagine that the learner has encountered n items that possess the property, corresponding to the observations . In this situation, the learner knows that the items x have been generated, and he or she also knows that these items belong to the true consequential region, x ∈ rt. If we suppose that the items are conditionally independent4 and apply Bayes’ rule, we obtain the expression
Given the simplicity of this expression, multiple items may be handled easily. Moreover, since it is constructed from the same two functions (the prior and the likelihood), there is nothing conceptually different about the multiple-item case. Again, noting that the likelihood is zero for any region r that does not contain all of the training items x, one part of the learning process is to eliminate all hypotheses that are inconsistent with the data. To continue with the ‘‘tropics’’ example earlier, the learner might be told that Cairns (latitude −16.9°) and Singapore (latitude 1.3°) lie in the tropics. On the basis of this knowledge, many of the originally possible regions are eliminated, as illustrated on the right-hand side of Fig. 1.
6.2. Making predictions
When solving an inductive generalization problem, the learner's task is slightly more complex. Rather than needing to infer the true consequential region associated with the property, the learner needs to determine whether some novel item y also shares the property. In the tropics example, this might correspond to a situation in which the learner has been told that Cairns and Singapore lie in the tropics, and is asked to guess if Brisbane (latitude −27.4°) also lies in the tropics. Formally, the problem at hand is to infer the probability that y ∈ rt given that the learner has observed x and knows that x ∈ rt. Clearly, if the identity of the true region is known to the learner, then this problem becomes trivial. However, the learner does not have this knowledge. Instead, what the learner has is a set of beliefs about the plausibility of different regions, as captured by as discussed earlier. If the learner weights each hypothesis according to its probability of being correct, then he or she can infer that the probability that the new item also possesses the property, as follows:
In this expression the term is very simple: It is just the ‘‘probability’’ that y falls inside some known region r. Therefore, it equals 1 if y is inside this region and 0 if it is not. Not surprisingly then, only those regions that include y make any contribution to the integral in Eq. 9, and so we can simplify this expression by restricting the domain of the integration to , the set of regions that contain y. Thus, we may write
As noted earlier, the solution to this integral is presented in the Appendix for all cases relevant to this article. However, in order to understand the behavior of the model, it suffices to note that all this integral does is ‘‘add up’’ the total degree of belief associated with those regions that include y. To return again to our tropics learning example in Fig. 1, there are five hypotheses that are consistent with the training data (Singapore and Cairns). Of those five hypotheses, four also contain the query item (Brisbane). Therefore, if the learner treats all five of these hypotheses as equally likely, he or she would rate the probability of Brisbane being tropical at 80%. Of course, in the actual model the learner's hypothesis space consists of all possible regions, not just the ten shown in the figure, but the basic principle is the same: The learner generalizes using only those hypotheses that are consistent with the data. The more important issue is whether the learner would really treat all of the data-consistent hypotheses as equally plausible. To answer that question, we must now look at the priors and the likelihoods in detail.
6.3. Priors and likelihoods
To complete the model, we need to specify P(rt = r), the learner's prior degree of belief in region r, and the likelihood function . As discussed earlier, the family of likelihood functions that we consider in this article are those that correspond to the ‘‘mixed sampling’’ model (Eq. 4). In order to do so, we may (without loss of generality) assume that the range over which stimuli can vary is the unit interval, [0,1]. In the tropics example, for instance, we can do this by dividing the latitude by 180 and then adding 0.5. In these new co-ordinates, the tropics correspond to the region that runs from 0.37 to 0.63. Having made this assumption, the mixed sampling likelihood functions are given by
where |r| denotes the size of the region, and as discussed previously θ = 0 yields the weak sampling model and θ = 1 is the strong sampling model. In these normalized co-ordinates, therefore, the size of the tropics is 0.63 − 0.37 = 0.26.
What degree of prior belief should the learner place in the hypothesis that region r is the correct one? One natural possibility is the uniform prior, P(rt = r)∝1, in which the learner treats every possibility as equally likely. However, this is not the only possibility that the learner might consider. Consider the region r that covers the interval [l,u]. The size of this region is given by |r| = u−l, and the location of this region is the center c = (u+1)/2. In keeping with Shepard (1987), we assume that it is only the size of a region that matters, not its location. With this in mind, one natural choice of prior is the simple one-parameter Beta(1,φ) model, in which
Although simple, this prior is quite principled. It has the same structure as the likelihood function (i.e., size raised to some power), which means that φ can be interpreted as pseudodata. That is, increasing φ by one has exactly the same effect on the generalization function as decreasing the sample size by one. In other words, if the learner's pre-existing knowledge can be described ‘‘as if’’ it corresponded to a set of fictitious previous observations, then the prior distribution should take this form. The result is a family of priors in which φ = 1 corresponds to a uniform distribution over possible regions, whereas φ < 1 expresses a prior assumption that the region is small, and φ > 1 corresponds to a prior belief that the region extends over a larger range. To illustrate this idea, Fig. 2 shows the prior distribution that would be involved in the tropics example, if the 10 hypotheses shown in Fig. 1 were in fact the entire hypothesis space. When φ = 1 (middle panel), all regions are treated as equally plausible a priori. A prior that is biased toward small regions (φ = 0.5) is shown in the left panel, while the right panel depicts a bias toward larger regions (φ = 2).
Now that we have both the prior and the likelihood specified, the influence of the sampling assumptions θ can be made explicit. Fig. 3 plots three posterior distribution for the tropics problem, where the prior distribution is assumed to be uniform (as per the middle panel of Fig. 2), using a weak sampling model (θ = 0), a strong sampling model (θ = 1), and an intermediate model (θ = .33). Note that the pattern of falsification is the same in all three cases: Regardless of the value of θ, hypotheses 1, 2, 5, 6, and 10 are all inconsistent with the data, because the corresponding regions do not contain both Singapore and Cairns. Thus, the posterior probability for these regions is zero. In weak sampling (left panel), this is the only change from the prior distribution, and so the five remaining hypotheses are equally weighted. Since the only one of the remaining hypotheses not to contain Brisbane is hypothesis 3 (shown in gray), in this case the learner would say that there is an 80% chance that Brisbane lies in the tropics. However, if θ > 0, then the likelihood favors the smallest regions that are consistent with the data (i.e., contain Singapore and Cairns). Thus, the likelihood will always favor hypothesis 8 most strongly, followed by hypotheses 3, 7, 4, and 9 in that order. When θ is near 1, this effect is very strong (right panel of Fig. 3), whereas for θ close to 0 this effect is much smaller. As a result, when θ = .33, the learner's estimate of the probability that Brisbane is tropical falls to 77%, which drops to 75% when θ = 1.
6.4. Constructing a generalization gradient
Up to this point, the tropics example that we have used to illustrate model behavior has relied on an ad hoc hypothesis space consisting of only 10 possible consequential regions. Moreover, we have only examined the model predictions for a single query item (i.e., Brisbane, latitude −24.4°). However, in order to understand the overall behavior of the model, we need to remove both of these restrictions. To start with, suppose we expanded the hypothesis space to include all regions where the edge points are a member of the set .05,.15,…,.85,.95 (i.e., a very crude discrete approximation to ). Using this expanded hypothesis space, we can then calculate the posterior distribution over possible regions for different choices of the two parameters θ and φ, as well as the generalization probability for all possible query items y.
In the simplest case, suppose the learner has a uniform prior over regions φ = 1 and applies a weak sampling model θ = 0. This case is illustrated in the left panels of Fig. 4. The lower part of the plot draws all of those regions that are consistent with the two observations (Singapore and Cairns). Since all of these regions were equally likely in the prior, and the weak sampling likelihood function only acts to falsify regions that do not contain the data, all of these regions remain equally likely in the posterior (illustrated schematically by the fact that all of the lines are of the same thickness). In the top part of the panel, we plot the generalization probability for all possible latitudes of the query item y. Since all of the regions in the lower panel are weighted equally, for a given value of y this probability is just a count of the proportion of non-falsified regions that contain the query item.
When we use other parameter values, it is no longer the case that all regions are equally likely. For instance, suppose the learner switched from a weak sampling model to a strong sampling model, by increasing θ from 0 to 1. It is now the case that the likelihood function strongly favors the smallest regions that contain both Singapore and Cairns. This situation is shown in the middle panel of Fig. 4. The lower part of the plot schematically illustrates the effect of the likelihood on the posterior distribution over regions: Smaller regions are more likely and are thus drawn with thicker lines. The upper panel shows what effect this has on the generalization gradient: It ‘‘tightens’’ around the data, because the narrowest regions are preferred by the learner.
The right-hand panel of Fig. 4 shows what influence the prior has, by raising φ from 1 to 5. By choosing a prior with φ = 5, the learner has a prior bias to believe that the region is large (i.e., that the tropics covers a very wide band of latitudes). Even though the learner has a weak sampling likelihood function (i.e., θ = 0), the posterior distribution in the lower panel is quite uneven, as a result of the learner's prior biases. In this case, because the prior bias was to favor large regions, the effect of the generalization gradient is the opposite of the effect we observed in the middle panel: The generalization gradient spreads out and, in fact, becomes convex.
One critical implication of this phenomenon should be made explicit: The effects of φ and θ are very similar if we consider only a single generalization gradient. That is, raising θ has the same effect as reducing φ. Indeed, this is a general property of Bayesian models. In order to disentangle the influence of the learner's prior beliefs (φ) from his or her assumptions about data (θ), it is necessary to look at how the overall pattern of generalizations changes across multiple generalization gradients.
Obviously, the stepped generalization functions shown in Fig. 4 are not psychologically plausible. This lack of smoothness is an artifact of limiting the hypothesis space to a discrete approximation. If we expand the hypothesis space to consist of all possible regions, the problem becomes continuous, as do the corresponding generalization gradients. In the Appendix, we present solutions to the integrals involved in the continuous case, allowing us to directly calculate smooth generalization gradients rather than relying on numerical approximations.
6.5. Generalization gradients in the full model
Having illustrated how the model predictions are constructed, we now turn to a discussion of what kinds of generalization patterns can be produced by varying the prior (via φ) and the likelihood (via θ). Of the two parameters, the one in which we are interested is θ, but it is important to show how to disambiguate the two. To do this, consider what happens to the generalization gradients when the learner makes three observations but sees them one at a time. For simplicity, assume that φ = 1, so the learner treats all regions as equally plausible a priori. The generalizations that are possible in this situation are illustrated in Fig. 5. In the weak sampling case, the generalization gradients are linear and do not tighten.5 Critically, notice that for any θ > 0 the generalization gradients get tighter as more data are observed (from left to right). As will become clear when we discuss the priors, this is the critical prediction—the non-linear shapes of the gradients in the lower panels can easily be mimicked by weak sampling if we choose the priors in the right way, but the tightening from left to right cannot. This is illustrated in Fig. 6, which shows what happens when we vary φ (while keeping θ fixed at 0). The prior cannot influence the manner in which generalizations change as a function of data. It is this critical regularity that we seek to investigate experimentally.
7. Experiment 1
The question we want to investigate in this experiment is the extent to which people change their generalizations as a function of the number of observations they make. According to the weak sampling hypothesis (Heit, 1998; Kemp & Tenenbaum, 2009; Shepard, 1987), the generalization gradients should not tighten purely because more data are available. In contrast, the strong sampling hypothesis as per Tenenbaum and Griffiths (2001a), and Sanjana and Tenenbaum (2003) implies that the generalization gradients should narrow as data arrive, even if the new observations fall within the same range as the old ones.
Participants were 22 undergraduates (16 female, 6 male) from the University of Adelaide, who were given a $10 book voucher for their participation.
7.1.2. Materials and procedure
Participants were asked about three different induction problems, presented in a random order via computer. All problems involved stimuli that varied along one continuous dimension. One problem involved the temperatures at which a bacterium can survive, another asked about the range of soil acidity levels that produce a particular colored flower, and the third related to the times at which a nocturnal animal might forage. The cover story then explained that a small number of ‘‘training’’ observations were available—in the bacterial scenario, for instance, it would indicate temperatures at which the bacterium was known to survive—and asked participants to make guesses about whether or not the property generalized to different stimulus values (the ‘‘query’’ items). The cover stories are provided in the Appendix.
To minimize any differences between the psychological representation of the stimuli and the intended one, participants were shown a visual representation of the data, which marked the locations of training items with a black dot, and the query item was shown using a red question mark. The cover stories were constructed to imply that the to-be-inferred property did not hold for beyond the range shown on screen. Responses were obtained by allowing people to position a slider bar using the mouse, with response values ranging from 0% probability to 100% probability. Once the participant was satisfied with the positioning of the slider bar, he or she clicked a button to move to the next question. For every set of training data, we presented 24 query items, spread evenly across the range of possible items and presented in a random order.
Since it is impossible to infer the sampling model from a single generalization gradient, we measured three generalization gradients for each problem, as illustrated in Fig. 7. Initially, participants were asked to make generalizations from a small number of observations. These were then supplemented with additional observations, and people were asked to generalize from this larger sample. Finally, the data set was expanded a third time to allow us to measure a third generalization gradient. In effect, we present people with three samples—collectively referred to as a ‘‘data structure’’—and obtain the corresponding generalization gradients. The assignment of cover stories (bacterial, soil, foraging) to data structures was fixed rather than randomized, as illustrated in Fig. 7 (we address this in experiment 2). For example, the top panel shows the data structure associated with the bacterial cover story. Each of the three rows corresponds to the three samples: Three known observations were initially given, which was then expanded to five data points, and then finally to a sample of ten observations. In total, each participant made 216 judgments (3 scenarios × 3 samples × 24 queries).
7.2. Basic results
Fig. 8 plots all 216 judgments made by two of the participants. In both cases the judgments appear sensible, but it is clear that the generalization gradients are quite different from one another. Indeed, inspection of the raw data across all 22 participants made clear that individual differences are the norm rather than the exception. With this in mind, in the model-based analyses presented later in this section we took care not to average data across participants. Moreover, later in the article we examine the individual-level data in some detail. However, before modeling the data, we present some more basic analyses.
As discussed earlier, the basic effect that we are looking for is the ‘‘gradient-tightening’’ effect shown in Fig. 5. That is, do people narrow their generalization gradients as sample size increases? A simple way to test this proposition is to note that the experimental design is such that there are multiple occasions where more data are added to the sample, but at least one of the two edge points remains unchanged (Fig. 7 ). For instance, one of the seven cases corresponds to the transition from the first to the second sample in the bacterial scenario: On the left-hand side, the edge point does not change. To a first approximation, it is reasonable to assume that if θ > 0, the generalization probabilities should decrease for the query items that lie beyond this edge.6 In total, there are seven situations in which the edge point does not change, yielding a total of 64 pairs of queries. Since there are 11 query items to the left of the leftmost training example in this instance, this contributes 11 of the 64 query pairs that we can consider, and thus a total of 22 × 64 = 1,408 pairs of responses. For each such pair of queries, we are interested in the difference between generalization probabilities reported by each participant, since this provides a measure of the extent of ‘‘gradient tightening.’’ On average, people showed a small but significant amount of gradient tightening: The generalization probabilities are lower on average when the sample size is larger (t1407 = −4.98, p < .001). Individually, the trend is the correct direction for 51 of the 64 query pairs, and for 17 of the 22 participants. However, notice that this a small effect: The raw judgments vary from a probability of 0% to 100%, but the average change corresponds to a decrease in generalization probability of only 2.3% from one case to the next.
7.3. Model-based analyses
The simplified analysis in the previous section provides some evidence that the ‘‘gradient tightening’’ effect is present in these data, but the effect is quite weak and leaves a great many questions unanswered. Is the tightening effect actually consistent with the strong sampling model proposed by Tenenbaum and Griffiths (2001a), or indeed with any of the sampling models discussed in this article? Is the effect present for all scenarios, or only one or two? Do people tend to rely on the same sampling assumptions, or do people differ in this respect? In order to answer these questions, we need to fit the model to data explicitly. In this section, we present an analysis that discusses two key questions: (a) the extent to which the Bayesian model successfully fits the data, and (b) what assumptions about θ are implied by these data. However, although we are careful not to average data across participants because of the individual differences that exist, we do not discuss these differences in any detail in this section; we return to this topic later in the article.
The first question of interest relates to the adequacy of the generalization model. In order to address this question, we estimated the best-fitting values of θ and φ separately for all 22 participants and all three scenarios, yielding a total of 66 parameter estimates (details are provided in the Appendix. We then measured the correlation between human responses and the model predictions at these best-fitting parameters, where the correlation is taken over the 72 judgments (24 queries × 3 cases) that each participant made in each scenario. Note that this set up means that the model needs to predict every single judgment made by every participant: No aggregation of data at any level is involved. The resulting correlations are plotted in Fig. 9. After making a degrees of freedom adjustment to account for the two extra free parameters (i.e., θ and φ), almost all correlations (62 of 66) are significant. More important, the correlations are quite strong, especially in view of the fine grain at which we are modeling the data: The median correlation is 0.77, with an interquartile range of [.65, .85]. Overall, it is clear that the model is able to describe the data well.
Given that the model fits appear to be quite good, it is natural to ask what values of θ are involved. As the histograms in Fig. 10 illustrate, there is evidence for both strong sampling and weak sampling in this task. However, there are also clear differences between people and between scenarios. While the distributions are unimodal and biased toward weak sampling (small θ) in two of the three scenarios, the distribution of estimates for the foraging scenario is rather bimodal.
The overall pattern of results to experiment 1 highlights several points. First, there is quite strong evidence that the Bayesian generalization model is able to describe accurately the behavior of participants at a quite fine-grained level of analysis (prediction of every judgment in the experiment). However, it is important to take note of the technical literature that discusses how to measure safely the adequacy of a model's performance, addressing issues such as model complexity (e.g., Myung, 2000; J.I. Myung, Navarro, & Pitt, 2006; Pitt, Myung, & Zhang, 2002), individual differences (e.g., Lee & Webb, 2005; Navarro et al., 2006), and contaminant processes e.g., (Huber, 1981; Ratcliff & Tuerlinckx, 2002; Zeigenfuse & Lee, 2010). In fact, although we glossed over the details, our analysis does in fact accommodate all of these topics: The Appendix provides a discussion of how this was done.
Secondly, the data are quite informative as regards the default assumptions that people make about how observations are generated (at least in some contexts). Notice that the scenario descriptions did not explicitly state how the data were generated: Participants were expected to supply the missing ‘‘sampling’’ assumption on their own. That said, there is a sense in which the data themselves suggest a strong sampling model, since participants only ever observed positive examples of a category. If people relied solely on this as a cue, then we would expect a bias toward larger values of θ. Despite this, the reverse is true: On the whole, smaller values of θ tended to predominate.
Thirdly, the pattern of variation in the estimated θ values between scenarios is interesting. When the scenarios were originally designed, we did not anticipate finding any differences between them; it was assumed that these differences would be presentational gloss and no more. Nevertheless, what we in fact observed is a relatively substantial difference in the distributions over θ. People did seem to have a strong bias toward weak sampling in two of the scenarios, and a weak bias to prefer strong sampling in the third.
This third point deserves further investigation. Specifically, the fact that we found differences across the scenarios raises additional questions regarding the origin of these differences. In particular, we note that the ‘‘foraging’’ scenario (which induced larger θ values than the other two) differed from the other two problems in two respects. First, as illustrated in Fig. 7, the changes in sample size involve an increase from 1 to 3 data points, and from 3 to 5 data points. In contrast, while the other two scenarios also have a jump from 3 to 5, the jump from 5 to 10 data points involves a smaller proportional increase (doubling) than the jump from 1 to 3 (tripling). One possible explanation for the use of strong sampling in this situation is that the ratios in the ‘‘1:3:5’’ structure are larger than in the ‘‘3:5:10’’ case, which (a) makes it easier for us to detect an effect, and (b) may make the sampling regime more salient to people. More generally, the data structures involved in the three cases are noticeably different to one another, suggesting the possibility that this is the source of the difference.
An alternative explanation involves the cover stories themselves. In particular, the bacterial temperature and soil pH scenarios both suggested an element of ‘‘experimental control,’’ whereas the bandicoot foraging scenario did not. Experimental control over the sampling locations means that the absence of data in other locations is uninformative. That is, if an experimenter chooses the locations at which observations are to made, then these locations clearly are not sampled from the true hypothesis. Thus, for the purposes of generalization it implies weak sampling. Accordingly, the effect may be due to the cover story and not the data structure. In essence, the cover stories may have subtly given participants instructions as to which sampling model is most plausible. Given that we have multiple hypotheses for the origin of the effect, the next section describes an experiment that disambiguates between the two.
8. Experiment 2
The goals for experiment 2 were three-fold. First, we aimed to replicate the effects from experiment 1 using a new set of cover stories. Secondly, we aimed to disambiguate between the ‘‘different data structure’’ explanation and the ‘‘implicit instruction’’ explanation for the scenario differences. Finally, we altered methodological aspects of the task slightly, to check that the results are robust to these manipulations. Specifically, queries were phrased as confidence judgments rather than probability estimates, and responses were given on a Likert scale rather than by positioning a continuous slider bar.
Twenty participants (7 male, 13 female, aged 18–36) were recruited from the general university community. Since the cover story manipulation relied on participants’ reading the materials carefully and understanding the nuances, all participants were fluent English speakers, and most were graduate students. All were naive to the goals of the study, though one was familiar with Bayesian generalization models.
8.1.2. Materials and procedure
The experiment presented people with a cover story in which they were asked to imagine themselves as an explorer collecting samples of tropical plants and animals, and asked them to make inferences about the different types of foods encountered on the journey. There were three scenarios, relating to ‘‘bobo fruit,’’‘‘walking birds,’’ and ‘‘pikki-pikki leaves.’’ After presenting people with introductory text explaining the experiment, text appeared on screen introducing one of the three scenarios.
Participants were then shown a number of stimuli and asked to make generalizations, under an implicit ‘‘strong sampling’’ or implicit ‘‘weak sampling’’ regime. For instance, if the scenario was bobo fruit and the sampling regime was strong, the text indicated that a local guide gives them delicious fruit to eat. This is presumably a strong sampling situation, since a helpful guide is not going to choose bad-tasting fruit as an example of the local cuisine. In contrast, under a weak sampling regime the text implied that the participant had collected the fruit themselves, more or less at random.
Participants responded using the keyboard, indicating their confidence on a 9-point Likert scale (1 = very low confidence and 9 = very high confidence). The locations of the training data and the generalization questions were identical to those used in experiment 1. Each participant saw all three data structures (i.e., all three panels in Fig. 7), and all three cover stories (bobo fruit, walking bird, pikki-pikki leaves). The assignment of cover story to data structure was randomized, and the data structures were presented in a randomized order. Participants either saw all three scenarios in the strong sampling form, or all three in weak sampling form. Twelve participants saw the weak sampling version and eight saw the strong sampling version.
In order to make the cover story feel more engaging to participants, the actual task was accompanied by thematically appropriate background music, various props (including pith helmets among other things), and the onscreen display had images of cartoon tropical trees to add to the atmosphere. Qualitative feedback from participants did suggest that they found the task both engaging and enjoyable, and indicated that they did spend some time thinking about the scenarios.
8.2. Basic results
The analysis for experiment 2 proceeds in the same fashion as for experiment 1. As before, we examine the 64 simple test cases to see whether the generalization gradients tighten due to sample size. If we translate the Likert scale responses to probability judgments in the obvious way (i.e., treating 1 as 0%, 9 as 100%, and interpolating linearly for all other values), then what we observe in the data corresponds to an average decrease in generalization probabilities of 3.0%. Once again, this is a significant (t1299 = −6.5, p < .001) but weak effect, of a similar magnitude to the 2.3% decrease found in experiment 1.
8.3. Model-based analyses
As with experiment 1, we found the best-fitting values of θ and φ for each participant and each scenario. The correlations between model predictions and human responses were somewhat higher in experiment 2. As shown in Fig. 11, all 60 correlations were significant, with a median of .90 and interquartile range of [.83, .94]. Overall, the model performance is at least moderately good in all cases, and it is usually very good.
The main difference between experiments 1 and 2 is that the cover story, data structure, and the sampling scheme implied by the cover story were all varied independently. After checking that (as expected) there were no effects of the cover story itself (i.e., no differences between bobo fruits, walking birds, and pikki-pikki leaves) and no interaction effects, we ran a 3 × 2 anova with main effect terms for data structure and sampling scheme (weak vs. strong) implied by the cover story, using the best-fitting value of θ as the dependent variable. The results suggest that the data structure influenced the choice of θ (F2,52 = 2.98,p = .06), but that people did not make adjustments to θ as a result of the different sampling schemes implied by the cover story (F1,52 = 0.004, p = .95).
Overall, the results from experiment 2 replicate those from experiment 1 and suggest that the differences in θ across conditions are largely due to the data structure and not to an implicit instruction buried in the cover story. Consistent with our original intuition, the cover stories in these experiments are too ‘‘generic’’ to induce strong differences in people's sampling assumptions.
A question that these results raise is why there was no effect of cover story, especially in light of the results of Xu and Tenenbaum (2007a). One possibility might be that the participants simply did not read the cover story closely, and hence did not take the implied sampling scheme into account. This seems rather unlikely, since the participants were explicitly told to pay close attention to the story. This is no guarantee that they did, but most participants did spend some time reading the stories. More plausibly, the manipulation may simply have been too subtle: People may not have given a great deal of thought to the implied sampling assumptions and made no accommodation of them as a result. This suggests that a more overt manipulation of actual sampling procedures (such as the one used by Xu and Tenenbaum [2007a]) is likely to have an effect, but subtle differences in cover story may not. In the meantime, it seems that it is the data structure that is the origin of the effect. One possible reason for this might be that people are learning about the sampling model itself as the experiment progresses, and that the 1:3:5 structure induces a different amount of learning to the 3:5:10 scenarios. Future research might address this possibility.
9. The sampling assumptions of individuals
Between the two experiments there are a total of 126 posterior distributions over θ (42 participants × 3 scenarios each). In the analyses presented up to this point we have attempted to summarize these individual distributions in terms of the marginal distributions in Figs. 10 and 12. However, the aggregate distributions do not tell the whole story.
First, a focus on the aggregate distributions and goodness-of-fit statistics tends to obscure the actual behavior of people and the model. For instance, consider the comparison between the data from participants 2 and 15 in the bacterial temperatures scenario from experiment 1, shown in Fig. 13. In the first panel, the generalization gradients are quite similar, because the differences in priors (φ = 1.03 and φ = 1.71 respectively) are almost perfectly balanced by the differences in sampling assumptions (θ = .02 and θ = .47) when the learner has seen only three observations. However, as more data are observed, the two people diverge in their predictions, because the differences in the likelihood functions come to dominate their beliefs. Analogous examples can be found in experiment 2.
The opposite effect can be seen by comparing participants 10 and 19 in scenario 3 in experiment 1. These two people have different priors (φ = 1.79 and φ = 5.77, respectively) but the same likelihood (best-fitting value is θ = 1 in both cases). When only a single datum is observed (left panels), these participants behave in quite different ways, but these differences begin to disappear as more data arrive (right panels). As is generally the case with Bayesian models, the data ‘‘swamp’’ the prior. That is, in the previous comparison (Fig. 13), participants made different assumptions about sampling, and so grew more dissimilar as the sample size increased. However, in this second comparison (Fig. 14), participants agree about how data are produced; as a consequence, their prior differences are erased as the sample size increases from left to right.
The second thing that is missing from the earlier analyses is a detailed examination of the posterior distributions over θ for each participant. In analyses to this point we have focused on the best-fitting values of θ, giving little consideration to the full posterior distribution. Nevertheless, for each participant and each scenario we have 72 judgments available, so the analysis is worth doing. We used standard Markov chain Monte Carlo techniques (MCMC; Gilks, Richardson, & Spiegelhalter, 1996) to estimate a complete posterior distribution over θ for every participant and every scenario (assuming a uniform prior). Specifically, we employ a Metropolis sampler with Gaussian proposal distribution, with burn in of 1,000 iterations, and the distributions are estimated using 5,000 samples drawn at a lag of 10 iterations between successive samples.
For experiment 1, Fig. 15 plots each of the 66 posterior distributions over θ separately. As is clear from inspection, the apparent bimodality involved in the foraging scenario is an artifact of aggregating over participants. All of the individual distributions over θ are unimodal, but they are centered on different values. For both the bacterial and soil scenarios, most people either adopted a weak sampling model or used a mixed sampling model centered on a fairly modest value of θ. For the foraging scenario, however, larger values of θ dominate, with only a few participants adopting weak sampling assumptions. The corresponding plots for experiment 2 are shown in Fig. 16, and it shows a similar pattern.
10. General discussion
The tendency to observe modest but non-zero values of θ in the experiments is strikingly similar to the phenomenon of conservatism in decision making: People's belief revision tends to be slower than the rate predicted by simple Bayesian models (e.g., Phillips & Edwards, 1966). In the experiments reported here, there is a sense in which θ = 1 is the ‘‘correct’’ sampling model to use, since the experimental design was such that all observed data points were constrained to lie inside the true consequential region r. Nevertheless, although some participants show clear evidence of having done so, most are at least a little conservative (θ < 1). Indeed, some participants do not adjust their generalization gradients at all (θ = 0), appearing to be insensitive to the effects of increasing the sample size (as per Tversky & Kahneman, 1974).
Conservative sampling assumptions also help resolve an odd discrepancy between two different types of theory. Most category learning models (e.g., Anderson, 1991; Kruschke, 1992; Love et al., 2004) do not tighten the generalization gradients as the experiment progresses,7 and they are capable of fitting empirical data on the basis of this implicit weak sampling assumption. Our findings are more or less consistent with this: Although a small number of people did display very strong tendencies to tighten their generalization gradients, most did not. The typical change of 2%–3% is small enough that it would probably not be noticed in a great many categorization experiments. This seems particularly likely in view of the fact that all of our problems were ‘‘positive evidence only’’ in design (in which strong sampling is the true model), whereas the majority of category learning experiments involve supervised classification designs (in which weak sampling is the true model). The conservative θ = .2 value even in this case goes a long way toward explaining why strong sampling is not the typical assumption made when modeling category learning.
In light of this result, the heavy reliance on strong sampling observed in other contexts (e.g., Tenenbaum & Griffiths, 2001a; Xu & Tenenbaum, 2007b) might seem unjustified. However, note that these models are typically concerned with learnability questions, often over developmental timeframes. In such cases, the critical problem facing the model is to ask how it is possible for the learner to acquire rich mental representations when exposed only to positive examples (Muggleton, 1997), and in order to do so, some variation of the strong sampling assumption is often implicitly adopted. For instance, in a language learning context, the amount of data strictly required to acquire abstract knowledge about syntax may be surprisingly small (Perfors et al., 2006; Perfors, Tenenbaum, & Regier, 2011): The data presented to the model by Perfors et al. (2006) correspond only to a few hours of verbal input. In real life, of course, children require a great deal more data, presumably because they have other things to learn besides syntax, their raw data are messier than preprocessed corpus data, they have processing constraints that slow the learning, and so on. These factors should act to weaken the sampling model, but, as long as the resulting model is not strictly weak (i.e., as long as θ > 0), the learner should eventually acquire the knowledge in question. In fact, as Fig. 17 illustrates for the spatial generalization problem we considered in this article, as the sample size gets arbitrarily large, every sampling model except weak sampling eventually becomes indistinguishable, because they all converge on the true region. Thus, a learner who consistently employed a θ = .0001 model would acquire knowledge in a fashion that is consistent with both the ‘‘strong sampling’’ pattern used in some models of language acquisition, and the ‘‘weak sampling’’ pattern often found in categorization experiments.
A number of other avenues for future work present themselves. Even within the context of simple spatial generalization problems, the assumption that the underlying hypothesis space consists solely of connected regions is probably too simple (Tenenbaum & Griffiths, 2001b). In particular, people may be willing to entertain the notion that the correct hypothesis consists of multiple regions (see Navarro, 2006, for a formalization of this idea). To the extent that people believe that categories can cover multiple regions, the effects on the generalization gradients are likely to be two-fold. First, consistent with the general findings from our experiments, the generalization gradients over the extrapolation region (i.e., beyond the training data) should tighten more slowly than would be expected under the simpler model, because the learner remains open to the possibility that there exists some additional region of the stimulus space for which the hypothesis is true, about which he or she has yet to uncover any information. Secondly, the generalization gradients will not remain flat inside the interpolation region, because the learner must entertain the hypothesis that the training data that he or she has seen actually come from multiple regions. In fact, Fig. 8 shows evidence of this effect. Participant 14 (dark lines) decreases the generalization probability in the gaps between training examples, whereas participant 2 (lighter lines) does not. More generally, if we examine the interpolation judgments across experiment 1 generally, there is a weak correlation (Spearman's ρ = −.16, p ≈ 10−5) between the generalization probability and the distance from the nearest training exemplar to the query items. This is both consistent with standard exemplar models, as well as with Bayesian generalization models that allow multiple regions. However, since the experimental designs used in this article were not optimized for examining interpolative judgments (i.e., there is a strong ceiling effect in these data since most people gave generalization judgments very close to 1 for all interpolation questions), this remains an open question for future work.
The use of uniform distributions as a basic building block is fairly standard for models in which the learner's goal is to identify a ‘‘consequential set’’ (e.g., Navarro, 2006; Sanjana & Tenenbaum, 2003; Shepard, 1987; Tenenbaum & Griffiths, 2001a), and it helps avoid the circularity of explaining graded generalizations in terms of graded sampling distributions. However, it is important to recognize that it too is an oversimplification. One of the more interesting extensions that one might consider is the situation in which the learner has to determine both the extension of the consequential set itself (i.e., which things are admissible members of the category) as well as the distribution over those entities (i.e., which things are more typical members of the category). Along similar lines, we have considered only those situations in which the learner observes positive and noise-free data from a single category. Natural extensions to the work would look at how sampling assumptions operate in other experimental designs, and in the presence of labeling noise. In the meantime, however, it seems reasonable to conclude that individual participants' inductive generalizations are in close agreement with the predictions of the Bayesian theory, but they vary considerably in terms of the default assumptions about how data were generated.
A portion of this work was presented in the Proceedings of the 30th Annual Conference of the Cognitive Science Society. DJN was supported by an Australian Research Fellowship (ARC grant DP0773794). We thank Nancy Briggs and Amy Perfors for helpful comments, and Peter Hughes, Angela Kinnell, and Ben Schultz for their assistance with the experiments.
Note that this is a comment about the likelihood functions used in models for classic inductive reasoning tasks: In the Bayesian literature more generally, there has of course been a much wider range of likelihoods used by modelers.
It is important to recognize that Eqs. 2 and 3 are simplifications. They depend both on the ‘‘strong versus weak’’ distinction (namely, the extent to which the distribution generates stimuli is dependent on the distribution over category labels) and on the specific form that the likelihood function takes (uniform distributions in this case). The story is not so simple in general; for simplicity, we restrict our discussion to the case relevant to the experiments in this article.
We thank Fermin Moscoso del Prado Martin for drawing our attention to this link.
In view of our discussion about correlated environments, this assumption may seem odd. However, it is important to recognize that conditional independence is not inappropriate: Part of the qualitative idea behind the mixed sampling model is to use θ to express the dependencies among items. That is, with probability θ the generation of observation x may be deemed to be formative, but it is otherwise deemed to convey no new information.
The linearity may seem surprising, since Shepard's (1987) analysis produced exponential gradients: The difference is partly due to the priors and partly due to the restriction to finite range.
This is not strictly correct, insofar as the behavior of a strong sampling model can sometimes depend on what happens to both of the edge points, not just the nearest one. Even so, the results of this simple analysis are in accordance with the results of the more detailed model fitting presented later in the article.
It is true that these models do adjust the generalization gradients in order to accommodate learned selective attention, the effect is generally to tighten generalization along one dimension at the expense of other dimensions. What is not generally part of the model is an overall tightening along all dimensions, as per strong sampling.