Adaptive Non-Interventional Heuristics for Covariation Detection in Causal Induction: Model Comparison and Rational Analysis


Department of Psychology, Ritsumeikan University, 56-1 Kitamachi, Toji-in, Kita-ku, Kyoto, 603–8577, Japan. E-mail:


In this article, 41 models of covariation detection from 2 × 2 contingency tables were evaluated against past data in the literature and against data from new experiments. A new model was also included based on a limiting case of the normative phi-coefficient under an extreme rarity assumption, which has been shown to be an important factor in covariation detection (McKenzie & Mikkelsen, 2007) and data selection (Hattori, 2002; Oaksford & Chater, 1994, 2003). The results were supportive of the new model. To investigate its explanatory adequacy, a rational analysis using two computer simulations was conducted. These simulations revealed the environmental conditions and the memory restrictions under which the new model best approximates the normative model of covariation detection in these tasks. They thus demonstrated the adaptive rationality of the new model.

1. Introduction

To respond adaptively in an environment that changes from moment to moment, people must make rapid and precise predictions of the future based on current information. In this respect, the concept of causality is central and has probably been acquired over the long course of evolution along with other cognitive abilities (Toda, 1983). There have been many arguments in philosophy and psychology about the concept of causality and the mechanisms of causal judgment (e.g., see, Sosa & Tooley, 1993; Sperber, Premack, & Premack, 1995). However, philosophical discussion about the normative definition of causation, although undoubtedly important, could be irrelevant to descriptive theories of human causal cognition. Such discussions, as they are relevant to science, concern the justification of our causal inferences when under no time pressure and when we have many resources to devote to the problem. However, for survival in the everyday world, for people who have limited memory and limited information processing capacity, it is important to be able to derive plausible conclusions rapidly using minimal resources (i.e., storing as little information as possible).

Most models of causal induction assume that extracting covariation information is an important first step (e.g., Anderson & Sheu, 1995; Baker, Murphy, Vallée-Tourangeau, 1996; Cheng, 1997; Cheng & Novick, 1992; Einhorn & Hogarth, 1986; Schustack & Sternberg, 1981; Shanks & Dickinson, 1987; Wasserman, Kao, Van Hamme, Katagiri, & Young, 1996). Of course, however, covariation is not causation. For example, the rooster's crow is not a cause of the sun's rising while they covary. A subsequent analytic process, which may be based on prior domain-specific causal knowledge, is required to identify genuinely causal links between events. Nevertheless, covariation detection is important for us for two reasons. First, there are always innumerable irrelevant factors. We have to exclude a vast number of unrelated events as potential causal candidates to discriminate genuine from spurious causes. Covariation can be a useful cue to screen out unnecessary factors. Second, correlational knowledge is sufficient for predicting the future. As long as we do not intervene, a strong correlation can provide precise predictions. In this article, we concentrate on the covariation-based aspect of causal induction. Our goal is to discover the nature of people's covariation detection mechanism and whether there are circumstances in which it can be given a rational interpretation.

In most experiments, covariation information is represented in a 2 × 2 contingency table as shown in Table 1. This table shows the relationship between a candidate cause and an effect. Four combinations arise: The candidate cause is present or absent (C or inline image, respectively) and the effect is present or absent (E or Ē, respectively), leading to the four cells (“C and E,” “C and Ē,” “inline image and E,” “inline image and Ē”), which are labeled cells a, b, c, and d, respectively. Models of covariation detection can be expressed using the frequencies of these cells.

Table 1. A 2 × 2 contingency table containing covariation information
CausePresent (E)Absent (Ē)
  1. Note: In areas other than causal induction, combinations of a cause and an effect could be prediction–actuality (e.g., weather forecasting), stimulus–response (e.g., recognition memory, perceptual response), etc.

Present (C)ab
Absent (inline image)cd

Experimental studies have revealed that normative models of covariation detection from 2 × 2 contingency tables are descriptively inadequate. For example, it has been argued that participants do not equally weight the four cell frequencies: the a-cell is the most and the d-cell is the least weighted (e.g., Arkes & Harkness, 1983; Crocker, 1982; Jenkins & Ward, 1965; Kao & Wasserman, 1993; Mandel & Lehman, 1998; Nisbett & Ross, 1980; Schustack & Sternberg, 1981; Shaklee & Mims, 1982; Smedslund, 1963; Ward & Jenkins, 1965; Wasserman, Dorner, & Kao, 1990). However, normative indices (e.g., the phi-coefficient [φ], introduced later) require all the cells to be weighted equally. This unequal weighting may be a reflection of a psychological asymmetry between the concepts of presence and absence, which is closely related to the fact that presence is often rare while absence is pervasive.

Recently, however, a number of theorists have argued for the adaptive rationality of differentially weighting the cell frequencies (e.g., Anderson & Sheu, 1995; Cheng, 1997; Friedrich, 1993; Gigerenzer & Hoffrage, 1999; Mandel & Lehman, 1998; McKenzie, 1994; but see also, Over & Green, 2001). Although a particular heuristic strategy may seem irrational from a normative standpoint, it may work practically very well given the structure of our natural environment and so can be justified from an adaptive point of view (e.g., see, Anderson, 1990; Evans & Over, 1996a; Gigerenzer, 2000; Oaksford & Chater, 1998a, 1998b; Payne, Bettman, & Johnson, 1993; Stanovich, 1999). McKenzie (1994) investigated the average accuracy of some non-normative indices proposed as descriptive models of covariation judgments from 2 × 2 contingency tables.1 He compared them with a normative statistic (the phi-coefficient, φ) and showed that many of them performed quite well. His seminal work revealed that some heuristics could be practically very useful, but it is still unclear which model predicts human performance best and, moreover, why people use a particular heuristic.

2. Scope and overview

We propose that human causal induction has two stages: a heuristic stage and an analytic stage. The analytic stage is essential to discriminate between genuine and spurious causes. The heuristic stage is also vital in distinguishing relevant causal candidates from innumerable irrelevant factors. We believe that different models are needed to explain the complex process of human causal induction and this has not been explicitly recognized. The dual process approach in causal induction is relevant to several two-system views in human cognition (e.g., Evans, 1989; Sloman, 1996; Stanovich, 1999). This study addresses only the issue of the heuristic stage of causal induction and leaves open the question of what else is required to identify genuinely causal relationships in the analytic stage. The heuristic stage of causal induction is mostly concerned with covariation assessment between two events. As a first step in causal induction, covariation assessment is often based on observation, whereas intervention (i.e., experimental manipulation of particular factors) to idenfity real from spurious causes is part of the analytic stage (e.g., Lagnado & Sloman, 2004).

We first introduce a new heuristic index of covariation detection, the dual factor heuristic (Hattori, 2001), which is motivated by considering the rarity of particular causes and effects in the environment. This factor has proved important in other related areas (Hattori, 2002; McKenzie & Mikkelsen, 2000, 2007; Oaksford & Chater, 1994, 2003). Next we report the results of an experiment aimed to examine the difference between two common experimental paradigms (discrete vs. continuous) in causal induction. The experiment was intended to decide which data in the literature could be included in meta-analyses to compare the dual factor heuristic with other indices of covariation. We then present a meta-analysis exhaustively comparing this index with 33 other non-parameterized indices that have been proposed in various literatures with respect to their ability to account for the data from a signature set of past causal induction and covariation detection experiments. We note that the dual factor heuristic comes out best in this comparison but that other indices are very close runners up. We therefore report the results of an experiment designed to discriminate between these indices to determine which is the best of this bunch. We then present a further meta-analysis comparing the dual factor heuristic with parameterized models of covariation detection. Finally, we present a rational analysis of the dual factor heuristic in computer simulations.

The fundamental question that we address is why does the dual factor heuristic fit the data? The answer we will suggest is that under certain reasonable environmental conditions it provides a good approximation to normative predictions. To clarify this point, we conducted two computer simulations using the Monte Carlo method. In the simulations, following Anderson's (1990)rational analysis, we take three factors into consideration: the structure of the environment, cognitive constraints (time, memory, load, etc.), and the results of the behavior (accuracy, satisfaction, etc.). It turns out that the factors that determine when the dual factor heuristic best approximates normative predictions in covariation detection are the same as those found in data selection (Hattori, 2002; Oaksford & Chater, 1994, 2003). We first introduce the dual factor heuristic.

3. The dual factor heuristic

3.1. Two stages in causal induction

We have proposed that there is a heuristic and an analytic stage to human causal judgment. These two stages are differently motivated and the dual factor heuristic is concerned with the heuristic stage. Identifying genuine causes is important for us to control the environment and to change the world to meet our needs. Whereas correlation enables us to predict the future, causation provides us with more: it enables us to imagine the results of our own actions on the environment. The analytic stage, however, requires more cognitive resources and time. Given the cognitive costs of the analytic process, deciding which factors should be marked out as potential causal candidates and which ones can be ignored is important for survival in the real world. This two stage hypothesis in causal induction parallels dual process theory (Evans, 1989) in reasoning.

This hypothesis is also relevant to recent discussion of the distinction between observation and intervention in causal reasoning (Pearl, 2000). In principle, it is hard to deduce a causal dependency only from observational (non-interventional) data like that presented in 2 × 2 contiungency tables. Correlation does not imply causation. Intervening between events that covary is essential to learning whether one of these events is the cause of the other. For example, we must prevent the rooster from crowing (i.e., intervene on a causal candidate) to prove that it's crowing is not the cause of the sun rising. Using Pearl's terminology of the “do” operater, we need to doCrow) to ensure that ¬ Sunrise (where “¬” = not). Recently, some researchers have suggested that the cognitive processes involved in intervention differ from those of observation (Lagnado & Sloman, 2004; Sloman & Lagnado, 2005; Steyvers, Tenenbaum, Wagenmakers, & Blumd, 2003). Thus, there is a clear distinction between situations where the presence or absence of the cause and effect is simply observed, and situations where someone chooses to allow or prevent the cause from occurring and keeps track of the effect. However, before one can get to the intervention strategy one must have some candidate causes in mind and these must derive from covariation detection.

3.2. A heuristic ignoring nonoccurrence

As we pointed out above, normative theories of covariation detection include information about all four possible events relating cause and effect, i.e., information from all four cells (a, b, c, and d) in a contingency table. However, presenting such information in summary form in a contingency table is highly unrealistic. People normally encounter the relevant events that provide the data for covariation detection sequentially and compute covariation online from this stream of data. This involves storing and retaining information about the relevant events. With respect to any particular cause–effect relation, this creates a serious problem.

Although it is clear what should be stored when the cause alone occurs (cell b), the effect alone occurs (cell c) or both cause and effect occur (cell a), it is not clear what information to record when neither cause nor effect occur (cell d). Most of the time, any particular cause and effect are not occurring as we move around the world (just think of any particular event, such as starting your car). How then are these non-occurring events to be counted? Perhaps the time in between events of type ac could be divided into temporal bins in which the cause or effect could occur and these are counted or stored away as instances of d-cell events. Storage of such events would appear very inefficient and creates a problem analogous to the frame problem in artificial intelligence (McCarthy & Hayes, 1969; Pylyshyn, 1987). The frame problem arose in the context of reasoning about change: It relates to the need to store all the information about what does not change in a situation when something else does (Oaksford & Chater, 1991, 1993, 1995, 1998b; also explored the ramifications of this problem for theories of human deductive reasoning). So, if the coffee was spilt, then frame axioms would have to be added to the database indicating that the picture did not fall to the floor, the window did not open, etc. Storage of such information is computationally prohibitive as the information to be stored soon exceeds reasonable bounds. This problem also arises in the context of covariation detection, all potential cause-effect relations will have to have non-occurring d-cell events stored as their respective d-cell instances.

A more efficient procedure would be to store information about ac cell events and infer the d-cell. As for any particular cause–effect relation the cause and effect are not normally present, the inference is relatively clear cut: The d-cell is likely to be very large. The assumption that the d-cell is very large is equivalent to the assumption that the base rates of the cause, P (C), and the effect, P (E), are small. This is consistent with the rarity assumption that has proved important in providing rational analyses of data selection behavior (Evans & Over, 1996b; Klauer, 1999; Nickerson, 1996; Oaksford & Chater, 1994, 2003) and in providing a Bayesian justification for differential cell weightings in judging covariation (e.g., Anderson & Sheu, 1995; McKenzie & Mikkelsen, 2007).

In deriving an index of the dual factor heuristic for non-interventional covariation detection, we consider a normative statistical index, the phi-coefficient, φ, because it is the most popular measure of correlation. We explored the consequences of assuming the d-cell is very large for φ (we thank an anonymous reviewer for suggesting this derivation). In terms of the cell frequencies this coefficient can be rewritten as follows (from Formula 22 in Table 2):

Table 2. Probabilistic Models for 2 × 2 Contingency Tables
inline image

Here, I = 1 + b + c/d+ bc/d2. Taking the limit as d → ∞, I → 1 and the second term of Equation 1 goes to zero, as every term containing d or d2 in the denominator approaches zero. Consequently, we have:


We refer to this index as the dual factor heuristic, H, because it is equivalent to the geometric mean of two factors: the probability of the cause given the effect, P (C|E), and the probability of the effect given the cause, P (E|C). Rewriting φ in terms of probabilities as follows can yield an insight into the nature of H:


Comparing Equations 2 and 3, we can see that H ignores the base rates of the cause and the effect, P (C) and P (E). Although φ has a high value only when P (E|C) and P (C|E) are high by comparison with the corresponding base rates, high values of P (E|C) and P (C|E) per se raise H. This is also obvious because when d → ∞, P (C), P (E) → 0.

The argument that leads to the dual factor heuristic, in short, is as follows. Normative assessments of the strength of covariation require attending to all events, including those that involve nothing happening (d-cell). Attending to all events places prohibitive demands on our limited memory capacity. Computing H allows you to assess the strength of relations without using the d-cell. This is more efficient on memory and is accurate under circumstances where d is large—that is, P (E) and P (C) are small (i.e., the situation people normally encounter in the real world). Later on we also see that H is actually even better than using the d-cell if one is only allowed to keep track of a small number of events.

3.3. What “d-cell insensitivity” is not

We have argued that the dual factor heuristic, that ignores the d-cell, can be an efficient method for covariation detection in “normal” situations. This claim, however, could be subject to misunderstanding. Consquently, here we seek to clarify what d-cell insensitivity means by clarifying what it is not.

First, d-cell insensitivity is not a universal law of human causal induction. According to our account the dual factor heuristic operates at the first heuristic stage of causal induction, picking out possible causal candidates. In the subsequent analytic stage, people identify genuine causes from this candidate set perhaps using interventional strategies. Interventional strategies, where people actively make the cause happen or prevent it from happening, should place as much emphasis on causal necessity—for example, if doCrow) then ¬ Sunrise, as sufficiency, if do (Crow) then Sunrise. This emphasis on causal necessity in intervention, ipso facto requires taking the d-cell into account because causal necessity is high when PCrow, ¬ Sunrize) is high. Thus, one would expect a simple heuristic strategy like H to be much less in evidence in paradigms where interventional strategies are used. In Experiment 1, we contrast the discrete and continuous paradigms investigated by Anderson and Sheu (1995). We argue that the discrete paradigm is mainly observational, whereas the continuous paradigm is more interventional.

Moreover, weak d-cell effects are compatible with a small subgroup of participants using such an analytic strategy even with observational data. This is similar to the case of occasional logical performance in deductive reasoning tasks. Oaksford and Chater's (1994, 2003) probabilistic approach explains this phenomena as follows. Whereas most participants use heuristic or low level probabilistic strategies, those with high IQs have learned logical strategies that they can apply to the tasks (Stanovich, 1999). This explains occasional logical performance and why low level probabilistic strategies provide much better fits to the data than logical strategies. By analogy, a few participants may be able to apply higher level strategies even to purely observational data just as in deductive reasoning some participants can use logical strategies. This is one of the reasons why we sometimes observe weak d-cell effects and why for observational data we would expect to find better fits for indices like H that ignore the d-cell.

Second, d-cell insensitivity is not a norm or a golden rule for causal induction. The dual factor heuristic is an approximation that is effective in the real world considering limited cognitive resources. If we need to assess a correlation between two events ad unguem (perhaps in the analytic stage) and we have enough time and resources to do so, we should use a normative index such as phi-coefficient or Δ P, which exploits the d-cell.

Third, d-cell insensitivity is not necessarily a “null coefficient” for the d-cell in a certain parameterized model, particularly in a multiple linear regression model, which is known as weighted linear combination model for covariation detection (see Formula 41 in Table 2 and Equation 11, introduced later). Without definite evidence that a regression model best describes people's covariation detection, fitting the model and judging the importance of a variable (i.e., the effect of corresponding cell) based on the magnitude of (standardized) regression coefficients can lead to erroneous conclusions. For example, consider a set of data, (x, y, z) = (2, 1, 4), (3, 3, 9), (4, 6, 16), where x and y are predictors and z is the dependent variable. These data are completely fitted by a simple non-linear model, z = x2, which “ignores” y (i.e., it is assigned a zero weight). The same data set, however, are also fitted by a linear model, z = x + 2y, which has non-null coefficient (i.e., 2) for y. As shown in this example, it can be a case that a set of data generated by a model in which a certain variable (e.g., d-cell) is missing is also well fitted by a linear regression model with a non-null coefficient for the variable. Thus, just because the d-cell is assigned a non-null weight in a multiple regression does not mean that a model that uses the d-cell is the best model for that data. In particular, a model that uses less information to achieve the same level of fit would be preferred as more parsimonious. The long-standing observation of weak d-cell effects in covariation detection (or causal induction) is mostly based on linear regression models (e.g., Schustack & Sternberg, 1981). However, one could only determine whether a model that uses the d-cell is required by a comparative model fitting exercise like those we report in this article.

Consequently, despite the empirical findings on the d-cell effects, the question still remains as to whether an index of covariation that ignores the d-cell such as H can provide better accounts of the evidence than alternative indices.

3.4. Preventive causes

The case of a preventive cause might be seen as just a mirror image of the case of a generative cause such that only the truth value of the effect is reversed, so that “C prevents E” is just to say that “C causes Ē.” If this were the case, then preventive causes are also formalized in our framework—reversing the truth value of the effect results in swapping the a-cell with the b-cell and the c-cell with the d-cell.2 Formally, the dual factor heuristic index for preventive causes could then be expressed as b /√(a + b)(b + d).

However, there seems to be more to the concept of prevention. First, note that flipping the truth value changes the probability associated with the event: For example, when P (E) is small, P (Ē) is large. According to the dual factor heuristic, when the sets C and E almost totally overlap, the predicted strength of covariation is large. However, when C is small and E is large this strength decreases. Therefore, for H to take on a high value for a putative preventive cause, when P (C) is small, P (Ē) should also be small and hence P (E) should be large (e.g., “committing a felony [C] prevents being at work [E]”). As the dual factor heuristic assumes the rarity of the cause and the effect, in the case of preventive causes, prevalence of the effect should be maintained.

In addition to reversing the truth value, most statements of preventive causation require re-contextualization, or recircumscribing the causal field (similar processes are also described by, e.g., Cheng, 1997). For example, the statement “ingesting vitamin C prevents catching a cold” does not exactly mean “ingesting vitamin C causes not-catching a cold” because even if we do not ingest vitamin C, we do not usually catch a cold. The intended class of events is not general situations in our daily life but those where some other conditions are met for catching a cold, for example, when your partner has a cold. Consequently, paraphrasing “C prevents E” as “C causes Ē” requires refocusing or recontextualizing on the appropriate causal field. Without doing so the d-cell can extend unreasonably and this can lead to anomalies even for normative indices.

Let us look at another example, “the sarin (poison gas) antidote prevents sarin poisoning.” First we consider the case without appropriate contextualization to “sarin exposure” events. Reversing the truth value of the effect and replacing “cause” with “prevent,” we have to take into account the vast number of d-cell cases (i.e., no-antidote and no-poisoning), which contributes to positive diagnoses for some (normative) indices. Suppose here that the observed frequencies of an un-contextualized sample were 1, 4, 5, and 106 for the cells a (antidote and poisoning), b (antidote and non-poisoning), c (non-antidote and poisoning), and d (non-antidote and non-poisoning), respectively—almost all people did not have sarin antidote and did not experience sarin poisoning. In this case, φ = .18 and Δ P = .20 (i.e., both indicate a weak positive correlation)—that is, counterintuitively, rather than preventing sarin poisoning these indices indicate that the antidote is a weak cause of sarin poisoning!

We now consider the appropriately contextualized case where the relevant class of events is sarin exposures. If all the d-cell observations actually came from sarin non-exposure cases, the “real” d-cell frequency is 0 and φ = −.82 and Δ P = −.80, which are both strongly negative, as expected. After appropriate recontextualization, we can treat the target case as a mirror image of a generative cause (i.e., we can paraphrase “… prevents sarin poisoning,” as “… causes protection against sarin poisoning”). Consequently, cell a can be swapped with b and c with d in the contingency table. For this table, H = .89 (strong negative, as expected). Thus, the dual factor heuristic can handle the case of a preventive cause after appropriate re-contextualization, rephrasing, and swapping the columns of the contingency table.

Recontextualization and reversing the truth value might be viewed as too complex for a fast heuristic for covariation detection. However, in text comprehension, for example, it is known that we automatically and rapidly produce online inferences about the contextually appropriate meanings of words (on the order of few hundred milliseconds) to establish local coherence (e.g., McKoon & Ratcliff, 1992). Likewise, encoding the polarity of the effects can be achieved by a simple “take the smaller” rule—that is, people compare the set sizes of the target and its complement (negative) proposition in the context (e.g., “being employed” and “being unemployed”) and take the smaller (i.e., rarer) one (i.e., “being unemployed”) to detect the covariation. In expressing causal relations, focusing on the rare case is exactly what people do (McKenzie, Ferreira, Mikkelsen, McDermott, & Skrable, 2001).

Although some models of causal induction (e.g., Δ P) deal with generative and preventive causes within a single framework the dual factor heuristic basically provides a method to detect generative causes either between C and E or between C and Ē dependent on which is rarer E or Ē. This might be regarded as a disadvantage of this model. However, the function of the heuristic stage, as in other two stage models (Evans, 1989), is relevance detection (i.e., finding out whether a potential cause relevant enough to be considered a candidate for subsequent testing at the analytic stage). It is only at the analytic stage that genuine causes are separated from spurious causes and the precise nature of the causal relation, preventive or generative, are fully determined by intervention in the causal system itself rather than by simply observing it.

In the following, we first determined whether the dual factor heuristic provides a good account of covariation detection by comparing its performance to all other indices on 2 × 2 contingency tables that we could find in the literature. We now introduce these other indices before turning to a comparison based on a meta-analysis of some of the past data on causal induction and covariation detection.

4. Indices for 2 × 2 contingency tables

There have been many proposals for appropriate measures of the strength of covariation. However, there have been few studies that have exhaustively compared these models with each other. There seem to be two reasons for this. First, instead of comparing models for how well they fit the data, some studies (e.g., Anderson & Sheu, 1995; Schustack & Sternberg, 1981) are devoted to calibrating a specific model with parameters. Second, some other studies (e.g., Shanks, Lopez, Darby, & Dickinson, 1996; Wasserman et al., 1996) have concentrated on specific features of causal induction such as learning effects (or sample size effects) or overshadowing (or discounting), which are relevant only to the disagreement between proponents of associative learning models (e.g., Rescorla & Wagner, 1972) and proponents of statistical contingency models, particularly, the so-called Δ P rule (e.g., Cheng & Novick, 1992). Consequently, a subsidiary goal, in addition to determining the adequacy of the dual factor heuristic, was to exhaustively investigate the descriptive validity of various models of causal judgment from 2 × 2 contingency tables.

A wide variety of relations can be described by a 2 × 2 contingency table other than covariation including, “prediction and accuracy” (e.g., in studying the feeling-of-knowing and in weather forecasting) and “stimulus and response” relations (e.g., recognition memory and perceptual responses). Although each relational concept differs in background motivations, many of these concepts overlap. They can be roughly placed in the following two categories according to how they are motivated:

  • 1Normative studies of measures in psychology, statistics, epidemiology, medical science, and meteorology. The relations studied have been labeled “accuracy,” “agreement,” “association,” or “correlation.”
  • 2Psychological descriptive studies of human or animal judgment or learning. The relations studied have variously been labeled “causation,” “contingency,” “correlation,” or “covariation.”

Some articles provide detailed reviews of these indices within a certain restricted domain (e.g., Allan, 1980; Brier & Allen, 1951; A. W. F. Edwards, 1963; J. H. Edwards, 1957; Goodman & Kruskal, 1954; Landis & Koch, 1975a, 1975b; Nelson, 1984; Swets, 1986; Woodcock, 1976). Below, ignoring differences of backgrounds, we comprehensively review these indices. Table 2 shows the complete list of the indices investigated in this study. In the text below, a number in brackets followed by an index name or its description corresponds to the one in this table.

Before reviewing indices, some points should be noted. First, some indices are symmetrical, whereas others are not. Namely, if we interchange the cause and effect by swapping rows and columns of the contingency table, some indices show a different value. We also test these “converse” indices, which are indicated by the superscript c in Table 2.

Second, as we mentioned above, some models developed to describe human and animal behavior have adjustable parameters including the weighted Δ P model (e.g., Allan, 1993; Anderson & Sheu, 1995), the weighted linear combination model (e.g., Schustack & Sternberg, 1981), and the Rescorla–Wagner model (Rescorla & Wagner, 1972). In these models, parameters are introduced and estimated so that the model fits the data; and, as a result, estimated parameter values vary from data set to data set. We first compared H to non-parameterized models in a meta-analysis of previous data. We then compared H to parameterized models using some of this data. This was because different measures of fit are required in each case and not all the data is sufficiently rich to constrain the parameters of the parameterized models.

The third issue is concerned with the ranges of the various indices. Some indices run over infinite ranges (e.g., from 0 to plus infinity [+∞]), whereas others have limited ranges (e.g., from 0–1). Moreover, some indices (e.g., χ2) are sensitive to sample sizes, wheras the others (e.g., φ) are not. On the one hand, theoretically speaking, the sample size can be increased without limit; it would thus be reasonable to assume that the indices sensitive to sample size do not have an upper limit. On the other hand, as for indices that have an infinite range despite insensitivity to sample size, we should standardize their ranges using a proper monotonic function to make a fair competition. This is because our analyses here are based on linear correlations between each model and participants' causal ratings with a limited range (e.g., from 0–100), although we back up the results with non-parametric method. Each method of standardization was chosen so as to be suitable to each formula, which is detailed in Appendix A.

4.1. Δ P rule and related indices

Among the many models of causal induction that have been developed based on covariation, the Δ P rule [2, 3] is a well-known standard:


The Δ P rule has been proposed to explain human judgment of causality or correlation (e.g., Allan, 1993; Allan & Jenkins, 1980; Cheng & Novick, 1992; Jenkins & Ward, 1965; Ward & Jenkins, 1965) or as an index of association between the conditioned stimulus (CS) and the unconditioned stimulus (US) in Pavlovian learning in animals (e.g., Rescorla, 1968). Recently, it was shown that the Rescorla–Wagner model (Rescorla & Wagner, 1972) of associative learning converges to a weighted version of Δ P, introduced later (Wasserman, Elek, Chatlosh, & Baker, 1993), or simply to Δ P (Chapman & Robbins, 1990; Cheng, 1997) under certain constraints (see, Danks, 2003, for details). This index has also been the subject of much interdisciplinary discussion in the area of causal cognition (e.g., Cheng & Holyoak, 1995; Holland, Holyoak, Nisbett, & Thagard, 1986; Sperber et al., 1995).

It is, however, perhaps less well-known that accuracy has been measured by the same index. Δ P is the equivalent to an index that has been used to describe the accuracy of recognition memory in psychology (Gillund & Shiffrin, 1984; Woodworth, 1938) or the accuracy of prospects in meteorology and medical diagnosis (Hanssen & Kuipers, 1965, described in Woodcock, 1976). For example, in the case of recognition memory, the index indicates the difference of proportions between correct recognition of the old items and incorrect recognition of the new items. Moreover, an index known as the Hart difference score (Hart, 1965), which has been used in studies of meta-cognition as a measure of accuracy of feeling-of-knowing (Nelson, 1984), is also equivalent to Δ P.

The power PC theory (Cheng, 1997) is a modification of the probabilistic contrast model (Cheng & Novick, 1990, 1992), which is based on the Δ P rule and the concept of a focal set. According to this theory, when it is not known that there is another cause that is dependent on the candidate cause, causal power (PW) [4,5] is defined as:


Interestingly, this index is formally equivalent to an index that has been used in studies of perceptual function (Blackwell, 1963; Fisk & Schneider, 1984), although the background theory is conceptually different. With respect to visual discrimination tasks, Blackwell regarded this index as the ideal probability of “yes” responses in a sensory discrimination task, provided that P (E|inline image) is the probability of giving (spurious) “yes” responses when sensory discrimination is absent, and P (E|C) is the total probability of “yes” responses due both to sensory discrimination and spurious “yes” responses. This idea is connected with the following approaches based on graphical models of causality.

Pearl (2000) constructed a new system to express and compute causality. He defined three indices that measure causal strength (chap. 9): the probability of necessity (PN) [6, 7], the probability of sufficiency (PS), and the probability of necessity and sufficiency (PNS). According to his definitions, in the simplest situation3PN = Δ P / P (E|C), PS coincides with PW, and PNS coincides with Δ P.

Griffiths and Tenenbaum (2005; Tenenbaum & Griffiths, 2001) showed that both Δ P and PW can be regarded as maximum likelihood estimates of causal strength in a graphical model, GC, where three variables, an effect (E), a potential cause (C), and a set of background factors (B), are involved. In GC, there are two directed links connected to E: Cinline imageE and Binline imageE. In case C occurs, E occurs with probability pC. Likewise, B produces E with probability pB. The causal strength is defined as a probability pC. Δ P and PW only differ in functional relations (parameterizations) between the cause and effect. Treating causal induction as causal structure learning, which can be formalized as a decision between two causal graphical models, GC and GI, Griffiths and Tenenbaum proposed a new model called causal support, SP [8]. GI is a model for causal irrelevance in which there is only a link between C and B. Causal support is defined by the value of the log likelihood ratio for obtaining data D from GC over GI as follows:


P (D|GC) and P (D|GI) are defined by integrating over all possible values that pC and pB could assume as shown in Appendix A. SP cannot simply be expressed using cell frequencies.

4.2. Other normative models in psychology

Signal Detection Theory (SDT; Green & Swets, 1966/1988; Tanner & Swets, 1954) [9, 10] is based on an analogy between the way the mind works in sensory discrimination or detection tasks and the Neyman–Pearson theory of statistical hypothesis testing (Gigerenzer & Murray, 1987). It proposes two internal distributions: “noise” alone and “signal and noise.” Each of these distributions corresponds to distributions of some sample statistic (e.g., mean) of two competing hypotheses in Neyman–Pearson theory. Detectability, d′, is calculated using two probabilities: P (E|inline image)—that is, the probability of a “yes” response under noise alone (false alarms); and P (E|C)—that is, the probability of a “yes” response under signal plus noise (hits; see Appendix A for more detail).

Likewise, Choice Theory (Luce, 1959, 1963) relates to human decision processes. Luce (1963) defined a measure of similarity, η, between signal and noise and argued that –log η could be regarded as mental distance (p. 113), which is similar to d′. Although these two indices may be closely related to human causal judgment, they both need to be standardized because they run from –∞ to +∞. It turns out that using a standard transformation, an index based on Choice Theory is formally equivalent to Yule's (1912)coefficient of colligation [25] (introduced below), so this index was omitted (see Appendix A for the method).

Inhelder and Piaget (1958) formalized the concept of correlation as the difference between (a + d) and (b + c) to investigate children's conceptual development. They proposed the association coefficient [11] as the simplest measure of correlation (p. 234). This index is a variation of a well known strategy called the sum of diagonals or Δ D (introduced later as L3 [19]) standardized by the sample size. More recently, White (2003) has shown that some people use this strategy which he calls pCI.

The proportion correct has been used frequently as a measure of response accuracy because of its simplicity and intuitive appeal. There are two ways to define the proportion correct: pay attention only to positives (i.e., the cause is present) [12,13], C, or including negatives [14], PC. PC was first used more than 100 years ago to evaluate the accuracy of predicting tornadoes in America (e.g., Finley, 1884, described in Nelson, 1984) and has also been used as a measure of causal strength (White, 2000) or correlation (Smedslund, 1963). Although these indices are often used in psychology, meteorology, and medical diagnosis, they are margin sensitive and considered inappropriate as normative measures of accuracy (for more argument, see, e.g., Nelson, 1984; Swets, 1986).

4.3. Normative models in statistics

Probably the most widely used statistical measure of correlation between two discrete variables is the four-fold point correlation coefficient [22], usually known as phi-coefficient, φ. It relates to two binary variables, and it corresponds to Pearson's product–moment correlation coefficient, r, which relates to two continuous variables. In addition to φ, there are two other standard indices of correlation in a 2 × 2 contingency table: Yule's (1900)coefficient of association [24], Q; and Yule's (1912)coefficient of colligation [25], Y (see Appendix A).

The chi-square statistic [23], χ2, is also a well-known measure of the relation between two discrete variables. It is strongly related to φ as can be seen from their definitions (see Table 2). Contrary to φ, χ2 is sensitive to sample size. This is because χ2 is concerned with statistical judgments of the dependency between two discrete variables, which is affected by both correlational strength and sample size.

Goodman and Kruskal (1954, 1963) defined two asymmetric statistics [26,27] both based on the principle of predicting one variable from another, which is known as proportional reduction in error (PRE; e.g., see, Bishop, Fienberg, & Holland, 1975). Suppose one needs to predict the category of a discrete variable E in two different situations: (a) where nothing else is known and (b) where information about C is known. Generally, the probability of error in b is less than in a. Thus PRE is defined as the relative degree of improvement in terms of prediction of E given C:


The second statistic involves a similar calculation for predicting C given E. In addition to this asymmetrical statistic, Goodman and Kruskal (1963) also proposed a symmetrical one [28] (see Appendix A).

In psychology, meteorology, and clinical medicine, the kappa statistic [29] has often been used as a measure of agreement, which is a special case of association (Cohen, 1960; Heidke, 1926, described in Brier & Allen, 1951). Let π0 indicate the observed probability of agreement, and πE indicate the probability of chance coincidence when the two variables are independent (see Appendix A). The kappa statistic is then defined as follows:


This index has been used as a measure of “response agreement” in psychology (Cohen, 1960, 1968; Fleiss, 1971; Light, 1971), “observer agreement” in clinical medicine (Landis & Koch, 1977), and the “accuracy” of weather forecasting (called the Heidke skill score) especially in America (Woodcock, 1976) because it can be regarded as the proportion correct (PC) [14] corrected for chance success. Woodcock also introduced another measure of accuracy in weather forecasting, the skill test [30], Sk, which is also a corrected version of the proportion correct.

Some researchers (e.g., A. W. F. Edwards, 1963; Goodman, 1970) have proposed the log odds ratio of diagonal cell frequencies as a measure of association in a 2 × 2 contingency table. However, this index coincides with Yule's (1900)Q provided it is standardized using the method shown in Appendix A.

4.4. Normative models in the philosophy of science

Mill's (1843/1973)method of difference is based on the idea that a cause is the difference between cases where the effect did occur and cases where the effect did not occur. Cheng and Novick (1992) proposed a probabilistic interpretation of this idea, but as shown in Appendix A, this is formally equal to “converse” Δ P (i.e., Δ Pc [3]).

In his attempt to construct a theory of temporal direction, Reichenbach (1956) defined an asymmetric measure of causality. According to his definition, “an event B is causally between C and E” (p. 190) if (a) 0 < P (E) < P (E|C) < P (E|B) < 1, (b) 0 < P (C) < P (C|E) < P (C|B) < 1, and (c) P (E|C, B) = P (E|B). He did not characterize this relation quantitatively. When considering just the cause and effect, this measure is equivalent to judging the existence of a causal relation by comparing P (E|C) with P (E), and P (C|E) with P (C). From Equation 3, it is clear that this idea is very close to the phi-coefficient, the normative index of correlation in statistics.

In an attempt to formalize causality in networks of events, Good (1961, 1962) defined a measure Q (E:C), which is the “causal support for E provided by C” or the “tendency of C to cause E” [31,32]. The index was defined as the log odds ratio of the conditional probabilities when the effect is absent: P (Ē|inline image) and P (Ē|C). Again this index needs to be standardized as it runs over the interval [–∞, ∞] (see Appendix A).

Suppes (1970) called an event, C, a prima facie cause when (a) the effect, E, occurred after C; (b) P (C) ≠ 0; and (c) P (E|C) > P (E). Note that the condition c is implied by Reichenbach's (1956) condition a. Although Suppes also did not introduce any quantitative measures, Cheng and Novick (1992) defined an index based on Suppes' condition c as follows [33,34]:


4.5. Descriptive models in psychology

Klayman and Ha (1987) argued that positive hypothesis testing (comparing cell a with b) and positive target testing (comparing cell a with c) are favored in contingency judgment when the probability of the effect is small, P (E) < .5, and the probabilities of the cause and effect are similar, P (C) ≈ P (E). Here, we define an index based on this idea that runs over the interval [0, 1], which we called the positive test index [15]:


McKenzie (1994) identified some further heuristic indices in the literature on human covariation assessment. Some of them are among the indices that have already been introduced, and others are regarded as specific forms of a more general parameterized model, the so-called weighted linear combination model (Anderson & Sheu, 1995; Einhorn & Hogarth, 1986; Schustack & Sternberg, 1981):4


First, the positive hits strategy [16], L1, is realized by proposing that β1 = 1 and all others are 0. Second, the hits minus false positives strategy [17,18], L2, is realized by proposing that β1 = 1, β2 = –1, and all others are 0. Third, the sum of diagonals strategy [19], L3, is realized by proposing that β1 = β4 = 1, β2 = β3 = –1, and β0 = 0. This index concerns the diagonal cell frequencies and is similar to Inhelder and Piaget's (1958) association coefficient and the log odds ratio index introduced earlier. It is sometimes called Δ D and has been used many researchers as an index of causal strength or contingency (Allan & Jenkins, 1980, 1983; Arkes & Harkness, 1983; Jenkins & Ward, 1965; Kao & Wasserman, 1993; Shaklee & Hall, 1983; Shaklee & Mims, 1982; Shaklee & Tucker, 1980; Ward & Jenkins, 1965; Wasserman et al., 1990). Finally, McKenzie proposed a new index called the aggregate model [20, 21], L4, to model differential cell impact. This is another version of the linear combination model, with β1 = 4, β2 = –3, β3 = –2, β4 = 1, and β0 = 0. All versions of the linear combination model are sensitive to sample size.

We now compare indices for 2 × 2 contingency tables including the dual factor heuristic. However, before exhaustively compare the indices using past experimental data, an experiment was conducted to establish the criteria for judging whether to include or exclude data from each study in the literature.

5. Experiment 1: Discrete versus continuous task

As experiments in the literature are varied in methods with different motivations, appropriate standards are important for a meta-analysis. Lax criteria cause heterogeneity in data, which can generally lead to vague or even false conclusions. Our main concern is the discrimination between the discrete and the continuous paradigm (Anderson & Sheu, 1995) in causal judgment experiments. In the discrete paradigm experiments, participants only observe a sequence of either presence or absence of the cause and effect in pairs with particular frequencies set by the experimenter in advance as a form of 2 × 2 contingency table. In the continuous paradigm, which is sometimes called a free-operant probabilistic schedule, people choose between doing and not doing the cause at every moment and the effect occurs according to the probabilities previously set by the experimenter.

As we pointed out when we introduce the dual factor heuristic, the difference between these experiment paradigms is directly related to the recently much emphasized distinction between observation and intervention in human causal inference (Lagnado & Sloman, 2004; Steyvers et al., 2003). The discrete paradigm, being based solely on observation, is likely to only engage the heuristic stage of causal judgment. The continuous paradigm involves active intervention on behalf of participants and is hence likely to mainly enage the analytic stage. As these paradigms engage different processing stages they should produce different behavior and if this is the case, we should not mix up data from the two types of tasks in the subsequent meta-analyses.

In this experiment, we used H and the phi-coefficient (φ) to examine the difference in performance between the two paradigms. As H is regarded as a simpler substitute (omitting d-cell) of the well-known normative index of correlation, φ, comparing the data fit between H and φ would be a reasonable test to assess the difference in people's sensitivity to d-cell frequencies in the discrete and continuous paradigms.5 Because in the continuous paradigm the data sequence is partly under participant control, it is difficult to fully equate experiments using these different paradigms. However, given a range of possible contingencies, we would expect H to do better than φ for the discrete task where only observation is allowed, whereas for the continuous task, where intervention focuses attention of causal necessity, we would expect φ to do at least as well if not better than H.

5.1. Method

5.1.1. Tasks

Participants were asked to assess the strength of the causal relation between using a particular type of fertilizer and plants blooming. In the Discrete Task, participants only observed a sequence of scenes (a total of 12–15) in which fertilizer (cause) and plant blooming (effect) were either present or absent. The cell frequencies of the 2 × 2 contingency tables used in this task are shown in Table 3 (left). Each participant was presented with 12 stimuli in a randomized order, each corresponding to a line in the table.

Table 3. Stimuli in discrete (Left) and continuous (Right) tasks
NoabcdHφNoP (E|C)P (E|inline image)

In the Continuous Task, two probabilities were assigned per stimulus, as shown in Table 3 (right): a probability that the plant blooms (E) when fertilizer was used (C), P (E|C); and a probability that the plant blooms (E) when fertilizer was not used (inline image), P (E|inline image). Each participant inspected 30 cases for each stimulus. In each case, he or she chose to either use or not use fertilizer and observed plants bloom or not bloom. Given participant's choices, the plant bloomed according to the probabilities in Table 3 (right).

5.1.2. Procedure

The experiment was conducted on personal computers. Every time a participant clicked the mouse, a new picture was displayed in a randomized order. After observing a series of situations (i.e., pictures), participants rated the subjective strength of the causal relation with a value between 0 (completely unrelated) and 100 (completely related). This cycle was repeated for all stimuli shown in Table 3.

5.1.3. Participants and design

A total of 39 undergraduate students from Ritsumeikan University participated in this experiment as unpaid volunteers. They were randomly assigned either to the Discrete Task or to the Continuous Task. The participants were run either individually or in small groups.

5.2. Results and discussion

Figure 1 shows the relation between index values and participant ratings. In the Continuous Task, contrary to the Discrete Task, each participant experienced different stimuli (i.e., cell configurations). This is because the experimenter cannot control how many times a participant performs the action that is the possible cause of the effect and partly because the experimenter can only control the probability with which the effect occurs consequent on the action and the actual occurence is left to chance. Consequently, every rating is plotted as a single point in Fig. 1. In this figure, data for participants whose correlation coefficients were significant (p < .05) are plotted with black circles and others are plotted with white ones. Regression lines were drawn for individual participants (only for those participants whose ratings had a significant correlation with the index). Correlation coefficients were calculated by participants and averaged using Fisher's z transformation.

Figure 1.

Participants' ratings of causal strength in Discrete (upper panels) and Continuous (lower panels) Tasks of Experiment 1. Regression lines are for participants whose ratings significantly correlated with the index (p < .05).

In the Discrete Task, the correlation of participants' ratings was much stronger with H (r2 = .66) than with φ (r2 = .33). Thus, when only observational data is available people tend to ignore d-cell information. In the Continuous Task, participant ratings had as much correlation with φ (r2 = .73) as with H (r2 = .73). Therefore, with the continuous paradigm where intervention focuses attention on causal necessity, as predicted, the correlation with φ rose and was similar to H, indicating as much analytic as heuristic processing in this paradigm.

The results showed differences in causal cognition between situations where people only observe the occurrences of events and situations where they intervene in the system. Consequently, continuous paradigm data were excluded in the following meta-analyses assessing the merits of the dual factor heuristic as an account of the heuristic stage.

However, the dual factor heuristic, which is a model of the non-interventional heuristic stage in causal induction, well described the data on the Continuous Task, as did φ. This result suggests that continuous tasks are not completely analytic. Some participants may have processed the data only at the heuristic level. Our explanation of some d-cell effects in covariation detection suggests that there may be individual differences in strategy usage. Moreover, a reviewer suggested that to control for inevitable differences between the continuous and discrete paradigms, we could have used a yoking procedure. In this procedure, the sequence of trials generated by a participant in the continuous paradigm is summarized and presented to a yoked participant in the discrete paradigm. These are clearly both areas for future research which we hope to turn to in the future.

6. Meta-analysis 1: Non-parameterized models

Many experiments have investigated human and animal judgments on causality, correlation, or contingency between two covarying events. However, as far as we know, no study has exhaustively examined the descriptive validity of the various models based on covariation listed in Table 2. The purpose of this section is to begin to determine the most appropriate descriptive model for covariation detection based on 2 × 2 contingency tables.

6.1. Method

Experiments on judgments of causality, correlation, or contingency that satisfy the following five criteria were selected from the literature: (a) The participants were human; (b) the causal relation in the experiment was assumed to be a one-to-one correspondence (i.e., a single event causes an effect); (c) the task was to assess the subjective strength of a relation with a (nearly) continuous value; (d) the candidate cause and effect were presented sequentially with frequencies decided in advance by the experimenter (i.e., the discrete paradigm experiments); (e) the task handled ordinary generative causes (as opposed to preventive ones that suppress the effect). Although the criterion d was justified by Experiment 1, criterion e requires further justification.

Some experiments (e.g., Buehner, Cheng, & Clifford, 2003) impose a generative or preventive scenario as different tasks, whereas some experiments (e.g., Wasserman et al., 1996) merge generative and preventive causes within a task. A problem is that the recontextualization required to interpret preventive causes (discussed in the section, The Dual Factor Heuristic) can differ between these experiments, and we have no way of determining any recontextualization participants may impose to interpret preventive causes.

Moreover, learning preventive associations from purely observational data without recontextualization would be computationally intractable and is another instantiation of the frame problem we discussed in introducing the dual factor heuristic. As we move around the world most things are not happening. Do we therefore develop associations between the occurrence of an event (a possible cause) and all the other events that are not occurring at the time, just in case they may be instances of a preventive cause? This would involve encoding an enormous number of largely spurious associatons every time an event was encountered between a representation of that event and a representation of almost all other possible events that are not occurring (i.e., the negation of each other possible such event). In sum, at the level of learning covariations to feed into subsequent interventional processing in the real world, as opposed to the laboratory, it would be computationally infeasible for people to attend to preventive associations. Consequently, to avoid data heterogeneity we only included experiments that measured positive relations. We acknowledge that these exclusions may give the impression that we are only testing H against a very specialized set of data. However, while specialized with respect to the classes of experimental paradigms used in the area, we would argue that it is not specialized with respect to people's real world experience of detecting covariations by observing the world.

The following six target experiments were identified in the literature: Experiment 1 of Anderson and Sheu (1995), abbreviated to “AS95”; Experiments 1 and 3 of Buehner, et al. (2003),6 abbreviated to “BCC03.1” and “BCC03.3,” respectively; Experiments 1 through 3 of Lober and Shanks (2000),7 abbreviated to “LS00”; and Experiments 2 and 6 of White (2003), abbreviated to “W03.2” and “W03.6,” respectively. In addition, Experiment 1 of the current study was also included in the analysis. For each experiment, all indices mentioned in the previous section and the converse indices (indicated by the superscript c in Table 2) were calculated from the cell frequencies. In total, 34 indices were calculated. To measure each model's fit to the data, the coefficient of determination (r2) between each index and participants' mean ratings of causal strength were calculated.

6.2. Results and discussion

The results of the analysis are set out in Table 4. As an overall measure of goodness-of-fit, the weighted average of the seven values of r2 for each index was calculated and they are shown in column 11 (i.e., “whole [w/o Expt 2]”). This calculation was based on the Fisher ztransformation of r and each study was weighted by its degree of freedom, M – 3, where M indicates the number of stimuli used in the experiment (e.g., see, Rosenthal, 1991, chap. 4). The number of stimuli, M, was 80, 13, 6, 11, 8, and 4 for AS95, BCC03.1, BCC03.3, L00, W03.2, and W03.6, respectively.

Table 4. Data fit (r2) for the non-parameterized models
ModelAS95BCC03.1BCC03.3LS00W03.2W03.6Expt 1Expt 2whole (w/o Expt 2)whole
  1. Note: AS95: Experiment 1 of Anderson and Sheu (1995), the number of stimulus variation, M = 80; BCC03.1 and BCC03.3: Experiments 1 and 3 of Buehner et al. (2003), M = 13 and 6; LS00: Experiments 1–3 of Lober and Shanks (2000), M = 11; W03.2 and W03.6: Experiments 2 and 6 of White (2003b), M = 8 and 4; Expt 1 and Expt 2: Experiments 1 and 2 of the current study, M = 12 and 9.

  2. In the “whole” column, the figures indicate the weighted average of r2 (see the text for details).

  3. ** p < .01, * p < .05.

1 H.91**.94**.91**.80**.69*.80.93**.96**.90.91
2 Δ P.78**.86**.70*.77**.00.50*.00.75.72
3 Δ Pc.78**.74**.68*.56**.36.50*.73**.73.73
4 PW.30**.75**.90**.41**.15.55**.01.39.36
5 PWc.53**.43*.*.95**.47.52
6 PN.56**.35*.*.00.47.44
7 PNc.41**.68**.83*.38*.09.38*.13.43.39
8 SP.81**.88**.74*.23**.18.45.79**.36.77.76
9 SDT.78**.48**.68*.23.00.47*.00.68.65
10 SDTc.78**.48**.68*.23.00.47*.00.68.65
11 R.76**.86**.70*.77**.*.74.72
12 C.82**.63**.82*.**.76.73
13 Cc.58**.*.97**.48.53
14 PC.76**.86**.70*.77**.49.18.49*.16.74.72
15 PT.89**.90**.88**.75**.51*.72.90**.80**.87.87
16 L1.46**.63**.82*.13.57*.86.73**.85**.50.53
17 L2.73**.63**.82*.15.66*.72.85**.89**.71.72
18 L2c.49**.86**.70*.75**.14.40.42*.84**.55.57
19 L3.71**.86**.70*.75**.49.18.49*.01.71.67
20 L4.86**.93**.99**.78**.95**.88.91**.43.88.87
21 L4c.78**.96**.90**.88**.59**.70.76**.60*.81.81
22 φ.78**.84**.70*.71**.21.50*.49*.75.74
23 χ2.07*.83**.63.69**.26.48*.36.23.24
24 Q.78**.49**.67*.23.00.47*.01.68.64
25 Y.79**.50**.*.07.68.64
26 λE|C.22**.61**.29.57**.36*.45*.30.30
27 λC|E.08*.86**.70*.77**.37.36*.56*.26.27
28 λ.13**.82**.65.71**.37.36*.49*.29.30
29 K.71**.86**.70*.77**.39.51**.65**.71.71
30 Sk.72**.86**.70*.77**.28.50*.73**.71.71
31 G.70**.74**.89**.38*.13.58**.04.66.63
32 Gc.54**.92**.89**.77**.78**.59**.88**.65.66
33 S.49**.86**.70*.77**.
34 Sc.61**.

As shown in Table 4, only four indices exceeded .80 in average r2 value; these were H (.90), L4 (.88), PT (.87), and L4c (.82). Five indices including the top four and Gc (r2 = .65) were statistically significant (p < .05) in all the data (with the exception of W03.6, which had a low N), although Gc was ranked low (21/34). Below the top four, some indices including φ and Δ P are grouped in the range of .77 to .65. We also assessed the model fits using non-parametric, rank order correlations (i.e., Spearman's coefficient of correlation; e.g., see Siegel & Castellan, 1988). Our findings were essentially unaltered. The same indices (i.e., H, L4, PT, and L4c) remained as the distinctive top four.

H and PT are of course conceptually very similar: H is the geometric mean of P (E|C) and P (C|E) and PT is the arithmetic mean of these two factors. H has the advantage of being a limiting case of the normative index (i.e., it has a rational motivation). PT is obviously going to be highly correlated with H and we can only understand its success to the extent that it approximates H.

L4 and its converse L4c are based on setting the parameters of the weighted linear combination model to particular values. However, unlike the other indices produced in this way, the parameter values have been specifically set, albeit only crudely, to capture the differential cell weightings found in previous data. That L4 and L4c provide good fits to these new data sets without allowing these parameters to be further adjusted is impressive. It demonstrates the generalizability of the model. However, these indices are not based on a normative theory (i.e., they provide no explanation for why people behave as they do in these tasks). The only explanation they provide is the tautological one that they can fit the previous data. In contrast, H is based on a limiting case of the normative index of covariation in 2 × 2 contingency tables when C and E are rare. Thus, H explains why people behave as they do (i.e., it is an adaptively rational strategy; Chater, Oaksford, Nakisa, & Redington, 2003). Thus, we argue that H is to be preferred because it not only provides a good fit to the data, it also rationally explains why people behave in the way they do on these tasks.

7. Experiment 2: H versus ΔP and L4

Although the results of the meta-analysis indicated that the dual factor heuristic model best fit the experimental data, other models did rather well. However, these degrees of fit may have occurred because some indices are correlated with H in the stimulus sets used in the experiments against which we evaluated the models in the above meta-analysis. More specifically, when H fits the data, and H and a particular index are dependent for some stimuli, the index can also show a good fit to the data. As long as stimuli are used for which different models' predictions do not diverge, we cannot differentiate between models. The motivation for Experiment 2 was therefore to present stimulus set where these indices are predicted to diverge in their predictions.

We focused our attention on Δ P and L4. First, because L4 and its converse fit the data well, we needed to differentiate these indices from H. Second, because Δ P has been one of the leading psychological models of causal induction and it did quite well in the last meta-analysis (r2 = .75, ranked 8th), Δ P was worth comparing with H.8 So far, almost all experiments on causation and correlation based on covariation information have been concerned with Δ P in one way or another, either positively or negatively (e.g., Allan, 1980, 1993; Allan & Jenkins, 1980, 1983; Anderson & Sheu, 1995; Baker, Berbrier, & Vallée-Tourangeau, 1989; Buehner et al., 2003; Chapman & Robbins, 1990; Cheng, 1997; Cheng & Novick, 1992; Griffiths & Tenenbaum, 2005; Neunaber & Wasserman, 1986; Shaklee & Tucker, 1980; Shanks, 1987; Vallée-Tourangeau, Murphy, Drew, & Baker, 1998; Wasserman, 1990; Wasserman, Chatlosh, & Neunaber, 1983; Wasserman et al., 1993; White, 2000, 2003). Third, as mentioned in the previous section, although PT matched H, PT is less rationally motivated and it showed a (slightly) worse fit. Consequently, we do not consider PT further.

This experiment was therefore designed to examine which model best predicts the data in a stimulus set where H is not well correlated with these other indices. The cell frequencies were explicitly defined such that H and Δ P had independent values on three levels (low, middle, and high) and no data could validate both models at the same time. This procedure also had the consequence that neither L4 nor its converse was significantly correlated with H.

7.1. Method

7.1.1. Participants

Fifty undergraduate students of Ritsumeikan University participated in this experiment as unpaid volunteers. The participants were run either individually or in small groups of up to 10.

7.1.2. Stimuli and procedure

Table 5 indicates the cell frequencies and the values of each index for the nine contingency tables used in this experiment. The experiment was a completely within-participant design conducted on personal computers. Each participant was presented with stimuli derived from these nine contingency tables in randomized order.

Table 5. Stimuli in Experiment 2
Cell frequency Models' prediction
abcdNHΔ PL4L4c
  1. Note: N = a + b + c + d.


The cause was “drinking milk” and the effect was “stomach-ache.” Each contingency table in Table 5 corresponded to a person, who could have a weak digestion. Participants were instructed to judge the causal strength between drinking milk and stomach-ache for each contingency table. When a participant clicked the mouse, a pair of pictures was displayed: The picture on the left of the screen indicated whether the person drank the milk, and the picture on the right of the screen indicated whether the person experienced a stomach-ache. After observing a series of these stimuli (i.e., paired pictures) in randomized order, participants rated the subjective strength of the causal relation with a value from 0 (not related at all) to 100 (completely related) using the computer mouse. This procedure was repeated for the nine contingency tables.

The exact instructions were as follows:9

The purpose of this task is to investigate how strongly people feel a cause produces its effect.

Imaginary persons you will see on the screen may have a stomach-ache. You are asked to judge whether milk is the “cause” of their stomach-ache after observing several situations in which they drink or do not drink milk. There is no correct answer, so please answer according to your purely subjective feeling.

This task consists of several sessions and each session is concerned with a different person. When the session starts, every time you click the mouse, a new picture will be displayed indicating (1) whether or not the person drank milk, and (2) whether or not the person had a stomach-ache. After watching some cases, you must decide the degree to which you think milk is the “cause” of a stomach-ache for that person. Please use the computer mouse to indicate your response. Rate as a rough estimate from 0 (there is no causal relation at all between milk and a stomach-ache) to 100 (there is a complete causal relation). You will be asked to do this several times.

7.2. Results and discussion

Figure 2 shows participants' mean ratings of causal strength. In Figure 2a, the x axis corresponds to the value of H for each contingency, and each line connects the points that have the same Δ P value. It is clear that the changes in participants' causal evaluations bear no relation to the Δ P values. On the other hand, it is clear that there is a linear relation between H and causal strength. An analysis of variance (ANOVA) with the causal ratings as the dependent variable and the three levels of H as a factor showed a highly significant main effect of the dual factor heuristic, F (2, 98) = 274.30, p < .0001, but no main effect of Δ P, F (2, 98) = 0.42, p = .67. The interaction between these two factors was also significant, F (4, 98) = 5.62, p < .001. As Figure 2a shows when H = .5 and H = .8, there are differences in causal ratings between levels of Δ P. However, these were inconsistent. When H = .8, causal ratings increased with Δ P, as would be expected; whereas, when H = .5, causal rating decreased with Δ P, which is the exact opposite of what would be expected.

Figure 2.

Participants' ratings of causal strength and the predictions of (a) H and Δ P and (b) the predictions of L4 and L4c in Experiment 2.

In Figure 2b, the x axis corresponds to the value of L4 or L4c, and points are connected with lines in groups of three that had the same value of H. Figure 2b shows that there is no clear monotonic relation between participants' ratings and L4 or L4c. We can see that L4 and L4c make poor predictions particularly for the group of stimuli when H = .2 where there is almost no correlation between causal ratings and L4 (or L4c). A similar ANOVA could not be carried out for these indices, because the levels were not fully crossed. However, Figure 2b replicates the odd interaction effects we observed for Δ P (i.e., when H = .8, causal ratings increase; but when H = .5, causal ratings decrease as L4 or L4c increase).

The “Expt 2” column of Table 4 shows r2 for this experiment. H had an r2 of .96, Δ P had an r2 of .00, L4 had an r2 of .43, and its converse had an r2 of .60. An analysis using Spearman's non-parametric correlation coefficient replicated the results: .93, .00, .40, and .72 for H, Δ P, L4, and L4c, respectively. The rightmost column of Table 4 shows the weighted averages of r2 based on Fisher's z transformation (as described in the section, Meta-analysis 1) for all the data including Experiment 1 and 2. Here we can see that H still provides the best fit. These results suggest that in the meta-analysis H actually fit the data, whereas Δ P and L4 seemed to fit the data because of the correlation in these stimuli between H and these other two indices.

We also wondered how H compared to the fully parameterized weighted linear combination model when an index of fit that penalizes model complexity (i.e., the number of free parameters) is used. We therefore performed a second meta-analysis, which also included a range of other parameterized models of covariation detection that have been prominent in the literature and that, therefore, needed to be compared to H.

8. Meta-analysis 2: Parameterized models

We have now established that H compares well with other non-parameterized models of covariation detection. It outperforms other models including normative and descriptive models of covariation detection. Moreover, in an experiment where the other best performing models were not well correlated with H, H performed a lot better than those models. However, we also needed to compare H with the range of parameterized models of covariation detection in the literature. This is the purpose of the second meta-analysis that we now report.

8.1. Parameterized models

As we indicated above, the first parameterized model we consider is the weighted linear combination model, L* [41] (Equation 11) introduced in the section Indices for 2 × 2 Contingency Tables, which is one of the most familiar models with free parameters. However, a range of other models exist in the literature.

8.1.1. The simple Bayesian model

Another covariational approach was proposed by Anderson and Sheu (1995; Anderson, 1990) as a rational analysis of causal judgments. Causal strength was proposed to be a function of the odds of a causal relationship (i.e., a hypothesis, H) given the data (D), O (H|D):


Here, O (H|D) is described using Bayesian theorem, letting pc be the probability of the effect in the presence of a cause, pa be the probability of the effect in the absence of the cause, pn be a base rate of the effect irrespective of occurrence of the cause, and pr be the prior probability of the causal relationship, P (H);


They called this parameterized model the Simple Bayesian Model [40].

8.1.2. Weighted Δ P models

Several models with free parameters have also been proposed in the associative learning literature. According to this approach, causal judgments are explained by learning theory where the candidate cause is treated as the CS, the effect is treated as the US, and causal strength is treated as associative strength. The Rescorla–Wagner model (Rescorla & Wagner, 1972) is expressed as a simultaneous difference equation so that it can predict the results of step-by-step learning. However, the Rescorla–Wagner model involving one causal factor (contrasted with the context) reduces to the following formula at asymptote (Wasserman et al., 1993), which is sometimes called weighted Δ P, Δ Pw1 [36]:


The right hand side of this equation is derived by reductions assuming both wa and wc are not 0. Using this measure to capture the fit between the data and the Rescorla–Wagner model assumes that performance in each experiment is at asymptote. For the studies we model, this criterion cannot be guaranteed to hold. However, no simple criterion like number of trials could guarantee the opposite (i.e., that performance is not at asymptote). Consequently, although admitting that this may be a limitation, we continued to model all the data that met our selection criteria (we discuss whether performance is at asymptote further in the section General Discussion.) Confusingly, there is another version of weighted Δ P (Anderson & Sheu, 1995), Δ Pw2 [37], which has been applied to causal judgment in humans:


8.1.3. Information integration models

Kao and Wasserman (1993) derived the same equation as Δ Pw1 from Busemeyer's (1991) information integration theory. They also proposed that a non-normative information integration process can be described as follows [38]:


which is a form obtained by substitutions, w1 = wb / wa, w2 = wc / wa, and w3 = wd / wa. Also based on Busemeyer's theory, Anderson and Sheu (1995) introduced another model [39]:


This is derived by substituting (.5 + w4), (w1w4), (w2+ w4), and (w3+ w4) with β0, β1, β2, and β3, respectively.

8.1.4. The Pearce model

Pearce (1987) proposed another associative model of stimulus generalization. Asymptotic predictions for his model are made according the following formula (Buehner et al., 2003; Perales & Shanks, 2003), J [35]:


8.1.5. Parameterized dual factor heuristic

The dual factor heuristic, H, has no parameters, but to compare it with these other parameterized models, we used the simplest linear equation as a response function which maps strength of covariation onto an actual response: β0+ β1H.

8.1.6. Causal support with the power transformation

Griffiths and Tenenbaum (2005) proposed a transformation of the causal support model [8], SP, based on the power law to accommodate non-linearity of the data before fitting their model as follows: β0+ β1 sign(SP) abs(SP)γ.

8.2. Method

The aforementioned models were compared using the same set of data as Meta-analysis 1 including data from Experiment 1 plus additional data from Experiment 2. The model parameters were estimated based on the least squared error criterion. For models that have the form of linear equations (i.e., Δ Pw2 [37], I2 [39], L* [41], and parameterized H [1]), a multiple linear regression was conducted.10 For the other non-linear models (i.e., J [35], Δ Pw1 [36], I1 [38], B [40], and transformed SP [8]), nonlinear least squares regression (Dennis, Gay, & Welsch, 1981) was conducted using the “nlregb” function of S-PLUS 6.0.

In all experiments, the causal judgments were rated between 0 and 100, whereas the output of all the aforementioned indices is between 0 and 1. Therefore, all the definitions for the models were multiplied by 100. (This only affects the absolute value of AICc [see below], not the relative magnitude which is essential.)

8.3. Results and discussion

As a measure of goodness-of-fit, r2 between participants' causal ratings and the best fit predictions based on the least squared error criterion was calculated for each model and they are shown in Table 6. Estimated parameters are shown in Appendix B. In the “whole” column of Table 6, the weighted averages of r2 (see the section Meta-analysis 1) are shown. J, Δ Pw1, and Δ Pw2 fit the data from W03.2 and Experiment 2 poorly (r2 ≤ .20), but the other models seemed to fit all the data quite well.

Table 6. Data Fit (r2) and generalizability (AICc) for the parameterized models
  AS95BCC03.1BCC03.3LS00W03.2W03.6Expt 1Expt 2whole (mean)
ModelKr2AICcr2AICcr2AICcr2AICcr2AICcr2r2AICcr2AICcinline imagerank
  1. Note: K indicates the number of parameters in each model. With regard to the number of parameters of the model H, see the text. For W03.6, because AICc could not be calculated for all indices because of the lack of stimulus variation (i.e., 4), this column was omitted.

  2. In the “whole” column, r2 was averaged with each study weighted as in Meta-analysis 1 and mean rank orders of AICc are shown. BCC03.3 and W03.6 were excluded from the averaging because of the insufficiency of stimulus variation.

  3. ** p < .01, * p < .05; ††† best model, †† second best model, third best model.

1 H(2).91**259.3.94**46.8††.91**43.6††.80**57.7.69**40.5†††.80.93**47.3†††.96**38.2†††.911.8
35 J2.89**302.6.92**63.2.87**49.0.87**52.7††.1452.6.54.89**54.4††.0069.0.854.5
36 Δ Pw12.93**309.9.92**64.4.93**42.7†††.91**49.1†††.0752.0.54.89**54.5.0370.3.895.0
37 Δ Pw23.94**222.3†††.96**46.4†††.99**59.5.87**57.9.2057.3.54.89**57.6.0075.4.924.3
8 SP(3).80**321.7.95**48.5.76*79.5.71**66.6.66***66.5.44*70.1.806.3
38 I14.72**351.3.94**57.6.99**.88**64.0.99**43.6††.92**59.9.72**76.1.826.3
39 I24.93**242.6††.96**51.9.99**.87**65.2.97**48.7.92**59.4.85**70.5.934.8
40 B4.88**282.6.96**53.3.96**.89**63.2.97**48.5.94**56.6.91**66.1††.913.8
41 L*5.86**297.4.96****74.6.97**104.7.93**67.4.98**77.3.918.0

However, with parameterized models, the r2 goodness-of-fit index is generally not an adequate criterion. The more free parameters a model possesses the more chance there is that it will fit the data. However, an “overcomplex” model loses generalizability, which is the ability to fit samples beyond the current data set (e.g., see Pitt & Myung, 2002). As a measure of generalizability, for each model we therefore calculated AICc (Hurvich & Tsai, 1989), which is a corrected version of the well-known Akaike Information Criterion (AIC; Akaike, 1973). AICc is generally appropriate and dramatically outperforms AIC in small-sample settings such as those in our meta-analyses (e.g., see Burnham & Anderson, 1998). Smaller values of AICc indicate that a model is more generalizable. According to this measure, H was always ranked in the top three for all data sets and overall it is more generalizeable than the other parameterized models.11 The last column of Table 6 shows the mean rank orders of AICc. AICc values for all indices could not be derived for BCC03.3 and W03.6 as they contained insufficient stimuli. These experiments were therefore excluded from the averaging process. Again, H was the best among these eight parameterized models according to mean rank order of AICc, which was significant by Friedman's test, χr2(7) = 19.0, p = .015.

In sum, even compared with complex parameterized models, the dual factor heuristic performed well. Note that although the number of parameters was taken into account in this comparison, the number of cells (or the number of samples) used for the calculation to detect covariation was not. All indices other than H use information about all cells, while H ignores d-cell information. Generally speaking, the less information needed for a calculation, the easier it is to perform, which means a lower cognitive load is imposed. This is an intrinsic advantage of the dual factor heuristic not factored in to the model comparison process. To investigate this feature in more detail, further computer simulations were conducted to provide a fuller rational analysis of covariation detection using the dual factor heuristic.

9. Simulation 1: Effectiveness and parsimony

The results of the meta-analyses and the experiment show that the dual factor heuristic has some descriptive validity. It would appear that people use this heuristic as an approximation to the normative strategy of covariation assessment when they assess causal strength between a candidate cause and its effect.

In the simulations we now report, we attempted to provide a more detailed rational analysis of the dual factor heuristic. According to Anderson (1990), the adaptive rationality of a cognitive process depends on the limitations imposed by working memory and the nature of the environment. We considered three factors that are relevant to the adaptive rationality of the dual factor heuristic: rarity, equiprobability, and the capacity of working memory. The importance of rarity has already been established in deriving the dual factor heuristic. The rarity assumption was initially introduced in explaining data selection behavior in the Wason selection task (Oaksford & Chater, 1994). In the same task, Hattori (2002) also proposed that participants make a biconditionality assumption that a conditional is often regarded as a biconditional (i.e., not only is “if X then Y” true so is “if Y then X”). To avoid misunderstanding, we call this assumption the equiprobability assumption here because it suggests that the probabilities of the antecedent, X, and the consequent, Y, of an indicative conditional are almost equal (see also, Klayman & Ha, 1987). Working memory limitations have been appealed to in many rational analyses (e.g., Anderson, 1990). Interestingly, for our current analyses, Kareev (2000) has pointed out that the working memory limitations are not always disadvantageous to cognitive processing. He pointed out that restricting the sample size within the narrow limits imposed by working memory (i.e., approx. 7 ± 2) may amplify a sample correlation so enabling people to detect correlations more efficiently.

We first explored how well and under what conditions H best predicts the normative index of covariation. This first analysis shows that H is most effective as an approximation to the phi-coefficient when only small samples are used (i.e., it is most effective when parsimony is enforced by working memory limitations).

9.1. Method

In these simulations we generated a large variety of contingency tables embodying different relations between cause and effect. To generate these tables we needed to vary three parameters that specify a population of tables. We chose P (C), P (E) and P (C, E). The probability of the candidate cause, P (C), and the effect, P (E), were both varied systematically over the range .1 to .9 in steps of .1 to examine the effects of rarity and equiprobability. Their joint probability was set by a random variable with a unified distribution defined between P (C)P (E) and P (C) or P (E) (depending which is the smaller): P (C, E) ∼ Unif (P (C)P (E), min(P (C),P (E))). These are the parameters of the population of contingency tables.

The number of samples was controlled in two ways: One was the total sample size (i.e., N = a + b + c + d); the other was to exclude the d-cell (i.e., NW = a + b + c); and the sample size was varied systematically to examine the effects of working memory capacity, which is known to be about 7 ± 2: N (or NW) ∼ Norm (μ, (μ/7)2), μ = (7, 14, 28, 56). To be specific, when the number of samples is controlled by NW, the procedure was as follows. First, P (C), P (E), and μ were set at particular levels systematically. For each set of P (C), P (E), and μ, the correlation between H and φ was calculated based on the generated contingency tables. For each contingency table, P (C, E) and NW were determined according to their respective probability distributions. Each sample's cell category (a, b, c, or d) was determined sequentially according to the following probabilities (as the parameters of the population), P (C, E), P (C) – P (C, E), P (E) – P (C, E), 1 – P (C) – P (E) + P(C, E), respectively. Sampling was continued until the number of samples categorized in a, b, or c (excluding d) summed up to NW. As a result, the real sample size exceeds NW by exactly the size of the d-cell.

For each set of P (C), P (E), and μ, 500 contingency tables were generated, regenerating P (C, E) and N (or NW) for each table. For each contingency table, H and φ were calculated, and finally r2 between H and φ was calculated. To obtain stable estimates, this procedure was repeated 20 times and each r2 was averaged.

9.2. Results and discussion

9.2.1. Relations between H and sample φ

Figure 3a shows r2 between H and φ when the equiprobability assumption is made. The horizontal axis indicates the mean probability of cause or effect and the coordinates are linked by sample size. The results have two distinctive features. First, when the probability of events is low (i.e., rare), the correlation between H and sample φ is always high, irrespective of sample size. Second, as the probability of events increases, the correlation between H and sample φ decreases.

Figure 3.

(a) Coefficients of determination (r2) between H and φ when equiprobability is maintained; (b) coefficients of determination between H and φ and between φ and φ when equiprobability is maintained.

H differs from normative indices in not using the frequency information of the d-cell. The results, however, show that non-normative H works well and it works as well as φ given some constraints. Moreover, H may be preferable because it minimizes the computational cost of calculating an estimate of causal strength (i.e., H is more economical than φ).12

9.2.2. Relations between H and population φ0

Sampling does not aim at comprehension of the sample itself. For an agent living in the real world, it is essential to estimate parameters of the population from samples. Consequently, population φ is significantly more important than sample φ. From now on, to avoid possible confusion, population φ is described as φ0, and when φ appears alone, it indicates sample φ.

When estimating the population φ0, sample φ provides the best estimates of φ0 under the multiple sampling model (e.g., Bishop et al., 1975), and the accuracy of the estimates increases as the number of samples increases. Figure 3b shows the precision with which H and φ predict population φ0 for various sample sizes. In this figure, φ is plotted on dotted lines and H is plotted with black diamonds on a line in the small sample case (N ≈ 7). These results show that estimation accuracy of φ0 by H matches that for φ when P (C) is in the domain .1 to .3 (i.e., rare).

However, because H ignores the d-cell frequency, more samples can be acquired than other models that use all four cells—that is, the capacity limitation of working memory forces the heuristic to control the number of samples such that NW (not N) is approximately equal to 7. The results of sampling by this method are shown in Figure 3b with a line and grey circles, which shows that H predicts φ0 far more precisely than φ in the domain where the probability of cause or effect is .1 to .5. In particular, when the probability is low (approximately .1–.3), NW-based H achieves higher accuracy than φ with double the sample size (i.e., approximately. 14).

Figure 4 shows that H predicts φ0 more accurately when equiprobability is maintained. The horizontal axis indicates the difference of probabilities of the cause and the effect, P (C) – P (E). Equiprobability occurs at the point where this value is equal to 0. Points that have the same mean probability (m) of the cause and the effect are connected by a line. For instance, if P (C) and P (E) are (.1, .3), (.2, .2), or (.3, .1), the mean probability of cause and effect is .2, so they are connected with a line “m = .2” and the values of “P (C) – P (E)” of these points are –.2, .0, and .2, respectively. This figure shows the results when NW≈ 7 and P (C), P (E) ≤ .5 (i.e., when rarity is maintained). Figure 4 shows clearly that H predicts φ0 far better when both events are equiprobable.

Figure 4.

Coefficients of determination between H and φ0 under the rarity assumption and with NW = a + b + c ≈ 7. M indicates the mean probability of the cause (C) and the effect (E), and horizontal lines indicate the degree of deviation from equiprobability.

10. Simulation 2: Efficiency

In the real world, people usually sample sequentially. In sequential sampling, it is important to form an appropriate conclusion quickly. Accordingly, there is a trade-off between the early cessation of sampling and the accuracy of estimation. Therefore, we investigated the relation between the convergence of an index value on the true population mean and sample size.

10.1. Method

The convergence of φ, Δ P, and H was examined in sequential sampling from a population. The probabilistic characteristics of the population were defined by three parameters, P (C), P (E), and φ0.13 Assuming rarity and equiprobability, P (C) and P (E) were set to .2, and φ0 was varied at three levels, .2, .5, and .8. The sample size started at 1 and increased to 30. This procedure was repeated 5,000 times, and the mean values of the indices at each N (or NW) and their standard deviations were derived.

10.2. Results and discussion

The results are shown in Figure 5 for the cases where φ0 = .8 (upper) and φ0 = .2 (lower). The results for φ0 = .5 were similar to these two and so were omitted. The y-axis in each upper panel indicates the mean values of φ, Δ P, and H for 5,000 trials of sequential sampling. These show that the indices converge as the number of samples increases. The lower panels for each level of φ0 show the SDs for each index. The dotted line with black diamonds indicates NW-based index values, whereas the other lines are N based. When φ0 was .2 and .5, the results were the same.

Figure 5.

Convergence graphs for the sample indices (φ, Δ P, and H) as functions of sample size when φ0 = .8 (upper) and φ0 = .2 (lower) with 5,000 trials of sequential sampling. The horizontal axes indicate the number of samples, N or NW (detailed in the text); the vertical axes indicate the mean or the standard deviation of each index (i.e., φ, Δ P, or H).

The mean values of H and φ converge gradually. As the limiting value of H is higher than that of φ, the convergence line for H is always above that of φ; the rates of convergence themselves, however, do not differ. On the other hand, Δ P appears to be stable from the start, as does NW-based H.

The reliability of an index cannot be determined solely from the convergence rate of the mean. It is necessary to know the confidence interval of the population mean. As the standard deviation decreases, the confidence interval narrows and the information about the population gained from an index increases. The convergence status of the standard deviations indicates that Δ P, which seemed very stable when considering its mean convergence alone, has large standard deviations and turns out not to be a reliable index. On the other hand, NW-based H has small standard deviations. Taken together with its rate of mean convergence this clearly indicates that NW-based H is a more reliable index, assuming rarity and equiprobability.

10.3. Summary of simulations

In an environment where rarity and equiprobability hold, the index H defined by the dual factor heuristic measures φ approximately. In such an environment, H can function as a superior substitute for φ. When H is used for causal induction, d-cell information can be disregarded; more samples can thus be acquired within the limits of working memory. Using this NW-based sampling, H can predict the population φ0 far better than other indices, even better than sample φ (Simulation 1). In sequential sampling, the confidence interval of H decreases rapidly, and it is the most reliable predictor of population φ0 (Simulation 2).

11. General discussion

The dual factor heuristic, H, provided the best fit to the experimental data both from meta-analyses of past data and new experiments. Normative assessments of the strength of covariation require attending to all events, including those that involve nothing happening (d-cell). Attending to all cell frequencies places prohibitive demands on people's limited working memory capacity. Computing H allows people to assess the strength of relations without using the d-cell. This strategy is more efficient on memory and is accurate under circumstances where P (E) and P (C) are small. Despite the model's descriptive power and adaptive rationality, d-cell neglect is controversial. In this discussion we first consider the d-cell effects.

Given an observation of small but reliable positive d-cell effects (e.g., Anderson & Sheu, 1995), the dual factor heuristic might be viewed as falsified at birth. However, the conclusion of this research is that the dual factor heuristic best approximates people's behavior in the heuristic stage of causal induction. The results do not imply either that there are no d-cell effects at all or that any attempts to investigate the effects are in vain. There are many other factors besides cell frequencies that can influence covariation assessment—for example, context, intelligence, depression, culture, gender, and so on. Despite the success of the dual factor heuristic in its restricted domain of application, we do not pretend that these factors have no effect on causal judgment. The same is true of the d-cell effects. The d-cell effects may be a real part of causal induction despite the descriptive validity of the dual factor heuristic. If the effects are significant, it might suggest either that people adopt a slightly different algorithm from this model, that these effects arise at the analytic stage, or that some assumptions of the model are slightly wrong.

Nonetheless, our simulations showed that H can perform even better than using the d-cell if one is only allowed to keep track of a small number of events in working memory. Our simulations also revealed some further factors that influence the model with respect to adaptation to the real world. Next, we consider the results of the simulations indicating the importance of rarity, equiprobability, and working memory capacity to covariation detection and cognition more generally.

There is ample empirical evidence for working memory limitations but rather fewer demonstrations of the adaptive advantage such an apparent limitation can confer on a cognitive agent. Consequently, our demonstration that covariation estimation using H can be more accurate than using normative measures for small samples is consistent with related demonstrations by Kareev (2000). The idea that causal statements usually assume low probability events that are approximately equal has also occurred in other domains such as hypothesis testing (Klayman & Ha, 1987). Moreover, there is recent evidence that when people spontaneously frame conditional statements about causal hypotheses (i.e., if cause then effect), they do so using rare events (see McKenzie et al., 2001).

Rarity and equiprobability have also emerged as important factors in other areas, in particular, in Oaksford and Chater's (1994) information gain model of data selection (see also Hattori, 2002). Suppose that there is a certain category Y in question, and an arbitrary category X. Let X = {X, inline image {and Y = {Y, inline image}, and let I (Y) describe average self-information (i.e., entropy) of a random variable Y. The reduced entropy (i.e., information gain) of Y by knowing X is


which has local maxima when (a) P (X) = P (Y) = P (X, Y), and (b) P (X) = 1 – P (Y) and P (X, Y) = 0. The former corresponds to equiprobability between X and Y, and the latter corresponds to equiprobability between inline image and Y. If the rarity assumption is made, case a is the only solution. Therefore, when an explanatory category X is searched for the category Y, seeking a category that has approximately the same probability increases the expected information gain. Thus the assumptions that allow H to provide a good index of causal strength are same as those made by optimal data selection models in explaining the rationality of people's data selection behavior (Hattori, 2002; Oaksford & Chater, 1994, 2003).

This study contrasts with most other rational analyses of causal induction (e.g., Anderson, 1990; Griffiths & Tenenbaum, 2005) insofar as it is concerned with an optimal approximation in an uncertain world. Other rational analyses (e.g., Anderson, 1990; Oaksford & Chater, 1994) showed that people's behavior is adaptive if the environment can be assumed to have certain characteristics (e.g., rarity or a power law need probability function). However, the analyses themselves are able to make predictions for behavior even if these environmental assumptions are violated. Oaksford and Chater (1994), speculating on the consequences for the algorithmic level, suggested two extremes. On the one hand, the cognitive system could implement the rational analysis in a hard wired heuristic such that, for example, the system shows no behavioral variation with respect to violations of rarity. On the other hand, the analysis could be directly implemented in the cognitive system in which case behavior should perfectly track rarity violations. Oaksford and Chater (1994) suggested that the truth probably lies somewhere between these two extremes: Although there may be some responsiveness to rarity violation, the cognitive system would be expected to show some inertia in responding to deviations away from normal environmental assumptions. In the dual factor heuristic, the environmental assumption about rarity is built in (i.e., it cannot respond to rarity violation and thus can only approximate rational performance). Therefore, it is closer to the hard-wired heuristic end of Oaksford and Chater's (1994) continuum.

This fact may question the rationality of using this heuristic. However, if the assumptions of the model are usually respected in the environment then in the large majority of cases it will provide the right answers quickly and efficiently. The question that obviously arises is, can the cognitive system adapt to cases of rarity violation? That there is some d-cell sensitivity and McKenzie and Mikkelsen's (2007) recent work suggests that people may have some facility to adapt to violations of rarity. Besides, although the dual factor heuristic provides a rational strategy, this does not necessarily mean that all people detect covariation in exactly the same way as the model, but it does mean people's average behavior can be predictable. Each individual may make judgments in a different way. Not only individual differences but also context, wording, or general knowledge may affect causal induction (e.g., Arkes & Harkness, 1983; Crocker, 1982; Einhorn & Hogarth, 1986).

There is an intrinsic difference between an associational view of covariation judgment and our heuristic view. The dual factor heuristic is designed to make quick responses to the environment when there is pressure to continuously construct provisional judgments. On this view, the learning process is too slow to suggest tentative hypotheses for test by intervention at the analytic level. For example, when someone feels sick, they want to detect causally relevant food as soon as possible to avoid possible future risks. According to the dual factor heuristic, a single case in sequential sampling can have a strong impact if it is an instance of the a-cell. On the other hand, according to the associative approaches, the impact of instances is small and the effect on judgment is gradual and incremental. It is highly unlikely that people always reserve their judgments until learning reaches “asymptote” as the associationists insist. However, sample size no doubt increases the reliability of data, which can also be important in some context, perhaps where Type I errors (i.e., false alarms) matter decisively. For example, if you were a manager of a baseball team you would be unlikely to hire a player based on observing him hit a single, albeit decisive, home run. Rather you would assess his overall performance based on his batting average because the costs of a false positive are too high (given the high earnings of these players). Assessing the costs of false positives and so whether long run performance must be assessed is again a matter for the analytic stage of causal induction not the first pass determination of likely causal candidates.

Moreover, there is a phenomenon that has been regarded as a “bias” from an associative view and that seems to have a rational explanation in our model. In outcome density bias the probability of the outcome, P (E), for non-contingent relations (i.e., Δ P = 0) can positively affect human judgments of causal strength (e.g., Allan & Jenkins, 1980, 1983; Chatlosh, Neunaber, & Wasserman, 1985; Shanks, 1985; Wasserman & Shaklee, 1984). This bias has often been observed in the continuous paradigm tasks. Although the continuous paradigm is not in the scope of the current study, this bias may be explained, at least partly, by the dual factor heuristic. H can be written as follows:


When Δ P = 0, H is equal to√P (C)P (E). Suppose that participants “perform” the act (i.e., the candidate cause) that may cause the effect irrespective of the outcome density. We can thus regard√P (C) as a fixed constant, k, and so, H = kP (E). This means that outcome density, P (E), should alter people's estimates of the predictive relationship (i.e., the higher the outcome density the stronger the perceived causal relation).

In conclusion, when people observationally detect covariation between events as a first step to induce causality, their behavior is most consistent with the dual factor heuristic which is a simple and efficient strategy that approximates the normative index, assuming rarity. At present, participants' performance in a particular type of causal induction task is best described by this heuristic, which suggests that people's behavior in non-interventional causal assessment tasks is adaptively rational.


  • 1

    The method McKenzie (1994) adopted was not the Monte Carlo method because it was not based on random variables with some probabilistic distributions.

  • 2

    Although φ distinguishes positive and negative correlations in any 2 × 2 contingency tables by means of the “±” sign of the index value, H does not have such a mechanism—that is because, obviously, lima, d → 0φ = − 1, whereas lima, d → 0H = 0.

  • 3

    In terms of his theory, it is characterized by monotonicity and exogeneity. It is called monotonic if and only if there is no individual case that already has the effect without the cause but that would lose the effect if it gained the cause. It is called exogenous if and only if the cause and the effect are not influenced by any common factors.

  • 4

    Schustack and Sternberg (1981) actually included an additional term for the strength of competing causes and Equation 11 is according to Anderson and Sheu (1995).

  • 5

    Note that fitting a linear regression model of cell frequencies and looking for different cell weights is not an appropriate test. See the third point in the section What d-cell Insensitivity is Not, in this regard.

  • 6

    In Experiment 1 of Buehner, Cheng, and Clifford (2003), Stimuli 11 (8, 0, 8, and 0 as a, b, c, and d, respectively) and 15 (0, 8, 0, and 8, ditto) were omitted because many indices (11 and 13 indices out of 40 for Stimuli 11 and 15, respectively) cannot be calculated according to the zero divisor. As a result, the number of stimuli counted for this experiment amounted to 13.

  • 7

    In this study, the variation of stimuli in each experiment was very small (i.e., 3 or 4), and their configurations were set up to complement each other, so we figure these three experiments as one experiment here.

  • 8

    Although Δ P seems to be invalidated by the data of W03.2 (i.e., r2 = .0), this study only set two levels for Δ P, and so it might be seen insufficient evidence to negate this index.

  • 9

    These instructions were written in Japanese.

  • 10

    For some model-data combinations (i.e., L* with BCC03.1, BCC03.3, LS00, and W03.02; and I2 with BCC03.1, BCC03.3, and LS00), we had to compute rank-deficient linear least squares solutions. In such cases, the Moore–Penrose generalized inverse of a matrix was used (see Venables & Ripley, 1999, p. 100).

  • 11

    Because AICc is already adjusted not only for the number of parameters, K, included in a model but also for the sample size (i.e., the number of stimuli, M, included in each experiment), unlike the case of r2, it was not weighted by the number of stimuli as an overall measure of the model's appropriateness.

  • 12

    It is controversial how to stipulate computational simplicity. For more argument on this point, see for example, Chater, Oaksford, Nakisa, and Redington (2003).

  • 13

    Between P (C, E) and φ0, there is a relation as inline image, where x = P (C), y = P (E), x̄ = 1 – x, and ȳ = 1 – y.


A part of this study was conducted while the authors were at the School of Psychology, Cardiff University, Wales. We appreciate the school's warm support for our research. Preparation of this article was partially supported by Grant-in-Aid for Scientific Research 19500229 from the Japan Society for the Promotion of Science, a research grant from Institute of Human Sciences, Ritsumeikan University (Project Research B), and a grant from The Daiwa Anglo-Japanese Foundation awarded to M. Hattori. This article was presented, in part, at the 4th International Conference on Cognitive Science (ICCS 2003), The University of New South Wales, Sydney, Australia.

We thank Marc Buehner, Kyung Soo Do, Ken Manktelow, Masanori Nakagawa, Minoru Nakashima, Tatsuo Otsu, Tsuneo Shimazaki, Tetsuo Takigawa, Hiroshi Yama, and Kimihiko Yamagishi, as well as Josh Tenenbaum and three anonymous reviewers for their very helpful and constructive comments on this study. We are also grateful to Shiori Nakao, Yuka Otake, Tomoko Tamezane, and Miyuki Tanaka for their help in running the experiments.

Appendix A: Mathematical supplements on the indices

A. 1. Causal support

Griffiths and Tenenbaum (2005) defined causal support by the value of the log likelihood ratio for obtaining data D from GC over GI, each of which is a causal graphical model (see Equation 6). P (D|GC) is defined by integrating over all possible values pC and pB could assume. Assuming independence between C and B and a uniform prior distribution for pB, it is simplified as shown in the following equation using the beta function and cell frequencies (a, b, c, d, and N = a + b + c + d):


P (D|GC), defined as follows, cannot be evaluated analytically, but it can be approximated by Monte Carlo method. Assuming uniform priors on pC and pB, it can be calculated drawing m samples of pC and pB from a uniform distribution on [0, 1] as follows:


where C = {C, inline image}, E = {E, Ē}, N (C, E) = a, N (C, Ē) = b, N (inline image, E) = c, and N (inline image, Ē) = d; and inline imageC and inline imageB indicate 1 − pC and 1 − pB, respectively.

As this index runs from infinitely small to infinitely great, [–∞, ∞], to keep the range between –1 to 1, we define an index that is transformed by the hyperbolic tangent as an non-parameterized model based on causal support as follows:


Note that we also tested this index as a parameterized model as Griffiths and Tenenbaum (2005) suggested in Meta-analysis 2.

A.2. Signal Detection Theory (SDT) measure

According to the parametric SDT (Green & Swets, 1966/1988; Tanner & Swets, 1954), the detectability, d′, is calculated as follows:


Here, P (Y|n) is the probability of a “yes” response under noise alone, P (Y|s) is the probability of a “yes” response under signal plus noise, each of which corresponds to P (Ē|inline image) and P (Ē|C), respectively; and Φ indicates the cumulative normal distribution function.

As this index runs from infinitely small to infinitely great, [–∞, ∞], to keep the range between –1 to 1, we adopt its transformation by the cumulative normal distribution function (i.e., so-called inverse probit transformation) as an index based on the SDT:


A.3. Choice Theory measure

In his Choice Theory, Luce (1959, 1963) defined a measure of similarity, η, between signal and noise as follows:


According to Luce (1959, 1963), –log η is mental distance. As the mental distance between signal and noise becomes large, the accuracy of response increases, and the linkage between stimuli and responses becomes high. Thus, –log η indicates the strength of association between two events. This index, however, also runs range [–∞, ∞]. Here, we define an index that is transformed by the hyperbolic tangent to keep the range between –1 to 1:


This index is equivalent to Yule's (1900) coefficient of colligation, Y [25] (Equation 29).

A.4. Yule's coefficients

Yule's (1900) coefficient of association, Q, and coefficient of colligation (Yule, 1912), Y are defined as follows:


It is known that Y is always smaller than Q in their absolute values. Q and Y always run between –1 to 1 regardless of the marginal frequencies of the table, and this point is the greatest difference from φ.

A.5. Goodman–Kruskal's index of predictive association

As described in the text, Goodman and Kruskal (1954, 1963) defined an asymmetric statistic. Swapping C and E, another symmetrical index can also be defined. In terms of a 2 × 2 table, these indices are defined as follows:


As the above two indices usually do not have an equal value (i.e., λE|CλC|E), Goodman and Kruskal (1963) defined an average index of λE|C and λC|E, as the following:


Here, nE|C and nC|E indicate numerators of λE|C and λC|E, respectively; and dE|C and dC|E indicate a denominator of λE|C and λC|E, respectively.

A.6. Kappa statistic

In the case of a 2 × 2 table, letting pij be the probability cell ij (row i and column j), π0, the observed probability of agreement, and πE, the probability of coincidence by chance, are expressed as follows:


Therefore, κ it is defined as follows:


A.7. Log odds ratio

The log odds ratio of diagonal cell frequencies of a 2 × 2 contingency table is defined as follows:


If you transform it by the hyperbolic tangent as it runs range [–∞, ∞], it coincides with Yule's (1900)Q [24] (Equation 28) as follows:


A.8. Probabilistic extension of Mill's (1843/1973) method of difference

Cheng and Novick (1992) proposed a probabilistic interpretation of this idea as follows:


This is formally equivalent to Δ Pc [3].

A.9. Good's probabilistic causal model

Good (1961, 1962) defined a probabilistic causality as follows:


As this index also runs range [–∞, ∞], we define a transformed index by the hyperbolic tangent as follows:


Appendix B

ModelAS95BCC03.1BCC03.3LS00W03.2W03.6Exp 1Exp 2
1 Hβ0.0289−.140−.873−.809−.0634.0994−.131−.109
35 Jx1.116.411.722.654.585.169.283.0100
36 Δ Pw1w11.23.7532.471.04.823.7971.30.000
37 Δ Pw2β0.189.208−.428−.017.398.156.0566.416
8 SPβ047.743.113.031.447.51.42 × 10345.720.9
 β114.511.815.78.80.03171.38 × 10317.016.2
38 I1w0−.412−.355−.123−.0144−.0959−.210−.373
39 I2β0.573.549.236.499.670.508.276
40 Bpr.413.360.0584.0523.327.351.216
41 L*β0.420.341−.0134−.608.207.361.325