“What if…”: The Use of Conceptual Simulations in Scientific Reasoning


Naval Research Laboratory, Code 5515, Washington, DC, 20375-5337. E-mail: trickett@itd.nrl.navy.mil


The term conceptual simulation refers to a type of everyday reasoning strategy commonly called “what if” reasoning. It has been suggested in a number of contexts that this type of reasoning plays an important role in scientific discovery; however, little direct evidence exists to support this claim. This article proposes that conceptual simulation is likely to be used in situations of informational uncertainty, and may be used to help scientists resolve that uncertainty. We conducted two studies to investigate the relationship between conceptual simulation and informational uncertainty. Study 1 was an in vivo study of expert scientists; the results suggest that scientists do use conceptual simulation in situations of informational uncertainty, and that they use conceptual simulation to make inferences from their data using the analogical reasoning process of alignment by similarity detection. Study 2 experimentally manipulated experts' level of uncertainty and provides further support for the hypothesis that conceptual simulation is more likely to be used in situations of informational uncertainty. Finally, we discuss the relationship between conceptual simulation and other types of reasoning using qualitative mental models.

1. Introduction

In a famous anecdote, Einstein (1979) describes how, as a youth, he visualized himself chasing a beam of light; he explains that later on, this imaginative leap contributed to his development of the theory of relativity. Einstein's thought experiment is one of the best-known examples of a type of “what if” reasoning that has been implicated in scientific discovery in a variety of fields. Other famous scientists who are reported to have engaged in thought experiments include Galileo, Newton, Maxwell, Heisenberg, and Schrödinger, to name a few (e.g., Shepard, 1988).

Scientists are likely to use such thought experiments, or “what if” reasoning, when it is either impossible or impractical to conduct a physical experiment. In addition, from a purely theoretical perspective, “what if” reasoning offers several advantages. Unlike quantitative reasoning strategies, it does not require numerical precision. This may be useful (a) when precise quantitative information is not available; or (b) when a scientist is attempting to develop a general, or high-level, understanding of a system. Like other forms of mental model-based qualitative reasoning, “what if” reasoning allows one to reason with partial knowledge (whether incomplete or imprecise) and hence to accommodate the ambiguity inherent in situations of uncertainty (Forbus, 2002). “What if” reasoning also allows the construction of multiple alternatives, which may be useful in generating predictions or explanations when scientists lack principled knowledge that can allow them to proceed in their reasoning with some measure of certainty. All these situations share a high level of uncertainty; thus, “what if” reasoning may be especially useful in some situations of uncertainty.

There are many types of uncertainty in complex domains such as scientific enquiry (Schunn, Kirschenbaum, & Trafton, 2004) Schunn et al. differentiated between subjective uncertainty (what a person feels) and objective uncertainty (uncertainty in the information a person has). Our focus here is on informational uncertainty.

For reasons discussed below, we concentrate our research on the data analysis phase of scientific discovery. During this phase, scientists must first recognize what information the data actually represent; and second, come to an understanding of what that representation actually means in terms of their research questions (i.e., interpret the data). Consequently, there are two general areas where scientists are likely to encounter informational uncertainty. First, the data themselves may literally be unclear: For example, data may be missing, inaccurate, or noisy so that scientists must work to differentiate real phenomena from noise. Second, the meaning of the data may be unclear; for example, experimental results may be anomalous (i.e., incompatible with previously established empirical results or even theory), follow some unexpected or unusual pattern, or otherwise conflict with the scientist's predictions. Part of the scientist's task is to explain or otherwise resolve such expectation violations.

In other complex domains, such as meteorology, we have found that when people experience informational uncertainty when using complex visualizations, they mentally transform the visualization by adding their own representation of uncertainty, in order to resolve it (Trickett, Trafton, Saner, & Schunn, 2007) Consequently, we expect that when scientists experience informational uncertainty, they will try to resolve that uncertainty; and we propose that “what if” reasoning is likely to be one strategy by which they attempt to do so because it allows people to transform their current understanding by mentally constructing an alternative. “What if” reasoning allows people to think through the implications of different starting assumptions by playing out different scenarios and then to evaluate their plausibility. If this were the case, we would expect “what if” reasoning to occur particularly in association with tentative explanations, or hypotheses, that could account for particular instances of informational uncertainty. Furthermore, if “what if” reasoning were used to try to resolve such uncertainty, we would expect it to lead to an evaluation of the hypothesis to determine whether it adequately accounts for the uncertainty and consequently resolves it.

What constitutes “what if” thinking? Brown (2002) proposed a three-step process that consists of first, visualizing some situation; second, carrying out one or more operations on it; and third, seeing what happens. The third part of the process—seeing what happens—is crucial. It distinguishes “what if” thinking from purely imagining because during this third phase causal reasoning occurs to the results of the manipulation(s) of the second phase. A well-known example of this type of thinking is Lucretius' attempt to show that space is infinite (Brown, 2002). Assuming space has a boundary (visualize a situation), throw an imaginary spear toward it (carry out an operation on the visualization). If the spear goes through, there is no boundary; if the spear rebounds, we infer a “wall” that must itself be in space that stopped the spear (see what happens). Consequently, space has no boundary (causal reasoning).

Although Lucretius is clearly not a layperson, it is easy to apply the same processes to everyday examples of this type of thinking. For example, suppose one is figuring out the steps by which to assemble a piece of furniture (e.g., Lozano & Tversky, 2006), in the absence of clear written instructions, and prior to making any irreversible decisions. One might mentally start to arrange certain pieces where one thinks they should go (visualize a situation). Then one might mentally attempt to insert a new piece (carry out an operation on the visualization). One can then inspect the visualization to determine whether the new piece will fit (see what happens). Finally, one can determine whether the initial arrangement is correct and decide either to proceed with construction or to start over (causal reasoning).

As this illustration shows, “what if” thinking is hardly the type of arcane activity frequently associated in the popular imagination with scientific genius, but rather an everyday reasoning strategy available to scientist and layperson alike. How important is such a strategy likely to be in the scientific reasoning process? On the one hand, scientific expertise—domain knowledge and skills—is acquired only after many years of education and practice (Ericsson & Charness, 1994; Ericsson, Krampe, & Tesch-Roemer, 1993; Schunn & Anderson, 1999). On the other hand, current research suggests that, as Einstein himself maintained, what sets scientific reasoning apart from everyday reasoning is not different processes but simply greater precision, systematicity, and “logical economy” (Klahr & Simon, 1999). A full model of scientific discovery should therefore include relevant everyday reasoning strategies and heuristics. It has already been suggested that everyday reasoning strategies, such as mental simulation and other forms of reasoning with qualitative mental models, play a role in a general understanding of natural phenomena and physical systems (e.g., Hayes, 1988; Williams & de Kleer, 1991). Our question is the extent to which one such strategy—“what if” reasoning—guides the reasoning of experts' scientific reasoning.

In fact, several everyday reasoning strategies have already been shown to play an important role in the process of science, strategies such as analogy (Dunbar, 1995, 1997; Gentner, 2002; Okada & Simon, 1997), attending to anomalies (Kulkarni & Simon, 1988), collaboration (Azmitia & Crowley, 2001), use of mental models (Forbus, 1983; Forbus & Gentner, 1997), and the like. The goal of this article is to investigate the role of “what if” thinking in the scientific reasoning of contemporary scientists.

There is some evidence in the cognitive science literature that scientists specifically use forms of “what if” reasoning. Reconstructions of historical discoveries and analyses of contemporary records such as journals and lab notebooks suggest that scientists conduct “mental experiments” in a process that mirrors an empirical experiment (Nersessian, 1999) or otherwise construct “runnable” mental models (e.g., Ippolito & Tweney, 1995). Empirical studies of contemporary scientists also find the use of mental experiments (e.g.,Clement, 2002a; Qin & Simon, 1990) and mental simulation (Schraagen, 1993). This research spans a wide variety of contexts (such as historical reconstruction, protocol study, and lab experiment), tasks (such as scientific discovery, experimental design, and prediction), and participants (from famous historical figures to contemporary expert practitioners to scientists-in-training).

Despite this body of research, it is difficult to draw general conclusions from the results. The nature of historical studies makes it impossible to determine whether the mental experimentation occurred in the course of the problem solving or retrospectively (Saner & Schunn, 1999). Nor are the studies of contemporary scientists conclusive. Qin and Simon (1990) told participants to generate a mental image prior to performing the task, so that their use of mental experimentation may not have been spontaneous. The scientists observed by Schraagen (1993) and by Clement (2002a were not experts in the specific task domain and therefore lacked precise domain knowledge. The use of “what if” reasoning in these studies was clearly spontaneous; however, perhaps the scientists were using it to compensate for their lack of domain knowledge (i.e., in this case, conceptual simulation was more of a lay strategy than an expert one).

In sum, no experimental studies have been conducted with the express purpose of investigating the use of “what if” reasoning among expert, practicing scientists working in their own domain; as a result, no clear picture has emerged as to when, how, and why scientists might use this strategy. Our goal is first, to gather evidence that expert scientists do, in fact, engage spontaneously in “what if” reasoning; and second, to investigate how they do so and how significant a role this strategy plays in their acts of scientific enquiry.

Researchers use many different terms to describe the strategy we have loosely discussed as “what if” reasoning—mental experiment, thought experiment, inceptions, mental simulation, and so on. In all cases, however, the underlying strategy demonstrates the characteristics described by Brown (2002), discussed above. In our study, we refer to these separate processes—visualizing a situation, carrying out mental operations on it, and seeing what happens—collectively as conceptual simulation. We believe this term captures the two most crucial aspects of this type of reasoning; namely, it occurs at the conceptual level (rather than, say, in any actual or external sense), and it involves mentally playing out, or “running,” a model of the visualized situation, so that changes can be inspected.

More specifically, conceptual simulation involves constructing and manipulating a mental model that not only derives from an external representation but is also an analog of it (Clement, 2002b; Nersessian, 1999; Schwartz & Black, 1996a). Functionally, conceptual simulations adapt the external representation by adopting hypothetical values and playing out their implications, to move beyond the information actually represented. This process allows new inferences about that information to be made.

Our first challenge has been to develop a reliable means of identifying conceptual simulations, which are an internal cognitive process rather than a directly observable behavior. Our general method has been to collect verbal protocols of scientists solving problems in their own domain. This method is based on the assumption that contents of working memory are “dumped” into the speech stream, where they can be examined and coded (Ericsson & Simon, 1993). In order to increase the reliability of this detection and coding process, we have operationalized the notion of conceptual simulation such that the construct is empirically grounded and observable in the speech stream: In a continuous sequence of utterances, the scientist (a) refers to a new representation of a system or mechanism; (b) refers to transforming that representation spatially, in a hypothetical manner; and (c) refers to a result of the transformation. This three-stage process corresponds to the processes described by Brown (2002) in defining “what if” thinking.

Our first study is exploratory and examines the question of whether and to what extent practicing scientists spontaneously use conceptual simulations. We further investigate the extent to which this strategy is used to resolve specific instances of informational uncertainty, in a cycle of hypothesis statement and evaluation. To determine the significance of the relation between “what if” reasoning and hypothesis evaluation, we investigate the frequency of use for other hypothesis-evaluation strategies that have been identified in the scientific reasoning literature. If “what if” reasoning plays a significant role in the hypothesis evaluation process, it should occur at least as frequently as these known strategies. Furthermore, there may be relations between “what if” reasoning and these other strategies that can illuminate the overall process of resolving informational uncertainty. To foreshadow the outcome of Study 1, our results suggest that scientists do spontaneously use conceptual simulations, and they seem to do so as a means of resolving informational uncertainty. Study 2 is a laboratory experiment—also of expert scientists—in which we manipulate uncertainty to further test this hypothesis.

2. Study 1

2.1. Method

Dunbar (1995, 1997) demonstrated the value of naturalistic observation of scientists in uncovering previously underspecified strategies and dynamics in the science laboratory. Therefore, we have adapted Dunbar's (1995, 1997) “in vivo” methodology for online observation of scientific thinking in which participants perform their regular tasks and the experimenter observes and records their interactions. We have focused our investigation on one specific scientific task—data analysis—because it is a crucial task for many scientific domains, one during which scientists attempt to account for their data and in which they are likely to experience a great deal of informational uncertainty. Data analysis is therefore likely to produce a rich record of scientific thinking and hypothesis-generation about informational uncertainty.

2.1.1. Participants

Participants were recruited through personal connection of the experimenter or her associates. The sample of scientists was selected to represent a diverse array of fields rather than just one particular subfield and several different stages of data analysis, in order to make the results more generalizable. Observations were recorded from nine scientists in eight data analysis sessions. All the participants were either expert scientists who had earned their PhDs more than 6 years previously or graduate students working alongside one of these experts. Four of the sessions involved an expert scientist working alone. Three of the group sessions involved a senior researcher and one or more graduate students; the remaining group session involved two expert scientists. (Some scientists were thus observed over more than 1 session.) Four sessions were in branches of physics (astronomy and computational fluid dynamics, or CFD), two were in neuroscience (fMRI and neural spikes), and two were in cognitive psychology. Of the three datasets pertaining to CFD, one focused on a problem involving a submarine, and two focused on laser pellet research.

2.1.2. Procedure

Participants agreed to contact a member of the research team when they were ready to conduct some analysis of recently acquired data, and an experimenter visited the scientists at their regular work location. All participants agreed to be videotaped during the session. Participants working alone were trained to give talk-aloud verbal protocols. For scientists working in groups, their conversation was recorded as they engaged in scientific discussion about their data. All participants were instructed to carry out their work as though no camera were present and without explanation to the experimenter (Ericsson & Simon, 1993).

Details about each individual session are reported in Table 1. All utterances were transcribed and segmented according to complete thought (off-task utterances were excluded from analysis). Finally, a coding scheme (described below) was developed to explore the relationship between conceptual simulation, uncertainty, and hypothesis evaluation.

Table 1. Dataset characteristics
DatasetOn-Task Utterances% Total UtterancesNo. of ScientistsTotal Relevant Time
  1. Note. CFD = computational fluid dynamics.

Astronomy65676249 min
CFD submarine43742139 min
CFD Laser 117243115 min
CFD Laser 218474113 min
fMRI21572255 min
Neural spikes21764254 min
Psychology 148189331 min
Psychology 291664275 min

2.1.3. Analysis tools and tasks

The psychology data were displayed numerically in Excel; all the other data were displayed using visualization tools specific to the domain. Fig. 1 shows an example of the visualization software used by one of the physicists.

Figure 1.

Screen snapshot of computational fluid dynamics data.

Although each scientist or group of scientists used different tools, their tasks shared several characteristics. They were all analyzing data that they themselves had collected from observations, from a controlled experiment, or from running a computational model. They displayed the data using their regular tools. Apart from the second CFD laser session, which was a follow-up to the first session, all sessions represented the initial investigation of these data. Whether their interest was exploratory or confirmatory, their goal was to understand the fundamental processes that underlay the data. Table 2 summarizes the characteristics of each data analysis session.

Table 2. Characteristics of individual data analysis sessions
DomainResearch StageData TypeDataData SourceTask Description
  1. Note. CFD = computational fluid dynamics.

AstronomyExploratoryVisualVelocity contour lines laid over optical dataTelescope observationsUnderstand flow of gas in galaxy
CFD submarineConfirmatoryVisualTwo-dimensional line plotsComputational modelUnderstand model in relation to empirical data collected by a different researcher
CFD Laser 1ConfirmatoryVisualContour plots or Fourier decompositionComputational modelUnderstand growth rate and sequence of different modes
CFD Laser 2ConfirmatoryVisualContour plots or Fourier decompositionComputational modelFollow-up Laser 1
fMRIConfirmatoryVisualStructural or functional brain imagesControlled experimentIdentify areas of neural activity; evaluate experiment predictions
Neural spikesExploratoryVisualNeural spikesSurgical observationsIsolate single cell firings to distinguish real from spurious neurons
Psychology 1ExploratoryNumericNumerical in spreadsheetControlled experimentSeek evidence for strategies among subjects
Psychology 2ExploratoryNumericNumerical in spreadsheetControlled experimentUnderstand relation between subject and model data

2.1.4. Coding scheme

The overall goal of this research was to investigate whether and when scientists use conceptual simulation, whether they use it to resolve informational uncertainty, and to what extent they do so, relative to other strategies. We predicted that scientists would use conceptual simulation to evaluate hypotheses they proposed to account for informational uncertainty. Therefore, we developed a coding scheme that would allow us to identify conceptual simulations, hypotheses, and several strategies that have been shown to be associated with hypothesis evaluation. Conceptual simulation

A conceptual simulation spans several utterances. It begins with a reference to a representation of a system or part of a system. Mental operations are then carried out on this representation to simulate the system's hypothetical behavior under certain circumstances. The initial representation may be grounded internally (e.g., in domain knowledge or memory of a previously observed phenomenon) or externally (e.g., in a displayed image). However, simply forming and transforming a mental image is not sufficient. The key feature of a conceptual simulation is that it involves a simulation “run” that is both hypothetical (i.e., it does not merely reproduce observed behavior) and alters the starting representation, producing a different end state that can be inspected to “see what happens” (cf. Brown, 2002).

To formally code conceptual simulations, we adapted Trafton's spatial transformation framework (Trafton, Trickett, & Mintz, 2005). Spatial transformations occur when a spatial object is transformed from one mental state or location into another mental state or location. They occur in a mental representation that is an analog of physical space. They can be performed purely mentally (e.g., purely within spatial working memory or a mental image) or “on top of” an existing visualization (e.g., a computer-generated image). (See Trafton et al., 2006 for more on spatial transformations.) This initial representation provides the starting point for a conceptual simulation. Therefore, we first identified references to a new representation. We then performed a spatial transformation analysis on the utterances that immediately followed to determine whether any mental operations were applied to transform that representation. Some possibilities include rotation, modification (by addition or deletion), moving an image, animating features, and comparison. Finally, we identified the reference to the result of the transformation(s). Conceptual simulations may thus be defined formally as a specific sequence:

  • 1Refers to a new representation of a system or mechanism.
  • 2Refers to transforming that representation spatially in a hypothetical manner.
  • 3Refers to a result of the transformation (seeing what happens).

Table 3 illustrates examples of conceptual simulation. Note that although a conceptual simulation spans several utterances, collectively these are coded as only one conceptual simulation. (See Table 4 for additional examples of conceptual simulation, at http://www.cognitivesciencesociety.org/supplements/)

Table 3. Examples of CS
  1. Note. CS = conceptual simulation; CFD = computational fluid dynamics.

AstronomyIn a perfect sort of spider diagramCSReference to new representation (“spider diagram”)
 If you looked at the velocity contours without any sort of streaming motions, no, what I'm trying to say is, um, in the absence of streaming motionsCS continuedReference to transforming representation (mentally removing existing streaming motions)
 You'd probably expect these lines here [gestures]to go all the way across, you know, the ringCS continuedReference to result (sees what happens)
CFD submarineIt is conceivably possible that this curve is floating around all over the place, and what they're showing is an average [scientist is looking at a graphical representation (a curve) that represents the turbulence]CSReference to new representation (“this curve”)
 So if this thing is really floating around that much, just up and down, and I'm at the extreme end, and if I average all of this stuff, then I may actually still get the curve rightCS continuedReference to transforming representation
  CS continuedReference to result (sees what happens) Hypotheses

Every utterance was examined, and all statements that attempted to explain or account for a phenomenon were coded as hypotheses—for example, “OK, so now he's not showing activation for the motor preparation, so maybe that's just a function of it being the first thing he did” (source: fMRI; hypothesis in bold type). Scientific reasoning strategies

We selected several strategies from the scientific reasoning literature: data focus, empirical test, consult a colleague, tie-in with theory or domain knowledge, and analogy. Data focused strategies are highly relevant to scientific inquiry in general and to data analysis in particular. Testing a hypothesis by empirical means is part of the scientific method (Popper, 1956) and has been much studied in the scientific reasoning literature (e.g., Klahr & Dunbar, 1988; Klahr, Dunbar, & Fay, 1990; Schunn & Anderson, 1999; Vollmeyer, Burns, & Holyoak, 1996). Collaboration (consulting a colleague) has been shown to be instrumental in solving scientific problems in both instructional and professional settings (Azmitia & Crowley, 2001; Okada & Simon, 1997). Domain knowledge is also an important factor in expert performance among scientists (Chinn & Malhhotra, 2001; Schunn & Anderson, 1999) as is a deep understanding of the tools, instruments, and techniques used in a given domain (Schraagen, 1993; Schunn & Anderson, 1999). Finally, research has identified analogy as a powerful reasoning mechanism for science (Dunbar, 1997; Forbus & Gentner, 1997; Nersessian, 1992a; Thagard, 1992).

Analogical reasoning involves mapping information from one domain or instance—the “source”—to another—the “target”—in order to make inferences about the target (Gentner, 1989). Different theories of analogy specify different processes by which the mapping between source and target occurs—for example, structural alignment (Gentner, 1983; Holyoak, 1985), constraint satisfaction (Holyoak & Thagard, 1989), and similarity detection (e.g., Gentner & Markman, 1997). During the mapping or alignment phase, regardless of the specific mechanism by which it occurs, the relevant parts of the source are “applied” to the target, and inferences about the target are drawn. Alignment thus involves an explicit or implicit comparison between two representations and the detection of similarities between them.

Gentner and Markman (1997) proposed that analogy and similarity are related through the process of structural alignment. The difference lies in the relative importance of relational similarity (in analogy) and attribute similarity (in similarity judgments). Whereas analogical comparisons focus primarily on structural or relational similarity, similarity judgments focus more on commonalities between attributes or surface features. (Note that “mere-appearance matches” have no relations in common, and are therefore are not discussed further here.) Because of the visual–spatial nature of much of the data in these scientific domains, we expect the scientists to make a significant number of similarity judgments in addition to more structurally focused analogical comparisons. According to Gentner and Markman, structural alignment guides the comparison process in both cases, analogy and similarity. Also, in both cases, the comparison process focuses on alignable differences, which allow a person to identify on relevant differences between the two entities being compared. We use the term analogy to refer to comparisons based primarily on structural or relational similarity, and the term alignment or “alignment by similarity detection” to capture the process of comparison based primarily on attribute similarity in which one representation is matched up to another to detect relevant areas of similarity and difference.

To code all these strategies, we identified all utterances that immediately followed a hypothesis that further elaborated the hypothesis, whether they supported or opposed it. Those utterances were coded as follows:

Data focus—Following Trafton et al. (2005), we coded statements that “read off” data from the visible display as data focus. Utterances that referred to looking at data in a different way (such as replotting the data or displaying it in a different visualization), to “tweaking” data (e.g., by transformation or removing outliers, etc.), or to looking at data that were not currently on view but that were available were also coded as data focus strategies. See Table 5 for examples of data focus strategies.

Table 5. Examples of data focus strategies
fMRIWe can find out what the z-score of that one is, too. Let's see, it's 4.22, 4.23Read off data
AstronomyActually, I know that the, this is a naturally weighted method. If we look at the robust, let's look at the robust weighted methodChange visualization
Psychology 1So I mean this is a post-hoc hypothesis, that we could verify by looking at the patternsExamine additional available data
Psychology 2We have an outlier there. We can get rid of that guy probably …. That's more than three times the mean standard deviationTweak data (remove outlier)

Empirical test—Utterances in which the scientist proposed to collect additional data were coded as empirical test strategies. These included experiment proposals, making plans to run a new experiment, planning to collect additional data for an existing experiment (e.g., increasing the sample size), or planning to collect more observational data. Plans to build and run computational models were also coded as empirical test strategies. Table 6 illustrates the coding of empirical test strategies.

Table 6. Examples of empirical test strategies
  1. Note. CFD = computational fluid dynamics.

AstronomyDo you think it's worth getting some more [telescope] time, just to do an offset plane, or offset velocity?Collect more (observational) data
fMRIBut we also have to be cognizant of the limitations of the equipment we're working with. And we are, like I said, when we collect data again, for instance, we are going to get the whole brain.Collect more (experimental) data
CFD (submarine)That means I have to tweak an input parameter on the flow code. And then re-run it [the model].Run computational model

Consult a colleague—Utterances that refer to showing the data to or asking the opinion of a coworker or other expert were coded as consulting a colleague: for example, “I'm gonna have to discuss it with, ah, John when he gets back. And with Bob” (source: CFD—submarine). (Names have been changed to safeguard participants' anonymity.)

Tie-in with theory and domain knowledge—Utterances that referred to theoretical underpinnings of the data were identified and coded as tie-in with theory: for example, “But just in general, if you have, I mean in your, your theoretical ring galaxy of the computer …” (source: Astronomy). In addition, utterances that drew on domain-specific skills, such as an understanding of tools and techniques, were also included in this category: for example, “Ah, I'm beginning to wonder if we didn't have enough velocity range all of a sudden” (source: Astronomy).

Analogical reasoning—Analogical reasoning was coded using the definition and coding scheme developed by Dunbar (1997). According to this scheme, an analogy exists when a scientist either refers to another base of knowledge to explain a concept or uses another base of knowledge to modify a concept. Analogies were coded at a general level, when both source and target were explicitly identified (e.g., “The atom is like the solar system”) and at the level of the alignment by similarity detection. Table 7 illustrates this coding of analogy.

Table 7. Examples of analogy and alignment
  1. Note. Relevant phrases that pinpoint the actual analogy or alignment are in italics; utterances in Roman type are for context only and were not coded as analogy/alignment. CFD = computational fluid dynamics.

AstronomyThink of this [points to part of ring galaxy]as a spiral armExplicit analogy between “spiral arm” (source) and “this” (ring galaxy); scientist is using the concept of a spiral arm to make inferences about the behavior of a system that is not a spiral arm
CFD (Laser 2)So [0–2] is going to be way below the black line … but he's gonna grow at roughly the same rate [as 2–0] which is what you would expectAlignment: scientist aligns growth rates of one mode (0–2) with another (2–0), and with theoretical expectations
CFD (Laser 2)The high modes are supposed to take off. They're supposed to run faster, which means that if that guy took off first, then he should be like, dominating the whole action. Now the only possible way that that can't happen is if this guy has some source somewhere, that he's, like, being fed. And he is being fed … by the difference of these two guysAlignment: Scientist aligns his expectation that mode must be being “fed” with the data representation, which indicates that the mode is, in fact, being fed
CFD (submarine)You know what, this is an experiment that sets in a, in a tube, and they've got struts holding that sucker up onto the floor. I wonder if I'm seeing the wake of the struts, which, of course, we don't have on our computational model—so that's why we don't see a dip. But we're still off by a good few percent, way off there …Alignment: Scientist aligns the experimental data with his image of the model data, after accounting for the presence of the struts; the alignment shows there are still significant differences between the model and experimental data
AstronomyIt's, I mean, it seems to make sense, if that's operating, if it's all the same velocity, it's probably more or less a rigid body, so that the whole thing is—I mean, so does that make sense? No, it doesn't really, nah, it's not necessarily a right body …Alignment: Scientist aligns the output of his chain of reasoning that suggests a rigid body with the actual data, which does not show a rigid body

2.2. Results and discussion

Eight datasets were analyzed, comprising 331 min of relevant protocol, broken into 3,278 on-task utterances.

2.2.1. Interrater reliability

We used two approaches to establish interrater reliability. First, after one coder had coded all the data for conceptual simulations, a second independent coder coded 10% of the entire dataset pinpointing any conceptual simulations. (The data to be coded were selected from 2 domains by the first coder because they contained examples of conceptual simulations and of sequences that it might be challenging to determine whether they were conceptual simulations.) To illustrate this approach in the CFD domain, consider the set of utterances in Table 8. The first coder identified that lines 6 through 12 contained a conceptual simulation in which the speaker was trying to reconstruct how one of the modes could have grown at a slower rate than another. The first coder ended the conceptual simulation at line 12, noting that in lines 13 and 14, the scientist aligned the end result of the mode being fed with the displayed representation of the final growth of the other modes involved. The remainder of this section was not coded as conceptual simulation because the scientist is recalling theoretical information about the way the modes interact. The second coder then reviewed this entire section, embedded within a much larger context of several previous and subsequent utterances, to determine whether a conceptual simulation occurred; and, if so, which utterances it spanned. We initially took this coarse-grained approach to establish that conceptual simulations could be reliably isolated in the speech stream. This approach resulted in 98% agreement, k = .91, p < .01.

Table 8. Illustration of initial approach to coding CSs
  1. Note. Source: Laser 2. CS = conceptual simulation.

1Was outrun by the next one down 
2And I don't know 
3I just don't know 
4I'll haveta get someone else's interpretation of that 
5I don't understand that 
6The high modes take offCS: New mental representation of beginning state (display shows end state)
7They're supposed to run fasterCS: Describes new representation
8Which means if that guy [mode 1] took off firstCS: Mentally follows growth path of mode 1
9Then he should be like dominating the whole actionCS: Mentally places mode 1 in relation to mode 2
10Now the only possible way that that can't happenCS: Mentally undoes growth path of mode 1
11Is if this guy [mode 2] has some source somewhereCS: Mentally adds source to representation of mode 2
12That he's like, being fedCS: Mentally adds source to representation to mode 2
13And he is being fedAlignment
14The only way he gets fed is by the difference of these two guys [additional modes]Alignment
15OK, the, the physics of this is 
16The physics of this is any two modes that can add up 
17Because of their non-linear action 
18Feed the next one 
19So the mode interacts with itself 
20One-one, to produce a two 
21But one and two can interact and produce a three 
22But three, ah, three minus two can also produce one 
23So they sort of interact among themselves 

Second, we performed a finer grained analysis, coding for each utterance whether it was part of a conceptual simulation. The same first coder's ratings were used. We then selected 33% of the entire dataset for coding by yet a third independent coder. We divided each session's data into three equal parts based on the number of utterances and selected the first, second, or third section at random from each dataset. As a result, the third coder coded one third of each dataset; collectively, the sections represented early, middle, and late analysis on the part of the scientist. The first coder's ratings were not available to the third coder at any time during the process. To summarize the difference between the two rounds of coding, in the first, coarser grained round, the second coder identified given sequences of utterances as comprising a conceptual simulation or not. In the second round, the third coder identified line by line whether each utterance was part of a conceptual simulation.

The third coder was trained to recognize conceptual simulations by using examples that were not part of the to-be-coded data (see Appendix A for more information about the training). The coder examined each utterance and judged whether the speaker referred to a new representation; whether, immediately afterward, the speaker referred to one or more mental operations that transformed that representation (spatial transformations); and whether the speaker referred to the result of those transformations. If the coder observed this sequence, the individual utterances were scored as part of a conceptual simulation. Utterances that did not contribute to this sequence were scored as “no conceptual simulation.” The third coder worked entirely independently of the first coder. The coders conferred once after the third coder had coded one dataset to resolve any questions or difficulties on the part of the third coder. After this initial conference, the two coders did not compare their judgments until the coding was complete. Agreement for this phase of the IRR coding was 98%, k = .75, p < .05.1 The level of agreement between the coders was thus good. All disagreements were resolved by discussion.

2.2.2. Conceptual simulations

There were 37 conceptual simulations throughout the protocols, an average of one conceptual simulation approximately every 9 min. Considering the large amount of time spent on other activities (such as choosing and setting up different visualizations, reading off data from the visualizations, etc.), conceptual simulations occurred with sufficient frequency to be considered a real strategy used by the scientists. The frequency with which it was used compared with other strategies is discussed below.

There were 71 hypothesis statements, an average of 1 hypothesis approximately every 4.5 min. Fifty-five (77%) of these hypotheses were elaborated (i.e., the scientist further considered the hypothesis). Only elaborated hypotheses were included in subsequent analyses.

Thirty-two (86%) of the conceptual simulations occurred in reference to a hypothesis. Thus, the vast majority of conceptual simulations were coupled with the scientists' efforts to construct a satisfactory explanation of their data. We focus our analyses on how these conceptual simulations were used. (When conceptual simulation did not immediately follow a hypothesis, it was used as a problem-solving strategy, such as to resolve a difficulty in mapping between the display color and changes in velocity, to determine the circumstances under which a phenomenon might diverge from a theoretical model, or to account for a discrepancy.)

We then investigated the relative frequency of conceptual simulation compared with other strategies. Each individual utterance of data focus and tie-in with theory/domain knowledge was counted as one instance. For example, the utterance, “If I look at the average of that, it's a nice clean spike,” and the utterance that immediately followed it, “and I can look at the standard deviation around that and it's pretty tight right in the middle where it needs to be,” were coded as two instances of data focus (average, standard deviation) because the information extracted was different in each utterance. In all other cases, the number of overall strategy uses was counted. For example, the sequence of utterances in a conceptual simulation was coded as one conceptual simulation.

First, raw frequencies for each strategy were counted, as shown in Table 9. Clearly, the most common strategy was data focus (i.e., strategies that centered on the available data as opposed to those whose focus was beyond the current data). This result is not surprising, given that the scientists' task was data analysis. However, among the strategies that focused beyond the immediate data, tie-in with theory/domain knowledge, conceptual simulation, and analogical reasoning/alignment occurred most frequently. We expected that expert scientists would draw on their extensive domain knowledge in understanding and analyzing data, as discussed earlier. Similarly, the use of analogical reasoning as a strategy in scientific enquiry is well documented. However, the relatively large number of conceptual simulations is striking and provides evidence that conceptual simulation is an authentic reasoning strategy used by experts performing naturalistic tasks in their own domain.

Table 9. Frequencies of occurrence of hypothesis-evaluation strategies—Total number of uses (raw frequency) and percentage of all hypotheses for which strategy was used (relative frequency)
StrategyRaw FrequencyRelative Frequency (% Hypotheses)
  1. Note. Because more than one strategy might be used with a given hypothesis, these percentages sum to more than 100.

Data focus22965
Tie in with theory5135
Conceptual simulation3246
Empirical test3.05
“Far” analogy2.04
Consult colleague1.02

Interestingly, proposing to collect more data and consulting colleagues occurred only rarely. Possibly, in the first case, the real-life expense (in time and money) of collecting more data made this a less attractive option than in laboratory studies of scientific reasoning, in which empirical test is frequently only a mouse-click away. Because several of the data analysis sessions involved more than one scientist, these scientists may have been less inclined to consult others, given that they were already working collaboratively (the single instance of this strategy occurred in an individual subject case).

In addition to raw frequencies, the relative frequency of each type of strategy was calculated, also shown in Table 9. For this analysis, we identified whether a strategy was used in reference to each hypothesis. Table 9 shows the percentage of hypotheses for which a given strategy was used at least once (i.e., repeated uses were not counted). As expected, the results of this analysis again show the prevalence of strategies that focus on the data. However, in terms of strategies that focus beyond the data, conceptual simulation was used as frequently as or more frequently than any other strategy. This again suggests that conceptual simulation plays a significant role in scientists' consideration and evaluation of hypotheses.

The use of analogical reasoning is also of interest. There were only two instances of general analogy, compared with 32 alignments. This result is consistent with findings of other studies in which analogy use has been found to be more “local” than “global” (Dunbar, 1997; Saner & Schunn, 1999). The use of alignment by similarity detection in relation to conceptual simulation is discussed in more detail below.

We proposed that conceptual simulation would help scientists to resolve informational uncertainty by allowing them to evaluate their hypotheses. We suggested that upon encountering informational uncertainty, scientists would develop a possible explanation to account for it. By then running a conceptual simulation, they would be able to play out the necessary details of that explanation, creating a new representation in order to “see what happens.” The resulting representation could then function as a point of comparison with the actual data representation. Insofar as the two representations match, the hypothesis would be at least supported and, therefore, still offer a plausible explanation. If the relevant details do not match, the hypothesis would have to be rejected.

Trafton et al. (2005) have shown that scientists frequently use alignment to connect internal and external representations; consequently, we hypothesized that alignment by similarity detection would be used by these scientists to link the internal (result of the conceptual simulation) and external (phenomenon in the data) representations. Alignment would potentially allow a direct comparison between the two representations, and thus could facilitate the evaluation of the hypothesis. If this were the case, conceptual simulation would most frequently be followed by alignment (in conjunction with a hypothesis); and, to the extent that the issue is successfully resolved, alignment by similarity detection would mark the end of the reasoning chain.

The next analysis investigates this possibility by focusing on combinations of strategies. We calculated the frequencies of the transitions from one strategy to the next for all major strategies (Ericsson & Simon, 1993). In order to understand the more relevant connections between strategies, we limit our discussion to those sequences that occurred 15% or more of the time. These frequencies are represented in the transition diagram shown in Fig. 2.

Figure 2.

Transition diagram showing the relations among strategies. Percentages show the frequency with which one strategy followed another.

The transitions of primary interest are the frequency with which conceptual simulation is followed by alignment and the frequency with which alignment occurs at the end of the reasoning process. Here a very strong pattern is revealed. Conceptual simulations were almost always (91% of the time) immediately followed by alignment, and this sequence occurred more frequently than expected by chance, χ2(4) = 99.88, p < .001; Bonferroni adjusted chi-squares significant at p < .05. Alignments themselves were most likely to end the chain, a sequence that was more frequent than expected by chance, χ2(4) = 15.81, p = .003. Post-hoc comparisons showed that alignment at the end of the chain occurred significantly more frequently than alignment followed by theory, alignment, or conceptual simulation (the latter comparison was marginally significant); Bonferroni adjusted chi-squares significant at p < .05. The difference between the frequencies of alignment followed by data focus and alignment at the end of the chain was not significant. These results suggest that the process of alignment either resolved the hypothesis under evaluation and thus terminated the chain of reasoning or failed to resolve the hypothesis, leading the scientist to seek more information from the display.

Several patterns emerge from the transition diagram in Fig. 2. A hypothesis was most likely to be followed by data focus, but was also followed fairly frequently by theory or directly by conceptual simulation. Data focus was almost always followed by more data focus, indicating numerous sequences in which the scientist focused explicitly on the data themselves. Theory was also most frequently followed by itself, suggesting that the scientist engaged in in-depth consideration of theoretical constructs. Theory was also a gateway to extracting information and to conceptual simulation. None of these sequences was unexpected, given the nature of the scientists' tasks. The frequency of the conceptual simulation → alignment sequence, however, is striking and suggests a tight coupling between the two strategies. It is in this combination of processes that the hypothesis evaluation took place.

Figure 3 illustrates this process of conceptual simulation and alignment-based similarity detection. In this example from the astronomy dataset, the scientists were considering the cause of some deviations from the expected pattern of velocity contours. One of them proposed a “streaming motion hypothesis”; he proposed that the existence of streaming motions might be the cause of the distortion. He then constructed a mental representation of the theoretical appearance of the velocity contours (“a perfect spider diagram”). He mentally deleted any streaming motions from this representation (“if you looked at the velocity contours without any sort of streaming motions”) and identified how the lines would, then, hypothetically appear (“you'd probably expect [them] to go all the way across the ring.”)—that is, he was able to “see what happened.” Finally, he made a comparison between this new mental representation and the image on screen noting that under these hypothetical circumstances, there would be no deviant segments of the contours (“without any sort of changes here in the slope”). Use of the word here and gestures to the screen to identify the actual deflected contour lines indicate the target of the comparison. In summary, the scientist suggested that the cause might be streaming motions; ran a conceptual simulation of the contours without streaming motions noting that under these circumstances there would be no deviations in the contours; pointed out that, in contrast, there were kinks in the contour lines; and concluded that, consequently, the streaming motion hypothesis was supported by the appearance of the data.

Figure 3.

Conceptual simulation used as a source of comparison in the alignment process. An anomaly in the external display functions as the target of the comparison, and the scientist uses conceptual simulation to generate the source of the comparison.

2.2.3. Relation among hypotheses and conceptual simulations

Why were only some hypotheses associated with conceptual simulation? Although almost all conceptual simulations followed a hypothesis, not all hypotheses were followed by a conceptual simulation. In this section, we attempt to tease apart why this might have been so.

In general, a hypothesis represents a scientist's best guess about an uncertain situation; however, there may be greater or lesser degrees of informational uncertainty associated with different hypotheses. If conceptual simulation is a strategy for resolving informational uncertainty, it should occur more frequently after hypotheses that relate to greater uncertainty. One way to measure the uncertainty associated with a hypothesis is to consider the scientist's knowledge about the phenomenon to which the hypothesis pertains. If there is something in the data that violates the scientist's expectations (such as a major discrepancy between model and data), hypotheses pertaining to this phenomenon are likely to be associated with significant levels of uncertainty. If, however, the phenomenon itself is expected (e.g., in one psychology dataset the fact that respondents in the more difficult condition took longer than respondents in the control condition), hypotheses pertaining to it are likely to be associated with less uncertainty.

In order to investigate the relationship between the hypotheses and the data, the phenomenon behind each hypothesis was identified either as expected or as violating expectation. Three independent coders coded 15% of the data. Agreement between the coders was 87.5%, k = .75, p < .01; disagreements were resolved by discussion.

After the hypotheses were coded as referring to phenomena that either violated expectations or not, the use of conceptual simulation and data focus strategies to evaluate each type of hypothesis was counted. Our purpose was to determine the circumstances under which each strategy was used; consequently, only the first instance of each strategy use was counted. Table 10 shows the results of this analysis. As expected, there was no significant correlation between data focus and violate expectation (r = .18, p > .1), suggesting that data focus was a general strategy that cut across the different types of hypothesis under exploration. However, the correlation between conceptual simulation and violate expectation was significant (r = .41, p < .01). Thus, conceptual simulation appears to be a strategy that is closely associated with the investigation of hypotheses that pertain to violations of the scientists' expectations (i.e., to circumstances under which there are greater levels of informational uncertainty).

Table 10. Percentages of violate expectation and no discrepancy hypotheses for which conceptual simulation and data focus were used
VariableViolate ExpectationNo Discrepancy
Conceptual simulation64%21%
Data focus61%79%

2.3. Summary of Study 1

The verbal protocols collected for Study 1 provided a rich dataset by which to investigate the online thinking of practicing expert scientists as they analyze their own data. In the course of their analysis, the scientists develop hypotheses to account for aspects of the data and then evaluate those hypotheses in light of both their theoretical knowledge and the data themselves. The analyses presented above reveal several new findings about the processes by which scientists perform this task. First, they show that scientists use conceptual simulation as a means of evaluating hypotheses and that they do so relatively frequently compared with other strategies. We propose that scientists use conceptual simulation to generate a representation of a phenomenon under hypothetical circumstances, which then serves as a source of comparison with the actual data. The comparison between this hypothetical representation and the data takes place by a process of alignment by similarity detection, which allows the scientist to evaluate whether the hypothesis under consideration remains plausible. Finally, these results show that the use of conceptual simulation is strongly associated with conditions of informational uncertainty, as opposed to circumstances under which the scientist's expectations were met. Study 2 investigates further the relationship between conceptual simulation and uncertainty by experimentally manipulating the scientists' expectations.

3. Study 2

Although Study 1 found a strong relationship between informational uncertainty and conceptual simulation, this relationship was correlational. Temporally, the hypotheses preceded the conceptual simulations, and conceptual simulation was more associated with phenomena that violated the scientists' expectations than phenomena that matched them. Together, these facts support our interpretation that conceptual simulation is a strategy used in situations of informational uncertainty. However, the results of Study 1 only suggest an association; they do not imply a causal relationship between informational uncertainty and conceptual simulation. In order to investigate this relationship further, we conducted a second study in which we manipulated scientists' levels of certainty about data they would be examining.

In order to retain experimental control, we conducted Study 2 as a laboratory study. However, in keeping with our goal to study the reasoning processes of practicing scientists, we replicated some of the important features of the in vivo Study 1. As in Study 1, our participants were expert or near-expert scientists, conducting a scientific activity in which they regularly engaged (in this case, understanding data collected by a third party). In Study 2, we focused on one domain, cognitive psychology, for which we ourselves had the necessary domain knowledge to construct realistic materials.

3.1. Method

3.1.1. Participants

Participants were seven cognitive psychologists (4 men, 3 women). Three were advanced graduate students, 1 was a post-doctoral fellow, and 3 were university faculty.

3.1.2. Tasks

We created five tasks related to four topics within cognitive psychology—the Stroop effect, the “cocktail party effect,” graph interpretation, and the effect on performance of interruptions (the interruptions topic was divided into 2 tasks). These topics either concerned very well-known effects or they pertained to research conducted by participants themselves or by other members of the same lab who had presented talks on this research. Thus, the participants were familiar with all the topics and were considered expert in some of them.

The format of each task was as follows: A one-page, single-spaced text described a psychological experiment—the theoretical background and rationale for the experiment (from which predictions might be drawn) and a brief method section describing the stimuli or tasks used, the experimental conditions, the participants, and the procedure. The second page contained a bar graph representing the results of the study and a caption summarizing those results, including any relevant statistical results. An example of the tasks can be found in Appendix B.

The information in the theoretical background of the experiment was designed to lead the participant to have certain expectations about the results. There were two versions of each task, one in which the results of the experiment matched these expectations and an alternative version in which it did not. Thus, two within-subjects conditions were created: an Expectation Violation (EV) condition and an Expectation Confirmation (EC) condition. The tasks were adapted from real experiments published in the psychological literature. However, they were scaled down and simplified; in some cases, the results were altered in order to create the two conditions described above.

3.1.3. Task order

Each participant performed one version (EV or EC) of each of the five tasks. Tasks were counterbalanced according to a Latin Square design, and the condition for each task was varied such that each task was seen an approximately equal number of times in the EV or EC version, and each participant performed either two EV and three EC or two EC and three EV tasks. One task (the Interruptions task) was created as a sequence of two experiments; in Experiment 1, the expectations were violated (EV condition) prompting a follow-up experiment in which expectations were confirmed (EC condition). All participants performed both versions of the interruptions task.

3.1.4. Procedure

Participants were trained to provide talk-aloud protocols while problem solving (Ericsson & Simon, 1993). They were given the tasks one at a time by the experimenter, and they were instructed to read the materials aloud. The first page of text ended with the statement, “The results of this experiment are presented below,” followed by the question participants were to answer: “What do you think could account for these results?” Thus, participants were required to propose at least one hypothesis about the experimental results. The extent to which they reasoned about their hypothesis or hypotheses was left entirely to the participant. Their responses were recorded by video camera. After completing the tasks, participants were asked orally whether the results of each task were expected or unexpected to them. The protocols were transcribed and segmented, and conceptual simulations were coded as described in Study 1.

3.2. Results and discussion

One task, the “cocktail party effect” task, was excluded from analysis because many participants found part of the experimental manipulation and the results confusing.

3.2.1. Interrater reliability

One coder coded all of the data, and a second coder coded a subset (10%) of the data. (Ten percent was sufficient in this study because of the high reliability previously established in Study 1.) Initial agreement for the conceptual simulation coding was 97% (k = .92, p < .01). Thus, agreement between the two coders was extremely strong. Any disagreements were resolved by discussion.

3.2.2. Time on task

Participants spent an average of 49.7 min performing the four tasks, and produced an average of 422 utterances (excluding participants' initial reading of the task materials that described the study). Thus, participants expended considerable time and effort performing the tasks, at least given that each task involved reasoning about only one experiment and one set of data.

3.2.3. Use of conceptual simulation

Overall, participants used conceptual simulation 78 times, or approximately once every 4.5 min, on average. This rate was approximately double that of Study 1. One possible explanation for this difference is that in Study 2, the task was explicitly to account for the data; whereas in Study 1, the task was to “do what you would normally do in looking at your data.” Thus, in Study 1, participants had to spend time determining what specific task they would perform next, how to set up the display to accomplish it, and then actually change the display. In Study 2, apart from reading the introductory text, the entire session was spent trying to explain the data.

The mean number of conceptual simulations in the EV condition was 3.8, compared with 1.9 in the EC condition. Thus, participants used conceptual simulation twice as often in the EV as in the EC condition (these were within-subjects conditions). A repeated measures analysis of variance on these data was significant, F (1, 6) = 12.06, p < .05, showing that participants were significantly more likely to use conceptual simulation when their expectations were violated than when they were confirmed. This result held across all subjects and tasks.

3.2.4. Local EV and EC coding

It is possible that the manipulation did not work in the predicted manner; that is, participants might not have been surprised by results in the EV condition or might have found results in the EC condition surprising. In order to confirm that participants were indeed using conceptual simulation more frequently when their expectations were violated than when they were confirmed, a “local” EV–EC coding scheme was applied to the data. A two-stage system was used to determine whether each conceptual simulation occurred when the participant's expectations had been violated or confirmed. First, internal evidence in the protocol was used. For example, “The effect of interruption doesn't seem too surprising, because, um, according to theory, er, the goals decay quickly,” was coded as EC; whereas, “That's very interesting, though, because I would have expected something [referring to null result],” was coded as EV. Second, if there were no explicit statements in a specific task's protocol that could be coded as EV or EC, the participant's self-report from the post-task interview was used. Any conceptual simulations that occurred with reference to these phenomena were coded as EC or EV accordingly, regardless of the experimental condition.

Again, one coder coded all the data; a second, independent coder coded a subset (10%) of the data. Initial agreement was 98% (k = .77, p < .01), which was a very strong level of agreement. Any disagreements were resolved by discussion. Furthermore, for 76% of the conceptual simulations, the local coding as EV or EC matched the experimental condition. Thus, although not perfect, overall the manipulation appears to have worked as intended.

3.2.5. Use of conceptual simulation: local coding

Two instances of conceptual simulation were not coded because the participant was trying to decide whether the result was surprising. Sixty-eight percent of the conceptual simulations were associated with expectation violation, compared with 32% associated with expectation confirmation. A chi-square test showed that conceptual simulation was used when expectations were violated significantly more frequently than expected by chance, χ2(1) = 12.96, p < .001. This result echoes the 2:1 ratio of use produced by the experimental manipulation and provides strong support for the hypothesis that conceptual simulation is a strategy used under conditions of expectation violation and informational uncertainty.

3.3. Summary of Study 2

Study 2 provides further evidence that scientists use conceptual simulation spontaneously when reasoning about data, and that they are more likely to do so under conditions of informational uncertainty. Whereas Study 1 provided correlational support for this hypothesis, Study 2 explicitly manipulated the participants' level of informational uncertainty by generating situations in which either their expectations would be met or they would be violated. The results of Study 2 thus provide experimental confirmation of our interpretation of the results of Study 1.

4. General discussion and conclusion

These two studies show that practicing, expert scientists use conceptual simulation when working on naturalistic tasks in their own domain. This result corroborates previous research that argues for the use of mental experimentation–simulation in both historical discoveries and contemporary reasoning tasks. However, whereas historically based research depends on retrospective and narrative sources, our research finds evidence in the scientists' online, verbalized thinking. Furthermore, whereas other studies have identified the use of this type of reasoning by scientists of varying degrees of expertise working in domains that are not their own, or on artificial tasks, we have examined the behavior of professional, expert scientists working in their own domain on authentic scientific tasks.

In addition, our research demonstrates that scientists are more likely to use conceptual simulation under situations of informational uncertainty. This is shown in the in vivo data, where conceptual simulation was associated with the evaluation of hypotheses related to unexpected phenomena, and it is further supported in the experimental study in which levels of informational uncertainty were explicitly manipulated. Finally, the research shows how conceptual simulation helps resolve uncertainty: Conceptual simulation facilitates reasoning about hypotheses by generating an altered representation under the purported conditions expressed in the hypothesis and providing a source of comparison with the actual data, in the process of alignment by similarity detection.

In-depth protocol studies, which use fewer participants than are generally involved in experimental research, always face questions about their generalizability. However, the consistency with which conceptual simulation was used by many individuals, as well as the range of scientific areas included in this research, suggest that the results of these two studies are likely to generalize to other scientists, at least insofar as they are performing data analysis. The use of conceptual simulation may vary in other scientific inquiry tasks, such as generating predictions from theories or designing experiments to test those theories. In general, however, we propose that scientists are likely to use conceptual simulation in situations of informational uncertainty, regardless of the specific task.

The cycle of hypothesis–conceptual simulation–alignment bears some resemblance to analogical reasoning in that one representation (a “source”) is mapped onto another (a “target”), in order to make inferences about it. The conceptual simulation was the means by which the scientists generated the source of the comparison. The actual, displayed data representation, which the scientists were trying to understand, was the target. Alignment by similarity detection was a form of comparison that allowed the scientist to evaluate the hypothesis in order to understand something more about the underlying structure of the data representation.

There are, however, important differences between conceptual simulation and analogical reasoning. First, in the data we examined, the process of alignment was primarily based on perception because of the visual–spatial nature of the scientists' data; in analogical reasoning in general, however, inferences drawn about the target are not necessarily grounded in perception. Second, analogical reasoning is a memory-based strategy (i.e., similar situations that have been previously observed are recalled and used to generate predictions for a novel situation). The protocol data in these two studies, however, suggests that although the initial representation in a conceptual simulation may be grounded in memory, the transformations that are applied to it appear to be constructed afresh with each simulation. In conceptual simulation, new representations are not generated solely by reference to a familiar situation but by taking what is known and transforming it to generate a future state of a system. Thus, conceptual simulation may be considered a form of model construction, which is likely to occur when no easily accessible, existing source for analogy is available. This situation may be similar to that identified by Griffith and colleagues, who proposed that when model search and analogy fail, scientists construct and manipulate mental models (e.g., by means of general structural transformation; Griffith, Nersessian, & Goel, 2000).

Like analogical reasoning, conceptual simulation can also be considered a type of reasoning with inductive mental models (e.g., Nersessian, 1992b; Schwartz & Black, 1996b). Although the term mental model is used frequently, there is wide-scale disagreement about precisely what constitutes a mental model. In our view, mental models are dynamic and “runnable.” This means that the components of the model can be set in motion and their behavior and changes of state can be observed, in a process that mirrors observations of the physical components of a tangible model. The output of running a mental model is an inference about the outcome of a particular converging set of circumstances. By animating their mental models, people are able to simulate a system's behavior in their “mind's eye” and to predict one or more possible outcomes, even for situations in which they have no previous experience (Gentner, 2002). Conceptual simulation involves transforming (“running”) a representation, and inspecting the output, a changed representation that becomes the basis for inferences about the data.

Conceptual simulations, like other kinds of mental models, rely on qualitative relations such as signs and ordinal relations, relative positions, and so on rather than precise numerical representation. In general, mental models are particularly instrumental in guiding problem solving when people lack a formal scientific understanding of a domain (e.g., Forbus, 1983; Gentner & Gentner, 1983; Kieras & Bovair, 1984). Although the expert scientists in our studies did not lack formal scientific understanding, they did lack the precise knowledge to immediately solve the informational uncertainty they were experiencing. Conceptual simulation seems to have allowed them to engage in causal reasoning about a system, even in the midst of this informational uncertainty.

As a form of “what if” reasoning, conceptual simulation is also strongly related to the type of thought experiment discussed by Nersessian (1992b). Nersessian (1992b) also interpreted thought experiments as a form of reasoning with mental models and proposes that such mental models are “temporary structures constructed in working memory for a specific reasoning task.” We have argued that conceptual simulations are similarly constructed to meet a specific, temporary need. Nersessian (1992b) argued for the importance of this type of reasoning in instances of major conceptual change in scientific discovery. Unlike these thought experiments, which may lead to large-scale conceptual change, conceptual simulations may be considered small-scale, or “local,” thought experiments. Although we did not observe any major conceptual change in our data, we did witness numerous instances of scientists using conceptual simulation to get “unstuck” when they had reached an impasse in understanding their data; in this sense, conceptual simulation may serve a similar function of helping a scientist move beyond what is currently known.

In general, experts' domain knowledge provides them with many existing solutions and analogs on which to draw during problem solving (e.g., Chi, Feltovich, & Glaser, 1981). Yet, we found true experts generating conceptual simulations rather than retrieving solutions from memory. We propose that conceptual simulation will be used by experts when they are working either outside their immediate area of expertise or on their own cutting edge research—that is, in situations that go beyond the limits of their current knowledge. This interpretation meshes with Schraagen's (1993) observation that conceptual simulation was used on a task in the domain of gustatory psychology by psychologists expert in domains other than gustatory psychology, but not by novices or by experts within the gustatory domain. Although Schraagen was led to conclude that it is therefore an intermediate strategy, his results are not inconsistent with our suggestion that experts working on a truly novel task in their own domain would engage in conceptual simulation. The extent to which novices are able to productively use conceptual simulation in situations of uncertainty remains a matter for investigation. We predict, however, that novices will be less capable of generating conceptual simulations because they lack domain knowledge, and that therefore they will use fewer conceptual simulations than experts.

There are very few studies of expert scientists performing “real” scientific tasks. In his pioneering in vivo study of molecular biologists, Dunbar (1995) asked, “How do scientists really reason?” Our studies contribute further to our understanding of how scientists really reason. Frequently, studies of experts employ problems that are well-understood for an expert and that can be solved by recalling either this very problem (i.e., by model-based search) or another that shares the same deep structure (i.e., by analogy; cf. Chi et al., 1981). In contrast, our studies show experts reasoning about problems for which neither they nor anyone else knows the answer. In such circumstances, they must construct new models “on the fly,” tailor-made to the problem and its context. This strategy of conceptual simulation is similar to mental model-based strategies used by laypeople in reasoning about the everyday world. Expert scientists, however, have the domain knowledge that allows them to generate predictions that are accurate and therefore useful in the context of scientific problem solving.

With the current emphasis in science education reform on authentic practice (National Research Council, 1996), these studies have practical implications for efforts to improve science in the classroom. Not only does current educational theory suggest that instruction should be situated in the context of authentic scientific questions to which students genuinely desire to learn the answer (Barron et al., 1998), but also that students be encouraged to use the tools and strategies of real scientific practice. Research has already shown the value of having students generate predictions prior to conducting experiments (White, 1993); however, the prediction generation process itself has been largely unexplored. It is possible that qualitative reasoning strategies, such as the use of mental models and conceptual simulation, can be explicitly taught to students providing them with a more formal means to generate predictions, specifying their implications, evaluating their accuracy, and identifying potential causes of discrepancies.

There have been many myths about how scientists operate including the idea of the “lone scientist” toiling in isolation; the belief that scientific discovery is the result of genius, inspiration, and sudden insight; the assumption that hypotheses should always precede experimentation and observation; and especially the notion that scientists are unbiased processors of objective data. Research in cognitive science has helped to dispel many of these myths; the current study contributes further to our understanding of the processes by which scientific knowledge actually develops in the real world. It provides evidence to support the claim that science advances not through the use of mysterious and inexplicable processes unique to a particular group of geniuses but through the systematic use of everyday processes. Conceptual simulation—a specific type of qualitative mental model—is one such everyday reasoning process.


  • 1

    After the third coder had completed the coding, a 2 × 2 contingency table was constructed counting the number of times the coders agreed there was no conceptual simulation, the number of times they agreed there was a conceptual simulation, the number of times Coder 1 thought there was a conceptual simulation but Coder 3 did not, and the number of times Coder 3 thought there was a conceptual simulation but Coder 1 did not. The nature of the data was such that there were very many instances of “no conceptual simulation,” which were easy to identify (e.g., lines 2–5 of Table 7). The majority of coded utterances thus fell into the cell representing agreement on “no conceptual simulation.” However, because percentage agreement does not appropriately take into account agreement by chance, Cohen's kappa was used in addition to percentage agreement (Cohen, 1960). Kappa of .7 is generally considered to represent satisfactory agreement.


Work on this project was supported by Office of Naval Research Grants M12439 and N0001403WX30001 to J. Gregory Trafton. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Navy. This research was Susan Bell Trickett's dissertation. She thanks all the members of her dissertation committee: Chris Schunn, Debbie Boehm-Davis, and Brenda Bannan-Ritland for their comments and guidance. Thanks also to Erik Altmann, Audrey Lipps, and Bill Lile for comments on earlier drafts of this work; to Peter Cheng, Nancy Nersessian, Dan Schwartz, and an anonymous reviewer for their reviews of this manuscript; and to Raj Ratwani for assistance with coding.

Appendix A: Conceptual simulation training

We want you to read through every line in the protocol and mark it in the following way. First, you need to ask whether the speaker is creating a new mental representation. One way to think about this is to determine whether he or she is referring directly to what is currently on display on the computer screen. If so, there is no new mental representation. If the scientist is referring to something in his or her head, you should note that as a new representation. The new representation could refer to a memory of something he or she has already seen, or it could refer to a theoretical construct, or it could refer to a hypothetical situation that the scientist is constructing for the first time.

When you identify a new representation, you should code the utterances that follow it, using the spatial transformation coding scheme. That is, if the scientist mentally manipulates or transforms the starting representation spatially, you should code that utterance accordingly. Finally, immediately after any utterances that you have coded as spatially transforming the starting representation, you should examine the next utterance(s) to determine whether there is a “result” of the transformations, or an ending representation that is different from the starting representation. If you find all three components of this sequence, you should code each utterance as conceptual simulation (CS). For any utterance that is not a part of this type of sequence, you should code it as no conceptual simulation (No CS).

Example 1

Utterance (scientist 1)Utterance (scientist 2)CS CodingExplanation (training purposes only)
That might just be gas blowing from the star-forming regions No CSScientist is trying to explain what might account for “stuff all over here” identified previously
 But that's not a star-forming region, though, at the centre leftNo CSIdentifies feature of current display
Centre left No CSSearches display to identify area of interest
 That oneNo CSIdentifies area of interest
Maybe this stuff is just sort of infalling No CSSpatial transformation: mentally moves “stuff” from one location to another. However, coded as no CS because it does not follow a reference to a new representation, or lead to a changed representation
I mean, you know, if there's a big gas cloud … No CSNew representation. Coded as No CS because the representation is not transformed.
 Infalling as a big blob?No CSQueries explanation
Why not? Why not? Why can't gas infall as a big blob? No CSReiterates explanation
 The pressure thing tends to push them apart, thoughNo CSStates domain knowledge
 I mean, it seems like there should be a kinematic reason for thatNo CSStates domain knowledge
 Ah, I don't see what it isNo CSUnable to resolve
 It seems like the H1 disk here is offsetNo CSScientist is looking at image of galaxy and interpreting it
The H1 disk is offset … Can you have that happen? No CSQuestions interpretation
 Sure, I, I, well, I think you can actuallyNo CS 
 Umm, I mean, remember, these things are in the elliptical orbitsCSNew representation (displayed image does not show anything about orbits)
 Things may be falling kind of inward as they're going around the orbitsCSSpatial transformation: mentally moves matter from one place to another, and moves it around in orbit
 The gas pressure is sort of driving the H1 out a little bit moreCSSpatial transformation: mentally moves the H1 from one location to another
 And when it falls back in because of the dissipation going onCSSpatial transformation: Mentally moves H1 from one location to another
 You could have it offset that wayCSEnd result: offset disk

Here are two examples from the astronomy dataset that illustrate this coding scheme. In the first example, note that although the scientists are trying to explain a particular phenomenon by proposing different hypothetical situations, and although a new representation is generated, there is no conceptual simulation, because no spatial transformations are applied to the new representation. The entire sequence (refer to new representation—refer to mentally transforming representation—refer to result of representation) is not present. In the second example, there are a reference to a new representation, reference to several spatial transformations performed on that representation, and reference to an end result of those transformations. Consequently, each of those utterances is coded as CS.

Appendix B: Sample materials for Study 2


Altmann & Trafton (in press have suggested that there are 3 things that memory for goals depends on:

  • 1Rehearsal (you may need to rehearse your goal to remember it later)
  • 2Cues in the environment (i.e., something in the environment may remind you what your goal was)
  • 3The fact that individual goals decay quite quickly (in seconds)
  • image(B1)

[Error bars are standard error of the mean. There is a highly significant effect of interruption: Resuming a task after an interruption takes much more time than lags measured during the primary task. There is no effect of condition F > 1.]

Recently, Trafton ran an experiment to examine how rehearsal affected resuming a task after an interruption. The task was set up so that participants were working on a goal as they got interrupted. The experiment used two tasks, a primary task that participants worked on most of the time and a secondary task that was the “interrupting” task. The primary task was a complex resource allocation task that had many different goals and many different things participants could do at any point in time. The secondary task was a dynamic categorization task (the Ballas task, a lot like Argus).

Participants worked on the primary task for approximately 20 minutes. There were 10 interruptions throughout the 20 minute scenario. Each interruption followed a mouse-click to ensure that a participant was working on a goal (or, rather, to ensure the participant was actively working on some task, not just thinking or spacing out). There were two conditions:

  • • A No Warning condition (NW) where participants were immediately taken to the secondary task.
  • • A Warning condition (W) where participants were given 8 seconds to “prepare” for the secondary task. Participants were warned they were switching to the secondary task by a set of “eyeballs” that appeared on the screen. Once the eyeballs showed up, participants were not able to work on the primary task and were told to “remember what they were working on.”

All participants were told that when they came back to the primary task, they were to resume where they left off (i.e., to remember the goal they were working on).

There were 10 subjects in each condition.

The secondary task lasted approximately 45 seconds.

According to Altmann & Trafton, the Warning condition was expected to have a much faster resumption lag (RL) than the No Warning condition. (A resumption lag is the time it takes people to resume a task after being interrupted; a regular lag is the time between key strokes without an interruption).

The results of this experiment are presented above. What do you think could account for these results?