Goal-Proximity Decision-Making


Correspondence should be sent to Vladislav D. Veksler, Air Force Research Laboratory, Wright-Patterson AFB, OH. E-mail: vdv718@gmail.com


Reinforcement learning (RL) models of decision-making cannot account for human decisions in the absence of prior reward or punishment. We propose a mechanism for choosing among available options based on goal-option association strengths, where association strengths between objects represent previously experienced object proximity. The proposed mechanism, Goal-Proximity Decision-making (GPD), is implemented within the ACT-R cognitive framework. GPD is found to be more efficient than RL in three maze-navigation simulations. GPD advantages over RL seem to grow as task difficulty is increased. An experiment is presented where participants are asked to make choices in the absence of prior reward. GPD captures human performance in this experiment better than RL.

1. Introduction

How does a cognitive agent choose a path of actions from an infinitely large decision-space? Reinforcement learning (RL) is widely adopted as a psychologically and biologically valid model of human/animal action-selection (e.g., Fu & Anderson, 2006; Holroyd & Coles, 2002; Nason & Laird, 2005). RL explains how an agent may reduce its decision-space over time by attending to the reward structure of the task environment. However, as goals change, so does the reward structure of the agent's world. Relearning the reward structure for every possible goal may take an extremely long time. For greater efficiency, a cognitive agent should be able to learn more about its environment than just the reward structure, and to exploit this knowledge for achieving new goals in the absence of prior reward/punishment. For example, a person may see a mailbox on her way to work and incidentally learn its location. Sometime later, if she needs to mail a letter, she can find her way to that mailbox because she knows its location. There had been no reward or punishment for the actions leading to this mailbox, and so the ability to find its location cannot be explained solely through the principles of reinforcement learning.

We propose a mechanism for making decisions in the absence of prior reward or punishment, and we provide initial tests of its fidelity and efficiency as compared with RL. Given multiple possible paths of action, the proposed mechanism chooses the path most strongly associated with the current goal, regardless of prior reward. Strength of association between any two items, in turn, depends on experienced temporal proximity of those items. From here forth, we refer to the proposed mechanism as GPD (goal-proximity decision-making).

The rest of this article describes a key theoretical problem for RL models of decision-making (the two-goal problem), briefly summarizes classic evidence in psychological literature for reward-independent decision-making in humans and animals, and presents two computational models that exemplify non-RL-based decision-making. The article then outlines the implementation of the GPD mechanism within the ACT-R cognitive framework. GPD and RL are contrasted for efficiency in finding multiple goal-states in three simulation environments, ranging in difficulty from easy to difficult. Finally, we describe an experiment based on a forced-choice paradigm and provide fits of GPD and RL decision mechanisms to human data in this task. We conclude that GPD is more efficient than RL in multi-goal environments, and that GPD can account for human performance where RL cannot—prior to any reward or punishment.

1.1. What this article is not about

Because everything in cognition is so closely knit, the GPD theory may evoke topics that are outside of the scope of current work. The following topics are important to cognitive science but tangential to the focus of this article.

First, GPD is not meant to replace RL, but rather to complement it. The focus of this paper is on presenting: (a) a decision mechanism that is compatible with Anderson's theory of spreading activation (Anderson & Lebiere, 1998), and (b) advantages of this mechanism over the more widely adopted RL-based decision mechanism in matching human efficiency on multi-goal tasks. However, there is much evidence supporting the idea that declarative and reward-based learning mechanisms complement each other in human/animal cognition (e.g., Daw, Niv, & Dayan, 2005; Dickinson & Balleine, 1993, 1994; Fu & Anderson, 2008; Gläscher, Daw, Dayan, & O'Doherty, 2010). Determining the exact process by which GPD and RL may interact is an extensive topic that we plan to address in future work.

Second, GPD is a model of human choice in the context of immediate behavior. In this, GPD does not address planning. How GPD may be used in complex planning procedures is a tangential topic.

Third, the field of human decision-making is vast and diverse, and current work only pertains to a small subset of this discipline.

Fourth, GPD bares resemblance to models of navigation (as the one described in Section 'Voicu and Schmajuk'), but GPD is not restrained to spatial navigation tasks. In this context, it is important to think of task environments in Sections 'Simulations' and 'Experiment' as generic instances of Markov decision tasks.

Finally, GPD partially addresses associative learning. However, associative learning is not the focus of this article. Rather, the focus here is on the goal-oriented decision-making that can emerge from a simple associative learning mechanism. The topic of associative learning should comprise other lines of research (e.g., sequence recall, free association, priming) in addition to this one and is too extensive to address here.

1.2. The two-goal problem

Consider a scenario where an agent has to achieve goal A, and then goal B, in the same environment. To increase efficiency humans and animals would learn the environment during task A and perform faster on task B (the Experiment below provides evidence for this phenomenon). That is, we do not just learn the positive utility for the actions that helped us reach the goal, or the negative utility for the actions that failed to reach the goal; we also pick up on other regularities in the environment that may help us with possible future goals. RL-based architectures will have a problem matching human performance on this two-goal problem.

To make this example more concrete, imagine how an RL-based agent may perform on a specific two-goal problem. In this example, the first goal, A, can be accomplished by executing actions 1, 2, and 3. After trying the following sequences of actions, 1-2-4, 1-5-7, 1-4-3, finally the sequence 1-2-3 is attempted. Upon reaching the desired goal A, actions 1, 2, and 3 will be positively reinforced. The utility value of actions 1, 2, and 3 will increase every time that A is reached via this route, and soon these actions will fire without fail, greatly improving the agent's time to reach the goal.

Now imagine the task switches so that the agent has to find B in the same task environment. The shortest path to B would be to fire actions 1, 5, and then 7. Although the agent had previously reached state B, actions leading to this state were not positively reinforced because B was not the goal at the time. Thus, when presented with this new goal, RL performance will be at chance level.

RL, by definition, learns only the reward structure of the world, ignoring the rest of the environmental contingencies (with the exception discussed in the Model-based RL section below). In those cases where this ignored information may help in achieving new goals, it would be useful to have an additional mechanism for collecting and using this information (especially in the case of humans, where memory is relatively cheap as compared to additional trials). The mechanism proposed in this paper, GPD, should serve as such a complement for RL-based architectures.

2. Background

Stevenson (1954) provided evidence that children are capable of resolving the two-goal problem. In this study, children were placed at the apex of a V-shaped maze, and the goal items were located at the ends of the arms of the V. Children were asked to find some goal item A (a bird, flower, or animal sticker), and later asked to find a new goal B (a purse or a box). Although children were never rewarded for finding B, and did not know that they would be asked to look for it at any point, once presented with this goal, they proceeded to the correct arm of the maze more than 50% of the time.

This paradigm, called latent learning, does not just provide evidence that learning occurs in the absence of reward/punishment, but also that, given a goal, the learned information is reflected in decision-making, and ultimately in performance. Latent learning was originally found in rats in the context of maze running (Blodgett, 1929). Tolman provided further evidence for latent learning in rats and humans (Tolman & Honzik, 1930; Tolman, 1948), and Quartermain and Scott (1960) displayed latent learning behavior in human adults, substituting a cluttered cubicle shelf for the rat maze.

The following subsections review four decision models that employ associative knowledge (rather than reward) in the decision process, and thus, may be able to display latent learning and resolve the two-goal problem.

2.1. Model-based planners

Model-based RL (Sutton & Barto, 1998) extends RL by learning the environmental structure beyond action utilities. The term “Model” in “Model-based RL” refers to agent's internal model of the environment. An agent based on this framework is capable of planning its route before execution. However, the planning process itself is still based on RL. To use the example from the two-goal problem section: When a model-based RL agent is presented with a new goal, B, having the knowledge that 1-5-7 leads to B, the agent will begin to plan its route by considering random actions. In other words, because this framework employs a decision mechanism based on RL, having the additional knowledge about the world does not reduce decision cycles.

Daw et al. (2005) present a “tree-search” model that also learns the structure of the environment, but it employs a brute-force search to find the best path toward its goals. Similar to Model-based RL, “tree-search” type models do not provide additional efficiency over RL on the two-goal problem in that these models do not attempt less actions. Rather, the search is simply performed in the head, rather than in the world.

2.2. Voicu and Schmajuk

Although models of space navigation can employ RL (e.g., Sun & Peterson, 1998), there is a class of decision mechanisms employed in many artificial navigation systems that do not use RL representation (for review see Trullier, Wiener, Berthoz, & Meyer, 1997). As Trullier et al. state, “Navigation would be more adaptive if the spatial representation were goal-independent” (p. 489).

In a primary example of goal-independent representation, Voicu and Schmajuk (2002) implemented a computational model that learns the structure of the environment as a network of adjacent cells. Once a goal is introduced, reward signal spreads from the goal-cell through this network, such that the cells farther from the goal-cell receive less activation than those that are close. Goal-driven behavior in this model comprises moving toward the cells with the highest activation.

Once this model memorizes the map of the environment, it does not need to learn the reward structure through trial-and-error. Rather, the utility of each action-path is identified through spreading activation from the goal. In this manner, this model resolves the two-goal problem.

One major limitation of this model is that it makes unrealistic assumptions about the world (e.g., that it can be neatly mapped out as a grid of adjacent spaces). This model would be computationally infeasible for a sufficiently large, dynamic, probabilistic environments. In addition, this model is not integrated within a larger cognitive framework. As a stand-alone model of maze-navigation behavior in an oversimplified environment, there are questions as to the scalability and fidelity of the model.


SNIF-ACT (Fu & Pirolli, 2007) is a model of human information-seeking behavior on the World Wide Web. The pertinence of SNIF-ACT to current work is that it is a model of how humans use declarative knowledge (rather than action utilities) in goal-driven behavior in a very rich and unpredictable task environment. The World Wide Web is unpredictable in the sense that there is no way for any of its users to know what links they will encounter during Web browsing. For this reason, an agent must be able to evaluate its actions (which link to click) without any prior reinforcement of those actions.

The action of clicking a link in SNIF-ACT is based not on the previous reinforcement of clicking on that link, but rather on the semantic association of the text in the link to user goals (information scent). To implement this concept in ACT-R, Fu and Pirolli changed the utilities for clicking links based on the link-goal association strengths (note the similarity to the Voicu & Shmajuk model). This is different from the standard ACT-R implementation, where the decision mechanism is based on RL. Changing the utility mechanism in this way allows SNIF-ACT to make non-random decisions between multiple matching actions that have never been reinforced.

Besides being limited to text-link browsing, SNIF-ACT's other major limitation is that it does not learn the association strengths between links and goals, but rather imports these values from an external source. However, SNIF-ACT's decision-making mechanism is an excellent example of how to achieve goal-driven behavior in the absence of prior reinforcement within the ACT-R framework.

2.4.  math formula

Gläscher et al. (2010) present math formula, a model that makes decisions based on learned association strengths, as an alternative to RL. The math formula associative learning mechanism employs the delta-learning rule to learn association strengths (Rescorla & Wagner, 1972; Shanks, 1994; Widrow & Hoff, 1960), which is widely accepted as a psychologically and biologically valid mechanism of associative learning.

However, unlike SNIF-ACT and the Voicu & Schmajuk models, which predict choice based on a psychologically valid mechanism of spreading activation (Anderson & Lebiere, 1998), the math formula model calculates option utilities by means of dynamic programming—a mathematical optimization technique that recursively defines values at each level in terms of those at the next level.

Another important aspect of Gläscher et al. (2010) work is the proposed integration of the associative learning-based model like math formula with a RL model. Specifically, Gläscher et al. (2010) describe a Hybrid Learner mechanism for blending the results from RL and associative-based decision models. How a mechanism like the Hybrid Learner may be employed to integrate RL with the GPD model that we proposed below is a topic for future research.

2.5. Summary

In summary, RL-based decision mechanisms cannot resolve the two-goal problem, even when associative data are collected, as in the Model-based RL. Employing associative knowledge in the decision process can resolve the two-goal problem. The Voicu & Schmajuk, SNIF-ACT, and math formula models employ associative knowledge in this manner. The Voicu & Schmajuk and SNIF-ACT models can be further improved as models of human decision-making by including a psychologically valid mechanism for associative learning, like the delta-learning rule employed by the math formula model. The math formula model can be further improved by employing a psychologically valid mechanism for estimating option utilities, like the spreading activation mechanism employed by the Voicu & Schmajuk and SNIF-ACT models.

3. Goal-proximity decision-making

RL cannot account for human/animal decision-making in the absence of reward. The Voicu & Schmajuk and the Fu & Pirolli models described above suggest an alternative decision mechanism, where agent choice depends on spreading activation from the goal.

More specifically, these models employ reward-independent associative knowledge to represent environmental contingencies. The decision process in both models comprises choosing the option most strongly associated with the current goal.

In the Voicu & Schmajuk model, the strength of association between two elements is inversely proportional to the physical distance of those elements in space. In SNIF-ACT, the strengths of associations are imported from an external source—Pointwise Mutual Information engine (Turney, 2001), an association strength between two words is incremented every time that the two words co-occur within a window of text, and decremented every time that the two words occur in the absence of one another.

Hence, for both methods, the temporospatial proximity (object distance for the Voicu & Schmajuk model and word distance for SNIF-ACT) between items X and B may be employed to predict whether X is en route to B. Although the agent is seeking some goal, A, it may be learning the proximity of elements in its environment, including the proximity of X and B. Given a new goal, B, the agent can use its knowledge to judge the utility of approaching X to find B. In this manner, the environmental contingencies learned while performing goal A can help to improve agent performance on goal B, thus resolving the two-goal problem.

We call this mechanism goal-proximity decision-making (GPD). In more generic terms, GPD (a) relies on having associative memory, where association strengths between memory elements represent experienced temporal proximity of these elements, and (b) chooses to approach the environmental cue that is most closely associated with its current goal.

3.1. Implementation

We implement GPD in the ACT-R cognitive architecture (Anderson, 2007; Anderson & Lebiere, 1998). ACT-R comprises a production system as the central executive module, a declarative memory module, a goal module, an imaginal module, and visual and motor modules.

To implement GPD in ACT-R, we developed an ACT-R model that, given some goal G, looks through all the options on screen, performing retrievals from memory. Retrievals from memory in ACT-R, among other factors, depend on spreading activation from the goal—such that the memory elements that are more strongly associated with the goal, G, are more likely to be retrieved. The GPD model then clicks on the option that was retrieved from memory—this most likely being the option with the greatest association to G. By “option” we mean the state that the model may move to, rather than the action taken to move. The actions required for moving toward a given option (e.g., clicking a button or pressing a key) are task dependent.

Although ACT-R employs the spreading activation mechanism, making for an easy implementation of the GPD model (only 13 productions), it does not make predictions about how association strengths between memory elements are learned. ACT-R 4.0 (an older version) had a mechanism for associative learning (Lebiere, Wallach, & Taatgen, 1998; Lebiere & Wallach, 2001). However, according to Anderson (2001), this particular form of associative learning turned out to be “disastrous” and produced “all sorts of unwanted side effects” (p. 6).

To implement associative learning in ACT-R, we first create an episodic queue—a simple list containing the names of recently attended memory elements. Whenever the model checks the contents of the visual buffer (visual attention), the name of the memory element from the visual buffer is pushed into the episodic queue. This is most similar to the queue of executed productions held in ACT-R's procedural module (used for reward propagation).

Next, we update association strengths between the latest episode and every other item in the episodic queue. To do this, we employ error-driven learning, also known as the Delta rule. This is the same learning mechanism as the one employed by the math formula model described above, and it is widely accepted as a psychologically and biologically valid mechanism of associative learning (Rescorla & Wagner, 1972; Shanks, 1994; Widrow & Hoff, 1960). Stated formally, for each new element j and previously experienced element i, the strength of association between j and i, math formula, at current time, n, is increased in the following manner:

display math(1)

where β is the learning rate parameter, δ is the temporal discount rate (0 < δ < 1), and math formula is the temporal distance between j and i. The pseudocode for the GPD model and this associative learning mechanism is provided in Table 1. The full ACT-R model may be found at http://act-r.psy.cmu.edu/papers/1030/actrGPDmodel.lisp.

Table 1. GPD pseudocode
Sji is the association strength between memory elements j and i
δ is the temporal discount rate.
β is the associative learning rate parameter.
# GPD algorithm
given a goal, G, and current best option, Y {
for each option in the environment, X {
learn episode (X)
given two options, X and Y
attempt retrieval from declarative memory
spreading activation from G
set Y to be the retrieved memory element
learn episode (Y)
approach option Y
# episodic/associative learning
learn episode (j) {
activationOfItem = δ
for each item in episodicqueue, i
Sji += β * (activationOfItem − Sji)
activationOfItem = activationOfItem * δ
push j into episodicqueue

To make it clear how this mechanism may work, consider the following scenario. When the model is initiated, all association strengths are assumed to be 0. After it is initiated, the model sees A, followed by B, followed by G. Thus, the temporal distance between G and B is 1, and the temporal distance between G and A is 2. Assuming δ = 0.5 and β = 1, according to Eq. 1 the strength of association between A and G becomes 0.25, and the strength of association between B and G becomes 0.5 (in addition, by this point the model would have also learned that the association strength between A and B is 0.5). Now consider that at some later point, having no other experience with items A, B, and G, the goal of the model is to find G. If the model was deciding between options A and B, the model would be more likely to retrieve B from memory and click (or otherwise approach) this option, as this item would have more activation spreading from the goal. No reward information would be required to make this decision, thus overcoming the limitations of ACT-R's current RL-based decision mechanism.

The following sections attempt to show advantages for GPD over the Temporal-Difference RL mechanism currently employed for decision-making in ACT-R, with respect to efficiency and the ability to account for human data in multi-goal task environments.

4. Simulations

The purpose of these simulations is to expose how GPD is more efficient than RL in multi-goal environments, and how these advantages multiply as the environment grows in complexity. We ran three simulations to contrast GPD and RL in traditional grid-world mazes. Grid-world simulations are a common way to represent a generic problem space and are often used to examine RL and navigation models (e.g., Sutton & Barto, 1998; Trullier et al., 1997; Voicu & Schmajuk, 2002). The three grid-world mazes are displayed in Fig. 1. Maze A allowed movement in all eight directions between neighboring cells (N, NE, E, SE, S, SW, W, and NW); maze B, allowed bidirectional movement in four directions (N, E, S, W); maze C, comprised unidirectional and bidirectional connections, without regard for grid consistency.

Figure 1.

Simulation environments. Numbered boxes signify locations, and arrows signify the directions in which an agent may travel.

The subtle differences in these navigation environments were meant to modify maze difficulty. A random-walk model could find each each possible maze location from each possible starting point in mazes A, B, and C in an average of 181.39 steps, 369.83 steps, and 793.79 steps, respectively (each number represents an average of 100 model runs). Thus, the three mazes may be thought of as Easy, Medium, and Difficult, respectively.

For the sake of speed, closed-form RL (based on the native Temporal-Difference RL implementation in the ACT-R cognitive architecture) and GPD models were employed for this simulation. For each of the mazes, the models attempted to find each of the 16 maze locations, one at a time, in random order, five times. Model performances were evaluated in terms of the number of steps it took to find all 16 locations the fifth time around. Table 2 and Fig. 2 display the best-found parameter values for the models and the simulation results based on those parameters (RL and GPD performance scores are averaged over 60 and 40 model runs, respectively).

Figure 2.

Best results from exploration of three varying-difficulty task environments.

Table 2. Best parameter values for GPD and RL for varying-difficulty mazes
 Learning RateNoiseEpisodic Queue
  1. Note. GPD temporal discount rate for all mazes: δ = 0.7.

Maze A
Maze B
Maze C

A two-factor 2×3 ANOVA, examining the effects of Model Type (GPD, RL) and Maze Type (Easy, Medium, Difficult) on the number of steps taken to complete the 5h bin, revealed a significant interaction effect of the two factors, F(2,294) = 10.88, p < .001, a significant effect of Model Type, F(1,294) = 77.00, p < .001, and a significant effect of Maze Type, F(2,294) = 83.43, p < .001. As expected, in these multi-goal tasks GPD outperforms RL for all environment types. In addition, these results suggest that the advantage of GPD over RL increases as the difficulty of task increases.

5. Experiment

The purpose of this experiment is to collect behavioral data in a multi-goal environment, where we hypothesize that GPD could account for human performance, and RL cannot. This experiment requires the participants to traverse a simple maze in search of different goal items presented one at a time. Whereas RL would predict that reward structure is updated after the agent reaches a goal or a dead end, GPD would predict that the agent also learns where other items in the maze are located. When asked to find a new goal, RL should perform at chance level (since there has been no reward for this goal), whereas GPD should perform above chance level. Thus, the results from this experiment help to draw contrast between the two models. Random-walk and Ideal-performer models are included in the model comparison analyses to provide baseline performances.

5.1. Participants

Twenty-one human subjects consisting of undergraduate students at RPI participated in this experiment for course extra credit, as specified by course instructor.

5.2. Materials

The experiment was presented as a point-and-click application on a 17” computer screen, set to 1280×1024 resolution. Participants were presented with 150×200 pixel option buttons, where each button displayed either a letter from the English alphabet, or one of the symbols shown in Fig. 3.

5.3. Design

The experiment employed a single-group design with no between-subject variables. Participants were asked to perform a simple exploratory maze-navigation task. Each participant had to complete two two-arm mazes (two arms, two goal items in each arm) and four three-arm mazes (three arms, three goal items in each arm) in the following order: two-arm, three-arm, three-arm, two-arm, three-arm, three-arm. The choice and goal items in each of the two-arm mazes were letters of the English alphabet (chosen randomly without replacement), and the choice and goal items of the three-arm mazes were symbols from Fig. 3 (chosen randomly without replacement). Participants were required to continue with a given maze until they completed six consecutive error-free trials (trials where only the correct path to the goal was taken) in the two-arm mazes, or 12 consecutive error-free trials in the three-arm mazes.

Figure 3.

Stimuli used for 3-choice mazes.

For each trial, participants were asked to find one of the goal items (e.g., in the maze displayed on the left of Fig. 4, a goal could be: C, D, E, or F), such that no two successive trials would have repeating goals. The idea here is to replicate the two-goal (or rather n-goal) problem design—while participants are looking for a given goal item they may be learning the maze, and they will be able to perform above chance level when presented with the next goal item.

5.4. Procedure

Each trial persisted until the participant found and clicked the required goal item. At the beginning of each trial, participants were presented with a screen containing the top-level options (e.g., in the two-arm maze in Fig. 4, a participant is first presented with options A and B). After choosing one of the top-level options, participants were presented with a screen containing the bottom-level options (e.g., in the two-arm maze in Fig. 4, if a participant chose option A, they are presented with options C and D). If the participant chose the wrong path to the goal, upon choosing one of the bottom-level options, they were presented with a “Dead End” screen, and taken back to the screen containing the top-level options. If the participants found and clicked the current goal item, they were presented with their next goal.

Figure 4.

Sample navigation mazes for Experiment 1, two-arm condition (left) and three-arm condition (right).

At each screen, to ensure that participants attended each option, the options were always covered with a gray mask until clicked. Another click was necessary to re-mask an unmasked option before proceeding. After the first option is unmasked and re-masked, a participant may proceed to unmask the next option. Once all options on screen have been viewed and re-masked, the participant could make his or her choice with an additional click. Fig. 5 shows a sample screen progression for the experiment, where both options are masked (top-left), then one of the options is clicked, and it unmasks “A” (top-middle), then that option is clicked again and re-masked (top-right), and then the other option is clicked to unmask “B” (bottom-left). Finally, the second option is clicked again (bottom-right), at which point the participant can choose either of the two options by clicking it.

In addition, participants were not able to rely on their location memory, as the location of each option on screen was randomized. For example, in Fig. 5, the locations of options “A” and “B” were randomly chosen to be left and right, respectively. However, on the next trial, these locations may be reversed. Thus, the participant could not say “when I go left, I get C and D”; instead, they had to recall that “A leads to C and D.”

Figure 5.

Experiment 1 screen progression. Top-left: two options available, both are masked. Top-middle: one of the options has been clicked to unmask “A.” Top-right: option “A” is re-masked. Bottom–left: the other option was clicked to unmask “B”. Bottom-right: second option is re-masked, and now a choice between the two options can be made by clicking one of them.

5.5. Modeling

Human data were analyzed in terms of agreement with four models: GPD, RL, Random, and IdealPerformer. The Random model selected which option to click at random, and the IdealPerformer model remembered everything perfectly (which items were located in each arm of the maze) and made choices with perfect memory. The RL model simply increased the utility of a goal-choice pair if the choice led to the goal successfully, and decreased it otherwise; the option with the highest utility warranted a click (no noise was added), and if multiple options had the same utility, the choice was random. After a few (less than 10) variations were attempted, the best-fit GPD model was derived to have error-driven learning with the following parameters: δ = 0.5, β = 0.01. No noise was added to spreading activation in GPD.

Model data were collected using the model-tracing technique (Anderson, Corbett, Koedinger, & Pelletier, 1995; Fu & Pirolli, 2007).1 For each human participant, for each decision, each model was provided with the same experience as the human participant up to that choice point, and then model's would-be choice was recorded. For example, imagine that Table 3 presents data for a human participant having gone through the maze shown on left of Fig. 4. At the bolded choice point (trial 1), being that there is no experience with the maze, all models would choose randomly. Let us say that both the RL and the GPD models chose B. Thus, what will be recorded is that these two models made an error on trial 1, whereas the human participant did not. However, the experience added to the two models will be based on human choice. At the end of trial 1, RL will have learned that the D-A (if goal is D, click A) goal-choice pair has a positive utility. GPD will have learned that D is strongly associated with C, less so with A, and even less with B, and that C is strongly associated with A, and less so with B. At the underlined choice point (trial 2, top), the RL model will still have to make a random choice (utilities for C-A and C-B goal-choice pairs are both 0 at that point). The GPD model, having learned that C is more associated with A than with B, will choose A.

Table 3. Sample data log for a human participant
Trial 1: goal = D:
looked at A, looked at B, clicked A,
looked at C, looked at D, clicked D, success
Trial 2: goal = C:
looked at B, looked at A, clickedB,
looked at E, looked at F, clicked F, fail
looked at B, looked at A, clicked A,
looked at C, looked at D, clicked C, success

5.6. Results and simulation

Each model's performance was averaged over 10 model runs for each decision point. Results from the first two-arm maze were ignored as training data (the experimenter was in the room with the participants for the first few steps of the first two-arm maze, responding to participants' questions). Results for human and model performances on the first choice of each of the first six trials for the other two-arm maze (maze 4) are shown at the top of Fig. 6 (only the first six trials are shown because some participants did not have data beyond the sixth trial). Results for human and model performances on the first choice of each of the first 14 trials for the three-arm mazes (averaged over all mazes: mazes 2, 3, 5, and 6) are shown at the bottom of Fig. 6 (only the first 14 trials are shown because some participants did not have data beyond the 14th trial).

Table 4 displays root mean square errors (RMSE) between average human and model performances for the data displayed in Fig. 6—performance on the first choice of each trial for the first six trials of the second two-arm maze, and the first 14 trials of the four three-arm mazes.

Figure 6.

Average performance from human participants, GPD, RL, Random, and IdealPerformer models on the two-arm maze (top) and the three-arm mazes (bottom). Error bars represent standard error based on 21 participants.

Table 4. RMSE between human and model performances, by maze type

The early part of the curves in Fig. 6 emphasizes the problem for Reinforcement Learning. RL simply cannot account for the efficiency in human-level performance in the early stages of the task, where the reward structure of the environment is still unknown. The IdealPerformer model assumes perfect learning and no interference. For example, on trial 1 shown in Table 3, the IdealPerformer model will have only learned the association between the clicked option, A, and the ensuing options, C and D. GPD, however, would increment association strengths between C/D and all of their preceding items: both A and B. Thus, IdealPerformer learns unrealistically fast, and RL learns unrealistically slow.

6. Summary and conclusions

Whereas RL accounts for human decision-making based on the experienced reward, this article proposes a mechanism to account for human choice in the absence of prior reward, based on associative learning and spreading activation. The proposed mechanism, GPD, was compared to RL in its efficiency and empirical validity. To examine efficiency, the two models were contrasted in their ability to find multiple goal-states in three mazes of varying difficulty. GPD performed better than RL in these multi-goal environments. Simulation results further suggest that GPD advantages over RL increase as task difficulty increases. To examine GPD validity as a human cognitive mechanism, GPD was implemented in the ACT-R cognitive architecture and examined in its ability to simulate human behavior in a forced-choice navigation task. In this experiment, GPD was able to account for human data where RL could not—in the beginning of the task, before the reward structure of the environment was fully experienced.

The implementation of GPD in the ACT-R cognitive architecture required two things. First, we wrote an ACT-R model that made retrievals based on spreading activation from the goal, and clicked on the retrieved option. Second, associative learning was introduced: keeping recently attended memory elements in an episodic queue, and using error-driven learning to increase the strengths of association between memory elements based on their proximity in the episodic queue.

This article highlights how associative learning, rather than reinforcement learning, can explain the efficiency in human goal-driven behavior where no reward information is available. However, it would be incorrect to suggest that reinforcement learning is not a good model of human behavior. Unlike RL, GPD cannot account for human/animal performance in situations where negative reward or variable positive reward signal is involved. Associative learning merely records the temporospatial structure of the environment, and not the value of decision feedback. In other words, the two mechanisms are complementary. How the two mechanisms may interact in a cognitive architecture is a topic for future research.


Wayne Gray's work on this study was supported by ONR grant #N000141010019, Ray Perez, Project Officer, the Alexander von Humboldt Stiftung, and the Max Planck Institute for Human Development. In addition, some of the work was performed while Vladislav D. Veksler held a National Research Council Research Associateship Award with the Air Force Research Laboratory's Cognitive Models and Agents Branch.


  1. 1

    Model-tracing was employed here because GPD predictions depend wholly on which items were attended, and in what order. If GPD was to attend different items, or items in different order, than the participants (e.g., because of random fluctuations, or because humans may have some preferences, like going left-to-right), GPD performance could be much different.