2.1. The parameter space of reinforcement learning models
In contrast to our effort, prior work investigated settings for each of the four parameter sets of reinforcement learning models (general cognitive architecture, environment, actor, and critic). Different cognitive architectures, and architectural assumptions, have been used across studies. For example, there have been architectural-based approaches (e.g., Napoli & Fum, 2010; Nason & Laird, 2005) and non-architectural approaches, of which several examples are given by Sutton and Barto (1998).
Similarly, reinforcement learning techniques have been applied in different task environments. These range from gambling tasks such as the Iowa gambling task (e.g., Yechiam & Busemeyer, 2005), to interactive tasks (e.g., Gray et al., 2006), to tasks in which (virtual) agents have to maneuver in an environment (e.g., Ballard & Sprague, 2007; Singh et al., 2009; see also several examples in Sutton & Barto, 1998).
Different settings have also been proposed for the actor and critic components of reinforcement learning models. Yechiam and Busemeyer (2005) and Ahn et al. (2008) investigated how different choices for the actor and critic component influence performance in the context of the Iowa Gambling task. The success of the different models in predicting human behavior depended on the modeler’s objective: Some models are better at predicting short-term performance, while others are better at predicting long-term performance.
The critic components that Yechiam and Busemeyer (2005) and Ahn et al. (2008) investigated differ from the parameters that are the focus of this article. The aforementioned studies investigated the algorithms by which utility is calculated. An overview of the options for these algorithms is given in Sutton and Barto (1998). To mention a few, a modeler needs to decide which actions get their utility value updated (e.g., all actions preceding a reward, or only the last n actions), how strongly a reward impacts utility value (e.g., is the influence of a reward on an action’s utility value linearly or exponentially “discounted” with the number of time steps between an action and a reward), and how long the agent learns (e.g., during its entire life span, or only during a designated learning phase). We will discuss our settings in Section 2.2.
Modeling work on the effect of different settings for rewards is scattered across the literature. Different models have used different settings for moment, objective function, and magnitude of reward (see Table 1). However, systematic investigations of the different combinations for these parameters are scarce. Indeed, the only extensive study of the effect of different reward settings which we found is reported by Singh et al. (2009). In two simple simulated agent environments, they explored model behavior for, respectively, 3,240 and 54,000 alternative reward values. These different settings were achieved by systematically varying the magnitude of rewards from a distribution of continuous values within the range [−1.0, 1.0]. The value of the rewards was not motivated by a rationale for a specific objective function or magnitude. Rather, the researchers investigated whether the rewards that provided the agent with the best performance score also matched with an understanding of optimizing a specific objective function.
Similar to Singh and colleagues, we also systematically explore the effect of alternative reward types on performance of a reinforcement learning model. However, in our work the alternative reward types will be motivated by a rationale for objective function and magnitude. In addition, we will also explore the effect of the moment at which the reward is given. In the work by Singh and colleagues this parameter is not varied; rewards are given after every time step (Singh et al., 2009). We will also investigate model performance over a shorter time frame (48 trials), similar to the duration of the human experiment. This contrasts with the focus on evolutionary aspects of rewards by Singh and colleagues (where models had at least 20,000 reward updates), which was without a comparison to human performance.
2.2. Actor and critic in ACT-R
Our choice for using the ACT-R cognitive architecture (Anderson, 2007) as a framework provides us with parameter settings for the general cognitive architecture, the actor, and the critic. As the model interacts with a task interface, this combination also provides parameter settings for the task environment. The settings for the general cognitive architecture and environment will be discussed when we flesh out the model in Section 4. In the current section, we will outline the parameter settings for the actor and critic. We will also show how the parameter settings for moment, objective function, and magnitude of reward influence the performance of the critic.
ACT-R is a production rule system, which means that behavior is modulated by a sequence of production rules (condition–action pairs). Each production rule has a utility value associated with it. Higher utility values reflect successes in the past and predict successes in the future.
The functionality of a critic is achieved by updating the utility value of production rules that are associated with an experienced reward, using a utility function. ACT-R’s utility function is a special case (Anderson, 2007; Fu & Anderson, 2004, 2006) of the temporal difference learning algorithm from reinforcement learning (Sutton & Barto, 1998). At the moment when a reward is given, all production rules i that preceded the reward (and postceded the previous reward) get their utility value U updated as follows:
In this equation, Ui(n) is the estimated utility of production rule i after its nth usage, Ri is the estimated reward and α is the learning rate. The estimated utility at the current time (nth usage) is based on the previous estimate of the utility (Ui(n − 1)) plus an error term (Ri − Ui(n − 1)) that reflects the difference between the estimated reward and the previous estimated utility. By scaling this error term with the learning rate α (ranging between zero and one), the impact of recent experience on the estimated utility is limited, and learning is gradual.
The interesting parameter in this equation for our study is the estimated reward, Ri, which is based on a behavioral reward and a temporal difference value. The behavioral reward, rj in Eq. 2, represents the reward that is experienced in the environment. Its value is determined by the objective function and its magnitude.
The second component of Ri is the temporal difference value. For each production rule, this value estimates how much a specific production rule i contributed to the magnitude of the eventual behavioral reward. In ACT-R, the temporal difference value is calculated as a linear difference between the time at which the production rule fired and the time at which the behavioral reward was experienced. In the end, if production rule i was used at time ti, and a behavioral reward rj is given at time tj, then the estimated reward of production rule i is (Anderson, 2007):1
The magnitude of Ri decreases linearly with a delay of the moment at which the reward is given (i.e., with an increase in the difference between tj and ti). This captures the intuition that actions (or production rules) that are used closer to a behavioral reward contribute more to the magnitude of that reward than actions from the more distant past. In this sense, the moment when the reward is given influences the estimated reward and the estimated utility of production rules.2
Utility values can be used for action selection in choice situations (i.e., for the actor component in an actor–critic system). As common in production rule systems, ACT-R models initially select a subset of available production rules that can be executed given the current state of the world and the model (i.e., the contents of the diverse buffers in ACT-R). If multiple alternatives are available, the model selects the production rule with the highest utility value using the soft-max action selection rule, which is widely applied in reinforcement learning models (e.g., Anderson, 2007; Sutton & Barto, 1998).
To moderate the strict reliance on exact utility value, and to reflect uncertainty in action selection, the soft-max selection rule has a built-in temperature component that applies some noise to each utility value during an action selection round. This is useful in situations where two actions are so close in utility value as to be practically, though not statistically, indistinguishable. The temperature component then insures that the less ranked action is occasionally chosen. Similarly, this occasional selection of less-high ranked actions can help in exploring the (perhaps changed) value of alternative actions.
Importantly, modelers are free to choose the settings of the moment, objective function, and magnitude of reward. In the next section we will introduce the Blocks World task, which will be our test-bed for investigating these parameters. This will be followed by a description of the model and its parameter settings.