Using Reinforcement Learning to Examine Dynamic Attention Allocation During Reading



A fundamental question in reading research concerns whether attention is allocated strictly serially, supporting lexical processing of one word at a time, or in parallel, supporting concurrent lexical processing of two or more words (Reichle, Liversedge, Pollatsek, & Rayner, 2009). The origins of this debate are reviewed. We then report three simulations to address this question using artificial reading agents (Liu & Reichle, 2010; Reichle & Laurent, 2006) that learn to dynamically allocate attention to 1–4 words to “read” as efficiently as possible. These simulation results indicate that the agents strongly preferred serial word processing, although they occasionally attended to more than one word concurrently. The reason for this preference is discussed, along with implications for the debate about how humans allocate attention during reading.

One of the most contested debates among reading researchers concerns the nature of attention allocation during reading (cf., Kliegl, Nuthmann & Engbert, 2006; Rayner, Pollatsek, Drieghe, Slattery, & Reichle, 2007; see also, Reichle, Liversedge, Pollatsek, & Rayner, 2009). The question being debated is: Is attention allocated in a strictly serial manner, to support the lexical processing of only one word at a time, or is attention instead allocated as a gradient encompassing two or more words, to support the concurrent lexical processing of multiple words? Despite the seemingly simple nature of this question, it has not been answered unequivocally. For example, the computational models that have been developed to explain how attention affects readers' eye movements do so equally well using the serial-attention assumption (e.g., as in the E-Z Reader model; Reichle, Pollatsek, Fisher, & Rayner, 1998; Reichle, Pollatsek, & Rayner, 2012b; Reichle, Rayner, & Pollatsek, 2003; for a review, see Reichle, 2011) and the attention-gradient assumption (e.g., as in the SWIFT model; Engbert & Kliegl, 2011; Engbert, Longtin, & Kliegl, 2002; Engbert, Nuthmann, Richter, & Kliegl, 2005; Schad & Engbert, 2012). Similarly, various attempts to empirically settle the issue have also been equivocal (cf., Inhoff, Eiter, & Radach, 2005; Inhoff, Radach, & Eiter, 2006; Pollatsek, Reichle, & Rayner, 2006a,b). This article therefore addresses this question using a novel approach—one that entails using artificial reading agents (Liu & Reichle, 2010; Reichle & Laurent, 2006; Reichle, Liu, & Laurent, 2011a) to examine the emergence of dynamic attention allocation during reading. The goal in doing this is to move beyond the “either-or” nature of the debate and gain a better understanding of the conditions under which the serial versus parallel attention allocation enhances reading performance.

Artificial reading agents are virtual systems that use reinforcement learning (Sutton & Barto, 1998) to acquire behaviors (e.g., programming and executing saccades) that allow the agents to “read” as efficiently as possible. In the context of the simulations, this “reading” entailed the identification of linear arrays of words, and the agents learned to perform this task as efficiently as possible given a variety of physiological (e.g., limited visual acuity) and psychological (e.g., words require time to identify) constraints. Prior simulations have shown that these agents are capable of learning to control the movement of their “eyes” in a manner that supports efficient reading. For example, Reichle and Laurent (2006) demonstrated that these agents learn to direct their eyes toward the centers of words because this location permits rapid word identification, and that the agents learn the optimal “strategy” of initiating saccadic programs (which require time to complete) so that the eyes move from word n to + 1 just as word n has been identified. (This strategy is optimal because initiating saccadic programming any sooner would slow the identification of word n by causing the eyes to move prematurely, and because initiating saccadic programming any later would cause the fixation on word n to be unnecessarily long.) Liu and Reichle (2010); see also, Reichle, Liu, et al., 2011a) replicated these results using agents implemented in artificial neural networks and showed that the aforementioned eye-movement behaviors generalized from training sentences to novel test sentences.

The previously reported simulations using artificial reading agents have thus been informative about what physiological and psychological variables constrain the emergence of eye-movement behavior during reading. For example, the simulations have provided a novel account for how an early stage of lexical processing (e.g., corresponding to the “familiarity check” that is posited in the E-Z Reader model; Reichle, 2011) might come to trigger the initiation of saccadic programming during reading, and why fixations durations are sometimes inflated immediately prior to word skipping (Kliegl & Engbert, 2005; Reichle et al., 1998; see also Reichle & Drieghe, in press). But unfortunately, these simulations say nothing about the nature of attention allocation because the agents were only permitted to attend to one word at a time. In other words, because the agents were only allowed to begin processing each word in a sentence after the previous word had been completely identified, it was not possible to ascertain how different attention-allocation schemes might have affected the agents' performance. This article reports the results of three new simulations in which this restriction was relaxed, providing an opportunity for the agents to learn to attend to and process 1–4 words concurrently.

Our hypothesis was that the agents would learn to allocate attention in a dynamic manner (i.e., attending to different numbers of words in different situations) and that this would afford maximally efficient reading. We also thought that this dynamic “strategy” might provide new insights into the conditions under which humans might opt to attend to one versus more than one word during reading. In the remainder of this article, we first briefly review what has been learned about eye-movement control during reading (for comprehensive reviews, see Rayner, 1998, 2009) and how this has informed the debate about how attention is allocated. We then describe our reading agents and how they are able to dynamically allocate attention to process 1–4 words. This will then make it possible to describe how our simulations were completed and to report the results of those simulations. And finally, we discuss the theoretical implication of our results for the debate about attention allocation and models of eye-movement control during reading and other (non-reading) visual-cognitive tasks.

1. Eye movements during reading

Contrary to most people's subjective impressions, our eyes do not move smoothly or continuously across lines of text as we read. Instead, the eyes make rapid ballistic movements called saccades. These saccades move the eyes from one viewing location to the next, where the eyes remain fairly stationary for brief periods of time called fixations. Although 60–80% of the saccades move the eyes forward to the next word, 15–20% are refixations that move to a different viewing location within the same word, 10–15% move the eyes forward more than one word (thereby causing the next word to be skipped), and 10–15% are regressions that move the eyes back to previously read portions of the text (Rayner, 1998, 2009). In alphabetic languages like English and German, most of the fixations are 200–250 ms in duration, although their duration can be as short as 50 ms or as long as 1,000 ms (Rayner, 1998, 2009). The durations of the saccades themselves also vary as a function of how far they move the eyes (Fuchs, 1971) but are typically 25–40 ms in duration and (on average) 7–9 characters in length (McConkie, Kerr, Reddix, & Zola, 1988). Because visual information is not extracted from the printed page during the saccades (Matin, 1974), reading can be likened to a slide show in which the visual information that is available from each fixation or “slide” remains visible for about a quarter of a second before the eyes move to the next “slide” (Rayner & Pollatsek, 1989).

Again contrary to most people's subjective impressions, the amount of visual information that can be extracted from the printed page at any point in time is extremely limited. This fact is most convincingly demonstrated by eye-movement studies employing gaze-contingent paradigms wherein what is displayed on a computer monitor during any given fixation is contingent upon where a participant is looking (Rayner, 1979). For example, in the moving-window paradigm, normal text is displayed within a “window” that extends some number of characters to the left and right of a participant's fixation, while random letters or Xs are displayed outside of the window (McConkie & Rayner, 1975; Rayner & Bertera, 1979). As the participant moves his or her eyes, the window of normal text moves with the point of fixation, thereby delimiting how much information can be extracted during each fixation. As indicated, because visual information is not extracted during the saccades (Matin, 1974), participants usually fail to notice the display changes. However, by varying the size and symmetry of the window and the nature of the text outside of the window (e.g., leaving vs. filling in the blank spaces between words), it has been possible to ascertain that readers normally extract information from a very limited spatial extent. For example, skilled adult readers of English normal extract word boundary information up to 15 character spaces to the right of fixation, but typically only extract information about the identities of individual letters from up to 7–8 characters to the right of fixation (Häikiö, Bertram, Hyönä, & Niemi, 2009; McConkie & Rayner, 1975; Rayner, 1986).

Such findings indicate that, during any given fixation, the perceptual span, or spatial extent of effective visual processing, is limited, so that only 1–3 words typically receive any amount of processing during any given fixation. And although this restriction partially reflects the fact that the type of high visual acuity that is necessary to perceive fine detail (e.g., letters) is largely limited to the fovea, or central 2° of the visual field (Rayner & Morrison, 1981), the restriction also reflects inherent limitations of attention and its allocation. Again, this was most convincingly demonstrated in an experiment using the moving-window paradigm and English–Hebrew bilingual participants; when the participants read English, their perceptual span extended asymmetrically to the right of fixation, but when they read Hebrew (which is read from right to left), their perceptual span extended asymmetrically to the left of fixation (Pollatsek, Bolozky, Well, & Rayner, 1981). The perceptual span thus reflects to joint constraints imposed by the physiology of the retina (i.e., limited visual acuity) and the manner in which attention can be allocated.

As just indicated, the perceptual span severely restricts the amount of lexical processing that can occur during each fixation. The rate of lexical processing is also restricted by the inherent limitations of memory and the speed with which the printed form of a word can be used to access its lexical representations (i.e., information about a word's pronunciation, meaning, syntactic category). Current estimates based on both eye movements (Reingold, Reichle, Glaholt, & Sheridan, 2012; Schilling, Rayner, & Chumbley, 1998) and electrophysiological measures (Reichle, Tokowicz, et al., 2011b; Sereno, Rayner, & Posner, 1998) indicate that lexical processing requires 120–180 ms to complete, but that this time is modulated by a variety of different cognitive variables. For example, words that frequently occur in printed text are processed and identified more rapidly than words that occur less frequently (Inhoff & Rayner, 1986; Just & Carpenter, 1980; Kliegl et al., 2006; Rayner, Ashby, Pollatsek, & Reichle, 2004; Rayner & Duffy, 1986; Schilling et al., 1998). Similarly, words that are predictable from their preceding sentence context (Balota, Pollatsek, & Rayner, 1985; Erhlich & Rayner, 1981; Rayner & Well, 1996; Rayner et al., 2004) or that are acquired at an early age (Juhasz & Rayner, 2006) are also processed and identified more rapidly than words that are unpredictable and/or acquired later. Together, these cognitive constraints provide an upper bound on the maximal rate of reading; if one assumes that the goal of reading is to understand whatever is being read, and if both the perceptual span and the rate of lexical processing delimit how many words can be processed per unit time, then the maximal rate of reading with comprehension is predicted to be somewhere on the order of 300–400 words per minute (depending on the skill of the reader, the difficulty of the material being read, etc.; Rayner & Pollatsek, 1989). This prediction closely corresponds to the reading rates that are observed with skilled (college-level) readers, who routinely read at a rate of 250–350 words per minute (Rayner, 1979).

Of course, the aforementioned constraints reflect the “front end” of the perceptual and cognitive systems that are involved in reading; there are also constraints on the “back end” that reflect the inherent limitations of the oculomotor system. For example, saccades are not executed instantaneously but are instead programmed in two stages, each of which requires some amount of time to complete. Becker and Jürgens (1979) first demonstrated these facts about saccadic programming in an eye-movement experiment using the double-step paradigm. Each trial of this experiment began with participants fixating a cross that was displayed in the center of a computer monitor. A dot was subsequently displayed after a variable-length time interval at random locations on the monitor. The participants' task was simply to move their eyes as quickly as possible to the location of the dot. However, on some proportion of the trials, the dot would disappear and instantaneously reappear at a second location. By varying the time interval between when the first and second dot appeared, Becker and Jürgens were able to make the following observations: If the interval was short, the participants moved their eyes directly to the location of the second dot, but if the interval was long, the participants moved their eyes to the location of the first dot and then quickly moved their eyes to the location of the second dot. On the basis of this observation, Becker and Jürgens surmised that saccades are programmed in two stages—an initial labile stage that can be canceled by the initiation of a second saccadic program (as would occur if a second dot appeared soon after the first), followed by a second, non-labile stage in which the saccade is obligatorily executed (as would occur if the second dot appeared some time after the first). Based on these results, it is possible to subtract out the time needed for information about the location of the dot to be propagated from the eyes to the brain (50–60 ms; Clark, Fan, & Hillyard, 1995; Foxe & Simpson, 2002; Mouchetant-Rostaing, Giard, Bentin, Aguera, & Pernier, 2000; Van Rullen & Thorpe, 2001) to estimate the durations of the labile and non-labile stages of saccadic programming as being 75–125 and 25–50 ms, respectively. These estimates correspond closely to estimates derived from reading, where minimum saccadic latencies are estimated to be 100–175 ms (Rayner, Slowiaczek, Clifton, & Bertera, 1983; Reingold et al., 2012).

Another set of oculomotor constraints has to do with the simple fact that saccades—like all motor movements—are not completely accurate but are instead subject to motor error. Furthermore, this motor error is of two types: random and systematic. The random error causes saccades to deviate from their intended targets (i.e., the preferred-viewing location, which is to the left of the center of the target word; Rayner, 1979) in a manner that causes the fixation landing-site distributions on words to resemble truncated Gaussian distributions, with missing “tails” that reflect instances where the eyes either under- or overshot their intended targets (McConkie, Kerr, Reddix, Zola, & Jacobs, 1989; McConkie et al., 1988, 1991; Rayner & Fischer, 1996; Rayner, Sereno, & Raney, 1996). Some proportion of these under- and overshoots also reflect contributions of the second, systematic source of saccadic error. This systematic error reflects the “tuning” of the oculomotor system so that it “prefers” to make saccades of a particular length. For example, the oculomotor systems of readers of English prefer to make saccades approximately 7 character spaces in length; saccades that are intended to be longer than this often undershoot their targets, whereas saccades that are intended to be shorter than this often overshoot their targets.

Finally, it is also known that visual and oculomotor constraints interact with cognitive constraints and the goals of the reader (e.g., to read as rapidly as possible while maintaining some minimal level of comprehension). For example, when moving the eyes from one word to the next, the location of the initial fixation on that word will affect both the duration of that fixation and whether the next saccade will be directed to a second viewing location on that word or to some other location (e.g., the next word). This complex interaction of when and where the eyes move was first demonstrated by Vitu, O'Regan, Inhoff, and Topolski (1995); see also Kliegl et al., 2006; Nuthmann, Engbert, & Kliegl, 2005; Rayner & Fischer, 1996; Vitu, McConkie, Kerr, & O'Regan, 2001) in an analysis of fixation durations contingent upon their locations. This analysis showed that the initial fixations on a word tended to be longer in duration if they were located near the center of a word, so that a plot of fixation durations by their location resembles an inverted-U shaped function. This inverted-optimal viewing position (IOVP) effect was initially somewhat paradoxical because it had previously been shown that words displayed in isolation are identified most rapidly if they are displayed in the center of the visual field (i.e., making the word center the optimal viewing position for rapid identification; O'Regan & Lévy-Schoen, 1987). However, readers are also more likely to make refixations following initial fixations near either end of a word, suggesting that an initial fixation near the center of a word is often sufficient to identify the word during a single (long) fixation, but that an initial fixation near either end of a word is likely to be followed by a rapid corrective saccade to move the eyes closer to the center of the word because this location affords more rapid lexical processing. Because of saccadic error, however, some portion of those corrective saccades cause the eyes to overshoot their intended target and fixate one of the spatially adjacent words, thereby producing an IOVP effect even if a word is fixated only once (Nuthmann et al., 2005).

Another example of how cognition interacts with the oculomotor and visual systems is related to word-skipping costs, or the purported finding that fixations immediately before or after a word has been skipped tend to be longer in duration than fixations before or after another fixated word (see Kliegl & Engbert, 2005). The cost associated with having skipped the previous word is uncontroversial and reflects visual acuity because more parafoveal processing of word n will typically be completed from word − 1 than from word − 2, the mean fixation durations on word n will (on average) be shorter when word − 1 is fixated than when it is skipped (Kliegl et al., 2006; Rayner & Duffy, 1986).

The cost associated with skipping the upcoming word is less well understood (see Reichle & Drieghe, in press). Such costs were originally predicted by serial-attention models (e.g., E-Z Reader; Reichle et al., 1998) because these models make the assumption that a “decision” to move the eyes from word n to word + 2 (thereby skipping word + 1) necessitates the cancelation of the saccadic program that would have otherwise moved the eyes to word + 1, which increases the fixation duration on word n because of the additional time that is needed to first cancel and then reinitiate saccadic programming. This prediction has subsequently been both confirmed (Pollatsek, Rayner, & Balota, 1986; Pynte, Kennedy, & Ducrot, 2004; Rayner et al., 2004; Reichle et al., 1998) and disconfirmed (McConkie, Kerr, & Dyre, 1994; Radach & Heller, 2000). However, a recent study by Kliegl and Engbert (2005) provided evidence that this type of word-skipping cost may be modulated by properties of the skipped word; their analyses of a large corpus of eye-movement data indicated that skipping cost was pronounced when the skipped word was either long or infrequent but actually became facilitation (i.e., shorter fixation durations prior to skipping) when the skipped word was short or frequent. The reason for this interaction remains unclear; although it has been simulated using the SWIFT model of eye-movement control (Engbert & Kliegl, 2011), a precise account of exactly why the model generates this result has not yet been provided (see Schad & Engbert, 2012, p. 412). And although the pattern has also been partially simulated using the E-Z Reader model (Reichle & Drieghe, in press), the model failed to generate the skipping facilitation that was reported by Kliegl and Engbert (2005) for short and frequent words. Thus, it is fair to say that the precise reasons for skipping cost and benefit remain relatively poorly understood.

Controversies such as the one surrounding word-skipping cost have generated a considerable amount of debate about the nature of eye-movement control during reading. This debate has both resulted in and been spurred on by the recent development of (formal) computational models that simulate the patterns of eye movements that are observed during reading, and that provide precise explanations for many of the phenomena that are observed with readers' eye movements (for reviews, see Reichle et al., 2003, or the 2006 special issue of Cognitive Systems Research). The development of these models has also brought into focus the following two basic questions: First, to what extent does cognition determines when the eyes move during reading? And second, how is attention allocated to support lexical processing during reading? In the remainder of this section, we shall briefly discuss both of these questions.

It has long been known that the moment-to-moment “decisions” about where to move the eyes are largely independent to the “decisions” about when to move the eyes. This dissociation was demonstrated, for example, in a moving-window experiment in which two variables were independently varied from fixation to fixation—the size of the moving window and the onset time of text within the window (Rayner & Pollatsek, 1981). (The text-onset time was manipulated by initially displaying stings of Xs within the window at the beginning of each fixation for some variable amount of time.) The key finding from this experiment was that the two manipulations had independent effects of readers' eye movements: Changing the size of the window caused the saccade lengths to vary from one fixation to the next but did not affect the fixation durations, whereas changing the text-onset delay caused the fixation durations to vary from one fixation to the next but did not affect the saccade lengths. As these results indicate that decisions about where versus when to move the eyes are made largely independently of each other, and because there is consensus that visual and oculomotor constraints largely determine where the eyes move, the question that has yet to be answered to everyone's satisfaction is: What determines when the eyes move?

Opinions about how to answer this question can be divided into two theoretical camps (Rayner & Pollatsek, 1981; Rayner et al., 1996). The first of these camps, which advocate oculomotor-control theories of eye-movement control, maintains that the operating characteristics of the oculomotor system in conjunction with visual constraints largely determine when the eyes move (O'Regan, 1992; Reilly & O'Regan, 1998; Suppes, 1994; Yang & McConkie, 2001). For example, by one such account, decisions about when to move the eyes are mainly determined by global parameters that can be adjusted to reflect the overall difficulty of the text being read (O'Regan, 1992). In contrast, the second camp, which advocates cognitive-control theories, maintains that the decisions about when to move the eyes are made locally, usually with the assumption that the completion of some stage of linguistic processing (e.g., lexical access) causes the eyes to move through the text (Just & Carpenter, 1980; Rayner & Pollatsek, 1989). Thus, by this latter account, the mind and eyes are tightly coupled, with cognition exerting very rapid control over when the eyes move.

Although this first theoretical question has not been completely resolved, five of the seven existing computational models of eye-movement control during reading are variants of cognitive-control theories (EMMA: Salvucci, 2001; E-Z Reader: Reichle et al., 1998, 2003, 2009; Reichle, Pollatsek, et al., 2012b; Glenmore: Reilly & Radach, 2006; SERIF: McDonald, Carpenter, & Shillcock, 2005; SWIFT: Engbert et al., 2002, 2005; Schad & Engbert, 2012). It is also fair to say that these cognitive-control models explain how both lower level variables (e.g., visual acuity, oculomotor constraints) and higher level variables (e.g., word frequency, predictability) influence readers' eye movements, whereas the existing oculomotor-control models (Competition-Interaction: Yang, 2006; SHARE: Feng, 2006) do not, but instead only explain the influence of lower level variables. This, in conjunction with the extremely large number of demonstrations that higher level variables do affect when the eyes move (for reviews, see Clifton, Staub, & Rayner, 2007; Rayner, 1998, 2009), has meant that the burden of proof has now shifted to advocates of the oculomotor-control theories to justify their theoretical stance.

As the second unanswered theoretical question (i.e., regarding how attention is allocated during reading) is only relevant if one assumes that lexical processing (which requires attention) is somehow involved in eye-movement control, this second question has mainly been of interest to proponents of cognitive-control models (Reichle et al., 2009). Within this group, opinions about whether attention is allocated in a serial or parallel manner are fairly evenly divided: Two of the cognitive-control models (EMMA: Salvucci, 2001; E-Z Reader; Reichle et al., 1998, 2003, 2009; Reichle, Liversedge, et al., 2012; Reichle, Pollatsek, et al., 2012b) are consistent with the serial view of attention (for a review, see Reichle, 2011), and two of the models (Glenmore: Reilly & Radach, 2006; SWIFT: Engbert et al., 2002, 2005; Schad & Engbert, 2012) are consistent with the parallel or gradient view of attention (see Engbert & Kliegl, 2011). Because both classes of models do equally well explaining the same phenomena despite of the fact that they obviously do so using very different assumptions about how attention is allocated, the models themselves provide no basis for favoring one view of attention allocation over the other. As a result, this theoretical stalemate has motivated a number of attempts to decide the issue empirically, using predictions derived from the models.

For example, one prediction of the gradient models is that lexical properties of a parafoveal word (i.e., word + 1) can influence the decision about when to move the eyes from the fixated word (i.e., word n), purportedly giving rise to parafoveal-on-foveal effects. Attempts to demonstrate such effects have produced mixed results, while some analyses of eye-movement data have shown such effects (Kennedy & Pynte, 2005; Kliegl et al., 2006), others have not (Drieghe, Rayner, & Pollatsek, 2007; for a review, see Rayner & Juhasz, 2004). Although the precise reasons for these discrepancies remain unclear, several explanations for why parafoveal-on-foveal effects might sometimes be observed because of artifacts (e.g., as a result of mis-located fixations) have been put forward (e.g., see Rayner, White, Kambe, Miller, & Liversedge, 2003). However, rather than re-stating the arguments surrounding this debate (for a review, see Drieghe, 2011), we will adopt a completely different approach in the remainder of this article by focusing on the computational principles and information-processing constraints that might determine how attention is allocated during reading.

As indicated in the first section of this article, we will do this by using artificial reading agents that are capable of learning to allocate attention in whatever manner supports efficient reading. This approach is inherently agnostic with respect to the serial-versus-parallel dichotomy because the agents are in principle capable of learning to allocated attention dynamically (i.e., changing back and forth from serial to parallel processing as the local constraints of the text change) if doing so turns out to be the optimal way to read. Our goal in running these simulations was thus to gain a better understanding of the conditions that foster the serial versus parallel allocation of attention during reading so that these insights might ultimately be used to inform our understanding of this issue. That being said, we shall now describe the artificial reading agents that were used to complete the simulations reported below.

2. Artificial reading agents

Artificial reading agents are systems that are capable of using reinforcement to learn both when and where to move their eyes and attention to “read” as efficiently as possible. With the exception that the artificial reading agents were allowed to concurrently attend to 1–4 spatially adjacent words, the other theoretical assumptions used in the simulations reported below were as similar as possible to those used in previously reported simulations (Liu & Reichle, 2010; Reichle & Laurent, 2006; Reichle, Liu, et al., 2011a). The remainder of this section will therefore describe those assumptions about the agents that are common to all of the simulations reported below; the assumptions that are specific to each simulation are then described with its corresponding results.

The agents were implemented within artificial neural networks that used reinforcement learning (Sutton & Barto, 1998) to acquire information about both different information-processing states that the agents could be in while they were reading and the different action(s) that the agents could execute to read as efficiently as possible. These states were defined by 20 variables that indicated the status of various ongoing perceptual, cognitive, and motoric processes during any time step (see Table 1). As the table indicates, this information includes the length of the attended word, the number of time steps spent programming a saccade, the precise manner is which attention is allocated, and so on. Furthermore, during any given time step, the agents could execute three possible actions: (a) initiate saccadic programming (which requires three time steps to complete) to move the eyes 1–10 character spaces left or right of the current fixation location; (b) select one of 10 possible attention “gradients” (see Fig. 1) for lexical processing and continue processing using that gradient; or (c) continue processing whatever word(s) is (are) currently being attended. (For the sake of simplicity, we assumed that attention gradients could be changed instantaneously, but that such changes could only occur once per time step.)

Table 1. The perceptual, cognitive, and motoric variables that jointly determine the states available to the reading agents
No.Information CategoryCategoryVariable Type
1Word − 1 length?PerceptualInteger
2Word n length?PerceptualInteger
3Word + 1 length?PerceptualInteger
4Word + 2 length?PerceptualInteger
5Word + 3 length?PerceptualInteger
6Word + 4 length?PerceptualInteger
7Word n identified? (Y/N)CognitiveBoolean
8Word + 1 identified? (Y/N)CognitiveBoolean
9Word n + 2 identified? (Y/N)CognitiveBoolean
10Word + 3 identified? (Y/N)CognitiveBoolean
11No. time steps processing word n?CognitiveInteger
12No. time steps processing word + 1?CognitiveInteger
13No. time steps processing word + 2?CognitiveInteger
14No. time steps processing word + 3?CognitiveInteger
15Attention window size?CognitiveInteger
16Attention window center?CognitiveInteger
17No. characters between center of attention and fixation?CognitiveInteger
18Saccade being programmed? (Y/N)MotoricBoolean
19No. time steps programming saccade?MotoricInteger
20No. of characters of programmed saccade?MotoricInteger
Figure 1.

The 10 attention gradients that reading agents can use to process words. The gradients vary in terms of how many words they encompass and their distribution (relative to word position) of processing. Panel A shows the gradients used in Simulations 1A and 1B, with the word number and attention window center (see variables 16 and 17 in Table 1) corresponding to gradient modes. Panel B shows the gradients used in Simulation 2 with centers indicated by white asterisks.

Given the dimensionality—and hence complexity—of the task that the agents had to learn, it was necessary to implement the agents as artificial neural networks because such systems are well suited to learn complex tasks. However, our approach required us to develop a new method for training the networks because they normally require an explicit “teaching signal” that compares the network's behavior to some target behavior so that the network's connections can then be adjusted to reduce the discrepancy between the two (Rumelhart, Hinton, & Williams, 1986). It was therefore not possible to use this type of error-driven learning with our reading agents because our main goal was to observe how the agents learned to perform their task, rather than requiring them to perform the task in some specific way. As such, it was necessary to develop a computational framework for specifying how complex behaviors that seemingly require error-driven learning might instead be acquired via simple reinforcement (Sutton & Barto, 1998). This framework includes algorithms that specify the evolution of artificial neural network topologies capable of learning of large-scale, complex problems using only information about the quality of a network's performance.

Fig. 2 is a schematic diagram illustrating our algorithms for implementing the macro- and microscopic evolution that is necessary to generate artificial network topologies that are capable of solving large, complex problems via error-driven reinforcement learning. (A more detailed description of this process is available in the Supplemental Materials that can be retrieved from: This process starts by first generating a population of simple genomes that express themselves as individual networks (i.e., as phenotypes). Each network is then trained on the same problem using the (residual) reinforcement-learning algorithm to adjust the network's connection weights. Each network's performance (i.e., fitness) is then evaluated by summing the reward that it received. If the overall fitness of the population fails to improve (i.e., stagnates), then a microscopic evolutionary process is used to “nudge” the networks toward a better solution by increasing the mutation rate via simulated annealing (i.e., CMA-ES; Hansen, 2006; Hansen & Kern, 2004; Hansen, Müller, & Koumoutsakos, 2003; Hansen & Ostermeier, 2001; Igel, Hansen, & Roth, 2007; Suttorp, Hansen, & Igel, 2009). Finally, the individual networks are allowed to generate the next generation via cloning the fittest individual, and via both mutation and crossover. This entire evolutionary process was repeated for each simulation reported below, with the fittest individual from each such effort being selected and included in the sample of reading agents that was evaluated.1

Figure 2.

A schematic diagram of the evolutionary process used to generate the artificial network reading agents. The top box shows the population, which consists of N species of genotypes and their resulting phenotype networks. Each generation is trained on a problem using the residual-gradient reinforcement-learning algorithm (Baird, 1995). A micro-evolution algorithm (e.g., CMA-ES; Hansen, 2006; Hansen & Kern, 2004; Hansen et al., 2003; Hansen & Ostermeier, 2001; Igel et al., 2007; Suttorp et al., 2009) is then used to “nudge” the network connection weights so as to improve the networks' overall fitness. Finally, the reproduction procedure produces the next generation via mutation, crossover, and cloning, determined probabilistically. For a complete description, see the Supplemental Materials (

3. Simulations

3.1. Simulation 1A

3.1.1. Method

In the first simulation, the artificial reading agents were allowed to use any one of 10 possible attention gradients during any given unit of time. As Fig. 1A shows, these attention gradients were approximately Gaussian in shape, but to keep the simulations tracztable, it was necessary to split the “continuous” attention distributions into approximately discrete “bins” corresponding to individual words and to assume that attention was allocated in a spatiotopic rather than retinotopic manner. As Fig. 1A indicates, these attention gradients allowed the agents to process a single word at a maximal rate (i.e., 24 “units” of lexical processing per time step) or to process 2–4 words concurrently with a “peak” processing rate and lesser rates to the left and/or right (e.g., gradient #2 simultaneously allows 22 “units” of processing for one word and 2 “units” of processing for a second word). And because attention is distributed in a spatiotopic manner, words that differ in terms of their lengths and thus their identification times are nonetheless processed at the same rate. Subject to these constraints, the agents were able to regulate their attention resources, allocating them in whatever manner supported maximally efficient word processing.

Of course, the agents also had to learn to contend with several other constraints on how they performed the task. The first was limited visual acuity: A word that required t time steps to identify required one additional time step to identify for each character space between the center of vision (i.e., the fixation location) and the center of the word. The second constraint was saccadic error, which was sampled from a Gaussian distribution with μ = 0 and σ = 1 character spaces. Finally, saccades required three time steps to program and one time step to execute. (Saccades that were initiated at time step t were thus obligatorily executed at time step + 3.) Lexical processing was allowed to continue during saccade execution.

The simulations were completed using 16 reading agents and the 20 “sentences” originally used by Reichle and Laurent (2006). Each sentence was comprised of random permutations of eight 1-, 3-, 5-, and 7-letter “words,” with the exception that the first and last words in each sentence were 1-letter words that were excluded from the analyses because the processing of those words started and ended abruptly. Each 1-, 3-, 5-, and 7-letter word, respectively, required 48, 144, 240, and 336 processing “units” to identify when the words were fixated from their central character. For example, assuming that an agent had fixated a 1-letter word and had “decided” to attend and process only that word (i.e., using gradient #1 in Fig. 1A), then that word would require exactly two time steps to identify.

To examine the agents' behaviors, each agent was first trained on five randomly selected sentences; its behavior was then evaluated using the full set of sentences. The agents were “motivated” using rewards and punishments; an agent received +1 for each word identified and −1 for each time step that was spent processing a sentence.2 Critically, the agents received a reward for identifying each word irrespective of the order in which the words were actually identified. For example, an agent processing words 2–5 in parallel might identify word 5 before the other three, and in that situation the agent would immediately receive its reward for having identified word 5. This assumption about the reward structure is critical because it avoids any bias in favor of serial processing that might result from, for example, only administering rewards for words that were identified in their correct canonical order. However, it is also worth mentioning that the decision to relax the constraint of having to identify words in order is on some level “stacking the deck” in favor of parallel processing because it ignores one of the fundamental constraints imposed by higher level language processing (e.g., syntactic parsing) during reading—the need to maintain the order of the words in a sentence so that a representations of the sentence can be incrementally constructed from those words (Pollatsek & Rayner, 1999; Rayner, Pollatsek, Liversedge, & Reichle, 2009). Despite this, however, the decision to allow agents to identify words out of order was adopted because it provided an unbiased test of the relative efficacy of serial versus parallel lexical processing in the context of basic visual and oculomotor constraints.

The simulation results reported next are based on various mean measures of performance, including both standard eye-movement measures (Inhoff & Radach, 1998) and measures of how the agents allocated attention.

3.1.2. Results

Fig. 3A shows the fixation landing-site distributions on words as a function of their length. As the figure shows, the agents, like humans (O'Regan, 1992; Rayner et al., 1996), directed their eyes toward the centers of words because this viewing location afforded maximal visual acuity and hence a maximal rate of lexical processing. And also similar to what is observed with humans (McConkie et al., 1988; Rayner et al., 1996), the simulated landing-site distributions were approximately normal in shape, reflecting the fact that saccades were subject to random error.

Figure 3.

Mean first-fixation landing-site distributions on words as a function of word length. (The error bars show the standard errors of the means.) The position labeled “0” on the x-axis corresponds to the blank space to the left of the words. Panels A-C, respectively, show the results for Simulations 1A, 1B, and 2.

Fig. 4A shows the probabilities of making refixations on words of various lengths as a function of the initial fixation locations on those words; Fig. 5A shows the locations of those refixations. In contrast to what has most often been observed with humans (e.g., see Vitu et al., 1995, 2001), the agents were not more likely to initiate a refixation following an initial fixation near either the beginning or end of a word, but instead exhibited only the former pattern, being more likely to initiate a refixation to move their eyes to the end of a word following an initial fixation near the beginning of that word. Despite this apparent discrepancy, however, it is important to note this pattern has occasionally been observed with adults reading short words (e.g., Rayner & Fischer, 1996; Fig. 4; McConkie et al., 1991; Fig. 7) and/or easy text (Joseph, Liversedge, Blythe, White, & Rayner, 2009; Fig. 2), and for native readers of Chinese (Li, Liu, & Rayner, 2011; Fig. 5; Yan, Kliegl, Richter, Nuthmann, & Shu, 2010; Fig. 5). In the context of these studies, the pattern of directing refixations toward the ends of words has been interpreted as allowing the reader to maintain the forward “momentum” of his or her eyes through the text (e.g., see Reichle, Pollatsek, et al., 2012b).

Figure 4.

Mean probabilities of making refixations as a function of first-fixation location and word length. (The error bars show the standard errors of the means.) The position labeled “0” on the x-axis corresponds to the blank space to the left of the words. Panels A–C, respectively, show the results for Simulations 1A, 1B, and 2.

Figure 5.

Mean refixation landing-site distributions on words as a function of word length. (The error bars show the standard errors of the means.) The position labeled “0” on the x-axis corresponds to the blank space to the left of the words. Panels A–C, respectively, show the results for Simulations 1A, 1B, and 2.

For example, because Chinese words are comprised of 1–4 characters that are not demarcated by word boundaries (i.e., blank spaces do not separate individual words), this strategy of directing refixations forward may allow readers to minimize the probability of having to make an interword regression because of a failure to segment a fixated word from the line of characters in which it is embedded. In other words, by moving the eyes to the beginning of the next few characters that have not already been segmented as part of a word, it is then possible to move the eyes forward either a short distance (i.e., refixate the currently fixated word) if it proves difficult to segment that word or a longer distance (i.e., to the next few characters that have not been segmented) if the word is identified; either way avoids the need to make interword regressions. Such regressions are time consuming (and hence costly to the agents) because less parafoveal processing of upcoming (un-identified) words is possible following a regressive saccade because the eyes will be further from those words.

Fig. 6A shows three fixation-probability measures for words as a function of their length: the mean probabilities of skipping a word, Pr(Skip), fixating it exactly once, Pr(1 Fixation), and fixating it two or more times, Pr(2+  Fixation). Again, consistent with what is observed with humans (Kliegl et al., 2006; Rayner & Duffy, 1986; Rayner & Fischer, 1996; Rayner et al., 1996; Schilling et al., 1998; Vitu et al., 1995), the agents tended to fixate the longer, difficult-to-identify words more often than the shorter, easy-to-identify words and skipped the latter more often than the former (Brysbaert & Vitu, 1998). This result indicates that the agents' decisions about whether to look at a given word was influenced by the difficulty associated with processing that word, indicating that the agents' decisions about where to move their eyes were based on local processing difficulty.

Figure 6.

Mean probability of fixating exactly once, Pr(1 Fixation), two or more times, Pr(2+ Fixation), or skipping, Pr(Skip), words as a function of word length. (The error bars show the standard errors of the means.) Panels A–C, respectively, show the results of Simulations 1A, 1B, and 2.

Fig. 7A shows several mean fixation-duration measures on words as a function of their length, including the first-fixation duration (FFD), or duration of the first fixation on a word conditional upon it occurring during the first pass through the text, gaze duration (GD), or the sum of all first-pass fixations on a word, and total-viewing time (TT), or the sum of all fixations, irrespective of whether they occurred during the first pass. As indicated, the durations of these measures increased with word length because longer words required more time to identify than shorter words, as indicated by the mean word-identification times (ID). This result is consistent with what is observed with humans, who spend more time looking at long than short words (Rayner et al., 1996). However, it is important to note that our word-length manipulation was also a proxy for any lexical variable (e.g., how often a word occurs in printed text; see Rayner, 1998, 2009) known to modulate fixation-duration measures. As such, the agents are capable of modulating their eye movements to fixate words only as long as necessary to identify them. The explanation for this capability is also evident in Fig. 7A, which shows the familiarity-check times (FCT), or the time spent attending a fixated word prior to initiating a saccadic program to move the eyes off of that word.3 The fact that this measure is systematically shorter in duration than the identification times indicates that the agents learned the relationship between word length and the time required to identify words of varying length, and then used this knowledge to initiate saccadic programming so that the eyes moved off of a word immediately after it had been identified. As discussed below, this strategy is optimal because it minimizes fixation durations. It also indicates that the agents' decisions about when to move their eyes were also affected by local processing difficulty.

Figure 7.

Mean first-fixation durations (FFD), gaze durations (GD), total-viewing times (TT), word-identification times (ID), and familiarity-check times (FCT) on words as a function of word length. (The error bars show the standard errors of the means.) Panels A–C, respectively, show the results of Simulations 1A, 1B, and 2.

Fig. 8A shows the mean first-fixation durations as a function of both their location and the length of the word being fixated. Although this pattern of fixation durations is not completely consistent with the IOVP effect that has often been reported in the literature (e.g., see Nuthmann et al., 2005; Vitu et al., 1995, 2001), it should be noted that such effects are often very erratic (Rayner & Fischer, 1996; Fig. 7) or completely absent (Rayner et al., 1996; Fig. 4). That being the case, the fixation-duration pattern in Fig. 8A is consistent with our earlier account of refixations. That is, with all else being equal, there is (on average) little penalty associated with moving the eyes too soon from the beginning of a word because, although doing so sometimes results in slightly less efficient processing (if the eyes land near the end of the word), it also sometimes actually results in more efficient processing (if the eyes land closer to the word's center). However, there is a substantial potential cost associated with moving the eyes prematurely from the end of a word because a progressive saccade will only move the eyes further from the center of the word (thereby decreasing the rate of lexical processing), whereas a regressive saccade will afford less parafoveal processing of the next (upcoming) word. The differential cost associated with moving the eyes too soon from the beginning versus the end of a word thus affects how the agents move their eyes, making them more likely to rapidly move their eyes from the beginnings of words and thereby causing their first-fixation durations to exhibit the pattern shown in Fig. 8A. This account of IOVP effects is thus remarkably similar to the one offered by Nuthmann et al. (2005).

Figure 8.

Mean first-fixation durations on words as a function of fixation location and word length. (The error bars show the standard errors of the means.) The position labeled “0” on the x-axis corresponds to the blank space to the left of the words. Panels A–C, respectively, show the results of Simulations 1A, 1B, and 2.

Finally, Table 2 shows the mean fixation durations on word n as a function of its length and whether word + 1 was fixated, “accidentally” skipped due to oculomotor error, or “deliberately” skipped because this action increased the agents' reading efficiency. As the table shows, the agents' fixations were longer in duration prior to skipping than fixating the next word, and the cost associated with accidental skipping was less and more sporadic than the cost associated with deliberate skipping. These skipping costs indicate that the local costs (i.e., punishment) that were incurred by the agents when they made longer fixations prior to skipping were offset by the global gains (i.e., rewards) that resulted from having to program and execute fewer saccades (which require time and are prone to error). However, it is important to emphasize that the costs and benefits that occur when human readers skip words are still not well understood and appear to be modulated by lexical variables in a manner that has not fully explained by existing model of eye-movement control in reading (Reichle & Drieghe, in press; Schad & Engbert, 2012).

Table 2. Mean (along with number of observations and standard error of mean) fixation durations prior to fixating, accidentally skipping, and deliberately skipping the next word for Simulations 1A, 1B, and 2
Sim.Word N LengthWord N FixatedWord N Accidentally SkippedWord N Deliberately Skipped


  1. Asterisks indicate significant skipping costs (*< .05; **< .01), and dashes indicate conditions with no observations.


In summary, the results of Simulation 1A indicate that the agents learned both when and where to move their eyes to read as efficiently as possible, thereby replicating previously reported results using artificial reading agents (e.g., see Liu & Reichle, 2010) and providing an example of the complex trade-off that can occur between where and when the eyes move (e.g., an initial fixation near the beginning of a word tends to be rapidly followed by a refixation near the end of that word). More important, however, these results are roughly consistent with the behavior that is observed with (skilled) adult readers and thus provide a basis for interpreting the next set of results, which speak directly to the issue of attention allocation during reading.

Table 3 shows the mean proportion of time that each of the 16 agents employed each of the 10 possible attention gradients. As the table shows, the agents preferred to engage in serial processing approximately 96% of the time. However, approximately 3% of the time, the agents attended to four words, with the majority of the processing resources being directed toward the right-most word (i.e., gradient #10 in Fig. 1A). This happened most often when the agents were processing sequences of (mostly) short words located near the ends of sentences, as evidenced by the fact that, when the agents used gradient #10, they did so when its right-most edge was (on average) less than one character space (= 0.65, SE = .06) from the end of a sentence. The reason for this behavior will be discussed in the General Discussion.

Table 3. Mean proportion of time that each attention gradient was preferred by each agent in Simulations 1A, 1B, and 2
Sim.AgentGradient Number


  1. Dashes indicate values < 0.01.

Mean 0.960.03
Mean 0.960.03
Mean 0.960.03

3.1.3. Interim discussion

The results of Simulation 1A replicate but also extend previously reported simulations using artificial reading agents (Liu & Reichle, 2010; Reichle & Laurent, 2006). One possible criticism of this work, however, is that the simulations reported above do not demonstrate that the agents' eye-movement behavior generalizes to new sentence materials. As a result, it is possible that the agents' behaviors are specific to the training corpus that was used in our simulation, and that the agents' behavior might be markedly different if tested on new sentences. To evaluate this possibility, Simulation 1B was completed using the same 16 agents used in Simulation 1A, but with a completely new set of 20 sentences that were comprised entirely of words of lengths that the agents had not experienced during training. The goal of Simulation 1B was therefore to determine whether the agents' eye-movement behavior would generalize to sentences on which they had not been trained.

3.2. Simulation 1B

3.2.1. Method

The same 16 artificial reading agents that were used in the previous simulation were used in this simulation. However, this simulation was completed using 20 new sentences containing completely different random permutations of 2-, 4-, 6-, and 8-letter words that, respectively, required 96, 192, 288, and 384 processing “units” to identify when the words were fixated from their centers. The materials thus consisted of different sentences comprised of different words that the ones used during the training of the agents.

3.2.2. Results

To facilitate comparisons between the results of the previous and current simulations, the results of Simulations 1A and 1B are displayed next to each other in Panels A and B (respectively) of Figs. 3-8. Even a casual inspection of these figures is sufficient to see that the two simulations produced almost identical results. That is, as was true in Simulation 1A, the agents' behavior in Simulation 1B allowed them to read efficiently, as evidenced by the fact that the various reported dependent measures are nearly identical across the two simulations. More specifically, although the agents were tested on completely new sentences, they still directed their eyes toward the centers of words (Fig. 3B) and were more likely to direct refixations following first fixations near the beginnings of words (Fig. 4B) toward the ends of words (Fig. 5B). The agents were also more likely to skip short words and refixate long words (Fig. 6B) and initiated saccadic programming so that their eyes moved off of words right as they were identified (Fig. 7B). The complexity of the agents' behavior is also evident in Fig. 8B and Table 2. Fig. 8B indicates that the agents again made shorter first fixations when those fixations landed near the beginning rather than ends of words, although there was actually now some suggestion of the canonical inverted-U shape (at least with 6- and 8-letter words) that is indicative of the IOVP effect (Vitu et al., 1995, 2001). Table 2 shows that the agents also exhibited skipping costs and that this cost was again larger for deliberate than accidental skips. Finally, as Table 3 shows, the agents again showed a strong preference for serial-attention allocation, choosing to use gradient #1 (see Fig. 1A) approximately 96% of the time. And interestingly, as was true in Simulation 1A, the agents again showed a slight preference for gradient #10 over the other multi-word gradients, using it approximately 3% of the time when its right-most edge was approximately one character space (= 0.62, SE = .09) from the end of a sentence. These results collectively indicate that the agents' behavior on the new test sentences was nearly identical to that exhibited on the training sentences, which demonstrates that the agents' behavior is not limited to the sentences on which they were trained but instead generalizes to new materials.

3.2.3. Interim discussion

Although the results of our first two simulations indicate that the artificial reading agents exhibit a strong preference for serial processing of words, one might possibly object that the simulations do not provide a fair test of the parallel-processing hypothesis because the types of attention gradients that were available to the agents were not representative of those that humans presumably employ during reading. Although it is true that the attention gradients that were used in our simulations were only approximate (i.e., discrete) spatiotopic versions of the more continuous, retinotopic gradients that are posited by attention-gradient models (Engbert et al., 2005; Reilly & Radach, 2006; for a review, see Engbert & Kliegl, 2011), the fact that the agents prefer strict serial processing over any of the other nine gradients that were available to them weakens this possible objection to our simulations. However, because we do recognize that our conclusion that the agents prefer serial to parallel processing is an inductive one (i.e., a general conclusion that is based on a limited number of specific examples), we thought that a second demonstration of this result was warranted, and consequently report the results of Simulation 2.

The logic of this second simulation was to provide an even stronger test of the hypothesis that serial processing affords an advantage (in terms of overall processing efficiency) to parallel processing—even when this parallel processing permits a degree of lexical processing efficiency that is psychologically implausible based on any a priori consideration of what human readers are actually capable of doing (e.g., because of visual acuity limitations, etc.). Simulation 2 thus allowed the agents opportunities to engage in an even more parallel type of processing than in Simulations 1A and 1B by allowing them to use attention “gradients” that were distributed in a uniform—rather than approximately Gaussian—manner (cf., Fig. 1A,B). By allowing this type of uniformly distributed parallel processing, the agents could essentially allocate even more of their attention “resources” toward processing parafoveal words, thereby offsetting most of the cost associated with limited visual acuity and—most important—providing a second, even stronger test of efficacy of serial versus parallel lexical processing.

3.3. Simulation 2

3.3.1. Method

The basic method was identical to that of Simulation 1A, with the exception that the 16 artificial reading agents could select any one of the 10 uniformly distributed attention “gradients” that are shown in Fig. 1B.

3.3.2. Results

The results are presented in exactly the same manner as with Simulations 1A and 1B, and as a direct comparison of Panel C of Figs. 3-8 and their analogs (i.e., Panels A and B in Figs. 3-8, respectively) indicates, the results of the third simulation closely resemble those of the first two. For example, as Fig. 3C shows, the agents tended to direct their eyes toward the centers of words, with the fixation landing-site distributions resembling truncated Gaussians because of saccadic error. Similarly, Figs. 4C and 5C, respectively, show that the agents were more likely to refixate following initial fixations near the beginnings of words, and that these refixations were directed toward the ends of those words. Fig. 6C shows that the agents were more likely to skip short words and more likely to fixate long words, while Fig. 7C shows that the agents required more time to identify the longer words, and that—as was true in Simulations 1A and 1B—the agents learned the relationship between word length and word-identification times and then used this knowledge to initiate saccadic programming in an optimal manner. Fig. 8C shows that the first-fixation durations were modulated by their location, resulting in weak IOVP effects. And as Table 2 shows, the agents again exhibited word-skipping costs. Finally and most important, Table 3 indicates that the agents once again exhibited a strong preference for serial lexical processing, opting to allocate attention to only one word at a time approximately 96% of the time. And as in the previous simulations, those rare occasions when the agents did engage in parallel lexical processing involved the use of gradient #10, which was most often used to process sequences of short words when the right-most edge of the gradient was approximately one character space (= 0.38, SE = .07) from the end of a sentence.

Thus, despite the fact that the artificial reading agents in Simulation 2 could employ attention gradients that afforded significantly more parafoveal lexical processing than the gradients that were available to the agents in Simulations 1A and 1B, all of the agents generated the same eye-movement behaviors and showed the same strong preference for strict serial processing of words. Simulation 2 thus provides a direct replication of the first two simulations despite qualitatively different assumptions about the spatial distribution of attention. And critically, Simulation 2 provide additional evidence supporting our inductive conclusion that, irrespective of the precise shape of the attention gradient, it affords less efficient lexical processing than does the strict serial allocation of attention.

3.3.3. Interim discussion

As indicated, the key finding from Simulation 2 is that our artificial reading agents still showed a strong preference to attend to and process one word at most time, despite of the fact that the uniformly distributed attention “gradients” actually allowed rapid processing of up to four spatially adjacent words. These results thus suggest that it is not the shape of attention gradient distribution per se that produced the preference for serial lexical processing; rather, the preference seems to reflect the constraints of the task being performed—that is, the goal of rapidly identifying linear configurations of words in the face of limited visual acuity, minimal saccadic programming latencies, and saccadic error. More will be said about how these and other constraints favor the serial allocation of attention during reading in the final section of this article.

4. General discussion

The simulations reported in this article replicated previous work (Liu & Reichle, 2010; Reichle & Laurent, 2006; Reichle, Liu et al., 2011a) showing that artificial reading agents can learn fairly sophisticated eye-movement behaviors in order to perform the task of rapidly identifying sequences of words during reading. These behaviors include directing the eyes toward the centers of words because this viewing location permits the most rapid identification of words. They also include learning the relationship between word length and the time required to identify words, and then anticipating when a given word is about to be identified so that saccadic programming can be initiated to move the eyes from the word just as it has been identified. This latter strategy is akin to the familiarity-check assumption in the E-Z Reader model (Reichle, 2011; Reichle et al., 1998) because it allows the initiation of saccadic programming in anticipation of word identification. As indicated, this strategy is optimal because initiating programming any earlier would likely cause the eyes to move too soon (thereby requiring the word to be processed from a poor viewing location), while initiating programming any latter would cause the fixations to be unnecessarily long.

Of course, the current simulations also extend our previously reported results in two important ways. The first is that the current simulations examined several phenomena that have played pivotal roles in our understanding of readers' eye movements. For example, the agents exhibited the complex interaction between fixation locations and their durations, generating both patterns of refixations (e.g., Joseph et al., 2009), IOVP effects (e.g., Vitu et al., 1995), and skipping costs (Balota et al., 1985) that are similar to what has been observed with human readers. These findings indicate that agents' behavior was motivated by the goal of maintaining a forward momentum of the eyes so as to maximize their overall reading rate (and thereby maximize their obtained reward). As such, these results suggest that similar behavior in humans may also be adaptive in the sense that it allows readers to maximize their overall reading efficiency; by initiating a refixation toward the end of a word following an initial fixation near the beginning of a word, it may be possible to speed lexical processing during instances when the eyes move closer to the center of a word, while simultaneously avoiding the cost that might result from overshooting the word and/or having to make an interword regression. Of course, the fact that human readers also sometimes make interword regressions (Vitu et al., 2001) and exhibit skipping benefit (Kliegl & Engbert, 2005) but our agents do neither suggests that the constraints imposed upon our agents were incomplete or incorrect; it is important to bear this caveat in mind when considering our conclusions and to recognize that their validity will ultimately depend on additional simulations of the type reported in this article.

The second way that our simulation results extend previous results is that they obviously speak directly to the debate about how attention is allocated during reading (see Reichle et al., 2009). Although the results of our three simulations indicate that the agents occasionally attended to more than one word at a time, these results also indicate that the agents have a strong preference for serial processing, attending to only one word at a time approximately 96% of the time. The present results thus suggest that, even ignoring the possible processing constraints related to the feature “binding” (e.g., Treisman & Gelade, 1980) and having to keep word order straight (Pollatsek & Rayner, 1999; Rayner et al., 2009), the serial processing of words affords some additional advantage over the parallel processing of words.

For example, one possible advantage of serial processing is that both the spatial extent and movement of an attention gradient are limited by the rate of lexical processing and—in particular—by the rate at which the left-most word in the gradient can be identified. That is, the fact that attention is restricted to a single, spatially contiguous gradient that can only move forward after the left-most word has been identified means that there is little advantage is distributing processing over multiple words. The reason for this is best explained by way of example: Consider a situation in which attention is allocated to words n through + 3, with processing proceeding most rapidly for word + 1 (i.e., as might occur using gradient #8 in Fig. 1A). The parallel processing of words in this situation might not be particularly advantageous because, after word + 1 has been identified, whatever attention resources were being allocated to that word cannot be reallocated toward the processing of other words because the gradient cannot be extended beyond the four spatially adjacent words. Thus, it is not until the left-most word (i.e., word n) has been identified and the gradient shifted two words to the right (to encompass words + 4 and + 5) that the attention resources that would otherwise be unavailable (because they are still being “allocated” to word + 1, which has already been identified) can be used for further lexical processing. Although one might object that, in this type of situation, the attention resources would be re-distributed to whatever words had not been identified (i.e., words n,+ 2, and + 3), such a re-distribution scheme would result in two spatially distinct foci of processing and would seemingly predict that readers would often be simultaneously attending to and processing words across intervening spaces—a situation that we believe is far too odd to be the default in reading.

As indicated, however, the agents did occasionally use a multi-word attention gradient, especially with sentences containing several short, spatially adjacent words located near the ends of sentences. This situation might resemble one that human readers encounter with sequences of function words and/or prepositions (e.g., “…in the…”) and, as such, may indeed correspond to situations where it might be advantageous to dynamically increase the span of attention to process multiple words. We can also imagine two other situations where such dynamic attention allocation might also be advantageous: the reading of idioms (e.g., “kick the bucket”) or common phrases (e.g., “United States of America”). It is important to note, however, that these two situations are ones in which the word sequences may have become lexicalized, thus functionally being treated by the reader as single, long words comprised of constituents that happen to be separated by spaces.4 Such instances would clearly not be the default during reading because most texts, like most spoken utterances, are generative in nature and thus composed of novel sequences of words. Therefore, on the basis of this latter consideration and the behavior of our reading agents, it would seem that the most adaptive strategy for reading is one in which the overwhelming majority of words are processed and identified one at a time, but with occasional parallel processing of words when they are both short and arranged in a spatially contiguous manner. The conclusion is congruent with other arguments in favor of the serial-processing hypothesis (e.g., see Reichle et al., 2009) and suggests that it is a good approximation of what actually happens during reading.

It is also once again worth emphasizing that, although our conclusion that the serial allocation of attention affords more efficient reading than a gradient is an inductive one, it is based on our finding that, across three separate simulations, the agents preferred serial word processing relative to any of the 18 different parallel-processing gradients. Thus, while we acknowledge that the attention gradients that were available to the agents may not bear a completely accurate resemblance to the attention gradients that humans might employ during reading (e.g., the gradients used in our simulations were discrete and spatiotopic rather than continuous and retinotopic), this possible counter argument against our conclusions would have to precisely specify: (a) how the gradients used by humans differ from those of our agents, and (b) how this discrepancy invalidates our conclusions. Thus, in the final analysis, we believe that the burden of proof is on advocates of the attention-gradient theories to demonstrate how the parallel allocation of attention is adaptive during reading. In other words, exactly what advantage does the parallel processing (apart from the possible rapid identification of common phrases and idioms, as already mentioned) of words actually afford the reader? One possibility is that attention gradients might afford more robust lexical processing in the face of the large amount of “noise” that is inherent to the human cognitive and oculomotor systems but was absent (with the exception of saccade-targeting error) in our simulations. Future simulations will be necessary to test this possibility by examining whether our artificial reading agents continue to prefer serial lexical processing if stochasticity is introduced into the times required to identify words and program saccades.

And it is also worth emphasizing that our conclusions regarding how attention is allocated during reading may have no direct bearing on the broader question of how attention is allocated during other (non-reading) visual-cognitive tasks. This more general question is important because written language is a relatively recent cultural intention, making it incontrovertible that the cognitive systems that support reading actually evolved to perform other tasks (e.g., searching for objects in the environment). An appreciation of this fact has recently motivated attempts to understand attention allocation in non-reading tasks by using models to simulate the patterns of eye movements that are observed in those tasks. For example, simulations using E-Z Reader (Reichle, Pollatsek, et al., 2012b) demonstrated that the model's serial-attention assumption was sufficient to simulate patterns of eye movements observed in several non-reading tasks. However, because similar demonstrations have been completed using attention-gradient models (e.g., SWIFT; Nuthmann & Engbert, 2009), the questions about how attention is allocated during non-reading tasks and how these tasks might differ from reading have not been resolved but will undoubtedly be the focus of future research (e.g., see the 2012 special issue of Visual Cognition).

Finally, it is worth briefly discussing the theoretical implications of our results for current models of eye-movement control during reading. As indicated earlier, these models make very different assumptions about how attention is allocated to support lexical processing during reading; whereas serial-attention models like EMMA (Salvucci, 2001) and E-Z Reader (Reichle et al., 1998, 2003, 2009; Reichle, Liversedge, et al., 2012; Reichle, Pollatsek, et al., 2012b) maintain that attention is allocated in a strictly serial manner, parallel-attention models like Glenmore (Reilly & Radach, 2006) and SWIFT (Engbert et al., 2002, 2005; Schad & Engbert, 2012) maintain that attention is allocated as a gradient, to support the concurrent processing of multiple words. Although our simulation results are—strictly speaking—not completely consistent with either of these two classes of model in that our artificial reading agents exhibited both serial and parallel processing, it is equally clear that the agents showed a strong preference for serial processing. It is therefore fair to say that, on the basis of our results, the serial-processing models probably provide a better approximation of what human readers actually do during reading, but that these models are also on some level oversimplifications in that they fail to indicate the conditions under which human readers might engage in parallel lexical processing. As was just discussed, text involving common multi-word phrases (e.g., idioms) might provide one example of when such parallel processing occurs, but current serial-attention models make no provisions for explaining how or when the focus of attention might be dynamically expanded to process such word sequences. We hope that—at the very least—the results of our simulations provide some new evidence that such an account is necessary, and that rather than framing the debate about attention allocation as an “either-or” question, it might be more profitable to investigate the conditions under which attention is allocated in a serial versus parallel manner.


The study reported in this article was completed by the first author in partial fulfillment of his Ph.D. degree at Sun Yat-Sen University, China. The study reported in this article was supported by a China Scholarship Award to the first author, grant HD053639 awarded to the second author, and grants from the National Natural Science Foundation of China (31070988), the Ministry of Education of China (09YJAXLX026), the Science and Technology Planning Project of Guangdong Province, China (2008B080701041), and the 985–3 Research Program of Sun Yat-Sen University awarded to the third author. We thank four anonymous reviewers for their helpful suggestions for improving earlier versions of this article.


  1. 1

    Contrary to other attempts to develop reinforcement-learning algorithms capable of learning complex problems using biologically plausible evolutionary systems (e.g., see Fernando, Goldstein, & Szathmáry, 2010), we make no strong claim about the biological plausibility of our approach. Our goal was instead quite modest—to use our learning algorithm to examine how attention might be allocated to support optimal reading. Our approach is therefore one of using machine learning to examine a theoretically important psychological question (i.e., How is attention allocated during reading?) rather than an attempt to develop a biological model of attention allocation during reading.

  2. 2

    The training was completing using a procedure called trajectory sampling. Using this procedure, during each time step, the agent will select the action that results in the state providing the most reward with some probability determined by a greed parameter (= 0.5 in our simulations) but will randomly select another action with probability 1—greed. The latter ensures that the agent explores the state space and does not simply continue to exploit whatever reward is currently available. The specific reward contingencies were chosen to be consistent with previously reported simulations (e.g., Liu & Reichle, 2010) and affect the rate of learning but not what is learned.

  3. 3

    We use this nomenclature because the time that a word is attended prior to initiating a saccadic program to move the eyes off of that word is functionally similar to the familiar check that is posited to occur in the E-Z Reader model (Reichle, 2011).

  4. 4

    We are indebted to Simon Liversedge for suggesting this possibility and for providing our particular examples.