Research on letter position encoding has been through a remarkable expansion in recent years. At the origin of this are numerous findings concerning the flexibility of letter string representations, as well as the growing realization of their importance in the computational study of lexical access. Nevertheless, how the brain encodes visual words is still a subject of vivid debate. Obviously, a stable code must be used if we are to retrieve words successfully, and order must be encoded because we can distinguish between anagrams like ‘‘relating’’ and ‘‘triangle.’’ But researchers disagree on exactly how this string encoding step1 is achieved.
There is currently a great variety of proposals in the literature, but it is possible to classify them according to the scheme and code they use. Indeed, current proposals fall into two classes of schemes: relative and absolute, and each of them can be implemented using two kinds of codes: localist or distributed.
We will see that all schemes but one have been implemented using localist codes. In this article, we explore a special kind of distributed coding, based on holographic representations. We show that well-known schemes behave differently depending on the code they use, holographic ones performing generally better on a benchmark of behavioral effects. Finally, we exploit the compositional power of holographic representations to study a new scheme in the spirit of the local combination detector model (LCD; Dehaene, Cohen, Sigman, & Vinckier, 2005), a recent and biologically constrained proposal for string encoding.
1.3. The localist versus distributed debate
As this article illustrates, the choice of a representational format has profound impacts at the computational level: Schemes do not make the same predictions depending on the code they use. In consequence, one should assess which of the distributed or localist approach, if any, is currently favored by the evidence.
Nowadays, the commitment to distributed representations is widespread among both neuroscientists and computational modelers, in part because of the success of the PDP framework (Rumelhart & McClelland, 1986), and also because they have been reported in a vast number of brain regions and in a variety of forms—see Abbott and Sejnowski (1999) and Felleman and Van Essen (1991) for good surveys. As regards string encoding, it is also worth noting here that the LCD proposal, which may currently have the strongest claims to biological plausibility, postulates distributed representations.
Recently, however, Bowers (2009) argued for a localist vision in which units are redundant (i.e., many dedicated localist units code for any given entity), activation spreads between them (i.e., each dedicated unit can possibly also show some low level of activity for a related stimuli), and units are limited to coding for individual words, faces, and objects (Bowers, 2009). Bowers points to human neurophysiological recordings showing rare cells with highly selective responses to a given face, and negligible or flat responses to all others.
It has been pointed out that reports of highly tuned cells can hardly count as evidence for localist coding since one can neither test exhaustively all other cells for any given stimulus, nor the presumed localist cell exhaustively for all possible stimuli (Plaut & McClelland, 2010). Nevertheless, Bayesian estimates of sparseness can and have been derived for medial temporal lobe cells (Waydo, Kraskov, Quiroga, Fried, & Koch, 2006), assuming that on the order of 105 items are stored in this area—where coding is generally believed to be sparser than elsewhere in the brain—and these estimates turned out to be unequivocally in favor of sparse distributed coding. Crucially, the following claim from Bowers is incorrect:
Waydo et al.’s (2006) calculations are just as consistent with the conclusion that there are roughly 50–150 redundant Jennifer Aniston cells in the hippocampus. (p. 245)
Waydo et al.’s (2006) study was agnostic with respect to redundant coding and explicitly concludes that rather than 50–150, on the order of several million medial temporal (MT) neurons should respond to any given complex stimulus (Jennifer Aniston included), and each cell is likely to respond to a hundred of different stimuli. In fact such ‘‘multiplex’’ cells, to use Plaut and McClelland's (2010) proposed terminology, have been directly observed by Waydo et al. (2006) (personal communication).
As for computational efficiency, distributed codes have well-known computational qualities: They are robust to noise, have a large representational power, and support graded similarities. In distributed codes, randomly removing n units does not lead to the total deletion of n representations, but rather to the gradual loss of information in all representations. This would seem especially useful considering the different sources of noise—environmental or neural—in spite of which the brain has to operate (Knoblauch & Palm, 2005). Redundant localist coding would indeed provide a similar advantage, were it not already disproved by the above mentioned estimates. Distributed coding also provides a significant gain in representational power. For instance, n binary units can encode at most n local representations (without any redundancy), but this number is a lower bound when using distributed representations—and can reach 2n with dense coding. Indeed, because firing a neuron takes energy, there is likely to be a trade-off between representational efficiency and metabolic cost. Brain regions supporting highly regular associations would be more likely to settle on dense representations, whereas those supporting highly irregular associations would settle on sparser ones, although not to the localist extreme. Moreover, while distributed representations are susceptible to result in ‘‘catastrophic interference’’ of new memories on old ones, it is known that the independently motivated introduction of constraints, such as interleaved learning or assumptions, such as pseudo-rehearsal can circumvent the issue (Ellis & Lambon Ralph, 2000; McClelland, McNaughton, & O'Reilly, 1995).
Finally, in localist coding, the question arises as to which entities or class of entities a unit should stand for, this especially considering that class distinctions can be entirely context dependent (Plaut & McClelland, 2010). This is intimately tied to the problem of defining similarities between representations, a problem which is solved in distributed representations where graded similarities can be defined as the overlap between patterns—for instance, by computing the euclidean distance between activation patterns.
In light of the available evidence, we think that the localist approach is not currently supported by physiology, that it runs into the fundamental issue of how to determine what a unit codes for, and that it severely limits representational power especially if one acknowledges that redundant coding is required to overcome noise issues—which is not compatible with current estimates. On the other hand, distributed representations have been abundantly found in the brain, and they allow for a much larger, graded, and robust spectrum of representations to be coded by any given population of unit.
1.4. Constituent structure and the dispersion problem
As we have seen, there are behavioral and biological reasons to explore the possibility that letter combinations are involved in the string encoding process, as proposed for instance in LCD, open-bigrams, or Wickelcoding. In these schemes by definition a given letter combination is active only if its constituents are present in the string in the correct order.
There is, however, a consequence of this way to represent letter order: the possible loss of information pertaining to constituent structure. Under a localist open-bigram implementation, for instance, bigram 12 is equally similar to bigrams 1d or 21, with which it shares common constituents,4 than to bigram dd, with which it does not—see also Davis and Bowers (2006), for the same remark on localist Wickelcoding. This is what mathematicians refer to as a topologically discrete space: the (normalized) distance between two localist representations is always one (or zero iff identical). As n-gram units increase in size, this lack of sensitivity to constituent structure can become problematic for the string encoding approach, implying for instance that a localist open-quadrigram code would assign no similarity whatsoever between any four letter strings, in flat contradiction with experimental results (Humphreys, Evett, & Quinlan, 1990).
While this issue does not arise in absolute codes because similarities are computed at the level of letters, it is potentially serious for relative schemes. The issue seems to stem in part from the use of localist representations, but it is unclear how using distributed representations would actually improve the situation. Indeed, as described earlier, one example of a distributed string code can be found in the original PDP model and was applied to the Wickelcoding scheme. The code has good robustness to noise and provides graded similarity measures for strings of more than three letters, but it is of little use in the problem described above. This is because as triplets are randomly assigned to units, there is still no relationship between units coding for triplets with similar letter constituents. This phenomenon is related to the dispersion problem observed in PDP (i.e., the fact that orthographic-to-phonological regularities learned in some context do not carry over to different contexts) and it has been identified as a major obstacle to generalization in the PDP model (Plaut, McClelland, Seidenberg, & Patterson, 1996).
What seems to be required is a way to represent letter-combination entities so as to respect similarities between combinations that share constituents, and we now describe how this can be achieved using holographic representations.
1.5. Binary spatter code
Binary spatter codes (BSC; Kanerva, 1997, 1998) are a particular case of holographic reduced representations (Plate, 1995) and provide a simple way to implement a distributed and compositional code.
Formally, BSCs are randomly distributed vectors of large dimensions (typically in the order of 103 to 104), where each element is bipolar, identically and independently distributed in a Bernoulli distribution. Vectors can be combined together using two particular operators: X-or and Majority rule, often, respectively referred to as the Binding and Chunking operators.5 The composition of BSC vectors using these operators results in a new vector of the same dimension and distribution.
Operators are defined element-wise (see Appendix A for formal definitions): X-oring two bipolar variables gives 1 if they are different and −1 otherwise, while the majority rule applied to k bipolar variables gives 1 if they sum up above zero and −1 otherwise. Both operators preserve the format (size and distribution) of vectors, so that the resulting vectors can themselves be further combined, thus allowing for arbitrarily complex combinatorial structures to be built in a constant sized code.
As an example, consider how to implement the sequence of events [A, B, C] with an absolute scheme and vectors of dimension 10. We start by generating three event vectors at random:
Likewise, we also generate at random the three temporal vectors required by an absolute position scheme:
We express the idea that an event occurred at a given time by X-oring these vectors together, giving the three bindings:
These vectors are again densely and randomly distributed (p(0) = p(1) = 1/2). Calculating the expected similarities6 between one binding and its constituents, we see that it must be zero: X-or destroys similarity.
Finally, we signify that these three events constitute one sequence by applying the majority rule:
The sequence vector S that results from the Maj also has the same distribution as its argument vectors. As we calculate its expected similarity with its arguments, we see that contrary to the X-or it is non-zero, but here, 1/2.7 The fact that Maj allows for part/whole similarities also implies that two structures which share some constituents will be more similar than chance.
Note that BSCs can be made arbitrarily sparse (Kanerva, 1995), although in this article, we use the dense version for simplicity. It has also been shown that randomly connected sigma–pi neurons can achieve holographic reduced representations (Plate, 2000). In addition, holographic techniques, such as BSCs have been recently applied to a number of language related problems, including grammatical parsing (Whitney, 2004), learning and systematization (Neumann, 2002), lexical acquisition (Jones & Mewhort, 2007), language acquisition (Levy & Kirby, 2006), or phonology (Harris, 2002).
1.6. Comparing codes: Method, objections, and replies
The masked priming paradigm (Forster & Davis, 1984) is most commonly used to assess the relative merits of candidate schemes. Simply stated, in this paradigm, a mask is presented for 500 ms, followed by a prime for a very brief duration (e.g., 50 ms), and then directly by the target string for 500 ms. Using this paradigm, subjects are generally unaware of the prime—it is said to be subliminal. However, when target strings are preceded by subliminal primes that are close in visual form, subjects are significantly faster and more accurate in their performances. In most studies the primes of interest are non-words, and there is no manipulation of frequency [F] or neighborhood [N]). Varying the orthographic similarity with the target (e.g hoss-TOSS vs. nard-TOSS), one can observe different amounts of facilitation that can then be compared to similarity scores derived from candidate schemes.
The way similarity scores are calculated depends upon the scheme. Most of the time, the similarity between prime and target strings is obtained simply by dividing the number of shared units by the number of units in the target. This is the similarity defined for instance in Grainger and Van Heuven's localist open-bigrams, and when absolute positions are taken into account, in slot coding. More sophisticated similarities can introduce weights for different units (Seriol model), signal-to-weight differences (Spatial Coding), or sums of integrals over Gaussian products (overlap model).
Some researchers have recently raised objections to this way of assessing codes. Lupker and Davis (2009) challenge the notion that masked priming effects can constrain codes, since the latter abstract away from N and F that despite all precautions will play a role in the former, and a new paradigm is proposed that is deemed less sensitive to these factors. Although this ‘‘sandwich priming’’ appears promising, a dismissal of the standard approach would certainly be premature. All researchers agree that the similarity/masked priming comparison should be used with caution—see, for example, Gomez et al. (2008)—which would exclude, for instance, conditions that explicitly manipulate primes for N or F. But the standard view is that in general masked priming facilitation should highly correlate with the magnitude of similarities, which in all models are thought to capture bottom-up activation from sublexical units (see, for instance, Davis & Bowers, 2006; Gomez et al., 2008; Grainger et al., 2006; Van Assche & Grainger, 2006; Whitney, 2008). This view is also supported by unpublished investigations showing that similarities in the Spatial Coding scheme correlate 0.63 with simulated priming in the Spatial Coding model, and 0.82 when conditions that manipulate F and N are excluded (Hannagan, 2010).
Although such correlations are bound to differ across models because these make different assumptions about N and F mechanisms, the differences put models’ behaviors into perspective, allowing one to distinguish bottom-up contributions from other contributions in priming simulations. Finally, in some circumstances, similarities alone can be sufficient to rule codes out. Such a situation occurred, for instance, in the case of standard localist slot coding, where a string code suffers equal disruptions from double substitutions or from transpositions. No lexical level from any model could save this localist code from being inconsistent with human data, because these show strong priming in the latter case but weak or no priming in the former (Perea & Lupker, 2003a).
In summary, assessing codes on masked priming constraints has been standard practice in the field, can rule out codes on face value, arguably provides accurate expectations for priming simulations carried out on models, and in any case provides a useful way to understand their inner workings. Consequently and following others, in this article, we will use similarity scores to predict these priming effects that would be expected from the codes alone in any lexical access model, and we will assess codes exclusively on masked priming data.
1.8. General procedure
In the next four sections, we will assess a number of schemes on this set of criteria. The same procedure is used throughout all simulations: Each scheme is implemented using holographic coding on the one hand, and localist coding on the other, for comparison purposes. Similarities in the localist case are computed in the standard way described earlier. The holographic procedure is carried out as follows. For each trial, we randomly generate orthogonal holographic vectors of dimension 1,000. From these, we build the various codes for all conditions, compute similarity scores based on hamming distance between vectors, and average them across 100 trials. The use of high dimensional vectors implies very little variance for these similarity scores, typically 0.01 for dimension 1,000. Consequently, we will only report mean similarities, and we will consider that two conditions yield equal facilitation if their difference in similarities lies within 0.01.