Orthographic coding is what every reader must do in order to recognize strings of letters. It is hard to see how we could retrieve meaning or pronunciation from a visual word if its representation was randomly assigned anew every time we came across it (for then how could we retrieve the same word meaning on two different occasions), or if we did not somehow keep track even partially of letter order (for then how could we distinguish between, say, anagrams). Yet orthographic coding is still an open problem. As a determinant of what counts as a neighbor for a word, it is central in our understanding of the lexical system. But it is also a problem tied to general questions in the visual object recognition literature, such as the interplay between invariance and selectivity.
An intuitive way to represent order in a word (say TIME) would be to assign fixed positions to letters, starting from some origin (for instance, T1, I2, M3, E4). But because letters are exactly tied to positions, taken literally this ‘‘slot coding’’ approach fails to account for robust experimental findings such as transposition and relative position effects. For instance, when this scheme is used to present ‘‘TMIE’’ to a state-of-the-art word recognition model like DRC or CDP++ (Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001; Perry, Ziegler, & Zorzi, 2010), M2 and I3 cannot contribute any evidence for word TIME and the model is not driven toward the activation pattern that corresponds to word TIME more than it would have been if say TOBE had been presented. The resulting model behavior is falsified by experimental data coming from priming experiments showing that TMIE primes identification of the word TIME better than TOBE, so-called transposed-letter (TL) priming (Perea & Lupker, 2004; Schoonbaert & Grainger, 2004). Another challenging set of results for current models of lexical access is that humans can be remarkably unperturbed by several letter deletions (GRDN-GARDEN), or insertions (GARXYZDEN - GARDEN), so-called relative-position (RP) priming (Grainger, Granier, Farioli, Van Assche, & van Heuven, 2006; Van Assche & Grainger, 2006), and we have no trouble recognizing embedded or misaligned substrings (BOBCAT-CAT; Davis & Bowers, 2006).
Several proposals have been put forward to increase flexibility in orthographic representations and reproduce this signature. One possible modification would be that in-word positions are not in fact perfectly defined over slots: For each letter the word representation would maintain some measure of uncertainty as to its position (with some letter positions being less uncertain than others). This idea has been formalized and studied in the Overlap model of orthographic coding (Gomez et al., 2008), where letters in candidate stimuli have Gaussian position functions (e.g., stimuli ‘‘TRIAL’’ and ‘‘TRAIL’’ in Fig. 1A, left and middle), whereas known target words have perfectly defined positions (e.g., target ‘‘TRAIL’’ in Fig. 1A, right). The similarity between candidate and target is calculated only from those of the candidate's letters that are shared with the target, by summing their density functions when integrated within the correct boundaries (i.e., over the slots corresponding to the target letters). In this account, a transposed prime would be less effective than the identity prime because the integration boundaries have been switched, making the areas under curves for I and A smaller for ‘‘TRIAL’’ (Fig. 1A, left) than for ‘‘TRAIL’’ (Fig. 1A, middle). However, prime ‘‘TRIAL’’ will still produce more facilitation than say ‘‘TRONL’’ because the former shares five letters with the prime against only three for the latter.
Figure 1. Three orthographic coding schemes illustrated with the comparison of target ‘‘TRAIL’’ (last column) to candidates ‘‘TRIAL’’ and ‘‘TRAIL’’ (first and middle columns, respectively). (A) Overlap coding: letter positions are encoded as Gaussian probability density functions (pdf), and similarities between candidate and target strings are computed as the sum of the pdf integrals for letters shared in both strings. (B) Spatial coding: letter positions are encoded by decreasing positional values, and similarities between candidate and target strings are calculated from the sum over each ‘‘signal-weight function’’ for letters shared in both strings (a signal-weight function is a Gaussian-like function defined for each common letter, centered on the difference between positional values in both strings). (C) Open-bigram coding: string representations are mapped to a point in the space of all possible ordered letter pairs (bigrams), and similarities between candidate and target strings are obtained by dividing the number of shared bigrams by the number of target bigrams.
Download figure to PowerPoint
A second alternative is Spatial Coding (Davis, 2010), which starts by assigning positions to letters (in Fig. 1B, the height of the bar for each letter marks its position value) and where the similarity between two strings also requires the summing of Gaussian density functions across every common letter. However, each Gaussian function is now centered on the difference between the positions occupied by the common letter in both strings, and the final similarity score is obtained not by summing the areas under curves, but by finding the peak of the summed curves (data not shown). In this account, ‘‘TRIAL’’ is less effective a prime than ‘‘TRAIL’’ because transposing two letters displaces the center of their corresponding Gaussian functions, and the subsequent summing produces a lower peak. Note that unlike Overlap coding, the letter nodes activated by candidate strings are duplicated in several banks (the clones in Fig. 1B), which are wired in finely tuned competitive and cooperative networks where ambiguities produced by repeated letters are assumed to be resolved.
A coding scheme that we will not study here but that bears mentioning because it has appeared several times in the literature in one form or another is the so-called Both-ends coding, where letter positions are encoded from both ends of the word (Cox, Kachergis, Recchia, & Jones, 2011; Fischer-Baum, McCloskey, & Rapp, 2010; Jacobs, Rey, Ziegler, & Grainger, 1998). Slot, Both-ends, Overlap, and Spatial coding are inherently letter-based and absolute schemes (they use coordinates from an origin to define letter positions within the string), but alternatives have been proposed that use information about the relative positions of letters (Grainger & van Heuven, 2003; Mozer, 2006; Whitney, 2001), for instance, by keeping track of ordered pairs of letters in the stimulus.2 This amounts to representing letter strings as points in the high-dimensional space of all possible letter pairs (Fig. 1C), and it is generally known as Open-bigram coding, a label that covers a range of schemes with different letter gaps and weight parameters. Whitney's Open-bigram scheme had a restriction on the authorized gap, and weights that decreased with the gap (Whitney, 2001). Grainger and van Heuven (2003) considered unconstrained and constrained Open-bigrams, that is, respectively, with or without gap restrictions. Table 1 shows the Open-bigrams representations of the example word ‘‘ROOMS’’ for each of these schemes.
Table 1. Word ‘‘ROOMS’’ as represented in three variants of Open-bigram coding: Seriol, Constrained (COB), and Unconstrained (UOB)
The similarity between prime and target is then obtained by a (possibly weighted) count of how many bigrams they share, divided by the target's norm. Note how only UOB takes bigram ‘‘RS’’ into account, the other schemes imposing a gap restriction of two letters. Also bigrams are counted only once,3 although a recent distributed implementation takes repeated bigrams into account with improved results (Hannagan et al., 2011a).
It might be worth noting at this point that all these string encoding schemes have predecessors in the memory for serial order literature: Slot coding can be compared to Conrad's seminal Box model (Conrad, 1965), Both-Ends coding to the Start-End model (Henson, 1998), Overlap coding to the Overlap model (Burgess & Hitch, 1992), Spatial coding to the Primacy model (Page & Norris, 1998), and Open-bigrams to the so-called Compound chaining models (Ebbinghaus, 1964; Slamecka, 1985). That models which were designed to capture serial order turn out to be relevant for ordered strings of letters is rather natural, although it would seem to limit the importance of domain-specific aspects for word encoding, such as parallel presentation and large storage requirements. This raises the question of whether these two fields might share more processes than is usually thought—for instance, could word recognition involve a serial beginning-to-end process? The single ocular fixation required by skilled readers to recognize a word, combined with the absence of a length effect, is good evidence that information about letter identity and location is not extracted from left to right but rather in parallel (Adelman, Marquis, & Sabatos-DeVito, 2010; Tydgat & Grainger, 2009), although probably not all letters at the same pace (Blais et al., 2009). Moreover, it would be expected that the necessity to store tens of thousands of words would trigger specific adaptations to general purpose encoding schemes in order to minimize interference for long-term storage and recall. In support of this, the processing of random letter strings appears to differ in important ways from the processing of sequences of symbols, as revealed by identification errors (Tydgat & Grainger, 2009).
Indeed, Open-bigram, Overlap, and Spatial coding describe different orthographic landscapes using two main concepts: an orthographic representation for any string and a similarity measure between two representations. But not all researchers have found it necessary to use representations in order to compute similarities. Inspired by string edit theory, Yarkoni, Balota, and Yap (2008) used the Levenshtein distance between strings to build a lexical neighborhood index that has arguably more explanatory power than the commonly used metric (the so-called N metric), which is based on a single letter substitution radius. Although this is certainly an improvement for the design and analysis of word recognition studies, as it stands the Levenshtein distance falls short of explaining why certain basic transformations (such as deletion) are generally more disruptive than others (such as repetition)—indeed it explicitly assumes that an insertion, a deletion, and a substitution have an equal editing cost. This is not to say that a principled explanation for different costs is beyond the reach of string edit models, as edit costs can be chosen on the basis of Minimum message length or Bayesian criteria (Allison, Wallace, & Yee, 1992; Dennis, 2005).4 Another option, however—and the one chosen in this article—is to assume the existence of an underlying orthographic coding scheme, and let these differences be induced by the string representations thereby generated. We now assess the behavioral, neural, and computational evidence for and against these schemes.
2.2. Behavioral evidence
The standard way to gather behavioral evidence on the orthographic code has been through masked priming lexical decision experiments. A robust finding in psycholinguistics is that human decisions on whether a given target string of letters is or is not a word (the so-called lexical decision task) are affected when a subliminal stimulus (a prime) has been flashed just before the target (Forster & Davis, 1984). Because subjects are unaware of the prime, this result is thought to be uncontaminated by conscious processes. What is more, it conveniently varies in strength when subliminal primes are more or less close in form to the target. For these reasons, masked priming has become the experimental paradigm of choice to investigate the lexical system on its own.
Overlap coding has been shown to capture a large selection of phenomena observed in a less commonly used paradigm, the same-different judgment task, where participants have to determine whether two briefly presented strings are identical (Gomez et al., 2008). However, its performance on masked priming data is often problematic as it predicts, for instance, no priming for nine-letter words when the final four letters 6789 are used as primes, and massive priming for eight-letter words with pairwise transposed primes like 21436587 (the so-called T-All condition). Both predictions have been falsified experimentally (Grainger et al., 2006; Guerrera & Forster, 2008). As for Spatial Coding, it consistently achieves superior agreement with masked priming constraints (see, for instance, Hannagan, Dupoux, & Christophe, 2011a). In particular, when used as an input scheme in an 18-parameter word recognition model, it provides a very good fit to 61 masked priming conditions (Davis, 2010). Note, however, that the T-All condition can only be accounted for ‘‘outside’’ the input coding scheme, mostly as a consequence of competition at the model's lexical level. This is because like Overlap coding, Spatial coding operates under the assumption that erroneous letters in the prime do not directly contribute negative evidence for the target. Only positive evidence is given by shared letters, and erroneous letters simply weaken the positive evidence from correct letters by disturbing their alignment with the target. Finally Open-bigram coding can generally achieve good correlations with priming data including (when the scheme is implemented with distributed representations) priming conditions such as reversed primes (4321) or repeated letters (123245) that have been deemed problematic for this specific coding scheme (Hannagan et al., 2011a). In summary, the three schemes can account for many aspects of the behavioral data, but as of today only Spatial coding has been studied in a full-blown word recognition model where it achieves a very good fit to the masked priming data.
2.3. Biological evidence
Fluorescence magnetic resonance imaging (fMRI) studies show that faces, tools, houses, and word stimuli activate small neighboring cortical patches in the fusiform region, at the frontier between occipital and temporal lobes (see, for instance, Gaillard et al., 2006; Hasson, Harel, Levy, & Malach, 2003; Kanwisher, 2006). This proximal organization is precisely what one would expect if some general visual object processing mechanism was at work, with the particular location of each patch reflecting idiosyncrasies such as the use of ordered symbols for word recognition.
Spatial coding has so far gathered little evidence from brain studies. Davis (2010) argues that the monotonic activation gradient that is used in this scheme could be implemented by dedicated neural populations for each letter, that would fire in a periodical fashion yet consistently out of phase with one another. Learning the orthographic code for word W would mean learning to delay the correct letter signals so that these initially asynchronous signals eventually all arrive at the same time at a neural population dedicated to W. It is argued that a misspelled word would then result in the letter signals being out of phase, as described in the spatial coding scheme. But this elegant mechanism has yet to be reported in the visual word form area following the presentation of word stimuli or letter strings, and it also requires dedicated populations of neurons, a controversial assumption (Bowers, 2010; Plaut & McClelland, 2010).
As for Overlap coding, its basic premise is consistent with the notion of overlapping receptive fields which by definition would produce some uncertainty on the retinotopic location of detected entities. However, the uncertainty described in Overlap coding operates at the level of in-word letter positions, not retinotopic letter locations. Thus, taken literally, this scheme requires the existence of at least two levels: an upper level with banks of neural populations tuned to all possible letters at all possible in-word positions that would receive activity via fan-in overlapping connections from topologically corresponding units in a lower retinotopic level. Although a retinotopic level and overlapping connectivity are completely consistent with the primate visual system, the in-word position level remains speculative. While location-specific letter detectors have been successful in theoretical accounts of crowding effects (Tydgat & Grainger, 2009) and are supported by fMRI data (Dehaene et al., 2004), there is so far no evidence coming from brain studies for banks of location-invariant, but position-specific, letter detectors.
Turning to Open-bigram coding, Binder, Medler, Westbury, Liebenthal, and Buchanan (2006) found that activity in the VWFA was correlated with the frequency of (contiguous) bigrams in letter strings presented to subjects. Using similar methods, Vinckier, Dehaene, Jobert, Dubus, Sigman, and Cohen (2007) further found evidence for a posterior-to-anterior gradient of activity in the VWFA that correlates with letter combinations of increasing complexities, in a hierarchy involving an Open-bigram level (with gap 1 bigrams). This evidence supports a biologically inspired account of orthographic coding known as the letter combination detector (LCD) proposal, which, based on the general architecture of the primate visual system, hypothesized the existence of such a hierarchy (Dehaene et al., 2005). Indeed, these findings may not be surprising considering the ample evidence provided by electrophysiological studies for the existence of neurons that code for invariant feature combinations in generic object recognition (Kobatake & Tanaka, 1994; Tsunoda, 2001).
In summary, biological plausibility supports Open-bigram coding. Not only is there direct evidence for this scheme in the VWFA, but as predicted from theoretical considerations, it uses the same basic mechanism as apparently used by object recognition in general, which is to build representations from collections of feature combinations.
2.4. Computational evidence
Computational models have more evidence to offer than just fits to the data, and indeed selecting a model exclusively on this criterion would expose one to the dangers of overfitting (Pitt & Myung, 2002). Even when considering two models with the exact same goodness-of-fit, one of these might take more time or more memory to run, or it might be operating at the limits of its computational abilities, or be using complex algorithmic contraptions and a host of parameters, or simply be much harder to learn. Formal notions have been proposed to quantify each of these four criteria, respectively: time and memory complexity, complexity class (van Rooij & Wareham, 2008), parsimony as defined by Kolmogorov complexity (Li & Vitanyi, 1997), and learnability (Valiant, 1984). In this section we shall only consider informally how the different proposals fair in some of the previously cited respects, but model selection has been a very lively topic of research in recent years with several principles now available to decide between models (Grünwald, 2004; Pitt & Myung, 2002).
First let us look at how many parameters are being tuned in the three coding schemes. In order to fit behavioral data, Overlap coding uses as many parameters as the longest string involved in the comparison: Each parameter determines the variance of the Gaussian uncertainty function of a letter. Instead, Spatial coding uses a two-parameter formula that computes the variance for each letter given its position, and one parameter to specify the importance of end letters. The type of Open-bigram coding used in Whitney's Seriol model also involves edge bigrams, and two different bigram weights that depend on letter contiguity. Finally, one parameter is tuned in Grainger and van Heuven's versions (the letter gap which is either fixed or absent).
Let us now consider what kind of computational mechanisms are at play in these schemes. Overlap and Spatial coding share one critical feature, which is that similarity calculations require the prior knowledge of which letter from the target should be mapped to which letter from the candidate. For strings of equal length, this has been tackled by assuming that only the same letters at the same position are considered for comparison (an intuitive although strong hypothesis in itself). But this becomes problematic for strings with different lengths and repeated letters, because there is no easy way to tell how this mapping should proceed in general. Simply consider, for example, the candidate TENT and target METER: Overlap coding does not tell us which of the two very different uncertainty functions for T in the candidate string we should use to compute the similarity with T in the target.5 Spatial coding faces the same issue, but it proposes a complex solution that involves duplicated letter banks for each position (the so-called letter clones), competitive-cooperative interactions between and within banks, and the invocation of a new principle—the principle of clone equivalence. In Open-bigram coding, no prior knowledge of the letters to be compared is assumed, and repeated letters are not a problem when computing similarities because the basic unit of comparison, the bigram, does not have a position anymore; only its presence in both representations counts.
Computational evidence should also be sought on how these coding schemes can be learned. Computational models of object recognition have shown that units tuned to feature combinations can arise from training hierarchically organized layers of neurons with limited receptive fields (Wallis & Rolls, 1997), possibly with interleaved layers of simple and complex cells as found in the ventral route (Riesenhuber & Poggio, 1999). Nevertheless, a neuro-computational demonstration that Open-bigrams can be learned is still unavailable, and neither is there a computational model that learns Spatial coding. In this respect, Overlap coding might arguably have the advantage, given that a backpropagation network trained for location invariant word recognition has recently been shown to implement what can be described as a distributed version of this specific coding scheme (Hannagan, Dandurand, & Grainger, 2011b).
In summary, the time and memory complexities of these schemes are still unknown, but a crude assessment of the number of parameters and the parsimony of the mechanisms involved is in support of Grainger and van Heuven's Open-bigrams. We have also identified defects in the current specification of Overlap coding that limit its computational power relative to Open-bigram and Spatial coding. However, learnability might still support a scheme in the spirit of Overlap coding. We now proceed to strengthen the computational evidence in favor of Open-bigram coding, by pointing out its relationship with a powerful machine learning tool known as String kernels.