• String kernels;
  • Visual word recognition;
  • Orthographic coding;
  • Open-bigrams;
  • Visual Word Form Area


  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Orthographic coding
  5. 3. String kernels
  6. 4. Open-bigrams as String kernels
  7. 5. Discussion
  8. 6. Learning String kernels: The role of invariance
  9. 7. Conclusion
  10. References
  11. Appendix

It has been recently argued that some machine learning techniques known as Kernel methods could be relevant for capturing cognitive and neural mechanisms (Jäkel, Schölkopf, & Wichmann, 2009). We point out that ‘‘String kernels,’’ initially designed for protein function prediction and spam detection, are virtually identical to one contending proposal for how the brain encodes orthographic information during reading. We suggest some reasons for this connection and we derive new ideas for visual word recognition that are successfully put to the test. We argue that the versatility and performance of String kernels makes a compelling case for their implementation in the brain.

1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Orthographic coding
  5. 3. String kernels
  6. 4. Open-bigrams as String kernels
  7. 5. Discussion
  8. 6. Learning String kernels: The role of invariance
  9. 7. Conclusion
  10. References
  11. Appendix

Models of human visual word recognition typically involve a lot of machinery and several processing levels, but it has become clear in the last decade that their behavior is deeply affected by the coding scheme they use to represent letter order (Plaut, McClelland, Seidenberg, & Patterson, 1996). Far from being a defect of such models, the importance of this ‘‘orthographic coding’’ stage appears to mirror the finding that humans devote a well-defined and perfectly reproducible brain area to this end, the so-called Visual Word Form Area (Cohen et al., 2000). In recent years, one challenging task has thus been to determine which of a number of possible coding schemes can best capture human behavioral data, under various blends of parsimony and plausibility constraints. Although existing proposals bear little resemblance to one another, surprisingly, so far it has been very difficult to decide between them solely on the basis of their explanatory power. But here we wish to argue that one coding scheme, Open-bigrams, is also favored by its generality and computational efficiency.

Indeed, one argument that has not yet been fully exploited in this endeavor is the computational argument. Many researchers share the view that to a first approximation and for all computational intents and purposes, characterizing the orthographic code boils down to specifying the similarity between all possible strings of letters.1 What then are the best tools available to date in order to compare sequences? In the field of machine learning, one powerful and versatile method that has emerged in the last decade is the so-called String kernels method. String kernels were originally developed to address questions in bioinformatics (Haussler, 1999) but have now found applications in natural language processing (Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002) or more recently in speech recognition (Gales, 2009).

In this article we will first describe the three main alternatives for orthographic coding and assess them on the existing behavioral, neural, and computational evidence. We will then explain what String kernels are and how they have been used to tackle problems in natural language processing and computational biology, and we will show that each existing variant of the Open-bigram scheme finds an almost identical counterpart in String kernels. We will point to novel insights that become available when the orthographic code is envisioned in this way, and we will present new simulation results on String kernels inspired from computational biology. Finally, in light of this unexpected connection, we will recapitulate and discuss the case for String kernels in the brain.

2. Orthographic coding

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Orthographic coding
  5. 3. String kernels
  6. 4. Open-bigrams as String kernels
  7. 5. Discussion
  8. 6. Learning String kernels: The role of invariance
  9. 7. Conclusion
  10. References
  11. Appendix

Orthographic coding is what every reader must do in order to recognize strings of letters. It is hard to see how we could retrieve meaning or pronunciation from a visual word if its representation was randomly assigned anew every time we came across it (for then how could we retrieve the same word meaning on two different occasions), or if we did not somehow keep track even partially of letter order (for then how could we distinguish between, say, anagrams). Yet orthographic coding is still an open problem. As a determinant of what counts as a neighbor for a word, it is central in our understanding of the lexical system. But it is also a problem tied to general questions in the visual object recognition literature, such as the interplay between invariance and selectivity.

An intuitive way to represent order in a word (say TIME) would be to assign fixed positions to letters, starting from some origin (for instance, T1, I2, M3, E4). But because letters are exactly tied to positions, taken literally this ‘‘slot coding’’ approach fails to account for robust experimental findings such as transposition and relative position effects. For instance, when this scheme is used to present ‘‘TMIE’’ to a state-of-the-art word recognition model like DRC or CDP++ (Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001; Perry, Ziegler, & Zorzi, 2010), M2 and I3 cannot contribute any evidence for word TIME and the model is not driven toward the activation pattern that corresponds to word TIME more than it would have been if say TOBE had been presented. The resulting model behavior is falsified by experimental data coming from priming experiments showing that TMIE primes identification of the word TIME better than TOBE, so-called transposed-letter (TL) priming (Perea & Lupker, 2004; Schoonbaert & Grainger, 2004). Another challenging set of results for current models of lexical access is that humans can be remarkably unperturbed by several letter deletions (GRDN-GARDEN), or insertions (GARXYZDEN - GARDEN), so-called relative-position (RP) priming (Grainger, Granier, Farioli, Van Assche, & van Heuven, 2006; Van Assche & Grainger, 2006), and we have no trouble recognizing embedded or misaligned substrings (BOBCAT-CAT; Davis & Bowers, 2006).

Determining the signature of the orthographic code has been the focus of much research in the last decade (for a review, see Grainger, 2008), but one first and inescapable conclusion is that flexibility is one of its hallmarks.

2.1. Proposals

Several proposals have been put forward to increase flexibility in orthographic representations and reproduce this signature. One possible modification would be that in-word positions are not in fact perfectly defined over slots: For each letter the word representation would maintain some measure of uncertainty as to its position (with some letter positions being less uncertain than others). This idea has been formalized and studied in the Overlap model of orthographic coding (Gomez et al., 2008), where letters in candidate stimuli have Gaussian position functions (e.g., stimuli ‘‘TRIAL’’ and ‘‘TRAIL’’ in Fig. 1A, left and middle), whereas known target words have perfectly defined positions (e.g., target ‘‘TRAIL’’ in Fig. 1A, right). The similarity between candidate and target is calculated only from those of the candidate's letters that are shared with the target, by summing their density functions when integrated within the correct boundaries (i.e., over the slots corresponding to the target letters). In this account, a transposed prime would be less effective than the identity prime because the integration boundaries have been switched, making the areas under curves for I and A smaller for ‘‘TRIAL’’ (Fig. 1A, left) than for ‘‘TRAIL’’ (Fig. 1A, middle). However, prime ‘‘TRIAL’’ will still produce more facilitation than say ‘‘TRONL’’ because the former shares five letters with the prime against only three for the latter.


Figure 1.  Three orthographic coding schemes illustrated with the comparison of target ‘‘TRAIL’’ (last column) to candidates ‘‘TRIAL’’ and ‘‘TRAIL’’ (first and middle columns, respectively). (A) Overlap coding: letter positions are encoded as Gaussian probability density functions (pdf), and similarities between candidate and target strings are computed as the sum of the pdf integrals for letters shared in both strings. (B) Spatial coding: letter positions are encoded by decreasing positional values, and similarities between candidate and target strings are calculated from the sum over each ‘‘signal-weight function’’ for letters shared in both strings (a signal-weight function is a Gaussian-like function defined for each common letter, centered on the difference between positional values in both strings). (C) Open-bigram coding: string representations are mapped to a point in the space of all possible ordered letter pairs (bigrams), and similarities between candidate and target strings are obtained by dividing the number of shared bigrams by the number of target bigrams.

Download figure to PowerPoint

A second alternative is Spatial Coding (Davis, 2010), which starts by assigning positions to letters (in Fig. 1B, the height of the bar for each letter marks its position value) and where the similarity between two strings also requires the summing of Gaussian density functions across every common letter. However, each Gaussian function is now centered on the difference between the positions occupied by the common letter in both strings, and the final similarity score is obtained not by summing the areas under curves, but by finding the peak of the summed curves (data not shown). In this account, ‘‘TRIAL’’ is less effective a prime than ‘‘TRAIL’’ because transposing two letters displaces the center of their corresponding Gaussian functions, and the subsequent summing produces a lower peak. Note that unlike Overlap coding, the letter nodes activated by candidate strings are duplicated in several banks (the clones in Fig. 1B), which are wired in finely tuned competitive and cooperative networks where ambiguities produced by repeated letters are assumed to be resolved.

A coding scheme that we will not study here but that bears mentioning because it has appeared several times in the literature in one form or another is the so-called Both-ends coding, where letter positions are encoded from both ends of the word (Cox, Kachergis, Recchia, & Jones, 2011; Fischer-Baum, McCloskey, & Rapp, 2010; Jacobs, Rey, Ziegler, & Grainger, 1998). Slot, Both-ends, Overlap, and Spatial coding are inherently letter-based and absolute schemes (they use coordinates from an origin to define letter positions within the string), but alternatives have been proposed that use information about the relative positions of letters (Grainger & van Heuven, 2003; Mozer, 2006; Whitney, 2001), for instance, by keeping track of ordered pairs of letters in the stimulus.2 This amounts to representing letter strings as points in the high-dimensional space of all possible letter pairs (Fig. 1C), and it is generally known as Open-bigram coding, a label that covers a range of schemes with different letter gaps and weight parameters. Whitney's Open-bigram scheme had a restriction on the authorized gap, and weights that decreased with the gap (Whitney, 2001). Grainger and van Heuven (2003) considered unconstrained and constrained Open-bigrams, that is, respectively, with or without gap restrictions. Table 1 shows the Open-bigrams representations of the example word ‘‘ROOMS’’ for each of these schemes.

Table 1.  Word ‘‘ROOMS’’ as represented in three variants of Open-bigram coding: Seriol, Constrained (COB), and Unconstrained (UOB)

The similarity between prime and target is then obtained by a (possibly weighted) count of how many bigrams they share, divided by the target's norm. Note how only UOB takes bigram ‘‘RS’’ into account, the other schemes imposing a gap restriction of two letters. Also bigrams are counted only once,3 although a recent distributed implementation takes repeated bigrams into account with improved results (Hannagan et al., 2011a).

It might be worth noting at this point that all these string encoding schemes have predecessors in the memory for serial order literature: Slot coding can be compared to Conrad's seminal Box model (Conrad, 1965), Both-Ends coding to the Start-End model (Henson, 1998), Overlap coding to the Overlap model (Burgess & Hitch, 1992), Spatial coding to the Primacy model (Page & Norris, 1998), and Open-bigrams to the so-called Compound chaining models (Ebbinghaus, 1964; Slamecka, 1985). That models which were designed to capture serial order turn out to be relevant for ordered strings of letters is rather natural, although it would seem to limit the importance of domain-specific aspects for word encoding, such as parallel presentation and large storage requirements. This raises the question of whether these two fields might share more processes than is usually thought—for instance, could word recognition involve a serial beginning-to-end process? The single ocular fixation required by skilled readers to recognize a word, combined with the absence of a length effect, is good evidence that information about letter identity and location is not extracted from left to right but rather in parallel (Adelman, Marquis, & Sabatos-DeVito, 2010; Tydgat & Grainger, 2009), although probably not all letters at the same pace (Blais et al., 2009). Moreover, it would be expected that the necessity to store tens of thousands of words would trigger specific adaptations to general purpose encoding schemes in order to minimize interference for long-term storage and recall. In support of this, the processing of random letter strings appears to differ in important ways from the processing of sequences of symbols, as revealed by identification errors (Tydgat & Grainger, 2009).

Indeed, Open-bigram, Overlap, and Spatial coding describe different orthographic landscapes using two main concepts: an orthographic representation for any string and a similarity measure between two representations. But not all researchers have found it necessary to use representations in order to compute similarities. Inspired by string edit theory, Yarkoni, Balota, and Yap (2008) used the Levenshtein distance between strings to build a lexical neighborhood index that has arguably more explanatory power than the commonly used metric (the so-called N metric), which is based on a single letter substitution radius. Although this is certainly an improvement for the design and analysis of word recognition studies, as it stands the Levenshtein distance falls short of explaining why certain basic transformations (such as deletion) are generally more disruptive than others (such as repetition)—indeed it explicitly assumes that an insertion, a deletion, and a substitution have an equal editing cost. This is not to say that a principled explanation for different costs is beyond the reach of string edit models, as edit costs can be chosen on the basis of Minimum message length or Bayesian criteria (Allison, Wallace, & Yee, 1992; Dennis, 2005).4 Another option, however—and the one chosen in this article—is to assume the existence of an underlying orthographic coding scheme, and let these differences be induced by the string representations thereby generated. We now assess the behavioral, neural, and computational evidence for and against these schemes.

2.2. Behavioral evidence

The standard way to gather behavioral evidence on the orthographic code has been through masked priming lexical decision experiments. A robust finding in psycholinguistics is that human decisions on whether a given target string of letters is or is not a word (the so-called lexical decision task) are affected when a subliminal stimulus (a prime) has been flashed just before the target (Forster & Davis, 1984). Because subjects are unaware of the prime, this result is thought to be uncontaminated by conscious processes. What is more, it conveniently varies in strength when subliminal primes are more or less close in form to the target. For these reasons, masked priming has become the experimental paradigm of choice to investigate the lexical system on its own.

Overlap coding has been shown to capture a large selection of phenomena observed in a less commonly used paradigm, the same-different judgment task, where participants have to determine whether two briefly presented strings are identical (Gomez et al., 2008). However, its performance on masked priming data is often problematic as it predicts, for instance, no priming for nine-letter words when the final four letters 6789 are used as primes, and massive priming for eight-letter words with pairwise transposed primes like 21436587 (the so-called T-All condition). Both predictions have been falsified experimentally (Grainger et al., 2006; Guerrera & Forster, 2008). As for Spatial Coding, it consistently achieves superior agreement with masked priming constraints (see, for instance, Hannagan, Dupoux, & Christophe, 2011a). In particular, when used as an input scheme in an 18-parameter word recognition model, it provides a very good fit to 61 masked priming conditions (Davis, 2010). Note, however, that the T-All condition can only be accounted for ‘‘outside’’ the input coding scheme, mostly as a consequence of competition at the model's lexical level. This is because like Overlap coding, Spatial coding operates under the assumption that erroneous letters in the prime do not directly contribute negative evidence for the target. Only positive evidence is given by shared letters, and erroneous letters simply weaken the positive evidence from correct letters by disturbing their alignment with the target. Finally Open-bigram coding can generally achieve good correlations with priming data including (when the scheme is implemented with distributed representations) priming conditions such as reversed primes (4321) or repeated letters (123245) that have been deemed problematic for this specific coding scheme (Hannagan et al., 2011a). In summary, the three schemes can account for many aspects of the behavioral data, but as of today only Spatial coding has been studied in a full-blown word recognition model where it achieves a very good fit to the masked priming data.

2.3. Biological evidence

Fluorescence magnetic resonance imaging (fMRI) studies show that faces, tools, houses, and word stimuli activate small neighboring cortical patches in the fusiform region, at the frontier between occipital and temporal lobes (see, for instance, Gaillard et al., 2006; Hasson, Harel, Levy, & Malach, 2003; Kanwisher, 2006). This proximal organization is precisely what one would expect if some general visual object processing mechanism was at work, with the particular location of each patch reflecting idiosyncrasies such as the use of ordered symbols for word recognition.

Spatial coding has so far gathered little evidence from brain studies. Davis (2010) argues that the monotonic activation gradient that is used in this scheme could be implemented by dedicated neural populations for each letter, that would fire in a periodical fashion yet consistently out of phase with one another. Learning the orthographic code for word W would mean learning to delay the correct letter signals so that these initially asynchronous signals eventually all arrive at the same time at a neural population dedicated to W. It is argued that a misspelled word would then result in the letter signals being out of phase, as described in the spatial coding scheme. But this elegant mechanism has yet to be reported in the visual word form area following the presentation of word stimuli or letter strings, and it also requires dedicated populations of neurons, a controversial assumption (Bowers, 2010; Plaut & McClelland, 2010).

As for Overlap coding, its basic premise is consistent with the notion of overlapping receptive fields which by definition would produce some uncertainty on the retinotopic location of detected entities. However, the uncertainty described in Overlap coding operates at the level of in-word letter positions, not retinotopic letter locations. Thus, taken literally, this scheme requires the existence of at least two levels: an upper level with banks of neural populations tuned to all possible letters at all possible in-word positions that would receive activity via fan-in overlapping connections from topologically corresponding units in a lower retinotopic level. Although a retinotopic level and overlapping connectivity are completely consistent with the primate visual system, the in-word position level remains speculative. While location-specific letter detectors have been successful in theoretical accounts of crowding effects (Tydgat & Grainger, 2009) and are supported by fMRI data (Dehaene et al., 2004), there is so far no evidence coming from brain studies for banks of location-invariant, but position-specific, letter detectors.

Turning to Open-bigram coding, Binder, Medler, Westbury, Liebenthal, and Buchanan (2006) found that activity in the VWFA was correlated with the frequency of (contiguous) bigrams in letter strings presented to subjects. Using similar methods, Vinckier, Dehaene, Jobert, Dubus, Sigman, and Cohen (2007) further found evidence for a posterior-to-anterior gradient of activity in the VWFA that correlates with letter combinations of increasing complexities, in a hierarchy involving an Open-bigram level (with gap 1 bigrams). This evidence supports a biologically inspired account of orthographic coding known as the letter combination detector (LCD) proposal, which, based on the general architecture of the primate visual system, hypothesized the existence of such a hierarchy (Dehaene et al., 2005). Indeed, these findings may not be surprising considering the ample evidence provided by electrophysiological studies for the existence of neurons that code for invariant feature combinations in generic object recognition (Kobatake & Tanaka, 1994; Tsunoda, 2001).

In summary, biological plausibility supports Open-bigram coding. Not only is there direct evidence for this scheme in the VWFA, but as predicted from theoretical considerations, it uses the same basic mechanism as apparently used by object recognition in general, which is to build representations from collections of feature combinations.

2.4. Computational evidence

Computational models have more evidence to offer than just fits to the data, and indeed selecting a model exclusively on this criterion would expose one to the dangers of overfitting (Pitt & Myung, 2002). Even when considering two models with the exact same goodness-of-fit, one of these might take more time or more memory to run, or it might be operating at the limits of its computational abilities, or be using complex algorithmic contraptions and a host of parameters, or simply be much harder to learn. Formal notions have been proposed to quantify each of these four criteria, respectively: time and memory complexity, complexity class (van Rooij & Wareham, 2008), parsimony as defined by Kolmogorov complexity (Li & Vitanyi, 1997), and learnability (Valiant, 1984). In this section we shall only consider informally how the different proposals fair in some of the previously cited respects, but model selection has been a very lively topic of research in recent years with several principles now available to decide between models (Grünwald, 2004; Pitt & Myung, 2002).

First let us look at how many parameters are being tuned in the three coding schemes. In order to fit behavioral data, Overlap coding uses as many parameters as the longest string involved in the comparison: Each parameter determines the variance of the Gaussian uncertainty function of a letter. Instead, Spatial coding uses a two-parameter formula that computes the variance for each letter given its position, and one parameter to specify the importance of end letters. The type of Open-bigram coding used in Whitney's Seriol model also involves edge bigrams, and two different bigram weights that depend on letter contiguity. Finally, one parameter is tuned in Grainger and van Heuven's versions (the letter gap which is either fixed or absent).

Let us now consider what kind of computational mechanisms are at play in these schemes. Overlap and Spatial coding share one critical feature, which is that similarity calculations require the prior knowledge of which letter from the target should be mapped to which letter from the candidate. For strings of equal length, this has been tackled by assuming that only the same letters at the same position are considered for comparison (an intuitive although strong hypothesis in itself). But this becomes problematic for strings with different lengths and repeated letters, because there is no easy way to tell how this mapping should proceed in general. Simply consider, for example, the candidate TENT and target METER: Overlap coding does not tell us which of the two very different uncertainty functions for T in the candidate string we should use to compute the similarity with T in the target.5 Spatial coding faces the same issue, but it proposes a complex solution that involves duplicated letter banks for each position (the so-called letter clones), competitive-cooperative interactions between and within banks, and the invocation of a new principle—the principle of clone equivalence. In Open-bigram coding, no prior knowledge of the letters to be compared is assumed, and repeated letters are not a problem when computing similarities because the basic unit of comparison, the bigram, does not have a position anymore; only its presence in both representations counts.

Computational evidence should also be sought on how these coding schemes can be learned. Computational models of object recognition have shown that units tuned to feature combinations can arise from training hierarchically organized layers of neurons with limited receptive fields (Wallis & Rolls, 1997), possibly with interleaved layers of simple and complex cells as found in the ventral route (Riesenhuber & Poggio, 1999). Nevertheless, a neuro-computational demonstration that Open-bigrams can be learned is still unavailable, and neither is there a computational model that learns Spatial coding. In this respect, Overlap coding might arguably have the advantage, given that a backpropagation network trained for location invariant word recognition has recently been shown to implement what can be described as a distributed version of this specific coding scheme (Hannagan, Dandurand, & Grainger, 2011b).

In summary, the time and memory complexities of these schemes are still unknown, but a crude assessment of the number of parameters and the parsimony of the mechanisms involved is in support of Grainger and van Heuven's Open-bigrams. We have also identified defects in the current specification of Overlap coding that limit its computational power relative to Open-bigram and Spatial coding. However, learnability might still support a scheme in the spirit of Overlap coding. We now proceed to strengthen the computational evidence in favor of Open-bigram coding, by pointing out its relationship with a powerful machine learning tool known as String kernels.

3. String kernels

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Orthographic coding
  5. 3. String kernels
  6. 4. Open-bigrams as String kernels
  7. 5. Discussion
  8. 6. Learning String kernels: The role of invariance
  9. 7. Conclusion
  10. References
  11. Appendix

Standard data analyses are typically carried out on vectors, but most complex real-life phenomena do not fall naturally into this category. Raw input data must usually go through a preprocessing stage before they can be suitably analyzed, which raises the questions of how to select relevant features and how this selection will affect subsequent stages. Kernel methods provide a general way to tackle this issue, which is especially relevant to cognitive science (Jäkel, Schölkopf, & Wichmann, 2009). The gist of Kernel methods is to map the initial structure onto a feature vector that lives in a vastly higher dimensional space than the original structure—for instance, the space of all possible feature combinations of a given length. The high dimensionality somewhat alleviates the feature selection step,6 and the vectorial nature of feature space then enables standard vector analyses to operate on real-life discrete structures such as trees, graphs, and most relevant to this article, sequences (Hofmann, Schölkopf, & Smola, 2007) (Fig. 2).


Figure 2.  Illustration of how String kernels can support decision processes in three different research domains. (A) Bioinformatics: protein function prediction using String kernels on sequences of amino acids. (B) Cognitive science: lexical decision using String kernels on short letter sequences. (C) Natural language processing: text classification using String kernels on concatenated words. Once sequences have been mapped to the space of domain-specific feature combinations, decision processes such as support vector machines or neural networks can be trained or tuned to perform classification tasks.

Download figure to PowerPoint

More formally, a kernel is a function that takes two structures and returns an inner product between the mapped structures in some feature space (Jäkel, Schölkopf, & Wichmann, 2007). The feature map is chosen so as to facilitate the separability of the mapped structures, and it is usually of high dimensionality. Because of this, in general, the explicit computation of all the mapped structures’ coordinates is prohibitive, but one strength of kernel methods is that for astutely chosen feature maps the inner products can nevertheless be calculated implicitly at a much smaller cost (the so-called Kernel trick).7 Note, however, that the String kernels of interest in this article are simple ones where the explicit computation of coordinates is actually feasible.

We will now illustrate the performance and versatility of String kernels by reporting on the results obtained by two recent studies on problems that arise in Natural language processing and in Bioinformatics.

3.1. Text classification

Human activities are increasingly mediated by the Web, and the ability to categorize text content in a quick and accurate way has become of utmost importance when designing search engines. This is, for instance, required to automatically prevent spurious ‘‘webspam’’ results from invading Google search results, or simply to avoid spams in one's mail box.

In Lodhi et al.’s (2002) string subsequence kernel for text classification, letter strings generated from a given alphabet ∑ are compared in two steps. First, each letter string is mapped onto a vector of dimension ∑k where each component registers a possible sub-string of size k (i.e., an ordered combination of k letters), weighted by a given contiguity factor λ that diminishes exponentially with the gap. For example, a two-kernel mapping for words TIME and MEMES would produce two vectors of dimension 26.2 In this case each vector would be very sparse, consisting mostly of zeros except for the few components that respectively correspond to existing substrings in TIME and MEMES, as described in Table 2.

Table 2.  A two-String kernel for words TIME and MEMES

Note that the dimension of the feature vector does not depend on string lengths, and in case of repeated letter combinations, weighted contributions are simply added up in the corresponding feature vector component. The kernel value between both strings is then obtained by taking the inner product of the normalized feature vectors. Assuming λ = 1 for simplicity:

  • image

When applied to entire webpages rather than single words, for example, Lodhi et al.’s string subsequence kernel simply treats the text as a long string of concatenated words and spaces (without punctuation and stop words such as ‘‘the’’ and ‘‘at’’). The resulting feature vector is still of the same dimension—or rather, of dimension (26 + 1)k because of the blank character—but it is much denser. The similarity between two strings of lengths S and T can nevertheless be computed in linear time of k, S, and T.

The authors used this kernel to train a support vector machine (SVM) to perform binary classifications on a sample of documents taken from the Reuters dataset (Lewis, 2004). SVMs are supervised binary classifiers that operate in conjunction with some kernel methods (Cristianini & Shawe-Taylor, 2000). When presented during training with exemplars of two categories that have been mapped onto feature space, an SVM tries to find a hyperplane that separates exemplars using a principle of maximum margin distance (illustration of the ‘‘decision process’’ in the right-hand side of Fig. 1). The SVM is then tested for its generalization performance in a subsequent phase, where it is presented with new exemplars of both categories. Performance is assessed using standard quantities such as precision, recall, and F1.8 Four hundred seventy documents were chosen from four of the most frequent categories in the Reuters database. Three hundred eighty documents were used during training and 90 during the test phase. The SVM achieved state-of-the-art performance in binary classification, as measured by F1 scores when compared to two SVMs that used other kernels (n-gram or word kernels). When averaged across all categories, performance also peaked for subsequences of length k = 5 or for low values of the gap penalty λ.

3.2. Protein classification

Coming from computational biology, remote protein homology detection is another problem where researchers have found it useful to use String kernels. Consider the SCOP database (Murzin, Brenner, Hubbard, & Chothia, 1995), a registry of proteins for which both structure and function is known, organized hierarchically into families, superfamilies, and folds. Two proteins are deemed remote homologs if they belong to different families but have the same superfamily. The remote protein homology problem consists in distinguishing between homologs and non-homologs, among a sample of proteins that span a number of families.

To tackle the remote protein homology detection problem, Leslie and Kuang (2004) proposed a ‘‘restricted gappy’’ String kernel that uses the same feature space as Lodhi et al. (that is a feature space indexed by all letter combinations of size k in the alphabet) but where only a restricted number of gaps is allowed when matching to the subsequence of the target string. Let us assume an alphabet ∑, a string S written in this alphabet, and a chunk α of g contiguous letters in S. The feature map φg,k for α is a vector of ∑k components, with each component i standing for a possible combination of k letters in ∑, and is defined in the following way:

  • image

The feature map for any string S is the sum of the feature maps for all α in S. This is effectively equivalent to computing all feature maps for a window of g letters that is sliding across string S. Again, the (g,k)-gappy kernel between strings S and T is then simply defined as the normalized inner product of the feature maps for S and T, that is:

  • image

Leslie and Kuang also envisioned a system of gap penalties very much in line with Lodhi et al., whereby substrings are assigned exponentially large penalties as a function of the gap with which they occur in the string.

The authors then used the restricted gappy kernel as a base to train an SVM on the remote protein homology problem. During training, the SVM was presented with exemplar proteins that belonged to several possible families (with the exclusion of one, the target family) within a given superfamily, as well as non-exemplar proteins coming from other superfamilies. Once satisfactory classification performance was reached on the training base, the SVM was tested on new proteins coming from the target family. Here assessment was carried out using the receiver operating characteristic with a standard area under curve (AUC) score, which is another performance measure for classifiers9 that returns 1 for perfect classification, and 1/2 for a random classification. With an AUC of 0.851 over 54 target families, the SVM exhibited good performance relatively to SVMs trained with other kernels. Performance was essentially similar with or without gap penalties and was also correlated with an a priori similarity score defined between the kernel and the target domain. In other words, SVM performance on a given training base could be estimated a priori from the restricted gappy String kernel.

4. Open-bigrams as String kernels

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Orthographic coding
  5. 3. String kernels
  6. 4. Open-bigrams as String kernels
  7. 5. Discussion
  8. 6. Learning String kernels: The role of invariance
  9. 7. Conclusion
  10. References
  11. Appendix

There are two small but significant differences between Open-bigrams and String kernels. The first is that repeated letter combinations are added up in String kernels (feature vectors take integer values), but not in Open-bigrams (feature vectors take binary values). The second difference is that similarities are symmetric in String kernels, but not in Open-bigrams, where one distinguishes between candidate (prime stimulus) and target strings and the similarity is given by the strings’ inner product divided by the squared target's norm. Except for these two differences, the reader will have recognized that Leslie and Kuang's Restricted gappy kernel with g = 4, k = 2, and a non-exponential gap penalty is essentially Whitney's weighted Open-bigram scheme. Without gap penalty, we recover Grainger and van Heuven's constrained Open-bigram scheme. Finally, Lodhi et al.’s substring kernel with k = 2 is the unconstrained Open-bigram scheme.

A remarkable fact about the literature on String kernels and Open-bigrams is that they do not cross-reference at all. It thus appears that the same technique has been introduced in distinct scientific areas at about the same time, unbeknownst to their actors.10 This is yet another example of parallel meme evolution, and a modest addition to an already long list of simultaneous scientific findings. What is perhaps striking in the String kernels/Open-bigrams case is how this parallel evolution took place in disciplines driven by quite different pressures: efficiency of prediction/analysis for computational biology and natural language processing, but fidelity to human behavior and neural mechanisms for cognitive science.

What is it that String kernels seem to capture in protein prediction, in text classification, and in visual word recognition? Could it be one and the same thing? Indeed, it could be argued that there are really two main ideas at the origin of String kernels: the idea of mapping structures onto high-dimensional vectors, and the idea that in so doing vector components should stand for feature combinations. How can we tell whether the success of String kernels in different fields comes from one idea or the other, or both?

Happily, it turns out that researchers in natural language processing have studied ways to map documents onto high-dimensional vectors but without using feature combinations, using so-called ‘‘Bag of Words’’ approaches. In latent semantic analysis (Joachims, 1998; Landauer & Dumais, 1997; Landauer, McNamara, Dennis, & Kintsch, 2007), for instance, a text database is turned into a large matrix where lines represent words, columns represent documents, and each matrix entry is weighed proportionally to the word frequency in the document and to its inverse frequency in the database. Singular value decomposition is then used to produce a single large vector for each document, where each component reflects the importance of a word in the document (rather than a combination of words or a subset of letters). This scheme has been quite successful and indeed illustrates that the mapping of structures onto high-dimensional spaces is a key to success in classification, as expected from Cover's theorem (Cover, 1965). However, Lodhi et al. showed that the substring kernel outperformed the Bag of Words approach on all the categories it had been applied to—although it took longer to compute. The conservative expectation to form for other fields, then, would be that beyond the high-dimensionality aspect, it is really the idea of representing structures as relationships between their parts that makes a difference.

We would argue that this key concept of letter combinations used in String kernels is successful across fields because it provides a way around similar issues: tolerating global displacements and local changes when comparing two sequences. In the domain of protein analysis, for instance, non-contiguous combinations are useful to reflect the fact that strings of amino acids that share displaced subsequences will probably have similar functions, as they likely evolved from a common string by way of insertion or deletion of an amino acid. In orthographic coding, non-contiguous combinations can be used to represent words independently of their location on the retina, or in other terms, to evoke the same representation despite global displacements of a given stimulus. Although the general class of String kernel methods has some generality, this is not to say, however, that the success of a particular String kernel in one domain automatically guarantees its success in another. The tuning of parameters such as substring length and gap penalty will be dictated by the specifics of each domain, and in particular the size and number of all sequences that are to be distinguished as well as the statistical regularities between them. For proteins and text classification, large sequence sizes call for larger substrings and the underlying regularities are best captured by exponential gap penalties, whereas in the word domain a smaller grain of letter combinations is more effective and the gap penalties that have been proposed are relatively weak.

4.1. Insights and tools from String kernels

The classification of Open-bigram coding as a String kernel method comes with a number of insights.

4.1.1. Formalism

For one thing, String kernels have been specified in a more formal way. The mathematics are simple and require only a mapping function from input to feature space (what we have referred to as a feature map), and a kernel function that returns the similarity between any two points in feature space. The requirement on the kernel function is that it satisfies Mercer's conditions (i.e., it must be symmetric and positive semidefinite). In all the kernels we have considered, the kernel function is the normalized inner product which automatically satisfies Mercer's conditions. However, Open-bigram schemes do not comply to this requirement because in general distances are not symmetric (similarities only involve the target's norm), and we will see that restoring symmetry promotes better performance.

4.1.2. Computational complexity

A strong appeal of String kernels is that they can be computed without actually representing structures in feature space, using string alignment and dynamic programming techniques (Lodhi et al., 2002) or data structures like Tries (Leslie & Kuang, 2004). Given two strings S and T and noting |·| as the length, Lodhi et al.’s k-substring kernel can be computed in 0(k|S||T|) time, whereas Leslie and Kuang's restricted gappy kernel is computed in 0(Cg,k(|S|+|T|)), where Cg,k is a constant (in the sense that it is determined by the kernel's g and k parameters, not by string lengths or by alphabet size). In other words, this means that similarities in all Open-bigram schemes can be computed in linear time with respect to string lengths. An important point is that because structures do not need to be explicitly mapped to feature space in order for the kernel to be computed, computation time for all kernels does not depend on the dimensionality of the feature space. Hence, just as much time is needed to compute a String kernel for DNA sequences, where the alphabet is restricted to four nucleotides, than for proteins or for words. Finally results are also available on how to approximate String kernels and how to implement them as a particular kind of automata (transducers; Leslie & Kuang, 2004).

However, such complexity analyses could be seen as irrelevant to the brain, given first that they do not involve the actual calculation of coordinates in feature space, whereas we have argued that the brain performs this mapping explicitly, and second that they are derived from the assumption of sequential processes (automata and Turing machines), whereas the neural machinery operates in parallel. For these reasons, demonstrations coming from the branch of complexity theory, called circuit complexity, would appear to be more directly relevant, especially with a view to an actual implementation of the feature map using standard connectionist networks (Shawe-Taylor, Anthony, & Kern, 1992). We would simply argue that determining the circuit complexity in which String kernels fall could be facilitated by these results and the existing theorems between sequential and parallel complexity theory (Papadimitriou, 1993). Note that a related reason why one should not discard sequential complexity results is that as we have seen, not all orthographic coding schemes claim a proximity to the neural substrate or assume fully parallel computations (Spatial Coding, for instance, requires the rapid sequential scan of the string in order to assign the activation gradient). The bottom line is that orthographic coding schemes have been implemented as computational models, but strikingly, until today no information was available on the computational complexity of these schemes. We think that the results on String kernels are a first step toward this end.

4.1.3. New ideas for visual word recognition

A first insight from String kernels concerns the use of a gap penalty in Open-bigram coding. Lodhi et al. noticed that weak gap penalties in the substring kernel/unconstrained Open-bigram scheme allowed successful classification on text corpora that contained polysemic words, that is, words with the same orthography but different meanings. This is because polysemic words tend not to be surrounded by the same words, something that is picked up by taking more context into account (weak gap penalties). Applying this insight to orthographic coding suggests that on the contrary, languages where many semantically related words are also orthographically related would need to use a relatively high gap penalty in order to keep words well separated in orthographic feature space and thus facilitate access to semantics. This would be the case, for instance, in agglutinative languages such as Turkish or Basque.

Another insight is that some String kernels have been described and tested that have currently no counterpart in visual word recognition, although they might well be relevant for it. One example of a potentially interesting kernel is the so-called wildcard kernel of Leslie and Kuang. Starting from a normal substring kernel, the idea is to introduce a wildcard character ‘‘*’’ in the alphabet that matches all characters. That is, feature space is expanded to dimension (∑+1)k (or if the blank character is also included, (∑+2)k). When k = 3, this means that in the feature vector for word TIME, all components that correspond to *IM, TI*, T*M, IM*, and so on would be incremented. Anticipating some interesting new behavior for Open-bigram and trigram coding, we can propose a gappy version of a wildcard kernel (a Gappy 3-wildcard kernel; see Appendix) where string TIME activates feature components (*IM, T*M, TI*, *IE, T*E, TI*, *ME, T*E, TM*, *ME, I*E, IM*). This scheme is simply obtained from Grainger and van Heuven's (2003) UOB scheme—henceforth GvH UOB—by substituting the wildcard character at every possible position in each bigram, and it is tested in the following section.

4.1.4. Simulations of String kernels

We tested the GvH UOB scheme (without repeated bigrams and with asymmetrical similarities), the Gappy String kernel from Leslie and Kuang, the Spatial Coding scheme (rather than the Spatial Coding model), and the Gappy 3-wildcard kernel described above, on two recent benchmarks in the orthographic coding literature: Davis (2010) and Hannagan et al. (2011a). Overlap coding was not included in these simulations due to the previously mentioned under-specification on how to calculate similarities when letter repetitions are involved. These two benchmarks are somewhat complementary in that the first is more exhaustive but only provides correlations to human data, which are susceptible to being artificially improved when schemes largely underestimate similarities, whereas the second involves fewer conditions but assesses what type of relationship must hold between them so as to satisfy a number of constraints consistently reported in the literature. We selected all the conditions from Davis (2010) that involved standard masked priming experiments and that did not manipulate frequency and neighborhood, as these were not relevant to assessing orthographic coding schemes, which are blind to these factors. There were 45 such conditions out of the original 61 in Davis (2010), each corresponding to a point in the correlation graphs with human data shown in Fig. 3.


Figure 3.  Correlation between human priming data (X-axis, in milliseconds) and similarities (Y-axis) produced by four coding schemes on 45 conditions selected from Davis (2010). Upper left: GvH UOB scheme. Upper right: Gappy String kernel. Lower left: Spatial Coding scheme. Lower right: a Gappy 3-wildcard kernel.

Download figure to PowerPoint

Table 3 shows how these four schemes satisfy the constraints from Hannagan et al., which we briefly describe here: No string representation should be more similar to another than to itself (stability constraint), a letter substitution occurring at the edge of a string should be more disruptive than in the middle (edge effects), transposing two adjacent letters should be less disruptive than substituting them altogether (local TL, where TL refers to ‘‘transposed letter’’), transposing two distant letters should be more disruptive than substituting one of them, but less so than substituting both (distant TL), a substitution combined with a transposition should be more disruptive than a single substitution (compound TL), transposing all adjacent letter pairs in a word should be maximally disruptive (global TL), removing letters from a string should conserve some similarity (distinct RP, where RP refers to ‘‘relative position’’), and finally in a string with two repeated letters it should be equally disruptive to remove one of the repeated letters as one of the non-repeated letters (repeated RP). A detailed presentation of these constraints can be found in Hannagan et al. (2011a).

Table 3.  Performance of the coding schemes on the constraints from Hannagan et al. (2011a)
 GvH UOBGappy String KernelSpatial Coding3-Wildcard Kernel
Edge effects
Local TL
Distant TL
Compound TL
Global TL
Repeated RP

Fig. 3 shows a good general agreement of simulations with human data but also significant differences between schemes, with the lowest correlation starting at 0.67 for UOB, then Spatial Coding (0.73), Gappy String kernel (0.75), and finally the 3-wildcard kernel, which achieves the best correlation at 0.8. This variability is amplified in the Hannagan et al. (2011a) benchmark reported in Table 3, where the ranking is changed to UOB (two constraints met), Gappy String kernel (three constraints), 3-wildcard (six constraints), and finally Spatial Coding (seven constraints). One might wonder why Spatial Coding ranks only third on Davis (2010) but first on Hannagan et al. (2011a). This is because the edge constraint is only met in Spatial Coding by making the explicit but ad hoc specification of an End-of-Word parameter that increases the weight of edge letters. Without this additional degree of freedom, Spatial Coding and 3-wildcard kernel would both satisfy six constraints, restoring the 3-wildcard kernel to the first rank as for the Davis (2010) benchmark.

Our benchmarks also agree on two remarkable points. First, UOB is always outperformed by the Gappy String kernel—or in other words UOB is always improved by taking into account repeated bigrams in the string representation and by switching to symmetric similarities between strings. This is mostly because in so doing the superset conditions (i.e., conditions with letter insertions and repetitions, which are part of the stability constraint and of Davis's benchmark) are assigned less than perfect similarities, which restores stability in the scheme and boosts the general correlation to priming data. The second and perhaps most remarkable common point is that both schemes are always outperformed by the 3-wildcard kernel. In addition to the three constraints already satisfied by the Gappy String kernel, switching to a 3-wildcard kernel now allows one to account for edge effects, for two types of transposition effects (distant and compound TL), and it overall raises the correlation with priming data to the 0.8 level.

Detailed explanations on why Open-bigram coding succeeds or fails on different constraints can be found in Hannagan et al. (2011a), and because the mechanisms there described for Open-bigrams are also relevant to the wildcard kernel every time these schemes agree, here we will focus on the constraints where they behave differently; that is, edge effects, distant TL, and compound TL. Edge effects emerge in the 3-wildcard kernel as a result of taking into account repeated combinations: In the vector that represents word ‘‘TIME,’’ components for TI* and *ME have a value of 2—the number of possible wildcard matches—whereas the component for IM* only has a value of 1. Consequently, when similarities are computed, the components that involve edge letters tend to contribute more than others to the inner product, and consequently stimuli like ‘‘AIME’’ or ‘‘TIMA’’ are deemed less similar to TIME than TIPE is (the definition of edge effects). Counterintuitively though, the reason for the superior performance on certain transposition effects also involves edge letters. For instance, both UOB and Gappy String kernels fail on the distant TL constraint because they wrongly imply that a single substitution will be more disruptive than a transposition, as the former condition disturbs more units in the string representation than the latter. That much is also true of the 3-wildcard kernel, but now the difference is that the act of transposing letters C and E in string ‘‘ABCDEF’’ also transposes the values, for example, A*C and A*E (1 and 3, respectively) in the representation vector. When computing similarities, the contributions for those two components sum up to 1·3+3·1 = 6, rather than to 0·1+3·3 = 9 as they would in the case of a single substitution. Similar but smaller trade-offs take place for all units involved in representing the modified letters, and they aggregate to produce a distant TL effect. The deeper cause for this effect is that introducing a wildcard character also induces a natural count of letter positions in the manner of both-ends coding (Fischer-Baum et al., 2010; Jacobs et al., 1998): In our example with string ‘‘ABCDEF,’’ the vector components for A*C and C*F have values 1 and 2, respectively, that is, the number of letters that are standing between C and the first letter, and between C and the last.

5. Discussion

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Orthographic coding
  5. 3. String kernels
  6. 4. Open-bigrams as String kernels
  7. 5. Discussion
  8. 6. Learning String kernels: The role of invariance
  9. 7. Conclusion
  10. References
  11. Appendix

5.1. The case for String kernels in the brain

We have shown that String kernels are powerful machine learning algorithms, performing at a state-of-the-art level on problems coming from different fields. But if String kernels provide a performant and versatile machine learning tool, can we conclude that they are part of the solution found by the brain to represent visual words?

A first reason to refrain from making conclusions at this point would be that while the performance of String kernels has been demonstrated on machine learning benchmarks, Overlap or Spatial Coding have so far only been tested in the field of visual word recognition. Short of the correct comparison, how can we conclude that String kernels are more versatile than either of these two schemes? We think that such prudence would be excessive in this case. First, the fact that machine learning researchers did not come up with letter-based and absolute coding alternatives suggests that keeping track of the position of all constituents within the sequence, whether by using probability density functions like in Overlap coding or by differences of positional gradients as in Spatial coding, is not an efficient way to represent sequences. More important, we have seen that significant additional assumptions are required in Spatial and Overlap coding in order to compare strings of different lengths or to deal with repeated letters (Davis, 2010; Gomez et al., 2008). These assumptions are already problematic when dealing with short sequences like words, but they would presumably become impractical when dealing with large sequences like proteins or webpages, where the number of letter repetitions grows at least linearly with the string. In any case, what we can safely conclude for now is that Open-bigram coding provides a way to represent sequences that is performant and extends largely outside of cognitive science. These are remarkable properties that are not yet known to be true of Spatial or Overlap coding, but the burden of proof remains on the proponents of these schemes to show that they can outperform String kernels on machine learning benchmarks. Until then, versatility and performance, as measured by the standards of machine learning, are on the side of String kernels.

If String kernels are indeed more versatile and better performing than other coding schemes, what does this tell about their possible implementation in the brain? One could argue, for instance, that the beginning reader faces one neural constraint that has no counterpart in machine learning: His brain must start not from a blank slate but from existing visual processing machinery (Dehaene & Cohen, 2007). However, we have previously mentioned the evidence showing that generic object recognition already involves feature combinations. Starting from this way to represent visual objects, the String kernels alternative would appear much easier to achieve than the other two approaches, because it ‘‘only’’ requires an adaptation of feature combination neurons so that they become LCDs that are sensitive to order. Thus, if this adaptation is not too prohibitive in terms of computational cost, the simplest solution for the brain will also turn out to be the best. Despite encouraging sequential complexity results, there is currently no available result on the cost of computing String kernels in a parallel architecture, let alone on the learnability of the suggested adaptation as opposed to others. However, in support of a learnable and cost-efficient adaptation, we have already mentioned how activity in the VWFA has been found to be correlated with Open-bigram frequency (Vinckier et al., 2007).

Finally, granted that String kernels are both more performant and easier to learn by the brain, why should we discard the behavioral evidence that has seemed to favor other schemes lately? To address this question, we have presented new simulations on two up-to-date behavioral benchmarks. These simulations show that when Open-bigram coding fully complies to the requirements of String kernels (as to how to handle letter repetitions and how to compute similarities), the performance of this scheme is improved on all benchmarks and is on a par with Spatial Coding on one benchmark, despite requiring less parameters. Perhaps more important, in this paper we have introduced a gappy 3-wildcard kernel inspired from computational biology, and we have shown that it outperforms or matches all the other schemes it was tested against. Although these simulations left some phenomena aside, because they involved experimental manipulations that are beyond the scope of input coding schemes, at the very least they imply that the behavioral evidence should now be reconsidered.

5.2. String kernels and the dual-route model of orthographic coding

So far we have been relatively silent as to how, where, and when String kernels could be implemented in the brain. We have also left aside the issue that Open-bigram coding appears to be ill suited for computing grapheme-to-phoneme correspondences, as this arguably requires knowledge of which letters are next to which, such as knowing that the T is just before the H in the complex grapheme ‘‘th“ (Goswami & Ziegler, 2006). A recent theoretical proposal, illustrated on Fig. 4, addresses these issues by suggesting that there are two distinct pathways to word recognition—a chunking and a diagnostic route (Grainger & Ziegler, 2011). We will now briefly describe this proposal and argue that a generic feature space could differentiate during development in order to support the kinds of schemes that are hypothesized in both routes.


Figure 4.  Grainger and Ziegler's dual-route model of orthographic coding. Starting from an alphabetic array of retinotopic letter detectors activated in parallel, reading acquisition is mediated by the construction of a diagnostic (left) and a chunking (right) route.

Download figure to PowerPoint

After the alphabet has been mastered, the next challenge children usually face in learning to read is to associate letters and short letter sequences to sounds. Part of this process requires extracting frequent letter chunks so as to isolate the relevant complex (i.e., multi-letter) graphemes (such as for instance graphemes SH and OO in SHOOK) to be associated with phonemes. This chunking route would initially be the salient part of the orthographic code, dominant at early stages of reading. The neural correlates of the chunking route could be the mid-anterior fusiform gyrus, given the evidence coming from case studies where lesions in these areas did not affect reading accuracy but generated phonological errors in spelling (Tsapkini & Rapp, 2010). The exact mechanism by which features in the chunking route would be selected and associated with phonemes is yet unknown, although past and present computational models of reading aloud point to a supervised process (Perry, Ziegler, & Zorzi, 2007; Sibley, Kello, Plaut, & Elman, 2008), and the nature of the association with phonology suggests that precise information about letter position would be required, possibly from both ends of the word as proposed by recent spelling studies (Fischer-Baum et al., 2010). It is worth noting here that, as we have seen, kernels with a wildcard character induce (but are not limited to) a natural positional count: It is easy to retrieve the position of any letter in the string by reading out these vector components that are involved in representing the combination of the target letter with either the initial or the final letter.

Furthermore, the chunking route in this account is thought to be responsible for the development of morpho-orthographic representations, known to play a key role in visual word recognition (Rastle & Davis, 2008). In languages such as English and French, affixes would be an obvious target for the chunking mechanism, given their relatively high frequency of occurrence. However, such chunking could also apply to root morphemes in the case of Semitic languages, and as such would involve contiguous and non-contiguous letter combinations. Indeed, Semitic languages are a perfect example of how natural languages can exploit non-contiguous letter combinations, as the tri-consonantal roots of these languages are formed of non-adjacent letters in the word, and empirical research on reading in Semitic languages underlines the importance of these roots in the word recognition process (Frost, 2006; Velan & Frost, 2011).

Meanwhile, according to the dual-route proposal, another type of orthographic coding would slowly be taking shape in the VWFA. There, the increasing receptive fields of hierarchically organized neurons would allow for high-level cells to be tuned to possibly remote combinations of two or more letters (Vinckier et al., 2007), as described for instance in the 3-wildcard kernel we have implemented. However, the feature space in this kernel would not be static but would evolve during reading acquisition so as to favor those letter combinations that are informative given the lexicon. Learning in the diagnostic route would be slow but automatic, effortless, and unsupervised, involving, for instance, Hebb's rule or a temporal variant thereof (Wallis & Rolls, 1997). Once set in place, the code would be dominant in silent reading and would allow for direct and fast access to semantics. It is thus proposed that masked priming experiments in the orthographic processing literature have largely been probing this diagnostic orthographic code.

Finally, the dual-route model also addresses the hard problem of letter position coding, which is how humans manage to switch from a retinotopic representation to a word-centered representation. This can be recast as the problem of learning location invariant word representations despite the fact that the same words will produce very different imprints on the retina at even the slightest variation in fixation point. This concern is not specific to words, and it is shared throughout the visual object recognition literature (DiCarlo & Cox, 2007; Fazl, Grossberg, & Mingolla, 2009), but the problem of invariance is perhaps most clearly perceived when considered in the word domain where the requirement for order is so explicit—for example, anagrams like ‘‘triangle’’ and ‘‘relating’’ have different meanings. Order is trivially present in the retinal activation pattern or its cortically magnified transform in primary visual areas, but these do not directly allow for recognition: Different words seen at the same location would have more overlapping patterns than the same word at different locations.11 Thus, if one is to recognize say ‘‘triangle’’ at all possible locations, it would appear that somehow location invariant letter detectors must be involved. However, fully invariant detectors would not allow one to distinguish between ‘‘triangle’’ and ‘‘relating.’’ So how exactly does the system achieve location invariance without ‘‘losing’’ information about letter order? The answer, according to the dual-route model, is that in both routes, location invariance is achieved not by learning location invariant letters, but rather by learning ordered and location invariant letter combinations—or in the language of String kernels, learning a feature map indexed by substrings of a given length. Using ordered letter combinations allows discrimination between anagrams as they will not have the same coordinates in this feature space.

6. Learning String kernels: The role of invariance

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Orthographic coding
  5. 3. String kernels
  6. 4. Open-bigrams as String kernels
  7. 5. Discussion
  8. 6. Learning String kernels: The role of invariance
  9. 7. Conclusion
  10. References
  11. Appendix

What could be the computational principles at work in the learning of these two orthographic codes? This is still an open question, but one possibility is that both routes of the dual-route model could evolve from the same feature space. Let us imagine a feature space indexed by all possible letter combinations up to a certain k, including a wildcard character. Learning the two orthographic codes could boil down to learning how to discard irrelevant features according to different criteria, or in other words, to learn, for instance, a 3-wildcard kernel for the diagnostic route, as was successfully tested in this article, and a String kernel with both contiguous letter combinations and wildcard combinations for the chunking route. The features of the orthographic code in each route would ultimately depend on how the frequency of occurrence is used to select relevant entities from the same original feature space, that is, to maximize diagnosticity on the one hand and chunking on the other.

This idea resonates with the way training is currently achieved in HMAX, one successful and in some ways biologically inspired model of visual object recognition and classification (Serre et al., 2005). This network of 26 million units interleaved in simple and complex layers first learns a universal dictionary of features by ‘‘passive exposure’’ to a training base of environmental images. At this stage, simple units are tuned to respond maximally to patterns of activation in the previous complex layer, whereas complex units are handwired to respond to the maximally active simple unit in the previous simple layer, from among those tuned to the same feature combination at different locations and sizes. This first training stage results in a large number of simple units tuned to feature combinations in the environment, and invariant to location and scale—a universal dictionary. In a second ‘‘expert’’ learning stage, the network is exposed to a sequence of stimuli that all contain an object from the target domain, and a unit's activity is now allowed to depend on its previous activity in the sequence. An abstract algorithm then implements a selection process within a randomly chosen pool of units, progressively replacing those units whose activity has fallen below some criterion with new random ones, which eventually produces a pool of units which are best fit to represent objects in the target domain.

The performance of HMAX on classification and identification tasks has been found to be comparable to the best algorithms in machine vision (Serre et al., 2005). What is more, formal results pertaining to invariance and discrimination properties in this type of architecture have been obtained for one-dimensional strings (Bouvrie, Poggio, Rosasco, Smale, & Wibisono, 2010). In these analyses, one critical mathematical object turned out to be the notion of kernels associated with network layers.12 Interestingly, these analyses reveal that such networks can discriminate between different strings but that unlike humans, they fail to distinguish a string from its reverse (abcd–dcba) and its checkerboard transformations (ababab–bababa). More generally, at a cognitive and biological level, HMAX remains unsatisfactory for a number of reasons: It does not have feedback connections, its invariance properties are still largely handwired, and its expert selection process lacks a neurally plausible substrate. Invariance is more plausibly obtained in other hierarchical models of invariant object recognition like Visnet (Wallis & Rolls, 1997), which has a homogeneous structure and simple learning rules, and pARTSCAN (Fazl et al., 2009), which includes adaptive feedback connections. Both of these models suggest that a neurally plausible candidate mechanism to implement the selection process used in HMAX might be to learn lateral inhibitory connections within layers of the hierarchy, progressively shaping competitive networks that would select the most active units. A model that combines qualities from these three networks (adaptive feedback connections, competitive subnetworks, and simple learning rules acting at multiple stages during training) is currently under investigation (Hannagan & Grainger, 2011).

Finally, the question of representational format arises, as at first sight the kind of localist vectors used in String kernels suggests also a localist coding in the brain, whereby one neuron would be dedicated to each letter combination. There is little hope that, as described, this kind of coding would be of any use in a brain that faces significant amounts of noise, but Bowers (2010) has argued that redundant localist coding would circumvent the noise issue, by having many ‘‘copies’’ of neurons dedicated to represent one unique entity. However, even such a redundant localist view is not consistent with estimates derived from Bayesian reasoning on single cell responses, which point to sparse distributed coding where many neurons code for any letter combination, and each neuron is also active for many letter combinations (Plaut & McClelland, 2010). How then can we accommodate the notion of kernels within a distributed framework? The answer might already be contained in computational models of invariant visual object recognition, as models like HMAX and Visnet produce sparsely distributed units. The question of how representational format impacts on orthographic coding schemes has also been studied in Hannagan et al. (2011a), where a distributed implementation of Open-bigram coding derived from Holographic representations was proposed (Plate, 1995), and shown to outperform the localist version on a benchmark of behavioral effects. The implementation starts by assigning distributed vectors to letters and locations, and from then uses binding and chunking operators to build vectors for arbitrary complex structures. Vectors can be as sparse as required, and correlated letter vectors can be used to involve letter similarity in the computational design and testing of orthographic coding schemes.

7. Conclusion

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Orthographic coding
  5. 3. String kernels
  6. 4. Open-bigrams as String kernels
  7. 5. Discussion
  8. 6. Learning String kernels: The role of invariance
  9. 7. Conclusion
  10. References
  11. Appendix

In this article we have described the current situation with respect to theoretical proposals for orthographic coding, where several quite different alternatives coexist. We pointed out that unlike other coding schemes, Open-bigrams come very close to being instances of String kernels, a machine learning tool that achieves state-of-the-art performance in the fields of computational biology and natural language processing. This connection stems from the idea to operate comparisons in high-dimensional feature spaces indexed by letter combinations, which not only improves the separability of representations but also provides a common way to address issues encountered across domains—for instance, capturing the various mutations that can relate two proteins, and achieving location invariance in the visual word recognition system.

We have used this new connection to test String kernels on two recent behavioral benchmarks for orthographic coding. Although the modifications that turned Open-bigrams into String kernels were minor, we were able to demonstrate a significant improvement in performance. Furthermore a new coding scheme inspired from computational biology achieved superior scores on our benchmarks. Departing from the traditional use of kernels in Machine Learning, we have argued for an explicit computation of feature coordinates in the brain, as evidenced by studies of the VWFA and more generally by brain imaging studies of visual object recognition. On this basis we have concluded that the adaptations required of any child who is learning how to read, that is, the adaptations needed to go from generic object recognition to visual word recognition, are likely to involve String kernels rather than any other scheme to encode letter order. Although we leave to future research the proper demonstration of how such a coding scheme could be learned, this convergence of computational, biological, and behavioral arguments concludes our case for String kernels in the brain.

  • 1

    Some researchers have raised doubts on the relevance of computing similarities outside of a full-fledged dynamical model of visual word recognition. This point is discussed in Hannagan, Dupoux, and Christophe (2011a), where it is concluded that the approach is useful to capture the common bottom-up activation component in all models, which correlates highly with the final outcome.

  • 2

    Schemes with longer n-grams have also been proposed (Dehaene, Cohen, Sigman, & Vinckier, 2005; Wickelgren, 1969).

  • 3

    When weights are involved, only the bigram that contributes the most is considered.

  • 4

    We thank one anonymous reviewer for pointing out this work to us.

  • 5

    Such conditions are not covered in Gomez, Ratcliff, and Perea (2008).

  • 6

    Especially when the so-called regularization is applied.

  • 7

    In the machine learning literature this ‘Kernel trick’’ has come to be associated with the mere notion of a kernel function, although it is not involved in its actual definition.

  • 8

    F1 = 2*Recall*Precision/(Recall + Precision), where Precision = True Positives/(True Positives + False Positives) and Recall = True Positives/(True Positives + False Negatives).

  • 9

    Receiver operating characteristic assesses how the ratio of true positives on false positives is affected by variations in the classification threshold.

  • 10

    We thank one anonymous reviewer for pointing out that the ‘‘Follows relation’’ in computational linguistics also encapsulates this same idea of distant ordered relationships between elements.

  • 11

    As measured for instance by the Frobenius norm or any standard matrix norm.

  • 12

    We thank one anonymous reviewer for mentioning this study.


  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Orthographic coding
  5. 3. String kernels
  6. 4. Open-bigrams as String kernels
  7. 5. Discussion
  8. 6. Learning String kernels: The role of invariance
  9. 7. Conclusion
  10. References
  11. Appendix
  • Adelman, J. S., Marquis, S. J., & Sabatos-DeVito, M. G. (2010). Letters in words are read simultaneously, not left-to-right. Psychological Science, 21, 17991801.
  • Allison, L., Wallace, C. S., & Yee, C. N. (1992). Finite-state models in the alignment of macro-molecules. Journal of Molecular Evolution, 35, 7789.
  • Binder, J. R., Medler, D. A., Westbury, C. F., Liebenthal, E., & Buchanan, L. (2006). Tuning of the human left fusiform gyrus to sublexical orthographic structure. Neuroimage, 33, 739748.
  • Blais, C., Fiset, D., Jolicoeur, P., Arguin, M., Bub, D., & Gosselin, F.(2009). Reading between eye saccades. PLoS ONE, 4(7), e6448. doi: http://10.1371/journal.pone.0006448.
  • Bouvrie, J., Poggio, T., Rosasco, L., Smale, S., & Wibisono, A. (2010). Generalization and properties of the neural response technical report MIT-CSAIL-TR-2010-051/CBCL-292. Cambridge, MA: Massachusetts Institute of Technology.
  • Bowers, J. (2010). On the biological plausibility of grandmother cells: Implications for neural network theories in psychology and neuroscience. Psychological Review, 116, 220251.
  • Burgess, N., & Hitch, G. (1992). Toward a network model of the articulatory loop. Journal of Memory and Language, 31, 429460.
  • Cohen, L., Dehaene, S., Naccache, L., Lehericy, S., Dehaene-Lambertz, G., Henaff, M., & Michel, F. (2000). The visual word-form area: spatial and temporal characterization of an initial stage of reading in normal subjects and posterior split-brain patients. Brain, 123, 291307.
  • Coltheart, M., Rastle, K., Perry, C., Langdon, R., & Ziegler, J. (2001). DRC: A Dual Route Cascaded model of visual word recognition and reading aloud. Psychological Review, 108, 204256.
  • Conrad, R. (1965). Order error in immediate recall of sequences. Journal of Verbal Learning and Verbal Behavior, 4, 161169.
  • Cover, T. M. (1965). Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, EC-14, 326334.
  • Cox, G. E., Kachergis, G., Recchia, G., & Jones, M. N. (2011). Towards a scalable holographic word-form representation. Behavior Research Methods, 43, 602615. doi: http://10.3758/s13428-011-0125-5.
  • Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge, UK: Cambridge University Press.
  • Davis, C. J. (2010). The spatial coding model of visual word identification. Psychological Review, 117, 713758.
  • Davis, C. J., & Bowers, J. S. (2006). Contrasting five theories of letter position coding. Journal of Experimental Psychology: Human Perception and Performance, 32(2), 535557.
  • Dehaene, S., & Cohen, L. (2007) Cultural recycling of cortical maps. Neuron, 56(2), 384398.
  • Dehaene, S., Jobert, A., Naccache, L., Ciuciu, P., Poline, J.-B., Le Bihan, D., & Cohen, L. (2004) Letter binding and invariant recognition of masked words: Behavioral and neuroimaging evidence. Psychological Science, 15(5), 307313.
    Direct Link:
  • Dehaene, S., Cohen, L., Sigman, M., & Vinckier, F. (2005). The neural code for written words: A proposal. Trends in Cognitive Sciences, 9, 335341.
  • Dennis, S. (2005). A memory-based theory of verbal cognition. Cognitive Science, 29, 145193.
  • DiCarlo, J.J., & Cox, D. (2007). Untangling invariant object recognition. Trends in Cognitive Sciences, 11, 333341.
  • Ebbinghaus, H. (1964). Memory: A contribution to experimental psychology. New York: Dover.
  • Fazl, A., Grossberg, S., & Mingolla, E. (2009). View-invariant object category learning, recognition, and search: How spatial and object attention are coordinated using surface-based attentional shrouds. Cognitive Psychology, 58, 148.
  • Fischer-Baum, S., McCloskey, M., & Rapp, B. (2010). Representation of letter position in spelling: Evidence from acquired dysgraphia. Cognition, 115, 466490.
  • Forster, K. I., & Davis, C. (1984). Repetition priming and frequency attenuation in lexical access. Journal of Experimental Psychology: Learning, Memory and Cognition, 10, 680698.
  • Frost, R. (2006). Becoming literate in Hebrew: The grain-size hypothesis and semitic orthographic systems. Developmental Science, 9(5), 439440.
  • Gaillard, R., Naccache, L., Pinel, P., Clémenceau, S., Volle, E., Hasboun, D., Dupont, S., Baulac, M., Dehaene, S., Adam, C., & Cohen, L. (2006). Direct intracranial, FMRI, and lesion evidence for the causal role of left inferotemporal cortex in reading. Neuron, 50(2), 191204.
  • Gales, M. J. F. (2009). Sequence kernels for speaker and speech recognition. Baltimore, MD: Language Technology Workshop at Johns Hopkins University.
  • Gomez, P., Ratcliff, R., & Perea, M. (2008). The overlap model: a model of letter position coding. Psychological Review, 115(3), 577600.
  • Goswami, U., & Ziegler, J. C. (2006). A developmental perspective on the neural code for written word. Trends in Cognitive Sciences, 10(4), 142143.
  • Grainger, J. (2008). Cracking the orthographic code: An introduction. Language and Cognitive Processes, 23(1), 135.
  • Grainger, J., & van Heuven, W. (2003). Modeling letter position coding in printed word perception. In P. Bonin (Ed.), Mental lexicon: ‘‘Some words to talk about words” (pp. 123). New York: Nova Science.
  • Grainger, J., & Ziegler, J. (2011). A dual-route approach to orthographic processing. Frontiers in Language Sciences, 2(54). doi: http://10.3389/fpsyg.2011.00054.
  • Grainger, J., Granier, J., Farioli, F., Van Assche, E., & van Heuven, W. (2006). Letter position information and printed word perception: The relative-position priming constraint. Journal of Experimental Psychology: Human Perception and Performance, 32, 865884.
  • Grünwald, P. (2004). A tutorial introduction to the minimum description length principle. In P. Grünwald, I. J. Myung, & M. Pitt (Eds.), Advances in minimum description length theory and applications. Cambridge, MA: MIT Press.
  • Guerrera, C., & Forster, K. I. (2008). Masked form priming with extreme transposition: Cracking the orthographic code. Language and Cognitive Processes, 23(1), 117142.
  • Hannagan, T., & Grainger, J. (2011). From learning objects to learning letters: Transfer of invariance in a computational model of the ventral visual system. Fifteenth International Conference on Cognitive and Neural Systems -ICCNS11, Boston, MA.
  • Hannagan, T., Dupoux, E., & Christophe, A. (2011a). Holographic string encoding. Cognitive Science, 35(1), 79118.
  • Hannagan, T., Dandurand, F., & Grainger, J. (2011b). Broken symmetries in a location invariant word recognition network. Neural Computation, 23(1), 251283.
  • Hasson, U., Harel, M., Levy, I., & Malach, R. (2003). Large-scale mirror-symmetry organization of human occipito-temporal object areas. Neuron, 37, 10271041.
  • Haussler, D. (1999). Convolution kernels on discrete structure. Technical report UCSC-CRL-99-10,UC Santa Cruz.
  • Henson, R. N. A. (1998). Short-term memory for serial order: The Start-End Model of serial recall. Cognitive Psychology, 36, 73137.
  • Hofmann, T., Schölkopf, B., & Smola, A. J. (2007). Kernel Methods in Machine Learning. Annals of Statistics, 36, 11711220.
  • Jacobs, A. M., Rey, A., Ziegler, J. C., & Grainger, J. (1998). MROM-P: An interactive activation, multiple read-out model of orthographic and phonological processes in visual word recognition. In J. Grainger & A. M. Jacobs (Eds.), Localist connectionist approaches to human cognition (pp. 147188). Mahwah, NJ: Erlbaum.
  • Jäkel, F., Schölkopf, B., & Wichmann, F. A. (2007). A tutorial on kernel methods for categorization. Journal of Mathematical Psychology, 51(6), 381388.
  • Jäkel, F., Schölkopf, B., & Wichmann, F. A. (2009). Does cognitive science need kernels? Trends in Cognitive Sciences, 13(9), 381388.
  • Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec & C. Rouveirol (Eds.), Proceedings of the European conference on machine learning (pp. 137142). Berlin: Springer.
  • Kanwisher, N. (2006). What's in a face. Science, 311, 617.
  • Kobatake, E., & Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral pathway of the macaque cerebral cortex. Journal of Neurophysiology, 71, 856857.
  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211240.
  • Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch W. (Eds.) (2007). Handbook of latent semantic analysis. Mahwah, NJ: Lawrence Erlbaum Associates.
  • Leslie, C., & Kuang, R. (2004). Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 5, 14351455.
  • Lewis, D. (2004). The Reuters-21578 text categorization test collection, distribution 1.0. Available at: (accessed on February 28, 2008).
  • Li, M., & Vitanyi, P. (1997). An introduction to Kolmogorov complexity and its applications. New York: Springer-Verlag.
  • Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. Journal of Machine Learning Research, 2, 419444.
  • Mozer, M. C. (1987). Early parallel processing in reading: A connectionist approach. In M. Coltheart (Ed.), Attention and performance XII: The psychology of reading (pp. 83104). London: Erlbaum.
  • Murzin, A. G., Brenner, S. E., Hubbard, T., & Chothia, C. (1995). SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247, 536540.
  • Page, M. P. A., & Norris, D. G. (1998). The primacy model: A new model of immediate serial recall. Psychological Review, 104, 761781.
  • Papadimitriou, C. (1993). Computational complexity (1st ed.). Reading, MA: Addison Wesley. ISBN 0-201-53082-1.
  • Perea, M., & Lupker, S. J. (2004). Can CANISO activate CASINO? Transposed-letter similarity effects with nonadjacent letter positions. Journal of Memory and Language, 51, 231246.
  • Perry, C., Ziegler, J. C., & Zorzi, M. (2007). Nested incremental modeling in the development of computational theories: The CDP+ model of reading aloud. Psychological Review, 114, 273315.
  • Perry, C., Ziegler, J. C., & Zorzi, M. (2010). Beyond single syllables: Large-scale modeling of reading aloud with the Connectionist Dual Process (CDP++) model. Cognitive Psychology, 61(2), 106151.
  • Pitt, M. A., & Myung, I. J. (2002). When a good fit can be bad. Trends in Cognitive Sciences, 6(10), 421425.
  • Plate, T. A. (1995). Holographic reduced representations. Transactions on Neural Networks, 6(3), 623641.
  • Plaut, D. C., & McClelland, J. L. (2010). Locating object knowledge in the brain: A critique of Bowers’ (2009) attempt to revive the grandmother cell hypothesis. Psychological Review, 117, 284288.
  • Plaut, D. C., McClelland, J. L., Seidenberg, M. S., & Patterson, K. (1996). Understanding normal and impaired word reading: Computational principles in quasi-regular domains. Psychological Review, 103, 56115.
  • Rastle, K., & Davis, M. H. (2008). Morphological decomposition based on the analysis of orthography. Language and Cognitive Processes, 23(7-8), 942971.
  • Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 10191025.
  • Schoonbaert, S., & Grainger, J. (2004). Letter position coding in printed word perception: Effects of repeated and transposed letters. Language and Cognitive Processes, 19(3), 333367.
  • Serre, T., Kouh, M., Cadieu, C., Knoblich, U., Kreiman, G., & Poggio, T. (2005). A theory of object recognition: computations and circuits in the feedforward path of the ventral stream in primate visual cortex. MIT-CSAIL Technical Report. Cambridge, MA: Massachusetts Institute of Technology.
  • Shawe-Taylor, J., Anthony, M., & Kern, W. (1992). Classes of feedforward neural networks and their circuit complexity. Neural Networks, 5(6), 971977.
  • Sibley, D. E., Kello, C. T., Plaut, D. C., & Elman, J. L. (2008). Large-scale modeling of wordform learning and representation. Cognitive Science, 32, 741754.
  • Slamecka, N. (1985). Ebbinghaus: Some associations. Journal of Experimental Psychology: Learning, Memory and Cognition, 11, 414435.
  • Tsapkini, K., & Rapp, B. (2010). The orthography-specific functions of the left fusiform gyrus: Evidence of modality and category specificity. Cortex, 46(2), 185205.
  • Tsunoda, K. (2001). Complex objects are represented in macaque inferotemporal cortex by the combination of feature columns. Nature Neuroscience, 4, 832838.
  • Tydgat, I., & Grainger, J. (2009). Serial position effects in the identification of letters, digits and symbols. Journal of Experimental Psychology: Human Perception and Performance, 35(2), 480498.
  • Valiant, L. (1984). A theory of the learnable. Communications of the ACM, 27(11), 11341142.
  • Van Assche, E., & Grainger, J. (2006). A study of relative-position priming with superset primes. Journal of Experimental Psychology: Learning, Memory and Cognition, 32(2), 399415.
  • van Rooij, I., & Wareham, T. (2008). Parameterized complexity in cognitive modeling: Foundations, applications and opportunities. Computer Journal, 51(3), 385404.
  • Velan, H., & Frost, R. (2011). Words with and without internal structure: What determines the nature of orthographic and morphological processing? Cognition, 118(2), 141156.
  • Vinckier, F., Dehaene, S., Jobert, A., Dubus, J., Sigman, M., & Cohen, L. (2007). Hierarchical coding of letter strings in the ventral stream: Dissecting the inner organization of the visual word-form system. Neuron, 55, 143156.
  • Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167194.
  • Whitney, C. (2001). How the brain encodes the order of letters in a printed word: The SERIOL model and selective literature review. Psychonomic Bulletin & Review, 8, 221243.
  • Wickelgren, W. A. (1969). Auditory or articulatory coding in verbal short-term memory. Psychological Review, 76, 232235.
  • Yarkoni, T., Balota, D. A., & Yap, M. J. (2008). Moving beyond Coltheart's N: A new measure of orthographic similarity. Psychonomic Bulletin & Review, 15, 971979.


  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Orthographic coding
  5. 3. String kernels
  6. 4. Open-bigrams as String kernels
  7. 5. Discussion
  8. 6. Learning String kernels: The role of invariance
  9. 7. Conclusion
  10. References
  11. Appendix
3-wildcard coding scheme

Here we provide a definition of the 3-wildcard kernel studied in the simulation section, which was inspired from the wildcard kernel proposed by Leslie and Kuang (2004). Let ∑ be the English alphabet augmented by a wildcard character * that matches every letter, and α a substring of length 3 in string S. We define φ as :

  • image
  • image

The feature map for any string S is the sum of the feature maps for all possible α in S:

  • image

As usual, the kernel between strings S and T is defined as the normalized inner product of the feature maps for S and T, that is:

  • image

Tables A1 and A2 provide detailed similarity measures on the benchmarks of Davis (2010) and Hannagan et al. (2011a), for the wildcard kernel and the other coding schemes studied in our simulation section.

Table A1.  Human priming (ms) and similarities for the four coding schemes on selected conditions from the Davis (2010) benchmark studied in our simulation section (adapted, with permission, from Davis, 2010)
PrimeTargetHumansSpatial codingGvH UOBKernel UOB3-WildCard
 Corr.: 0.730.670.750.80
Table A2.  Similarities for the four coding schemes on the Hannagan et al. (2011a) benchmark studied in our simulation section (adapted, with permission, from Hannagan et al., 2011a)
PrimeTargetSpatial codingGvH UOBKernel UOB3-WildCard