On the Necessity of U-Shaped Learning


Correspondence should be sent to John Case, Department of Computer and Information Sciences, University of Delaware, Newark, DE. E-mail: case@udel.edu


A U-shaped curve in a cognitive-developmental trajectory refers to a three-step process: good performance followed by bad performance followed by good performance once again. U-shaped curves have been observed in a wide variety of cognitive-developmental and learning contexts. U-shaped learning seems to contradict the idea that learning is a monotonic, cumulative process and thus constitutes a challenge for competing theories of cognitive development and learning. U-shaped behavior in language learning (in particular in learning English past tense) has become a central topic in the Cognitive Science debate about learning models. Antagonist models (e.g., connectionism versus nativism) are often judged on their ability of modeling or accounting for U-shaped behavior. The prior literature is mostly occupied with explaining how U-shaped behavior occurs. Instead, we are interested in the necessity of this kind of apparently inefficient strategy. We present and discuss a body of results in the abstract mathematical setting of (extensions of) Gold-style computational learning theory addressing a mathematically precise version of the following question: Are there learning tasks that require U-shaped behavior? All notions considered are learning in the limit from positive data. We present results about the necessity of U-shaped learning in classical models of learning as well as in models with bounds on the memory of the learner. The pattern emerges that, for parameterized, cognitively relevant learning criteria, beyond very few initial parameter values, U-shapes are necessary for full learning power! We discuss the possible relevance of the above results for the Cognitive Science debate about learning models as well as directions for future research.

1. Introduction and motivation

A U-shaped curve in a cognitive-developmental trajectory refers to a three-step process: good performance followed by bad performance followed by good performance once again. In learning contexts, U-shaped learning is a behavior in which the learner first learns the correct behavior, then abandons the correct behavior and finally returns to the correct behavior once again. This kind of cognitive-developmental trajectory has been observed by cognitive and developmental psychologists in a variety of child-development phenomena: language learning (Bowerman, 1982; Marcus et al., 1992; Strauss & Stavy, 1982) understanding of temperature (Strauss & Stavy, 1982; Strauss, Stavy, & Orpaz, 1977), understanding of weight conservation (Bowerman, 1982; Strauss & Stavy, 1982), object permanence (Bowerman, 1982; Strauss & Stavy, 1982), and face recognition (Carey, 1982).

U-shaped curves in cognitive development seem to contradict the “continuity model of cognitive development,” that is, the idea that performance improves with age, and that learning is a monotonic, cumulative process of improvement.1 Thus, the apparent regressions witnessed by U-shaped learning trajectories have become a challenge for competing theories of cognitive development in general and of language acquisition in particular.

The case of language acquisition is paradigmatic. In the case of the past tense of English verbs, it has been observed that, early in language acquisition, children learn correct syntactic forms (call/called, go/went), then undergo a period of ostensible over-regularization in which they attach regular verb endings such as “ed” to the present tense forms even in the case of irregular verbs (break/breaked, speak/speaked), and eventually reach a final phase in which they correctly handle both the rule-governed regular past-tense formation and the finitely many exceptions represented by the irregular verbs.

This example of U-shaped learning has been used as evidence against domain-general associative learning theories of language acquisition by supporters of linguistic nativism. It has figured so prominently in the so-called “Past Tense Debate” between connectionism and rule-based theory (the original articles are (Pinker & Prince, 1988; Plunkett & Marchman, 1991; Rumelhart & McClelland, 1986), but see (McClelland & Patterson, 2002; Pinker & Ullman, 2002)) for a more recent follow-up) that U-shaped learning has become the test bed for theories of language acquisition: competing models are often judged on their capacity of accounting for the phenomenon of U-shaped learning (see, e.g., Marcus et al., 1992; Plunkett & Marchman, 1991; Taatgen & Anderson, 2002).2

The prior literature is typically concerned with modeling how humans achieve U-shaped behavior. Instead, we are mostly interested in why humans exhibit this seemingly inefficient behavior. Is it a mere harmless evolutionary accident or is it necessary for full human learning power—for being competitive in the genetic marketplace? This is of course presently very difficult to answer empirically. Herein we pursue, nonetheless for potential interesting insight into this problem, a mathematically precise version of the following question: Are there some formal learning tasks for which U-shaped behavior is logically necessary? We discuss a large and growing body of results (Baliga, Case, Merkle, Stephan, & Wiehagen, 2008; Case & Moelius, 2008; Case & Kötzing, 2010a, 2010b; Carlucci, Jain, Kinber, & Stephan, 2006; Carlucci, Case, Jain, & Stephan, 2007; Carlucci, Case, Jain, & Stephan, 2008) in the context of (extensions of) Gold's formal model of language learning from positive data (Gold, 1967) that suggest an answer to this latter question.3 In addition, the proofs of some of our results intriguingly may begin to shed light on the role played by exceptions on the one hand and rule-governed behavior on the other in U-shaped learning as discussed in Section 'On the proof techniques'.

Gold's model has been very influential in theories of language acquisition (Berwick, 1985; Osherson, Stob, & Weinstein, 1982; Pinker, 1979; Wexler, 1982; Wexler & Culicover, 1980) and has been developed and extended into an independent area of mathematical research (Jain, Osherson, Royer, & Sharma, 1999; Osherson, Stob, & Weinstein, 1986). Criticisms of Gold's model are addressed later in Section 'Discussion and conclusion'.

The basics of the model are as follows. A learner is an algorithm for a (partial) computable function (see Rogers, 1967 for background on the theory of computable functions) that is fed an infinite sequence consisting of all and only the elements of a formal target language, in arbitrary order and possibly with repetitions. At each stage of the learning process, the learner outputs a corresponding hypothesis based on the evidence available so far. These hypotheses are candidate (formal) grammars for the target language. Learning in this context means that after some point, the grammars produced by the learner are correct for the target language. Importantly, different criteria can be defined by imposing conditions on the cardinality of the set of correct grammars that the learner produces in the limit, combined with restrictions on the learner's memory and other computational power. In this context, a U-shape occurs whenever, in the process of eventually successfully learning some target language, a learner abandons a conjecture that correctly describes the target language in favor of a wrong conjecture. Note that, as the learner is required to eventually learn the target language, at a later stage of the learning process a return to a correct conjecture will necessarily occur.4 For each learning criterion we consider, we say that a learner is a non-U-shaped learner if it commits no U-shapes while learning languages that it learns according to that criterion (i.e., as in the empirical settings, we mostly do not care about possible U-shapes on other languages5). We consider non-U-shaped learners as, mathematically, it is useful to examine the consequences for learning power when U-shapes are forbidden. U-shaped learning is necessary for a given criterion if some class of languages can be learned by that criterion, but not if one uses the same criterion—except with U-shapes forbidden.

We present, then, results about the impact of forbidding U-shaped behavior in a number of learning models/criteria within Gold's framework. In some cases of interest, U-shaped learning will turn out to be unavoidable: If U-shapes are forbidden, strictly fewer classes of languages are learnable. The general pattern that so far emerges from this line of research is the following. For cognitively relevant, parameterized learning criteria, beyond very few initial parameter values, U-shapes are necessary for full learning power!6

The article is organized as follows. In Section 'Gold-style computational learning theory', we review the main notions and assumptions of Gold-style learning theory and present our formal definition of U-shaped behavior. In Section 'U-shaped learning with full memory' ,we present results about the necessity of U-shaped learning in the context of classical learning criteria with no memory limitations. In Section 'U-shaped learning with memory limitations', we present results about the necessity of U-shaped learning in the context of learning criteria with memory limitations. In Section 'On the proof techniques', we discuss some very interesting features of the proof techniques that have some relevance for Cognitive Science. In Section 'Other forms of non-monotonic learning', we present results about forms of nonmonotonic learning—other than U-shaped learning. In Section 'Discussion and conclusion', we offer a final discussion and prospects and seemingly difficult, cognitively relevant, open questions for future research.

2. Gold-style computational learning theory

We review the basic assumptions and ingredients of (extended) Gold-style computational learning theory (Jain et al., 1999) in an informal yet precise way.

2.1. Languages, grammars, texts, and learners

An alphabet is a finite set of symbols. A language is modeled as a set of finite strings from an alphabet. Without loss of generality—using standard coding techniques—a language can be identified with a set of natural numbers (see, e.g., Davis, Sigal and Weyuker, 1994). This model of language may seem to be very naive, but it is broad enough to model most of the schemes of language description commonly used in Linguistics. Typically, the alphabet symbols are taken to represent the words of the language and the strings of alphabet symbols are taken to represent the sentences of the language. Other interpretations are equally possible: The elements of the alphabet can represent morphemes, phonemes, IPA symbols, and the elements of the language could be identified with the strings that are possible words of the language.

As in most computational learning theories, a learner in Gold's model is an agent that tries to identify a target language based on implicit information.

The learning process is modeled as an inductive procedure (indexed by a discrete time parameter n) in which a learner is trying to identify a target language based on implicit information. At time n, the learner has to make a guess about what the target language is based on the finite amount of information seen so far. Gold's model is in this sense a theory of inductive inference.

In Gold's original model, and for all learning criteria of interest in this study, the information available to the learner is an infinite sequence consisting of all and only the elements of the language. The elements of the language can appear in any order whatsoever and with repetitions. Any such a sequence is called a text or a presentation of a language. Each (nonempty) language has infinitely many different presentations (indeed, any infinite language has uncountably many presentations). A learner is fed a text element by element, and after receiving each new piece of information, has to make a guess about the target language.

Children appear to be learning natural languages by a casual and unsystematic exposure to the linguistic activity of the adults. The issue of whether children profit from negative information in language learning is still debated in Psycholinguistics (see, e.g., Marcoux, 1993 and, for a review, the recent Clark & Lappin, 2010).7 A substantial body of experimental evidence suggests that children learn natural language in the absence of feedback (Brown & Hanlon, 1970; Marcoux, 1993; Taatgen & Anderson, 2002). Accordingly, the learners in Gold's model learn from positive data only: The learner receives as input all and only the correct sentences of the language. In Newport, Gleitman, and Gleitman (1977), a well-documented body of experimental evidence suggests that mothers' utterances to young children are not at all calibrated by syntactic complexity to teach children gradually, but, instead, to fit the limited attention span and processing powers of children. These considerations are partly formally mirrored in Gold's model by the requirement that the learners eventually correctly learn the language no matter how the input is presented, as long as it contains all and only the correct sentences of the language.8

The hypotheses of a learning machine in Gold's model are (numerical codes for) computer programs in a pregiven programming system. The idea here is that human language acquisition involves the acquisition of a grammar for the target language. How the knowledge of such a grammar is coded in the brain is still unknown. According to formal language theory, a grammar for a language is a finite list of rules that effectively generate all and only the correct sentences of the language. Such a grammar can be identified with a computer program such that, when the program is run, all and only the elements of the language are produced as output. Such a computer program has to be written in some fixed programming language. Nowadays, we have a mathematically precise notion of algorithm and of general-purpose (universal) programming languages. We can thus fix a programming language in which any algorithm can be implemented and ask that the grammars are written in that language. Any general-purpose high-level modern programming language will do. Formally, an acceptable programming system is a universal programming system (such as Turing Machines, Random Access Machines, C, Lisp, etc.) into which one can compile from any programming system, or, equivalently, in which any control structure can be implemented (Royer, 1987; Rogers, 1958, 1967). Computer programs are finite objects and can thus be coded as natural numbers (see Rogers, 1967). We refer to these codes as names of the associated computer program and as indices of the corresponding language generated by the computer program. In view of clarifying some subtle distinctions in what follows, we recall the well-known fact that the same language can be generated by many intensionally (i.e., syntactically) distinct grammars. In fact, in our formal setting, each language can be generated by infinitely many intensionally distinct grammars. These distinct but coextensional grammars can be thought of as different sets of rules to generate the same language. Note that redundant rules can always be added to a grammar without changing its extension, thus resulting in infinite variation.

What kinds of languages can be captured by such grammars? The Chomsky hierarchy classifies formal languages in terms of the complexity of the grammars that generate them. The most general class is the class of computably enumerable (c.e.) languages. These are the languages that can be generated by arbitrary algorithmic procedures. Any such language has at least one index in any fixed acceptable programming system (indeed, it has infinitely many distinct ones, as observed above). The view that the class of natural languages could be identified with one of the classes of the Chomsky hierarchy had been dominant in Cognitive Science for many years. Nowadays, researchers are more inclined to think that natural languages form a class that is orthogonal to the classes of the Chomsky hierarchy. It is clear that context-free languages are not enough to model all natural languages (Bresnan, Kaplan, Peters, & Zaenen, 1987; Heinz & Idsardi, 2011; Joshi, 1985; Pullum & Gazdar, 1982; Shieber, 1985), but no one objects to the idea that each natural language is computably enumerable.

The modeling of all human cognition by algorithms or, equivalently, by computer programs, is a well-established trend in Cognitive Science (Johnson-Laird, 1988; Pylyshyn, 1984). Accordingly, we only consider those computing (partial) computable functions, that is, learners whose behavior can be simulated by an algorithm. On the other hand, each (nonempty) language admits many noncomputable presentations.9 It might be interesting to study U-shaped learning with the restriction to computable texts, but we do not make such an assumption here. Note that, for some learning criteria, it is known that restricting to computable texts makes no difference as to successful learning (Case, 1999).

2.2. Successful learning

What does it mean that a learner learns a language? Empirically, we say that a child has acquired knowledge of a natural language after the time he or she stops to make errors (or if the error rate drops below a threshold) and starts to generalize (i.e., produce original linguistic output) correctly. We can (at least currently) never be sure that some error will not occur later on in an individual's linguistic behavior. From a theoretical viewpoint, however, it makes sense to require that knowledge is acquired once a point is reached in the process of hypothesis formation, after which the learner does not make wrong guesses about the target language.

We do not ask that the learner knows when this point of convergence has been reached and consequently stops learning. The process of learning is in the limit in the sense that the successful learner will eventually output only correct conjectures but will not necessarily be able to know that he is doing so and consequently halt the learning process. To require this would result in serious limitation of the power of the model. We would here like to quote Gold's (Gold, 1967) own justification for studying learning in the limit. “A person does not know when he is speaking a language correctly; there is always the possibility that he will find that his grammar contains an error. But we can guarantee that a child will eventually learn a natural language, even if it will not know when it is correct.”

We further illustrate the setting and introduce some terminology. Suppose that, at some point, after reading the elements inline image of a text for a language L, machine (algorithmic procedure) M conjectures a grammar inline image, and that grammar inline image is a correct grammar for L. Suppose now that, after conjecturing inline image, M outputs forever only correct grammars for L, that is, all later grammars inline image output while seeing the rest of the input text inline image, are correct grammars for L. In that case, we say that M has converged to a set of correct grammars, in this case to the set inline image. We call this set the set of M's final conjectures.

For all criteria of interest to the present study, a learner is required to converge to a set of correct grammars for the target language, in response to any text for the target language. Different learning criteria can be defined by imposing conditions on the set of correct grammars to which the learner converges (e.g., the set is a singleton, the set has cardinality less than n, the set is finite but unbounded, etc.). Other criteria of interest are obtained by imposing restrictions on the learner's memory (plausible for us humans), and these will be discussed later.

It is commonly assumed that children are able to learn any natural language, given the appropriate input. Any proposed definition of natural language determines a class of languages that fulfill the definition. Accordingly, we are interested in machines that learn classes of languages, rather than individual languages. In fact, in Gold's model, learning a single language is trivial: The learner can blindly output a fixed grammar for the language, regardless of its received input. A learner is said to learn a class of languages according to a given learning criterion if the learner learns every language in the class according to that criterion.

2.3. U-shaped behavior

As Strauss and Stavy write in the Introduction to Strauss and Stavy (1982), U-shaped learning consists in “the appearance of a behavior, a later dropping of it, and what appears to be its subsequent reappearance […] Phase 1 behavior is a correct performance and Phase 2 is an incorrect performance, whereas Phase 3 behavior is a correct performance.” From a theoretical viewpoint, we decide to interpret a good performance as the (sometimes observable) consequence of the learner's conjecturing a correct grammar, and a bad performance as the (sometimes observable) consequence of the learner's conjecturing a wrong grammar, that is, a grammar for a language differing from the target language. This is a reasonable working hypothesis, and we adopt this perspective in our formal setting. It is indeed analogous to assuming that a child learning a language eventually acquires at least one grammar for the learned language.

We say that a learner learning a class inline image of languages is U-shaped on the class inline image if it exhibits U-shaped behavior on some presentation of some language in the class. That is, a machine M is U shaped on a class inline image of languages if there is a language L in inline image and a text T for L such that, while learning L from T, M outputs at some point a correct grammar for L, then later abandons it and makes a wrong conjecture, and later outputs a correct conjecture again. More formally, if the text T for L is the infinite sequence inline image, M is U-shaped on T if there exist three elements inline image such that (1) m < n < p, and (2) after reading the input up through inline image the machine M conjectures a grammar inline image which is a correct for the language L, (3) after reading the input up through inline image the machine M conjectures a grammar inline image which is not a correct grammar for L, and (4) after reading the input up through inline image the machine M conjectures a grammar inline image which is again a correct grammar for L. We do not require the two correct conjectures (inline image and inline image) to be intensionally (i.e., syntactically) the same. We only ask that they both—extensionally—generate the target language L. We note that it is currently experimentally impossible to see grammars inside a person's head. Hence, we cannot yet determine intensional equivalence (or lack thereof) of the grammars a person is actually employing. The best we can currently do in the laboratory or field is detect extensional/behavorial outcomes of what is inside a person's head. Even in a domain as apparently simple as past tense morphology, it is neither far-fetched nor contradictory to known (or currently knowable) experimental data to argue that learners may generate, during the learning process, intensionally distinct but extensionally equivalent correct grammars.

For mathematical convenience, we will state our results in terms of non-U-shapedness. A learner is said to be non-U-shaped on a class inline image of languages if and only if it is not U-shaped on the class. When such a learner M is presented with the elements of L in the order of some text T, if at some point M outputs a correct conjecture for L, then all conjectures output by M after that point are correct conjectures for L. We actually mostly care about U-shaped behavior of learners on classes of languages they actually learn (according to some fixed learning criterion of interest), so that convergence on a set of correct conjectures is ensured from the onset for every text for every language in the class under consideration.

Our definition of U-shaped learning obviously constitutes an idealization of the experimentally observed learning behaviors. “U-shaped learning” is an experimental concept. A U-shaped learning curve is a qualitative feature of a quantitative representation of measurable linguistic performance rather than of linguistic competence. As we noted above, at the present time it is impossible to look into a person's head to see what grammar he or she is using. The same empirical untestability holds for the problem of testing whether the grammar used in Phase 1 is the same as the grammar used in Phase 3. We thus believe that requiring correct identification of the target language in Phase 1 of a U-shaped curve is a viable abstraction, in fact akin to the common idea that a competent learner knows some grammar for the language. Also, we believe that biological biases and genetic constraints make the possibility of the child reaching a correct linguistic knowledge (i.e., a correct grammar) at an early age not so far-fetched.

We will occasionally mention stronger forms of non-U-shaped learning appearing in the literature. Strongly non-U-shaped learning (from Wiehagen 1991, where it is called semantically finite) refers to learning in which the stronger requirement of never abandoning a correct conjecture is imposed. Thus, a strongly non-U-shaped learning has to stabilize on the very first correct conjecture issued during the learning process, while a non-U-shaped learning can issue an arbitrary number of intensionally distinct but correct conjectures after issuing a correct conjecture for the first time. An even stronger notion appears in the literature. A decisive learner (Osherson et al., 1986) is a learner that never returns to a conjecture that is extensionally equivalent to a previously abandoned conjecture—be it wrong or right with respect to the target language—if a conjecture generating a different language has been issued in between.

We are now able to formulate, for each formal learning criterion, the following fundamental questions. Is U-shaped behavior necessary for the full learning power? Are there classes that are learnable only by resorting to U-shaped behavior (on some text for some language of the class)?

3. U-shaped learning with full memory

In this section, we present results about the necessity of U-shapes in three classical learning contexts in which the learner has full access to previously seen data items. At any stage n of the learning process, such a learner can access the full initial segment inline image of the input text T and consequently recompute any of its own previously output conjectures inline image as well as the new one inline image. Explanatory Learning requires the learner to converge to a single correct hypothesis in the limit. Explanatory Learning is Gold's original model of learning in the limit: (Gold, 1967) At the other extreme, Behaviorally Correct learning (Case & Lynes, 1982; Osherson & Weinstein, 1982) allows the learner to stabilize on possibly infinitely many syntactically different correct conjectures in the limit. Vacillatory Learning is intermediate between the two: The learner here is allowed to vacillate between at most a finite fixed (or finite unbounded) number of correct conjectures in the limit.10 Vacillatory Learning (Case, 1999) defines an infinite hierarchy of more and more powerful learning criteria intermediate between Explanatory and Behaviorally Correct Learning. The case of Vacillatory Learning is paradigmatic and is the first example of a general pattern of which we will see more: In parametrized learning models, beyond very few initial parameters, U-shaped learning is necessary for full learning power.

3.1. Explanatory, behaviorally correct, and vacillatory learning

We here more formally define Behaviorally Correct, Explanatory, and Vacillatory Learning.

The minimal requirement for a learner M to learn a language L is that M, given any text for L, eventually outputs only correct grammars for L. These grammars can possibly be infinitely many syntactically distinct ones. A learner that satisfies this minimal and correct convergence requirement is said to behaviorally identify the language L. Such a learner identifies the language only extensionally but not intensionally: It does not have to stabilize on a single grammar for the language but is nevertheless able to correctly capture the extension of the language and thus to eventually reach correct linguistic behavior. We refer to this criterion as Behaviorally Correct Learning.

By fixing a finite number b ≥ 1 as an upper bound to the number of correct conjectures to which a learner is allowed to converge in the limit, we obtain the concept of Vacillatory Learning with vacillation bound b. For each choice of a positive natural number b, we get a distinct criterion. If we require that the number of correct conjectures to which the learner converges is finite (but undetermined), we get a different criterion.

Finally, if we require that the learner converges to a single correct grammar, we have the concept of Explanatory Learning. This is Gold's original concept from Gold (1967) and is obviously equivalent to vacillatory identification with vacillation bound of b = 1. Such a learner is said to explanatory identify the language in the sense of stabilizing on a single correct description or explanatory definition of the language, without changing its mind later.

We compare the power of the learning criteria by comparing the classes of learnable languages as sets. A learning criterion is more powerful than another if it allows learning of more classes of languages. What is learnable in the explanatory sense is also learnable in the vacillatory and in the behaviorally correct sense by definition. For each b, learning with vacillation bound b is contained in learning with vacillation bound b + 1 by definition and in learning with an arbitrary but finite number of correct grammars in the limit.

It is known that all the above mentioned inclusions are indeed strict. Thus, all the learning criteria defined so far are different and give rise to a hierarchy of more and more powerful learning paradigms, according to which more and more language classes become learnable.

It might be profitable here to recall that the general problem of deciding whether two grammars generate the same language is algorithmically undecidable. Also, grammars for the same language can be so different that it is impossible to prove their equivalence from the axioms of Set Theory (see Rogers, 1967). This gives a hint on what can be gained by allowing a learner more than one correct conjecture in the limit.

The Vacillatory Learning criteria with vacillation bound b = 1,2,3,… form an infinite strict hierarchy of more and more powerful learning criteria, on top of which is learning with an arbitrary finite number of correct conjectures in the limit (Case, 1999). We call this hierarchy the Vacillation Hierarchy. We state the existence of this hierarchy as a theorem for ease of further reference.

Theorem 1. For each choice of a positive natural number b there are classes of languages that are learnable by vacillating between at most b + 1 correct grammars in the limit but not by vacillating between at most b, grammars. Also, there are classes that are learnable by converging to a finite but not preassigned number of correct grammars, but are not learnable by vacillation between at most b correct grammars, for any choice of a positive natural number b.

Let us briefly describe in detail a class of languages that witnesses the separation between learning with vacillation bound b + 1 and b.11 As grammars are coded as natural numbers and languages are sets of natural numbers, it might well be the case that a (code for a) grammar g generating a language L is also an element of L. Choose a positive natural number b. Now consider the class inline image of languages defined by the following two requirements: (a) Among the first b + 1 elements of L (with respect to the usual order < of the natural numbers) there occurs at least a code of a grammar for L, and (b) none of the elements of L beyond the b + 1-th is a code of a grammar for L. Consider, for example, the concrete case of b = 1. In other words, inline image contains all and only those languages L such that one or both of the first two elements of L, and no other, is a code for a grammar for L. A variant of a proof of a theorem of Case (1999) shows that the class inline image can be learned by vacillating in the limit between at most b + 1 correct grammars but cannot be learned by vacillating between at most b correct grammars. Thus, the class inline image witnesses the fact that learning with vacillation bound b + 1 is strictly more powerful than learning with vacillation bound b.12 The results can be strengthened by showing that even allowing a learner to converge on approximately correct conjectures (in the sense of grammars that identify the target language up to a finite number of errors) does not make the class inline image learnable with vacillation bound b.

3.2. U shapes in explanatory, vacillatory, and behaviorally correct learning

The first result in the area (Baliga et al., 2008) showed that U-shaped learning is not necessary for the full power of Explanatory Learning, as stated in the following (nontrivial).

Theorem 2. Every class of languages that can be learned by convergence in the limit to a single correct grammar can be learned in this sense by a non-U-shaped learner.

To appreciate the meaning of the above theorem, note that a general explanatory learner has the freedom to issue, during the learning process, correct conjectures that are later abandoned in favor of incorrect ones, before eventually reaching the stabilization point after which one and the same correct conjecture is issued indefinitely. Such a freedom is obviously not allowed to a non-U-shaped learner. A very succinct and elegant proof of Theorem 2 can be found in the recent Case and Kötzing (2010a). The result has been strengthened in Case and Moelius (2008) to show that all explanatory learners can be transformed in explanatory learners that are strongly non-U-shaped. Recall that strongly non-U-shaped learners are learners that stick to the first correct conjecture issued during the learning process. This has to be contrasted with the fact that decisive learning does restrict learning power of explanatory learners, as shown in Baliga et al. (2008). Recall that in decisive learning, a learner is forbidden to issue a conjecture for a grammar that is extensionally equivalent (yet possibly intensionally distinct) to a previously issued conjecture (be it wrong or right with respect to the target language), in case a grammar for a different language has been issued in between.

From a Cognitive Science perspective, the above results means that if Explanatory Learning as such is an adequate model of human learning acquisition, then U-shaped behavior is an unnecessary feature of human behavior. However, the adequacy of the Explanatory Learning paradigm can be challenged in a number of ways. In particular, the requirement made in Explanatory Learning that the learner must converge on exactly one correct grammar in the limit may be too restrictive. While it is possible to measure changes in linguistic behavior experimentally, as noted twice above, it is not currently possible experimentally to detect syntactic changes in a person's head. Recall that, on the other hand, general grammar equivalence is algorithmically undecidable. Humans might be taking advantage of not committing to a single description of the target language. It is therefore interesting to investigate alternative learning paradigms that address exactly this point.

Behaviorally Correct Learning is known to be strictly more powerful than Explanatory Learning, and it is interesting to investigate what the impact of U-shaped learning is in this context. As observed in Baliga et al. (2008) and Carlucci et al. (2008), based on a proof from Fulk et al. (1994), U-shaped behavior is necessary for the full learning power of behavioral learners, as stated in the following Theorem (see Carlucci et al., 2008).

Theorem 3. There are classes of languages that can be behaviorally identified but cannot be behaviorally identified by a non-U-shaped learner.

If Explanatory Learning seems to be too restrictive, Behaviorally Correct Learning seems to be much too liberal—in allowing the learner to converge on up to infinitely many distinct correct grammars for the target language. The case of infinitely many distinct correct grammars must include humanly unrealistically large size grammars.

Vacillatory Learning is more realistic in this respect and gives rise to a completely different and much richer picture.13

We already know from Theorem 2 that U-shaped behavior is redundant for the first level of the Vacillation Hierarchy (as it is just Explanatory Learning). What about the other levels, the levels in which the power of vacillation is actually used?

Consider a learner inline image learning a class inline image with vacillation bound b > 1. Suppose now that this learner is non-U-shaped on that class. Now imagine another learner inline image observing the behavior of inline image on a text T for some language L in the class inline image and acting as follows. As soon as inline image outputs a conjecture for the first time, inline image outputs the same grammar. Each time inline image (syntactically) repeats a previously output conjecture, inline image just forgets it, and outputs again its own most recent conjecture. We claim that inline image learns the language L in the explanatory sense, that is, that inline image will eventually converge on a single correct conjecture for L. Why is that so? Since inline image learns L in the vacillatory sense, at some point n of the learning process, inline image will output, for the first time, a correct conjecture for L, call it g. By design, inline image will output g as well. As inline image by assumption converges on at most b different correct grammars, inline image can output after g only textitfinitely many previously unseen grammars. Also, as inline image is by assumption non-U-shaped on L, all conjectures output by inline image after g are textitcorrect conjectures for L.

Let inline image be the—last but not least!—grammar to appear in inline image's output beyond g, that is, after all other grammars output by inline image from g on have already been output once. It is easy to see that inline image will converge on inline image. Thus, inline image will learn L in the explanatory sense. As L and T were arbitrary in the above argument, it shows that inline image learns the whole class inline image in the explanatory sense.

So we have the following Theorem from Carlucci et al. (2008).

Theorem 4. For all b > 1, the following holds. Every class that is learnable by vacillation in the limit between at most b correct grammars and with out U-shapes is already learnable by convergence in the limit to a single correct grammar.

U-shaped behavior is therefore necessary for the full power of Vacillatory Learning in a very strong sense: If U-shapes are forbidden, then the extra power gained by vacillation is lost. The Vacillation Hierarchy (see Theorem 1) collapses to Explanatory Learning.

It is an easy consequence of Theorems 2, 1, and 4 that the class inline image (defined above) cannot be learned without U-shaped behavior by vacillating between at most b + 1 grammars in the limit. In fact, the same holds if the learner is allowed to converge on grammars approximating the target language modulo a finite number of anomalies. Consider the particular case of b = 1. Then inline image is a concrete example of a class that requires U-shaped behavior to be learned. In fact, we know from above that inline image is learnable by vacillating between at most two correct grammars but not by converging to a single correct grammar in the limit. Consider a learner M learning inline image by vacillation between no more than two correct grammars. Such a learner exists because inline image is learnable with vacillation bound 2. Now suppose that M does not exhibit U-shaped behavior. By the argument sketched above (see Theorem 2), this would mean that there is another learner, inline image, that learns the class inline image in the explanatory sense. But this is a contradiction. Formally, we have the following Corollary (see Carlucci et al., 2008).

Corollary 1. Let b  ∈  {2,3,…}. Then any M witnessing that inline image is learnable with vacillation bound b necessarily employs U-shaped learning behavior on inline image.

The above results suggest that if some of the natural language-learning tasks humans have to face are of the kind of the classes inline image, and to the extent that Vacillatory Learning is an adequate model of human language learning, then U-shaped behavior is not a harmless and accidental feature of human behavior but may be necessary for learning what humans need to learn to be competitive in the genetic marketplace.14 From this perspective, it would be interesting to find insightful characterizations of the language classes that require U-shaped behavior to be successfully learned in the vacillatory sense.

3.3. Getting around U shapes

We know from the previous section that some classes require U-shaped behavior to be learned in the vacillatory sense. But how deep is the necessity of U-shaped behavior in such cases? What happens if we remove the finite vacillation bound and consider behaviorally correct learnability of those classes? Obviously, every class which is vacillatorily learnable is also behaviorally learnable by definition. But can we avoid some U-shapes if we only have to learn in a behaviorally correct way? The picture that emerges is interesting in our opinion. First, if we consider behaviorally correct learnability of the classes in the first nontrivial level of the Vacillation Hierarchy, the necessity of U-shaped behavior disappears. We have the following Theorem from Carlucci et al. (2008).

Theorem 5. Every class that can be identified by vacillating between at most two indices can also be behaviorally identified by a non-U-shaped learner.

In contrast, from the third level of the Vacillation Hierarchy on the necessity of U-shaped learning cannot be removed even by allowing the learner to converge to infinitely many syntactically different correct grammars in the limit! We have the following Theorem from Carlucci et al. (2008).

Theorem 6. For b > 2, there are classes that can be identified by vacillating between at most b correct grammars, but which cannot be behaviorally identified by any non-U-shaped learner.

In this sense, we can say that the necessity of U-shaped behavior for these classes is even deeper than the necessity of U-shaped behavior for learning with vacillation bound 2. Note how a (arguably) cognitively implausible model of learning such as Behaviorally Correct Learning can be usefully used to qualitatively strengthen results about the uneliminability of a cognitively relevant learning strategy (U-shaped learning) for some not so implausible learning model (Vacillatory Learning). It is difficult to judge whether the asymmetry between two and larger than two might have some significance from the Cognitive Science perspective.

4. U-shaped learning with memory limitations

For modeling humans, a major limitation of the models considered so far is that they allow a learner too easy access to all the previous data. Humans certainly have memory limitations. It is therefore of relevance to cognitive science to investigate the impact of forbidding U-shaped strategies in the presence of memory limitations. In this section, we present results about the necessity of U-shaped learning in learning models featuring very severe memory limitations. For all these models, the convergence requirement is the same as for Explanatory Learning: A single correct grammar must be output in the limit. The models differ in the forms of memory allowed to a learner.

It is profitable to distinguish between intensional memory and extensional memory, although they cannot always be kept distinct. Intensional memory refers to the learner's memory of his or her own past conjectures. Extensional memory refers to the learner's memory of previously seen data items.

4.1. Iterative learning

Iterative Learning (Wiehagen, 1976; Wexler & Culicover, 1980) is a fundamental model of inductive learning with memory limitations. An iterative learner computes its guesses about the target language based on its own most recent conjecture and on the current data item only.15 Iterative Learning is a well-studied model (Becerra-Bonache, Case, Jain, & Stephan, 2010; Case & Moelius, 2009; Case, Jain, Lange, & Zeugmann, 1999; Jain & Kinber, 2007; Jain, Lange, & Zilles, 2006; Lange & Grieser, 2003). Most interestingly, in the perspective of the present study, Iterative Learning is the base case of a hierarchy of stronger and stronger memory-limited models. It is thus an important question whether U-shapes are necessary in this model. An attempt at answering this question was made in Carlucci et al. (2007). The problem was solved only later by Case and Moelius (2008). Their result shows that U-shapes are not necessary for the full learning power of iterative learners.

Theorem 7. All classes of languages that can be learned by an iterative learner can be learned by an iterative learner without U-shapes.

4.2. Bounded example memory learning

Few would defend the claim that humans are iterative learners. At the very least, humans have some form of extensional memory. Yet Theorem 1 above is important for the investigation of U-shaped learning in memory-limited models. As observed above, Iterative Learning is the base case of a parametrized family of criteria of learning with bounded memory of past examples. It is indeed easy to extend Iterative Learning by allowing learners to store a bounded number of items in their long-term memory. A Bounded Example Memory learner with memory bounded by n is an iterative learner that can store in memory up to n previously seen data items. This model has been introduced in Lange and Zeugmann (1996) and studied further in Case et al. (1999). A hierarchy of more and more powerful learning criteria—the Bounded Example Memory Hierarchy—is obtained by increasing the size of the long-term memory: for every n ≥ 0, with the storing of n + 1 data items in memory, more language classes can be learned than by storing only n data items. If a learner is allowed to store an arbitrary but finite number of items in its long-term memory, a criterion is obtained that is strictly more powerful than each finite level of the hierarchy. Allowing long-term memory of one previously seen data item is already strictly stronger than Iterative Learning. The impact of forbidding U-shaped behavior in this setting is largely unknown and is of primary interest for future research! Mathematically, it seems interestingly very difficult.

We strongly conjecture that U-shapes are necessary in the Bounded Example Memory hierarchy—at least beyond the first few memory-bound parameter values! See the next subsection for “evidence” toward this conjecture.

Below, we present results on models with more severe memory limitations.

4.3. Bounded memory states learning

In Bounded Memory States Learning (Carlucci et al., 2007), a learner has an explicit bound on its memory and otherwise only knows its current datum. No access to previously seen data and to previously formulated conjectures is allowed!16 At each step, the learner computes its conjecture as a function of the current (bounded) memory state and the current data item. The learner also chooses the new (bounded) memory state to pass on to the next learning step. Intuitively, for c ≥ 1, a learner that can choose between c memory states is a learner that can store one out of c different values in its memory. When inline image, a learner with c memory states is equivalent to a learner with k bits of memory. Bounded Memory State learning with an arbitrary but finite number of memory states is equivalent to Iterative Learning (see Carlucci et al., 2007).

It was shown in Carlucci et al. (2007) that U-shaped behavior does not enhance the learning power of bounded memory states learners with only two memory states. Note that two memory states amount to one bit of memory. The full picture was later obtained by in Case and Kötzing (2010b). This gives another instance of the 2 versus 3 phenomenon.

Theorem 8. There are language classes that are learnable with three memory states but cannot be learned without U-shaped behavior with any finite number of memory states.

The above result is consistent with the emerging picture so far: U-shaped learning is unavoidable in parametrized learning models beyond a few initial parameters. On the other hand, U-shapes are unnecessary for Bounded Memory States learning with an arbitrary, but finite number of memory states. This was proved in Case and Kötzing (2010b) on the bases of Theorem 1 and of the fact that the model is equivalent to Iterative Learning. Again, the limit case (arbitrary, but finitely many) behaves very differently from the finite cases. The same might be the case for Bounded Example Memory Learning.

4.4. Memoryless learning with queries

Queries are meant to formalize a kind of interactive memory. A query is a question of the form “Have I previously seen the following item(s)?”

In Memoryless Feedback Learning (Carlucci et al., 2007), a learner may ask a bounded number of queries about whether computed items have been previously seen in input data—and otherwise only knows its current datum. In this model the queries are parallel, in the sense that the choice of a question—within each learning step—cannot depend on the answer to a previous question. If sequential queries are allowed (each computed query beyond the first one can depend on the answer to the previous queries), we obtain the model of Memoryless Recall Learning, introduced in Case and Kötzing (2010b).

U-shaped learning is necessary for the full learning power of n-memoryless feedback learners, for every n > 0.

Theorem 9. For every n > 0, there are classes of languages that can be learned with n parallel queries by a memoryless feedback learner, but not with n + 1 parallel queries by a non-U-shaped memoryless feedback learner.

As an open problem it was asked in Carlucci et al. (2007) whether this necessity could be overturned by allowing more queries. Is it the case that for every m > 0 there exists an n > m (possibly with n much larger than m) such that all classes learnable with m queries can be learned with n queries but, then, without U-shapes? The question was answered negatively by Case and Kötzing (2010b). Indeed, much more was shown.

Theorem 10. There is a class learnable with a single feedback query by a memoryless learner that cannot be learned by a non-U-shaped memoryless learner with any finite number of feedback queries, even if sequential (rather than parallel) queries are allowed.

Interestingly, the above result is complemented by a result showing that any class of infinite languages that is learnable by a memoryless feedback learner with finitely many feedback queries is so learnable without U-shapes. In fact, all classes of infinite languages learnable with complete memory and, moreover, explanatorily learnable, can be learned without U-shapes by a memoryless feedback learner using a finite, but unbounded number of feedback queries. We will see more about this pattern: The necessity of U-shaped learning sometimes disappears when learning classes of infinite languages only. On the other hand, there is a class of infinite languages that can be learned by a memoryless feedback learner with a single feedback query but that cannot be learned without U-shapes using any particular number of feedback queries.

It is a cognitively important open question whether U-shapes are necessary for feedback or recall learning for which the learner also knows its just prior working hypothesis/conjecture. This too seems to be interestingly mathematically difficult. From Case et al. (1999) it is known this kind of learning forms a hierarchy in dependence on the bound on the number of queries.

Suppose m + n > 0. A more general, cognitively important open question is whether U-shapes are necessary for full learning power for learning criterion where the learner knows its prior conjecture and its current input datum, can remember m prior data items, and can (feedback or recall) query in each round n computed items. We conjecture that, at least for m + n beyond the first few positive values, U-shapes are necessary!

4.5. Counters and time/data awareness

Memory-limited learners such as human beings might take advantage from other forms of information during their training. For example, humans are to some large extent aware of the passage of time and data—certainly when they are awake. Can this awareness give some additional learning advantage—in bounded memory cases?

In Case and Moelius (2008), the authors introduced an extension of Iterative Learning featuring the use of counters. In particular, they considered a model in which an iterative learner also knows the number of not necessarily distinct data items seen so far. In other words, the learner is aware of the iteration stage number of the learning process. This information is naturally coded as a counter going from 0 to infinity. We call this type of counter a full counter. We think of counters as modeling the fact that humans are at least somewhat aware of time and/or data passage. Some people may be more aware than others.

It was shown in Case and Moelius (2008) that iterative learners with full counters are strictly more powerful than plain iterative learners. Not surprisingly, they are also strictly less powerful than explanatory learners—as explanatory learners have available the data input sequence up to any point in time, so they can calculate the length of this sequence.

Kötzing (2011) began a systematic study of how counters can improve learning power. Which properties of counters give a learning advantage? Is it the higher and higher counter values, which can, for example, be used to time-bound computations? Is it merely unbounded counter values? Is it, as above, knowing the exact number of data (not necessarily distinct) items seen so far? In Kötzing (2011), the impact of six different types of counters—each one modeling one of six potential advantages of using a counter—is fully studied—at least in the context of Iterative Learning. It is not so clear to us yet which type of counter best models human performance—again there may be human individual differences. More work needs to be done on this.

On the one hand, Kötzing (2011) showed that even the weakest of his six types of counter does improve learning power—at least of iterative learners. On the other hand, the six types of counters studied in Kötzing (2011) turned out to fall into two groups with respect to iterative learning power. The strongest learning advantage is given by having a full counter, but a strictly monotone one is also sufficient. Indeed, the proofs (again for Iterative Learning) show that the such a learner only needs to count the number of (not necessarily distinct) elements seen since its last mind change (i.e., change of conjecture)—or overestimate that number (however badly) to attain maximal learning power of using full counters. Dropping the monotonicity requirement already results in a loss of learning power (for Iterative Learning). The same learning power of an iterative learner using a not necessarily strict monotone, but unbounded counter is achieved by dropping the monotonicity requirement or by using a counter that eventually enumerates all natural numbers, but with no constraint on the order of the enumeration. More work needs to done on the learning advantages of various humanly plausible types of counters for learning criteria more humanly plausible than the very restrictive Iterative Learning.

Even the question of whether U-shaped learning is necessary for iterative learners with counters is an interesting still open problem for future research. In Case and Moelius (2008), the authors conjectured that U-shaped learning should be unnecessary if the learner has access to a full counter.17 The conjecture is still open. In Kötzing (2011) there are preliminary results on the interplay between counters and U-shaped learning in the context of a toy model of learning vastly weaker than even Iterative Learning: Transductive Learning is learning with no memory at all. A transductive learner outputs its new conjecture based on the current datum only. It is shown in Kötzing (2011) that in the context of Transductive Learning, its six counter types give rise to four distinct extensions of Transductive Learning. For Transductive Learning, U-shaped learning exhibits a sensitivity to the counter being ordered: it makes a difference whether the learner has access to a monotone and unbounded counter rather than just to an unbounded counter (while it made no difference for Iterative Learning).

Here is a collection of master open questions (so far) regarding the necessity of U-shaped learning for humanly plausible criteria with memory limitations.

Suppose m + n > 0. Are U-shapes necessary for full learning power for the learning criterion where the learner knows its prior conjecture and its current input datum, can remember m prior data items, and can (feedback or recall) query in each round n computed items, and has access to one of several humanly plausible counters? We conjecture that, at least for m + n beyond the first few values and for some such counters, U-shapes are necessary!

5. On the proof techniques

Do the proofs of the above results give us any insights into the necessity/dispensability of U-shaped behavior? In this section, we discuss some features of some of some the proofs of results from previous sections that might be of interest from the Cognitive Science perspective.

5.1. Self-reference, self-description, and self-learning

Self-reference is a powerful technique in Computability Theory and figures prominently in Computability-theoretic Learning Theory as a whole. Many theorems proving the necessity of U-shaped learning make use of self-referential algorithms and language classes. The classes of languages witnessing many of the separations between a learning criterion and its non-U-shaped variant are self-describing classes. A self-describing class of languages is such that information about the grammars for the languages in the class is directly present in some elements of the languages themselves. For each b ≥ 2, the classes inline image used in Section 'U-shaped learning with full memory' to separate Vacillatory Learning with bound b from its non-U-shaped variant are self-describing classes: Each language in inline image contains information about its own (coded) grammar within its first b + 1 elements.

There has been a trend from Case (1994), Case and Smith (1978), Case and Lynes (1982), and Case and Smith (1983) to Case (1999) and Case et al. (1999) for the self-describing classes employed to go from obvious choices to more difficult choices, yet the original reason for their employment was to make the positive half of proofs that one criterion has more power than another immediate.

Recently, the use of self-describing classes has been generalized, in many cases improved (as regards immediacy of such positive halves of proofs) and systematized by Case and Kötzing (2010a, 2010b,2012) and Kötzing (2011). For a given learning criterion C, the self-learning class of languages for C is that class that is C-learned by merely treating the data elements as codes for programs to be run on inputs relevant to C-learning. This may seem irrational, and, of course, many numbers so run as programs will produce no output (conjectures). Such numbers will, then, not be data elements of languages in the self-learning class for C! Surprisingly, it has been shown very generally (including for criteria as above) that the self-learning class for C witnesses that C is more powerful than another criterion inline image if and only if this is indeed the case! In particular, then, the self-learning class for C will witness C-learning is more powerful than the variant of C in which U-shapes are forbidden if and only if U-shapes are necessary for full C-learning power!

The technical counterparts of the use of self-describing and self-learnable classes are the so-called Recursion Theorems of Computability Theory. The first such theorem is due to Kleene (see, e.g., Rogers, 1967) and can be stated as follows. Let p be an arbitrary algorithmic task. Then there exists an program e (depending on p) that acts as follows. When run on an input x, the program e creates a copy of its own code (a self-copy) and performs the task p on the combined input consisting of the self-copy and the external input x. That is, e performs the preassigned algorithmic task p on its own code combined with the external input x. In some sense, e exhibits a form of self-knowledge (in this case the ability to produce and manipulate a low-level self-description). Actually, the self-knowing program e can be found algorithmically from p—and quite efficiently so (see Royer and Case, 1994). A number of extensions of Kleene's Recursion Theorem are known. One of the most far-reaching is Case's Operator Recursion Theorem (Case, 1974). This theorem features infinitary self- and other-reference. It is important for the employment of self-learning classes.

Even though (imperfect) self-knowledge and self-description might be powerful resources even in the case of human learning, some may be dissatisfied by the fact that the classes of languages separating a learning criterion from its non-U-shaped variant are self-referential classes (or self-describing, or self-learning), rather than natural classes of languages.

An analogy can be drawn between the self-referential witnesses of separations in Learning Theory and the self-referential witnesses of Gödel's Incompleteness phenomenon (see Gödel, 1986) in formal systems of mathematics. The relevance of Gödel's First Incompleteness Theorem for concrete mathematics was questioned on the basis of the fact that his unprovable and irrefutable sentences witnessing the incompleteness of formal systems such as Peano Arithmetic and Zermelo–Fraenkel Set Theory were self-referential sentences—whose mathematical content was devoid of interest to nonlogician mathematicians. It took more than 40 years to find “natural" examples of Gödel sentences. The first such example is the famous “Large Ramsey Theorem" of Paris and Harrington (Harrington and Paris, 1977). Later, perfectly natural mathematical theorems such as Kruskal's Tree Theorem and the famous Robertson's and Seymour's Graph Minor Theorem were proved to require unexpectedly strong axioms. Partly based on this analogy, Case (1999) proposed the following Informal Thesis.18

Informal Thesis: If a self-referential example witnesses the existence of a phenomenon, there are natural examples witnessing the same!

The other basis for the just above Informal Thesis is that self-reference arguments lay bare an underlying simplest reason for the theorems they prove (Case, 1994; Case & Kötzing, 2012; Rogers, 1967); if a theorem is true for such a simple reason, the “space” of reasons for its truth may be broad enough to admit natural examples.

We consider it nonetheless a difficult and interesting challenge for future research to find humanly natural examples of classes requiring U-shaped learning. This is in part because much of human cognition is hidden in brains too complicated for current experimental techniques.

5.2. The problem of generalization: Rules and exceptions

According to a classical explanation of U-shaped learning, U-shapes occur because of the learner adopting two learning strategies: memorization of a finite table and production of a general rule (Marcus et al., 1992; Pinker, 1984, 1991). This explanation has been most notably challenged by connectionists, who posit a single learning mechanism.

It is interesting to note how advocates of opponent theories agree in describing the learning situations in which U-shaped learning occurs as critically featuring an interplay between a small set of exceptions and a general rule or common case.

Melissa Bowerman (Bowerman, 1982) thus points out that U-shaped learning curves “[…] occur in situations where there is a general rule that applies to most cases, but in which there are also a limited number of irregular instances that violate the rule” and goes on concluding that “[…] the solution involves a general adherence to the rule plus memorization of the exceptions.” In the article by Rogers, Rakinson, and McClelland (2004) of the connectionist school, the emergence of U-shaped behavior is linked to learning tasks consisting of regularities (statistical ones in this case) and exceptions: “Specifically, we suggest that U-shaped curves can arise within a domain-general learning mechanism as it slowly masters a domain characterized by statistical regularities and exceptions.”

An interplay between finite tables and infinite languages subsuming them is intriguingly featured in some of our proofs, for example, by the proofs of Theorems 2 and 3 (see Carlucci et al., 2008 for details).

The idea of these proofs is the following. Take a machine M and consider the behavior of M on finite amounts of data. Consider the case of a finite sequence of data σ such that, when M receives σ as input, M outputs a grammar g such that the language generated by g is a proper superset of the elements of σ (i.e., it contains all the elements of σ and some more). Now suppose that M is required to learn both the finite language consisting of the elements of the sequence σ (call this language inline image) and the language inline image generated by g. Consider now the behavior of M on the following text. M is first fed σ. At this point, by choice of σ, M outputs grammar g for inline image. After that, M is presented with the elements of σ in any order. As M must learn inline image, at some point M will output a grammar generating the language inline image, which is different from inline image. But, as inline image contains inline image, we can go on by presenting M with elements from inline image. As M learns inline image by assumption, at some point, M will output a correct grammar for inline image. But then M, has committed a U shape in learning inline image: It has first output a conjecture for inline image after reading σ, then it has abandoned it for a conjecture generating the language inline image, and finally it has returned to a correct conjecture for inline image. This shows that M is in fact forced to have a U-shaped behavior on any class containing at least inline image and inline image. The other ingredients of the proofs are needed to ensure learnability and will be disregarded in this discussion. Now σ is a finite sequence, and inline image is a finite language, which can be thought of as a finite table of exceptions. Instead, the language inline image containing inline image is, at least in principle (and we have no way to decide it), an infinite set. Such an infinite set can be described, and learned, only by conjecturing a general rule.

It would be far-fetched to draw definitive conclusions about human learning from the features of the above proof. Note that the order in which finite tables and general rules come into play in the argument sketched above to enforce U-shapedness is different from the classical account for humans. The finite table σ is so chosen that the learner in the first phase commits apparent overgeneralization in the sense of conjecturing a grammar for a larger language, in the second phase correctly learns the finite table σ, and finally is forced to return to a conjecture for the target language containing the finite table σ and something more.

Still, the above proof exemplifies how U-shaped behavior may be caused by the learner having to deal with two categories of objects: a finite table (of “exceptions") contained in a larger language, which is possibly infinite. The relations between finite and infinite members of the target class is also critical in other results discussed above. For example, in the context of Memoryless Feedback Learning, U-shapes become unnecessary when classes containing only infinite languages are considered. Furthermore, showing that U-shapes are unnecessary for iterative learning of classes consisting of infinite languages turned out to be significantly easier than obtaining the result for arbitrary classes. In the context of Explanatory Learning, if a class does not contain an extension of every finite set, then that class can be learned by a decisive learner (Carlucci et al., 2006)!

It should be noted that the same interplay is at the very heart of many fundamental results in Gold-style learning theory—also unrelated to U-shaped learning—most notably the seminal result showing that the class of regular languages is not learnable in the limit in the explanatory sense. In this sense, the learnability of mixed classes of finite and infinite languages might be more widely relevant to the understanding of “the problem of generalization” as a central problem of learning (Heinz, 2010). Note that these matters are also embedded in just infinite languages, where one may have some special finite parts and the other not, so these kinds of considerations may not go away in general if we confine ourselves to infinite languages.19

6. Other forms of non-monotonic learning

U-shapes are but an instance of nonmonotonic learning. Other nonmonotonic patterns have been experimentally documented and studied in a variety of cognitive-developmental situations (Cashon & Cohen, 2003, 2004).

In Carlucci et al. (2006), a number of variants of nonmonotonic learning criteria have been investigated in the context of Gold's model. In particular, the following restrictions on the learner's behavior have been studied: (1) no return to previously abandoned wrong hypotheses, (2) no return to overinclusive hypotheses, (3) no return to overgeneralizing hypotheses, and (4) no inverted U-shapes. In each case, “no return” means that the learner cannot output a conjecture (of the specified kind) that is extensionally equivalent to a previously abandoned conjecture. An overinclusive conjecture is a conjecture for a language that contains nonelements of the target language. An overgeneralizing conjecture is a conjecture for a language that properly includes the target language as a strict subset. An inverted U-shape means returning to a wrong conjecture while making a correct guess in between.

A fairly complete picture has been obtained of the impact of the above constraints on Explanatory Learning, Vacillatory Learning, and Behaviorally Correct Learning.

The general picture that emerged is the following. Forbidding return to previously abandoned wrong conjectures turns out to be a very restrictive requirement in all models. For explanatory and vacillatory learners, this amounts to imposing the strongest form of monotonicity, that is, decisiveness (no return to any previously abandoned conjecture). In the case of Behaviorally Correct Learning, the requirement turns out to be incomparable with non-U-shapedness. Sometimes non-U-shaped behaviorally correct learners can do better than those not returning to wrong conjectures, but sometimes it is the other way round. The results about the other forms of nonmonotonic learning confirm the extreme sensitivity of the Vacillation Hierarchy to this kind of constraint. Forbidding return to overinclusive hypotheses or forbidding inverted U-shapes causes the hierarchy to collapse to plain Explanatory learning, just as in Theorem 2 for U-shaped learning. Forbidding return to overgeneralizing hypotheses does not cause collapse, but it does restrict learning power at each level of the hierarchy. By contrast, the variants (2), (3), and (4) are useless for Explanatory and Behaviorally Correct Learning!

It is an interesting direction for future research to investigate the impact of the above variants of nonmonotonic learning in models with memory limitations.

7. Discussion and conclusion

Gold's model has been widely discussed in the Cognitive Science literature. Most commentators, however, have focused on a fundamental negative result of Gold (1967) rather than on the model as a whole and its extensions in subsequent research (see Jain et al., 1999). The theorem in question (usually referred to as Gold's Theorem tout court) shows that no superfinite language class is learnable in the limit from positive data.20

Gold's Theorem is often invoked as evidence for the nativist approach to cognition (for recent examples see the influential Hauser, Chomsky, & Fitch, 2002; Nowak, Komarova, & Niyogi, 2002). If not to show that Universal Grammar is a “logical necessity” (Nowak et al., 2002), Gold's Theorem is sometimes invoked in a less drastic way as indicating that domain-general knowledge is impossible, and that constraints on the learning process are necessary for learning—be they innate or acquired (see, e.g., Tenenbaum, Kemp, Griffiths, and Goodman, (2011)). As nicely observed and documented in Heinz (2010), commentators usually disregard the fact that Gold himself proved in his original article (Gold, 1967) that constraints on the learning process do enhance learning power (the result is that the class of c.e. languages is learnable in the limit from positive data if the learner is only required to converge on primitive recursive texts).21 In this respect, inductive learning in the limit is on a par with connectionist, statistical, or Bayesian models of learning and cognition: Some language classes become learnable at the cost of extra assumptions on the input or hypothesis spaces. The plausibility of the extra assumptions made in models other than Gold's can itself be questioned. As Heinz (2010) puts it,

With respect to the claim that identification in the limit makes unrealistic assumptions, I believe it is fair to debate the assumptions underlying any learning framework. However, the arguments put forward by the authors below are not convincing, usually because they say very little about what the problematic assumptions are and how their proposed framework overcomes them without introducing unrealistic assumptions of their own.

The recent debate between connectionist/emergentist and structured probabilistic inference models offers a good example of advocates of competing models mutually accusing each other of making unrealistic assumptions.22 The connection with linguistic nativism and the comparison with statistical models has certainly contributed to exacerbate the critiques to Gold's model from adversaries of the nativist tradition. It is not necessary to downplay Gold's model as a whole to defeat the argument that goes from Gold's Theorem to nativism. It is enough to observe (as both Clark and Lappin 2010 and Heinz 2010 do) that the class of natural languages need not be a superfinite class. This is in addition to the fact that—as nowadays widely recognized—natural languages need not coincide with a class of the Chomsky hierarchy. Early results (Angluin, 1980, 1982) show that—even if regular languages are not learnable in Gold's model—interesting and rich classes that run orthogonal to classes in the Chomsky hierarchy are so learnable. Recent research has shown that many interesting and rich classes orthogonal to the Chomsky hierarchy classes are learnable in the limit from positive data, and, in some cases, even efficiently so (Becerra-Bonache & Yokomori, 2004; Becerra-Bonache et al., 2010; Clark & Eyraud, 2007; Oates, Armstrong, Becerra-Bonache, & Atamas, 2006; Yoshinaka, 2009; Yoshinaka, 2011).

Even advocates of the nativist tradition sometimes emphasize the limits of Gold's model, mostly in favor of statistical models of learning. Heinz (2010) presents a very detailed analysis of the most common critiques to Gold's model and a convincing rebuttal of most of them. The following list of “somewhat problematic assumptions” of Gold's model can be found in the influential Nowak et al. (2002), which favors probabilistic extensions of Gold's model. (1) the learner has to identify the target language exactly, (2) the learner receives only positive examples, (3) the learner has access to an arbitrarily large number of examples, and (4) the learner is not limited by any consideration of computational complexity. As these points are not all explicitly addressed in Heinz (2010), we briefly comment on them. Note that points (1–3) apply to a family of learning criteria within Gold's setting and not to Gold-style learning theory as a whole. Concerning (1), we point out that forms of approximate learning can be and have been investigated within Gold's model (Case, 1999; Case & Smith, 1978; Case & Lynes, 1982; Case & Smith, 1983). Concerning (2), we remark that the necessity of negative information for language learning is a controversial and debated issue (Brown & Hanlon, 1970; Clark & Lappin, 2010; Marcoux, 1993; Taatgen & Anderson, 2002). Interestingly, the fact that some form of possibly implicit negative information could be available to children learning languages was also suggested by Gold himself (Gold, 1967) commenting on his results on unlearnability from positive data only. For attempts at modeling partial negative information in Gold's framework see Baliga et al. (1995), Jain and Kinber (2008), and Motoki (1991). Point (3) is addressed by models with bounded memory as those mentioned in Section 'U-shaped learning with memory limitations' of the present article. The issue of computational complexity (point [4]) is in general a serious one. The difficulty of imposing a fair feasibility restriction on the computational complexity of the learners is indeed a serious drawback of the model. See Case and Kötzing (2009), a rigorous investigation of how some proposed solutions fail to solve the fairness problem.23 On the other hand, fair and feasible algorithms are known for interesting classes orthogonal to the Chomsky hierarchy and providing some natural language patterns (see, e.g., Becerra-Bonache et al., 2010; Clark & Eyraud, 2007; Oates et al., 2006; Yoshinaka, 2009; Yoshinaka, 2011).

The investigation of the important open mathematical questions, about the necessity of U shapes, mentioned in prior sections, but where the learner is required to be both fair and feasible, has barely begun. For the humanly important memory-bounded cases, the fairness problem is apparently only difficult to sort out for the case of nonzero bounded feedback and recall queries.

To some points above, Clark and Lappin (2010) add the requirement of convergence on all texts as an unrealistic assumption of Gold's model.24 Oddly enough, they motivate their claim by referring to the case of feral children, who suffer from an “Impairment of learning due to an absence of data.” But no text in Gold's model—however adversarial—features absence of data.25 One might also argue that the excessively restrictive (convergence on all texts) and the excessively liberal aspects of the model (no complexity bounds) tend to even out to some extent. As noted in Heinz (2010), the restrictive aspects of Gold's learning criteria make the known positive results stronger rather than weaker.

One the most prominent (and usually disregarded) shortcomings of Gold's model is in our opinion its inability to model semantic information. Empirical evidence suggesting that semantics (denotation and social reinforcers) in addition to positive information might be crucial for language acquisition is presented regarding denotation in Moeser and Bregman (1972, 1973). The recent and influential Meltzoff, Kuhl, Movellan, and Sejnowski paper (2009) also makes a strong case about the critical importance of social reinforcers.

Overall, we believe that some of the drawbacks of Gold's model are not unique to it and that those that are are compensated by other benefits of the model. A trade-off between applicability and predictive power on the one hand, and generality and categorical rigour on the other, has to be expected at the present time. In particular—and critically for the topic of the present article—Gold's model is a unique setting for posing and answering with mathematical rigour questions about the logical necessity of learning strategies.

In general, the state of the art of Gold-style learning theory can be compared to the status of ancient physics. Modeling human learning in Gold's model is weakly analogous to modeling the thermodynamics of gases without taking into account van der Waal's forces. This kind of idealization still allows some understanding of the modeled reality. Also, many other parts of Cognitive Science do not allow for precise quantitative predictions in their present form.

We thus believe that the results obtained in Gold's model can give some insights into human learning. As suggested in Case (2007) and Heinz (2010), formal results in Gold's model might guide the design of meaningful experiments. We would like to see more interaction between this branch of Computability Theory, experimental Psychology, and Cognitive Science.

As said above, the general picture that emerges from the so-far known results presented in this article is that U-shaped behavior is unavoidable for full learning power in the context of a number of parametrized models of learning featuring a number of cognitively motivated constraints. These results might be taken as suggestive of the fact that humans might exhibit U-shaped and other nonmonotonic learning patterns because otherwise it would be impossible for them to learn what they need to learn to be competitive in the evolutionary marketplace. U-shaped learning could really turn out to be a “hallmark of development” (as in Marcovitch and Lewcowicz, 2004). Also, the results presented do illuminate from a novel perspective the critical issue of how U-shaped learning relates to the general “problem of generalization” and to the structure of the language classes. In a number of interesting cases, the necessity of U-shaped learning disappears when learning is restricted to classes consisting of infinite languages only. These preliminary insights should be verified in two directions. (1) Empirical: by designing insightful experiments to assess measurable advantages of using a U-shaped learning strategy, and (2) Mathematical: by investigating the necessity of U-shapes in the context of more and more cognitively relevant criteria. For the second purpose, we believe that future research should focus on models of learning obtained by combining the following features: (a) vacillatory identification, (b) bounded memory, (c) queries, (d) counters, (e) fair feasibility, and also (f) stochastic elements. Regarding the latter, in Case (2012) Case argues that we live in a quantum mechanical universe for which the expected behaviors are nonetheless algorithmic; hence, there is value in modeling the expected behavior of humans nonstochastically.

We believe that each of these features is relevant for the purpose of modeling human cognition and expect that the study of U-shaped learning in models obtained by combining these features would give new insights in this interesting phenomenon.


  1. 1

    This concept of monotonic should not be confused with the interesting technical senses in computational learning theory (Jantke, 1991; Zeugmann & Lange, 1995). These latter seem not to be cognitively relevant and, hence, are not dealt with herein.

  2. 2

    Interestingly, the idea that U-shaped cognitive development might be “the quintessential hallmark of the developmental process” has been advanced by Marcowitch and Lewcowicz in their short paper (Marcovitch & Lewcowicz, 2004).

  3. 3

    Various extensions of Gold's model will hereinafter be referred to as just Gold's model.

  4. 4

    Such a return can be to an intensionally, that is, syntacticaly, very different grammar—but a grammar, nonetheless, that generates the exact same language.

  5. 5

    The case of forbidding returning (in the sense of the extension of the conjectured grammars) to an abandoned conjecture on all languages is called decisiveness (Baliga et al., 2008; Fulk, Jain, & Osherson, 1994; Osherson et al., 1986), and there is a little more about it below.

  6. 6

    We will develop below some important, parameterized learning criteria.

  7. 7

    Some theoretical results appear, for example, in Baliga, Case, and Jain (1995).

  8. 8

    This eventual success is without regard to such interesting issues as the timing (or relative timing) of reaching such success. Regarding formal notions of insensitivity to order of presentation, several theoretical results (for some learning criteria) are presented, for example, in Case (1999).

  9. 9

    It can be argued that noncomputable texts can be and are generated by randomness in the environment, including from quantum mechanical phenomena.

  10. 10

    An example fixed finite bound would be three. Finite unbounded allows different finite bounds (on the number of successful programs in the limit) on different texts and languages (on which the learner is successful).

  11. 11

    This class is a variant of that employed in Case (1999) and is from Carlucci et al. (2008).

  12. 12

    A similar class shows that learning with arbitrary but finitely many correct grammars in the limit is strictly more powerful than learning with vacillation bound b for each positive natural number.

  13. 13

    For the human case, the b, bounding the number of correct grammars in the limit, must have an upper limit. A slight paraphrase of an relevant argument from Case (1999) follows. At least one of b distinct grammars would have to be of size proportional to the size of b (i.e., to  log(b); hence, for extraordinarily large b, at least one of b distinct grammars would be too large to fit in our heads—unless, as seems highly unlikely, human memory storage mechanisms admit infinite regress.

  14. 14

    See the discussion in section 5.1 regarding the thesis from Case (1999) that self-referential examples portend natural ones.

  15. 15

    Note that an iterative learner can make up from its inability to explicitly remember previously seen data items by coding them into its conjectures. This trick can only be used a finite number of times, and it has to stop by the time the learner has converged to its final conjecture.

  16. 16

    In the human case, it is plausible that we do have available our prior working hypothesis for computing our next one. This makes it more urgent to solve the problem of whether U shapes are necessary (beyond the first few parameter values) in the Bounded Example Memory Hierarchy!

  17. 17

    Interestingly, the conjecture is based on the observation that in the case of Iterative Learning, the dispensability of U-shapes in learning classes consisting of infinite language only could be much more easily proved than in the case of mixed classes. By analogy, perhaps the access to some form of infinity—as given for example, by an unbounded counter—could make U-shapes unnecessary.

  18. 18

    Johnson (2004) makes a similar analogy between Gold's Theorem (providing the unlearnability of the class of regular languages) and Gödel's Theorem: “In general, the relation of Gold's Theorem to normal child language acquisition is analogous to the relation between Gödel's first incompleteness theorem and the production of calculators. Gödels' theorem show that no accurate calculator can compute every arithmetic truth. But actual calculators don't experience difficulties from this fact, since the unprovable statements are far enough away from normal operations that they don't appear in real life situations.” From this analogy, Johnson concludes that Gold's Theorem is irrelevant to cognitive science, just as Gödel's Theorem is to concrete mathematics, apparently disregarding 40 years of mathematical research on concrete incompleteness theorems (see Friedman, 2011).

  19. 19

    And it has long been argued (except by finitists) that each natural language is infinite.

  20. 20

    A superfinite class of languages is a language class that contains some infinite language and all its finite sublanguages. As the class of regular languages is superfinite, Gold's Theorem implies that regular languages are not learnable, and that the same is true of all larger classes of the Chomsky hierarchy.

  21. 21

    A similarly proved positive result regarding a related kind of stochastic learning is found in Angluin (1988).

  22. 22

    For example, McClelland et al. (2010) write: “We view the entities that serve as the basis for structured probabilistic approaches as abstractions that are occasionally useful but often misleading; they have no real basis in the actual processes that give rise to linguistic and cognitive abilities or to the development of these abilities.” To which Griffiths, Chater, Kemp, Perfors, and Tenenbaum (2010) reply: “By contrast, we believe that greater danger lies in committing to the particular incorrect low-level mechanisms—a real possibility because most connectionist networks are vastly oversimplified when compared with actual neurons.”

  23. 23

    The fairness issue, first noted in Pitt (1989), is the problem that, for learning in the limit, an imposition of requiring each conjecture to appear in polynomial time (in the length of the data on which it to be based) is really no useful restriction at all—as hard computations can be unfairly put off until a longer data sequence appears which would allow more time to compute.

  24. 24

    “Children are not generally subjected to adversarial data conditions, and if they are, learning can be seriously impaired. Therefore, there is no reason to demand learning under every presentation” (Clark and Lappin, 2010).

  25. 25

    An adversarial text might feature absence of data up to any given finite time bound (developmental stage), but the model has no pretension of mimicking child development to that level of detail.