Editors' Introduction: Why Formal Learning Theory Matters for Cognitive Science
- Note: Nick Chater was supported by ERC Advanced Research Grant “Cognitive and Social Foundations of Rationality.”
Correspondence should be sent to Sean Fulop, Department of Linguistics PB92, California State University Fresno, Fresno, CA, 93740. E-mail: email@example.com
This article reviews a number of different areas in the foundations of formal learning theory. After outlining the general framework for formal models of learning, the Bayesian approach to learning is summarized. This leads to a discussion of Solomonoff's Universal Prior Distribution for Bayesian learning. Gold's model of identification in the limit is also outlined. We next discuss a number of aspects of learning theory raised in contributed papers, related to both computational and representational complexity. The article concludes with a description of how semi-supervised learning can be applied to the study of cognitive learning models. Throughout this overview, the specific points raised by our contributing authors are connected to the models and methods under review.
Formal learning theory, broadly construed, is a diverse collection of approaches to the mathematical modeling of learning. From the point of view of cognitive science, formal learning theory can provide constraints on what is learnable by different types of idealized mechanism. Moreover, in some cases, specific algorithms discussed may be part of the cognitive model. Therefore, such modeling could potentially be relevant to human learning, either by modeling the learning algorithm directly or by deriving theoretical results which constrain what any learning system may possibly achieve given certain data. In contrast, many theoretical discussions and computational frameworks for understanding learning in cognitive science are often not appropriately related to theoretical findings, so that, for example, whether a particular model can, in principle, scale up to more complex cases may be difficult to assess. Conversely, when not directly tied to cognitive scientific questions (or practical challenges in machine learning), formal learning theory can become a rather specialized mathematical activity, with no clear application. This disconnect between two disciplines that can productively inform each other more fully deserves some remedy, and it is our hope that the papers within the present topic go some way toward fostering a greater connection.
Although cognitive science is most canonically concerned with the construction of computational models of specific cognitive phenomena (including learning of all kinds, and of course language acquisition), there remain fundamental questions about the capabilities of different classes of cognitive model, and about the classes of data from which such models can successfully learn. In view of this, formal learning theory has the potential to play a role within cognitive science analogous to that played by theoretical computer science with respect to applied computing. Thus, we suggest that the analysis of learning in cognitive science has a rich vein of theoretical work to draw upon, but too often the literature describes specific computational learning simulations that are not accompanied by, or situated within, any theoretical analysis.
Theoretical analysis of learning is frequently discussed under the heading of “inductive inference”—viewing learning as a type of inference in which the “premises” typically consist of observed data (which might be perceptual input, reward signals from the environment, linguistic materials, and so on). Early formal work on inductive inference attempted to model the process by extending classical logic to inductive logic, for example Carnap (1950), building on probability theory. Although still pursued in some fields (Anshakov & Gergely, 2010; Milch et al., 2007; Muggleton, 1990), research has typically shifted away from the powerful but often intractable representational machinery of formal logic, to models of learning defined over much more restricted representational formalisms. Thus, interest often focuses on relatively restrictive problems, such as learning categories from examples, or learning what sentences are allowed in a language from examples (or, even more abstractly, learning potentially infinite sets of numbers from examples, and perhaps also nonexamples, labeled as such).
Various formal learning models have gradually come to the fore, including Bayesian inference, the statistical inference model codified in Solomonoff (1964a), the Probably Approximately Correct framework proposed by Valiant (1984), and the “identification in the limit” model of (Gold, 1967). The now well-known “Gold's theorem” from the latter launched a vigorous debate in linguistics which is far from over. Its deceptive simplicity has led to its being possibly more often misunderstood than correctly interpreted within the linguistics and cognitive science community, as was richly documented by Johnson (2004). All of these learning models fall under the rubric of supervised, because they must use data that are in some sense labeled—either the learner assumes all examples are positive examples of one concept such as a language, or a more elaborate distribution of multiple labels can be provided, as with category learning. Another area of research involves induction of structured knowledge from unlabeled data—so-called unsupervised learning. This area encompasses such disparate enterprises as data clustering, independent components analysis, and self-organizing maps; it is generally difficult to measure the success of such unsupervised techniques. The use of unlabeled data in conjunction with labeled data has developed into an interesting subfield of semi-supervised learning. We will next present a brief overview of learning theory covering many subfields, attempting to touch upon all of the approaches that are applied in the contributed papers.
A standard way to model a restricted learning problem is to begin by supposing a concept class C encompassing all of the things that are like the thing to be learned in some sense. For the commonly considered language learning problem, for example, a concept class might be all sets of expressions over a certain (predetermined) vocabulary of words. This just fixes the domain of discourse, in a sense—what kind of learning problem do we have. Then learning can be represented as an effort to formulate a correct hypothesis. This target concept is generally assumed to be a member of a hypothesis class, H, which is a restricted subset of the entire concept class. One way in which this notion of learning is restricted is that the aim is to infer the elements of the category or language, but not, for example, to reconstruct the rules or grammar defining the category or language. Theoretical results generally pertain to the learnability of hypotheses in various sorts of hypothesis classes. Just what is meant by “learnability” also needs to be specified—do we mean asymptotic learnability (in the limit of an unending sequence of examples), or is some bound on performance or complexity of the process also to be required? Another possibility is to allow approximation of the correct hypothesis, rather than insisting on precise learning.
The learning process proceeds as the examples from a set D illuminating the target hypothesis arrive in a sequence one by one. The example sequence may be limited to members of the target hypothesis, which are known as positive examples, or the example sequence may include elements from both inside and outside the target (the latter being negative examples), in which case each example must indicate whether it is positive or negative. Solomonoff's (1964a) idea was that learning can be modeled as “extrapolation” from the sequence of examples, that is, successful prediction of the next symbol in the sequence. This model of learning as prediction was expansively presented by Li and Vitányi (2008), where it was shown to subsume other formal approaches, but can in the end be viewed as a subspecies of Bayesian learning.
2. Bayesian learning
The Bayesian approach to learning takes as its starting point that probabilities can be interpreted as “degrees of belief” in a “learning agent” (whether a child is acquiring a language or a machine learning algorithm).1 The Bayesian learning method amounts to updating these degrees of belief on encountering new data (which is assumed to be known for certain), in line with the rules of probability theory. Specifically, such updating typically involves applying an elementary theorem of probability, known as Bayes's rule:
Here, the various Hi are the individual hypotheses in H, while D is the set of observed examples, or “learning sample.” The formula states that the “posterior” probability Pr(H | D) of a particular hypothesis H conditioned on observations D is proportional to the product of the learner's “prior” belief in (probability assigned to) H with the probability Pr(D | H) of the learning sample conditioned on H. The latter value is commonly called by the (rather misleading) name of likelihood. The denominator of the formula, often called the Bayesian evidence, makes it an equation rather than a proportion; it is not always necessary to compute this in order to choose among different hypotheses.
As data are encountered piecemeal, this formula can be applied iteratively: Today's posteriors are tomorrow's priors. More precisely, at step n, the posterior probabilities of a number of different hypotheses are computed with Bayes's formula. Then at step n + 1, the new example is considered, and the posteriors are adopted as new prior probabilities in a reapplication of the formula. This general scheme has become widely applied in cognitive models of learning (e.g., Tenenbaum & Griffiths, 2001) as well as in Bayesian statistics. But how should the initial priors—the ones used at the first step of the iteration—be chosen? Somewhat reassuringly, under some conditions (e.g., H is a finite class; Freedman, 1963), Bayesian inference converges to the “optimal” hypothesis eventually starting from almost any prior, but convergence can be much faster if we start with a “better” set of prior probabilities. If the hypothesis class is a countably infinite distribution such as an infinite class of languages, the situation becomes more subtle, and the prior can actually affect the ability to converge. In any case, practically effective Bayesian inference demands the selection of a good prior.
The problem of the initial priors is the main subject of the contribution by Jacob Feldman in the current issue. This article discusses the matter of how initial priors should be optimized before carrying out Bayesian learning. Feldman states that there have been two opposing views on this matter. The “classical” Bayesian approach stands against deliberately optimizing the initial priors from empirical results and instead advocates applying one's general knowledge of the scenario to assign priors according to reasonable beliefs. An example of this is the standard assumption of a fair coin, where the favored (most probable) hypothesis initially would assign probability 0.5 to each side. If a coin is, in fact, not fair, a Bayesian learning procedure will uncover this by adjusting the posteriors through a long sequence of coin tosses, and a different hypothesis assigning the unequal probabilities will gradually emerge. The ancient “principle of insufficient reason” is the classic course of last resort, refusing to favor any hypothesis initially (making all priors equal) when general knowledge is not useful. This approach is considerably generalized in the widely used maximum entropy principle in Bayesian methods (Jaynes, 2003).
An opposing view, known in the literature as empirical Bayes (Berger, 1985), advocates using empirical facts about the various hypotheses, derived from a direct experiment, to assign the prior probabilities. This approach has become quite commonplace in cognitive science, but Feldman argues that such an attempt to “optimize” priors empirically runs counter to the principles and philosophy of Bayesian inference. Such a procedure applied in the coin-tossing example would involve flipping the coin a number of times beforehand and using the observed ratio of heads to tails as the “true prior nature” of the coin. Feldman uses simple Monte Carlo simulations of classifier performance to demonstrate the ineffectiveness of this kind of prior determination, and that using priors with greater entropy—and thus less information—leads to improved classifier performance. This shows that “mere tuning does not optimize performance,” where performance of the learner in this context means classification performance from a certain training set. Feldman offers suggestions toward a compromise, by assigning prior probabilities from a continuum between the empirically determined and the generally guessed. This kind of study could be profitably connected to more theoretical results (Haussler, Kearns & Schapire, 1994) showing the dependence of the Bayesian learning curve on the accuracy of the assumed prior distribution; it has been established that the learning curve—quantified as the probability of a mistaken prediction as the learning sample expands—is optimal when the assumed prior equals the true prior (where the true prior is defined purely internally to the simulation).
3. The universal prior
Solomonoff (1964a) proposed another solution to the problem of the Bayesian priors—the universal prior distribution. His framework of learning as predicting the next symbol in the example sequence was subsequently developed into a workable scheme by Li and Vitányi (1993, 2008), who showed that although the universal prior distribution may be uncomputable, it is approximated well by using a universal semimeasure M for a probability distribution over all hypotheses—and this can be approximated, in the limit, by a computable process. The mathematical details are not for the faint-hearted, but in essence we can view the distribution defined by M as a mixture of hypotheses which assigns greater weights to the “simplest” ones (Li & Vitányi, 2008). The interesting thing about this methodology is that the universal semimeasure is defined in terms of the complexity of hypotheses, specifically quantified as the Kolmogorov complexity measuring the degree of randomness in a sequence. The Kolmogorov complexity of the learning sample can be measured as the length of the shortest computer program which describes the examples (in a given language, though the choice of language can be ignored for theoretical purposes, just as the choice of computer can be ignored in time- or memory-complexity analyses of computer algorithms). This aspect brings in the ancient principle of Occam's razor—find all hypotheses that fit the data, and then make predictions by applying the simplest explanation—which is often connected to Bayesian learning.
The current contribution from Anne Hsu, Nick Chater, and Paul Vitányi reviews the authors’ recent formal results in the preceding framework, concerning the possibility of learning natural languages from positive example sequences. This question is arguably among the most frequently debated cognitive issues in linguistics. The paper reviews findings showing the learner can learn successfully, according to various criteria, from positive evidence by favoring the grammar that provides the simplest encoding of the linguistic input.
4. Identification in the limit
Gold's (1967) framework of identification in the limit (i.i.l.) largely initiated the study of formal learning theory. In this scheme, we are given an unending sequence of examples (either all positive, or both positive and negative) and the task is to infer a function (language, grammar, etc.) from a predetermined hypothesis class. The learner does not have to recognize its own success—the criterion for successful learning is convergence of the learner's hypotheses to the single correct target in the limit of the example sequence. The key parameters in applications of this paradigm are the hypothesis class and the specific method of induction from the examples. Although not conceived as a probabilistic learning scheme, i.i.l. has been shown (Li & Vitányi, 2008) to be equivalent to a particular setup of a Bayesian learner.
A variation of the Gold paradigm that has been suggested on natural cognitive grounds is the model of a memory-limited learner. This is a learning algorithm which conjectures hypotheses based on the latest example received, together with the previous conjecture (Wexler & Culicover, 1980). It is useful to consider the contrast with the full Bayesian learner, which considers new data in the light of its current probability distribution over all hypotheses; the memory-limited learner, in contrast, remembers only its current “favorite” hypothesis and throws away all information about previous hypotheses. This general style of learning has also been developed into a formal definition of a U-shaped learner; this is a learning algorithm that goes from correct conjectures to incorrect ones and then back to the correct conjecture as the example sequence is received (Carlucci, Case, Jain & Stephan, 2007). Cognitive science researchers may find this familiar, since natural language learning, among other things, often proceeds along a U-shaped learning curve in the natural setting. The current contribution from Lorenzo Carlucci and John Case provides a timely overview of the formal theory of U-shaped learning and its technical requirements, as well as its features relevant to cognitive science. The paper also successfully defends identification in the limit (and variations) as a valuable model relevant to the study of learning within cognitive science.
5. Complexity of learning
In the models considered so far, the most basic criterion of successful learning is simply to converge upon the target hypothesis. In most interesting applications, there are a very large number of competing hypotheses, so it may not be generally feasible to converge to the target precisely. Practically feasible learning faces at least two constraints: that successful learning occurs given the amount of data available to the learner (e.g., the amount of linguistic input received by the child); and that the calculations involved in such learning are not explosively complex.2
The idea behind the probably approximately correct (PAC) learning framework first introduced in Valiant (1984) is to require feasibility of the learning, and to achieve this we are allowed to converge to some reasonable approximation of the target hypothesis with reasonably high probability, using feasible computational resources. To formulate these requirements coherently it is necessary for the learning algorithm to select examples according to some fixed unknown probability distribution over the space of examples. Then we can define PAC-learnability of a concept class (possibly comprising functions, languages, etc.) as being fulfilled when we have a computationally tractable learning algorithm such that for each element in the class and probability ∈, the algorithm converges with probability at least 1−∈ to some element which disagrees with only on a set of examples having total probability less than ∈ (Li & Vitányi, 2008).
Although PAC learning usually requires that the learning algorithm should run in polynomial time depending on the size of the provided example sequence, formal learnability results are often quite abstract and pay little heed to the practical fact that real human learning must be computationally feasible in some way. Alex Clark and Shalom Lappin's contribution identifies two problems, related to computational complexity, with much conventional literature on language learnability. One problem is the propensity for learnability arguments to conflate or confuse the hypothesis class with the “learnable concept” class. The hypothesis class is the class of grammars (each defining a language) that the learner can conjecture, given some input; the learnable concept class is the (potentially very much smaller) subset of languages on which the learner can identify the language in the limit. Assuming that the learner is acquiring the language from prior generations of learners who successfully learned the language in the limit, then the language to be learned must lie within this restricted set. Nonetheless, the learner need not explicitly restrict all of its conjectures to this narrower set of languages—indeed, in some cases this is provably not possible, even where the learner succeeds. This confusion can lead theorists to draw excessively pessimistic conclusions about learnability, and to overstate the importance of explicit restrictions on the hypotheses the learner conjectures. This brings us to the second problem addressed by Clark and Lappin, namely the computational complexity of the learning algorithm. In language acquisition, they argue, the key focus should be the actual method of inducing a hypothesis from input data. Consideration of the complexity constraints and requirements of such induction procedures will then suitably guide our investigations toward the best procedures for settling on a compatible hypothesis given the input, and perhaps will ultimately be more informative than would figuring out the mathematical properties of a theoretically learnable language class. Notwithstanding the focus on language learning in the article, all of the authors’ arguments apply to any cognitive learning problem that could be formalized.
Apart from the complexity of learning algorithms, the different representational complexities of concept classes—syntactic grammars/languages versus phonological grammars—is the main issue addressed in the contribution from Jeff Heinz and Bill Idsardi. The authors first establish that no phonological process in natural language is agreed to require complexity greater than is representable in a regular formal language (referring to the Chomsky hierarchy)—in fact, it is shown that the subregular classes of strictly local and strictly piecewise languages can represent phonological processes. It has, of course, long been established that recursive processes in the syntax of natural languages often require greater complexity and may even require the representational power of a context-sensitive language. The authors then argue that the best explanation for this complexity difference is that phonology is learned by a different cognitive mechanism from that applied to syntactic learning. The differences in the constraints on learning, it is argued, can best explain both what phonology allows and what it does not allow.
6. Semi-supervised learning
Most learning models about which there is any degree of theoretical study are supervised in the sense defined earlier, meaning that the learning data consist entirely of labeled examples (or are drawn from a single target concept and thus have a single default label), and the induction process essentially learns how to generalize and account for further data. The opposite approach, in which none of the examples are labeled but we expect them to be drawn from different groups or some sort of unknown structure, is known as unsupervised learning. Here, the usual goal is to extract “hidden structure” from the raw data, but without any labeling it is impossible to gauge success of the learning—we literally do not know what we are doing. An interesting combination of these two extremes is provided by the framework of semi-supervised learning, in which unlabeled examples are provided alongside the usual labeled ones in a learning task such as classification, and the learner somehow makes use of the unlabeled examples to augment the information provided by the labeled set.
In cognitive science, category learning is an area that seems especially ripe for the application of semi-supervised learning since, as stated in the current contribution from Bryan Gibson, Timothy Rogers, and Jerry Zhu, “in the real world, category learning is not fully supervised.” In their paper, one finds an overview of three standard cognitive models for category learning, which are generally supervised models: the exemplar model, in which all training examples are remembered for later use; the prototype model, in which each category is represented as a prototype having a set of parameter values; and the rational models, a family of compromises between the preceding extremes allowing the learning representations to vary from individual items (as in exemplar models) to summary representations (as in prototype models) as determined by the data. The paper also describes three common supervised models used in the purely practical enterprise of machine learning: kernel density estimation, Gaussian mixture models, and Dirichlet process mixture models; the goal is to show how semi-supervised techniques for the machine learning models can be applied to evaluate the cognitive learning models.
Modifying a supervised learning model so that it makes use of unlabeled data is called lifting the model into semi-supervised learning. Unlabeled data cannot be used directly, as they do lack labels, but they provide extra information about P(D), which is one expression for the denominator (Bayesian evidence) in the Bayes formula. When the lifting assumptions of the framework are valid, semi-supervised models, have been shown to converge more quickly than corresponding supervised models which make no use of unlabeled data. One common semi-supervised learning assumption is the mixture model assumption, which assumes all data are drawn from a mixture model. This assumption can lead to semi-supervised versions of the kernel density, Gaussian mixture, and Dirichlet process mixture models, which are in turn demonstrated by the authors to be mathematically equivalent to the three respective cognitive learning models, viz. exemplar, prototype, and the rational model proposed by Anderson (1990).
In machine learning, the motivation for lifting to the semi-supervised regime is economic; in cognitive science, the motivation is realism. Gibson et al. go on to review empirical evidence that these models describe important aspects of human category learning. Studies are reviewed which show that, not only is human category learning performance influenced by the distribution of unlabeled examples, the influence is strong enough to override the information gathered from labeled examples. The evidence from the studies does generally indicate that both the labeled and unlabeled examples are used for induction by the human subjects.
Model-fitting experiments were also conducted, in which each type of semi-supervised model was fit to empirical findings, while a single chosen parameter of each model was varied. Each model shows behavior grossly consistent with human categorization behavior under some parametrizations. To get a better quantitative picture, the empirical data were divided into training and test portions so that the models could be better evaluated on the basis of off-training set behavior—always the holy grail of model selection. Under this procedure, the semi-supervised lifting of Anderson's (1990) rational model provides the best fit with the empirical findings.
This contribution distinguishes itself from much of the cognitive science literature by providing complete algorithms for the learning methods discussed, as opposed to vague outlines. The authors admit, however, that their general conclusions are preliminary, and they suggest some useful methodological changes for future research. Their extensive survey discusses further findings about human learning strategies and also addresses some problems with extending semi-supervised learning to higher dimensional stimulus spaces.
We believe that the papers in this topic provide an excellent illustration of the power of formal methods for understanding fundamental problems of learning that lie at the heart of cognitive science, and we hope that these papers will help inspire other cognitive scientists working on learning, as well as mathematicians and computer scientists working on formal results, to focus on the exciting formal challenges raised by the cognitive science of learning. One of the most brilliant and innovative figures in this field, Partha Niyogi, was an inspiration to both the topic editors and most of the contributors. He combined dazzling technical insight with a deep understanding of the core issues in the field of machine and human learning. His untimely death in 2010 shocked the community of formal learning research; the editors and contributors would like to dedicate this entire topic to his memory.
This is the subjective interpretation of probability, for which there are a variety of justifications. Probability theory also has a number of other interpretations, for example, concerning limiting frequencies of repeatable events, which do not concern us here.
Typically, this is interpreted as requiring that the time and space requirements of any learning algorithm are polynomial rather than exponential or worse in the size of the set of learning examples. Even this extremely modest criterion is violated by general-purpose Bayesian methods, such as those described above.