Toward Computational Models of Multilingual Sentence Processing

Although computational models can simulate aspects of human sentence processing, research on this topic has remained almost exclusively limited to the single language case. The current review presents an overview of the state of the art in computational cognitive models of sentence processing, and discusses how recent sentence-processing models can be used to study bi- and multilingualism. Recent results from cognitive modelling and computational linguistics suggest that phenomena specific to bilingualism can emerge from systems that have no dedicated components for handling multiple languages. Hence, accounting for human bi-/multilingualism may not require models that are much more sophisticated than those for the monolingual case.


Introduction
Computational modeling has been fundamental to the cognitive sciences and continues to be one of the most valuable methods for studying human cog-nition, including language processing. Research in the psycholinguistics of bilingualism, too, has been advanced by computational models; some of the best known or most recent examples being the BIA+ model of bilingual word recognition (Dijkstra & Van Heuven, 2002), the Multilink model of word translation (Dijkstra, Wahl, Buytenhuijs, & Van Halem, 2019), the SOMBIP model of mapping between phonology and semantics in the bilingual mental lexicon (Li & Farkaš, 2002), and its successor, the DevLex-II model of crosslinguistic lexical priming (Zhao & Li, 2013).
In stark contrast to this wide array of models of bilingual lexical development, representation, or processing, there are only very few sentence-level models of bilingualism. Although formal or verbal sentence-level theories do exist (e.g., Amaral & Roeper, 2014;Hartsuiker & Bernolet, 2017;Sharwood Smith & Truscott, 2014) there is a profound lack of computationally specified and fully implementable (let alone implemented) models.
Because of the immense complexity of the human language system, details about its functioning can only be understood by capturing them in implemented computational models. Unlike "mere" verbal theories, specifying a model forces one to bring hidden assumptions out in the open, and to do away with any ambiguity or vagueness in the theory. Successful models can reveal unexpected emergent phenomena and thereby help make sense of empirical findings. Hence, developing such models is critical to advance the field. This is especially true for the bilingual case because, as bilingualism researchers are well aware, "the bilingual is not two monolinguals in one person" (Grosjean, 1989, p. 3): The bilingual system is even more complex than the monolingual system because of the ways languages can interact, not only with each other but also with other aspects of the cognitive system. However, little is understood of the mechanisms underlying these interactions. A successful computational model of bilingual or multilingual sentence processing can help fill this gap in our understanding and thereby provide answers to many concrete questions at the frontier of current multilingualism research: What is the effect of bilingualism on native-language processing (Cop, Keuleers, Drieghe, & Duyck, 2015)? Do predictive processes differ between L1 and L2, and if so, how (Kaan, 2014)? How does increasing L2 proficiency or exposure affect interaction between languages (Morett & MacWhinney, 2013;Whitford & Titone, 2012)? How are individuals' cognitive characteristics borne out in their L2 processing (Hopp, 2015;Linck, Osthus, Koeth, & Bunting, 2014)? Why are code-switches not distributed randomly over utterances (Green & Wei, 2014)?
The near nonexistence of computational models of bi-/multilingual sentence processing is particularly remarkable considering the current success of Language Learning 71:S1, March 2021, pp. 193-218 194 computational psycholinguistic models in the monolingual domain. In what follows, I will first review some of the current work in single-language sentence processing models. Next, I discuss how these models could be (and, occasionally, have been) extended to handle two languages simultaneously.
The final section speculates about what it would take to create truly multilingual sentence-processing models and what these might teach us about human multilingualism.

Computational Models of Sentence Processing Probabilistic Processing and Word Surprisal
In the field of psycholinguistics, it is no longer a controversial idea that statistical patterns from the linguistic environment come to play an important role in the cognitive system. A considerable part of language acquisition comes down to learning (co-)occurrence frequencies of linguistic units, 1 and much of language processing is the application of the learned statistics. This view dovetails well with recent thinking about prediction during language comprehension because knowledge of language statistics can give rise to probabilistic expectations of upcoming language input. Although there is considerable debate about the precise role of linguistic prediction (Huettig, 2015;Kuperberg & Jaeger, 2016) it is clear that, at least sometimes and to some extent, people anticipate future input. If linguistic prediction is probabilistic (i.e., statistical), it can be formalized and quantified using concepts from information theory. The most successful of these information-theoretic measures is surprisal-the negative logarithm of word's occurrence probability given the (linguistic) context. The linguistic unit of analysis here does not need to be a word but can be of any size. However, surprisal has most often been investigated at the word level. A word's surprisal is lower if the word was more likely to occur in the current context. To give a simple example: The word "gentlemen" is the most likely continuation of "ladies and" but not of "women and"; consequently, the surprisal of "gentlemen" is lower in the context of "ladies and" than after "women and".
A word's surprisal has been argued to correspond to the amount of cognitive effort required to process the word (Hale, 2001;Levy, 2008), a claim that has seen ample validation: Word reading times increase linearly with word surprisal (Goodkind & Bicknell, 2018;Smith & Levy, 2013); surprisal predicts neural activity as measured by electroencephalography (EEG; Frank, Otten, Galli, & Vigliocco, 2015), magnetoencephalography (MEG; Wehbe, Vaswani, Knight, & Mitchell, 2014), and functional Magnetic Resonance Imaging (fMRI; Willems, Frank, Nijhof, Hagoort, & Van den Bosch, 2016); and, crucially, learning more accurate language statistics generally results in stronger fit to 195 Language Learning 71:S1, March 2021, pp. 193-218 human processing data (Aurnhammer & Frank, 2019;Goodkind & Bicknell, 2018). Word surprisal has also shown to affect sentence production: When different syntactic structures are possible, writers tend to choose the one with lower surprisal (Rajkumar, Van Schijndel, White, & Schuler, 2016); and in speech, words with higher surprisal are pronounced more slowly (Demberg, Sayeed, Gorinski, & Engonopoulos, 2012) and are more likely to be preceded by a disfluency (Dammalapati, Rajkumar, & Agarwal, 2019). 2 Surprisal theory is specified at the so-called "computational level of analysis" (Marr, 1982), which is to say that it is merely a function that relates probability to cognitive effort, without specifying any mechanism that gives rise to the probabilities (or to cognitive effort, for that matter). However, to apply surprisal theory, one needs an implemented language model that estimates the word occurrence probabilities. Such a model will be specified at the "algorithmic level," where representations and mechanisms are assigned that are appropriate to the task at hand. Probabilistic language models can differ widely in their algorithmic-level descriptions. The two general model classes that have been most influential in psycholinguistics are probabilistic phrase-structure grammars and neural networks, which will be discussed in turn below. 3

Probabilistic Grammars and Incremental Parsing
It is traditionally assumed that the first (or, at least, an important) step toward understanding a sentence is incremental parsing: the word-by-word construction of its syntactic tree structure (Frazier & Rayner, 1982, among many others). When the sentence is not yet complete (and sometimes even when it is), it is usually consistent with a range of possible tree structures. In the probabilistic view on parsing, we do not select just one of these but generate many (if not all) possible structures, together with their probabilities. When the next word of the sentence comes in, it needs to be incorporated into the set of structures being considered, leading to changes in (the probabilities of) the structures. Levy (2008) showed how this amount of change gives rise to the word's surprisal. Hence, surprisal provides a direct connection between probabilistic incremental parsing and probabilistic predictive processing.
The question remains where all these probabilities come from. Put simply, the probability of a tree structure follows from the probabilities of all the production rules (also known as rewrite rules) that were involved in the structure's derivation. The rules' probabilities (and, most often, the rules themselves) are extracted from a corpus of sentences annotated with their syntactic tree structures. This collection of rules and their probabilities forms a probabilistic phrase-structure grammar. Hale (2001) and Levy (2008) substantiate their claim that surprisal is a measure for cognitive processing effort by demonstrating that, under a probabilistic grammar, surprisal predicts reading times that match certain psycholinguistic phenomena such as garden-path effects. A garden-path effect occurs when a critical word is read more slowly if it syntactically disambiguates (i.e., its context is locally ambiguous) then when it appears in an unambiguous context. For example, it is well known that the main verb ("lost") is read more slowly in sentence (2) than in sentence (1), because in (2) it disambiguates between the two possible structures of "The doctor sued for damages" (Rayner, Carlson, & Frazier, 1983).
1. The doctor who was sued for damages lost the lawsuit. 2. The doctor sued for damages lost the lawsuit.
Simulations using probabilistic grammars in English (Hale, 2001;Levy, 2013;Van Schijndel & Linzen, 2018) and Dutch (Brouwer, Fitz, & Hoeks, 2010) have shown that a word's surprisal is indeed higher (corresponding to longer reading times) when it disambiguates than when it does not. Moreover, the success of probabilistic grammars goes beyond accounting for the reading time effects of particular experimental manipulations. Boston, Hale, Patil, Kliegl, and Vasishth (2008) and Demberg and Keller (2008) were the first to correlate word surprisal values (estimated by probabilistic grammars) to word reading times across sentences in German and English, respectively, that were not constructed for any particular experiment. Higher surprisal correlated with longer reading times, after factoring out more superficial variables such as word length, frequency, and probability given the previous word. This finding has been replicated many times since, at least for English, using different grammars and sets of reading time data (e.g., Monsalve, Frank, & Vigliocco, 2012;Roark, Bachrach, Cardenas, & Pallier, 2009;Van Schijndel & Schuler, 2015). More recent work on grammar-based surprisal has mostly focused on relating it to neural activity (Brennan & Hale, 2019;Brennan, Stabler, Van Wagenen, Luh, & Hale, 2016;Frank et al., 2015;Shain, Blank, Van Schijndel, Schuler, & Fedorenko, 2020), often resulting in novel insights into the different brain processes and areas involved in language comprehension.
Note that this very short exposition barely scratches the surface of a large and multifaceted research field. For one, there exist many different types of grammar as well as many ways to relate aspects of sentence structure to observed cognitive processing difficulty (for a recent and more comprehensive review, see Demberg & Keller, 2019). As I will discuss next, several of the psycholinguistic results from probabilistic grammars can also be obtained using models that are not based on syntactic parsing.

Recurrent Neural Networks
Inspired by biological neural networks, artificial neural networks simulate information processing as the flow of activation through a large number of, simple, numerical processing units (or "neurons") that are connected to one other into a network (as shown in Figure 1). A unit's activation is a simple function of the total activation it receives from the units it is connected with. Each connection has a weight that determines how strongly the "sending" unit affects the activation of the "receiving" unit. Weights may be negative, so that one unit deactivates another. The connection weights are adapted to make the network perform a particular task. This is most often done by so-called supervised training, in which the network receives examples of inputs and the corresponding desired (target) outputs. For instance, a sentence comprehension network would receive sentences as input and their semantic representations as target output. Its connection weights are then adapted such that the network (approximately) generates the desired semantic representation for each example sentence. To the extent that it successfully learned a general sentence-to-meaning mapping, the network is then able to generate semantic representations of novel sentences.
The most popular neural network architecture for cognitive models of incremental language processing is the recurrent neural network (RNN; Elman, 1990). As can be seen in Figure 1, an RNN is fitted with feedback connections through which it receives its own previous internal activations, giving the model a decaying "short-term memory." This allows it to process input that comes in over time. If the model simulates word-by-word sentence comprehension, for example, each incoming word is interpreted in the context of the network's memory of the entire sentence so far.
Neural networks offer a perspective on the human language system that is very different from, and perhaps less linguistically informed than, grammar-based theories (see also Frank, Monaghan, & Tsoukala, 2019). Nevertheless, currently the most successful practical applications of language processing by machines (such as automatic translation) are based on neural network technology.

RNNs for Next-Word Prediction
The most common use of RNNs is as next-word prediction models, as originally proposed by Elman (1990). In this case, the network receives a collection of sentences or texts as training material, and at each point it is trained to predict the upcoming word. The network can only perform well on this task to the extent that it has discovered relevant aspects of the language's statistical structure. However, even with perfect knowledge of the language, predicting the correct next word is usually impossible. The network's output is thus not a single predicted word but a probability distribution over word types, that is, each word type has an estimated probability that it will be the next word. This provides an easy computation of the upcoming word's surprisal, which equals the negative logarithm of its probability estimate.
Early RNN models could only be trained on small, artificial languages, but recent technological developments allow training on millions of sentences from text corpora, using RNN variants that are more sensitive to long-range linguistic dependencies. The resulting surprisal values predict certain psycholinguistic phenomena that are traditionally thought to reveal structural syntactic processing, such as garden-path effects in Dutch and English (Frank & Hoeks, 2019;Futrell et al., 2019;Van Schijndel & Linzen, 2018). Also, like grammarbased surprisal, RNN surprisal accounts for word-reading times (Goodkind & Bicknell, 2018;Monsalve et al., 2012) and brain activity (Brennan & Hale, 2019;Frank et al., 2015;Wehbe et al., 2014) across naturalistic sentences in English.

RNNs for Sentence Comprehension
Compared to next-word prediction, the construction of a semantic representation is more central to the goal of sentence comprehension. Only a few RNN models have recently been proposed for mapping sentences to representations of the meaning expressed. Because of the difficulty of representing realistic semantics, all these RNN comprehension models are limited to hand-crafted, miniature languages, usually modeled on English. These models take as input a sentence, presented one word at a time, and generate as output some representation of the sentence's meaning. Where they differ is in how they represent meaning: Propositional structures (Brouwer, Crocker, Venhuizen, & Hoeks, 2017;Hinaut & Dominey, 2013) identify the agent, patient, and action of a given sentence, that is, they represent the semantic roles and concepts that fill those roles. Situation vectors (Frank & Vigliocco, 2011;Venhuizen, Crocker, & Brouwer, 2019) represent the state-of-affairs in the world as described by the sentence, without any internal role-concept structure. Sentence gestalts (Rabovsky, Hansen, & McClelland, 2018; based on a classical model by McClelland, St.John, & Taraban, 1989) are developed by the neural network itself during training. Unlike propositional structures and situation vectors, they are not designed in advance and, consequently, a sentence gestalt is not directly interpretable but can only be used as part of the network in which it arose.
Most, if not all, RNN sentence comprehension models relate some measure of the "amount of change" in their semantic representation to cognitive processing difficulty. The propositional structure models by Brouwer et al. (2017) and Hinaut and Dominey (2013) take this measure to correspond to the wellknown P600 EEG component, 4 which is often viewed as indicative of a sentence reinterpretation process. 5 The situation vector models by Frank and Vigliocco (2011) and Venhuizen et al. (2019) show that the amount of change in the network's output can be expressed in terms of word surprisal. Frank and Vigliocco further demonstrate that this predicts simulated word-processing time, that is, their model provides a mechanistic account of why higher surprisal leads to longer reading time. In Rabovsky et al.'s (2018) sentence gestalt model, the amount of change in the gestalt representation explains a wide range of N400 EEG effects from the literature, which are usually interpreted as indexing a word's unexpectedness or the difficulty of semantically integrating it with the earlier context.

RNNs for Sentence Production
There exist relatively few cognitive models of sentence production. By far the most influential is the Dual-Path RNN model by Chang (2002). As its name suggests, it assumes there are two processing pathways: The syntactic path takes care of sequencing words in the correct order, and the semantic path encodes the propositional structure (i.e., the role-concept pairs such as AGENT is WOMAN) of the to-be-expressed semantics. The Dual-Path neural network also has a layer of units with recurrent connections (as in Figure 1). Both the syntactic and the semantic pathway pass through this single layer, so that syntax and semantics can interact. As input, the Dual-Path model receives the proposition it is to convey as well as information about the desired verb tense and aspect. It then produces the corresponding sentence one word at a time. Each word that is produced is fed back into the network as input, so that subsequent words depend on what has been produced so far.
To learn a mapping from semantic representations to word sequences, the model is provided with a training set of target semantics paired with corresponding sentences. The target semantics is given to the network, which then starts to produce words. Each word is compared to the target output sentence, and if the produced word is incorrect, the network's connection weights are updated such that its output is more likely to be correct next time. This makes network training very similar to the RNN word prediction models of Section RNNs for Next-Word Prediction.
As was the case for the comprehension models discussed above, the Dual-Path model has only been applied to miniature languages, although these were based on a wider range of languages (e.g., English, German, and Japanese) than for other computational cognitive models. Moreover, the model's validity has been extensively evaluated against human language data, for example, from structural priming experiments (Chang, Dell, & Bock, 2006) and aphasic patients (Dell & Chang, 2014). The model's success in accounting for human sentence production behavior once again highlights the importance of statistical learning and processing to the language system. Hence, it makes sense to also take the statistical approach when moving from the monolingual to the bilingual case.

From Mono-to Bilingual Models
What are the desiderata for a bilingual sentence processing model? The statistician George Box famously said that "all models are wrong but some are useful" (Box, 1979, p. 202), although of course a model can be too wrong to be useful.

201
Language Learning 71:S1, March 2021, pp. 193-218 If our bilingual model is to be of any use and suggest answers to questions of interest, such as those mentioned in the introduction, it needs to: simulate aspects of processing in at least two languages, possibly incorporating differences in proficiency and exposure between languages; display one or more phenomena unique to bilingual processing, such as code switching, language transfer, or crosslinguistic structural priming; account for the relevant empirical data (e.g., a code-switching model should display code-switching patterns similar to those observed in bilinguals); be cognitively and/or neurobiologically reasonable (e.g., the long-term memory capacity required for learning two languages should not exceed the sum of memory capacities for the individual languages).
A model that successfully mimics a relevant phenomenon and accounts for the associated empirical data (i.e., that meets the second and third criteria above) displays what Jacobs and Grainger (1994) call descriptive adequacy: It describes some aspects of reality. Any model can be made descriptively adequate to some extent by introducing additional free parameters or mechanisms that serve no other purpose than to make the model display the desired behavior. In that case, descriptive adequacy is merely due to ad hoc assumptions and the model is said to have no explanatory adequacy (in Jacobs & Grainger's terminology) which is to say that it fails to explain the phenomenon. It goes without saying that models that lack explanatory adequacy are not very useful or interesting from a scientific perspective.
One way to reduce the risk that a bilingual model lacks explanatory adequacy is to construct it by building upon a monolingual model, such that the monolingual model is subsumed under the bilingual one. Such a model will reduce to the monolingual case if the L2 is absent, 6 which means that everything we have learned from the original monolingual model remains valid. For example, removing one language from the Bilingual Interactive-Activation (BIA) model of word recognition (Van Heuven, Dijkstra, & Grainger, 1998) reverts it to the foundational, monolingual IA model (McClelland & Rumelhart, 1981). Hence, to the extent that the BIA model simulates specifically bilingual phenomena, it actually provides an explanation of these phenomena because nothing was added to the IA model apart from a second language.

Bilingual Statistical Processing
As we have seen, currently the most successful monolingual sentence processing models are those that exploit the language's statistical properties. This may not be true for second language processing in cases where L2 proficiency is very low and the language is acquired by explicit instruction, so that comprehension or production proceeds via rote-learned translations into/from the L1 and morphosyntactic (de)composition is based on the conscious application of nonprobabilistic rules. In what follows, we focus on more fluent and proficient L2 processing, when knowledge and use of both languages are statistical in nature.
However, even when a bilingual's individual languages can be captured in statistical language models, this approach may not immediately seem suitable for modeling the bilingualism itself because keeping the two languages separate requires something more strict and categorical than a probabilistic system can offer. How is a statistical system able to tell languages apart? There are at least two answers to this question. First, co-occurrences within a language are much more frequent than co-occurrences between languages. A statistical learner can quite easily pick up on this and thereby approximately categorize the languages, as also demonstrated in the very simple RNN model by French (1998), discussed briefly in Section Bilingual Recurrent Neural Networks below.
Second, the cognitive system is sensitive not only to linguistic statistical patterns but also to their interaction with the statistics of the world in general. Languages tend to be separated in the real world (by speakers, locations, situations, etc.), enhancing their separability and identifiability. Along the same lines, Pajak, Fine, Kleinschmidt, and Jaeger (2016) propose a framework according to which learning a second (or later) language is a process of probabilistic inference based not only on (perceived) similarities between languages but, crucially, also on "socio-indexical structure." That is, the distribution of speech/language varieties in the learner's environment provides information about similarities and differences between the varieties. This theory describes how learners develop a hierarchy of language, dialect, and idiolect knowledge (all with their unique, but also shared, statistics) and may account for patterns of transfer between languages.

Bilingual Probabilistic Grammars
Research on technologies for automatic natural language processing has included a fair amount of work on inducing bilingual grammars. Many of these studies (e.g., Burkett & Klein, 2008;Saers, Addanki, & Wu, 2012) require a so-called "parallel corpus," which contains pairs of translation equivalent sentences in both languages. Although this is clearly not a realistic way to learn two languages from a cognitive perspective, the resulting bilingual grammars could still form useful models of a bilingual adult's language knowledge. Other proposals do not depend on parallel corpora but can learn from corpora that 203 Language Learning 71:S1, March 2021, pp. 193-218 differ not only in language but also in content (Cohen, Das, & Smith, 2011;Iwata, Mochihashi, & Sawada, 2010). These models can therefore more realistically capture a bilingual's language exposure and form more promising cognitive models, although in practice they are not developed or presented as such. It thus remains to be seen whether these models have much psycholinguistic import. They are currently only evaluated in terms of the adequacy of the induced grammars, and not as human language processing models. However, it should be possible to use the grammars in an incremental probabilistic parser in order to compute word surprisal values, which could then be compared to human data from (bilingual) sentence comprehension experiments. Perhaps a more fundamental issue for these models is the additional syntactic machinery they require in order to handle more than one language. Unlike many of the bilingual neural network models that will be discussed later, the bilingual grammar models are not simply monolingual models exposed to two languages. For instance, when the Burkett and Klein (2008) model induces the grammars of English and Chinese, it also learns explicit links between those parts of the languages that are believed to be translation equivalents. Likewise, in the Cohen et al. (2011) model, language-specific syntactic categories are replaced by language-general categories before grammar learning can begin. Reliance on an architecture that is especially designed for allowing bilingualism substantially lowers these models' cognitive plausibility as well as its explanatory adequacy.
It should of course be kept in mind that these bilingual grammar models were designed for technological applications and therefore never intended to be cognitively realistic or evaluated against human processing data, so judging them by these standards may be a bit unfair. An alternative approach to obtaining a bilingual, incremental, probabilistic parser would be to start from an existing cognitive theory and embodying this in a computational implementation. For example, Multiple Grammar theory (Amaral & Roeper, 2014) claims that the language system has access to a large number of mutually incompatible "subgrammars" to deal with (apparent) irregularities in the language. Bilinguals have sub-grammars for both languages, with tags that indicate the language to which a sub-grammar belongs. Hence, there is no qualitative difference between mono-and bilinguals. This theory is not defined in probabilistic terms but, as the authors suggest, the sub-grammars could have frequency-based activation levels (along the lines proposed by Truscott, 2006), which would be a first step toward a full, probabilistic implementation. If the probabilistic sub-grammars can then be used by an incremental parser to obtain word surprisal values, the resulting model can be evaluated against human processing data in the same manner as Language Learning 71:S1, March 2021, pp. 193-218 204 is routinely done with standard, monolingual models. Currently, however, there are no implemented and cognitively validated bilingual probabilistic grammars.

Bilingual Recurrent Neural Networks Bilingual Next-Word Prediction
Recurrent neural networks are routinely applied to next-word prediction in a single language. What happens when an RNN is trained on two languages simultaneously? Interestingly, this requires no architectural changes; the network does not even need to be told that its input comprises two different languages. Rather, it learns to distinguish them from the simple fact that words of the same language co-occur much more often than words from different languages. A notion of language identity (even if only approximate and implicit) is both learnable from the training data and very useful to perform next-word prediction, hence, the network will develop a separation between the languages. An early demonstration of this principle was provided by French (1998) who trained an RNN model on two simple, artificial languages that differed only in their vocabularies. All sentences had a three-word SVO structure, with all three words coming from the same language. During training, the input language would occasionally switch from one sentence to the next, but never within a sentence. The (rather predictable) outcome was that the network's internally developed word representations showed a clear separation between the two languages.
More recently, Frank, Trompenaars, and Vasishth (2016) used an RNN that was trained for next-word prediction on large corpora of Dutch and English text. The network computed surprisal values over Dutch and English sentences from a reading experiment with Dutch-English bilinguals. The experiment manipulated grammaticality of sentences with double-embedded structures and showed a difference in reading-time effects between the two languages, which the RNN-based surprisal approximately matched. Other instances of largescale, bilingual word-prediction RNNs were developed for certain applications, in particular automatic translation (e.g., Auli, Galley, Quirk, & Zweig, 2013, for English-French and English-German), rather than for cognitive modeling.

Bilingual RNN Sentence Comprehension
There has been even less work on bilingual neural network models of sentence comprehension than next-word prediction. Hinaut, Twiefel, Petit, Dominey, and Wermter (2015) show that the RNN model by Hinaut and Dominey (2013; see Section RNNs for Sentence Comprehension) can successfully be trained simultaneously on two miniature languages modeled on English and French,

205
Language Learning 71:S1, March 2021, pp. 193-218 without receiving any explicit information that the input comes from two languages. In fact, performance of the bilingual model in either language was similar to performance of each of two monolingual networks. This result is particularly impressive because the English, French, and bilingual RNN models did not differ in their input and recurrent connection weights; only the weights going to the output units were trained on sentences from one or both languages. However, the authors do not present any evaluation against human data or interpretation in terms of human bilingual processing. Although research on bilingual comprehension by neural networks is clearly still in its infancy, the Hinaut et al. (2015) result suggest that there is no principled problem for an RNN model to learn bilingual comprehension. There also seems to be no a priori reason to believe it will be particularly difficult for models based on other semantic representations than the propositional structures used by Hinaut and colleagues. The question remains, however, whether such models account for any human data that is relevant to bilingualism research and what, if anything, can be learned about human bilingualism from such models.

Bilingual RNN Sentence Production
Currently the most successful bilingual sentence processing models are from the domain of sentence production. Chang's (2002) Dual-Path model (see Section RNNs for Sentence Production) can quite easily be turned into a bilingual production model by simply training it on two languages simultaneously. It requires no architectural adaptations, apart from the minimal addition of language-control units that steer the network toward producing a sentence in one of the two languages. Janciauskas and Chang (2018) used the bilingual Dual-Path model to study the effects of L2 age of acquisition (AoA) and length of L2 exposure. They first trained the model on Korean-like sentences. After an amount of training that was varied to model differences in AoA, they then trained it to process sentences in an English-like language. The model mimicked human learners in that its L2 English production had more errors with later AoA. However, the empirical finding that L2 never reaches native levels (except for very early AoA) could only be simulated when a "critical period" was built into the model by switching off learning in the syntactic pathway after some time, an addition that was not present in the original, monolingual model. Hence, when it comes to this particular finding, the bilingual Dual-Path model displays descriptive but not explanatory adequacy. Tsoukala, Frank, and Broersma (2017) used a similar approach to investigate why L1 Spanish speakers sometimes produce the wrong gender for L2 English pronouns, that is, they confuse "he" and "she" (Antón-Méndez, 2010). A Spanish-English version of the bilingual Dual-Path model showed the same behavior, but when the Spanish-like language was adapted to lose its pro-drop feature, the L2 English pronoun gender error disappeared. This confirmed that the error may be caused by a transfer effect from Spanish pro-drop. The Spanish-English bilingual Dual-Path model can also produce code-switched sentences, even though it was never exposed to code switches during training (Tsoukala, Frank, Van den Bosch, Valdés Kroff, & Broersma, 2019). The patterns of generated code switches show some resemblance to those of Spanish-English bilinguals. In particular, the model behaves like Spanish-English bilinguals in that it is much more likely to code-switch into English after the auxiliary verb in progressive than in perfect-tense structures, while code-switches on the auxiliary occur equally often in both structures. However, it remains unclear why, exactly, the model shows these typically bilingual and somewhat humanlike behaviors. Even though the model did not require ad hoc assumptions (it displays some explanatory adequacy), it currently fails to provide much in terms of a true understanding of the cognitive processes that underlie the behavior.

Beyond Bilingualism: Is There Anything Special About Multilingual Models?
We have seen several examples of bilingual grammar induction or sentence processing models. How can these be extended to handle more than two languages? This question has not yet been tackled in cognitive models but is a research focus in the field of computational linguistics, which is to say that these multilingual models are not intended to be models of human multilingualism. Cohen et al. (2011) demonstrate that, for a typologically diverse range of languages, single-language grammar induction from unannotated sentences is often more successful when, at the same time, grammars of four other languages are learned from syntactically annotated sentences. Even more impressively, the model by Iwata et al. (2010) can learn simultaneously, without supervision or parallel corpora, a shared grammar, and language-specific sub-grammars for ten Indo-European languages plus Finnish. Apparently, going from bi-to multilingualism is unproblematic for models of grammar induction but, as mentioned in Section Bilingual Probabilistic Grammars, they do require specific machinery to deal with more than one language, and their psycholinguistic relevance is questionable and has not been evaluated.

207
Language Learning 71:S1, March 2021, pp. 193-218 More recent work on computational linguistic models that are massively multilingual (i.e., can handle more languages than most individual humans) applies neural networks instead of probabilistic grammars. This generally does not require much (if any) architectural adaptation specific to bi-or multilingualism. Johnson, Schuster, Le, and Krikun (2017) took an RNN model for translation between two languages and trained it on up to twelve language pairs including several European languages, Korean, and Japanese. It was then also able to translate between pairs it was never trained on, to handle code-switched input, and to engage in code-switched production. The only required addition to the original model (apart from the additional languages) was a set of extra input units to signal which of its languages the network should translate into.
Moving toward massive multilingualism does not imply that such language control gets out of control.Östling and Tiedemann (2017) managed to train a text-prediction RNN on Bible translations in up to 990 languages simultaneously, although increasing the number of languages did lead to worse performance. Language identity was not represented by a discrete symbol but as a continuous-valued, high-dimensional vector. This allows for the representation of an unbounded number of languages, as well as of their interrelatedness by assigning similar vectors to similar languages. Consequently, the model could interpolate between languages, and, for example, generate sentences from a continuous range of languages from Middle English to Modern English. In effect, the "number of languages" known by the model has become a meaningless concept (see Berthele, this issue, for a discussion of how the same can be argued for human multilingualism).
What is the potential for cognitive models of multilingualism? The results on bilingual RNNs (see Section Bilingual Recurrent Neural Networks) suggest that multilingualism is relatively straightforward to model: Next-word prediction and sentence-production RNNs can, in principle, be trained on one, two, or many languages. Although performance may degrade as the number of languages increases, as it did in theÖstling and Tiedemann (2017) model, neural networks for translation (Zoph, Yuret, May, & Knight, 2016) or image-caption retrieval (Kádár, Elliott, Côté, Chrupała, & Alishahi, 2018) have demonstrated that training the model on additional languages can actually improve performance on languages for which little training data is available.
We have already seen that surprising phenomena can emerge when monolingual models are turned bilingual or when bilingual models become multilingual: The bilingual Dual-Path model displays unexpected transfer and code-switching behavior (Tsoukala et al., 2017, and RNN translation systems can translate between new language pairs (Johnson et al., 2017).
As more languages are added, more opportunities for such emergence arise. These uniquely multilingual abilities thus emerge from simply presenting multiple languages to models that have no unique multilingual design features. Therefore, observing "special" behavior of multilingual models (or, by extension, people) does not imply the presence of any "special" underlying mechanism. Rather, the increased complexity of the bilingual or multilingual system can arise without increasing the complexity of the underlying cognitive architecture. The view from the state-of-the-art in computational language models thus appears to be that there is nothing special about multilingualism.
Final revised version accepted 30 January 2020 Notes 1 This does not preclude a role for abstract linguistic units. Rather, the statistical learning view holds that the abstractions themselves are discovered from the statistical regularities in the language data, and that learning extends to the (co-)occurrence frequencies of the learned abstract units. 2 All these studies used data from English, with the exception of Willems, Frank, Nijhof, Hagoort, and Van den Bosch (2016) who used Dutch, so it remains to be demonstrated that these findings generalize to a typologically diverse set of languages. However, the information-theoretical basis of surprisal is language independent so there is no reason to assume that its validity is restricted to Indo-European (or even just Germanic) languages. 3 There also exist hybrid models that explicitly include syntactic structure in neural networks (e.g., Dyer, Kuncoro, Ballesteros, & Smith, 2016;Sturt, Costa, Lombardo, & Frasconi, 2003) but these have not been very influential in psycholinguistics. 4 To be precise, Brouwer, Crocker, Venhuizen, and Hoeks (2017) take as a prediction for P600 size the amount of change in a network layer just before the semantic output. 5 But see Fitz and Chang (2019) for an alternative, learning-based account of the P600, supported by RNN simulations. 6 In much the same sense that a monolingual person who will (in the future) learn a L2 is not a priori different from someone who will not.