The origins of gestures and language: history, current advances and proposed theories

Investigating in depth the mechanisms underlying human and non‐human primate intentional communication systems (involving gestures, vocalisations, facial expressions and eye behaviours) can shed light on the evolutionary roots of language. Reports on non‐human primates, particularly great apes, suggest that gestural communication would have been a crucial prerequisite for the emergence of language, mainly based on the evidence of large communication repertoires and their associated multifaceted nature of intentionality that are key properties of language. Such research fuels important debates on the origins of gestures and language. We review here three non‐mutually exclusive processes that can explain mainly great apes' gestural acquisition and development: phylogenetic ritualisation, ontogenetic ritualisation, and learning via social negotiation. We hypothesise the following scenario for the evolutionary origins of gestures: gestures would have appeared gradually through evolution via signal ritualisation following the principle of derived activities, with the key involvement of emotional expression and processing. The increasing level of complexity of socioecological lifestyles and associated daily manipulative activities might then have enabled the acquisition and development of different interactional strategies throughout the life cycle. Many studies support a multimodal origin of language. However, we stress that the origins of language are not only multimodal, but more broadly multicausal. We propose a multicausal theory of language origins which better explains current findings. It postulates that primates' communicative signalling is a complex trait continually shaped by a cost–benefit trade‐off of signal production and processing of interactants in relation to four closely interlinked categories of evolutionary and life cycle factors: species, individual and context‐related characteristics as well as behaviour and its characteristics. We conclude by suggesting directions for future research to improve our understanding of the evolutionary roots of gestures and language.


I. INTRODUCTION
One of the fundamental features of human language is that it is learned through socially mediated interactions, culturally transmitted over generations, and varies substantially within and among human societies. According to a growing number of researchers, human culture and language are closely related evolutionarily speaking (e.g. Fitch, Huber & Bugnyar, 2010;Dediu et al., 2013). Studies of intraspecific behavioural variability in other animals have evidenced a culture-like phenomenon, especially in birds (e.g. Marler & Tamura, 1964;Aplin, 2019), cetaceans (e.g. Rendell & Whitehead, 2001;Cantor & Whitehead, 2013) and non-human primates (hereafter primates) (e.g. Whiten et al., 1999;Boesch, 2012), but also in fish (e.g. Laland, Atton & Webster, 2011) and insects (e.g. Alem et al., 2016). This review focuses on primates, particularly on great apes concerning gestural communication and on monkeys concerning vocalisations, reflecting the model species bias inherent in the literature on gesture and language origins. A distinctive behaviour pattern can be group specific (e.g. tool use by chimpanzees, Pan troglodytes: e.g. Luncz, Mundry & Boesch, 2012), sometimes involving neighbouring groups living in nearly identical environments but that differ genetically (e.g. Gruber et al., 2012). However, most primate studies focus mainly on foraging techniques, and communicative signalling has been overlooked. Some cases of culturally transmitted communication behaviour concern vocal traditions (e.g. nestbuilding calls by free-ranging orangutans, Pongo pygmaeus; Wich et al., 2012) and visual signalling traditions associated with vocalisations (e.g. howler monkeys, Alouatta pigra, placing their hand in front of their mouth while vocalising; Briseño-Jaramillo, Estrada & Lemasson, 2015). Other examples include gestural communication traditions [e.g. HANDCLASP GROOMING by chimpanzees (Nakamura, 2002); HAND EXTENSION by mandrill, Mandrillus sphinx (Laidre, 2008); throughout this review we use small capital letters to identify gestures]. Here we restrict the term 'gesture' to communication functions. We define a gesture as a movement of the limbs, head, and/or body directed towards a recipient that is mechanically ineffective (i.e. 'visibly lacks the mechanical force to bring about the reaction shown by the recipient, and also does not include any attempt to grab or extensively hold a body part of the other'; Pollick & de Waal, 2007, p. 8185) and elicits a voluntary response from the recipient (e.g. Pika, 2008;Pika & Bugnyar, 2011). Further studies are needed for a better understanding of the evolutionary origins of human language. Comparative evolutionary approaches investigating communicative signalling of our closest living relatives, the non-human primates including New World and Old World monkeys as well as small and great apes, can help us to understand the evolutionary mechanisms underlying the acquisition of language.
Gestures play a crucial role in human and non-human primate communication systems (e.g. Cartmill, Beilock & Goldin-Meadow, 2012;Pika & Liebal, 2012;Gillespie-Lynch et al., 2014). Some non-human primate gestural communication systems share several key characteristics with human language such as intentionality (e.g. Meunier, Prieur & Vauclair, 2012), referentiality [e.g. directed scratches (Pika & Mitani, 2006); beckoning (Genty & Zuberbühler, 2014)], and conversational rules [turn-taking sequences (e.g. Fröhlich et al., 2016;]. All these properties underlying the production and use of sophisticated gestural communication are crucial prerequisites for human language (e.g. Leavens, 2004;Meguerditchian & Vauclair, 2008;Pika, 2008;Fitch, 2010). Therefore, the study of primate gestural communication should provide valuable clues to explore the evolutionary roots of human language (e.g. Arbib et al., 2008;Pika & Liebal, 2012;. Our aim is twofold. First, we assess the current understanding of primate gestural communication, suggesting that gestures played a key role in the emergence of language. We discuss debate on the origins of gestures and hypothesise an evolutionary scenario. Second, we present four theories concerning the origins of language: the vocal theory, the gestural theory, the multimodal theory and the multicausal theory. We propose and use of gestures], with only a few studies focussing on the understood gestural repertoire by addressing the meaning of gestures, i.e. what signals are used intentionally to achieve [bonobos, Pan paniscus (Graham, Furuichi & Byrne, 2017); chimpanzees (Roberts, Vick & Buchanan-Smith, 2012;; see also Hauser, 2000, Scott-Phillips, 2015, 2016a, Gruber, 2016 andOña, 2018a for discussions about meaning and reference in human and nonhuman animals]. While the expressed gestural repertoire refers to 'the set of gesture types that a signaller deploys', the understood gestural repertoire refers to 'the set of gestures to which a recipient reacts in a way that satisfies the signaller repertoire' (Graham, Furuichi & Byrne, 2017, p. 171).
The size of the gestural repertoire used in primates is considerable and includes a great variety of gestures [e.g. bonobos (de Waal, 1988;Pika, Liebal & Tomasello, 2005b); chimpanzees (Nishida et al., , 2010; gorillas (Pika, Liebal & Tomasello, 2003;Genty et al., 2009); orangutans (Liebal, Pika & Tomasello, 2006); see also Call &Pollick &de Waal, 2007]. For example, the gestural repertoire of gorillas is larger than their vocal repertoire: 33-102 gesture types (e.g. Genty et al., 2009) versus 5-17 call types (e.g. Salmi, Hammerschmidt & Doran-Sheehy, 2013;Hedwig et al., 2014). Considering reports of the gestural repertoires of the four great apes as well as siamangs and Barbary macaques (Macaca sylvanus),  calculated that at least 50% of the repertoire of each species consisted of manual gestures, with the highest proportion (73%) for gorillas. The repertoires of these species include visual gestures (which generate a mainly visual component with no physical contact such as RAISE ARM and EXTEND HAND), tactile gestures (which include physical contact with the recipient such as EMBRACE and TOUCH BODY), auditory gestures (which generate sound while being performed such as SLAP HAND and BEAT BODY) and object manipulation gestures (which involve the use of an object such as SHAKE OBJECT and THROW OBJECT).
(c) Communication persistence to achieve a social goal A signaller generally ceases to communicate when it has achieved its social goal (e.g. Roberts, Vick & Buchanan-Smith, 2013;. On the contrary, when the initial gesture is unsuccessful, a signaller can repair communication failure by persisting: the signaller either performs the same type of gesture repeatedly (repetition) or performs another type of gesture or a combination of gestures (elaboration) in relation to the recipient's comprehension state (e.g. Bates, Camaioni & Volterra, 1975;Bruner, 1981;Roberts, Vick & Buchanan-Smith, 2013;Genty, Neumann & Zuberbühler, 2015). According to some authors, signallers' communication persistence would be the most convincing evidence of intentionality in gestural communication (e.g. Golinkoff, 1986;Leavens, Russell & Hopkins, 2005). As Liebal et al. (2014) pointed out, elaboration would be a more reliable marker of intentionality than repetition: a signaller's arousal or emotional state would be less likely to drive production of several different gestures than repetition of the same gesture over a short period of time.
Complementary intentionality criteria to those listed above have been proposed and defined to identify and categorise intentionality in communication in a more comprehensive way (e.g. Bates, Camaioni & Volterra, 1975;Bruner, 1981;Tomasello et al., 1994;Leavens, Russell & Hopkins, 2005;Pika & Liebal, 2012;Townsend et al., 2017;Prieur et al., 2018b): (i) a signaller uses a signal to achieve a social goal: the signaller produces a signal only in the presence of an audience or produces or withholds a signal depending on the size and/or composition of the audience (e.g. age and sex ratio; kin, affiliative or hierarchical relationship); (ii) a signaller waits for the recipient's response: the signaller pauses after sending a signal and waits for at least 2 s for a response while maintaining visual contact with the recipient (Fröhlich et al., 2016); (iii) a signaller uses attention-getting behaviours: the signaller produces an auditory or tactile signal to attract the recipient's attention that is followed by a visual signal (Liebal et al., 2014); (iv) a signaller's apparent satisfaction: the signaller ceases to communicate when the apparent social goal has been achieved by the recipient .
(d) The need for a finer grained categorisation of intentionality Intentionality (defined as a commitment to carrying out an action with planning and forethought; e.g. Bratman, 1987) and shared intentionality (defined as the skill and motivation to share goals and intentions with others during joint actions such as social play or cooperative activities; e.g. Tomasello, Carpenter & Hobson, 2005) are keystones of human social cognition (e.g. Malle, Moses & Baldwin, 2001;Tomasello & Carpenter, 2007;Tomasello & Moll, 2010). In particular, these properties play a fundamental role in the emergence and development of linguistic communication, cooperation and cultural learning (e.g. Tomasello & Rakoczy, 2003;Call, 2009;Tomasello, 2009). Deeper investigations of the multifaceted nature and components of intentionality and shared intentionality in social interactions of humans and other primates presenting a wide range of geographical, ecological, social, and demographic characteristics are needed to improve our understanding of the evolutionary origins of the unique motivation of humans to communicate, cooperate and share psychological states with others.
Using multicriteria methods should reduce the uncertainties and difficulties inherent in qualifying a behaviour as intentional. A recent study of bonobo and chimpanzee gestural exchanges observed during mother-infant joint travel interactions differentiated between social interactions via intentionally produced gestures presenting more than one key criterion of intentionality (multiple-criteria gestures) and gestures presenting only one key criterion of intentionality (single-criterion gestures) (Fröhlich et al., 2016). The authors' key intentionality criteria were: sensitivity to the attentional state of the recipient, response waiting, apparent satisfaction of signaller, and goal persistence. However, a finer-grained analysis of intentionally produced behaviours taking into account simultaneously as many key criteria of intentionality as possible is necessary to understand better (i) when, why and how intentionality develops at the individual, population and species levels, and (ii) possible evolutionary scenarios for the origins of human intentional communication, cooperation and shared intentionality. Prieur et al. (2018b) proposed an index to quantify the development of intentionality abilities (Intentionality Characterisation Index: ICI) taking five key criteria of intentionality into account. The ICI could help to make comparisons (i) within and among individuals, populations and species, and (ii) within and among gestural (tactile, visual and/or auditory), oro-facial and vocal communication systems. Further considerations using the ICI should help to assess better the development of intentionality abilities. We emphasise the importance of investigating intentional abilities by applying a comprehensive approach taking into account as many potentially influential factors as possible. The factors that should be considered range from sociodemographic characteristics of the signaller and recipient (e.g. age, sex, group, affiliation, kinship and rank status, e.g. Prieur et al., 2016aPrieur et al., , 2017c and socioecological factors of populations and species (e.g. habitat, social structure and dynamics, e.g. Cunningham & Janson, 2007;Bouchet, Blois-Heulin & Lemasson, 2013) to several characteristics related to context (e.g. audience effect and interactional components, e.g. Fröhlich et al., 2017;Coppinger, Cannistraci & Karaman, 2017) and of the communication signals (e.g. ICI and type of sensory modalities involved). To summarise, the literature suggests that the gestural communication of primates and particularly of great apes is likely to have played a key role in the emergence of language, mainly based on the evidence of large communication repertoires and their associated multifaceted nature of intentionality that are fundamental properties of language. The increasing number of reports in favour of the evolutionary contribution of gestures to the emergence of human language has led researchers to investigate and formulate several hypotheses concerning the origins of gestures.

(3) The origins of gestures
The origins of gestures is still debated with four main hypotheses/mechanisms proposed: (i) phylogenetic ritualisation; (ii) ontogenetic ritualisation; (iii) social learning via imitation; and (iv) learning via social negotiation (e.g. Tinbergen, 1952;Arbib et al., 2008;Genty et al., 2009;Hobaiter & Byrne, 2011b;Fröhlich, Wittig & Pika, 2016c;Liebal, Schneider & Errson-Lembeck, 2018;Pika & Fröhlich, 2018). As described below, phylogenetic ritualisation is a process in which communication displays (e.g. dominance signals such as mounting) could emerge from body movements lacking a communication function because they are 'borrowed' from other contexts (e.g. a sexual context) (e.g. . Ontogenetic ritualisation is a process in which a communication signal is created by two individuals shaping each other's behaviour in repeated instances of an interaction over time. Tomasello (2002, p. 329) provides an example: 'playhitting is an important part of the rough-and-tumble play of chimpanzees, and so many individuals come to use a stylized 'arm-raise' to indicate that they are about to hit the other and thus initiate play'. Imitative learning refers to situations in which observers can acquire parts of an actor's gestural repertoire by copying its gestures directed towards them (second-person imitation) or towards other individuals without directly interacting with them (third-person imitation) (e.g. Bandura, 1986;. Learning via social negotiation is a process based on the assumption that gestures originate from a shared understanding of gestural meaning and mutual construction in real time by both interactants (e.g. Fröhlich, Wittig & Pika, 2016c; Pika & Fröhlich, 2018).

(a) Phylogenetic ritualisation
Several studies support the biological inheritance of primate gesture acquisition (e.g. Redshaw & Locke, 1976;Genty et al., 2009;Hobaiter & Byrne, 2011a,b;Byrne et al., 2017;Graham, Furuichi & Byrne, 2017). In particular, they reported that chimpanzee gestures were not only shared by community members, but also that many gestures of the four genera (bonobos, chimpanzees, gorillas, and orangutans) were similar (Hobaiter & Byrne, 2011b;Graham, Furuichi & Byrne, 2017). Furthermore, they showed that the gestural repertoire of older chimpanzees was smaller than that of younger subjects (Hobaiter & Byrne, 2011a). The latter authors proposed the repertoire tuning hypothesis that postulates that an originally large innate repertoire is shaped by experience into the most efficient elements for each individual. The above studies indicate that the vast majority of great ape gestures are innate and exhibit flexibility in use across contexts but not flexibility in form, leading to high repertoire concordance within and among groups.

(b) Ontogenetic ritualisation
Contrary to the view of biological inheritance, several studies support an ontogenetic basis of the gestural acquisition process, called ontogenetic ritualisation (e.g. Plooij, 1978;Tomasello et al., 1985;, as evidenced by idiosyncratic gestures (i.e. gestures that are learned Biological Reviews 95 (2020)  The origins of gestures and language individually so that they are used only by single individuals within a group) and by the high variability in gestural repertoires observed between groups of bonobos, chimpanzees and orangutans in captive and natural conditions (e.g. Pika, Liebal & Tomasello, 2005b;Liebal, Pika & Tomasello, 2006;. This hypothesis is also supported by a developmental study focusing on initiation of joint travel by bonobo mothers carrying infants that revealed variability within dyads (Halina, Rossano & Tomasello, 2013). These studies indicate that great ape gestures can be created mainly by repeated exchanges between interactants, resulting in the shortening of a physically effective sequence of actions (without any communication function), leading to low repertoire concordance within dyads as well as within and among groups. As for other gestures subject to a ritualisation process, gestures emerging from ontogenetic ritualisation also exhibit flexibility in use across contexts but not flexibility in form.

(c) Social learning via imitation
Other authors argue that great ape gestural acquisition can be explained by imitative learning (e.g. Hayes & Hayes, 1952;Russon & Galdikas, 1993;Custance, Whiten & Bard, 1995). Individuals can learn gestures by observation and replication of behaviours of their parents (vertical transmission), peers (horizontal transmission), or unrelated older group members (oblique transmission) (Cavalli-Sforza & Feldman, 1981). The social learning via imitation hypothesis predicts that concordance of repertoires would be high within groups but low among groups. However, very few primate data show the presence of group-specific gestures (e.g. Tanner & Byrne, 1999;Pika, Liebal & Tomasello, 2003, 2005bLiebal, Pika & Tomasello, 2006). Although humans learn by observation, it seems that the role of imitation in great ape gestural acquisition is negligible. ) performed a systematic quantitative comparison of gestural signalling between two chimpanzee communities: they compared individuals horizontally between communities and the same individuals vertically between two consecutive years, focusing on mother-infant dyads, and considering gestures to initiate joint travel. As concordance rates of gestural repertoires varied within dyads as well as within and between communities, they suggested that neither genetic channelling nor ontogenetic ritualisation and social transmission through imitation explain the variability and flexibility of chimpanzee gestural interactions adequately. They proposed a revised version of the social negotiation hypothesis (sensu Plooij, 1978;Wittgenstein, 1953), which states that gestures are the output of social shaping, shared understanding of gestural meaning and mutual construction in real time by interactants. This view is strengthened by their studies of gestural development in chimpanzee infants focusing on gestures produced in various behavioural contexts, in particular initiation of social play. They reported that chimpanzees are able to adjust their communication (gestural play solicitation) according to distinct attributes of conspecifics (age, sex, kin relationship) (Fröhlich, Wittig & Pika, 2016b). Furthermore, they showed that higher interaction rates with non-maternal conspecifics and the number of previous interaction partners increase gesture frequency, gesture production in sequences, and size of the gestural repertoire, depending on the infant's age (Fig. 1). On the contrary, interaction rates with mothers did not influence chimpanzee infant gestural signalling. These findings indicate that infants of highly social mothers are at an advantage in the complex social life of chimpanzees (Fröhlich et al., 2017). The longitudinal data emphasise the important role of interactional experience and social exposure in chimpanzee gestural acquisition. Fröhlich et al. (2016Fröhlich et al. ( , 2017Fröhlich, Wittig & Pika (2016b,c) show that the acquisition and development of gestures do not necessarily imply the shortening of an action sequence (unlike ontogenetic ritualisation), but instead implies mutual online adjustment (unlike phylogenetic ritualisation) and exchange of fully formed behaviours by both interactants. This would lead to low repertoire concordance within dyads as well as within and among groups. Gestures emerging throughout ontogeny via a process of social negotiation exhibit flexibility in use across contexts and could also show flexibility in form but this assumption awaits confirmation.

(e) Towards a consensus on the origins of gestures
To summarise, some data support a phylogenetic process of ritualisation in primate gestural acquisition, such as speciestypical gestures (e.g. BEAT CHEST in gorillas) and family-typical gestures in common among different genera of great apes living under different ecological and/or social pressures (e.g. EMBRACE in bonobos, chimpanzees, gorillas, humans and orangutans). Other data support the ontogenetic ritualisation hypothesis with the presence of idiosyncratic gestures that are key indicators of the role of individual learning. Moreover, fine-grained analyses of longitudinal data for wild chimpanzees support the social negotiation hypothesis with a role of interactional experience and social exposure in gestural acquisition. Based on these findings, we suggest that great ape gestural acquisition can be explained mainly by the following three non-mutually exclusive processes: phylogenetic ritualisation, ontogenetic ritualisation and learning through ontogeny via social negotiation. This view is supported by studies showing that primate gestural acquisition may be due to more than one single mechanism, involving a combination of both genetic and social factors (e.g. Schneider, Call & Liebal, 2012; Biological Reviews 95 (2020)  To progress further, it is important to understand better the cognitive mechanisms underlying phylogenetic ritualisation, ontogenetic ritualisation and social negotiation. We assume that these three processes involve different mechanisms of concept learning. Based on previous definitions (e.g. Barsalou, 1991Barsalou, , 1992Zentall et al., 2008;Zentall, Wasserman & Urcuioli, 2014), we define concept learning as the acquisition and development of the ability to categorise objects, events and relations into classes that allows the generalisation of something learnt to novel stimuli or contexts evaluated to be perceptually, associatively, or functionally equivalent to those involved in the initial learning. First, phylogenetic ritualisation would require perceptual or similarity-based concept learning in which objects/stimuli are categorised based on physical similarity (e.g. Zentall, Wasserman & Urcuioli, 2014). Such perceptual concepts would be mainly under the control of the behavioural principles of primary stimulus generalisation and discrimination (e.g. Hull, 1943;Honig & Urcuioli, 1981;Mackintosh, 2000) which allow humans and non-human animals to respond similarly and adequately to new events that resemble past events (e.g. Pearce, 1988;Wasserman, Kiedinger & Bhatt, 1988;Zentall & Wasserman, 2012). For example, animals could link dominance signals such as mounting to ancient postural reflexes crucial for reproduction because of the physical similarity of the motions or positions of body, limbs, head, face and/or eyes. Second, ontogenetic ritualisation and social negotiation would require advanced cognitive mechanisms of associative concepts in which arbitrary stimuli (i.e. a physical resemblance between the stimuli is not required) become interchangeable because of their previous association with another stimulus, outcome, or response (e.g. Zentall, Wasserman & Urcuioli, 2014). Such associative concepts would develop through experience (e.g. Miller & Dollard, 1941;Dickinson, 2012). For instance, an action sequence, starting with 'raising one arm' or 'slapping ground with one hand', both preceding a play event, would form an associative class ('play') through repeated instances of interactions among individuals. The context of an association (e.g. play context) would thus be continuously and flexibly shaped and adjusted during ontogeny by adding new stimuli to, or discarding old stimuli from, its associative class (e.g. 'play'). We postulate that such concept-learning mechanisms might have played a fundamental role in the acquisition and development of meaningful social behaviours, and thereby intentional and functional gestural (tactile, visual and/or auditory), vocal, oro-facial and eye communication systems.
Based on the literature and on our above assumptions about great ape gestural acquisition and related mechanisms of concept learning, we hypothesise a scenario for the evolutionary origins of gestures. First, gestures would have appeared via signal ritualisation (Krebs & Dawkins, 1984;Smith & Harper, 2003) following the principle of derived activities (Tinbergen, 1952). Gestures would originate from a pre-existing set of instinctive, involuntary actions essential to survival (e.g. escaping, foraging, mating, nursing) and have no signalling function but provide information concerning the actor's emotional state to observers through physiological and behavioural changes such as acceleration of breathing The origins of gestures and language rate and combination of body, limb, head and eye movements with facial expressions and/or vocalisations. For instance, when competing for food, actors in need of dietary energy might benefit from the observer's responses (e.g. departure from the food location) by intensifying their emotional state reflected in their behaviour (e.g. through more rapid or more jerky movements of hands combined with more pronounced facial expressions and gaze alternation between food items and the observer) in an appropriate functional context (e.g. feeding) or other contexts (e.g. travel and sex). Observers might benefit from the detection and discrimination of the actor's emotional/behavioural patterns by anticipating their behaviour and reacting accordingly (e.g. by escaping to avoid physical aggression). An important role of emotional expression and processing in the emergence of intentional communication has been suggested by several studies [e.g. see Liebal & Oña, 2018b for a recent review]. Ultimately, such behavioural patterns would have evolved progressively and been maintained across generations, leading to phylogenetically ritualised communication signals that lack flexibility in form (at both individual and species levels) but not in use, such as attention-getting gestures and species-typical gestures (e.g. BEAT CHEST) as well as familytypical gestures (e.g. EMBRACE).
Second, the increasing level of complexity of ecological [environmental diversity and variability such as an ecosystem with several well-developed strata from arboreal to terrestrial habitats (e.g. August, 1983;Anand et al., 2010)] and social [from strictly solitary to more or less social and tolerant (e.g. Aureli et al., 2008;Grueter, Chapais & Zinner, 2012)] lifestyles and related daily manipulative activities (e.g. tasks involving multiple acts such as bimanual coordinated actions and tool use) might have shaped animals' key sociocognitivecommunicative properties (e.g. intentionality, referentiality and turn-taking) and associated abilities (e.g. intentional abilities such as monitoring of the audience and elaboration). This shaping might have occurred through a trial-and-error, action-reaction learning process, allowing them to acquire and develop different interactional strategies throughout their life cycle, mainly by (i) shortening action sequences through ontogenetic ritualisation; (ii) mutual online adjustments of fully formed behaviours (arising from phylogenetic ritualisation) through learning via social negotiation; and (iii) imitative learning. On the one hand, ontogenetic ritualisation and imitative learning would lead to learned intention movements (e.g. Tinbergen, 1951Tinbergen, , 1952Tomasello, 2010) such as RAISE ARM (e.g.  and LEAF-CLIPPING (Nishida, 1980) that lack flexibility in form (at the dyad level for ontogenetic ritualisation and at both dyad and group levels for imitative learning) but not in use. On the other hand, learning during ontogeny via social negotiation would lead to intentional online adaptation and refinement of movements that hypothetically exhibit flexibility in both form and use (Pika & Fröhlich, 2018) but this remains to be confirmed. It is only later, during the evolution of hominids, that advanced cognitive-communicative skills might have emerged and developed, such as the representation of actions using iconic gestures and elaboration of messages after communication failure using pantomime as well as referential acts such as pointing and deictic gestures that are occasionally observed in great apes [iconicity (e.g. Tanner & Byrne, 1996;Russon & Andrews, 2011;Perlman, Tanner & King, 2012;Genty & Zuberbühler, 2014;Douglas & Moscovice, 2015); referentiality (e.g. Veà & Sabater-Pi, 1998;Pika & Mitani, 2006;Genty & Zuberbühler, 2014;Hobaiter, Leavens & Byrne, 2014)].
Taken together, these studies and hypotheses favour an important evolutionary role of gestural communication in the emergence of human language and have led researchers to support the gestural theory of language origins. Below we discuss this and other theories advanced in the debate on the evolutionary origins of language.

III. THEORIES OF THE ORIGINS OF LANGUAGE
Historically, research investigating the emergence of language through the study of communication in great apes used two main approaches. The first approach consisted of trying to teach great apes human vocalisations, but such attempts failed repeatedly (Garner, 1900;Kellogg & Kellogg, 1933;Hayes, 1951;Hayes & Hayes, 1951). However, note that studies have shown the comprehension of spoken words by non-human animals [e.g. the chimpanzee Vicki (Hayes, 1951); the bonobo Kanzi (Savage-Rumbaugh, Shanker & Taylor, 1998); see also Kaminski, Call &Fischer, 2004 andPilley &Reid, 2011 for recent studies of dogs]. Studies showed that great apes were more successful in learning other human communication systems: sign language (e.g. Gardner & Gardner, 1969) and use of plastic tokens (e.g. Premack, 1971) or geometric symbols called lexigrams (e.g. Greenfield & Savage-Rumbaugh, 1990). Results from these ape 'language' projects emphasise the multifaceted nature of communication and the fact that many features of language are not specific to humans.
From this literature three major theories have been advanced to explain the emergence of human language: the vocal theory, the gestural theory and the multimodal theory. The vocal theory of language origins proposed that language stemmed from the auditory-vocal modality (e.g. Dunbar, 1996;Zuberbühler, 2005;Knight, 2008). The gestural theory of language origins states that language developed from gestures (e.g. Corballis, 2002;McNeill, 2012). The multimodal theory of language origins posits that gestural, vocal, oro-facial and eye communication systems coevolved to build the multimodal, rhythmic (i.e. frequency temporal) and Biological Reviews 95 (2020)  socio-interactive nature of language (e.g. Arbib et al., 2008;Masataka, 2008;Lemasson, 2011;Slocombe, Waller & Liebal, 2011;Ghazanfar, 2012;Levinson & Holler, 2014;Liebal et al., 2014). Based on the literature and our findings (e.g. Arlet et al., 2015;Lemasson et al., 2016;Prieur et al., 2016a;Crockford, Wittig & Zuberbühler, 2017;Fröhlich et al., 2017;Hobaiter, Byrne & Zuberbühler, 2017;Fedurek et al., 2019), we propose the multicausal theory of language origins, postulating that primate communicative signalling stems from a cost-benefit trade-off of signal production and processing of interactants in relation to four interrelated (evolutionary and life cycle) factors, namely species, individual and context-related characteristics as well as behaviour and its characteristics. The perennial debate concerning the origins of language is presented below.
(1) The vocal theory of language origins The vocal theory of language origins predicts that calls represent a precursor of human language (e.g. Seyfarth, 1987;Masataka, 2003;Snowdon, 2009;Zuberbühler et al., 2009;Lemasson, 2011). At least nine key characteristics of human language have been identified in primitive forms in primate vocalisations, particularly in monkeys. First, referentiality was shown to exist in alarm calls conveying particular semantic content with respect to the type or location of predators and the urgency of the threat (e.g. Seyfarth & Cheney, 2003;Cäsar et al., 2013) and by comparing semantic dialects among geographically distant populations in the presence of different predator pressures (Schlenker et al., 2014). Reports also showed referential properties of non-alarm calls conveying semantic content with respect to the quality and quantity of food encountered (e.g. Hauser, 1998;Slocombe & Zuberbühler, 2005), or the accessibility of fertile females (Pfefferle et al., 2008).
Second, protogrammar rules that reflect similarities with grammatical principles of human language (morphology and syntax) have been identified in primate vocal communication, namely morpho-syntax shown by sound units which may be merged to form complex (suffixed) calls (e.g. Crockford & Boesch, 2005;Candiotti, Zuberbühler & Lemasson, 2012a;Coye et al., 2015Coye et al., , 2018Coye, Zuberbühler & Lemasson, 2016) or calls that can be combined into vocal sequences with a context-dependent predictable concatenation pattern allowing animals to refine or enrich the information conveyed [e.g. Clarke, Reichard & Zuberbühler, 2006;Arnold & Zuberbühler, 2006;Ouattara, Lemasson & Zuberbühler, 2009; see also Collier et al., 2014 for a review concerning the combinatorial structure of human and animal vocal systems]. For instance, female Diana monkeys, Cercopithecus diana, produce four distinct social calls ('H', 'L', 'R' and 'A'): the 'H', 'L' and 'R' calls are related to particular contextual valences for the signaller (very positive social context, neutral to mildly positive context and socionegative or mildly dangerous context, respectively); the 'A' call is associated with a wide range of contexts and its acoustic structure varies significantly among individuals, suggesting that it conveys information about the caller's identity (Candiotti, Zuberbühler & Lemasson, 2012a,b). These authors observed that each of these calls can be produced either singly ('H', 'L', 'R' or 'A') or combined in non-random ways, namely a contextual unit ('H', 'L' or 'R') combined with a signature unit ('A') in relation to ongoing behaviour or external events. Coye, Zuberbühler & Lemasson (2016) investigated the relevance of the 'R', 'L' and 'A' units by merging 'L' and 'R' contextual units with 'A' signature units from either familiar group members or neighbouring individuals. Diana monkeys responded differently to social calls composed of different morphological units (RA or LA call combination), indicating that their contact call system possesses combinatorial and morpho-semantic properties.
Primates' vocalisations and human speech present homologies in terms of articulation and acoustics by production and use of proto-vowels (through typical, 'voiced' calls/vocalisations such as grunts and barks, resulting from the activation of their vocal folds and their regular oscillation) and protoconsonants (through atypical, 'voiceless' calls such as lipsmacks and raspberries, resulting from supra-laryngeal manoeuvring) either singly or in relatively simple syllable-like call combinations (Lameira, 2014(Lameira, , 2018 (Preuschoft, 1995); crested macaques, Macaca nigra (Thierry et al., 2000); gelada baboons, Theropithecus gelada (Bergman, 2013); rhesus macaques (Partan, 2002)]. These findings suggest that (i) our last common ancestor with Cercopithecoidea (around 25 million years ago) exhibited ancestral articulatory abilities; and (ii) early hominids (around 7 million years ago) could have been able to produce a small repertoire of consonants and consonant-vowel combinations (i.e. syllables) and presumably to create the first human-like words.
Fifth, several studies identified conversational rules respecting key organisational properties guiding primate vocal exchanges such as turn-taking between callers, calloverlap avoidance and acoustic matching (e.g. Snowdon & Cleveland, 1984;Sugiura & Masataka, 1995 The origins of gestures and language vocal exchanges in short-distance communication in a group of bonobos in captivity. The bonobos' vocal exchanges followed simple temporal (i.e. turn-taking, overlap avoidance) and social (i.e. interlocutor selectivity) rules, and the frequency of these vocal exchanges was only influenced by social bonds (established by the frequency of occurrence of peaceful spatial proximities). Interestingly, the frequencies of Japanese macaques'(Macaca fuscata) vocal exchanges were positively correlated with their social bonds (established by grooming duration) (Arlet et al., 2015), thus supporting the hypothesis that vocal exchanges can be interpreted as a form of 'grooming-at-a-distance' that facilitates the maintenance of social cohesion (Dunbar, 2003).
Seventh, studies of primate vocalisations reveal social learning skills as shown by social learning in juveniles of the appropriate context of use and meaning of calls (e.g. Seyfarth & Cheney, 1997;Lemasson et al., 2011;Bouchet, Koda & Lemasson, 2017).
Eighth, a growing body of evidence supports the existence of intentionality in primate vocal communication: (i) changes in call rates with the size and composition (sex ratio, kin, affiliative or hierarchical bond) of the audience (Roush & Snowdon, 2000;Wich & Sterck, 2003;Slocombe & Zuberbühler, 2007;Clay & Zuberbühler, 2012;Clay, Archbold & Zuberbühler, 2015) as well as with the state of knowledge of the receiver (Crockford et al., 2012); (ii) goal-directed signalling associated with gaze alternation (Schel et al., 2013); (iii) persistence and elaboration associated with changes in the acoustic structure of the repeated call to increase the chances of receiving a response (Koda, 2004); (iv) brain motor control of some calls produced during operant conditioning paradigms (Simões et al., 2010;Coudé et al., 2011;Hage & Nieder, 2013, 2015; and (v) dynamic control over the configuration of the vocal tract (Koda et al., 2012).
Ninth, recent investigations of statistical regularities in primate vocal communication reveal patterns consistent with Zipf's law of abbreviation (i.e. frequently used words tend to become shorter) [Semple, Hsu & Agoramoorthy, 2010;but see Ferrer-i-Cancho & Hernández-Fernández, 2013 for discussion about the law visibility] and with Menzerath's law (which predicts that longer sequences are made up of shorter constituents) (Gustison et al., 2016;Fedurek, Zuberbühler & Semple, 2017;Gustison & Bergman, 2017) suggesting that common linguistic laws underlie the structure of vocal communication in human and non-human primates.
(2) The gestural theory of language origins The gestural theory of language origins predicts that gestures represent a precursor of human language (e.g. Corballis, 2002Corballis, , 2003Arbib et al., 2008;McNeill, 2012). A first argument supporting this theory is that primate gestural communication is more flexible in terms of learning and use than are primate vocalisations (e.g. Meguerditchian & Vauclair, 2008). Indeed, their flexible gestural communication enables primates to adapt to social context and to the social rank and age of conspecifics (e.g. Maestripieri, 1999;Pika et al., 2005a;Arbib et al., 2008;Pika, 2008;Hobaiter & Byrne, 2011b) leading to large variations in the composition, morphology and size of the gestural repertoire among individuals and among groups of a given species. For example, in chimpanzees, the use of certain species-typical gestures is restricted to particular age classes (Hobaiter & Byrne, 2011a). Older subjects are more likely to use the most effective gestures (i.e. gestures attaining the desired goals), and the number of gesture sequences they use, as well as their gestural repertoire, decreases with age. Hobaiter & Byrne (2011b, p. 829) defined a sequence of gestures as 'a series of more than one gesture without interspersed pauses >1 s, the criterion used by Genty & Byrne (2010)'. These findings confirm those of previous studies that the gestural repertoires of adult apes are smaller than those of juveniles (e.g. Tomasello et al., 1985Tomasello et al., , 1994Tomasello, Gust & Frost, 1989;. A second argument supporting the gestural theory of language origins is provided by the discovery of so-called mirror neurons that exist in all primate brains [see reviews in Fabbri-Destro &Rizzolatti, 2008 andTramacere, Pievani &Ferrari, 2017]. As shown by Gallese et al. (1996) for rhesus macaques, mirror neurons discharge both when a subject performs a given action and when it observes the same action being performed by an experimenter. More recently, reports showed that mirror neurons can even be activated in response to hearing a sound related to a given action (Kohler et al., 2002), or when observing actions involving the use of tools (Ferrari, Rozzi & Fogassi, 2005) or mouth actions performed by a human social partner (Ferrari et al., 2003). However, non-action-related sounds (i.e. white noise and monkey calls) do not activate mirror neurons in monkeys (Kohler et al., 2002). The mirror neurons involved in the production and the perception of visuo-gestural actions and of oro-facial communication are located in area F5, which is homologous to the language production area of humans (e.g. Nishitani & Hari, 2000). Furthermore, the study of hemispheric specialisation for communication shows a predominance in the human left cerebral hemisphere of Broca's area (responsible for speech production) and Wernicke's area (responsible for understanding speech) (Horwitz et al., 2003;Xu et al., 2009) and of homologous areas in great apes (Gannon et al., 1998;Cantalupo & Hopkins, 2001;Cantalupo, Pilcher & Hopkins, 2003;Hopkins, Russell & Cantalupo, 2007a 2010; Spocter et al., 2010). Observations of great apes and monkeys in captivity suggest that their gestural communication is right-lateralised (e.g. Meguerditchian, Molesti & Vauclair, 2011;Meunier, Fizet & Vauclair, 2013b;Prieur et al., 2016a,b), although their gestural laterality can be modulated by several categories of factors such as positions of interactants and gesture sensory modality (see review in Prieur et al., 2019b). Interestingly, in chimpanzees and olive baboons, right-hand preference remains consistent over time, based on replicated measurements of hand preferences in the same individuals [chimpanzees for human-directed FOOD BEG (Meguerditchian, Vauclair & Hopkins, 2010); olive baboons for HAND SLAP (Meguerditchian, Molesti & Vauclair, 2011)]. Vocalisations are likely to be subject to less flexibility than gestures as several studies in primates highlight the strong dependence of spontaneous vocal production on brain areas controlling emotional reactions such as those involved in the limbic system [e.g. Aitken, 1981;Ploog, 1981;Preuschoft & Chivers, 1993;; but see Versace, Endress & Hauser, 2008, Yamaguchi, Izumi & Nakamura, 2010, Gamba & Giacoma, 2010and Filippi, 2016 for evidence of flexibility in primate vocalisations in relation to different emotional and environmental contexts].
Several key properties of human language are reflected in the complex systems of gestural and vocal communication in primates. Furthermore, reports show that monkey lipsmacking [a lip-smack is a rhythmic oro-facial expression commonly used during face-to-face affiliative interactions between primates (e.g. Van Hoof, 1962;Van Lawick-Goodall, 1968)] and adult human speech both exhibit a 3-8 Hz rhythmic frequency [humans (Greenberg et al., 2003;Chandrasekaran et al., 2009); gelada baboons (Bergman, 2013;Gustison & Bergman, 2017); rhesus macaques (Ghazanfar, Chandrasekaran & Morrill, 2010;Ghazanfar, 2012)]. In addition, the structure and development of macaque monkeys' lip-smacking is consistent with the rhythmic structure of human language, from infant babbling to adult speech (Morrill et al., 2012). Accumulated empirical and comparative evidence focusing on lip-smacking supports the hypothesis that the bimodal (visual and auditory) human speech rhythm could have evolved from the rhythmic facial expressions of ancestral primates (MacNeilage, 1998(MacNeilage, , 2008. Moreover, the literature provides evidence that eye behaviours, particularly eye gaze and eye blinking, are essential components to achieve, maintain and regulate mutual understanding in everyday social face-to-face interactions in humans and other primates (e.g. Goodwin, 1981;Emery, 2000;Bard et al., 2005;Csibra & Gergely, 2009;Shepherd, 2010;Innocenti et al., 2012;Rossano, 2013;Tada et al., 2013;Hömke, Holler & Levinson, 2018). For instance, reports show that the eyes embody diverse levels of signal value in relation to status, spatial attention, (dis)engagement, and emotional state of signaller and recipient (e.g. Emery, 2000;Shepherd, 2010;Rossano, 2013). These studies highlight important social interactive functions of eye behaviours that might have played a critical role in the evolution of human and primate communication systems.
Such findings have led a growing number of researchers to favour the alternative multimodal theory of language origins: the multimodal, rhythmic and social-interactive nature of human language would be the result, at least partly, of the coevolution of gestural, vocal, oro-facial and eye signalling (e.g. Arbib et al., 2008;Masataka, 2008;Lemasson, 2011;Slocombe, Waller & Liebal, 2011;Taglialatela et al., 2011;Ghazanfar, 2012;Gillespie-Lynch et al., 2014;Levinson & Holler, 2014;Liebal et al., 2014;Fröhlich et al., 2019). This is in agreement with data for Biological Reviews 95 (2020)  The origins of gestures and language humans (e.g. Bernardis et al., 2008;Gentilucci & Dalla Volta, 2008;Xu et al., 2009) suggesting that both speech and gestures could be under the control of a common integrated communication system located in the left cerebral hemisphere.
The recent development of powerful statistical tools (e.g. generalised linear models) in communication signal research has allowed a relatively small but growing number of multifactorial studies to suggest not only that language is a complex adaptive trait that has been shaped by evolution, but also that many interlinked factors can influence the acquisition and development of human and non-human primate communicative signalling throughout their life cycle: individual characteristics (genetics, epigenetics, sociodemographic factors), contextrelated characteristics (e.g. emotional/functional contexts, Biological Reviews 95 (2020)  The origins of gestures and language audience effect and interactional components) as well as behaviour and its characteristics (e.g. use of different types of uni-or multimodal signal combinations) (e.g. Arlet et al., 2015;Lemasson et al., 2016;Prieur et al., 2016a;Crockford, Wittig & Zuberbühler, 2017;Fröhlich et al., 2017;Hobaiter, Byrne & Zuberbühler, 2017;Fedurek et al., 2019).
Finally, we hypothesise that similarities and dissimilarities in the acquisition and development of intentional signalling within and between human and non-human primate species would result from differences in the costs (e.g. physiological energy budget, cognitive demands, risk of attracting predators and competitors) and benefits (e.g. accurate information content of the signal to coordinate behaviours, success in terms of survival and reproduction) of signal production and processing by interactants in relation to the four broad categories of evolutionary and life-cycle factors mentioned above (i.e. species, individual, context-related and behaviour characteristics). Key properties and associated abilities of language (involving gestures, vocalisations, facial expressions and eye behaviours) would have been selected for and developed based on a cost-benefit trade-off modulated by these four categories of factors and their mutual intertwinement.
We propose the following scenario for the evolutionary origins of language that complements our suggested scenario for the evolutionary origins of gestures (see Section II.3e): human language would have arisen as a result of the cognitive enrichment associated with changes to our primate ancestors' lifestyle, including ecological (moving from arboreal to terrestrial habitats with changes to features such as ambient noise, sound propagation properties, range of vision, food distribution and level of predation) and social (moving from a solitary to a multilevel society with changes to features such as level of cooperation and competition, cultural innovation and cultural transmission). The complexity of human language would have been enhanced further in several ways.
(1) The adoption of a bipedal posture and locomotion would have allowed our ancestors to use their arms and hands for display and refinement of gestural communication involving hand shape, location, trajectory and structure (e.g. McNeill, 1992;Kendon, 2004;Corballis, 2009) and adaptation of the human vocal tract to produce a wider range of sounds (including the quantal vowels [i], [u], and [a]) and of brain regions allowing learning and flexible cognitive control of vocalisations (e.g. Fitch et al., 2016;Lieberman, 2017;Bergman et al., 2019).
(2) The intensification and diversification of fitnessrelevant short-and long-range daily social interactions (over shorter or longer periods of time) and space-time coordinated joint activities, notably bondinggathering experience [e.g. 'vocal grooming-at-a-distance' (Dunbar, 2003); food-sharing (Burkart et al., 2018); campfire (Dunbar, 2014); music (Masataka, 2009) Gowlett, 2016;Smith et al., 2017) associated with more complex social systems (e.g. Freeberg, Dunbar & Ord, 2012) and the development of our social communication system, particularly of vocal communication (e.g. MacNeilage, 2008;Hurford, 2014) would have maximised the cost-benefit trade-off discussed above. Vocal communication is the most efficient communication channel in terms of energy cost (e.g. Russell, Cerny & Stathopoulos, 1998), combinatoriality (capacity to combine phonemes into larger units, morphemes and words, that are combined into sentences) and generality (capacity to produce an infinite number of ends using finite means, allowing phonology and morphosyntax) (e.g. Liebal, Call & Tomasello, 2004a;Coye et al., 2018) as well as transmission/reception success rate (vocal signals can be communicated rapidly in all directions over long distances to many individuals, whatever their attentional state or their location). These properties give vocal communication significant advantages in a wide range of interactive activities (e.g. bonding-gathering, travelling, foraging, teaching) and could explain why gradually speech became the main channel of communication rather than the visual channels of gestures and facial expressions [see also  By minimising the costs and maximising the benefits, this rich, diverse, dynamic and increasingly complex and stimulating environment would have been crucial in the progressive development and coevolution of anatomical structures, sociocognition, communication components (gestural, vocal, oro-facial, and eye) and associated properties (e.g. intentionality, turn-taking, referentiality, grammatical rules and iconicity), probably linked through a slow selforganisation process (e.g. Lindblom, MacNeilage & Studdert-Kennedy, 1984;Oudeyer, 2005;De Boer, 2017), to build the multimodal, rhythmic and social interactive nature of language. Recent genetic, palaeontological and archeological data suggest that language and speech, once thought unique to modern humans, are ancient communication systems shared with Neanderthals, Homo neanderthalensis (Dediu & Levinson, 2012. Human communication systems [linguistic vocalisations (e.g. British English, American English, Singapore English 'Singlish', Indian English 'Hinglish' and Kenyan English (Pederson, 2001;Mesthrie & Bhatt, 2008;Gonçalves et al., 2018); gesturese.g. British 'Victory-sign' and 'Thumbs-up', and American 'OK-circlesign' (Morris, 2002); facial expressionse.g. voluntary funny or scary grimaces] are inevitably still evolving in response to the increasing complexity of our sociocognitivecommunication environment, sophisticated communication technologies and travel patterns.
To summarise, the literature suggests that some language properties are shared with either gestures (intentionality, goal directedness, and probably social learning), vocalisations The origins of gestures and language (referentiality, protogrammar, conversational rules and linguistic laws), oro-facial expressions (rhythmic structure) and/or eye behaviours (mutuality, i.e. sharing of social perception, signals and emotions between interactants) of primates. Furthermore, many complementary (evolutionary and life cycle) factors influence human and non-human primate communicative signalling. These findings led us to propose the multicausal theory of language origins (Fig. 3). Finally, it is necessary to investigate the evolutionary roots of language by using all findings from different areas of investigation. To achieve this, we need a comprehensive (multimodal and multifactorial) and integrated approach combined with suitable data collection and statistical analysis methods (i.e. experiments designed considering sample size, independence of data, confounding factors, and statistical tests used) in order to understand better the full complexity of primate social cognition and communication skills (e.g. Slocombe, Waller & Liebal, 2011;Waller et al., 2013;Leavens, Bard & Hopkins, 2017;Prieur et al., 2018b). Addressing Tinbergen's (1963) four fundamental questions (i.e. function, ontogeny, phylogeny, and mechanism) in detail will help us to investigate primate signalling more accurately (see also Bateson & Laland, 2013), by determining why, how and in what social context(s) gestural, vocal, oro-facial and eye signals are used either separately or jointly.

IV. FUTURE DIRECTIONS
Despite substantial research on human and non-human primate communication systems, empirical evidence on the evolution of language properties and associated abilities is still limited. We highlight the following eight important issues for future research: (1) characterisation of signalling behaviour based on key criteria of language properties (e.g. intentionality); (2) investigation of relationships between the expression and processing of emotions and intentional signals; (3) consideration of expressed and understood repertoires and use of signals to communicate; (4) comparisons of signal asymmetries both within and among uni-, bi-, multimodal and multicomponent communication functions; (5) exploration of relationships between communication functions and species-and individual-specific psychological/ physiological characteristics (e.g. motivation and personality); (6) investigation of language-like properties (e.g. intentionality, referentiality and turn-taking) and associated abilities (e.g. intentional abilities such as monitoring of the audience and elaboration) in communication systems of primates (great and small apes, Old and New World monkeys) living in different social and ecological niches; (7) adoption and development of appropriate statistical and methodological tools (e.g. generalised linear model and social network analyses, computer modelling); and (8) the application of a fine-grained combined evolutionary, developmental, functional and mechanistic approach to study gestural, vocal, oro-facial, eye and/or olfactory signalling. Addressing these issues should enable us to deepen our understanding of the multidimensional nature of human and non-human primate communication systems, an essential stage to propose a robust evolutionary trajectory of the sociocognitivecommunicative properties and associated abilities leading to language.

V. CONCLUSIONS
(1) Research on primate gestural communication, particularly that of great apes, suggests that gestures played a key role in the emergence of the multimodal nature of language. This view is supported by several arguments, including their large gestural repertoires and the associated multifaceted nature of intentionality as well as the role of the left hemisphere in primate gestural communication. We considered the debate on the origins of gestures and reviewed three nonmutually exclusive processes that could explain gestural acquisition and development in great apes: phylogenetic ritualisation, ontogenetic ritualisation, and learning via social negotiation. We hypothesised a scenario concerning the phylogenetic and ontogenetic origins of gestures involving a central role of emotions and different mechanisms of concept learning.
(2) Increasing behavioural and neurological evidence supports a multimodal origin of language: gestural, vocal, oro-facial and eye components and associated characteristics (e.g. intentionality, goal directedness, referentiality, protogrammar, conversional rules, and rhythm) would have coevolved to elaborate an increasingly complex, dynamic and varied human verbal and non-verbal communication system. We emphasise that the origins of language are not only multimodal, but more broadly multicausal. This led us to propose the multicausal theory of language origins: human and non-human primate communicative signalling is a complex trait that would have evolved in response to a cost-benefit trade-off of signal production and processing of interactants in relation to the close interrelationships between four broad categories of evolutionary and life-cycle factors, namely species, individual and context-related characteristics as well as behaviour and its characteristics. Taking a wider evolutionary perspective into account, we hypothesise that such an integrative and explanatory framework could explain the origins of not only language and laterality (Prieur et al., 2019b) but also of many traits in humans and other animals (e.g. cooperation, culture, episodic and semantic memory, foraging, learning, mind reading, morality, parental care, reproduction, tool use, etc.). Further investigations are necessary to test the ubiquity of this framework in animal science. understanding of the evolutionary mechanisms underlying the acquisition and development of human and non-human primate communication systems.

VI. ACKNOWLEDGMENTS
This study was performed in the framework of a PhD funded by the French Ministry of Research and Technology with additional financial support from Rennes Metropole and the VAS Doctoral School. This work was also funded by the Deutsche Forschungsgemein-schaft (DFG, German Research Foundation) -project number 407023904. We are very grateful to Ann Cloarec for correcting the English.