Macaque claustrum, pulvinar and putative dorsolateral amygdala support the cross‐modal association of social audio‐visual stimuli based on meaning

Social communication draws on several cognitive functions such as perception, emotion recognition and attention. The association of audio‐visual information is essential to the processing of species‐specific communication signals. In this study, we use functional magnetic resonance imaging in order to identify the subcortical areas involved in the cross‐modal association of visual and auditory information based on their common social meaning. We identified three subcortical regions involved in audio‐visual processing of species‐specific communicative signals: the dorsolateral amygdala, the claustrum and the pulvinar. These regions responded to visual, auditory congruent and audio‐visual stimulations. However, none of them was significantly activated when the auditory stimuli were semantically incongruent with the visual context, thus showing an influence of visual context on auditory processing. For example, positive vocalization (coos) activated the three subcortical regions when presented in the context of positive facial expression (lipsmacks) but not when presented in the context of negative facial expression (aggressive faces). In addition, the medial pulvinar and the amygdala presented multisensory integration such that audiovisual stimuli resulted in activations that were significantly higher than those observed for the highest unimodal response. Last, the pulvinar responded in a task‐dependent manner, along a specific spatial sensory gradient. We propose that the dorsolateral amygdala, the claustrum and the pulvinar belong to a multisensory network that modulates the perception of visual socioemotional information and vocalizations as a function of the relevance of the stimuli in the social context.

positive vocalization (coos) activated the three subcortical regions when presented in the context of positive facial expression (lipsmacks) but not when presented in the context of negative facial expression (aggressive faces).In addition, the medial pulvinar and the amygdala presented multisensory integration such that audiovisual stimuli resulted in activations that were significantly higher than those observed for the highest unimodal response.Last, the pulvinar responded in a task-dependent manner, along a specific spatial sensory gradient.We propose that the dorsolateral amygdala, the claustrum and the pulvinar belong to a multisensory network that modulates the perception of visual socioemotional information and vocalizations as a function of the relevance of the stimuli in the social context.
Significance statement: Understanding and correctly associating socioemotional information across sensory modalities, such that happy faces predict laughter and escape scenes predict screams, is essential when living in complex social groups.With the use of functional magnetic imaging in the awake macaque, we identify three subcortical structures-dorsolateral amygdala, claustrum and pulvinar-that only respond to auditory information that matches the ongoing visual socioemotional context, such as hearing positively valenced coo calls and seeing positively valenced mutual grooming monkeys.
We additionally describe task-dependent activations in the pulvinar, organizing along a specific spatial sensory gradient, supporting its role as a network regulator.

| INTRODUCTION
In a wide variety of species, social communication often involves sending and receiving systems that must integrate information across different modalities, recruiting multiple processes such as sensory perception, emotion processing and attention.Communication has been extensively described as a multisensory process relying on the association of visual and auditory components of vocal signals based on their temporal and spatial congruence (Murray & Wallace, 2012;Partan, 2013;Partan & Marler, 2005).Macaques are notably successful in matching specific calls, such as coos, with the proper facial expression, thus associating the specific lip and mandible configuration with the matching call (Ghazanfar & Logothetis, 2003;Hauser et al., 1993;Hauser & Ybarra, 1994).This cross-modal association (defined as the binding or association of stimuli or experiences that come from different sensory modalities) of visual information with vocal signals is not fully explained by the dynamic properties of the visual input that are associated with low level features and is shown to be specific to face processing.At the neuronal level, this cross-modal association is subserved by multisensory integrative mechanisms, such that a matching pair of faces and vocalizations triggers a neuronal response that is significantly different from that to the most effective unimodal stimulus in the pair (Ghazanfar et al., 2005).In addition to low level dynamic visual features such as facial mimics or local colour contrasts, faces and vocalizations also contain higher level information such as emotions and semantics or social meaning.Whether monkeys can associate faces and vocalizations based on their meaning rather than on their low-level matching features was unknown.For example, a scene of escaping monkeys is expected to be associated with a scream, because they describe a same event, but not with a coo.This crossmodal association based on stimulus meaning is expected to build on the knowledge we have from previous experience with this type of stimuli.Macaques scream when they experience fear, triggered by potential danger due to complex social situations from conspecifics or heterospecifics.Macaques also coo during positive social interactions, such as friendly approach, feeding or group coordinated displacement (Gouzoules et al., 1984;Hauser & Marler, 1993).This raises the question of whether hearing a scream generates a visual representation of antagonistic as opposed to positive social situations and whether seeing an antagonistic situation set up the expectation for screams, but not coos, to be produced.Such a mechanism could be crucial to social cognition and may have subserved the development of language.Indeed, in humans, the core language system is amodal, as our phonology, semantics and syntax do not depend on whether the input is auditory (speech) or visual (sign).In a recent study, we demonstrated using cardiac recordings (Froesel et al., 2020) and functional magnetic resonance imaging (fMRI) that macaques associate communicatively salient audio-visual information based on the social context set by visual information.This process is mediated by a network of face and voice patches in the superior temporal sulcus (STS) and lateral sulcus (LS) (Froesel et al., 2022).We propose that this could correspond to a supramodal representation of social stimuli, that is a representation that does not depend on the encoding sensory modality and that serves to attribute meaning to complex social stimuli.
Given that the task involves functionally meaningful communicative signals that are emotionally salient, we predicted strong amygdala activation.This subcortical nucleus contains face-selective neurons that are globally activated both by face identity and facial expressions (Fitzgerald et al., 2006;Livneh et al., 2012;Nakamura et al., 1992;Pessoa et al., 2006;Sergerie et al., 2008;Todorov, 2012).Specifically, affiliative facial expressions induce a decrease of firing rates in the amygdala, while threatening faces induce an increase in firing rates (Gothard et al., 2007).Note that the part of the amygdala responding to face stimuli in macaque fMRI studies is located in the dorsolateral part of the nucleus, across the white matter from the claustrum (Hadj-Bouziane et al., 2008, 2012;Schwiedrzik et al., 2015).Due to the type of stimuli used in the present study, we thus expect to observe activations specifically in this part of the amygdala.Overall, beyond face processing, the amygdala is part of a social perception network and is proposed to play a central role in emotional encoding during complex social interactions (Barraclough & Perrett, 2011;Leonard et al., 1985), including during cross-modal sensory emotional processing (Dolan et al., 2001;Kuraoka & Nakamura, 2007).In addition to the amygdala, the pulvinar, the largest nucleus of the thalamus, is also involved in face processing, emotion regulation and multisensory integration (Froesel et al., 2021;Moeller et al., 2008;Pessoa, 2010a).This subnucleus is subdivided into several parts, each exhibiting different activation patterns based on the sensory modality as well as a distinct connectivity with other brain regions.Specifically, the inferior pulvinar is mainly responsive to visual stimuli, the anterior pulvinar, to auditory stimuli, and the medial part, which is strongly connected to cortical multisensory areas, is expected to display multisensory responses (see for review Froesel et al., 2021).Last, the claustrum is also activated following stimulation of cortical face selective areas, also called face patches, and is proposed to play a role in face perception (Moeller et al., 2008).We thus hypothesize that all of these subcortical structures are activated by visual and auditory social stimuli such as macaque faces, macaque group scenes and macaque vocalizations and may play a part in multisensory integration.Social perception also involves integrating contextual, behavioural and emotional information (Freiwald, 2020;Ghazanfar & Santos, 2004).We thus predict that the audio-visual association of social information is impacted by the functional context and mediated by both these subcortical structures.
Here, we describe an fMRI study in awake macaques performing a passive audio-visual task manipulating the congruency of cross-modal audio-visual social information across six different emotional contexts, as well as a passive visual task manipulating monkey facial expressions.Complementing our previous report of the cortical activation patterns, we show that sub-cortical activations are determined by whether monkey vocalizations match or do not match the facial expressions associated with them.As predicted, the subcortical structures identified included the claustrum and putative dorsolateral amygdala that respond to all of visual, auditory and audio-visual stimuli, demonstrating their involvement in audio-visual cross-modal association as well as the pulvinar that is activated along a sensory gradient that is determined by stimulation modality (visual, auditory or audio-visual) or context.The functional activations observed in the medial pulvinar are similar to those observed in the putative dorsolateral amygdala and in the claustrum, thus suggesting a coordinated role of these three sub-cortical structures in multisensory social processing.In addition, the putative dorsolateral amygdala and the medial pulvinar express multisensory integration, such that congruent social audio-visual stimuli trigger a hemodynamic response that is significantly different from that to the most effective unimodal stimulus in the pair.Overall, our observations shed a new light on how multimodal social signals are processed when functionally meaningful, by subcortical structures.

| Subjects and surgical procedures
Two male rhesus monkeys (Macaca mulatta) participated in the study (T, 15 years, 10 kg and S, 12 years, 11 kg).The animals were implanted with a Peek MRIcompatible headset covered by dental acrylic.The anaesthesia for the surgery was induced by Zoletil (Tiletamine-Zolazepam, Virbac, 5 mg/kg) and maintained by isoflurane (Belamont, 1-2%).Post-surgery analgesia was ensured thanks to Temgesic (buprenorphine, .3mg/ml, .01mg/kg).During recovery, proper analgesic and antibiotic coverage was provided.The surgical procedures conformed to European and National Institutes of Health Guidelines for the Care and Use of Laboratory Animals.The project was authorized by the French Ministry for Higher Education and Research (project no. 2016120910476056 and 1588-2015090114042892) in accordance with the French transposition texts of Directive 2010/63/UE.This authorization was based on ethical evaluation by the French Committee on the Ethics of Experiments in Animals (C2EA) CELYNE registered at the national level as C2EA number 42.

| Experimental setup
During the scanning sessions, monkeys sat in a sphinx position in a plastic monkey chair (Vanduffel et al., 2001) facing a translucent screen placed 60 cm from the eyes.Visual stimuli were retro-projected onto this translucent screen.Their head was restrained, and the auditory stimuli were displayed by Sensimetrics MRI-compatible S14 insert earphones.The monkey chair was secured in the MRI with safety rubber stoppers to prevent any movement.Eye position (X, Y, right eye) was recorded thanks to a pupil-corneal reflection video-tracking system (EyeLink at 1000 Hz, SR-Research) interfaced with a program for stimulus delivery and experimental control (EventIDE ® ).Monkeys were rewarded for maintaining fixation into a 2 Â 2 tolerance window around the fixation point.

| General audio-visual run design
On each run, monkeys were required to fixate a central cross on the screen (Figure 1a).Runs followed a block design.Each run started with 10 s of fixation in the absence of sensory stimulation followed by three repetitions of a pseudo-randomized sequence containing six possible 16 s blocks: fixation (Fx), visual (Vi), auditory congruent (AC), auditory incongruent (AI), congruent audio-visual (VAC) and incongruent audiovisual (VAI).Each block (except the fixation block) consisted of alternating 500-ms stimuli (except for lipsmacks, 1-s dynamic stimuli succession) of the same semantic category (see Stimuli section below), in the visual, auditory or audio-visual modalities.Each block ended with 10 s of fixation in the absence of sensory stimulation.Note that within any one run, the visual stimulations were prevalent and always reflecting the same emotional content, thus setting the emotional context of the run.The initial blocks always contained a visual stimulation (V, VAC or VAI) such that unimodal auditory blocks could be defined as congruent or incongruent relative to the visual context set by previous blocks.

| Face and social task design
Six audio-visual contexts were presented to both monkeys, organized in runs as described above (Figure 1b).Each run combined visual stimuli of identical social content with either semantically congruent or incongruent monkey vocalizations (Figure 1b).The face affiliative context (F+) combined lipsmacks with coos and aggressive calls.The face aggressive context (FÀ) combined aggressive faces with coos and aggressive calls.The first social affiliative context (S1+) combined mutual grooming scenes with coos and aggressive calls.The second social affiliative context (S2+) combined mutual grooming scenes with coos and screams.The social aggressive context (S1À) combined aggressive group or individual scenes with coos and aggressive calls.The social escape context (S2À) combined fleeing groups or individual scenes with coos and screams.Importantly, pairs of contexts (F+ & FÀ; S1+ & S1À; S2+ & S2À) shared the same auditory conditions but opposite social visual content.

| Unimodal visual task design
The design of the visual runs was similar to that of the audio-visual run design, organized in blocks, except for the fact that all blocks were unimodal visual blocks and varied as a function of facial emotions.The six possible 16-s blocks were fixation (Fx), lipsmack (Lip), scared monkey faces (Sca), aggressive monkey faces (Aggr), neutral monkey faces (Neu) and scrambled monkey faces (Scr).As for the audio-visual runs, each block consisted in an alternation of 500-ms stimuli (except for lip smacks, 1-s dynamic stimuli succession) of the same emotional category (see Figure 1c).

| Social stimuli
Vocalizations were recorded by Marc Hauser from semifree-ranging rhesus monkeys during naturally occurring situations.Detailed acoustic and functional analyses of this repertoire have been published elsewhere (e.g., Gouzoules et al., 1984;Hauser & Marler, 1993).Field recordings were then processed, restricting selection of experimental stimuli to calls that were recorded from known individuals, in clearly identified situations and that were free of competing noise from the environment.Exemplars from this stimulus set have already been used in several imaging studies (Belin et al., 2007;Cohen et al., 2007;Romanski, 2012;Romanski et al., 2005;Russ et al., 2008).As in our previous study (Froesel et al., 2022), all stimuli were normalized in luminance and colour, but the frequency ranges varied between the different types of stimuli as shown in Figure S7.For each of the three vocalization categories, we used 10 unique exemplars coming from matched male and female individuals for each category in order to control for possible gender, social hierarchy or individual effects.Coos are affiliative vocalizations, aggressive calls are used as a precursor of a physical attack and screams are produced by subordinate being chased or attacked by a dominant.Facial expression (lipsmacks and aggressive facial expression) and social scene (group grooming, aggressive individual alone or in group/escaping individual or group) stimuli were extracted from videos collected by the Ben Hamed lab, as well as by Marc Hauser on Cayo Santiago, Puerto Rico.Images were normalized for average intensity and size.All stimuli were 4 Â 4 in size.We decided to keep them in colour to get closer to natural stimuli even if it produced greater luminosity disparity between the different stimuli preventing us to use pupil diameter as a physiological marker.Only unambiguous facial expressions and social scenes were retained.A 10% blur was applied to all images, in the hope of triggering multisensory integration processes (Stein & Meredith, 1993) (but see Section 3).For each visual category, 10 stimuli were used.The scrambling of the images was performed by EventIDE (https://www.okazolab.com/okazolab.com/)and applied to each visual stimuli displayed from the task.

| Scanning procedures
The in-vivo MRI scans were performed on a 3T Magnetom Prisma system (Siemens Healthineers, Erlangen, Germany).
Anatomical images: For the anatomical MRI acquisitions, monkeys were first anesthetized with an intramuscular injection of ketamine (10 mg/kg).Then, the subjects were intubated and maintained under 1-2% of isoflurane.During the scan, animals were placed in a sphinx position in a Kopf MRI-compatible stereotaxic frame (Kopf Instruments, Tujunga, CA).Two L11 coils were placed on each side of the skull, and an L7 coil was placed on the top of it.T1-weighted anatomical images were acquired for each subject using a magnetizationprepared rapid gradient-echo (MPRAGE) pulse sequence.Spatial resolution was set to .5 mm, with TR = 3000 ms, TE = 3.62 ms, Inversion Time (TI) = 1100 ms, flip angle = 8 , bandwidth = 250 Hz/pixel, 144 slices.T2-weighted anatomical images were acquired per monkey, using a Sampling Perfection with Application optimized Contrasts using different flip angle Evolution (SPACE) pulse sequence.Spatial resolution was set to .5 mm, with TR = 3000 ms, TE = 366.0ms, flip angle = 120 , bandwidth = 710 Hz/pixel, 144 slices.
fMRI images were acquired on awake macaque monkeys as follows: before each scanning session, a contrast agent, composed of monocrystalline iron oxide nanoparticles, Molday ION™, was injected into the animal's saphenous vein (9-11 mg/kg) to increase the signal to noise ratio (SNR) (Leite et al., 2002;Vanduffel et al., 2001).Monkeys were trained to stay calm to let the experimenter perform this injection without anaesthesia and got rewarded for it.They were also trained to stay calm during the fMRI scans.We acquired gradientechoechoplanar images covering the whole brain (TR = 2000 ms; TE = 18 ms; 37 sagittal slices; resolution: 1.25 Â 1.25 Â 1.38 mm anisotropic voxels) using an eight-channel phased-array receive coil and a loop radial transmit-only surface coil (MRI Coil Laboratory, Laboratory for Neuro-and Psychophysiology, Katholieke Universiteit Leuven, Leuven, Belgium, see Kolster et al., 2014).The coils were placed so as to maximize the signal on the temporal lobe.

| Data description
In total, for the audio-visual runs (155 pulses), 76 runs were collected in 12 sessions for monkey T and 65 runs in 9 sessions for monkey S. For the unimodal visual runs (155 pulses), 13 runs were collected in 8 sessions for monkey T and 12 runs in 5 sessions for monkey S. On the basis of the monkey's fixation quality during each run (85% within the eye fixation tolerance window), we selected 60 runs from monkey T and 59 runs for monkey S in total, that is 10 runs per task, except for one task of monkey S for audio-visual runs and 11 unimodal visual runs for S and 13 for T.

| Data analysis
Data were pre-processed and analysed using AFNI (Cox, 1996), FSL (Jenkinson et al., 2012), SPM software (version SPM12, Wellcome Department of Cognitive Neurology, London, UK, https://www.fil.ion.ucl.ac.uk/spm/software/),JIP analysis toolkit (http:// www.nitrc.org/projects/jip)and Workbench (https:// www.humanconnectome.org/software/get-connectomeworkbench).The T1-weighted and T2-weighted anatomical images were processed according to the HCP pipeline (Autio et al., 2020;Glasser et al., 2013) and were normalized into the MY19 Atlas (Donahue et al., 2016).Functional volumes were corrected for head motion, slice timed referred on the middle image of the run and skullstripped.They were then linearly realigned on the T2-weighted anatomical image with flirt from FSL; the image distortions were corrected using nonlinear warping with JIP.A spatial smoothing was applied with a 3-mm full width of half maximum (FWHM) Gaussian Kernel.
Fixed effect individual analyses were performed for each monkey, with a level of significance set at p < .05corrected for multiple comparisons (family-wise error [FWE], t-scores 4.6) and p < .001(uncorrected level, tscores 3.09).Head motion and eye movements were included as covariate of no interest.Because of the contrast agent injection, a specific MION hemodynamic response function (HRF) (Vanduffel et al., 2001) was used instead of the BOLD HRF provided by SPM.The main effects were computed over both monkeys.In most analyses, face blocked conditions and social blocked conditions were independently pooled.
To identify the precise anatomical location of the activations, we used the Subcortical Atlas of the Rhesus Macaque (SARM) (Hartig et al., 2021) (see Figure S1).The activation maps are coregistered on the MY19 Atlas (Donahue et al., 2016) as well as the SARM Atlas, thus providing a direct overlap between the two.
Region of interest (ROI) analyses were performed as follows.The anterior pulvinar, claustrum ROIs were determined from the auditory congruent contrast (AC vs. Fx) and the medial pulvinar and the amygdala ROIs from the audio-visual contrast (VAC vs. FX) of face context (Figure 2).PLvl and Pldm ROIs were extracted from the unimodal visual runs (all conditions vs. fixation contrast; Figure 4).ROIs were defined as 1-mm diameter spheres centred around the local peaks of activation.To note, there was no overlap between the different clusters (see Figure S1 for the representation of the activation peaks and selected regions of interest and Table S3 that indicates the exact coordinates of the centre of the ROIs selected for each ROI-based analysis).For each ROI, the activity profiles were extracted with the Marsbar SPM toolbox (marsbar.sourceforge.net)and the mean percent of signal change (%SC) (+/À standard error of the mean across runs) was calculated for each condition relative to the fixation baseline.As the face context includes both aggressive and lipsmack expressions, we decided for unimodal visual runs to combine these two expressions conditions and focus on this combination for the analysis to be comparable with the %SC results from the audiovisual runs.%SC were compared using Friedman nonparametric tests and Wilcoxon non-parametric paired tests.
In the present study, stimuli are associated based on their functions, determined by their naturally occurring context (e.g.coos are associated with mutual grooming) and possible contingencies or predictability (e.g.screams are associated with or predictive of escape scenes).The described audio-visual association thus goes beyond the strict definition of two sensory inputs produced by a  S3 for ROI coordinates.The precise localization of the identified ROIs relative to the MY19 Atlas is described in Figure S1.common source.Bimodal and unimodal activations are quantified as follows: 1. Definition of multisensory integration in our fMRI protocol: The criteria used to define multisensory integration at the neuronal level (Avillac et al., 2004(Avillac et al., , 2007;;Murray & Wallace, 2012;Stein et al., 2009) do not apply in fMRI.For example, it is now established that it is quite difficult to reach the superadditive criteria with MRI, that is when the bimodal response is significantly higher than the sum of the two independent modalities.This can be due to the variability of response of multisensory neurons (some exhibiting suppressive responses whereas others exhibiting enhancement and their interdigitated location among substantial populations of unisensory neurons), although some fMRI studies did show superadditivity (Stevenson et al., 2009;Werner & Noppeney, 2010).As a result, in fMRI studies, the maximum and additive criteria are now the most used criteria to investigate multisensory integrative processes on hemodynamic activations (Beauchamp, 2005;Pollick et al., 2011;Tyll et al., 2013;Cléry et al., 2017; see for review Stevenson et al., 2014).In the present study, we use the following criterion for multisensory integration: multisensory activations had to be significantly different from each of the two unisensory responses and significantly different from the control fixation condition.All reported multisensory integrative responses in the present manuscript were significantly higher than each of the two unisensory responses (in other words, none of the reported multisensory integrative responses was significantly lower than any of the two unisensory responses).Note that the identification of multisensory integration at population level should be validated by other methods at single neural level.2. Definition of cross-modal association in our fMRI protocol: cross-modal association is defined as the effect of one sensory modality on the processing of a second modality, presented either synchronously or asynchronously.Effect of temporal asynchrony has classically been investigated within a range of 100-200 ms of asynchrony (Schormans & Allman, 2018, 2023).However, cross-modal associations have also been reported at longer time ranges, whereby a general context determines whether cross-modal association takes place or not.As an example, auditory stimuli (e.g.coos, which always activate the auditory cortex in the LS) can either result in activations in the STS when congruent with a general visual context (e.g.grooming) or no STS activations when incongruent with the general visual context (e.g.escape) (Froesel et al., 2022).This type of cross-modal association probably arises from long-term learning and can result in more voxels/regions being recruited to process the stimuli (Archakov et al., 2020) or in a change in degree of activation (Froesel et al., 2022).Here, we manipulated the visual context by showing a single category of visual images of relevant emotional content per run.Setting a specific visual context aimed at identifying the effect of vision on subsequent auditory processing.Cross-modal association was determined as a significant difference in the auditory activations, depending on whether they were congruent (AC) or incongruent (AI) to the context set by the visual stimuli in any given run, thus indicating that how the brain processes the auditory modality depended on the network effects of the preceding visual stimulations.

| RESULTS
Here, we characterize the activations in the subcortical structures, in the putative dorsolateral amygdala, the claustrum and the pulvinar, underlying the cross-modal association of social visual and auditory stimuli based on their semantic content or meaning in the broad sense.In addition, we show that among these subcortical regions, only the activations observed in the dorsolateral amygdala and the medial pulvinar meet the criteria for multisensory integration in fMRI.Last, we show an effect of both sensory context and sensory modality on the location of pulvinar activations.

| Sub-cortical activations: Dorsolateral amygdala, claustrum and pulvinar
A small group of sub-regions are significantly activated during one or several of the unimodal or bimodal conditions presented to the monkeys in the different types of runs.Indeed, in a general contrast analysis, the claustrum (Claus All of the activations are bilateral in at least one contrast except for the amygdala.Figure S2 represents the SNR map for the coronal section at the level of the pulvinar. We defined functional ROIs based on these subcortical activations, and we computed the per cent of signal change relative to fixation (%SC) for all conditions, cumulated over identical block conditions of either the face contexts (Figure 3, left), or the social scenes contexts (Figure 3, right).For all ROIs, and both face and social contexts, there was no significant activation relative to the fixation baseline in response to the auditory incongruent blocks.All other blocks led to consistent significant activations, except for inferior and anterior pulvinar, in the visual blocks of both face and social scene contexts, and the claustrum, in the visual blocks of social scene contexts.

| Cross-modal association
The first major result of this study is the impact of the visual context on auditory activations.Although the dorsolateral amygdala, claustrum and pulvinar show activations in response to one or several of the unimodal or bimodal conditions presented to the monkeys in the different types of contexts, no significant activation is reported when the auditory stimuli are incongruent with the visual context (see Figure 3).In the dorsolateral amygdala and the claustrum, %SC in response to the auditory incongruent blocks relative to fixation was statistically significantly lower than the %SC in all other conditions (i.e.V, AC, VAC and VAI), in both the face and the social contexts (Figure 3, two top panels, except for the claustrum, in the social scene where AI was not significantly different from V and AC).
The three different pulvinar ROIs (inferior, medial and anterior) present no significant %SC in response to F I G U R E 3 Percentage of signal change (%SC) for FACE tasks (F+ & FÀ) and for social tasks (S1+, S1À, S2+ & S2À) across sub cortical regions of interest (ROIs), putative dorsolateral amygdala, claustrum, inferior pulvinar, medial pulvinar and anterior pulvinar of both hemispheres, comparing the auditory, visual and audiovisual conditions.Statistical differences relative to fixation and between conditions are indicated as follows: ***, p < .001;**, p < .01;*, p < .05(Wilcoxon non-parametric test).See Table S1 and accompanying note for quantitative effect sizes.

| Multisensory integration
The putative dorsolateral amygdala shows a higher activation following audio-visual congruent stimulation, that is in the bimodal stimulation, than for unimodal stimulation, that is visual only and auditory only contexts, in the social context.In the putative dorsolateral amygdala ROI, VAC conditions in S1À, S2À and S2+ resulted in significant %SC relative to fixation.We did not observe a significant difference in %SC in response to the negative (SÀ) or positive (S+) social contexts VAC conditions.More specifically, there was no significant difference in % SC for the VAC condition to aggressive visual context (S1À) and the flight visual context (S2À), and the mutual grooming scenes with coos and aggressive calls (S1+) resulted in significantly lower %SC than the mutual grooming scenes with coos and screams (S2+).In the claustrum ROI, only the mutual grooming scenes with coos and aggressive calls (S1+) and the mutual grooming scenes with coos and screams (S2+) resulted in significant changes in %SC for VAC relative to fixation, although not significantly different from each other.In addition, and paralleling putative dorsolateral amygdala activation, the medial pulvinar showed a higher activation in bimodal (VAC) versus unimodal conditions (V and AC), specifically to the face context (FACE: Friedman non-parametric test, X2 (4) = 19.26,p < .001,n = 80, post hoc: V-VAC: Z = 1.6, p < .01,AC-VAC: Z = 1.2, p = .02).The putative dorsolateral amygdala and the medial pulvinar thus exhibit a form of multisensory integration.

| Pulvinar audio-visual activations
Three different locations of activations within the pulvinar were identified in the audio-visual task: the inferior, the medial and the anterior pulvinar.The inferior pulvinar is activated in both face and social tasks by auditory congruent stimuli and audio-visual congruent and incongruent stimuli (see Figure 3 panel 3).The anterior pulvinar is significantly activated during AC and VAI blocks in both types of task and in VAC block in the face task (see Figure 3 panel 5).This leads to the noteworthy and unexpected observation that, in this task, the inferior and anterior pulvinar preferentially respond in the auditory conditions (i.e.AC, VAC and VAI) and not in the visual condition.
The medial pulvinar is activated by AC, VAC and VAI stimulations but is in addition the only pulvinar subregion to respond to visual stimulation presented alone (FACE: V: Z = 2.04, p = .04;SOCIAL: V: Z = 2.9, p = .003).
F I G U R E 4 Pulvinar activations in a unimodal visual task: (lipsmacks + aggressive) conditions versus fixation contrasts.Whole-brain activation maps of the unimodal visual task cumulated over both monkeys, for lipsmack + aggressive blocks versus fixation contrast.Darker shades of red indicate level of significance at p < .001uncorrected, t score 3.09.Lighter shades of yellow indicate level of significance at p < .05FWE, t score 4.6.PLvl: ventrolateral lateral pulvinar; PLdm: dorso-medial lateral pulvinar.See Table S3 for region of interest (ROI) coordinates.The precise localization of the identified ROIs relative to the MY19 Atlas is described in Figure S1.

| Putative dorsolateral amygdala, claustrum and pulvinar activations in a unimodal visual task
Because macaque pulvinar has been repeatedly observed to be involved in visual processing, we hypothesized that this specific pulvinar response pattern is a task effect and reflects a spatial segregation of pulvinar sensory responses.To test this, we investigated pulvinar activations in a unimodal visual task, in which only monkey faces were presented in blocks of consistent emotional categories.In order to match the visual categories used in the main audio-visual task, the following, we only used the lipsmacks and aggressive categories.Clear lateralized left visual activations could be identified in the lateral pulvinar (Figure 4, medio-dorsally: PLdm and ventro-laterally: PLvl), at a location very distinct from the ROIs activated in the audio-visual tasks.In this same task and same contrast, the claustrum was also significantly activated (Figure 5a) but not the putative dorsolateral amygdala, although a %SC analysis using a priori defined ROIs extracted from the audio-visual task shows that both subcortical structures are highly activated during these unimodal visual runs (Figure 5b; AMG: Z = 4.27; p < .001;Claus: Z = 5.8, p < .001).Overall, this indicates that while the putative dorsolateral amygdala and the claustrum are involved in the processing of social cues in both visual and audio-visual contexts, the spatial organization of the pulvinar activations varies between the two types of sensory contexts.This is explored in more detail in the next section.

| Gradient of multi and unimodal pulvinar activations
The pulvinar is involved in several sensory processes (for review, see Froesel et al., 2021).However, this sensory information from multiple modalities is organized remains poorly understood.In the following, we describe a spatial gradient of sensory responses within the pulvinar, from unimodal auditory responses (Figure 6a, red), audio-visual responses (Figure 6a,b, blue, see Figure S3 for individual monkey maps) and unimodal visual responses (Figure 6b, green).More specifically, unimodal auditory activations are located in anterior pulvinar (PuA), anterior and medial to the audio-visual activations that are located in medial pulvinar (PuM), with a slight overlap between the two activated ROIs.The unimodal visual activations in the visual task are located in two distinct lateral pulvinar ROIs (Figure 6b, sagittal view, green), lateral to the PuM audio-visual ROIs (Figure 6b, blue), that coincide with PLvl and PLdm (Figure 4), again with a slight overlap between the audio-visual and the visual ROIs.It is worth noting that this gradient was present in each monkey (Figure S3).
Importantly, this very clear functional sensory gradient within the pulvinar is task specific.Indeed, the visual ROIs identified during the unimodal visual task are not activated during the audio-visual task using the exact same stimuli and vice versa (Figure 7a S3 for ROI coordinates.The precise localization of the identified ROIs relative to the MY19 Atlas is described in Figure S1.S2 and associated note for quantitative effect sizes.Selected ROIs are inferior pulvinar (PuI, bilateral), medial pulvinar (PuM, bilateral), anterior pulvinar (PuA, bilateral) and lateral pulvinar (PLdm, dorso-medial lateral pulvinar and PLvl, ventrolateral lateral pulvinar, both on the left hemisphere).ROIs defined in the audio-visual task are in white.ROIs defined in the unimodal visual task are in grey.V: visual vs fixation; AC: auditory congruent versus fixation and VAC: audio-visual congruent vs fixation.PuM: medial pulvinar; PuA: anterior pulvinar; PuI: inferior pulvinar; PLvl: ventrolateral lateral pulvinar; PLdm: dorso-medial lateral pulvinar.L_: left hemisphere; R_: right hemisphere.p = .007;PuA: Z = 1.86; p = .062).In contrast, only the ventro-lateral pulvinar ROI (PLvl) retains this significant visual response in the visual only condition of the audio-visual task, although both tasks involve the same visual conditions (Figure 7b; Z = 2.1, p = .02).These results suggest that pulvinar visual responses (except for PLvl) are not fully driven by sensory input and are modulated by general task context.
PLvl also retains significant %SC in the visuoauditory condition (Figure 7b, VAC: Z = 2.4, p = .01)but not in the AC condition (Z = 1.16, p = .24),suggesting that the VAC significance is driven by the visual component of the VAC condition.The inferior and anterior pulvinar ROIs contrast with PLvl and appear to be driven by auditory stimulation as their %SC is significant for both the auditory (Figure 7b, AC: PuI: Z = 3.2, p = .0013;PuA: Z = 2.6, p < .01)and visuoauditory condition (VAC: PuI: Z = 2.3, p = .011;PuA: Z = 4.37, p < .001)but not in the visual condition (V: PuI: Z = .87,p = .38;PuA: Z = 1.16, p = .24).It is unclear whether PLvl will respond to auditory stimulations.Likewise, it is unclear if the PuI and PuA auditory activations are specific to the audio-visual context.These two points remain to be investigated in a unimodal auditory task, that is a task mirroring the unimodal visual task and containing only auditory stimuli.Medial pulvinar ROI (PuM) stands out in that is activated by all of the visual conditions whether in the unimodal visual task (Figure 7a, V: Z = 2.6, p = .007)or the audio-visual task (Figure 7b, V: Z = 2.04, p = .04),the auditory condition of the audio-visual task (Figure 7b, AC: Z = 2.6, p < .01)and the audio-visual condition of the audio-visual task (Figure 7b, VAC: Z = 5.25, p < .001).Notably, the %SC in the VAC condition is significantly higher than the %SC in the V and AC conditions of the audio-visual task (Friedman nonparametric test, X2 (4) = 19.26,p < .001,n = 80, post hoc: V-VAC: Z = 1.6, p < .01,AC-VAC: Z = 1.2, p = .02),strongly suggesting that the medial pulvinar, PuM ROI, in contrast with the other ROIs, is integrating visual and auditory information.

| DISCUSSION
Overall, three key subcortical structures contribute to some aspect of audio-visual association of functionally significant communicative signals: the putative dorsolateral amygdala, claustrum and pulvinar.The activation of these structures depended on the meaning of the auditory stimuli relative to the visual context.They were activated by auditory stimuli when these were congruent in meaning to the visual context but not when they were incongruent.This thus reveals a cross-modal audiovisual modulation in these regions.This possibly sheds new light on the functional organization of this subcortical nucleus.Further investigation in humans and nonhuman primates are needed to further consolidate these observations, including by manipulating context in different ways, as well as to further study the precise functional connectivity of the pulvinar with the cortex (Froesel et al., 2023).In addition to this cross-modal associations of communicative signals, the medial part of the pulvinar and the dorsolateral amygdala show a greater activation to the bimodal audio-visual stimulation relative to each of the visual and auditory unimodal stimulations.They therefore demonstrate multisensory integration similar to what we report in the STS in a previous study using the same data (Froesel et al., 2022).It is worth noting that the putative dorsolateral amygdala, claustrum and pulvinar correspond to the three subcortical regions activated by stimulation of face patch, that is.cortical region selective to responsive to face presentation (Tsao et al., 2008) stimulations.It has been proposed that these regions correspond to bottlenecks for the communication between face patches (Moeller et al., 2008) and consequently in face processing.This interpretation is supported by a tracer studies showing that individual patches receive input from these three subcortical structures (Grimaldi et al., 2016) as well as by resting-state fMRI, alone or in association with single cell recording studies (Schwiedrzik et al., 2015;Zaldivar et al., 2022).Based on our study and the previous literature cited above, the putative dorsolateral amygdala, claustrum and pulvinar are thus a part of a subcortical network involved in face processing.In addition, the amygdala has been shown to be involved in the guidance of social behaviours when specifically involving social interactions (Adolphs & Spezio, 2006).On the other hand, the macaque monkey pulvinar has been shown to respond to visual social stimuli involving faces depicting emotions, even when viewing human faces (Maior et al., 2010) or abstract representations of a face (Nguyen et al., 2013).In this study, we extend their role to the processing of species-specific vocalizations and their association with socioemotional visual information.The degree of specificity is not yet established, requiring tests with other species' vocalizations, facial expressions and social scenes.

| Audio-visual association of social stimuli in the dorsolateral amygdala, the claustrum and the pulvinar
The dorsolateral amygdala and claustrum are activated by all visual, auditory congruent and audio-visual (congruent and incongruent) stimuli where auditory congruence or incongruence is defined by the visual context set in each presentation.Both these subcortical structures thus follow a response pattern that is similar to that observed at the cortical level (Froesel et al., 2022).In addition, they both are also activated by faces during a unimodal visual task, including faces with high emotional content.This pattern, combined with evidence from prior studies of a selectivity to emotional facial expressions (Fusar-Poli et al., 2009;Wang et al., 2017;Williams et al., 2004), suggests that both structures contribute to context-based social and emotional processing in coordination with the face and voice patches described in Froesel et al. (2022).
The contribution of the amygdala to the processing of social stimuli, whether visual (Hadj-Bouziane et al., 2008, 2012;Nakamura et al., 1992;Pessoa et al., 2006;Sergerie et al., 2008;Todorov, 2012) or auditory (Domínguez-Borràs et al., 2019;Gadziola et al., 2012;Gothard, 2020;Morrow et al., 2019), has been extensively studied by others.Here, we further show that this processing is context dependent, such that a particular stimulus either activates or inactivates the amygdala depending upon the context.Additionally, we show that like the STS (Froesel et al., 2022), the putative dorsolateral amygdala also plays an important role in multisensory integration (see also Ross et al., 2022).In this structure, multisensory integration, defined by the minimal criterion of percent signal change between bimodal and unimodal conditions, is only observed when the visual stimulations are social scenes.This is also the case in the STS (Froesel et al., 2022).This may be due to the fact that these static social scenes were more ambiguous than faces.In this context, associating the auditory information might help resolve this ambiguity.These activations are located in the dorsal and lateral part of the nucleus, close to the claustrum, in a location already described previously in the literature (Hadj-Bouziane et al., 2008, 2012;Schwiedrzik et al., 2015).
The claustrum is, due to its connectivity with inputs from limbic areas and many other cortical and subcortical areas, well positioned as a hub associating sensory and limbic information in order to influence attention via its output to the frontal cortex (see for review Smith et al., 2020).It has recently been hypothesized that in addition to its involvement in the coordination of slow wave sleep, it could serve as a limbic-sensory-motor interface.The claustrum is proposed to integrate limbic and sensory information to guide and sustain attention towards behaviourally relevant and salient stimuli.Electrophysiological recordings on macaques demonstrate a segregation of the sensory responses within the claustrum (Remedios et al., 2010).The central part is responsive to auditory stimuli; whereas, the ventral part is responsive to visual stimuli.The claustrum is thus described as a multiple sensory structure but not as a multisensory integrator.Last, Remedios and colleagues describe visual responses in the ventral claustrum and auditory responses in the central claustrum (Remedios et al., 2010).While we did not specifically delineate ventral and central claustrum in our images, visual claustrum activations in both audio-visual and unimodal visual contexts appear to be more ventral than the auditory activations.This will have to be further confirmed experimentally.Overall, the activation of the claustrum during vocalizations is strongly modulated by the visual context, as this region is activated when the vocalizations are congruent with the visual context but not when they are incongruent.Moreover, as no multisensory integration was found in this region, this suggests that the perception of vocalization in the claustrum is modulated by context-dependent visual association based on semantic information even in the absence of multisensory integration.
The contribution of the pulvinar to audio-visual association is more complex.Indeed, cross-modal association was observed in the anterior, inferior and medial subnuclei, the auditory activations being driven by the task context set by the visual stimuli.However, the medial pulvinar was the only subnucleus responsive to both auditory and visual stimulation and also the only one that presented multisensory integration.The contribution of the pulvinar to the processing of social auditory and visual stimuli was thus more heterogenous than the contribution of the putative dorsolateral amygdala and the claustrum.

| Task-dependent sensory gradient in the pulvinar
We show a task-dependent sensory gradient such that auditory activations are located anteriorly, visual activations are located laterally and posteriorly and audiovisual activations are located medially.In Figure 8, we summarize the patterns of activation in the pulvinar.Several points are worth raising about these patterns.Sensory modalities associated with each pulvinar subdivision are defined based on a statistically significant percentage of signal change for that modality relative to the fixation baseline but does not include inference about relative strength of one modality relative to the other.First, a global sensory gradient can be seen, with the anterior pulvinar being dedicated to auditory processing, the medial part to audio-visual and the lateral part of the nucleus to visual.Second, the dorso-medial lateral pulvinar (PLdm) is highly modulated by the task context and is only activated in a unimodal visual task and not in an audio-visual association task.Third, only the medial pulvinar and the ventral lateral pulvinar respond to visual social stimuli irrespective of context.It is also worth noting that this gradient does not follow the classical sub divisions defined based on its cytoarchitectonic properties, that is inferior pulvinar, lateral pulvinar and medial pulvinar (Figures S4 and S5) (Gutierrez et al., 1995;Olszewski et al., 1952;Stepniewska & Kaas, 1997;Walker, 1938).Tracer studies directly addressing the multisensory properties of the pulvinar are scarce.A recent study in the marmoset deserves mention (Homman-Ludiye et al., 2020).Based on retrograde MRI-guided labelling of medial pulvinar, the authors show anatomical connectivity of the pulvinar with the temporal, parietal, frontal, cingulate and orbitofrontal regions, all of which are highly multisensory.Of relevance to our work, reciprocal anatomical connectivity is described between the medial pulvinar and the auditory parabelt (de la Mothe et al., 2012;Homman-Ludiye et al., 2020), possibly accounting for our AC and VAC activations in this part of the pulvinar.Though this work focuses on a New World Monkey, and ours on an Old World monkey, it is likely that similar connectivity has been preserved, if not enhanced, given the greater integration of visual signalling in terrestrial macaques.This proposed pulvinar sensory gradient in the pulvinar is a first attempt to describe the multisensory properties of this complex subcortical structure.This proposal is inherently limited by the spatial resolution and signalto-noise limitations of fMRI of subcortical structures.It will have to be confronted with higher resolution methods, including intracortical single neuronal recordings in NHP, as well as with comparative fMRI studies in humans.

| Pulvinar and face perception
The lateral part of monkey pulvinar has been proven to contain face responsive neurons, and the medial part is shown to be responsive to human facial expressions (Maior et al., 2010;Nguyen et al., 2013).Our observations from the unimodal visual task bring full support to this observation (Figure 8, green circles).Using resting-state analysis, it has been found that the dorsal pulvinar (i.e.dorsal lateral and medial pulvinar) is functionally connected with face patches (Schwiedrzik et al., 2015), defined as the cortical regions selectively responsive to faces as compared to other types of stimuli (Tsao et al., 2008).This is also the case for the ventral pulvinar.Generally speaking, the more anterior face patches connect to more anterior parts of the pulvinar thus defining an antero-posterior functional connectivity gradient between the STS and the pulvinar (Grimaldi et al., 2016).In addition, the stimulation of the two face patches AL and ML (respectively, anterior lateral and medial lateral) of the STS elicit activation in the inferior pulvinar (Moeller et al., 2008).Overall, this suggests that the entire F I G U R E 8 Summary of the activations within the pulvinar as a function of the task and sensory of stimulation.The circle size is largest in the regions of nucleus that present a higher activation.All circles correspond to significant activations or %SC.PuA: anterior pulvinar; PuM: medial pulvinar; PuI: inferior pulvinar; PLdm: dorso-medial lateral pulvinar; PLvl: ventro-lateral lateral pulvinar.
pulvinar is potentially responsive to faces, these face responses being recruited differentially as a function of task and context.
Our study showed that a task implicitly calling for an association between auditory and visual social information activates the medial as well as the ventro-lateral lateral pulvinar (PLvl).The unimodal visual task additionally activates the dorso-lateral and the inferior pulvinar.This is in agreement with the human pulvinar lesion literature, whereby a patient presenting with an entirely damaged unilateral pulvinar was not able to recognize fearful expressions in the contralesional field, and patients with damage limited to the anterior and lateral pulvinar showed no deficits in fear recognition (Ward et al., 2007).These results suggested that fear recognition is mediated by the medial pulvinar, a pattern supported by the observation that the entire nucleus is responsive to fear-related visual stimuli such as snakes et al., 2015).We propose to link these observations to the role of the pulvinar in emotional regulation, as discussed next.Additionally, the medial pulvinar is the unique pulvinar nucleus that presents audio-visual integration.This result supports the idea that this subregion implements multisensory integration (Froesel et al., 2021).As is the case for the amygdala, recent studies report multisensory integration in this brain structure.A macaque single cell recording study describes sub-additive and suppressive multisensory integration in the medial pulvinar (Vittek et al., 2023), and a recent human fMRI study of the pulvinar region describes multisensory enhancement during natural narrative speech perception (Ross et al., 2022).
In our present study, activations during the unimodal social visual context are exclusively identified in the left pulvinar.In the pulvinar, a processing bias in the left pulvinar has already been demonstrated in humans, such that left pulvinar activations are more often reported than right pulvinar activations (Padmala et al., 2010).This has been interpreted in the light of the role of the pulvinar in the attentional function.In their monkey rsfMRI study, Schwiedrzik et al. determined that the pulvinar, the dorsolateral amygdala, hippocampus, caudate nucleus, claustrum and other sub cortical structures are functionally connected with face patches (Schwiedrzik et al., 2015).In addition, their left pulvinar activations included significantly more voxels than the right activations.A left hemisphere bias for processing speciesspecific vocalization was reported in a field study of rhesus monkeys (Hauser & Andersson, 1994).The results, together with the fact that the SNR is not higher on the left than on the right (see Figure S5), raise the possibility of a functional lateralization in the monkey pulvinar.This will have to be further explored.

| Cortical-subcortical network involved in cross-modal audiovisual association of social stimuli
In Froesel et al. (2022), we characterized cortical activations underlying the association of social visual and auditory stimuli based on their semantic content or meaning in the broad sense.Cortical regions were active when visual and auditory information were congruent but inactive when this information was incongruent.This audiovisual, semantic association with contextual information relied on a core cortical functional network involving the STS and the LS.LS ROIs had a preference for auditory and audio-visual congruent stimuli; whereas, STS ROIs responded equally to auditory, visual and audio-visual congruent stimuli.Multisensory integration was only identified in the STS.These cortical regions together with the dorsolateral amygdala, the claustrum and the medial pulvinar thus belong to the same cortical-subcortical network involved in cross-modal audiovisual association of social stimuli (Figure 9).Ross et al. (2022) propose that multisensory enhancement in naturalistic contexts involving understanding of semantics recruits more than the typical multisensory network, notably including the amygdala and the pulvinar as described here.Given the fact that the medial pulvinar is highly connected with the limbic system and areas involved in the regulation of emotions such as the anterior cingulate cortex, the temporal cortex, the temporo-parietal junction, the insula, the frontal parietal opercular cortex (Rosenberg et al., 2009;Yeterian & Pandya, 1997), together with the amygdala with which it is also connected (Jones & Burton, 1976), these structures are proposed to coordinate cortical networks to evaluate the biological significance of affective visual stimuli (Pessoa, 2010b).Their recruitment during audio-visual association based on socioemotional context as described here supports this hypothesis.
In the cortical-subcortical network involved in crossmodal audiovisual association of social stimuli described in Figure 9, three out of the four functional nodes implement multisensory integration such that the activation in response to the audiovisual condition resulted in significantly higher activations than the highest unimodal response.The claustrum was the only node that did not show multisensory integration properties.This possibly suggests that the claustrum plays a distinct role in the processing of audiovisual social stimuli relative to the other nodes of the network it coactivates with.
It is unclear why we do not observe activations in the superior colliculus with the cross-modal association network described in Figure 9. Indeed the superior colliculus has been reported to be connected with both the amygdala and the medial pulvinar and to integrate sensory information from multiple modalities (Froesel et al., 2021;Meredith et al., 1987).Based on previous research, it is also hypothesized to provide sensory input to a subcortical network involved in unconscious emotion perception and composed of the superior colliculus, the medial and lateral pulvinar and the amygdala (Almeida et al., 2015;Tamietto et al., 2012).The absence of activation of superior colliculus in the present study could be due to a lower SNR in this deep brain region (Figures S2 and S6).Alternatively, it could be that the specific stimuli used in our study activated a different functional network than the one described above (see Figure 9).Indeed, the subcortical regions described in the present work co-activate with temporal regions closely matching face patches (Froesel et al., 2022), i.e., cortical regions selectively responsive to faces as compared to other types of stimuli (Tsao et al., 2008).Grimaldi et al. (2016) show input to these face patches from the pulvinar, the amygdala and the claustrum, thus directly supporting our observations of a functional co-activation of these subcortical regions with the temporal face patches.However, they do not identify direct projections from the superior colliculus to the temporal face patches, possibly hinting at the fact that this structure belongs to a different subcortical network not activated by our paradigm.

| CONCLUSION
In this study, we propose that all of the claustrum, pulvinar and putative dorsolateral amygdala play an essential role in cross-modal association communicatively significant information but that only the dorsolateral amygdala and the medial pulvinar further implement multisensory integration.In addition, we provide evidence towards a sensory audio-visual processing gradient within the pulvinar that is context dependent.We propose that these three subcortical structures are part of a supramodal network that modulates sensory perception as a function of the social context, independently of the sensory modality.

F
I G U R E 1 Experimental designs.Each sensory stimulation block contained a rapid succession of 500-ms stimuli (with the exception of the lipsmack for the unimodal visual task).Each run started and ended with 10 s of fixation regardless of the task type.(a) Experimental design of the audio-visual task.Example of an aggressive face (FÀ) context.One run represents one context set up by the visual stimuli and contains three randomized repetitions of six different blocks of 16 s.The six blocks displayed were either visual stimuli only (Vi), auditory congruent stimuli only (AC), auditory incongruent stimuli only (AI), audio-visual congruent stimuli (VAC) or audio-visual incongruent stimuli (VAI) or fixation with no sensory stimulation (Fx).Blocks were pseudo randomized in order that each block was, on average, preceded by the same number of blocks from the other conditions and that each run started with a block of a visual information (V, VAC or VAI).(b) Description of the contexts.Six contexts were displayed.Each context combined visual stimuli of identical social content with either semantically congruent or incongruent monkey vocalizations.Pairs of contexts shared the same auditory stimuli but opposite social visual content (F+ vs. FÀ; S1+ vs. S1À; S2+ vs. S2À).Each run corresponded to one of the semantic contexts described above.(c) Experimental design of the unimodal visual task.In one run, six different blocks were displayed three times randomly.The six possible 16-s blocks were fixation (Fx), lipsmack (Lip), scared monkey faces (Sca), aggressive monkey faces (Aggr), neutral monkey faces (Neu) and scrambled monkey faces (Scr).Visual stimuli were extracted from videos collected by the Ben Hamed lab, as well as by Marc Hauser on Cayo Santiago, Puerto Rico.
) and the anterior (PuA) and medial (PuM) pulvinar nuclei are activated by the Auditory congruent versus Fixation contrast (Figure2, top raw).The dorsal lateral part of the right amygdala (AMG) is activated by the Visual versus Fixation contrast (Figure2, middle left panel; for precise localization and identification criteria, see supplemental FigureS1B).Given the location of these activations, we will refer to them as putative dorsolateral amygdala.The inferior (PuI) and medial (PuM) pulvinar nuclei are activated by the Audio-visual incongruent versus Fixation contrast (Figure2, middle right panel).The putative dorsolateral amygdala (AMG), claustrum (Claus) and medial (PuM) pulvinar nucleus are activated by the Audio-visual congruent versus Fixation contrast (Figure2, bottom raw).
,b).All pulvinar ROIs except the anterior pulvinar (PuA) are significantly activated relative to fixation during the unimodal visual task (Figure 7a; PuI: Z = 3.19, p < .001;PuM: Z = 2.6, F I G U R E 5 Claustrum activations in a unimodal visual task: (lipsmacks + aggressive) conditions versus fixation contrasts.(a) Wholebrain activation maps of the unimodal visual task cumulated over both monkeys, for lipsmack + aggressive blocks versus fixation contrast.Darker shades of red indicate level of significance at p < .001uncorrected, t-score 3.09.Lighter shades of yellow indicate level of significance at p < .05FWE, t score 4.6.(b) Percent signal change of lipsmack + aggressive blocks versus fixation contrast in the putative dorsolateral amygdala (AMG) and the claustrum (Claus) region of interest (ROIs) defined in the audio-visual task described in Figure 2. AMG: Z = 4.27; p < .001;Claus: Z = 5.8, p < .001.See Table

F
I G U R E 6 Pulvinar (and whole-brain) activations in FACE (F+ & FÀ) audio-visual and unimodal visual tasks.(a) Whole-brain activation maps of the FACE context cumulated over both monkeys, for auditory congruent (red) and audio-visual congruent (blue) contrasts versus fixation.Axial, sagittal and coronal view are shown, zooming on the pulvinar.The activation outline in red and blue corresponds to activation thresholds at the level of significance p < .001uncorrected, t score 3.09.(b) Whole-brain activation maps of the FACE context cumulated over both monkeys, audio-visual congruent condition of the audio-visual task (blue) and visual condition (aggressive + lipsmack faces) of the unimodal visual task (green).

F
I G U R E 7 Percentage of signal change in selected pulvinar regions of interest (ROIs), the unimodal visual task (a) and the audio-visual task (b).Statistical differences relative to fixation and between conditions are indicated as follows: ***, p < .001;**, p < .01(Wilcoxon nonparametric test); see Table

F
I G U R E 9 Cortical-subcortical network involved in the cross-modal association of faces and social scenes with vocalizations sharing the same meaning.The nodes of this network are activated by unimodal visual stimuli (white), unimodal auditory stimuli (dark grey) and by audiovisual stimuli.All display multisensory integration (intermediate grey), except the claustrum.STS: superior temporal sulcus; LS: lateral sulcus; AMG: dorsolateral amygdala; Claus: claustrum; PuA: anterior pulvinar; PuM: medial pulvinar; PuL: lateral pulvinar; PuI: inferior pulvinar.