Supramodal neural networks support top‐down processing of social signals

Abstract The perception of facial and vocal stimuli is driven by sensory input and cognitive top‐down influences. Important top‐down influences are attentional focus and supramodal social memory representations. The present study investigated the neural networks underlying these top‐down processes and their role in social stimulus classification. In a neuroimaging study with 45 healthy participants, we employed a social adaptation of the Implicit Association Test. Attentional focus was modified via the classification task, which compared two domains of social perception (emotion and gender), using the exactly same stimulus set. Supramodal memory representations were addressed via congruency of the target categories for the classification of auditory and visual social stimuli (voices and faces). Functional magnetic resonance imaging identified attention‐specific and supramodal networks. Emotion classification networks included bilateral anterior insula, pre‐supplementary motor area, and right inferior frontal gyrus. They were pure attention‐driven and independent from stimulus modality or congruency of the target concepts. No neural contribution of supramodal memory representations could be revealed for emotion classification. In contrast, gender classification relied on supramodal memory representations in rostral anterior cingulate and ventromedial prefrontal cortices. In summary, different domains of social perception involve different top‐down processes which take place in clearly distinguishable neural networks.

. It is well established that the perception of auditory and visual social stimuli is driven not only by physical stimulus properties, but also by top-down processes (Gilbert & Li, 2013;Latinus, VanRullen, & Taylor, 2010). "Top-down processes" is a collective term for various types of cognitive influences driving perception. Important top-down influences on perception are attentional focus, that is, the aspect of a stimulus that a person is attending to (Corbetta & Shulman, 2002;Hopfinger, Buonocore, & Mangun, 2000;van Atteveldt, Formisano, Goebel, & Blomert, 2007) or supramodal representations in long-term memory (Choi, Lee, & Lee, 2018;Ramsey, Cross, & Hamilton, 2013). Some previous studies have addressed the role of specific top-down contributions in social perception. Bzdok et al. (2012) separated neural networks underlying social, face-specific, emotional and cognitive stimulus processing aspects.
These findings suggest that there are neural networks that are driven by top-down influences such as the task, but not by the stimulus material itself. Further evidence for this notion comes from a study by Hensel et al. (2015), who identified an involvement of dorsomedial prefrontal cortex (DLPFC) specifically during social trait judgments irrespective from stimulus modality.
Following the line of these studies, the present study investigated the neural networks underlying two types of top-down influences on the perception of voices and faces: attentional focus (i.e., the attended aspect of the stimulus material) and memory representations. Attentional focus was varied via the task, that is, attending to either emotion or gender of the faces and voices. Moreover, social evaluation requires a comparison with a representation in the individual's long-term (or "reference") memory (Roitblat, 1987), which has been formed via previous experience (Mazur, 2017). For social evaluation, we were interested if the respective networks were supramodal, that is, independent from stimulus modality. To identify such supramodal memory representations, we developed the Social Implicit Association Test (SIAT), a social variant of the well-established Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998). The SIAT is described in detail in the Section 2; in short, it investigates associations between memory representations via reaction times to the respective stimuli. In the original IAT, associated stimuli (such as the words "doctor" and "hospital") lead to faster responses than non-associated stimuli (such as "bird" and "cigarette"; cf. Collins & Loftus, 1975). In the SIAT, we similarly assumed supramodal associations between the same social categories in voice and face, for example, between a happy face and a happy voice. From a neurobiological perspective, we assumed that such an association may be reflected by a shared brain region. In other words, we assumed that associated representations are in fact two aspects of one and the same concept and represented in the same brain region.
With respect to faces and voices, this would correspond to a supramodal memory representation. There are previous functional magnetic resonance imaging (fMRI) studies on the IAT following the same logic. As an example, Knutson, Mah, Manly, and Grafman (2007) used the IAT during fMRI to investigate the neural substrates of gender and racial bias, identifying ventromedial prefrontal cortex (VMPFC) and ventral anterior cingulate cortex (vACC) as putative regions. These findings are well in line with Milne and Grafman (2001), who found a reduced IAT effect in patients with VMPFC lesions.
Based on these assumptions, we derived the following hypotheses: 1. For both emotion and gender evaluation, we can identify networks that are driven by attention: specific for the task (emotion/gender), but independent from stimulus modality or memory representations.
2. For both emotion and gender evaluation, we can identify networks that are driven by supramodal memory representations: independent from stimulus modality, but only present for associated auditory and visual stimuli.
These hypotheses were tested using fMRI.

| Participants
Forty-five right-handed subjects (23 female; age span 19-33 years, mean 24.7 ± 3.1) participated in the experiment. All subjects had normal or corrected to normal vision, normal hearing, no contraindications against MR investigations, and no history of neurological or psychiatric illness. All participants had either German as a first language or were grown up bilingually (with German from early childhood on).

| Stimuli
Auditory stimuli were disyllabic pseudowords (Thonnessen et al., 2010). They followed German phonological rules but had no semantic content and were validated in a pre-study on 25 subjects who did not participate in the fMRI study (see Klasen et al., 2011 for details of stimulus validation). Auditory stimulus duration was 1 s.
Visual stimuli were taken from the validated NimStim Face Stimulus Set (Tottenham et al., 2009). In analogy to the duration of auditory stimuli, photographs were presented for 1 s each. Stimuli were always presented in isolation (unimodal presentation). Auditory and visual stimuli were counterbalanced for emotion (50% happy, 50% angry) and gender of the speaker/actor (50% female, 50% male). Moreover, each stimulus type was displayed by four different speakers/actors. In summary, the experiment thus comprised 32 different stimuli: 2 modalities (auditory/visual) × 2 emotions (happy/angry) × 2 genders (female/male) × 4 actors/speakers.

| Experimental design
In the present fMRI study, we employed a SIAT. The SIAT measures crossmodal associations between corresponding visual and auditory modalities of social signals (faces and voices) via reaction times. Similar to the original IAT, the SIAT uses a classification task with two target categories sharing one response key, paired in either congruent or incongruent fashion. In the congruent condition, corresponding visual and auditory signals (e.g., happy faces and happy voices) share the same response key, whereas in the incongruent condition nonmatching pairings (e.g., happy faces and angry voices) are mapped on the same key.
To address the top-down influence of attentional focus on social perception, two different SIAT variants were employed: an emotion SIAT, and a gender SIAT. Attentional focus was varied via the task. In the Emotion SIAT, the task was to classify the emotion of faces and voices (happy or angry), and in the Gender SIAT, the task was to classify stimulus gender (male or female). The task of the participant was to classify the stimuli according to the respective instruction as fast and as accurately as possible by pressing one of two response keys according to the assigned category. Pairings of target categories were either congruent or incongruent. In the congruent condition, corresponding auditory and visual stimuli were always mapped on the same key, for example, for emotion, angry voice and angry face on one key and happy face and happy voice on the other key. In the incongruent condition, non-corresponding auditory and visual stimuli were mapped on the same key, for example, for emotion angry face and happy voice on one key and happy face and angry voice on the other key. The gender SIAT was designed in analogy.
Both SIATs included one congruent and one incongruent association phase in separate sessions in randomized order. Prior to the first association phase, participants performed two shorter learning phases, where the assignment of keys for the categories was learned according to the first association phase. A learning phase consisted of either visual or auditory stimuli only. As an example, a congruent association phase was always preceded by two learning phases assigning auditory and visual emotions in a congruent fashion (e.g., happy voice = left, angry voice = right for the auditory learning phase and happy face = left and angry face = right for the visual learning phase in the emotion SIAT).
The two association phases were separated by an additional re-learning phase (either auditory or visual) with the identical setup, but with switched assignments of keys, preparing for the second association phase (see Figure 1 for a depiction of the experimental setup).
The order of the SIAT variants (emotion/gender) was randomized for each participant. The same was true for the order of the association phases within each SIAT (congruent/incongruent), the order of the learning phases (visual/auditory), and the assignment of emotion (angry/ happy) and gender (male/female) to the response keys (right/left).
In the SIAT, reaction time differences between the congruent and incongruent association tasks quantified the implicit association of auditory and visual representations of the social categories emotion and gender. Both SIATs were conducted in a repeated measurement F I G U R E 1 Experimental design. Two Social Implicit Association Test (SIAT) variants were employed: one for emotion evaluation and one for gender evaluation. Both SIATs consisted of five phases, in close analogy to the original IAT by Greenwald et al. (1998). To avoid sequential effects of auditory and visual evaluation phases, four parallel versions were employed for each of the SIATs (emotion/gender). Inserts on the right show one example version of the emotion evaluation SIAT in detail design on two different days. Auditory and visual stimuli were identical in both SIATs.
Although the original IAT has traditionally been used to measure attitudes (stereotypes/implicit bias) in social psychology (e.g., Gawronski, 2002;Wilson & Scior, 2013), research has shown that adaptations of the IAT paradigm can be used for associations between non-social categories as well (e.g., flowers/insects and their association with pleasant/unpleasant attributes; Greenwald et al., 1998). Moreover, the IAT works for the auditory domain as well (McKay, Arciuli, Atkinson, Bennett, & Pheils, 2010) and even for the association between auditory and visual domains (Parise & Spence, 2012). This universal applicability encouraged us to use the SIAT as a social variant for investigating associations between vocal and facial stimuli.
In summary, the SIAT design allowed us to investigate the top-down contributions of attentional focus and memory representation independently from each other. By using conjunction analyses, we were moreover able to identify activation patterns that were independent from stimulus modality (i.e., supramodal). To avoid any bias arising from the stimulus material itself (and thus to exclude any bottom-up effects), we used exactly the same stimulus material for all association tasks.
Images were presented through a mirror mounted on the head coil.
During the fMRI measurements, participants wore soft foam ear plugs and head phones, which served as ear protection as well as for delivering the auditory stimuli. The sound volume was tested before the measurements in the scanner and individually adjusted to a comfortable level, based on the participant's feedback. Previous experience with the same scanner, ear protection, and auditory stimulus set (e.g., Klasen et al., 2011) indicated that the stimuli were well audible and could easily be classified even with the scanner noise in the background. Responses were given via two keys on a keypad placed at the participant's right hand.
Total time for functional and anatomical scans was 45 min.

| Data analysis
Image analyses were performed with BrainVoyager QX 2.8 (Brain Innovation, Maastricht, The Netherlands). Preprocessing of the functional MR images included slice time correction, 3D motion correction, Gaussian spatial smoothing (6 mm full width half maximum kernel), and high-pass filtering including linear trend removal. The first five images of each functional run were discarded to avoid T1 saturation effects. Functional images were coregistered to 3D anatomical data and transformed into Talairach space (Talairach & Tournoux, 1988), following the standard procedure as implemented in BrainVoyager. In total, four participants were excluded from the analysis. One was excluded due to technical problems; a part of the original DICOM image files was damaged and could not be restored. Three additional participants were excluded from all further analyses due to excessive head motion as identified by visual inspection, leaving a total of 41 participants in the final sample. From the excluded participants, two were male and two were female, leaving a final sample of 21 female and 20 male participants.
Statistical parametric maps were created by using a random effects general linear model (RFX-GLM) with multiple predictors according to the stimulus types. The following within-subject factors were considered in the analysis: Attentional focus (Emotion vs. Gender)

Stimulus Emotion (Happy vs. Angry)
The full combination of these five factors led to a total of 2 5 = 32 predictors which are listed in Table 1 (abbreviations see above). For each of the contrasts, their encoding is marked with "+" and "−," respectively. Fixation cross phases served as a low-level baseline. Events were defined in a stimulus-bound fashion, that is, modeled for the duration of stimulus presentation. Only trials with correct responses were included in the analyses. Trials with missing or incorrect responses were modeled as separate confound predictors. Task contrasts were investigated via paired t tests. Following the recommendations of Woo, Krishnan, and Wager (2014), activations were thresholded at voxel-wise p < .001 and Monte-Carlo-corrected for multiple comparisons on the cluster level (p < .05, corresponding to k > 11). All reported conjunction analyses tested the conservative conjunction null hypotheses (Nichols, Brett, Andersson, Wager, & Poline, 2005).

Results are displayed in radiological convention (left is right).
To address the study's hypotheses, the following comparisons were of interest: 1. Emotion versus gender: Supramodal networks. These were networks specific for emotion resp. gender evaluation, but independent from stimulus modality. These were the contrasts (Voice Emotion > Voice Gender) \ (Face Emotion > Face Gender) as well as the reversed contrast (Gender > Emotion, respectively).
2. Emotion versus gender: Networks independent from congruency of target categories. These were networks specific for emotion resp. gender evaluation, but independent from congruency of the target categories. These were the contrasts (Congruent Emotion > Congruent Gender) \ (Incongruent Emotion > Incongruent Gender) as well as the reversed contrast (Gender > Emotion, respectively).
3. Emotion versus gender: Networks depending exclusively on attentional focus. These were networks depending solely on the attention focus (emotion or gender), independent from stimulus modality or congruency of the target concepts. This equals to the fourfold conjunction (Voice Emotion > Voice Gender) \ (Face Emotion > Face Gender) \ (Congruent Emotion > Congruent Gender) \ (Incongruent Emotion > Incongruent Gender) as well as the reversed contrast (Gender > Emotion, respectively).
4. Networks depending on the congruency of target categories. These were networks that were specific for congruency resp. incongruency of the target categories (congruent vs. incongruent and vice versa). They were investigated separately for emotion and gender evaluation, as well as for the comparison between them.    3.2 | Neuroimaging results

| Emotion versus gender: Supramodal networks
These were networks specific for emotion resp. gender evaluation, but independent from stimulus modality. We thus compared emotion versus gender evaluation networks separately for auditory (voice) and visual ( Gender > emotion. For gender evaluation, auditory stimuli involved VMPFC and ACC, along with left angular and superior frontal gyri ( Figure S1). No clusters emerged for visual stimuli or for the conjunction of both contrasts.

| Emotion versus gender: Networks independent from congruency of target categories
These were networks specific for emotion resp. gender evaluation, but independent from congruency of the target categories.

| Emotion versus gender: Networks depending exclusively on attentional focus
These were networks depending solely on the attention focus (emotion or gender), independent from stimulus modality or congruency of the target concepts.
Emotion > gender. To investigate effects specific for emotion evaluation independently from stimulus modality and from the congruency of target concepts, we thus calculated the fourfold conjunction of all maps, that is, (Voice emotion > Voice gender) \ (Face emotion > Face gender) \ (Congruent emotion > Congruent gender) \ (Incongruent emotion > Incongruent gender). The resulting map revealed a common activation in bilateral anterior insula, right IFG, and pre-SMA ( Figure 3; Table 1).
Differences in reaction times between emotion and gender evaluation indicated a higher difficulty of the emotion task. To investigate possible influences of the latter on the brain activation patterns displayed in

| Networks depending on the congruency of target categories
These were networks that were specific for congruency resp. incongruency of the target categories (congruent vs. incongruent and vice versa).
Emotion evaluation. Incongruence led to a stronger activation in areas of the emotion evaluation network (compare Figure 2), namely FFA, bilateral anterior insula, thalamus, globus pallidus, IFG/MFG, and pre-SMA, along with extended activation in visual systems, DLPFC, and superior parietal lobule (SPL; Figure 4). Congruency, in contrast, was not associated with any specific activation pattern during emotion evaluation.
Gender evaluation. A different picture emerged for the gender evaluation task. The incongruent condition led to a similar, albeit less pronounced pattern in pre-SMA, DLPFC, right anterior insula, and SPL. Congruency, in turn, led to enhanced activation in two prominent clusters in rACC and VMPFC (Figure 4).

| Networks of supramodal memory representations
These were networks independent from stimulus modality, but depending on the congruency of the target concepts.
F I G U R E 4 Target categories: Congruent versus incongruent. Both emotion and gender evaluation showed similar fronto-parietal networks for incongruent target categories, indicating increased working memory load. No congruency-specific activation was observed for emotion evaluation. For gender evaluation, congruency led to enhanced activation in rostral anterior cingulate cortex (rACC) and ventromedial prefrontal cortex (VMPFC). These networks may thus reflect supramodal memory representations supporting gender evaluation  Table 2.

| DISCUSSION
The present study revealed new insights into top-down contributions to social perception. Specifically, the SIAT enabled us to identify topdown influences in the processing of social information in the auditory and visual domains. Task-specific, but modality-independent patterns reflected supramodal networks for social categories. These top-down components could further be separated into attention-driven networks and supramodal memory representations. Functionally distinct networks were identified for the social categories emotion and gender.
For emotion evaluation, the modality-specific analysis revealed FFA and STS specifically for face processing. Besides the FFA's wellestablished role in face identification, which is assumed to rely on mainly invariant facial features (Calder & Young, 2005;Dekowska, Kuniecki, & Ja skowski, 2008;Haxby, Hoffman, & Gobbini, 2002;Hoffman & Haxby, 2000), recent studies highlight the importance of the FFA for processing emotional expressions as well (Harry, Williams, Davis, & Kim, 2013;Nestor, Plaut, & Behrmann, 2011 Saxe, 2006). This functional versatility may be explained by sub-regional specialization, but also by task-dependent co-activation with functionally distinct frontal and temporal networks (Hein & Knight, 2008). Functional synchronicity of posterior STS with the FFA may reflect a visual emotion processing network.
Current models of auditory emotion processing highlight a righthemispheric lateralization (Brück, Kreifelts, & Wildgruber, 2011;Klasen et al., 2018). Right sided primary and higher order acoustic regions extract suprasegmental information, followed by processing of meaningful suprasegmental sequences in posterior parts of the right STS, followed by evaluation of emotional prosody in IFG (Wildgruber, Ackermann, Kreifelts, & Ethofer, 2006). Neuroimaging findings (Klasen et al., 2018) highlight the relevance of right IFG for emotional prosody. In our study, a right-hemispheric lateralization was observed for the fourfold conjunction of all maps (Figure 3c) shown that the pre-SMA is involved in domain-general sequence processes (Cona & Semenza, 2017) and in emotional evaluation of signals irrespective of modality (Ethofer et al., 2013). The anterior insula has been ascribed to a wide range of complex functions and participates in various cognitive and emotional processes (for a review see Menon & Uddin, 2010). In line with our findings, Menon and Uddin (2010) propose that a basic function of the anterior insula is the bottom-up driven detection of salient stimuli across multiple modalities. It is well established that the insula engages in affective processes (e.g., emotion perception of others) and the experience of emotions that derive from visceral and somatic information about bodily states (Uddin, 2015). As such, insula activity represents an individual's subjective and conscious emotional state, as well as the emotive value of external stimuli. Thus, it has been suggested that the ability to understand the emotions of others depends largely on experiencing similar changes in our visceral state by mirroring the perceived emotion (Critchley & Harrison, 2013). The anterior insula may be a central hub in this function. Taken together, the observed activity of SMA and anterior Insula may represent a supramodal neuronal signature of explicit emotion processing. However, since similar patterns emerged for the congruent as well as for the incongruent target concept they reflect an evaluative rather than a supramodal memory network. Considering the similarity with the salience network (Menon, 2015), the pattern reflects the high evolutionary significance of emotion recognition.
Notably, angry voices elicited the strongest responses in the emotion evaluation network. Previous research revealed an overall increase in activation for vocal emotion compared with neutral expressions in a fronto-temporo-striatal network Kotz et al., 2003). Ethofer et al. (2009) investigated brain regions that were more responsive to angry than to neutral prosody and identified bilaterally IFG/OFC, amygdala, insula, mediodorsal thalamus, and the middle part of the STG. Furthermore, they showed that the activation of these regions was automatic and independent of the underlying task, concluding that angry prosody is processed irrespectively of cognitive demands and attentional focus. Similar findings can be observed for visual emotion processing. Vuilleumier (2005) inferred that the FFA was more activated by fearful than neutral faces, even when faces were task-irrelevant. Our findings support the notion that angry prosody is perceived with particular dominance, which is of fundamental importance to prioritize the procession of threat-related stimuli (Cox & Harrison, 2008;LeDoux, 2003).
Remarkably, no amygdala activation was observed for any of the emotion classification categories. Lesion studies show that the amygdala has modulatory influences on emotion processing areas and heightens activity in for example, the FFA when perceiving fearful faces compared to neutral (Vuilleumier, Richardson, Armony, Driver, & Dolan, 2004), and this is also true for prosodic emotion processing and the STS (Frühholz et al., 2015). Since emotional information was present in all trials, missing amygdala differences can be attributed to unattentional emotion processing in all tasks. This notion is supported by previous findings (Vuilleumier, 2005;Vuilleumier, Armony, Driver, & Dolan, 2001) and was also explicitly validated in trials with a gender classification task, where amygdala activation was present even though attention was directed to the gender (Morris, Ohman, & Dolan, 1998).
A comparison of congruent versus incongruent target concepts revealed increased workload in a fronto-parietal network for incongruence in emotion and gender SIATs. This network has already been described for the evaluation of semantic incongruent bimodal emotional stimuli (Klasen et al., 2011). It shows a large overlap with the executive control network as described by Seeley et al. (2007), which reflects attention, working memory, and response selection. Almost identical findings between our study and Klasen et al. (2011) indicate a negligible influence of stimulus modality. Instead, the network seems to be driven by the aspect of incongruence itself, putatively reflecting increased task difficulty and cognitive workload in the incongruent condition. Moreover, pre-SMA activity may reflect conflict monitoring and error detection (Mayer et al., 2012). In a similar way, social classification categories (emotion vs. gender) seem to be only of minor importance for incongruence networks.
In contrast to our initial hypotheses, we could not reveal a contribution of a supramodal memory representation for emotional categorization. Instead, emotion evaluation seems to involve large evaluative networks, some of them modality-independent, others not. In summary, recognition of facial and vocal emotions involves common networks in insula, IFG, and pre-SMA, but does not rely on a common supramodal memory representation. In line with this notion, a recent meta-analysis by Schirmer (2018) revealed fundamentally different pathways for auditory and visual emotions.
Effects of effortful, that is, conscious emotion processing were observed in supplementary motor regions, which is in line with our findings. Emotion processing effects in limbic areas such as the amygdala, in turn, were task-independent and largely driven by the visual modality (Schirmer, 2018). This also delivers a new perspective on crossmodal emotion integration. Neuroimaging findings show that congruent audiovisual emotions enhance activity primarily in limbic areas (Klasen et al., 2011). In line with the well-established visual dominance effect in audiovisual perception (Colavita, 1974), auditory emotions may be just a supplement to visual perception, both behaviorally and neurobiologically, without the need for recruiting a common memory representation.
The gender SIAT, in contrast, showed enhanced involvement of two prominent clusters: the VMPFC and ACC. These findings are well in line with the findings from Knutson et al. (2007) on the neural substrates of gender and racial bias, as well as with Milne and Grafman (2001), who found a reduced IAT effect in patients with VMPFC lesions. In these studies, VMPFC was considered as representing previously learned automatic processing of emotional and social information. Thus, VMPFC may support concept formation in long-term memory.
Widening the scope, this view is very much in line with neuroimaging research on schematic memory. Schemas are experience-based implicit memory representations of situational aspects that typically belong together. They are activated by perceptual input and form a framework for stimulus interpretation (Bowman & Zeithamova, 2018;Spalding, Jones, Duff, Tranel, & Warren, 2015), a conceptualization closely related to the spreading activation network theory by Collins and Loftus (1975). Recent neuroimaging studies highlight VMPFC contributions to establishing and retrieving schemas. A lesion study by Spalding et al. (2015) indicated reduced performance ability of subjects with focal VMPFC damage for integrating new information into a schema congruent context. In a recent fMRI study, Bowman and Zeithamova (2018) describe the VMPFC as representing abstract prototype information, supporting generalization in conceptual learning over multiple domains. In summary, the VMPFC seems to store memories about typical examples and characteristic features of object categories. These "prototype" representations seem to facilitate object recognition in a top-down fashion: classification and response selection are based on the comparison of perceptual input with memory prototypes. In the case of gender, facial and vocal stimuli seem to access the same supramodal memory prototype, which may also account for the enhanced accuracy and reaction times compared to the emotion classification task. Supramodal prototypes may exist for emotions as well; however, the present study found no evidence for their contribution to stimulus classification.

| CONCLUSION
The present study identified modality-specific and modalityindependent influences of attentional focus and memory representations on the neural processing of social stimuli. Irrespective of modality, emotion evaluation engaged a fronto-insular network which was independent from supramodal memory representations.
Gender classification, in turn, relied on supramodal memory representations in rACC and VMPFC. Please contact Nim Tottenham at tott0006@tc.umn.edu for more information concerning the stimulus set.

CONFLICT OF INTEREST
The authors declare no potential conflict of interest.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from RWTH Aachen University. Restrictions apply to the availability of these data, which were used under license for this study. Data are available from the corresponding author with the permission of RWTH Aachen University.