Functional magnetic resonance imaging (fMRI) item analysis of empathy and theory of mind

Abstract In contrast to conventional functional magnetic resonance imaging (fMRI) analysis across participants, item analysis allows generalizing the observed neural response patterns from a specific stimulus set to the entire population of stimuli. In the present study, we perform an item analysis on an fMRI paradigm (EmpaToM) that measures the neural correlates of empathy and Theory of Mind (ToM). The task includes a large stimulus set (240 emotional vs. neutral videos to probe empathic responding and 240 ToM or factual reasoning questions to probe ToM), which we tested in two large participant samples (N = 178, N = 130). Both, the empathy‐related network comprising anterior insula, anterior cingulate/dorsomedial prefrontal cortex, inferior frontal gyrus, and dorsal temporoparietal junction/supramarginal gyrus (TPJ) and the ToM related network including ventral TPJ, superior temporal gyrus, temporal poles, and anterior and posterior midline regions, were observed across participants and items. Regression analyses confirmed that these activations are predicted by the empathy or ToM condition of the stimuli, but not by low‐level features such as video length, number of words, syllables or syntactic complexity. The item analysis also allowed for the selection of the most effective items to create optimized stimulus sets that provide the most stable and reproducible results. Finally, reproducibility was shown in the replication of all analyses in the second participant sample. The data demonstrate (a) the generalizability of empathy and ToM related neural activity and (b) the reproducibility of the EmpaToM task and its applicability in intervention and clinical imaging studies.

Regression analyses confirmed that these activations are predicted by the empathy or ToM condition of the stimuli, but not by low-level features such as video length, number of words, syllables or syntactic complexity. The item analysis also allowed for the selection of the most effective items to create optimized stimulus sets that provide the most stable and reproducible results. Finally, reproducibility was shown in the replication of all analyses in the second participant sample. The data demonstrate (a) the generalizability of empathy and ToM related neural activity and (b) the reproducibility of the EmpaToM task and its applicability in intervention and clinical imaging studies.

K E Y W O R D S
affect sharing, anterior insula, mentalizing, social cognition, temporoparietal junction

| INTRODUCTION
Aiming at elucidating the mechanisms underlying social understanding, human neuroscience research has extensively investigated the brain correlates of how we feel with (affective route) and know about others (cognitive route). The affective route allows for sharing others' emotions (empathy, affect sharing) (de Vignemont & Singer, 2006), for example, when vicariously sharing another person's sadness or grief.
The cognitive route enables reasoning about others' mental states (Theory of Mind, ToM, mentalizing) (Frith & Frith, 2005;Premack & Tania Singer and Philipp Kanske share senior authorship. Woodruff, 1978), for example, when attributing another person's belief, desire or intention. Several meta-analyses across different experimental approaches to both empathy and ToM have consistently described two distinct neural networks related to these functions.
With three notable exceptions (Bruneau, Dufour, & Saxe, 2013;Dodell-Feder, Koster-Hale, Bedny, & Saxe, 2011;Theriault, Waytz, Heiphetz, & Young, 2017), all previous empathy and ToM investigations used conventional functional magnetic resonance imaging (fMRI) analyses across participants. These analyses allow generalizing the observed neural response patterns from the investigated participant sample to the human population they were sampled from, if they treat subjects as random-effect (as has become standard since the late 1990s (Friston, Holmes, & Worsley, 1999)). However, the "fixed-effect fallacy" still applies to the item-level (Clark, 1973), that is, it is unsubstantiated to claim that activation patterns observed for a sample of stimuli would generalize to the population of stimuli, for instance, that the activity observed in an experiment eliciting emotional responses would generalize to the population of emotion-eliciting stimuli. Furthermore, treating items as fixed could give single items with extreme responses disproportionate weight, thereby rendering a contrast of two conditions significant, just because a (possibly small) subset of items in one condition shows very strong activity, while the majority of items shows no effect.
To overcome these problems, item analyses that treat items as random are common in many behavioral fields of study and have been shown to be feasible for fMRI analyses as well (Andrews-Hanna, Reidler, Sepulcre, Poulin, & Buckner, 2010;Bedny, Aguirre, & Thompson-Schill, 2007;Dodell-Feder et al., 2011;Theriault et al., 2017;Troiani, Stigliani, Smith, & Epstein, 2014;Yee, Drucker, & Thompson-Schill, 2010). Thus, Theriault et al. (2017) demonstrated positive correlations between regions in the ToM network and subjectivity ratings of metaethical judgments. Dodell-Feder et al. (2011) replicated a subject-wise analysis with an item analysis showing generalizability for false-belief ToM stories. Bruneau et al. (2013) performed an item-analysis on brief stories of physical or emotional pain yielding activity in the typical empathy and medial parts of the ToM related neural networks, respectively. This study did not, however, compare these results with the subject-wise analysis published previously, which would directly show replicability of subject-and item-wise analyses (Bruneau, Pluta, & Saxe, 2012). Interestingly, Bedny et al. (2007), who studied word class processing, found different results for subject-and item-wise analyses, demonstrating the potential of item analysis to make theoretically important distinctions, which in that case reconciled conflicting evidence regarding the role of the prefrontal cortex in processing nouns vs. verbs (Bedny & Thompson-Schill, 2006;Davis, Meunier, & Marslen-Wilson, 2004;Shapiro, Moo, & Caramazza, 2006;Tyler, Bright, Fletcher, & Stamatakis, 2004).
In the present study, we aimed to investigate whether itemanalyses of empathy and ToM replicate the neural networks observed with subject-wise analyses. To this end, we applied a previously validated fMRI paradigm that assesses both functions (EmpaToM) (Kanske et al., 2015). Empathy is probed via video stimuli with brief autobiographical narrations that are highly emotionally negative or neutral. The negative emotional narrations included such diverse issues as traffic accidents, involuntary pregnancy, partnership problems, diverse somatic and mental diseases and disorders, betrayal and guilt, political violence, seeking refuge, rape, natural disaster, miscarriage, assault or burglary. These videos have been shown to elicit empathic responses on a subjective, peripheral physiological, and on a neural level. ToM reasoning is demanded in subsequent questions that either ask for the mental states of the narrator in the previous video or for factual reasoning about the events of the narration. The mental state questions included first and second order, true and false beliefs, preferences and desires, irony, sarcasm, metaphors, (white) lies, deception and faux pas. The empathy and ToM measures were validated in several behavioral and fMRI studies through correlations and activation overlap with established empathy (Socio-affective Video Taks; Klimecki, Leiberg, Lamm, & Singer, 2013) and ToM tasks (False Belief Task;Dodell-Feder et al., 2011, Imposing Memory Task;Kinderman, Dunbar, & Bentall, 1998) and additional overlap with meta-analytical findings (Bzdok et al., 2012;Dodell-Feder et al., 2011;Kinderman et al., 1998;Klimecki et al., 2013). Conceptually, it is important for social neuroscience to show that empathy related neural activity generalizes beyond patterns only attributable to very specific stimuli, and whether ToM tasks other than false-belief tasks (Dodell-Feder et al., 2011) also lead to generalizable brain activation. To illustrate this form of generalization, as in psycholinguistics, where an itemanalysis in an experiment on verb-processing allows generalizing the results from the limited sample of verbs tested to the population of verbs in that language (e.g., Bedny et al., 2007), replicating the subjectanalysis results in the EmpaToM with an item-analysis would allow generalizing to the population of empathy-inducing and ToM-demanding conversational situations. Given the breadth of the sampled situations in the EmpaToM (240 distinct videos and questions), testing generalizability may be challenging, but could also have particular impact.
Furthermore, a principal problem in subject-analysis is that discrepancies between two experimental conditions beyond the intended difference are uncontrollable confounds. Item-analysis, in contrast, allows specifically testing whether activations observed in a contrast of two conditions are actually due to unintended low-level differences between the conditions (e.g., more or less movement when telling an emotionally negative compared to a neutral story) rather than the intended difference (e.g., negative vs. neutral emotion). As item-specific activation patterns are obtained, they can be associated to the specific features of each item. Given that it is impossible to completely match emotional and neutral stimuli without erasing the difference in emotionality, ruling out the influence of such low-level features is a crucial issue. With regard to ToM, because of the considerable overlap of ToM related activity with regions involved in language processing, particularly in the temporal cortex and TPJ (Friederici, 2011;Schurz et al., 2014), it is critical to rule out the possibility that linguistic differences account for the observed ToM effects. Dodell-Feder et al. (2011) convincingly demonstrated this for false-belief tasks, but it is important to test whether this holds for other language-based ToM tasks as well.
Because the EmpaToM was designed to be used in extensive longitudinal designs, it includes five parallel sets of different videos and questions that allow the repeated testing of the same participants across time.
To enable usage of the EmpaToM in clinical and other settings, where only small participant samples are available or participants can be scanned for a very limited amount of time only, an item analysis on this large stimulus set affords the chance to select the most effective items to create stimulus sets that provide the most stable and reproducible results.
Finally, a major criticism of fMRI studies has been the limited sample size that not only reduces the likelihood to detect true effects, but also reduces the chance that a statistically significant result reflects a true effect (Button et al., 2013). Therefore, the present study made use of a large sample of participants (N = 178) and checked for reproducibility of the results in a second sample (N = 130).
In sum, applying item-analyses to an fMRI task probing empathy and ToM, the present study addresses several questions: (a) Will the itemanalyses replicate the neural networks underlying empathy and ToM as observed with subject-wise analyses? This would argue for generalizability of the observed brain activation patterns to the respective stimulus classes (i.e., neutral and emotional autobiographical video narrations; factual reasoning and ToM questions, the latter involving a variety of ToM demands such as irony, higher order mental state inference, false beliefs, etc.).
(b) Can activity in the observed neural networks be predicted by low-level stimulus characteristics (i.e., number of sentences, words, syllables, characters, predicates, conjunctives, changes in tense, passive constructions, subclauses, and the amount of motion)? (c) Does the item-analysis allow creating stimulus sets including the most effective items to provide the most stable and reproducible results? (d) Are all of the above described results replicable in the second independent participant sample? 2 | METHODS

| Participants
Two samples of 191 and 141 German-speaking participants were tested in the context of a large-scale longitudinal study at baseline (ReSource Project; ). 1 Participants were recruited from the general public through adverts. Recruitment of Sample 1 took place in 2012-2013 and of Sample 2 in 2013-2014. Participants had a very good language proficiency and were not included if they were below 20 or above 55 years of age, fulfilled the criteria for a mental or neurological disorder (according to structured clinical interviews for DSM-IV axis I and axis II disorders; Wittchen, Zaudig, & Fydrich, 1997) or had any contraindication for MRI scanning. Twenty-four participants had to be excluded due to study dropout (N = 5), dropout from MRI measurements (N = 1), or missing data due to technical, scheduling, or health issues (N = 18).

| Stimuli and task
For details of the EmpaToM task see (Kanske et al., 2015) (Figure 1).
Each trial started with a fixation cross (1-3 s), followed by the name of a person (2 s), who would speak in the subsequent video (~15 s).
Each participant was presented with videos of 12 persons, telling four different stories each that corresponded to four conditions (2 × 2 factorial design, negative vs. neutral emotion, ToM vs. no ToM demands).
After this, participants rated the valence of their current emotional state (sliding scale from negative to neutral to positive; 4 s) and how much compassion 2 they felt for the person in the previous video (sliding scale from none to very much; 4 s). A second fixation cross (1-3 s) was followed by a multiple choice question with three response options (one correct). These questions demanded either the attribution of mental states or factual reasoning (ToM vs. factual reasoning).
Participants had to respond within 14 s. For example, stories and questions, see Data S1. After a third fixation cross (0-2 s), participants were asked to rate their confidence, that their decision was done correct (4 s) to allow assessing metacognitive abilities (Molenberghs, Trautwein, Bockler, Singer, & Kanske, 2016;Valk et al., 2016). In the present study we focused on the main empathy and ToM measures, that is, comparing emotional with neutral videos and ToM with factual reasoning questions (see (Kanske et al., 2015) for a validation of these contrasts).
The total stimulus set of the EmpaToM task comprised 240 videos and questions showing 60 different narrators in 4 conditions (see Figure 2). Based on this set, five parallel versions were created that each contained a different set of 12 narrators in 4 conditions (yielding 48 different videos and questions per set). The parallel sets were matched with regard to affect ratings, concern ratings, RTs, errors, confidence ratings, video lengths and linguistic characteristics of the questions (number of words, characters, predicates, changes in tense, complexity of the sentences [number of main and subordinate clauses], number of passive sentence constructions, and number of conjunctives), see (Kanske et al., 2015)). The five sets were randomly assigned to the participants such that each set (of 48 videos and questions) was seen by a fifth of the participants in Samples 1 and 2.

| MRI data acquisition
Data were acquired on a 3 T MRI scanner (Siemens Magnetom Verio, Siemens Medical Solutions, Erlangen, Germany) using a 32 channel head coil. Functional images were acquired with a T2*-weighted echo-planar imaging (EPI) sequence (TR = 2,000 ms; TE = 27 ms, Flip Angle 90 , matrix = 70 × 70 mm, FOV = 210 mm). Within one TR, 37 axial slices of 3 mm were acquired. In addition, we collected a high-resolution structural image (1 × 1 × 1 mm) with a T1-weighted MPRAGE sequence. These regressors were convolved by a canonical hemodynamic F I G U R E 1 EmpaToM trial sequence. Emotional and neutral videos with and without ToM demands (2 × 2 design) are followed by valence and compassion ratings, ToM and factual reasoning questions, and a confidence rating (adopted from Kanske et al. (2015)). This study investigated the effects of subject-and itemwise analyses on the empathy and theory of mind contrasts. Empathy was tested via emotionally negative versus neutral videos and theory of mind was tested via mental state versus factual reasoning questions. ToM, Theory of Mind response function. Six regressors accounting head movement effects were modeled as covariates of no interest. RobustWLS Toolbox (Diedrichsen & Shadmehr, 2005) was used to reduce potential noiseartifact. Contrast images for empathy (emotional vs. neutral videos) and ToM (ToM vs. nonToM questions) were calculated by applying linear weights to the parameter estimates and entered into onesample t-tests for random effects analysis.

| Behavioral data analysis
The item analyses were performed for each contrast separately by modeling the emotional and neutral videos, and the ToM and factual reasoning questions on the individual subject level. Each analysis resulted in 48 beta maps per subject (12 narrators × 4 conditions).
The beta maps were averaged across the subjects within the five parallel versions (see Figure 2) to receive one single beta map per narrator and condition. For each of the five subgroups, this method yielded 48 beta maps at which each beta map comprised a mean beta value across subjects at every voxel, adding up to 240 beta maps in total.
For the second-level random effects analyses, we modeled the main contrasts between the condition differences (emotional vs. neutral videos, ToM vs. nonToM questions) together with the factor of subgroups as covariates of no interest in order to account for the dependencies between the 240 beta maps corresponding to the five groups of participants. The main contrasts were tested with two sample t-tests.
The results for the subject-wise as well as the item-wise analyses were thresholded at p < .001 at voxel-level together with an FWE (family-wise error) correction (p < .05) at the cluster level.

| Regression analysis
For both contrasts, regions of interest (ROI) (N = 46, 23 ToM, 23 empathy) were defined on the basis of the subject-wise random effects analyses of Sample 1 (see Table 1 for empathy, Table 2 for ToM). They were used to extract the beta values from Sample 2 for F I G U R E 2 EmpaToM stimulus material. The overall stimulus material of the EmpaToM task contains 240 videos and questions with 60 different narrators in 4 conditions (emotional vs. neutral, ToM vs. nonToM), allocated to one of five parallel subsets. Each subset contains 12 different narrators in 4 conditions. The subsets are matched with regard to affect ratings, concern ratings, RTs, errors, confidence ratings, video lengths, and linguistic characteristics of the questions (number of words, characters, predicates, changes in tense, complexity of the sentences, number of passive sentence constructions, and number of conjunctives) (see Kanske et al., 2015). Subjects were randomly assigned to one of the five subsets, so that each subset was seen by a fifth of the participants in Sample 1 (N = 178) and Sample 2 (N = 130). ToM, Theory of Mind T A B L E 1 Whole brain subject-and item-wise random effects results for Videos Emotional > Neutral. The results are reported at a voxel-level threshold of p < .001 uncorrected together with an FWE-corrected cluster threshold of p The amount of spoken or written text, for example, measured by the number of words, has been used as a proxy for constituent size (Goucha & Friederici, 2015;Pallier, Devauchelle, & Dehaene, 2011).
These studies showed that increasing constituent size is associated with an increase of neural activation in left hemispheric cortical areas such as the inferior frontal gyrus, temporo-parietal junction, superior temporal sulcus and temporal pole, regions that are also engaged during empathy and theory of mind processing. Therefore, we tested whether differences in the number of words, characters, sentences or syllables can account for the observed effects in the EmpaTom task.
Five additional features measure aspects of syntactic complexity, that is, number of predicates, tenses, passives, conjunctives and complexity (lexical diversity: type token ratio). Syntactic complexity is correlated with working memory load indicated by higher error rates and longer processing times in sentence comprehension. FMRI studies showed that this effect modulates the neural activity in the left inferior frontal gyrus, middle frontal gyrus, and temporo-parietal junction (Meltzer, McArdle, Schafer, & Braun, 2010;Newman, Malaia, Seo, & Cheng, 2013) suggesting the possibility that items with higher syntactic complexity influence activation patterns in the same cortical areas that are engaged during empathy and ToM processing.
Besides to low-level features that are associated to spoken and written text, we additionally selected three general low-level features that characterized the video material: duration of videos, motion and velocity of the narrator's movement. Emotionality may not only be communicated by language and facial expression but is also facilitated by spontaneous gestures and movements (Dick, Solodkin, & Small, 2010). Gesture comprehension is supported by a cortical network comprising the bilateral temporo-parietal junction, bilateral superior parietal lobe, left inferior and middle frontal gyrus, and the left superior and middle temporal gyrus (Yang, Andric, & Mathew, 2015). Because of the considerable overlap with empathy related activity, we included these factors into the regression analysis to rule out that differences in the video material account for the observed empathy effects.
We performed stepwise forward/backward regression analyses with the item responses in the previously defined ROIs as dependent variables and condition and the selected features as independent variables.
Stepwise regression is an iterative process of selecting and eliminating multiple variables depending on the model's best fit to the data. It is particularly useful in cases where there are large numbers of predictors. In

| Optimized sets of stimuli
The results of the item analyses were used to identify optimal sets of items which elicit the most prototypical response in both contrasts T A B L E 2 Whole brain subject-and item-wise random effects results for Questions ToM > non ToM The results are reported at a voxel-level threshold of p < .001 uncorrected together with an FWE-corrected cluster threshold of p  3.2 | Neuroimaging data

| Empathy
We performed whole brain subject-and item-wise random effects analyses, first on the data set acquired in Sample 1 (see Figure 3a,b and showed significant activity only in the subject-wise, but not the itemwise analysis, including lingual and middle occipital gyrus, which would suggest that the activation is due to features of some specific stimuli and that it is not generalizable. The pattern of results was the same in Sample 2 (see Figure 3c,d and Table 1).

| Theory of mind
As for empathy, we first performed whole brain subject-and itemwise random effects analyses on the data set acquired in Sample 1 (see Figure 4a,b and Table 2). All of the brain regions typically involved in ToM were activated for ToM questions compared to factual reasoning questions, both across subjects and across items. These regions include bilateral ventral TPJ, STS, temporal poles, precuneus,  Figure 4c,d and Table 2).

| Regression analysis
We performed stepwise forward/backward regression analyses on the data set acquired in Sample 2 for the empathy (23 ROIs) and ToM F I G U R E 3 Consistency of the empathy related activation patterns (video: emotional > neutral) across item-wise (a) and subject-wise (b) analyses in Sample 1 (N = 178) and in Sample 2 (N = 130) (c, d, respectively). The results show activity in the empathy related network for emotional versus neutral videos, both across subjects and across items. This network includes anterior insula, anterior cingulate cortex/ dorsomedial prefrontal cortex, inferior frontal gyrus and dorsal portions of the temporoparietal junction (supramarginal gyrus) (23 ROIs) contrast. All ROIs were independently defined by the whole-brain subject-wise analysis of Sample 1. The results show that for both contrasts condition is almost the only predictor for all regions that were tested (see Table 3   based on action observation as in social animations (e.g., Castelli et al., 2000), or to conceptual knowledge about persons as in trait judgments (e.g., Mitchell et al., 2002). Consequently, the stimuli of the EmpaToM task do not elicit all possible forms of empathic responses and theory of mind reasoning. A more comprehensive approach to generate a random sample of items that is representative for theory of mind and empathy might be realized by an ecological momentary assessment (EMA) (Shiffman, Stone, & Hufford, 2008). This approach involves repeated sampling of subject's social interactions in real time over periodic intervals, thereby enabling a high ecological validity.
Future studies could therefore arrive at stronger conclusions about the precise nature of the population of items.
However, given the amount of videos and questions (240 in total for each type) and the fact that no situation was repeated, there is considerable breadth within this conversation type situation. Complying with the call for "item-analyses with a larger and more variable set of stimuli" (Dodell-Feder et al., 2011), the present results, thus, expand previous reports of consistent activity for reading false-belief (20 items; (Dodell-Feder et al., 2011)) and physical or emotional pain stories (24 items each; (Bruneau et al., 2013)).
Another critical question pertains to possible confounds due to the, in general, high error rates and the differences in behavioral performance. The EmpaToM task was explicitly designed to be hard, which makes it unique among other theory of mind tasks in functional neuroscience in adults. In other tasks, for example, false belief or social animations, healthy participants perform typically at 100% or nearly 100% accuracy. The drawback of those measurements is that they are not sensitive to pick up improvements in performance over time, whereas the EmpaToM task can (Böckler et al., 2017;Trautwein et al., 2020). Given that participants were less accurate in the nonToM condition than in the ToM condition, one might think that the differential brain activation identified with the contrast (ToM > nonToM) reflects the effect of general task difficulty. However, we think this is unlikely because of the following reasons: First, prior to the fMRI measurements, participants were sufficiently familiarized with the task and the different conditions. Second, a previous study that validated the EmpaToM task with other measures of empathy and theory of mind did not detect any differences in accuracy (Kanske et al., 2015; Exp. 1). In line with these results, subjects' confidence ratings, indicating their performance evaluation, were equal across all conditions, meaning the participants did not evaluate the nonToM condition as more difficult than the ToM condition. Finally, further results of this study also showed that the theory of mind performance does positively correlate with the activity of the default mode network, whereas areas in the default mode network typically tend to increase in deactivation with increasing task difficulty (e.g., Buckner, 2008).
Activity in a few regions observed in the subject-wise analyses was not present for the item-wise analyses. These include the supple- FMRI item-analyses allow an item-specific estimate for the neural activity in a brain region which might serve as an indicator of the regions function. As the items can be characterized not only regarding their experimental category but also regarding multiple other features (e.g., constituent size, or syntactic complexity), it is possible to determine which features best predict the neural response in each brain region (see e.g., Bruneau et al., 2013;Dodell-Feder et al., 2011 Given the recent discussions about difficulties in replicating psychological findings (Lindsay, 2015;Open Science Collaboration, 2015), we aimed at testing the stability of our findings in a within-study replication. Indeed, the results from a second independent sample corroborated the conclusions of the first sample, that is, reproducible neuroimaging results in subject-and item-wise analyses that are independent from low-level stimulus characteristics. Furthermore, addressing the critique of small sample sizes in many neuroimaging studies (Button et al., 2013), the two samples we assessed were relatively large in comparison to most fMRI investigations (which mostly include <40 participants) (David et al., 2013). Thus, the present study lends a high degree of trustworthiness to the observed neural activation patterns for empathy and ToM. Future studies could of course further strengthen this conclusion, for instance by probing the test-retestreliability of the results, which has been shown to be highly variable across brain regions and experimental paradigms (Plichta et al., 2012).
The specific activation patterns observed for empathy and ToM are not only consistent across subject-and item-wise analyses, but also correspond to the typical networks associated with the two functions in large-scale meta-analyses (Bzdok et al., 2012;Lamm et al., 2011;Molenberghs, Johnson, et al., 2016;Schurz et al., 2014). An interesting aspect is that the meta-analyses suggest the existence of Assuming that most experimental paradigms capture specific component processes of full-fledged empathy or ToM (Schurz & Perner, 2015), the finding of activation in the extended networks for the EmpaToM suggests that the task comprehensively captures the complexity of these two social capacities (as is the case for other paradigms aiming at ecological validity (Wolf, Dziobek, & Heekeren, 2010)). Furthermore, taking the independence of the neural bases of empathy and ToM into account (Kanske et al., 2015;Kanske et al., 2016) and observing the two networks in both types of analyses here, corroborates the assumption that empathy and ToM are distinct social functions, possibly serving specific purposes in social encounters, for example, establishing the motivation for cooperation and enhancing prosocial behavior Tusche, Bockler, Kanske, Trautwein, & Singer, 2016).
The results of the item-analysis made it possible to select those videos and questions that elicit the most prototypical responses in terms of activation in the neural networks that meta-analyses have associated with empathy and ToM (Bzdok et al., 2012;Lamm et al., 2011;Schurz et al., 2014) and in behavior. To avoid circularity, we selected the stimuli based on Sample 1 and tested them in the independent Sample 2, showing strong and consistent activation patterns across the two samples. This way, we could form several optimized stimulus sets for future usage in specific settings. In particular, the short versions of the task enable testing special populations with reduced attention spans, for instance, in psychopathology (Preckel, Kanske, Singer, Paulus, & Krach, 2016) or assessing multiple tasks, including the EmpaToM, within one session, for instance, to predict social behavior based on empathic and ToM capabilities (Tusche et al., 2016). The optimized parallel sets could be applied in longitudinal designs, including intervention research.
To conclude, by replicating the empathy and ToM related neural networks across item-and subject-wise analyses and demonstrating their independence from low-level stimulus characteristics, the present results contribute methodologically to the social neuroscience literature and add to our understanding of these social capacities as distinct functions.

ACKNOWLEDGMENTS
This study forms part of the ReSource Project, headed by Tania Singer.

DATA AVAILABILITY STATEMENT
The data of this study are available from the authors upon reasonable request.