Sound Predicts Meaning: Cross-Modal Associations Between Formant Frequency and Emotional Tone in Stanzas

Research on the relation between sound and meaning in language has reported substantial evidence for implicit associations between articulatory–acoustic characteristics of phonemes and emotions. In the present study, we specifically tested the relation between the acoustic properties of a text and its emotional tone as perceived by readers. To this end, we asked participants to assess the emotional tone of single stanzas extracted from a large variety of poems. The selected stanzas had either an extremely high, a neutral, or an extremely low average formant dispersion. To assess the average formant dispersion per stanza, all words were phonetically transcribed and the distance between the first and second formant per vowel was calculated. Building on a long tradition of research on associations between sound frequency on the one hand and non-acoustic concepts such as size, strength, or happiness on the other hand, we hypothesized that stanzas with an extremely high average formant dispersion would be rated lower on items referring to Potency (dominance) and higher on items referring to Activity (arousal) and Evaluation (emotional valence). The results confirmed our hypotheses for the dimensions of Potency and Evaluation, but not for the dimension of Activity. We conclude that, at least in poetic language, extreme values of acoustic features of vowels are a significant predictor for the emotional tone of a text.

One branch of studies within the realm of sound iconicity has investigated the relationship between the content of texts and the frequency of occurrence of particular phonemes in these texts. Compared to other areas of research on sound iconicity, the results from these studies provide a fairly ambiguous picture. Whereas a majority of the studies have reported significant relations between the emotional tone of texts and the relative occurrence of certain phonemes (Aryani, Conrad, & Jacobs, 2013;Auracher, Albers, Zhai, Gareeva, & Stavniychuk, 2010;Auracher, Scharinger, & Menninghaus, 2019;F onagy, 1961;Whissell, 1999Whissell, , 2003Whissell, , 2004aWhissell, , 2004bWhissell, , 2011Whissell, , 2016Whissell, , 2017Whissell, , 2018, other studies have found no such relations (Kraxenberger & Menninghaus, 2016;Miall, 2001). What is more, the results of the studies that did find significant sound-iconic relations vary greatly and even seem to partially contradict each other (Tsur & Gafni, 2020).
We propose that these inconsistencies can partly be explained by differences in the applied research designs (Auracher, 2020). Starting with Humboldt, it has repeatedly been suggested that there are various kinds of phonosemantic relations in language that differ regarding the underlying cognitive processes (Humboldt, 1836; see also Magnus, 2001). A detailed discussion of the different kinds of phonosemantic relations would clearly go beyond the scope of this article (for reviews, see Hinton et al., 1994;Perniss et al., 2010;Reay, 1994;Schmidtke et al., 2014;. However, we would like to point out that our decision for the research design applied in this study was motivated by the specific nature of the phonosemantic relation at stake and might, therefore, not be comparable to results of other studies with a different research interest. First, while most studies so far have focused on the relative occurrence of individual phonemes as a function of a text's content, this study will focus on the relation between the content of a text and its acoustic characteristics. There is good evidence for affective associations with either individual phonemes or the articulatory-acoustic properties of phonemes. For example, Monaghan and Fletcher (2019), asking participants to assess the meaning of non-words, found that statistical models based on individual phonemes fitted participants' responses significantly better than models based on phonetic features. Aryani and colleagues, applying neuroimaging techniques (fMRI), reported that words that sounded arousing led to increased activity in areas of the brain that also respond to nonverbal emotional expressions and emotional prosody (Aryani, Hsu, & Jacobs, 2018). Additionally, Aryani, Conrad, et al. (2018) reported that acoustic characteristics of words can account for a significant portion of their affective meaning and that these acoustic characteristics could also predict the assessment of the emotional valence and level of arousal of pseudowords. Comparing the effects of articulatory and acoustic features of phonemes on sound-iconic associations, Ohtake and Haryu (2013) observed that the effect of acoustic characteristics was independent from that of articulatory movements, whereas articulatory characteristics alone had no effect on sound-iconic associations. What is more, it has been claimed that some forms of sound iconicity are simply a specific variant of cross-modal associations, that is, of associations between basic features of perception that address different sensory modalities (e.g., Marks, 1978;Ohala, 1994;Parise, Knorre, & Ernst, 2014;Parise & Spence, 2013;Spence, 2011). The aim of the present study, therefore, is to test the role of such cross-modal associations for the perception of written language. Consequently, we will study the relation between the acoustic features of phonemes and a text's content rather than counting the frequency of distinct phonemes. Note that by acoustic characteristics of phonemes, we refer to the average values of acoustic properties of phonemes that have been measured for a certain number of people. That is, individual characteristics of the voice of a speaker, effects of emotional prosody, or modulations of a phoneme's pronunciation due to the specific phonetic constellation in a word will not be taken into account.
Second, a majority of the studies on sound iconicity in written language were conducted using what is, on our understanding, an insufficient number of text samples. We believe that it is necessary to draw on a large number of text samples in order to isolate the potential effects of sound iconicity from the influence of other linguistic features.
Third, in this study, rather than treating the relative occurrence of acoustic features in a text as a dependent variable, we test the power of acoustic features to predict readers' emotional assessment of texts (Auracher et al., 2010).
Finally, earlier studies often applied ad hoc categories to assess sound-iconic associations of phonemes or phonetic features. In contrast, for this study, we employ a wellestablished dimensional model of emotions comprising the three factors Evaluation, Potency, and Activity (the EPA model) as introduced by Osgood, Suci, and Tannenbaum (1957). According to this model, an emotion can be defined by its position in a three-dimensional space that characterizes states as positive or negative (Evaluation), dominant or submissive (Potency), and active or passive (Activity). Later versions of the EPA model have often referred to the three dimensions as Valence or Pleasure (for Evaluation), Arousal or Intensity (for Activity), and Dominance or Controllability (for Potency) (e.g., Mehrabian & Russell, 1974;Trnka, La cev, Balcar, Ku ska, & Tavel, 2016). While we consider these terms as largely interchangeable, it is important to point out some nuances that set them apart. Bakker and colleagues have noted that the terms used in these models sometimes differ regarding their focus on affect, cognition, or behavior (Bakker, van der Voordt, Vink, & de Boon, 2014). For example, while Osgood et al. (1957) operationalized the Activity dimension in their EPA model by using antonym pairs such as fast-slow or active-passive, Mehrabian and Russell (1974) replaced the term Activity with Arousal in their PAD-model, and, in so doing, shifted the focus from physical or cognitive activity to an inner state. Although the different stages related to Activity or Arousal are potentially interrelated, it is also true that the flexibility of the EPA model makes it conceptually fuzzy and its dimensions hard to define. Consequently, results of studies that used the EPA dimensions are often difficult to compare, as the operationalization of the dimensions can show substantial variations. For our study, we decided to stick to the original terms of the EPA model. However, we will also use different terms when reporting results of other studies depending on the focus of their respective research interests.

Research objective and hypothesis
Research on sound-iconic properties of vowels has mostly focused on the effect of their articulatory place (i.e., front vs. back vowels) and/or articulatory height (i.e., high or closed vowels vs. low or open vowels) on cross-modal associations (for a broader overview, see Material S1). The majority of these studies have addressed the phonosemantic relation between articulatory features of vowels and the notion of size, strength, or dominance more generally (for reviews, see Auracher, 2017;Davis, Morrow, & Lupyan, 2019;Hoshi, Kwon, Akita, & Auracher, 2019;Nuckolls, 1999;Ohala, 1994;Tsur, 2006). In sum, there is strong evidence suggesting that low-back vowels are preferably associated with high level of Potency or Dominance whereas highfront vowels are mostly related to low Potency or Submissiveness. As articulatory features of vowels are directly linked to their acoustic characteristics, the results of these studies can also be translated into sound-iconic associations. That is, the place of articulation has a strong influence on the frequency of the second formant, whereas the height of articulation affects the frequency of the first formant (Ladefoged & Disner, 2012). As a consequence, the distinction between high front vowels, such as/i/in free, and low back vowels, such as/ɑː/ in spa, is mirrored in the relative formant dispersion, that is, the vowel's characteristic distance between the first and second formant ( Compared to the well-studied relation between vowels' articulatory features and Potency, considerably fewer studies have tested associations between vowels and concepts related to Evaluation or Activity. For Evaluation, the studies by Klink (2000), Markel and Hamp (1960), Miron (1961), and Tarte (1982) provided evidence that front vowels (compared to back vowels) received higher ratings for pleasant, beautiful, harmonious, and good. Whereas these three studies employed questionnaires to directly test sound-iconic associations with vowels' articulatory-acoustic features, recent studies applied more indirect approaches. For example, some studies considered the relative occurrence of phonemes or phonetic-acoustic features in annotated wordlists (i.e., wordlists with ratings for Valence and Arousal per word) as an indicator for the emotional connotations of these phonemes (e.g., Adelman, Estes, & Cossu, 2018;Aryani, Conrad, et al., 2018;Louwerse & Qu, 2017). Similar strategies to tackle associations between phonemes (or phonetic features) and emotional tone have previously been applied by Ertel (1965) and Heise (1966). However, neither of these studies reported a specific relation between formant dispersion (the way it is operationalized in our study) and the dimensions of Evaluation or Activity.
Taking a different approach, Meyers and colleagues found that dynamic upward or downward shifts of the formant frequencies (particularly of the second formant) predict emotional associations with pseudo-words in a forced-choice experiment (Myers-Schulz, Pujara, Wolf, & Koenigs, 2013). These findings can also be interpreted as supporting a positive correlation between Evaluation and formant dispersion, given that downward shifts of the second formant, which have been found to be associated with negative emotional valence, diminish the distance between the first and second formant and, therefore, the formant dispersion. Conversely, upward shifts, which increase the formant dispersion, were more readily matched with positive emotional valence. However, the exact relation between emotional associations of dynamic shifts of formant frequencies on the one hand and those of static measurements of formant dispersion on the other hand remains a question for further studies.
A specific relation between the vowels /i:/ (high-front vowel) and /o:/ (back-vowel) and emotional valence has been demonstrated by Rummer, Schweppe, Schlegelmilch, and Grice (2014). The authors reported that participants who repeatedly articulated the phoneme /i:/ while watching a cartoon rated it as funnier than participants who repeatedly articulated the phoneme /o:/. The authors explain this in terms of the parallels between emotional facial gestures that express positive and negative mood with the articulatory movements of the vowels /i:/ and /o:/, respectively.
While only few studies have tested sound-iconic associations between vowels and Evaluation, research on the relation between vowels' articulatory-acoustic characteristics and lightness or brightness has some potential to compensate for this lack of direct empirical evidence. Vowels with a high formant dispersion have repeatedly been found to be closely related to brightness, while vowels with a low formant dispersion have been found to be more readily associated with darkness (Becker & Fisher, 1988;Bentley & Varon, 1933;Fischer-Jørgensen, 1968;Greenberg & Jenkins, 1966;Markel & Hamp, 1960;Newman, 1933). Similarly, studies on cross-modal associations between vowels and colors have reported that bright colors are preferably matched with high front vowels, while dark colors are usually related to low back vowels (Cuskley, Dingemanse, Kirby, & van Leeuwen, 2019;Moos, Smith, Miller, & Simmons, 2014;Park & Osera, 2008;Wrembel & Rataj, 2008). Given that in most cultures the opposition between bright and dark is also metaphorically used to express emotional valence (e.g., dark mood [Eng.], strahlendes L€ acheln [German], akarui hito [Japanese], and so forth; see also Adams & Osgood, 1973;Crawford, 2009;Meier, Robinson, Crawford, & Ahlvers, 2007), we assume that the relation between vowels and brightness accounts at least for a significant fraction of the associations between vowels and (positive) Evaluation.
Regarding the relation between vowels and Activity, front vowels (as compared to back vowels) have been found to be preferably assessed as "fast" (Folkins & Lenrow, 1966;Klink, 2000;Klink & Wu, 2014;Lowerey & Shrum, 2007;Markel & Hamp, 1960;Tarte, 1982), "active" (Folkins & Lenrow, 1966;Greenberg & Jenkins, 1966;Markel & Hamp, 1960;Tarte, 1982), and "tense" (Markel & Hamp, 1960), suggesting that a high formant dispersion is associated with a high degree of activity and tension. However, these studies mostly focused on associations of articulatory-acoustic characteristics of vowels with a notion of activity rather than with an inner state of arousal. In contrast, other studies have shown that a speaker's level of arousal has an effect on formant frequency, with higher arousal leading to a higher first formant (Goudbeek, Goldman, & Scherer, 2009;Laukka, Juslin, & Bresin, 2005;Vlasenko, Prylipko, Philippou-H€ ubner, & Wendemuth, 2011). However, as the frequency of the first formant alone is not sufficient to draw conclusions on the formant dispersion, that is, the distance between the first two formants, it is not clear how these results relate to our study. A similar problem arises when comparing our study with the one by Aryani, Conrad, et al. (2018) who tested the relation between Arousal and formant bandwidths as well as between Arousal and standard deviations of spectral center of gravity. These measurements not only contrast with our definition of formant dispersion (distance between mean F2 and F1), thereby precluding a direct comparison, but also introduce acoustic parameters that are prone to be dependent on acoustic context and speaker properties-variables that have been excluded in our study. Thus, to the best of our knowledge, there are currently no studies that tested the relation between formant dispersion, defined as the distance between the first two formants, and arousal, defined as an affective state.
Taken together, studies that monitored the associations of articulatory and acoustic properties of vowels with emotions suggest a relation between high front vowels, that is, vowels with a high value for formant dispersion, and an emotional tone that is characterized by a positive valence, a feeling of weakness or submissiveness, and an active or aroused state. In contrast, low back vowels, that is, vowels with a low formant dispersion, should be associated with a negative emotional valence, dominance, and calmness. However, as outlined above, findings regarding the relation between arousal and the frequency of the first formant indicate that the relation between formant dispersion and Activity is not clear yet and calls for further research. Hypothesizing a close relation between the acoustic properties of a text's phonemes and its content, we conducted a study in which we asked participants to assess the content of short text samples categorized according to their average formant dispersion on six bipolar items comprising the EPA model.

Participants
The text samples were assessed by 43 participants (29 women, 10 men, 4 undisclosed). Participants were recruited from a pool of volunteers who had signed up in the participant database of the Max Planck Institute for Empirical Aesthetics (Frankfurt am Main, Germany). The average age of participants was 25.7 years (min: 20, max: 40, STD: 5.14). A majority (35) of the participants were undergraduate (27) or graduate students (8). The remaining participants were mostly employees (5) or provided no information regarding their current employment situation. All participants were native German speakers and reported normal or corrected-to-normal sight.

Ethics statement
This study was conducted in full accordance with the World Medical Association's Declaration of Helsinki and the Ethical Guidelines of the German Association of Psychologists (DGPs). All participants gave their written informed consent and received EUR 10 as compensation for their participation, independently of whether or not they finished the study. All of the study procedures were approved by the Ethics Council of the Max Planck Society.

Materials
Assuming that phonosemantic relations are particularly common in poetic language, we used stanzas from German poems as stimuli. Text samples were selected from an online archive of classical German poems (Freiburger Anthologie, http://freiburger-an thologie.ub.uni-freiburg.de/). The archive comprises more than 1,500 German poems written between 1,720 and 1,900. After removing duplicates and poems that were written in languages other than standard German (e.g., in Latin or in a German dialect), our collection consisted of 1,399 poems. From these poems, we selected all stanzas that had a minimum of four and a maximum of six lines. As a result, we had a collection of 8,031 stanzas. All stanzas were phonetically transcribed using the Python library epitran (Mortensen, Dalmia, & Littell, 2018). To calculate the average formant dispersion per stanza, we assessed the distance between the first (F1) and second (F2) formant (dF = |F1 -F2|) and converted the result from Hertz to Mel (O'Shaughnessy, 1999). The values for the first two formants of each vowel were taken from P€ atzold and Simpson (1997; see Material S2). For diphthongs, we followed Greenberg and Jenkins's (1966) analysis and used the values assessed for the latter part of the diphthongs.
For the experiment, we divided the stanzas into three categories. To this end, we assessed an upper and a lower threshold defined by the average formant dispersion of all stanzas plus/minus the standard deviation (mean: 727 Mel, STD: 40 Mel, upper threshold: 767 Mel, lower threshold: 687 Mel). We then randomly selected 90 stanzas, 30 with an extremely high formant dispersion (dF> upper threshold), 30 with an extremely low formant dispersion (dF < lower threshold), and 30 with a formant dispersion between the upper and lower thresholds. In what follows, we refer to these three groups of stanzas as bright, dark, and neutral, respectively (Fig. 2). An overview of the selected stanzas used in the experiment can be found in Material S3.

Design and procedure
Every stanza was assessed by 14 or 15 participants (mean: 14.3) on six bipolar items, with two items for each of the three EPA dimensions (Evaluation, Potency, Activity; see Table 1). To test the inter-rater agreement, an intraclass correlation coefficient (ICC) per stanza was assessed (Koo & Li, 2016;Shrout & Fleiss, 1979). On average, the inter-rater agreement was good with a majority of three-quarters of the coefficients above .73 and As recommended by Osgood et al. (1957), we adapted the items used in this study to fit the specific nature of our stimuli. Specifically, we selected items that clearly indicated that participants were supposed to assess the content of the stanzas and not their effect on the readers. However, as outlined above, the selection of a certain set of items inevitably has the side effect that the overall scope of the underlying dimensions is somewhat narrowed. For example, while most studies use items such as pleasant versus unpleasant to assess differences regarding the Evaluation dimension, these adjectives imply that the item asks for the effect of a stimulus on the participant. In contrast, we were interested in the participant's evaluation of the emotional tone expressed in a text. These two perspectives do not necessarily aim at the same effect. It is possible that a text expresses a negative emotion yet is still perceived as pleasant by the reader. In our study, we, therefore, used the adjective-pair happy versus sad. Negative or positive emotional valence, however, is not restricted to the opposition between happiness or sadness. For example, there are other negative emotions, such as contempt, anger, or fear, that are not covered by sadness. Therefore, due to the specific operationalization of the Evaluation dimension used in this study, participants might have chosen neutral values to assess poems thatthough expressing positive or negative emotions-are related to neither happiness nor sadness.
The poems and rating scales were presented on computer screens using the Tk-inter package of the Python programming environment (https://docs.python.org/2/library/ tkinter.html). Each item was rated using the Tk-inter package's slider function. Participants were instructed to assess the stanzas by placing the slider closer to one end of the scale or the other (for the precise wording of the instructions, see Material S4). The order of the items and their orientation (i.e., dark to the left and bright to the right, or vice versa) were randomized per participant.
At the beginning of the experiment, participants were informed about the task and purpose of the study. However, the role of the acoustic properties of the stanzas for our research question was not revealed. Subsequently, participants were asked to assess the stanzas by judging which of the two adjectives for each item appeared to be a closer match to the content. They were also told that, in cases where they believed that neither Table 1 Bipolar items of the three dimensions of the EPA model. Items on the right side represent positive values (+5) and items on the left side represent negative values (À5). The adjectives in italics were hypothesized to be related to a relatively wide formant dispersion (i.e., + Evaluation, À Potency, + Activity). The German designations of the items used in the study are indicated in square brackets of the two adjectives adequately described the stanza, they could set the slider to the middle. Every participant assessed a set of 30 pseudo-randomly selected stanzas (10 per category). The order in which the stanzas were presented was randomized. Before the experiment began, participants practiced the procedure by assessing the first stanza of the poem Es sitzt ein Vogel auf dem Leim by the German poet Wilhelm Busch. If they had no further questions, participants started the experiment and went through it at their own pace. At the end of the experiment, participants were asked to provide personal information regarding their gender, age, profession, and educational background. Finally, they were asked for comments. Altogether, the experiment took between 30 and 45 min. All data can be found in Material S5.

Analysis of the data
To test the effect of the acoustic features of stanzas on readers' assessment, we used a linear mixed regression model that included stanza categories (i.e., bright, dark, neutral) as fixed factors and participant as a random factor. The random factor was included to account for the fact that each participant assessed only a selection of the stanzas. The data were analyzed using the R software package (R Core Team, 2019). The pairs of items were combined into three dimensions, representing the three-dimensional EPA space. The correlation between the combined items were r = .66 for Evaluation, r = .54 for Potency, and r = .64 for Activity. Linear mixed models conducted for each dimension using the library lme4 (Bates, M€ achler, Bolker, & Walker, 2015). To display the results of the analysis, we applied the tab_model() function of the sjPlot library (L€ udecke, 2017). For analysis of the data, all values were z-standardized (i.e., M = 0 and SD = 1).

Results
In a first step, we compared the performance of different models with increasing complexity. The results showed that inclusion of random slopes for each participant did not result in a significant improvement (Table 2). To assess the coefficients, we therefore applied a model that included only random intercepts per participant (Model #2). As our hypothesis made distinct predictions for each dimension of the EPA space, we tested three separate models for Evaluation, Potency, and Activity (Table 3).
The results confirmed our hypothesis regarding the Evaluation and the Potency dimensions, but not regarding the Activity dimension: Ratings of bright stanzas are higher for items referring to Evaluation and lower for items referring to Potency (Fig. 3; the exact values for each item are given in Appendix Table A1). Notably, ratings for the neutral stanzas were always close to zero, whereas those for the stanzas with extreme values (i.e., bright and dark) had opposite algebraic signs (+/À). In contrast to our hypothesis, however, ratings for dark stanzas were also higher for Activity. Moreover, while the ratings for the three categories follow the expected order (bright < neutral < dark) for the Potency dimension, the neutral stanza ratings were highest for the Evaluation dimension.
Looking at the differences between the three categories for each dimension, it is striking that there is always one category of stanzas that stands out compared to the other two categories. For the Evaluation dimension, dark stanzas are clearly different from both bright and neutral stanzas, while bright and neutral stanzas are not significantly different from each other. Conversely, for the Activity dimension, we found a significant difference for bright versus dark and neutral stanzas, but not for dark versus neutral stanzas. Significant differences in both directions (i.e., between neutral and bright stanzas as well as between neutral and dark stanzas) were found only for the Potency dimension. However, here again, the actual effect is clearly stronger for the bright stanzas than for the dark stanzas. This observation is also confirmed by the mixed regression model: For the dimensions of Evaluation and Activity, only one category differs significantly from neutral stanzas.
The ratings for the individual items also followed the expected order. Neutral stanzas were rated with intermediate values compared to the two extreme categories, with the ratings for happy-sad [fr€ ohlich-traurig] being the single exception to this rule (see Fig. 4; for exact values, see Appendix Table A2, and for a post hoc comparison of means between the categories, see Material S6). That is, for most items (except happy-sad), the ratings of bright and dark stanzas point toward opposite directions, indicating that, on average, stanzas of the two extreme categories were associated with opposite attributes (e.g., bright stanzas with weak and dark stanzas with strong). As outlined before, the observation that ratings for happy-sad are the only exception from this rule might be due to the fact that the opposition between happiness and sadness does not cover the full range of positive and negative emotions.
Our results therefore suggest a close relationship between the average ratings per stanza category (i.e., bright, neutral, and dark stanzas) and the average formant dispersion for these categories. This observation raises the question of whether there is a continuous relationship between the average formant dispersion and the perceived emotional tone of Table 2 Effect of formant dispersion (fixed factor) on readers' assessment of stanzas. Goodness of fit is indicated by the log-likelihood (À2LL) criterion. Successive models were compared using the chi-square test  stanzas. To test this, we conducted linear mixed models with the ratings per dimension as outcome variable, the average formant dispersion as predictor, and a random intercept per participant. The results suggest that the relation between the emotional tone of the stanzas and their formant dispersion is highly significant but weak (Evaluation: B = 0.08, p < .01, R 2 = .006; Potency: B = À0.22, p < .001, R 2 = .051; Activity: B = À0.20, p < .001, R 2 = .040). That is, while our data corroborate the assumption of a statistically significant correlation between formant dispersion and emotional tone, the amount of variance explained by the correlation (R 2 ) is relatively small, indicating that acoustic characteristics of vowels are only one among many features that influence the subjectively perceived emotional tone (for detailed results, see Materials S7 and S8). Finally, in view of our data, it is also noteworthy that the ratings for the items of the Potency dimension and those of the Activity dimension were almost identical. Thus, it seems that there is a close interrelation between these two dimensions or, more cautiously, that our rating items are not suited to distinguish them. To statistically test the impression of a relation between Potency and Activity, we correlated the average ratings for each poem between each pair of EPA dimensions (Fig. 5). The correlations were significant for Evaluation-Potency (r = À.42, p < .001) and Potency-Activity (r = .84, p < .001) but not for Evaluation-Activity (r = À.01, p > .1). Moreover, as already indicated by Fig. 3, the correlation was considerably stronger for Potency-Activity than for Evaluation-Potency. We also conducted separate correlations for each category of stanzas (i.e., bright, neutral, dark). The results indicate that correlations between dimensions were by and large similar across the categories of the stanzas (for detailed results, see Material S9). The only noticeable difference is a highly significant correlation between Evaluation and Potency for neutral stanzas.

Discussion
We tested the predictive power of acoustic features of vowels to assess the emotional tone in poems. To this end, we calculated the average formant dispersion (i.e., the distance between the first two formants of a vowel) for 8,031 stanzas retrieved from a corpus of German poems. Participants assessed a random selection of 90 stanzas with either an extremely high formant dispersion (bright stanzas), an extremely low formant dispersion (dark stanzas), or an average formant dispersion (neutral stanzas) on six bipolar items comprising the three dimensions Evaluation, Potency, and Activity. Based on previous studies, we hypothesized that assessments of bright and dark stanzas would differ significantly on all three dimensions, with bright stanzas being rated higher on items referring to a positive Evaluation and a high level of Activity, whereas dark stanzas were expected to be rated higher on Potency. Our results support this hypothesis for the Evaluation and the Potency dimensions of the EPA model. The assessment of the bright and dark stanzas indeed differed significantly on all items designed to measure these dimensions. What is more, except for the item happy-sad, the average rating for the neutral stanzas was always close to zero, whereas those of the two extremes (i.e., bright and dark stanzas) consistently diverged toward the opposite poles of the bipolar items. Thus, our data clearly suggest that acoustic characteristics of texts (passages) can be used as an indicator for the emotional tone of a text's content.
The results for the Activity dimension are not in line with our anticipations. In contrast to the results of previous studies, we found that dark stanzas (low formant dispersion) were mostly assessed as lively [lebhaft] and vigorous [energisch], whereas bright stanzas were assessed as quiet [ruhig] and gentle [sanft]. The strong correlation between the Potency and the Activity dimensions found in our data points to one possible explanation for this discrepancy between our results and those of previous studies. According to Fitch (1997), associations between formant dispersions and size may be rooted in the close relation between body size and length of the articulatory tract in mammals, including humans (Fitch & Giedd, 1999). The length of the articulatory tract determines the frequency of the formants of an individual's voice; accordingly, larger animals tend to have voices with a lower formant dispersion. For most animal species, size correlates with strength and strength with social dominance. Most likely for this reason, particularly male individuals purposefully lower the formant dispersion of their voice in order to appear larger and thereby to gain an advantage in the fight for resources (Charlton & Reby, 2016). Similar associations between formant dispersion and the notion of size, strength, and dominance have also been reported for humans (Evans, Neave, & Wakelin, 2006;Puts, Hodges, C ardenas, & Gaulin, 2007;Watkins & Pisanski, 2016). Thus, formant dispersion is closely related to the position of an individual in the social hierarchy of an ingroup. On a similar note, the behavior of dominant individuals is usually marked by more active, boastful, and aggressive behavior, whereas submissive behavior is likely to be more passive and self-derogatory as well as to avoid conflict (Holekamp & Strauss, 2016;Morton, 1977;Trower & Gilbert, 1989). Consequently, from an ethological perspective, there is a close relation between Potency (or Dominance) and Activity.
We, therefore, speculate that the relations between the acoustic features of vowels and the notions of activity might be derivative from the natural relation between formant dispersion and size. Our explanation for the discrepancy between our results and the results of previous studies is, therefore, as follows: First, we assume that there is a general tendency to implicitly associate large, strong, and heavy bodies with slow and clumsy movements. Second, participants who are asked to relate sound frequency to activeness or passiveness in a context-free research design might draw on this implied relationship between size and activeness and therefore end up assessing dark sounds as passive and bright sounds as active. In contrast, in our study, participants presumably assessed the stanzas based on their content rather than on their sound. Although we did not perform a detailed content analysis of the stanzas we used, it appears that knights-or, generally, warriors-appear relatively frequently in dark stanzas, whereas bright stanzas tend to deal with love, lovesickness, the joy of love, and so forth. That is, the relation between dark vowels and Activity we observe in our data may be driven by the fact that the semantic concepts of knights and warfare are associated with a high degree of physical activity, irrespective of phonosemantic properties of the verbal material.
Our results also indicate that the relation between the emotional tone of a text and its acoustic features is not necessarily bipolar. According to our results, an extremely high average formant dispersion in a text is a good indicator for a low level of Potency (i.e., small and weak). In contrast, comparing the ratings for Potency between stanzas with a low average value for formant dispersion and neutral stanzas, the distance is relatively small (although statistically significant). Similarly, an extremely low formant dispersion turned out to be a good predictor for low ratings on items referring to Evaluation, whereas extremely high values for formant dispersion showed no significant effect.
These results are partly in line with previous findings. Studies that have compared relations between the lexical meaning of words and their phonetic characteristics across languages have shown that words and morphemes that refer to smallness have a greater than chance probability of the occurrence of high front vowels (Blasi et al., 2016;Ultan, 1978), indicating a close relationship between a high formant dispersion and a low level of Potency. In contrast, neither Ultan nor Blasi and colleagues reported evidence for a relation between the notion of high potency and the occurrence of dark-sounding vowels. Additionally, in a recent study, Westbury, Hollis, Sidhu, and Pexman (2018) found evidence suggesting that acoustic cues that are predictive for one pole of a dimension are not necessarily also a good predictor (with a reversed sign) for the opposite pole. Rather, it appears that tendencies to associate opposing acoustic features (e.g., high vs. low pitch) with opposing poles of semantic dimensions (e.g., small vs. large) often result from research designs introducing items as antonyms so that one pole is conceptualized as the negative of the other pole (e.g., small as not large). Therefore, we conclude that a relation of one acoustic feature (e.g., extremely low average formant dispersion) to a specific emotional tone (e.g., negative emotional valence) does not automatically imply that the two opposite poles (e.g., high formant dispersion and positive emotional valence) are likewise related to one another. In other words, the conclusion e contrario may simply not apply.
Yet another possible explanation lies in the very design of our study. It is, for example, thinkable that the relatively low values of bright stanzas for Evaluation result from the semantic focus of the items used. Note, for example, that the opposition between happy and sad also contains an element of Activity, as happiness mostly comes along with a high degree of activity and sadness with a low degree of activity. Consequently, it could be that bright stanzas simply referred to a positive-relaxed feeling that is not covered by the word "happy." One problem with this explanation is that, for each dimension, neutral stanzas significantly differed with one pole of the extreme stanzas (e.g., between neutral and dark stanzas for Evaluation). If the lack of clear differences between neutral stanzas and bright stanzas for Evaluation, on the one hand, and dark stanzas for Activity, on the other hand, is due to deficient selection of items, then why does this only effect one extreme and not the other? As our data do not allow us to answer this question, we believe that further studies will be necessary to get a clearer picture.
Another limitation of this study is its preclusive focus on one specific phonetic feature, namely formant dispersion, to predict readers' assessment of the poems' emotional tone. The decision to emphasize this specific feature was based on a long tradition of research that provided sound evidence for associations between formant dispersion and concepts that refer to the dimensions of the Evaluation-Activity-Potency model. In other words, the driving question that motivated this study was whether a specific sound-meaning relation that has repeatedly been observed in studies from various academic disciplines can also be found in poetry. At the same time, results from recent studies suggest that there are also other features that proved to be statistically significant predictors for phonosemantic relations, such as articulatory characteristics of phonemes or the frequency of occurrence of individual phonemes (e.g., Louwerse & Qu, 2017;Nastase et al., 2007;Ohtake & Haryu, 2013). In fact, Monaghan and Fletcher (2019) reported that the frequency of individual phonemes outperforms phonetic features as a predictor for soundmeaning relations of nonwords. Thus, while our study reveals a statistically significant relation between the average formant dispersion of a text and its emotional tone, it does not allow drawing conclusions regarding the potential of formant dispersion as a predictor for emotional tone in relation to other features. As the formant dispersion of a vowel is causally linked to its articulatory characteristics, we cannot exclude the possibility that phonological features or vowel categories drive the results reported here. However, the aim of this study was to test whether findings reported in previous studies on soundmeaning relations also apply to texts. Our results suggest that this is the case. Future studies ought to more directly examine whether indeed formant dispersion or more abstract properties best explain the variance in the data-a comparison that is beyond the scope of this article.
Finally, we note that our results mainly reflect a distinction between a small group of stanzas with extreme values for a specific acoustic feature. Thus, it could well be that systematic relations between phonetic properties of texts and their content only come into play when the phonetic characteristics of a text exceed a certain threshold level. Evidence for the idea that sound-meaning relations in natural language only become salient when they strongly deviate from a norm has also been reported in previous studies. Aryani and colleagues, for example, found that phonetic features that are statistically overrepresented compared to their frequency of occurrence in everyday language can also predict the emotional tone of poems (Aryani, Kraxenberger, Ullrich, Jacobs, & Conrad, 2016; see also Aryani et al., 2013). Furthermore, similar evidence comes from a comparison between results reported by Auracher et al. (2010) and Kraxenberger and Menninghaus (2016). Both studies tested the relation between the average occurrence of certain phonemes (plosives and nasals) and the emotional tone of texts. However, whereas the first study did find significant interactions between sound and meaning for poems in four different languages, the data of the latter study could not replicate these results. One possible explanation for this discrepancy might lie in the research design: While in the study by Auracher et al. (2010) only poems with extreme values for the plosive-nasal ratio were used as stimuli, Kraxenberger and Menninghaus (2016) selected the poems used in their study on the basis of their emotional content.
Hence, it seems that phonosemantic relations in natural language only become salient for readers when phonetic features are foregrounded due to the unusually high frequency with which they occur in a text passage. We propose that congruencies between the acoustic properties and non-acoustic attributes of phonemes function as stylistic features, allowing authors to guide the attention of readers. However, as with any stylistic feature, authors are free to choose whether to use them or not. Thus, even though we found a highly significant correlation between formant dispersion and readers' assessment of the stanzas' emotional tone, our data also suggest that the relation between a text's acoustic characteristics and its emotional tone is far from being deterministic. Rather, for all three categories of stanzas used in our study (i.e., stanzas with extremely high, extremely low, and neutral formant dispersion), we obtained ratings for the three EPA dimensions that covered the full spectrum from low to high. This might also explain why some studies with relatively small corpora have not found any phonosemantic relations (e.g., Kraxenberger & Menninghaus, 2016;Miall, 2001). However, while the content of a text does not necessarily determine the relative occurrence of specific phonetic features, an unusually high occurrence of a specific phonetic characteristic seems to significantly increase the likelihood that a text expresses a certain emotion tone. Thus, we assume that a high occurrence of a phonetic feature is a considerably good predictor for the content of a text.

Future work
Many phonosemantic relations, such as those between acoustic frequency and size, have been found to be independent of specific languages. Our results therefore call for replications both in other languages and also, going beyond poems, in other text genres. It would, moreover, be interesting to investigate whether or not sound iconicity can be used as a cross-linguistic tool for automatic text analysis.