Six-month-old infants recognize phrases in song and speech

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. Infancy published by Wiley Periodicals LLC on behalf of International Congress of Infant Studies 1Centre for Language Studies, Radboud University, Nijmegen, The Netherlands 2International Max Planck Research School for Language Sciences, Nijmegen, The Netherlands 3Department of Linguistics, Macquarie University, Sydney, NSW, Australia 4Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands 5Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive Neuroimaging, Radboud University, Nijmegen, The Netherlands


Abstract
Infants exploit acoustic boundaries to perceptually organize phrases in speech. This prosodic parsing ability is well-attested and is a cornerstone to the development of speech perception and grammar. However, infants also receive linguistic input in child songs. This study provides evidence that infants parse songs into meaningful phrasal units and replicates previous research for speech. Sixmonth-old Dutch infants (n = 80) were tested in the song or speech modality in the head-turn preference procedure.
First, infants were familiarized to two versions of the same word sequence: One version represented a well-formed unit, and the other contained a phrase boundary halfway through. At test, infants were presented two passages, each containing one version of the familiarized sequence.
The results for speech replicated the previously observed preference for the passage containing the well-formed sequence, but only in a more fine-grained analysis. The preference for well-formed phrases was also observed in the song modality, indicating that infants recognize phrase structure in song. There were acoustic differences between stimuli of the current and previous studies, suggesting that infants are flexible in their processing of boundary cues while also providing a possible explanation for differences in effect sizes.
To date, it is not known whether infants also recognize phrasal units of caregiver singing. However, many acoustic cues to phrase structure are the same in melodies and speech (Deutsch & Feroe, 1981;Heffner & Slevc, 2015;Lehrdahl & Jackendoff, 1985;Trainor & Adams, 2000) and prosodic phrase segmentation is not bound to the listeners' native language or spoken modality: English-speaking adults do segment words from unfamiliar languages if these words are placed at prosodic phrase boundaries (Endress & Hauser, 2010;Langus, Marchetto, Bion, & Nespor, 2012) and American English infants can exploit the prosody of non-native languages, for example, Japanese (Hawthorne, Mazuka, & Gerken, 2015), Polish (Jusczyk, 2003), and even American Sign Language (Brentari, González, Seidl, & Wilbur, 2010), to recognize phrases. This combination of observations informs the hypothesis that infants might also be able to perceptually organize songs into phrases.
The arguments provided so far all suggest that infants might be able to parse phrasal units from ID song. Yet, there are also differences in the acoustic instantiation of boundary cues between song and speech (see, e.g., references in Merrill et al., 2012). For infants to recognize phrase structure in songs, they thus need flexible representations of phrase boundaries which adjust to the song modality. So far, the available literature does not provide conclusive evidence for this flexibility.
Investigating infants' ability to recognize phrase structure in songs is also relevant in light of recent evidence that the recognition of phrasal structure in linguistic or musical play is related to grammar development in typically developing preschoolers (Politimou, Dalla Bella, Farrugia, & Franco, 2019) and children with developmental language disorder (Richards & Goswami, 2019). As prosodic parsing is a pre-cursor to syntactic development (Morgan & Demuth, 1996), these studies raise the possibility that caregivers' language play, including ID singing, contributes to the development of prosodic parsing. However, the work suggesting a relationship between phrase perception and grammar development focused on (pre)school children (Politimou et al., 2019;Richards & Goswami, 2019), whereas prosodic parsing already develops within the first year of life (Carvalho, Dautriche, Millotte, & Christophe, 2018), and a direct test of children's ability to segment phrases from songs or other forms of language play is still poignantly lacking. The current study thus aims to provide evidence for infants' recognition of the phrasal building blocks of ID songs.

| Infants' recognition of phrase structure in music
Currently, the literature on infants' perception of melodic phrase structure in music and song is sparse. In a seminal study by Jusczyk & Krumhansl (1993) and Krumhansl & Jusczyk (1990), 6-month-olds differentiated between excerpts from Mozart Minuets with pauses at natural (phrase boundary) positions and excerpts with unnatural pauses (within phrases). A follow-up study with American English infants extended this finding to melodies from non-western (Japanese) child songs (Jusczyk, 2003), indicating that melodic phrase structure perception does not require extensive experience with a musical tradition. Crucially though, none of these previous studies required infants to encode and process melodic phrase structure. Instead, infants were provided with a pre-segmented stimulus that was reminiscent of their daily musical experience (naturally segmented) or rather odd (unnaturally segmented). It thus remains unclear whether infants chunk native songs into meaningful units and recognize subcomponents of the songs, despite these being the type of musical stimulus infants are exposed to on a daily basis. Recent evidence from Dutch infants (Hahn et al., 2018) also only indirectly supports the notion of song structure being accessible: 9-month-olds differentiated rhyming (and thus more natural) songs from non-rhyming (and thus less natural) songs, but the study did not test whether this differentiation has implications for the processing of the linguistic content. In the current study, we will provide infants only with natural native child songs and will explicitly test their ability to recognize familiarized song phrases.

| Extending the prosodic parsing paradigm
A good starting point for an investigation of infants' encryption of the inherent structure of ID song is transferring a reliable paradigm from infant speech perception research to the song modality while also replicating previous research for ID speech. Such a paradigm was provided by Nazzi et al. (2000), showing that 6-month-olds used prosodic phrase structure to segment clauses from continuous speech. In this head-turn preference study, infants were familiarized to two versions of the word sequence leafy vegetables taste so good (Nazzi, et al., 2000, Experiment 1). One version of the sequence was prosodically well-formed, carrying phrase boundaries at the edges, and sounded like a coherent clause: [Leafy vegetables taste so good]. The other version of the word sequence contained a phrase boundary halfway through, sounding more like snippets of two adjacent clauses: leafy vegetables] [Taste so good. In the subsequent test phase, infants heard two spoken passages of three sentences each. The well-formed sequence from the familiarization phase reoccurred as a coherent clause of one passage. The ill-formed sequence from the familiarization phase reoccurred as a subcomponent of two adjacent clauses of the other passage. Infants listened longer to the passage containing the well-formed compared to the ill-formed word sequence, indicating that they capitalized on the prosodic structure of the passage to recognize the familiarized well-formed word sequence therein. This paradigm has been adopted in numerous subsequent studies (Seidl, 2007;Seidl & Cristia, 2008;Soderstrom, Kemler Nelson, & Jusczyk, 2005). Critically for the present study, Dutch 6-month-olds also showed the same preference for the passage containing the well-formed word sequence (Johnson & Seidl, 2008).
Whether infants' prosodic parsing ability extends to the musical modality has already been explored in two short reports (Hawthorne & Gerken, 2013;Nazzi et al., 2000, see general discussion). Both studies applied the paradigm described above to melodies from a musical instrument. The preliminary results suggest that infants recognized the familiarized well-formed tone sequence within a longer musical piece.

| The current study
The current study investigates infants' recognition of the phrasal building blocks of ID song and replicates earlier studies on infants' recognition of phrases in ID speech. We will use the paradigm described above that has successfully revealed infants' phrase segmentation of ID speech (for Dutch: Johnson & Seidl, 2008; the original study for English: Nazzi et al., 2000) with a new sample of Dutch 6-month-olds and a new version of the Dutch stimuli and extend the paradigm to ID song, using natural song material that matches the ID speech stimuli in content and syntactic structure. Our approach significantly extends previous work in two ways: First, infants' processing of song lyrics has so far been limited to smaller phonological and lexical building blocks (François et al., 2017;Hahn et al., 2018;Lebedeva & Kuhl, 2010;Snijders et al., 2020;Suppanen et al., 2019;Thiessen & Saffran, 2009). We will extend the scope of this research to phrases, cognitive units which are relevant not only for the perception of song structure but also for lexical and syntactic development in infants' native language. Second, we will build upon the previous work on infants' auditory grouping in polyphonic instrumental music (Jusczyk & Krumhansl, 1993;Krumhansl & Jusczyk, 1990), monophonic melodies (Hawthorne & Gerken, 2013;Nazzi et al., 2000), and non-native child songs (Jusczyk, 2003), employing the type of musical stimulus that possibly best represents infants' musical input (Volkova, Trehub, & Schellenberg, 2006), namely native child songs. By extending the paradigm of Nazzi et al. (2000), we will also move beyond mere preferences for naturally phrased melodies. Instead, our study requires infants to incrementally process and organize ecologically valid native song input and match this input to memorized song fragments.

| Participants
A sample of 95 6-month-old infants (mean age in days: 184, range: 167-209 days, SD =9.02, 53 girls) from monolingual Dutch households was tested of which 12 infants were excluded, because they fussed or cried during the experiment (n = 11) or grew up in a bilingual household (n = 1). Three more infants were excluded from part of the analysis because they did not contribute trials in both experimental conditions for the critical dataset (see Analysis Section), resulting in a final dataset of n = 80 or n = 83 infants depending on the respective analysis.
Participants were recruited from the Baby and Child Research Center at Radboud University, Nijmegen, the Netherlands. According to their caregivers, infants were born full-term, had normal hearing, and no familial history of language or reading problems. The present study was conducted according to guidelines laid down in the Declaration of Helsinki, with written informed consent obtained from a parent or guardian for each infant before any assessment or data collection. Ethical approval for the study was obtained from the Ethiek Commissie Faculteit der Sociale Wetenschappen (ECSW) at Radboud University in Nijmegen, the Netherlands. Caregivers had the choice between 10€ and a book as a reward for their participation. The results of a questionnaire on musical exposure confirmed that all participants were regularly exposed to songs and music from electronic devices and human singers (the results of the questionnaire are summarized in the online materials).
A power analysis using G*Power (Faul, Erdfelder, Lang, & Buchner, 2007), based on Experiment 1 of Johnson and Seidl (2008) (estimated correlation between groups set to 0.5, Cohen's d Experiment 1 = 0.35), resulted in a required minimum sample of 52 infants to detect the phrase segmentation effect in each modality (80% power in one-sided t test with α = .05). We thus aimed for usable data of 104 infants in total. Due to time and resource limitations, however, data collection was terminated after 95 participants.

| Materials
Materials, design, and procedure of the current study closely followed the study by Johnson and Seidl (2008, Experiment 1, henceforth "J&S"). Stimuli were novel spoken and sung recordings of the J&S stimuli, complemented by a second stimulus set. The basis of all stimulus materials was a pair of text passages from J&S, both consisting of three sentences, separated by two phrase boundaries (see Table  1, Pair 1).
Within both passages, the same sequence of words occurred (e.g., koude pizza smaakt niet zo goed), but one passage contained the sequence as a single phrase, that is, phrase-internal (e.g., [koude pizza smaakt niet zo goed]) and the other passage contained the sequence with a phrase boundary in the middle, that is, phrase-straddling (e.g., … koude pizza] [smaakt niet zo goed …). The two passages were used for the test phase of the experiment. The phrase-internal and phrase-straddling sequences extracted from the passages were used for familiarization. All stimuli were recorded in a spoken as well as a sung version. Passage pair 1 was based on the Dutch stimuli of J&S, with a slight change to fit the melody 1 .
Passage pair 2 was created in analogy to pair 1: The number of syllables, word stress, and phrase structure were identical, and the lyrics had the same assumed familiarity (all content words of both pair 1 and pair 2 appeared in the Dutch N-CDI) (Zink & Lejaegere, 2002) and the words had a similar mean log raw frequency of 3.6 (pair 1) and 3.8 (pair 2) in the Dutch CELEX corpus (Baayen, Piepenbrock, & Gulikers, 1995). Both passage pairs are provided in Table 1.

| Song stimuli
Both pairs of passages were set onto the melodies of child songs ( Figure 1). Passage 1 was set onto melody 1 ("Sea Saw Margery Daw," originally from England), and passage 2 was set onto melody 2 ("Vine Melcul Suparat," originally from Romania), with one syllable per musical note and stressed syllables on strong metrical positions within each melody. The position of sentence boundaries in the passages was aligned with the position of melodic phrase boundaries in the melodies.
Three listeners (two amateur-one professional musician; two Dutch-one English native speaker) who were kept naïve to the purpose of the study, judged the quality of the resulting melodies. All three found them to resemble typical children's songs. 1 We slightly modified one phrase: original wording: "Hun zus vindt dat lekker." 'Their sister likes that'. Novel wording: "Hun opa vindt dat wel erg lekker." 'Their grandpa really likes that.' T A B L E 1 Texts from Passage Pairs 1 and 2.

| Recording
The same female Dutch speaker was recorded for the spoken and sung stimuli and was kept naïve to the purpose of the study. Only after the recording, it became apparent that the same person's voice had been recorded for the original J&S stimuli. The singer/speaker was instructed to speak and sing in a lively, child-directed manner while looking at the photograph of a toddler from her family. She chose a speaking and singing tempo and a pitch height that were convenient to her. Recording took place in a sound attenuated booth, and further processing was done using Praat (5.3.49, (Boersma & Weenink, 2014)) and Audacity (2.1.0): Pauses between phrases were set to silence but kept at their original duration. Two sequences were cut from each passage: one internal and one straddling (see Figure 2), resulting in 8 sequences for the sung and 8 sequences for the spoken modality.

| Acoustic analysis
Acoustic measures were obtained around the internal boundary in the straddling sequence (e.g., … pizza] [smaakt …) and compared to the same sequence without a boundary in the phrase-internal sequence (e.g., [… pizza smaakt ….]) using Praat (5.3.49, (Boersma & Weenink, 2014; sound files and corresponding text grids can be found in the online materials).

Comparison of song and speech stimuli within the current study
Phrase boundaries in both song and speech stimuli were expressed by longer pauses and longer preboundary vowels at the phrase boundary in the straddling sequence compared to the corresponding internal sequence. In the song stimuli, the pitch rose after the boundary in the straddling sequences.
In the spoken sequences, the opposite pattern was found: Pitch increased at the final vowel and then decreased at the first vowel of the following phrase. The speaker thus used a rising boundary tone to mark the end of her spoken phrases.
Comparison of stimuli from Johnson and Seidl (2008)

and the stimuli for the current study
Despite the fact that we used the same words and the same speaker, the speech stimuli of the current study were substantially slower than the stimuli by J&S (see Table S1). The longer pauses between sentences together with the overall slower speech rate of the current stimuli resulted in large differences in onset time of the critical sequence (see Table S2). These differences will be taken into account in the analysis of the looking-time data (see section Mixed-effect model F I G U R E 1 Melody 1 and Melody 2 as used for the song stimuli. Note. Phrase-internal sequences are in bold; straddling sequences are underlined.

Melody 1
Melody 2 analysis below). Furthermore, the pitch reset at the phrase boundaries of the straddling sequences of the J&S study was less pronounced in the stimuli of the current study (see Table S3), probably due to the slightly different intonation and the rising boundary tone the speaker used for the current study.

| Stimulus pre-test
Three Dutch native speakers judged the intelligibility of the sung and spoken sequences and were asked to judge the straddling/internal manipulation of the sung and spoken sequences. The three judges were first asked to listen once to each of the 16 sequences and immediately write out the text orthographically, as they understood it. All three participants wrote down the correct texts without mishearing. They were then presented with the phrase-internal and phrase-straddling versions of every sequence as a pair and were asked to indicate which of the two sequences sounded more coherent. All three participants judged all phrase-internal sequences to sound more coherent than their phrase-straddling counterparts.

| Procedure
The experiment was run using the head-turn preference procedure. Three lights were placed within a three-sided booth at infant eye level: a blue light in the center and red lights on the right and left walls of the booth. A camera was hidden below the center light to observe infant behavior from outside. Stimuli were presented via loudspeakers below the red lights. The infant and caregiver were seated in the middle of the booth, directly opposite the blue center light, exactly in between the left and right red lights. Stimulus presentation was controlled from outside the test booth by the experimenter, using the stimulus presentation software Look! (Meints & Woodford, 2008). The experimenter was blind to trial number and trial condition and coded the looking behavior of the infant (left, right, center) using assigned keys. The same procedure was used for both familiarization trials and test trials. The entire session was video-recorded for offline reliability coding (see section "reliability coding" in the online materials).

| Design
Infants took part in either the song or the speech version of the experiment, and were tested on their ability to segment either songs or speech into phrases (effect of modality, between subjects). Following Nazzi et al. (2000) head-turn preference procedure, infants were first familiarized with two sequences of the same words, one uttered as phrase internal, carrying phrase boundaries at the edges, for example, [Koude pizza smaakt niet zo goed] ("Cold pizza doesn't taste so well"), and the other uttered as phrase straddling, carrying a phrase boundary halfway, for example, koude pizza] [smaakt niet zo goed ("cold pizza. Doesn't taste so well"). The internal sequence thus represented a well-formed acoustic unit, and the straddling sequence was ill-formed. Apart from this acoustic difference, the exact same words were occurring in the sequences used in the familiarization phase. In the test phase, infants were presented with two passages of three sentences each: One passage contained the phrase-straddling sequence and the other the phrase-internal sequence (Table 1). For the analysis, looking times to the passages were assessed. Which passage functioned as the internal and which as the straddling passage were determined by the content of the respective sequence used during familiarization (effect of condition, within-subjects).

| Counterbalancing and randomization
The four pairs of passages (Table 1) were distributed across eight lists (four lists for speech, four lists for song). Within each list, one pair of passages was used and presentation side of the first stimulus (left, right) was counterbalanced and the same presentation side and the same condition were restricted to occur maximally two times in a row.

| Experimental session
Caregivers were first briefed about the experimental procedure and filled out the music exposure survey (see the online materials for an English translation of the questionnaire). At the start of the experiment, infants were seated on their caregiver's lap in the center of a three-sided test booth. Both caregiver and experimenter wore headphones throughout the experiment and listened to masking music (samba music played simultaneously with spoken text from various female speakers). Testing started with a familiarization phase during which infants heard alternations of the phrase-internal and phrase-straddling sequence and accumulated a minimum of 30 s of looking time for each sequence (in accordance with J&S). Within the test phase, each infant was presented with two passages. One passage contained the phrase-internal, the other the phrase-straddling sequence from the familiarization phase. Which passage acted as phrase-straddling or phrase-internal during test depended on which sequence a particular infant was familiarized to. A single test trial consisted of repetitions of a passage for the same condition (internal/straddling). Trials alternated in condition (internal/straddling). Passages were presented in 12 trials distributed over three blocks. Within every block, each passage was presented once from the left and once from the right side. The full experimental session lasted about five minutes, depending on the number of familiarization trials an infant required to reach the 30-sec familiarization criterion. Sessions were aborted earlier if the infant fussed. Data from aborted test sessions were not analyzed. After the experiment, caregivers were debriefed about the research question of the experiment.

| RESULTS
All data preprocessing and analyses have been performed using R for Windows (R Development Core Team, 2012). All raw data and analysis scripts are available in the online materials.

| Mixed-effect model analysis
Linear mixed-effect models were used to analyze differences in looking times between the internal and straddling passages in the test phase of the experiment. Two models were fit, one to the full dataset of all trials from all children (N = 83, n = 41 in song; 996 trials, 492 trials in song) and a second model starting from trials during which infants had attended long enough to be presented with the first 500 ms of the critical sequence within the test passage (n = 80, n = 39 in song, 680 trials, 295 trials in song). The second model on this Critical Sequence dataset was considered warranted given the overall slower speech rate and longer pauses in the present compared to the J&S stimuli, as described above. Note that three subjects were excluded from the second model because they did not contribute trials for both conditions in this dataset. The remaining 80 infants contributed an average of 4 trials per condition (range: 1-6 for both conditions). The fixed effects of both models were (a) Condition (internal vs. straddling, coded as an orthogonal contrast), (b) Modality (song vs. speech, coded as an orthogonal contrast), (c) Test Trial Number linear (1-12, coded as the linear polynomial), (d) Test Trial Number quadratic (1-12, coded as the quadratic polynomial), and (e) the interaction of Condition and Modality 2 . The random-effects structure of both models was specified to include random intercepts for participant and by-participant random slopes for the effect of Condition (internal/straddling). We deliberately chose not to specify the maximal randomeffects structure (Barr, Levy, Scheepers, & Tily, 2013), because the use of only two pairs of passages for speech and song did not warrant specification of item-related random effects. Both models 1 and 2 were fit onto Box-Cox-transformed looking times (λ = 0.12 for model 1, λ = −0.02 for model 2 (Csibra, Hernik, Mascaro, Tatone, & Lengyel, 2016)). The R-package "lmerTest" was used to run the models and evaluate significance of the effects (Kuznetsova, Brockhoff, & Christensen, 2016).

| Results of mixed-effect model analysis
When only considering trials during which infants listened long enough to reach the critical sequence within the test passage, infants preferred to listen to the passage that contained the phrase-internal sequence in both song and speech (Figure 3). The second linear mixed-effect model (Table 2) run on this Critical Sequence dataset (model 1, n = 80, 680 trials) revealed significant main effects of Condition (t = 2.21, ß = 0.05, p = .03), Modality (t = 2.78, ß = 1.0, p = .007), and the linear and quadratic polynomial of Test Trial Number (t = −4.39, ß = −0.28, p < .001; t = 3.42, ß = 0.22, p < .001). There was no significant interaction between Condition and Modality (t = 0.24, ß = 0.01, p = .81) and thus no evidence that segmentation is easier in song than in speech.
Considering all trials from all children (N = 83, 996 trials), we did not find evidence for a preference for passages with the phrase-internal or the phrase-straddling sequence nor did we find evidence that looking times differed between song and speech ( Figure S1 in the online materials). The linear mixed-effect model 2 (Table 2) only indicated significant main effects of the linear (t = −7.34, ß = −1.84, p < .001) and quadratic (t = 2.19, ß = 0.55, p = .03) polynomial of Test Trial Number, indicating that overall looking times decreased over the course of the experiment, but to a lesser degree toward the end of the experiment.

| t test analysis
To adhere to more standard analyses of infant looking-time data, we also report results of t tests within both modalities using the aggregated looking times within the Critical Sequence dataset (n = 80). Given the number of previous studies that found a preference for the internal sequence, we decided to run onesided t tests to test the hypothesis that looking times for the internal sequence are longer than for the straddling sequence. Two-sided t tests will be reported for the sake of completeness. Averaged looking times were also Box-Cox-transformed, using λ = 0.2 for song and λ = 0.36 for speech data. Levene's test indicated equal variance among groups in both the song and speech dataset. A Shapiro-Wilk test indicated that the song data deviate from normality even after transformation (p = 0.02). Therefore, the results of the t test for song have to be interpreted with caution. Effect sizes Cohen's d z , and Hedge's g av were calculated for the untransformed looking times in the t test datasets, according to recommendations given by Lakens (2013) and formulas introduced by Cohen (1988) and Hedges and Olkin (1985). A spreadsheet by Lakens (2013), available under https://osf.io/ixGcd/, was used for the calculation.

| t test results
The t tests run on the averaged looking times for both song and speech trials within the Critical Sequence dataset indicated a significant preference for the internal sequence for the song modality T A B L E 2 Parameters of linear mixed-effect models 1 and 2.

Predictor
Contrast Coding only (Table 3). In both modalities, about half of the infants tested showed a preference for the internal sequence, that is, longer listening times for the internal compared to the straddling passage.

| DISCUSSION
The current study set out to replicate 6-month-old Dutch infants' auditory grouping abilities based on intonational phrase boundaries in ID speech (Johnson & Seidl, 2008) and assess whether this ability extends to ID song. Infants in the current study were tested in a paradigm first developed by Nazzi et al. (2000), which has successfully revealed phrase segmentation in an earlier study of 6-month-old Dutch infants (Johnson & Seidl, 2008). We replicated this latter study in the same laboratory, using the same speaker for the stimuli, and extended it to the ID song modality. To this end, infants were first familiarized to two critical sequences of the same words in either song or speech (e.g., /koude pizza smaakt niet zo goed/ ("cold pizza does not taste so well")). One sequence was uttered with a well-formed phrase structure, with phrase boundaries at the edges: [koude pizza smaakt niet zo goed], while the other sequence was uttered with an ill-formed phrase structure, straddling a phrase boundary in the middle: koude pizza] [Smaakt niet zo goed. Infants were then presented with two three-sentence test passages, one containing the well-formed word sequence and the other the ill-formed word sequence. In both song and speech, infants listened longer to the passage containing the well-formed sequence. This indicates that infants were able to segment the passages of song and speech into their underlying phrasal constituents and recognized the well-formed familiarized sequence therein. Infants' known ability to recognize the phrase structure of ID speech thus extends to ID song.

| Contribution
The current study is the first to provide evidence that 6-month-old infants segment native child songs into well-formed phrases. Infants thus capitalize on the acoustic boundary cues within song melodies to organize a continuous song into structurally relevant constituents and recognize phrases while the song unfolds. The present results significantly extend previous research on infants' musical grouping abilities by using ecologically valid musical stimuli and by requiring infants to group native song melodies into perceptual chunks while the song unfolds Krumhansl 1990 andNazzi et al., 2000;Hawthorne & Gerken, 2013). This study also extends our knowledge on infants' recognition of phonological units in song lyrics from syllables (François et al., 2017;Lebedeva & Kuhl, 2010;Suppanen et al., 2019;Thiessen & Saffran, 2009), rhymes (Hahn et al., 2018), and single words (Snijders et al., 2020) to larger prosodic units, namely phrases. The potential functional relevance of these findings will be discussed below.
The current results also contribute to two more general issues in the field of first language acquisition: The first is the question about shared cognitive mechanisms underlying the processing of music, song, and speech; the second pertains the optimal acoustic stimulus for infant language learning. Concerning the first question: Infants' mental organization of speech and song into phrases observed in the current study may be grounded in a modality-general processing mechanism (Conway, Pisoni, & Kronenberger, 2009;Schön et al., 2010;Trehub & Hannon, 2006): A conceivable account would be that the salient acoustic structure of instrumental music, ID song, and ID speech attracts infants' attention to utterance edges (De Diego Balaguer, Martinez-Alvarez, & Pons, 2016;Drake, Jones, & Baruch, 2000;Falk & Kello, 2017;Leong & Goswami, 2015). Alternatively, infants' phrase recognition in ID song might stem from transfer of a speech-specific or even native language-specific prosodic parsing strategy to the song modality (Morgan & Demuth, 1996). Future research should identify the exact mechanisms underlying phrase segmentation and clarify to what extent these are bound to a specific developmental stage, input modality, or language. Our contribution to this open issue is the finding that at six months, infants' perception of phrase structure is not limited to speech-specific boundary cues (Johnson & Seidl, 2008;Seidl, 2007;Seidl & Cristia, 2008;Wellmann et al., 2012) but encompasses a more generic phrase boundary percept in song melodies, a finding that needs to be incorporated into future accounts of infant speech segmentation.
The second general contribution of the current study concerns the question about the kind of acoustic stimulus from which infants learn best. Infants' astonishing learning success in their first year of life has been attributed to the exaggerated acoustic shape of ID speech (Kuhl et al., 1997). If this were the case, then infants should learn even better from ID song, a type of stimulus that is even more exaggerated when compared to ID speech in terms of pitch, rhythm, and tempo (Trehub et al., 1997). In the current study, the pre-test confirmed the naturalness of the song and speech stimuli and infants showed increased attention to the songs versus speech stimuli. Also, the effect sizes of the speech and song modality were in the predicted direction (speech Cohen's d z = 0.09; song Cohen's d z = 0.28). Nevertheless, the current study provided no evidence for easier segmentation in ID song than ID speech. This is contrary to previous studies which reported a song benefit for infants' linguistic processing (François et al., 2017;Lebedeva & Kuhl, 2010;Thiessen & Saffran, 2009), but is in line with other work where no processing benefit for songs was observed (Snijders et al., 2020;Suppanen et al., 2019). In the following, we will discuss possible reasons for the lack of a song advantage in the current study.

| Understanding the absence of a modality effect
Absence of evidence for easier segmentation from songs compared to speech might reflect the relative acoustic similarity between our song and speech stimuli, resulting from the fact that the stimuli in both modalities were created to be analogous. As a result of this necessary experimental control, the speech stimuli may have been slower while the song stimuli may have displayed less repetition compared to their respective real-life counterparts. Alternatively, the hypothesized processing benefits of ID song might have been present in the current study but counteracted by the higher familiarity of ID speech, resulting in overall similar segmentation outcomes from song and speech. Also, it may simply be that more statistical power is needed to provide evidence for an interaction between modalities (speech/ song) and phrase segmentation. As the data of the present study are inconclusive regarding the cause of the absence of a modality effect, future studies should elucidate in how far ID song boosts, hinders, or truly has no impact on infants' segmentation abilities.

| Limitations of the replication
Infants' preference for the passages containing the well-formed sequence in both song and speech was only evident in an analysis that differed from the study we aimed to replicate and extend (Johnson & Seidl, 2008). To understand the first difference between the analyses, one should remember that the test passages consisted of three sentences. The familiarized sequences occurred within the second sentence (see Table 1). Our analysis only included looks after infants had attended long enough to be presented with the first 500 ms of the critical sequence within the test passages (316 of 996 trials and 3 infants excluded). Johnson and Seidl (2008), on the other hand, analyzed data from all test trials.
The change in analysis seemed warranted given that our stimuli were substantially slower than those in the previous study (Johnson & Seidl 2008). As a second difference, we made use of mixed-effect models on single-trial data in addition to t tests on data averaged over trials within children. Using this more sophisticated analysis, technique might have been necessary because of the relatively small effect sizes in the present study (Cohen's d z = 0.09 and 0.28 in the aggregated data of speech and song, respectively) compared to the somewhat larger effect observed in the study by Johnson and Seidl (2008, Experiment 1;Cohen's d z = 0.35).
In an attempt to understand why the effect size in the current study was smaller, we can outright disregard a number of factors: Language, age, experimental setup, and even the speaker of the stimuli, and the laboratory in which the study was conducted were all the same as in the original study. A factor that might have impacted the effect sizes is the tempo of the experimental stimuli. For one, the critical sequences occurred within three seconds from the start of the test passages of the original study (Johnson & Seidl, 2008; range: 1.26-2.99 s) but only up to 4 s into the passages of the present study (range: 1.66-4.23 s, see Table 5 in the online materials). Consequently, infants in the current study had to listen longer before they encountered the critical sequences. Second, the comparatively long pauses between the consecutive sentences of the test passages of the present study (range: 400-850 ms in Johnson & Seidl, 2008; range: 923-1,541 ms in the current study; Table 4 in the online materials) might have created a less coherent auditory percept of the passages, resulting in overall more challenging listening conditions and hence smaller effect sizes. Despite these differences, the present study nevertheless provides moderate support for infants' processing of prosodic structure in ID speech (Johnson & Seidl, 2008;Nazzi et al., 2000).
In how far could phrase segmentation from ID song be relevant to infant language acquisition? For one, it could aid infants in identifying smaller linguistic units within the song lyrics, for example, words occurring at phrase boundaries (see for a similar account for speech, e.g., Johnson et al., 2014;Shukla et al., 2011), and help to transfer the song lyrics and melody into working memory by chunking them into manageable units. This, in turn, might help infants to identify the song and its lyrics across different occasions in their daily routines and across different singers, contributing to the formation of context-and singer-independent abstract representations. Phrase segmentation from ID song might also, indirectly, benefit the processing of (ID) speech: By attending to melodic phrases in songs, infants train to allocate attention to important units in the song input. The same units are also relevant in speech, but presumably less salient and occurring at a much faster time scale. Caregiver singing could thus provide infants with an acoustic playground, a practice field to engage mechanisms that are also at work in the presumably more demanding speech signal.
Future research should investigate the functional relevance of infants' ability to segment songs into phrases. There is ample evidence that prosodic phrase segmentation of speech is a key prerequisite for lexical and morpho-syntactic development (Carvalho et al., 2018). It has even been suggested that impaired recognition of large phrasal boundaries in speech is the key underlying deficit for developmental language disorder (Richards & Goswami, 2019). Consequently, future research should investigate to what extent caregiver singing and other types of rhythmic-melodic input such as rhyming story books (Richards & Goswami, 2019) contribute to infants' perception of phrasal boundaries in speech. Such a relationship between language play and real-life linguistic abilities would speak to recent studies suggesting a link between rhythmic-melodic processing of music and speech on the one hand and grammar development on the other (Gordon, Jacobs, Schuele, & McAuley, 2015;Leong & Goswami, 2015;Politimou et al., 2019). The current study contributes an empirical foundation for such future investigations, by showing that for young infants, the major phrasal units in ID song are at least as accessible as in ID speech.

| CONCLUSION
Recognizing phrases in continuous speech is a cornerstone of the development of speech perception. This study replicated a previous finding regarding Dutch 6-month-olds' recognition of phrase structure of ID speech (Johnson & Seidl, 2008) and extended the results to ID song. Thus, already within their first half year of life, infants actively process sung input online and memorize well-formed sung phrases. Future research should identify the mechanisms underlying this ability and clarify whether the recognition of the phrasal structure of caregiver singing contributes to linguistic development.